2025-03-01 19:13:14 +00:00
2025-03-01 18:58:13 +00:00
2025-03-01 16:49:26 +00:00
2025-03-01 19:13:14 +00:00

Discourse to Markdown Archiver

This script archives posts and renders topics to Markdown from one or more Discourse installations. It downloads posts from specified Discourse servers via their API, archives them as JSON files (avoiding duplicates), and renders topic threads into Markdown files. Each site is stored in its own subdirectory along with a separate metadata file tracking synchronization details.

Discourse Forums to Archive

Features

  • Archive new posts as JSON.
  • Render topics to Markdown files.
  • Support for multiple Discourse sites concurrently (one site at a time).
  • Separate metadata tracking per site (last synchronization date and archived post IDs).
  • Concurrent rendering of topics using a thread pool for improved performance.
  • Exponential backoff for HTTP requests to handle rate limits or transient errors.

Requirements

  • Python 3.7+
  • Standard library modules (argparse, concurrent.futures, functools, etc.)
  • Optionally, the rich module for improved logging output.
    Install it via pip:
    pip install rich
    

Usage

Run the script from the command line.

Command-Line Arguments

  • --urls: A comma-separated list of Discourse server URLs.
    Example:

    --urls "https://forum.hackliberty.org,https://forums.whonix.org"
    

    If not provided, the script defaults to https://forum.hackliberty.org. You can also set the DISCOURSE_URLS environment variable.

  • --target-dir or -t: The base directory where archives and rendered topics will be stored.
    Default is ./archive.
    Each site will have its own subdirectory (using the site's hostname).

  • --debug: Run in debug mode.

Example

To archive posts and render topics from two sites and store the data in the ./archive directory:

./archive.py --urls "https://forum.hackliberty.org,https://forums.whonix.org" --target-dir ./archive

Alternatively, using environment variables:

export DISCOURSE_URLS="https://forum.hackliberty.org,https://forums.whonix.org"
export TARGET_DIR="./archive"
./archive.py

Directory Structure

After executing the script, the base target directory will be structured as follows:

./archive/
    site1.example.com/
        posts/
            2023-09-September/
                0000000123-username-topic-slug.json
                ...
        rendered-topics/
            2023-09-September/
                2023-09-15-topic-slug-id123.md
                ...
        .metadata.json
    site2.example.com/
        posts/
            ...
        rendered-topics/
            ...
        .metadata.json

Each site's .metadata.json contains:

  • last_sync_date: The ISO formatted date of the last successful sync.
  • archived_post_ids: A list of post IDs that have been archived, used to avoid duplicate downloads across invocations.

Logging

The script uses the logging module for feedback during processing. If the optional rich module is installed, rich logging output is enabled.

Troubleshooting

  • Network Issues / Rate Limits: The script incorporates an exponential backoff when encountering errors (such as rate limits). If requests repeatedly fail, check the network connectivity or adjust the server's rate limit settings.
  • JSON Decoding Errors: The script will log a warning if it fails to decode JSON from the API. Ensure the target Discourse instance is accessible and responding correctly.

Customization

  • Adjust the number of threads in the render_topics_concurrently() function by modifying the max_workers parameter.
  • Customize directories or filename formats in the save() and save_rendered() methods of the Post and Topic classes.

License

This script is provided under the MIT license.

Acknowledgements

This tool was created with inspiration from community discussions and use cases for archiving and reporting data from Discourse installations. Shout out to https://github.com/jamesob/discourse-archive which is where most of the code came from.

Happy archiving!

Description
This script archives posts and renders topics to Markdown from one or more Discourse instances
Readme MIT 46 KiB
Languages
Python 100%