Discourse to Markdown Archiver
This script archives posts and renders topics to Markdown from one or more Discourse installations. It downloads posts from specified Discourse servers via their API, archives them as JSON files (avoiding duplicates), and renders topic threads into Markdown files. Each site is stored in its own subdirectory along with a separate metadata file tracking synchronization details.
Discourse Forums to Archive
- https://forum.hackliberty.org
- https://forum.qubes-os.org
- https://forums.whonix.org
- https://forum.torproject.net
- https://discuss.privacyguides.net
Features
- Archive new posts as JSON.
- Render topics to Markdown files.
- Support for multiple Discourse sites concurrently (one site at a time).
- Separate metadata tracking per site (last synchronization date and archived post IDs).
- Concurrent rendering of topics using a thread pool for improved performance.
- Exponential backoff for HTTP requests to handle rate limits or transient errors.
Requirements
- Python 3.7+
- Standard library modules (argparse, concurrent.futures, functools, etc.)
- Optionally, the rich module for improved logging output.
Install it via pip:pip install rich
Usage
Run the script from the command line.
Command-Line Arguments
-
--urls
: A comma-separated list of Discourse server URLs.
Example:--urls "https://forum.hackliberty.org,https://forums.whonix.org"
If not provided, the script defaults to
https://forum.hackliberty.org
. You can also set theDISCOURSE_URLS
environment variable. -
--target-dir
or-t
: The base directory where archives and rendered topics will be stored.
Default is./archive
.
Each site will have its own subdirectory (using the site's hostname). -
--debug
: Run in debug mode.
Example
To archive posts and render topics from two sites and store the data in the ./archive
directory:
./archive.py --urls "https://forum.hackliberty.org,https://forums.whonix.org" --target-dir ./archive
Alternatively, using environment variables:
export DISCOURSE_URLS="https://forum.hackliberty.org,https://forums.whonix.org"
export TARGET_DIR="./archive"
./archive.py
Directory Structure
After executing the script, the base target directory will be structured as follows:
./archive/
site1.example.com/
posts/
2023-09-September/
0000000123-username-topic-slug.json
...
rendered-topics/
2023-09-September/
2023-09-15-topic-slug-id123.md
...
.metadata.json
site2.example.com/
posts/
...
rendered-topics/
...
.metadata.json
Each site's .metadata.json
contains:
last_sync_date
: The ISO formatted date of the last successful sync.archived_post_ids
: A list of post IDs that have been archived, used to avoid duplicate downloads across invocations.
Logging
The script uses the logging module for feedback during processing. If the optional rich
module is installed, rich logging output is enabled.
Troubleshooting
- Network Issues / Rate Limits: The script incorporates an exponential backoff when encountering errors (such as rate limits). If requests repeatedly fail, check the network connectivity or adjust the server's rate limit settings.
- JSON Decoding Errors: The script will log a warning if it fails to decode JSON from the API. Ensure the target Discourse instance is accessible and responding correctly.
Customization
- Adjust the number of threads in the
render_topics_concurrently()
function by modifying themax_workers
parameter. - Customize directories or filename formats in the
save()
andsave_rendered()
methods of thePost
andTopic
classes.
License
This script is provided under the MIT license.
Acknowledgements
This tool was created with inspiration from community discussions and use cases for archiving and reporting data from Discourse installations. Shout out to https://github.com/jamesob/discourse-archive which is where most of the code came from.
Happy archiving!