c0mmando c2f1fa9345
Refactor code, add features, fix bugs
- Removed duplicate post titles
- Fixed script termination
- Removed duplicates in readme
- Removed double image links
- Clean up post titles
- Organized readme topics by category
- Fix bug preventing archive of more than 20 posts per topic
2025-04-03 02:34:49 +00:00
2025-03-05 00:39:10 +00:00
2025-03-05 00:44:32 +00:00

Discourse-to-GitHub-Archiver

A Python script to archive Discourse posts and render topics to Markdown from one or more Discourse-based forums. The script downloads posts via the Discourse API, archives new posts as JSON files, and renders topics with images downloaded and URLs rewritten as relative paths. It also updates metadata and generates a final README.md with a table of contents that links to each archived topic.

Features

  • Archive Posts: Saves each Discourse post in a JSON file, organized by creation date.
  • Concurrent Rendering: Renders topics concurrently, converting posts from HTML to Markdown.
  • Image Downloading: Processes HTML to download images and rewrites image URLs to relative paths.
  • Metadata Updating: Keeps track of archived posts to avoid duplicates.
  • Incremental README Updates: Updates a README.md with a table of contents for easy navigation.

Requirements

  • Python 3.9 or higher
  • html2text (for converting HTML to Markdown)
  • beautifulsoup4 (for HTML parsing)
  • Optional: rich (for improved logging output)

Installation

  1. Clone the Repository:

    git clone https://github.com/yourusername/discourse-to-github-archiver.git
    cd discourse-to-github-archiver
    
  2. Create a Virtual Environment (Optional but Recommended):

    python3 -m venv venv
    source venv/bin/activate   # On Windows use: venv\Scripts\activate
    
  3. Install Dependencies:

    pip install html2text beautifulsoup4 rich
    

Usage

Run the script from the command line. The main script is called discourse2github.py. By default, it is configured to archive posts from:

  • https://forum.hackliberty.org

You can specify additional or custom Discourse URLs and an output target directory. For example:

./discourse2github.py --urls https://forum.hackliberty.org,https://forum.qubes-os.org --target-dir ./archive

Command-Line Options

  • --urls: Comma-separated list of Discourse server URLs.
  • --debug: Enable debug mode for more verbose logging.
  • -t, --target-dir: The base directory where archives (posts, rendered topics, and metadata) will be stored.

Alternatively, you can set the following environment variables:

  • DISCOURSE_URLS: Specifies the URLs.
  • TARGET_DIR: Specifies the output directory.
  • DEBUG: Enables debugging mode.

How It Works

  1. Fetching Posts:
    The script fetches posts from the provided Discourse servers via their public APIs.

  2. Archiving Posts:
    It saves new posts as JSON files in a directory structure organized by the post's creation date (e.g., 2023-10-October).

  3. Rendering Topics:
    The script collects unique topics from the posts and concurrently renders each one into a Markdown file. The Markdown includes headers with metadata about each post (ID, username, post number, creation and update timestamps) and the converted Markdown content from the original HTML.

  4. Downloading Assets:
    Any images embedded in the posts are downloaded, and their URLs are rewritten in the Markdown output to point to the local copies.

  5. Updating README:
    After processing topics, the script incrementally updates a README.md file with a table of contents linking to each archived topic.

Contributing

This script is not perfect, but it works for my purposes. I provide it as-is, and there are no guarantees of further updates. That said, I'm open to reviewing and merging code changes from the community. Feel free to open issues or pull requests if you have improvements or bug fixes to suggest.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Description
A Python script to archive Discourse posts and render topics to Markdown from one or more Discourse-based forums. It downloads posts via the Discourse API, archives new posts as JSON files, and renders topics with images downloaded and URLs rewritten as relative paths. It also updates metadata and generates a final README.md with table of content
Readme MIT 46 KiB
Languages
Python 100%