The goal of this project is to provide a framework for archiving websites and social media - with a particular focus on subreddits - and creating compilations of information in ways that are very easy for non-tech-savy people to consume, copy, and distribute.
Find a file
2019-11-12 02:17:21 -08:00
r/static initial 2018-10-28 05:08:54 -07:00
screenshots initial 2018-10-28 05:08:54 -07:00
templates initial 2018-10-28 05:08:54 -07:00
.gitignore initial 2018-10-28 05:08:54 -07:00
fetch_links.py fix subreddit name casing bugs, resolves #3 2019-04-09 02:59:18 -07:00
README.md fix duplicate comments bug, resolves #10 2019-09-12 00:47:01 -07:00
write_html.py write: optimize memory usage 2019-11-12 02:17:21 -08:00

reddit html archiver

pulls reddit data from the pushshift api and renders offline compatible html pages. uses the reddit markdown renderer.

install

requires python 3 on linux, OSX, or Windows

sudo apt-get install pip
pip install psaw
git clone https://github.com/chid/snudown
cd snudown
sudo python setup.py install
cd ..
git clone [this repo]
cd reddit-html-archiver
chmod u+x *.py

Windows users may need to run

chcp 65001
set PYTHONIOENCODING=utf-8

before running fetch_links.py or write_html.py to resolve encoding errors such as 'codec can't encode character'.

fetch reddit data

data is fetched by subreddit and date range and is stored as csv files in data.

./fetch_links.py politics 2017-1-1 2017-2-1
# or add some link/post filtering to download less data
./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1
# show available filters
./fetch_links.py -h

decrease your date range or adjust pushshift_rate_limit_per_minute in fetch_links.py if you are getting connection errors.

write web pages

write html files for all subreddits to r.

./write_html.py
# or add some output filtering for less fluff or a smaller archive size
./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments
# show available filters
./write_html.py -h

your html archive has been written to r. once you are satisfied with your archive feel free to copy/move the contents of r to elsewhere and to delete the git repos you have created. everything in r is fully self contained.

to update an html archive, delete everything in r aside from r/static and re-run write_html.py to regenerate everything.

hosting the archived pages

copy the contents of the r directory to a web root or appropriately served git repo.

potential improvements

  • fetch_links
    • num_comments filtering
    • thumbnails or thumbnail urls
    • media posts
    • score update
    • scores from reddit with praw
  • real templating
  • choose Bootswatch theme
  • specify subreddits to output
  • show link domain/post type
  • user pages
    • add pagination, posts sorted by score, comments, date, sub
    • too many files in one directory
  • view on reddit.com
  • js powered search page, show no links by default
  • js inline media embeds/expandos
  • archive.org links

see also

screenshots