red-arch

mirror of https://github.com/sys-nyx/red-arch.git synced 2026-01-06 02:55:41 -05:00

The goal of this project is to provide a framework for archiving websites and social media - with a particular focus on subreddits - and creating compilations of information in ways that are very easy for non-tech-savy people to consume, copy, and distribute.

Find a file

sys-nyx 32ea2550be created requirements.txt		2024-12-26 16:52:44 -08:00
r/static	initial	2018-10-28 05:08:54 -07:00
screenshots	initial	2018-10-28 05:08:54 -07:00
templates	initial	2018-10-28 05:08:54 -07:00
.gitignore	updated .gitignore	2024-12-26 16:19:04 -08:00
config.toml	added example config file	2024-12-26 16:15:52 -08:00
dumps.py	added dumps.py script which will take open json files containing subreddit content, organize them in a way expected by the generate_html func in write_html.py, and then call it	2024-12-26 16:18:42 -08:00
fetch_links.py	fix subreddit name casing bugs, resolves #3	2019-04-09 02:59:18 -07:00
LICENSE	Create LICENSE	2020-07-19 14:50:22 -07:00
README.md	Update README.md	2020-06-30 21:31:37 -07:00
requirements.txt	created requirements.txt	2024-12-26 16:52:44 -08:00
write_html.py	modified generate_html func in write_html.py to accept a dictionary containing subreddit content	2024-12-26 16:15:22 -08:00

README.md

reddit html archiver

pulls reddit data from the pushshift api and renders offline compatible html pages. uses the reddit markdown renderer.

install

requires python 3 on linux, OSX, or Windows.

warning: if $ python --version outputs a python 2 version on your system, then you need to replace all occurances of python with python3 in the commands below.

$ sudo apt-get install pip
$ pip install psaw -U
$ git clone https://github.com/chid/snudown
$ cd snudown
$ sudo python setup.py install
$ cd ..
$ git clone [this repo]
$ cd reddit-html-archiver
$ chmod u+x *.py

Windows users may need to run

> chcp 65001
> set PYTHONIOENCODING=utf-8

before running fetch_links.py or write_html.py to resolve encoding errors such as 'codec can't encode character'.

fetch reddit data

fetch data by subreddit and date range, writing to csv files in data:

$ python ./fetch_links.py politics 2017-1-1 2017-2-1

or you can filter links/posts to download less data:

$ python ./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1

to show all available options and filters run:

$ python ./fetch_links.py -h

decrease your date range or adjust pushshift_rate_limit_per_minute in fetch_links.py if you are getting connection errors.

write web pages

write html files for all subreddits to r:

$ python ./write_html.py

you can add some output filtering to have less empty postssmaller archive size

$ python ./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments

to show all available filters run:

$ python ./write_html.py -h

your html archive has been written to r. once you are satisfied with your archive feel free to copy/move the contents of r to elsewhere and to delete the git repos you have created. everything in r is fully self contained.

to update an html archive, delete everything in r aside from r/static and re-run write_html.py to regenerate everything.

hosting the archived pages

copy the contents of the r directory to a web root or appropriately served git repo.

potential improvements

fetch_links
- num_comments filtering
- thumbnails or thumbnail urls
- media posts
- score update
- scores from reddit with praw
real templating
choose Bootswatch theme
specify subreddits to output
show link domain/post type
user pages
- add pagination, posts sorted by score, comments, date, sub
- too many files in one directory
view on reddit.com
js powered search page, show no links by default
js inline media embeds/expandos
archive.org links