Update README.md

This commit is contained in:
libertysoft3 2020-06-30 21:26:07 -07:00 committed by GitHub
parent da0624f40e
commit 8522fdad96
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -4,7 +4,7 @@ pulls reddit data from the [pushshift](https://github.com/pushshift/api) api and
### install
requires python 3 on linux, OSX, or Windows
requires python 3 on linux, OSX, or Windows. warning: if `$ python --version` outputs a python 2 version, then replace all occurances of `python` with `python3` in the commands below.
$ sudo apt-get install pip
$ pip install psaw
@ -25,25 +25,33 @@ before running `fetch_links.py` or `write_html.py` to resolve encoding errors su
### fetch reddit data
data is fetched by subreddit and date range and is stored as csv files in `data`. You may need to explicitly run the script with python3 if it is not the default on your system.
fetch data by subreddit and date range, writing to csv files in `data`:
$ python3 ./fetch_links.py politics 2017-1-1 2017-2-1
# or add some link/post filtering to download less data
$ ./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1
# show available filters
$ ./fetch_links.py -h
$ python ./fetch_links.py politics 2017-1-1 2017-2-1
or you can filter links/posts to download less data:
$ python ./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1
to show all available options and filters run:
$ python ./fetch_links.py -h
decrease your date range or adjust `pushshift_rate_limit_per_minute` in `fetch_links.py` if you are getting connection errors.
### write web pages
write html files for all subreddits to `r`.
write html files for all subreddits to `r`:
$ ./write_html.py
# or add some output filtering for less fluff or a smaller archive size
$ ./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments
# show available filters
$ ./write_html.py -h
$ python ./write_html.py
you can add some output filtering to have less empty postssmaller archive size
$ python ./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments
to show all available filters run:
$ python ./write_html.py -h
your html archive has been written to `r`. once you are satisfied with your archive feel free to copy/move the contents of `r` to elsewhere and to delete the git repos you have created. everything in `r` is fully self contained.