From 8522fdad96fc387cd0417fb7d1cdd5b8ef939467 Mon Sep 17 00:00:00 2001 From: libertysoft3 <31760410+libertysoft3@users.noreply.github.com> Date: Tue, 30 Jun 2020 21:26:07 -0700 Subject: [PATCH] Update README.md --- README.md | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 92dce99..14ff729 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ pulls reddit data from the [pushshift](https://github.com/pushshift/api) api and ### install -requires python 3 on linux, OSX, or Windows +requires python 3 on linux, OSX, or Windows. warning: if `$ python --version` outputs a python 2 version, then replace all occurances of `python` with `python3` in the commands below. $ sudo apt-get install pip $ pip install psaw @@ -25,25 +25,33 @@ before running `fetch_links.py` or `write_html.py` to resolve encoding errors su ### fetch reddit data -data is fetched by subreddit and date range and is stored as csv files in `data`. You may need to explicitly run the script with python3 if it is not the default on your system. +fetch data by subreddit and date range, writing to csv files in `data`: - $ python3 ./fetch_links.py politics 2017-1-1 2017-2-1 - # or add some link/post filtering to download less data - $ ./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1 - # show available filters - $ ./fetch_links.py -h + $ python ./fetch_links.py politics 2017-1-1 2017-2-1 + +or you can filter links/posts to download less data: + + $ python ./fetch_links.py --self_only --score "> 2000" politics 2015-1-1 2016-1-1 + +to show all available options and filters run: + + $ python ./fetch_links.py -h decrease your date range or adjust `pushshift_rate_limit_per_minute` in `fetch_links.py` if you are getting connection errors. ### write web pages -write html files for all subreddits to `r`. +write html files for all subreddits to `r`: - $ ./write_html.py - # or add some output filtering for less fluff or a smaller archive size - $ ./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments - # show available filters - $ ./write_html.py -h + $ python ./write_html.py + +you can add some output filtering to have less empty postssmaller archive size + + $ python ./write_html.py --min-score 100 --min-comments 100 --hide-deleted-comments + +to show all available filters run: + + $ python ./write_html.py -h your html archive has been written to `r`. once you are satisfied with your archive feel free to copy/move the contents of `r` to elsewhere and to delete the git repos you have created. everything in `r` is fully self contained.