Example scripts for the pushshift dump files
Find a file
2023-03-02 18:41:50 -08:00
personal Bit of other work 2023-03-01 21:06:34 -08:00
scripts Initial work on filter file 2023-03-02 18:41:50 -08:00
.gitignore Add personal scripts to git 2021-12-10 17:39:52 -08:00
CITATION.cff Add citation file 2023-02-13 16:31:55 -08:00
LICENSE.md Create LICENSE.md 2023-02-12 21:48:49 -08:00
Pipfile Some cleanup, optimize multiprocess 2022-07-15 23:39:37 -07:00
Pipfile.lock Some cleanup, optimize multiprocess 2022-07-15 23:39:37 -07:00
README.md Clean up 2021-09-09 22:24:14 -07:00

This repo contains example python scripts for processing the reddit dump files created by pushshift. The files can be downloaded from here or torrented from here.

  • single_file.py decompresses and iterates over a single zst compressed file
  • iterate_folder.py does the same, but for all files in a folder
  • combine_folder_multiprocess.py uses separate processes to iterate over multiple files in parallel, writing lines that match the criteria passed in to text files, then combining them into a final zst compressed file