1492 Commits

Author SHA1 Message Date
Noah Levitt
20996fa501 bump version after merge 2018-10-12 12:46:09 -07:00
jkafader
8fc800d1ef
Merge pull request #127 from nlevitt/ydl-improvements
Ydl improvements
2018-10-12 11:55:47 -07:00
Noah Levitt
65fad5e8bf remove stray bad logging line 2018-10-12 11:35:47 -07:00
Noah Levitt
7497b7e5ac tests expect outlinks to be a set 2018-10-12 11:03:54 -07:00
Noah Levitt
054ba6d7a0 tidy up some comments and docs 2018-10-12 00:48:38 -07:00
Noah Levitt
8f9077fbf3 watch pages as outlinks from youtube-dl playlists
and bypass downloading metadata about individual videos as well as the
videos themselves (for youtube playlists), because even just the
metadata can take many minutes or hours in case of thousands of videos
2018-10-12 00:41:16 -07:00
Noah Levitt
9211fb45ec silence youtube-dl's logging, use only our own
because youtube-dl's can be annoyingly verbose, confusing, doesn't tell
us the things we're interested in, and doesn't tell us where the
messages originate
2018-10-12 00:39:37 -07:00
Noah Levitt
e5536182dc use a thread-local callback in monkey-patched
finish_frag_download, instead of locking around monkey-patching, to
allow different threads to youtube-dl concurrently, but still not
interfere with each other
2018-10-11 23:28:34 -07:00
Noah Levitt
82cf5c6dbb skip downloading videos from youtube playlists
because we expect to capture videos from individual watch pages, and
often processing thousands of videos with youtube-dl before the page is
ever opened in the browser is not desired behavior and is a crawling
problem
2018-10-11 15:46:30 -07:00
Noah Levitt
e406e42312 trace-level logging for all the chrome output
because the important/unimportant messages are always shifting and we're
not even trying to keep up with it and mostly it's just noise
2018-10-11 15:43:05 -07:00
Noah Levitt
1e95441ce7 hopefully fixes lingering ydl concurrency issue
which was causing awfulness like this:
2018-09-30 04:39:54,410 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://www.facebook.com/CongresswomanRosaDeLauro/
2018-09-30 04:39:58,092 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - 1080x607' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8001 with url youtube-dl:00037:https://instagram.com/p/BfJvqhfnQ0C/
2018-09-30 04:40:05,120 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc107.us.archive.org:8000 with url youtube-dl:00009:https://www.facebook.com/LDS
2018-09-30 04:40:09,450 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc407.us.archive.org:8000 with url youtube-dl:00048:https://www.youtube.com/watch?v=-gH28zrMmAM
2018-09-30 04:40:14,327 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc108.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepTedLieu/status/1010212963897233408
2018-09-30 04:40:23,018 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc048.us.archive.org:8001 with url youtube-dl:00005:https://www.facebook.com/SenDuckworth/
2018-09-30 04:40:29,553 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '0 - unknown' video stitched-up as application/octet-stream (228243844 bytes) to warcprox at wbgrp-svc045.us.archive.org:8000 with url youtube-dl:00009:http://www.facebook.com/repkathleenrice
2018-09-30 04:40:37,057 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc406.us.archive.org:8000 with url youtube-dl:00023:https://www.youtube.com/watch?v=MaamqVF87mE
2018-09-30 04:40:41,298 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing '22 - 1280x720 (hd720)' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc403.us.archive.org:8000 with url youtube-dl:00039:https://www.youtube.com/watch?v=pRpMp4H8El0
2018-09-30 04:40:45,613 19101 INFO BrozzlingThread:58486 brozzler.ydl._build_youtube_dl.<locals>._YoutubeDL._push_stitched_up_vid_to_warcprox(ydl.py:164) pushing 'hls-2176 - 1280x720' video stitched-up as video/mp4 (228243844 bytes) to warcprox at wbgrp-svc408.us.archive.org:8000 with url youtube-dl:00001:https://twitter.com/RepKevinCramer/status/999771072206639104
i.e. pushing the same stitched-up video to a bunch of wrong places :(
2018-10-11 13:40:57 -07:00
Noah Levitt
e519616f8e brozzler-worker log version number at startup 2018-10-11 13:31:37 -07:00
Noah Levitt
362a2347b9 bump version after merge 2018-09-28 15:27:40 -07:00
jkafader
d2b1843a6d
Merge pull request #122 from nlevitt/new-job-bulk-inserts
improve performance of brozzler-new-job
2018-09-28 15:26:33 -07:00
Noah Levitt
8de3e21103 fix another oversight 2018-09-28 14:45:45 -07:00
Noah Levitt
1ee36c38b9 ugh. oops 2018-09-28 13:44:54 -07:00
Noah Levitt
7137918005 Revert "add a github PR template for this repo"
This reverts commit 83552eb444fc5ef9bd4b9d1772820b62aae67e46.
2018-09-28 13:16:46 -07:00
Noah Levitt
7980b40ee3 improve performance of brozzler-new-job
by inserting pages and sites in bulk
2018-09-28 13:13:22 -07:00
Noah Levitt
2386e85a37 bump doublethink dependency 2018-09-27 14:25:49 -07:00
jkafader
f8ce9858fb
Merge pull request #121 from nlevitt/purge
brozzler-purge - purge crawl state from rethinkdb
2018-09-25 16:45:24 -07:00
Noah Levitt
f4fad934a7 verbiage tweaks 2018-09-25 15:19:33 -07:00
Noah Levitt
560981c1ad safety check and --force for brozzler-purge 2018-09-25 15:17:45 -07:00
Noah Levitt
174178e02e new command brozzler-purge 2018-09-25 14:56:26 -07:00
Noah Levitt
48bf185746 bump version after merge 2018-09-18 11:08:44 -07:00
Noah Levitt
dceee8bdbd
Merge pull request #119 from nlevitt/ydl-stitch-fix
WIP youtube-dl stitching fixes
2018-09-18 11:08:21 -07:00
Noah Levitt
60cd69e2bd send warcprox-meta when pushing stitched up video
also put locking around monkey patching to avoid race condition
2018-09-18 01:07:52 -07:00
Noah Levitt
1ef717fa75 test exposing bug that we don't send warcprox-meta
when pushing stitched-up video with WARCPROX_WRITE_RECORD
2018-09-18 01:05:18 -07:00
Noah Levitt
efb0696833 bump version number after merge 2018-09-06 16:17:59 -07:00
jkafader
8368cd2bcb
Merge pull request #115 from nlevitt/ydl-stitched
Ydl stitched
2018-09-06 16:15:52 -07:00
Noah Levitt
a4eacb5b8f oops, back to dev version number 2018-09-04 10:52:34 -07:00
jkafader
e38b867ff5
Merge pull request #118 from nlevitt/relax-claiming
Relax claiming
2018-09-04 10:45:11 -07:00
Noah Levitt
2d5c6681cf wait 20 seconds to claim sites if none were avail-
able last time, up from 0.5 seconds
this should lighten the load on rethinkdb considerably
2018-08-31 15:23:59 -07:00
Noah Levitt
d0f5cd7168 tweak logging 2018-08-31 15:23:48 -07:00
Noah Levitt
88d3d3b310
why did those tests fail??? (#117)
1.4 for pypi
1.4
2018-08-22 14:35:39 -07:00
Noah Levitt
02e98f101d
Merge pull request #116 from kblumenthal/master
Add screenshots
2018-08-22 14:34:52 -07:00
Karl-Rainer Blumenthal
ff1645ef7d
Add screenshots
Add Brozzler Dashboard and Wayback screenshots to readme
2018-08-22 13:02:08 -04:00
Karl-Rainer Blumenthal
7c8b597ad3
Add screenshots
Add screenshots of Brozzler Dashboard and Wayback
2018-08-22 12:55:10 -04:00
Noah Levitt
2a2952e810 back to dev version 2018-08-21 15:18:18 -07:00
Noah Levitt
b63661ea70 1.4 for pypi 2018-08-21 15:15:38 -07:00
Noah Levitt
eaf7ef74be explain --warcprox-auto briefly 2018-08-17 12:06:04 -07:00
Karl-Rainer Blumenthal
2081e6388a
Merge pull request #2 from internetarchive/master
Updating to upstream origin
2018-08-17 14:26:46 -04:00
Noah Levitt
8cdc3dee21 Merge branch 'master' into ydl-stitched
* master:
  vagrant readme fixes (thanks funkyfuture)
  update cryptography dep version
2018-08-17 10:34:00 -07:00
Noah Levitt
d19e139101 vagrant readme fixes (thanks funkyfuture) 2018-08-17 10:31:01 -07:00
Noah Levitt
ffa8021968 update cryptography dep version
github tells me there's a vulnerability <2.3
2018-08-16 14:32:03 -07:00
Noah Levitt
e7d2273856 fix failing tests 2018-08-16 11:40:54 -07:00
Noah Levitt
3c27132aaa test for youtube-dl stitch-up 2018-08-15 17:42:53 -07:00
Noah Levitt
c2ad8427e1 add missing imports and fix mimetype issue 2018-08-15 17:41:35 -07:00
Noah Levitt
33520da8f9 move youtube-dl code into separate file 2018-08-14 15:10:48 -07:00
Noah Levitt
39155ebcc5 push youtube-dl's stitched up videos to warcprox
(no tests yet)
2018-08-13 15:40:48 -07:00
Noah Levitt
4e398e1da2 expose more brozzle-page args 2018-08-13 15:38:24 -07:00