brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-04-22 16:39:07 -04:00

Author	SHA1	Message	Date
Barbara Miller	1fffaa9eee	Merge branch 'ARI-5689' into qa	2018-06-27 16:47:46 -07:00
Barbara Miller	76ec00c930	skip login for fb groups	2018-06-27 16:47:32 -07:00
Noah Levitt	09dbb4ce1d	Merge branch 'robots-errors' into qa * robots-errors: fix bug in test, add another one	2018-06-22 16:10:51 -05:00
Noah Levitt	c52c16c260	fix bug in test, add another one	2018-06-22 16:10:23 -05:00
Noah Levitt	aff67c3b29	Merge branch 'robots-errors' into qa * robots-errors: treat any error fetching robots.txt as "allow all"	2018-06-22 16:01:21 -05:00
Noah Levitt	aeb7c3f825	treat any error fetching robots.txt as "allow all"	2018-06-22 14:50:57 -05:00
Neil Minton	f5f9a1a137	Merge pull request #109 from internetarchive/ARI-5747 update instagram behavior	2018-06-22 09:24:14 -07:00
Barbara Miller	ad5f409078	Merge branch 'ARI-5744' into qa	2018-06-19 14:17:08 -07:00
Barbara Miller	1f93f70bfe	Revert "test behavior for event.crowdcompass.com" This reverts commit 565a472fb0004f89434f1f775c154e9c4393d380.	2018-06-19 14:07:02 -07:00
Barbara Miller	96014606ec	Merge branch 'ARI-5747' into qa	2018-06-18 10:37:09 -07:00
Barbara Miller	89e54fd2e6	update instagram behavior	2018-06-18 10:36:13 -07:00
Barbara Miller	0857fffeb6	Merge branch 'ARI-5747' into qa	2018-06-13 12:43:35 -07:00
Barbara Miller	5893b1f982	update instagram behavior	2018-06-13 12:43:12 -07:00
Barbara Miller	6b753623b7	Merge branch 'ARI-5744' into qa	2018-06-11 18:34:50 -07:00
Barbara Miller	565a472fb0	test behavior for event.crowdcompass.com	2018-06-11 18:30:40 -07:00
Noah Levitt	27bdfb65d2	monkey-patch youtube-dl to short-circuit video extraction using generic extractor in case of very large url (more than 20 mb) that youtube-dl interprets as html, to avoid spinning forever here: Traceback (most recent call first): File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall return _compile(pattern, flags).findall(string) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract 'uploader': video_uploader, File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract ie_result = self._real_extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info ie_result = ie.extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl info = ydl.extract_info(str(urlcanon.whatwg(page.url))) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page self._try_youtube_dl(ydl, site, page)	2018-06-11 11:50:22 -07:00
Noah Levitt	109d05c59a	Merge branch 'master' into qa * master: monkey-patch youtube-dl to short-circuit	2018-06-11 11:11:09 -07:00
Noah Levitt	a90a29968c	monkey-patch youtube-dl to short-circuit video extraction using generic extractor in case of very large url (more than 20 mb) that youtube-dl interprets as html, to avoid spinning forever here: Traceback (most recent call first): File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall return _compile(pattern, flags).findall(string) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract 'uploader': video_uploader, File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract ie_result = self._real_extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info ie_result = ie.extract(url) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl info = ydl.extract_info(str(urlcanon.whatwg(page.url))) File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page self._try_youtube_dl(ydl, site, page)	2018-06-11 11:08:20 -07:00
Noah Levitt	5c34bd3119	Merge branch 'master' into qa * master: lowercase readme.rst explain brozzler use of warcprox_meta update README copyright date bump dev version after PR #102 these ssurts are strings too fix bad copy/paste ssurts are strings now travis-ci install warcprox from github incorporate urlcanon fix update warcprox dependency to include recent fixes backward compatibility for old scope["surt"] missed a spot where is_permitted_by_robots needs monkeying handle new chrome cookie db schema describe scope rule conditions more explication of scoping update docs to match new seed ssurt behavior ok seriously tests fix more tests for new approach sans scope['surt'] s/max_hops_off_surt/max_hops_off/ new test of max_hops_off rename page.hops_off_surt to page.hops_off doublethink had a bug fix tests for new approach without scope['surt'] tests for new approach without of scope['surt'] WIP add an accept rule instead of modifying surt WIP some words on scoping WIP starting to flesh out "scoping" section WIP some explanation of automatic login WIP documentation!	2018-06-01 16:46:32 -07:00
Noah Levitt	b41ccd7e6b	Merge pull request #108 from nlevitt/docs Docs	2018-05-31 14:15:12 -07:00
Noah Levitt	62bb540a11	lowercase readme.rst	2018-05-31 18:46:37 +00:00
Noah Levitt	a00b5a7fd5	explain brozzler use of warcprox_meta	2018-05-30 18:06:39 -07:00
Noah Levitt	aef4c40993	Merge pull request #107 from internetarchive/copyright-2018 update README copyright date	2018-05-17 11:30:46 -07:00
Barbara Miller	135a13b1c9	update README copyright date	2018-05-17 11:21:47 -07:00
Noah Levitt	8906037d82	bump dev version after PR #102	2018-05-16 17:33:52 -07:00
Noah Levitt	e90e7345a5	Merge pull request #102 from nlevitt/docs complete job configuration documentation	2018-05-16 17:31:27 -07:00
Noah Levitt	331d07fe88	these ssurts are strings too	2018-05-16 17:11:08 -07:00
Noah Levitt	67558528cb	fix bad copy/paste	2018-05-16 16:43:38 -07:00
Noah Levitt	5bb392ec7c	ssurts are strings now because they're friendlier that way in rethinkdb	2018-05-16 16:43:10 -07:00
Noah Levitt	399c097c7c	travis-ci install warcprox from github	2018-05-16 15:48:29 -07:00
Noah Levitt	ac735639ff	incorporate urlcanon fix	2018-05-16 14:41:49 -07:00
Noah Levitt	338d2e48f9	update warcprox dependency to include recent fixes	2018-05-16 14:26:51 -07:00
Noah Levitt	b9b8dcd062	backward compatibility for old scope["surt"] and make sure to store ssurt as string in rethinkdb	2018-05-16 14:19:23 -07:00
Noah Levitt	1572fd3ed6	missed a spot where is_permitted_by_robots needs monkeying	2018-05-15 16:52:48 -07:00
Noah Levitt	a8de9b70d1	handle new chrome cookie db schema	2018-05-15 11:41:02 -07:00
Noah Levitt	de1f240e25	describe scope rule conditions plus a bunch of tweaks and fixes	2018-05-15 11:01:09 -07:00
Noah Levitt	a327cb626f	more explication of scoping	2018-05-14 17:31:45 -07:00
Noah Levitt	2cf474aa1d	update docs to match new seed ssurt behavior	2018-05-14 16:59:55 -07:00
Noah Levitt	fc05cac338	ok seriously tests	2018-05-14 15:38:28 -07:00
Noah Levitt	05f8ab3495	fix more tests for new approach sans scope['surt']	2018-05-14 15:38:28 -07:00
Noah Levitt	85a4757527	s/max_hops_off_surt/max_hops_off/	2018-05-14 15:38:28 -07:00
Noah Levitt	5ebd2fb709	new test of max_hops_off	2018-05-14 15:38:28 -07:00
Noah Levitt	b83d3cb9df	rename page.hops_off_surt to page.hops_off	2018-05-14 15:38:28 -07:00
Noah Levitt	60f2b99cc0	doublethink had a bug fix	2018-05-14 15:38:28 -07:00
Noah Levitt	526a4d718f	tests for new approach without scope['surt'] replaced by an accept rule (two rules in some cases of seed redirects)	2018-05-14 15:38:28 -07:00
Noah Levitt	245e27a21a	tests for new approach without of scope['surt'] replaced by an accept rule (two rules in some cases of seed redirects)	2018-05-14 15:38:28 -07:00
Noah Levitt	f26712ce93	WIP add an accept rule instead of modifying surt in place for seed redirects	2018-05-14 15:38:28 -07:00
Noah Levitt	98ce67ef36	WIP some words on scoping	2018-05-14 15:38:28 -07:00
Noah Levitt	88214236bb	WIP starting to flesh out "scoping" section	2018-05-14 15:38:28 -07:00
Noah Levitt	6df2c1cf22	WIP some explanation of automatic login	2018-05-14 15:38:28 -07:00

1 2 3 4 5 ...

1468 Commits