1468 Commits

Author SHA1 Message Date
Barbara Miller
1fffaa9eee Merge branch 'ARI-5689' into qa 2018-06-27 16:47:46 -07:00
Barbara Miller
76ec00c930 skip login for fb groups 2018-06-27 16:47:32 -07:00
Noah Levitt
09dbb4ce1d Merge branch 'robots-errors' into qa
* robots-errors:
  fix bug in test, add another one
2018-06-22 16:10:51 -05:00
Noah Levitt
c52c16c260 fix bug in test, add another one 2018-06-22 16:10:23 -05:00
Noah Levitt
aff67c3b29 Merge branch 'robots-errors' into qa
* robots-errors:
  treat any error fetching robots.txt as "allow all"
2018-06-22 16:01:21 -05:00
Noah Levitt
aeb7c3f825 treat any error fetching robots.txt as "allow all" 2018-06-22 14:50:57 -05:00
Neil Minton
f5f9a1a137
Merge pull request #109 from internetarchive/ARI-5747
update instagram behavior
2018-06-22 09:24:14 -07:00
Barbara Miller
ad5f409078 Merge branch 'ARI-5744' into qa 2018-06-19 14:17:08 -07:00
Barbara Miller
1f93f70bfe Revert "test behavior for event.crowdcompass.com"
This reverts commit 565a472fb0004f89434f1f775c154e9c4393d380.
2018-06-19 14:07:02 -07:00
Barbara Miller
96014606ec Merge branch 'ARI-5747' into qa 2018-06-18 10:37:09 -07:00
Barbara Miller
89e54fd2e6 update instagram behavior 2018-06-18 10:36:13 -07:00
Barbara Miller
0857fffeb6 Merge branch 'ARI-5747' into qa 2018-06-13 12:43:35 -07:00
Barbara Miller
5893b1f982 update instagram behavior 2018-06-13 12:43:12 -07:00
Barbara Miller
6b753623b7 Merge branch 'ARI-5744' into qa 2018-06-11 18:34:50 -07:00
Barbara Miller
565a472fb0 test behavior for event.crowdcompass.com 2018-06-11 18:30:40 -07:00
Noah Levitt
27bdfb65d2 monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
109d05c59a Merge branch 'master' into qa
* master:
  monkey-patch youtube-dl to short-circuit
2018-06-11 11:11:09 -07:00
Noah Levitt
a90a29968c monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:08:20 -07:00
Noah Levitt
5c34bd3119 Merge branch 'master' into qa
* master:
  lowercase readme.rst
  explain brozzler use of warcprox_meta
  update README copyright date
  bump dev version after PR #102
  these ssurts are strings too
  fix bad copy/paste
  ssurts are strings now
  travis-ci install warcprox from github
  incorporate urlcanon fix
  update warcprox dependency to include recent fixes
  backward compatibility for old scope["surt"]
  missed a spot where is_permitted_by_robots needs monkeying
  handle new chrome cookie db schema
  describe scope rule conditions
  more explication of scoping
  update docs to match new seed ssurt behavior
  ok seriously tests
  fix more tests for new approach sans scope['surt']
  s/max_hops_off_surt/max_hops_off/
  new test of max_hops_off
  rename page.hops_off_surt to page.hops_off
  doublethink had a bug fix
  tests for new approach without scope['surt']
  tests for new approach without of scope['surt']
  WIP add an accept rule instead of modifying surt
  WIP some words on scoping
  WIP starting to flesh out "scoping" section
  WIP some explanation of automatic login
  WIP documentation!
2018-06-01 16:46:32 -07:00
Noah Levitt
b41ccd7e6b
Merge pull request #108 from nlevitt/docs
Docs
2018-05-31 14:15:12 -07:00
Noah Levitt
62bb540a11 lowercase readme.rst 2018-05-31 18:46:37 +00:00
Noah Levitt
a00b5a7fd5 explain brozzler use of warcprox_meta 2018-05-30 18:06:39 -07:00
Noah Levitt
aef4c40993
Merge pull request #107 from internetarchive/copyright-2018
update README copyright date
2018-05-17 11:30:46 -07:00
Barbara Miller
135a13b1c9
update README copyright date 2018-05-17 11:21:47 -07:00
Noah Levitt
8906037d82 bump dev version after PR #102 2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
331d07fe88 these ssurts are strings too 2018-05-16 17:11:08 -07:00
Noah Levitt
67558528cb fix bad copy/paste 2018-05-16 16:43:38 -07:00
Noah Levitt
5bb392ec7c ssurts are strings now
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c travis-ci install warcprox from github 2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9 update warcprox dependency to include recent fixes 2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062 backward compatibility for old scope["surt"]
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6 missed a spot where is_permitted_by_robots needs monkeying 2018-05-15 16:52:48 -07:00
Noah Levitt
a8de9b70d1 handle new chrome cookie db schema 2018-05-15 11:41:02 -07:00
Noah Levitt
de1f240e25 describe scope rule conditions
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f more explication of scoping 2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d update docs to match new seed ssurt behavior 2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338 ok seriously tests 2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495 fix more tests for new approach sans scope['surt'] 2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527 s/max_hops_off_surt/max_hops_off/ 2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709 new test of max_hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df rename page.hops_off_surt to page.hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0 doublethink had a bug fix 2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f tests for new approach without scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a tests for new approach without of scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26712ce93 WIP add an accept rule instead of modifying surt
in place for seed redirects
2018-05-14 15:38:28 -07:00
Noah Levitt
98ce67ef36 WIP some words on scoping 2018-05-14 15:38:28 -07:00
Noah Levitt
88214236bb WIP starting to flesh out "scoping" section 2018-05-14 15:38:28 -07:00
Noah Levitt
6df2c1cf22 WIP some explanation of automatic login 2018-05-14 15:38:28 -07:00