1126 Commits

Author SHA1 Message Date
Noah Levitt
c52c16c260 fix bug in test, add another one 2018-06-22 16:10:23 -05:00
Noah Levitt
aeb7c3f825 treat any error fetching robots.txt as "allow all" 2018-06-22 14:50:57 -05:00
Neil Minton
f5f9a1a137
Merge pull request #109 from internetarchive/ARI-5747
update instagram behavior
2018-06-22 09:24:14 -07:00
Barbara Miller
89e54fd2e6 update instagram behavior 2018-06-18 10:36:13 -07:00
Noah Levitt
27bdfb65d2 monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
b41ccd7e6b
Merge pull request #108 from nlevitt/docs
Docs
2018-05-31 14:15:12 -07:00
Noah Levitt
62bb540a11 lowercase readme.rst 2018-05-31 18:46:37 +00:00
Noah Levitt
a00b5a7fd5 explain brozzler use of warcprox_meta 2018-05-30 18:06:39 -07:00
Noah Levitt
aef4c40993
Merge pull request #107 from internetarchive/copyright-2018
update README copyright date
2018-05-17 11:30:46 -07:00
Barbara Miller
135a13b1c9
update README copyright date 2018-05-17 11:21:47 -07:00
Noah Levitt
8906037d82 bump dev version after PR #102 2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
331d07fe88 these ssurts are strings too 2018-05-16 17:11:08 -07:00
Noah Levitt
67558528cb fix bad copy/paste 2018-05-16 16:43:38 -07:00
Noah Levitt
5bb392ec7c ssurts are strings now
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c travis-ci install warcprox from github 2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9 update warcprox dependency to include recent fixes 2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062 backward compatibility for old scope["surt"]
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6 missed a spot where is_permitted_by_robots needs monkeying 2018-05-15 16:52:48 -07:00
Noah Levitt
a8de9b70d1 handle new chrome cookie db schema 2018-05-15 11:41:02 -07:00
Noah Levitt
de1f240e25 describe scope rule conditions
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f more explication of scoping 2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d update docs to match new seed ssurt behavior 2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338 ok seriously tests 2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495 fix more tests for new approach sans scope['surt'] 2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527 s/max_hops_off_surt/max_hops_off/ 2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709 new test of max_hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df rename page.hops_off_surt to page.hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0 doublethink had a bug fix 2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f tests for new approach without scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a tests for new approach without of scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26712ce93 WIP add an accept rule instead of modifying surt
in place for seed redirects
2018-05-14 15:38:28 -07:00
Noah Levitt
98ce67ef36 WIP some words on scoping 2018-05-14 15:38:28 -07:00
Noah Levitt
88214236bb WIP starting to flesh out "scoping" section 2018-05-14 15:38:28 -07:00
Noah Levitt
6df2c1cf22 WIP some explanation of automatic login 2018-05-14 15:38:28 -07:00
Noah Levitt
914289b414 WIP documentation! 2018-05-14 15:38:28 -07:00
Noah Levitt
a1af18230c
Merge pull request #103 from internetarchive/ARI-5671
instagram updates
2018-03-23 14:18:04 -07:00
Barbara Miller
426ca48554 less is more 2018-03-23 14:17:22 -07:00
Barbara Miller
51977908ec uncomment; now tested 2018-03-20 10:39:14 -07:00
Barbara Miller
9e871a9f81 instagram umbraBehavior & vanishing elem fix 2018-03-20 10:22:55 -07:00
Noah Levitt
6aa8af9d80
Merge pull request #101 from galgeek/ARI-5617
repeatSameElement, firstMatchOnly, configurable interval timing, for ARI-5617
2018-03-19 16:36:52 -07:00
Barbara Miller
1e2e7213c8 better booleans for umbraBehavior 2018-03-19 16:31:23 -07:00
Barbara Miller
bc5a36e8a3 better booleans 2018-03-19 16:28:47 -07:00
Barbara Miller
745e6cc942 log behavior params better 2018-03-19 16:28:14 -07:00
Barbara Miller
ae6f72769a better config names 2018-03-19 16:02:07 -07:00
Barbara Miller
74fc7cd102 update behaviors.yaml 2018-03-19 14:44:29 -07:00
Barbara Miller
cc207763d5 add onceOnly config; other tweaks 2018-03-19 14:44:29 -07:00
Barbara Miller
8f861389ba amerciaspresidents.si.edu/gallery behavior 2018-03-19 14:44:29 -07:00
Barbara Miller
5dfb081bb4 skipIDcheck, default false / no / 0 2018-03-19 14:44:29 -07:00