Noah Levitt
27bdfb65d2
monkey-patch youtube-dl to short-circuit
...
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:
Traceback (most recent call first):
File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
return _compile(pattern, flags).findall(string)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
'uploader': video_uploader,
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
ie_result = self._real_extract(url)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
ie_result = ie.extract(url)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
b41ccd7e6b
Merge pull request #108 from nlevitt/docs
...
Docs
2018-05-31 14:15:12 -07:00
Noah Levitt
62bb540a11
lowercase readme.rst
2018-05-31 18:46:37 +00:00
Noah Levitt
a00b5a7fd5
explain brozzler use of warcprox_meta
2018-05-30 18:06:39 -07:00
Noah Levitt
aef4c40993
Merge pull request #107 from internetarchive/copyright-2018
...
update README copyright date
2018-05-17 11:30:46 -07:00
Barbara Miller
135a13b1c9
update README copyright date
2018-05-17 11:21:47 -07:00
Noah Levitt
8906037d82
bump dev version after PR #102
2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
...
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
331d07fe88
these ssurts are strings too
2018-05-16 17:11:08 -07:00
Noah Levitt
67558528cb
fix bad copy/paste
2018-05-16 16:43:38 -07:00
Noah Levitt
5bb392ec7c
ssurts are strings now
...
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c
travis-ci install warcprox from github
2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff
incorporate urlcanon fix
2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9
update warcprox dependency to include recent fixes
2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062
backward compatibility for old scope["surt"]
...
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6
missed a spot where is_permitted_by_robots needs monkeying
2018-05-15 16:52:48 -07:00
Noah Levitt
a8de9b70d1
handle new chrome cookie db schema
2018-05-15 11:41:02 -07:00
Noah Levitt
de1f240e25
describe scope rule conditions
...
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f
more explication of scoping
2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d
update docs to match new seed ssurt behavior
2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338
ok seriously tests
2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495
fix more tests for new approach sans scope['surt']
2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527
s/max_hops_off_surt/max_hops_off/
2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709
new test of max_hops_off
2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df
rename page.hops_off_surt to page.hops_off
2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0
doublethink had a bug fix
2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f
tests for new approach without scope['surt']
...
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a
tests for new approach without of scope['surt']
...
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26712ce93
WIP add an accept rule instead of modifying surt
...
in place for seed redirects
2018-05-14 15:38:28 -07:00
Noah Levitt
98ce67ef36
WIP some words on scoping
2018-05-14 15:38:28 -07:00
Noah Levitt
88214236bb
WIP starting to flesh out "scoping" section
2018-05-14 15:38:28 -07:00
Noah Levitt
6df2c1cf22
WIP some explanation of automatic login
2018-05-14 15:38:28 -07:00
Noah Levitt
914289b414
WIP documentation!
2018-05-14 15:38:28 -07:00
Noah Levitt
a1af18230c
Merge pull request #103 from internetarchive/ARI-5671
...
instagram updates
2018-03-23 14:18:04 -07:00
Barbara Miller
426ca48554
less is more
2018-03-23 14:17:22 -07:00
Barbara Miller
51977908ec
uncomment; now tested
2018-03-20 10:39:14 -07:00
Barbara Miller
9e871a9f81
instagram umbraBehavior & vanishing elem fix
2018-03-20 10:22:55 -07:00
Noah Levitt
6aa8af9d80
Merge pull request #101 from galgeek/ARI-5617
...
repeatSameElement, firstMatchOnly, configurable interval timing, for ARI-5617
2018-03-19 16:36:52 -07:00
Barbara Miller
1e2e7213c8
better booleans for umbraBehavior
2018-03-19 16:31:23 -07:00
Barbara Miller
bc5a36e8a3
better booleans
2018-03-19 16:28:47 -07:00
Barbara Miller
745e6cc942
log behavior params better
2018-03-19 16:28:14 -07:00
Barbara Miller
ae6f72769a
better config names
2018-03-19 16:02:07 -07:00
Barbara Miller
74fc7cd102
update behaviors.yaml
2018-03-19 14:44:29 -07:00
Barbara Miller
cc207763d5
add onceOnly config; other tweaks
2018-03-19 14:44:29 -07:00
Barbara Miller
8f861389ba
amerciaspresidents.si.edu/gallery behavior
2018-03-19 14:44:29 -07:00
Barbara Miller
5dfb081bb4
skipIDcheck, default false / no / 0
2018-03-19 14:44:29 -07:00
Barbara Miller
8f12f0b0c0
better idCheck and configurable interval timing
2018-03-19 14:44:04 -07:00
Barbara Miller
c31f13e47f
add idCheck feature, default: true
2018-03-19 14:44:04 -07:00
Noah Levitt
8e273b2e6b
Merge pull request #100 from nlevitt/max-claimed-sites
...
reimplement max_claimed_sites
2018-03-15 15:05:46 -07:00
Noah Levitt
dc00f5de32
reimplement max_claimed_sites
...
Other approach was too slow and caused db contention.
New approach avoids (slow) rethinkdb join by max_claimed_sites job
parameter to each of the job's sites. Uses rethinkdb fold() to count
claimed sites and enforce max_claimed_sites within a single query.
2018-03-15 12:57:49 -07:00