Noah Levitt
39155ebcc5
push youtube-dl's stitched up videos to warcprox
...
(no tests yet)
2018-08-13 15:40:48 -07:00
Noah Levitt
4e398e1da2
expose more brozzle-page args
2018-08-13 15:38:24 -07:00
Noah Levitt
b44a444dc2
update pillow dependency to get rid of github vul-
...
nerability warning
2018-07-24 16:37:25 -05:00
Noah Levitt
771d6aa626
more readme edits
2018-07-23 19:05:49 -05:00
Noah Levitt
073fc713f4
Merge pull request #113 from nlevitt/karl-readme
...
Karl readme copy edits
2018-07-23 18:36:00 -05:00
Noah Levitt
f7407a87c1
reformat readme to 80 columns
2018-07-23 23:32:56 +00:00
Noah Levitt
a7fb7bcc37
Merge branch 'master' into karl
...
* master:
bump up heartbeat interval (see comment)
back to dev version
version 1.3 (messed up 1.2)
setuptools wants README not readme
back to dev version number
version 1.2
bump dev version after merge
is test_time_limit is failing because of timing?
fix bug in test, add another one
treat any error fetching robots.txt as "allow all"
update instagram behavior
2018-07-23 23:28:42 +00:00
Karl-Rainer Blumenthal
bd78e07232
Copy edits to job-conf readme
...
Good reading and rampant pedantry!
2018-07-06 15:24:12 -04:00
Noah Levitt
9d18dc6aeb
bump up heartbeat interval (see comment)
2018-07-03 18:35:08 -05:00
Karl-Rainer Blumenthal
eebbc1d279
Copy edits
2018-06-28 12:59:22 -04:00
Noah Levitt
783fd0ea87
back to dev version
2018-06-25 19:32:27 +00:00
Noah Levitt
bd63908fb9
version 1.3 (messed up 1.2)
1.3
2018-06-25 19:30:39 +00:00
Noah Levitt
2780c92569
setuptools wants README not readme
2018-06-25 19:10:57 +00:00
Noah Levitt
032c7d2898
back to dev version number
2018-06-25 12:33:34 -05:00
Noah Levitt
442d02b26a
version 1.2
1.2
2018-06-25 12:21:00 -05:00
Noah Levitt
196cd555ea
bump dev version after merge
2018-06-25 11:44:45 -05:00
Noah Levitt
05ec6a68b0
Merge pull request #110 from nlevitt/robots-errors
...
treat any error fetching robots.txt as "allow all"
2018-06-25 11:44:18 -05:00
Noah Levitt
d4db8ba9bc
is test_time_limit is failing because of timing?
...
give it up to ten seconds to mark the job finished
2018-06-25 10:35:24 -05:00
Noah Levitt
c52c16c260
fix bug in test, add another one
2018-06-22 16:10:23 -05:00
Noah Levitt
aeb7c3f825
treat any error fetching robots.txt as "allow all"
2018-06-22 14:50:57 -05:00
Neil Minton
f5f9a1a137
Merge pull request #109 from internetarchive/ARI-5747
...
update instagram behavior
2018-06-22 09:24:14 -07:00
Barbara Miller
89e54fd2e6
update instagram behavior
2018-06-18 10:36:13 -07:00
Noah Levitt
27bdfb65d2
monkey-patch youtube-dl to short-circuit
...
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:
Traceback (most recent call first):
File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
return _compile(pattern, flags).findall(string)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
'uploader': video_uploader,
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
ie_result = self._real_extract(url)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
ie_result = ie.extract(url)
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
b41ccd7e6b
Merge pull request #108 from nlevitt/docs
...
Docs
2018-05-31 14:15:12 -07:00
Noah Levitt
62bb540a11
lowercase readme.rst
2018-05-31 18:46:37 +00:00
Noah Levitt
a00b5a7fd5
explain brozzler use of warcprox_meta
2018-05-30 18:06:39 -07:00
Noah Levitt
aef4c40993
Merge pull request #107 from internetarchive/copyright-2018
...
update README copyright date
2018-05-17 11:30:46 -07:00
Barbara Miller
135a13b1c9
update README copyright date
2018-05-17 11:21:47 -07:00
Noah Levitt
8906037d82
bump dev version after PR #102
2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
...
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
331d07fe88
these ssurts are strings too
2018-05-16 17:11:08 -07:00
Noah Levitt
67558528cb
fix bad copy/paste
2018-05-16 16:43:38 -07:00
Noah Levitt
5bb392ec7c
ssurts are strings now
...
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c
travis-ci install warcprox from github
2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff
incorporate urlcanon fix
2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9
update warcprox dependency to include recent fixes
2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062
backward compatibility for old scope["surt"]
...
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6
missed a spot where is_permitted_by_robots needs monkeying
2018-05-15 16:52:48 -07:00
Noah Levitt
a8de9b70d1
handle new chrome cookie db schema
2018-05-15 11:41:02 -07:00
Noah Levitt
de1f240e25
describe scope rule conditions
...
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f
more explication of scoping
2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d
update docs to match new seed ssurt behavior
2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338
ok seriously tests
2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495
fix more tests for new approach sans scope['surt']
2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527
s/max_hops_off_surt/max_hops_off/
2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709
new test of max_hops_off
2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df
rename page.hops_off_surt to page.hops_off
2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0
doublethink had a bug fix
2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f
tests for new approach without scope['surt']
...
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a
tests for new approach without of scope['surt']
...
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00