Commit graph

1587 commits

Author SHA1 Message Date
Noah Levitt
a7fb7bcc37 Merge branch 'master' into karl
* master:
  bump up heartbeat interval (see comment)
  back to dev version
  version 1.3 (messed up 1.2)
  setuptools wants README not readme
  back to dev version number
  version 1.2
  bump dev version after merge
  is test_time_limit is failing because of timing?
  fix bug in test, add another one
  treat any error fetching robots.txt as "allow all"
  update instagram behavior
2018-07-23 23:28:42 +00:00
Karl-Rainer Blumenthal
bd78e07232
Copy edits to job-conf readme
Good reading and rampant pedantry!
2018-07-06 15:24:12 -04:00
Noah Levitt
9d18dc6aeb bump up heartbeat interval (see comment) 2018-07-03 18:35:08 -05:00
Barbara Miller
98c21d9d1f Merge branch 'ARI-5689' into qa 2018-07-03 14:50:51 -07:00
Barbara Miller
687d51de20 Revert "skip login for fb groups"
This reverts commit 5e1c86421e.
2018-07-03 14:49:58 -07:00
Karl-Rainer Blumenthal
eebbc1d279
Copy edits 2018-06-28 12:59:22 -04:00
Barbara Miller
c3a19d3186 Revert "switch to group discussion tab"
This reverts commit 01bb731f54.
2018-06-27 16:56:34 -07:00
Barbara Miller
9a8fc15ff2 Merge branch 'ARI-5689' into qa 2018-06-27 16:54:41 -07:00
Barbara Miller
423e91a69f Revert "switch to group discussion tab"
This reverts commit 01bb731f54.
2018-06-27 16:51:55 -07:00
Barbara Miller
5e1c86421e skip login for fb groups 2018-06-27 16:51:38 -07:00
Barbara Miller
01bb731f54 switch to group discussion tab 2018-06-27 16:51:37 -07:00
Barbara Miller
1fffaa9eee Merge branch 'ARI-5689' into qa 2018-06-27 16:47:46 -07:00
Barbara Miller
76ec00c930 skip login for fb groups 2018-06-27 16:47:32 -07:00
Noah Levitt
783fd0ea87 back to dev version 2018-06-25 19:32:27 +00:00
Noah Levitt
bd63908fb9 version 1.3 (messed up 1.2) 2018-06-25 19:30:39 +00:00
Noah Levitt
2780c92569 setuptools wants README not readme 2018-06-25 19:10:57 +00:00
Noah Levitt
032c7d2898 back to dev version number 2018-06-25 12:33:34 -05:00
Noah Levitt
442d02b26a version 1.2 2018-06-25 12:21:00 -05:00
Noah Levitt
196cd555ea bump dev version after merge 2018-06-25 11:44:45 -05:00
Noah Levitt
05ec6a68b0
Merge pull request #110 from nlevitt/robots-errors
treat any error fetching robots.txt as "allow all"
2018-06-25 11:44:18 -05:00
Noah Levitt
d4db8ba9bc is test_time_limit is failing because of timing?
give it up to ten seconds to mark the job finished
2018-06-25 10:35:24 -05:00
Noah Levitt
09dbb4ce1d Merge branch 'robots-errors' into qa
* robots-errors:
  fix bug in test, add another one
2018-06-22 16:10:51 -05:00
Noah Levitt
c52c16c260 fix bug in test, add another one 2018-06-22 16:10:23 -05:00
Noah Levitt
aff67c3b29 Merge branch 'robots-errors' into qa
* robots-errors:
  treat any error fetching robots.txt as "allow all"
2018-06-22 16:01:21 -05:00
Noah Levitt
aeb7c3f825 treat any error fetching robots.txt as "allow all" 2018-06-22 14:50:57 -05:00
Neil Minton
f5f9a1a137
Merge pull request #109 from internetarchive/ARI-5747
update instagram behavior
2018-06-22 09:24:14 -07:00
Barbara Miller
ad5f409078 Merge branch 'ARI-5744' into qa 2018-06-19 14:17:08 -07:00
Barbara Miller
1f93f70bfe Revert "test behavior for event.crowdcompass.com"
This reverts commit 565a472fb0.
2018-06-19 14:07:02 -07:00
Barbara Miller
96014606ec Merge branch 'ARI-5747' into qa 2018-06-18 10:37:09 -07:00
Barbara Miller
89e54fd2e6 update instagram behavior 2018-06-18 10:36:13 -07:00
Barbara Miller
0857fffeb6 Merge branch 'ARI-5747' into qa 2018-06-13 12:43:35 -07:00
Barbara Miller
5893b1f982 update instagram behavior 2018-06-13 12:43:12 -07:00
Barbara Miller
6b753623b7 Merge branch 'ARI-5744' into qa 2018-06-11 18:34:50 -07:00
Barbara Miller
565a472fb0 test behavior for event.crowdcompass.com 2018-06-11 18:30:40 -07:00
Noah Levitt
27bdfb65d2 monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:50:22 -07:00
Noah Levitt
109d05c59a Merge branch 'master' into qa
* master:
  monkey-patch youtube-dl to short-circuit
2018-06-11 11:11:09 -07:00
Noah Levitt
a90a29968c monkey-patch youtube-dl to short-circuit
video extraction using generic extractor in case of very large url (more
than 20 mb) that youtube-dl interprets as html, to avoid spinning
forever here:

Traceback (most recent call first):
  File "/opt/brozzler-ve3/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/generic.py", line 2878, in _real_extract
    'uploader': video_uploader,
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/extractor/common.py", line 503, in extract
    ie_result = self._real_extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/youtube_dl/YoutubeDL.py", line 792, in extract_info
    ie_result = ie.extract(url)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 302, in _try_youtube_dl
    info = ydl.extract_info(str(urlcanon.whatwg(page.url)))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 361, in brozzle_page
    self._try_youtube_dl(ydl, site, page)
2018-06-11 11:08:20 -07:00
Noah Levitt
5c34bd3119 Merge branch 'master' into qa
* master:
  lowercase readme.rst
  explain brozzler use of warcprox_meta
  update README copyright date
  bump dev version after PR #102
  these ssurts are strings too
  fix bad copy/paste
  ssurts are strings now
  travis-ci install warcprox from github
  incorporate urlcanon fix
  update warcprox dependency to include recent fixes
  backward compatibility for old scope["surt"]
  missed a spot where is_permitted_by_robots needs monkeying
  handle new chrome cookie db schema
  describe scope rule conditions
  more explication of scoping
  update docs to match new seed ssurt behavior
  ok seriously tests
  fix more tests for new approach sans scope['surt']
  s/max_hops_off_surt/max_hops_off/
  new test of max_hops_off
  rename page.hops_off_surt to page.hops_off
  doublethink had a bug fix
  tests for new approach without scope['surt']
  tests for new approach without of scope['surt']
  WIP add an accept rule instead of modifying surt
  WIP some words on scoping
  WIP starting to flesh out "scoping" section
  WIP some explanation of automatic login
  WIP documentation!
2018-06-01 16:46:32 -07:00
Noah Levitt
b41ccd7e6b
Merge pull request #108 from nlevitt/docs
Docs
2018-05-31 14:15:12 -07:00
Noah Levitt
62bb540a11 lowercase readme.rst 2018-05-31 18:46:37 +00:00
Noah Levitt
a00b5a7fd5 explain brozzler use of warcprox_meta 2018-05-30 18:06:39 -07:00
Noah Levitt
aef4c40993
Merge pull request #107 from internetarchive/copyright-2018
update README copyright date
2018-05-17 11:30:46 -07:00
Barbara Miller
135a13b1c9
update README copyright date 2018-05-17 11:21:47 -07:00
Noah Levitt
8906037d82 bump dev version after PR #102 2018-05-16 17:33:52 -07:00
Noah Levitt
e90e7345a5
Merge pull request #102 from nlevitt/docs
complete job configuration documentation
2018-05-16 17:31:27 -07:00
Noah Levitt
331d07fe88 these ssurts are strings too 2018-05-16 17:11:08 -07:00
Noah Levitt
67558528cb fix bad copy/paste 2018-05-16 16:43:38 -07:00
Noah Levitt
5bb392ec7c ssurts are strings now
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c travis-ci install warcprox from github 2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00