Neil Minton
4733b0ac7d
Update SoundCloud.com behavior selectors.
2017-08-18 14:16:51 -07:00
Neil Minton
a8a624fbbf
Add Archive.org playlists to default behavior.
2017-08-18 14:16:51 -07:00
Neil Minton
b0fd1df1ef
Generalize default behavior.
2017-08-18 14:16:51 -07:00
Neil Minton
12e02ae401
Merge pull request #62 from internetarchive/ARI-5460
...
update Instagram selectors
2017-08-17 16:08:44 -07:00
Barbara Miller
6391e7b40f
Merge pull request #60 from galgeek/ARI-5453
...
simpleclicks for wixsite.com
2017-08-14 17:14:09 -07:00
Barbara Miller
901995c6cf
Merge pull request #58 from internetarchive/ARI-5379
...
ARI 5379 URL regex update
2017-08-14 16:54:17 -07:00
Barbara Miller
36b7e4f3d6
Merge pull request #59 from galgeek/ARI-5465
...
skip a.uiMorePagerPrimary after all
2017-08-14 16:50:39 -07:00
Barbara Miller
b5121c26a8
simpleclicks for wixsite.com
2017-08-14 16:47:49 -07:00
Barbara Miller
12cca540bf
Merge branch 'ARI-5379' of github.com:internetarchive/brozzler into ARI-5379
2017-08-11 17:17:07 -07:00
Barbara Miller
4d99ebc1b4
only div.teaser, for *pm.gc.ca*
2017-08-11 17:16:02 -07:00
Barbara Miller
6952bf76fb
improve/expand url_regex
2017-08-11 17:16:02 -07:00
Barbara Miller
05c0cd7914
skip a.uiMorePagerPrimary after all
2017-08-11 10:41:15 -07:00
Noah Levitt
f785adb944
Merge pull request #57 from internetarchive/ARI-5409
...
simpleclicks for tuebingen.de
2017-08-10 10:59:38 -07:00
Barbara Miller
61bbb9774f
sign up button and updated selectors
2017-08-09 16:23:35 -07:00
Noah Levitt
55856d2180
Merge pull request #54 from vbanos/worker_on_request
...
Pass missing on_request callback to BrozzlerWorker methods
2017-08-03 14:17:02 -07:00
Barbara Miller
101257effe
simpleclicks for tuebingen.de
2017-08-03 13:21:46 -07:00
Noah Levitt
bf250194b4
bump dev version number after some PR merges
2017-08-01 12:04:56 -07:00
Noah Levitt
50881ab38b
Merge pull request #55 from galgeek/ARI-5242
...
ARI-5242
2017-08-01 12:04:18 -07:00
Noah Levitt
f7fe06874d
Merge pull request #53 from vbanos/bugfix-needs-browsing
...
bugfix for BrozzlerWorker._needs_browsing
2017-08-01 10:14:56 -07:00
Vangelis Banos
78b9d61654
Pass missing on_request callback to BrozzlerWorker methods
...
``Browser.browser_page`` has the ``on_request`` parameter but it is on
used by ``BrozzlerWorker._browse_page`` where we invoke it.
I add this to ``BrozzlerWorker.brozzle_page`` and pass it also to
``BrozzlerWorker._browser_page``. Now, it is possible to use this
callback from other applications when calling
``BrozzlerWorker.brozzle_page``.
2017-08-01 13:58:42 +00:00
Vangelis Banos
ae73edb244
bugfix for BrozzlerWorker._needs_browsing
...
I'm sorry but with my previous commit I introduced a bug in
``BrozzlerWorker._needs_browsing`` method.
More specifically, if the ``brozzler_spy`` param is False (this happens
when ``youtube_dl`` is disabled), ``_needs_browsing`` method returns
always ``False`` and this messes with ``brozzle_page`` workflow. The
browser is nevenr visiting the page.
I'm fixing this here.
2017-08-01 11:43:42 +00:00
Noah Levitt
895bfbf913
Merge pull request #51 from vbanos/youtube-dl-option
...
Make youtube-dl optional in BrozzlerWorker.brozzle_page
2017-07-31 11:32:51 -07:00
Noah Levitt
7d62eb6525
Merge pull request #52 from vbanos/remove-redundant-param
...
Remove redundant method parameter.
2017-07-31 11:23:28 -07:00
Vangelis Banos
0343969807
Remove redundant method parameter.
...
``ignore_cert_errors`` is passed to ``Chrome`` via ``Browser`` via
``BrowserPool` here:
https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L120
it is not doing anything in ``Browser.browser_page``.
2017-07-31 12:36:17 +00:00
Vangelis Banos
6259d03be1
bugfix
2017-07-31 10:36:35 +00:00
Vangelis Banos
9c81a7bbda
Make youtube-dl optional in BrozzlerWorker.brozzle_page
...
Enabled by default (of course).
Speed is significantly improved when disabled.
2017-07-31 08:57:47 +00:00
Barbara Miller
a563e9eb0c
Merge pull request #50 from internetarchive/ARI-5407
...
Add click selector for facebook’s new See More link
2017-07-21 14:43:38 -07:00
Barbara Miller
7620ee6fae
only div.teaser, for *pm.gc.ca*
2017-07-20 15:52:53 -07:00
Barbara Miller
795b9ab809
improve/expand url_regex
2017-07-19 15:50:49 -07:00
Barbara Miller
5c4961fbce
add selector for See More link
2017-07-19 14:44:27 -07:00
Barbara Miller
59571dadfa
div.compactTrackListItem for soundcloud.com
2017-07-18 14:27:10 -07:00
Barbara Miller
3c9eb30212
div.soundItem selector for multi-item list
2017-07-18 14:08:20 -07:00
Noah Levitt
0955d56926
Merge pull request #46 from internetarchive/ARI-5379
...
ARI-5379: custom behavior for pm.gc.ca
2017-07-13 11:42:34 -07:00
Barbara Miller
0a2895364d
resolve conflict
2017-07-13 11:32:25 -07:00
Barbara Miller
762b65ee3e
selectors for multi-item playlist
2017-07-12 11:19:53 -07:00
Noah Levitt
c77f4e4249
dev version bump
2017-07-06 17:19:53 -07:00
Noah Levitt
6cbe097c87
Merge pull request #48 from vbanos/WWM-802
...
new skip cli options for brozzle-page and brozzler-worker
2017-07-06 17:19:28 -07:00
Vangelis Banos
8019eb4b5f
Hide the options using argparse.SUPPRESS
2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c
add skip cli options to brozzle-page
...
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4
--skip-extract-outlinks, --skip-visit-hashtags
...
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
261e7977ad
Merge pull request #47 from galgeek/ARI-5389
...
custom behavior for pitchfork.com, based on facebook & pm-gc-ca behaviors
2017-07-03 16:40:27 -07:00
Barbara Miller
24a68cb55d
pitchfork behavior, based on pm-ca and facebook behaviors
2017-06-30 13:54:54 -07:00
Noah Levitt
051e299a80
fix "local variable 'start' referenced before assignment"
2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30
enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness
2017-06-26 18:00:32 -07:00
Noah Levitt
3385d727ac
minimally update test_time_limit for new time accounting
2017-06-26 17:57:50 -07:00
Noah Levitt
8ef7972ace
make sure youtube-dl progress thing can't derail youtube-dl operation
2017-06-26 16:10:40 -07:00
Noah Levitt
caee2787b0
have brozzler-list-sites --active use the index
2017-06-24 01:05:19 +00:00
Noah Levitt
35babeb01b
make youtube-dl prefer unsegmented videos
2017-06-23 15:19:30 -07:00
Noah Levitt
e6b5770f6c
try workaround, maybe this is an issue with https://blog.travis-ci.com/2017-06-21-trusty-updates-2017-Q2-launch
2017-06-23 14:07:07 -07:00
Noah Levitt
29b19b1e9d
shed some light on the travis-ci error
2017-06-23 13:56:25 -07:00