1053 Commits

Author SHA1 Message Date
Barbara Miller
e786013b1b fix typo 2017-08-26 16:58:00 -07:00
Barbara Miller
00b57ed87a Merge pull request #61 from internetarchive/x11-support
screenshots don't work with Xvfb
2017-08-26 16:45:50 -07:00
Barbara Miller
f810603cdf Merge pull request #63 from vbanos/configurable-page-timeout
Thank you, @vbanos!
2017-08-23 13:31:29 -07:00
Vangelis Banos
00513af877 Configurable page timeout
The page loading timeout was hard-coded to 300s. With this change,
we make it configurable with a default value of 300.
2017-08-23 08:05:14 +00:00
Neil Minton
4733b0ac7d Update SoundCloud.com behavior selectors. 2017-08-18 14:16:51 -07:00
Neil Minton
a8a624fbbf Add Archive.org playlists to default behavior. 2017-08-18 14:16:51 -07:00
Neil Minton
b0fd1df1ef Generalize default behavior. 2017-08-18 14:16:51 -07:00
Neil Minton
12e02ae401 Merge pull request #62 from internetarchive/ARI-5460
update Instagram selectors
2017-08-17 16:08:44 -07:00
Barbara Miller
c181f4bcc3 screenshots don't work w/Xvfb 2017-08-16 15:20:43 -07:00
Barbara Miller
6391e7b40f Merge pull request #60 from galgeek/ARI-5453
simpleclicks for wixsite.com
2017-08-14 17:14:09 -07:00
Barbara Miller
901995c6cf Merge pull request #58 from internetarchive/ARI-5379
ARI 5379 URL regex update
2017-08-14 16:54:17 -07:00
Barbara Miller
36b7e4f3d6 Merge pull request #59 from galgeek/ARI-5465
skip a.uiMorePagerPrimary after all
2017-08-14 16:50:39 -07:00
Barbara Miller
b5121c26a8 simpleclicks for wixsite.com 2017-08-14 16:47:49 -07:00
Barbara Miller
12cca540bf Merge branch 'ARI-5379' of github.com:internetarchive/brozzler into ARI-5379 2017-08-11 17:17:07 -07:00
Barbara Miller
4d99ebc1b4 only div.teaser, for *pm.gc.ca* 2017-08-11 17:16:02 -07:00
Barbara Miller
6952bf76fb improve/expand url_regex 2017-08-11 17:16:02 -07:00
Barbara Miller
05c0cd7914 skip a.uiMorePagerPrimary after all 2017-08-11 10:41:15 -07:00
Noah Levitt
f785adb944 Merge pull request #57 from internetarchive/ARI-5409
simpleclicks for tuebingen.de
2017-08-10 10:59:38 -07:00
Barbara Miller
61bbb9774f sign up button and updated selectors 2017-08-09 16:23:35 -07:00
Noah Levitt
55856d2180 Merge pull request #54 from vbanos/worker_on_request
Pass missing on_request callback to BrozzlerWorker methods
2017-08-03 14:17:02 -07:00
Barbara Miller
101257effe simpleclicks for tuebingen.de 2017-08-03 13:21:46 -07:00
Noah Levitt
bf250194b4 bump dev version number after some PR merges 2017-08-01 12:04:56 -07:00
Noah Levitt
50881ab38b Merge pull request #55 from galgeek/ARI-5242
ARI-5242
2017-08-01 12:04:18 -07:00
Noah Levitt
f7fe06874d Merge pull request #53 from vbanos/bugfix-needs-browsing
bugfix for BrozzlerWorker._needs_browsing
2017-08-01 10:14:56 -07:00
Vangelis Banos
78b9d61654 Pass missing on_request callback to BrozzlerWorker methods
``Browser.browser_page`` has the ``on_request`` parameter but it is on
used by ``BrozzlerWorker._browse_page`` where we invoke it.

I add this to ``BrozzlerWorker.brozzle_page`` and pass it also to
``BrozzlerWorker._browser_page``. Now, it is possible to use this
callback from other applications when calling
``BrozzlerWorker.brozzle_page``.
2017-08-01 13:58:42 +00:00
Vangelis Banos
ae73edb244 bugfix for BrozzlerWorker._needs_browsing
I'm sorry but with my previous commit I introduced a bug in
``BrozzlerWorker._needs_browsing`` method.

More specifically, if the ``brozzler_spy`` param is False (this happens
when ``youtube_dl`` is disabled), ``_needs_browsing`` method returns
always ``False`` and this messes with ``brozzle_page`` workflow. The
browser is nevenr visiting the page.

I'm fixing this here.
2017-08-01 11:43:42 +00:00
Noah Levitt
895bfbf913 Merge pull request #51 from vbanos/youtube-dl-option
Make youtube-dl optional in BrozzlerWorker.brozzle_page
2017-07-31 11:32:51 -07:00
Noah Levitt
7d62eb6525 Merge pull request #52 from vbanos/remove-redundant-param
Remove redundant method parameter.
2017-07-31 11:23:28 -07:00
Vangelis Banos
0343969807 Remove redundant method parameter.
``ignore_cert_errors`` is passed to ``Chrome`` via ``Browser`` via
``BrowserPool` here:

https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L120

it is not doing anything in ``Browser.browser_page``.
2017-07-31 12:36:17 +00:00
Vangelis Banos
6259d03be1 bugfix 2017-07-31 10:36:35 +00:00
Vangelis Banos
9c81a7bbda Make youtube-dl optional in BrozzlerWorker.brozzle_page
Enabled by default (of course).
Speed is significantly improved when disabled.
2017-07-31 08:57:47 +00:00
Barbara Miller
a563e9eb0c Merge pull request #50 from internetarchive/ARI-5407
Add click selector for facebook’s new See More link
2017-07-21 14:43:38 -07:00
Barbara Miller
7620ee6fae only div.teaser, for *pm.gc.ca* 2017-07-20 15:52:53 -07:00
Barbara Miller
795b9ab809 improve/expand url_regex 2017-07-19 15:50:49 -07:00
Barbara Miller
5c4961fbce add selector for See More link 2017-07-19 14:44:27 -07:00
Barbara Miller
59571dadfa div.compactTrackListItem for soundcloud.com 2017-07-18 14:27:10 -07:00
Barbara Miller
3c9eb30212 div.soundItem selector for multi-item list 2017-07-18 14:08:20 -07:00
Noah Levitt
0955d56926 Merge pull request #46 from internetarchive/ARI-5379
ARI-5379: custom behavior for pm.gc.ca
2017-07-13 11:42:34 -07:00
Barbara Miller
0a2895364d resolve conflict 2017-07-13 11:32:25 -07:00
Barbara Miller
762b65ee3e selectors for multi-item playlist 2017-07-12 11:19:53 -07:00
Noah Levitt
c77f4e4249 dev version bump 2017-07-06 17:19:53 -07:00
Noah Levitt
6cbe097c87 Merge pull request #48 from vbanos/WWM-802
new skip cli options for brozzle-page and brozzler-worker
2017-07-06 17:19:28 -07:00
Vangelis Banos
8019eb4b5f Hide the options using argparse.SUPPRESS 2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c add skip cli options to brozzle-page
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4 --skip-extract-outlinks, --skip-visit-hashtags
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
261e7977ad Merge pull request #47 from galgeek/ARI-5389
custom behavior for pitchfork.com, based on facebook & pm-gc-ca behaviors
2017-07-03 16:40:27 -07:00
Barbara Miller
24a68cb55d pitchfork behavior, based on pm-ca and facebook behaviors 2017-06-30 13:54:54 -07:00
Noah Levitt
051e299a80 fix "local variable 'start' referenced before assignment" 2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30 enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness 2017-06-26 18:00:32 -07:00
Noah Levitt
3385d727ac minimally update test_time_limit for new time accounting 2017-06-26 17:57:50 -07:00