Barbara Miller
eb1f79271f
blog.sin.com.cn pagination
2017-09-05 14:20:36 -07:00
Barbara Miller
71d54faae0
Merge pull request #65 from vbanos/behavior_timeout
...
Make behavior_timeout configurable
2017-08-31 14:39:39 -07:00
Vangelis Banos
bb93b04c23
Make behavior_timeout configurable
...
``behavior_timeout`` is hardcoded to 900s. With this MR we make it
configurable with a default value of 900. We add a new variable to
``BrozzlerWorker`` and ``Browser``.
2017-08-31 08:06:26 +00:00
Barbara Miller
18a52f0b15
Merge pull request #64 from galgeek/typo
...
fix typo
2017-08-26 16:58:58 -07:00
Barbara Miller
e786013b1b
fix typo
2017-08-26 16:58:00 -07:00
Barbara Miller
00b57ed87a
Merge pull request #61 from internetarchive/x11-support
...
screenshots don't work with Xvfb
2017-08-26 16:45:50 -07:00
Barbara Miller
f810603cdf
Merge pull request #63 from vbanos/configurable-page-timeout
...
Thank you, @vbanos!
2017-08-23 13:31:29 -07:00
Vangelis Banos
00513af877
Configurable page timeout
...
The page loading timeout was hard-coded to 300s. With this change,
we make it configurable with a default value of 300.
2017-08-23 08:05:14 +00:00
Neil Minton
4733b0ac7d
Update SoundCloud.com behavior selectors.
2017-08-18 14:16:51 -07:00
Neil Minton
a8a624fbbf
Add Archive.org playlists to default behavior.
2017-08-18 14:16:51 -07:00
Neil Minton
b0fd1df1ef
Generalize default behavior.
2017-08-18 14:16:51 -07:00
Neil Minton
12e02ae401
Merge pull request #62 from internetarchive/ARI-5460
...
update Instagram selectors
2017-08-17 16:08:44 -07:00
Barbara Miller
c181f4bcc3
screenshots don't work w/Xvfb
2017-08-16 15:20:43 -07:00
Barbara Miller
6391e7b40f
Merge pull request #60 from galgeek/ARI-5453
...
simpleclicks for wixsite.com
2017-08-14 17:14:09 -07:00
Barbara Miller
901995c6cf
Merge pull request #58 from internetarchive/ARI-5379
...
ARI 5379 URL regex update
2017-08-14 16:54:17 -07:00
Barbara Miller
36b7e4f3d6
Merge pull request #59 from galgeek/ARI-5465
...
skip a.uiMorePagerPrimary after all
2017-08-14 16:50:39 -07:00
Barbara Miller
b5121c26a8
simpleclicks for wixsite.com
2017-08-14 16:47:49 -07:00
Barbara Miller
12cca540bf
Merge branch 'ARI-5379' of github.com:internetarchive/brozzler into ARI-5379
2017-08-11 17:17:07 -07:00
Barbara Miller
4d99ebc1b4
only div.teaser, for *pm.gc.ca*
2017-08-11 17:16:02 -07:00
Barbara Miller
6952bf76fb
improve/expand url_regex
2017-08-11 17:16:02 -07:00
Barbara Miller
05c0cd7914
skip a.uiMorePagerPrimary after all
2017-08-11 10:41:15 -07:00
Noah Levitt
f785adb944
Merge pull request #57 from internetarchive/ARI-5409
...
simpleclicks for tuebingen.de
2017-08-10 10:59:38 -07:00
Barbara Miller
61bbb9774f
sign up button and updated selectors
2017-08-09 16:23:35 -07:00
Noah Levitt
55856d2180
Merge pull request #54 from vbanos/worker_on_request
...
Pass missing on_request callback to BrozzlerWorker methods
2017-08-03 14:17:02 -07:00
Barbara Miller
101257effe
simpleclicks for tuebingen.de
2017-08-03 13:21:46 -07:00
Noah Levitt
bf250194b4
bump dev version number after some PR merges
2017-08-01 12:04:56 -07:00
Noah Levitt
50881ab38b
Merge pull request #55 from galgeek/ARI-5242
...
ARI-5242
2017-08-01 12:04:18 -07:00
Noah Levitt
f7fe06874d
Merge pull request #53 from vbanos/bugfix-needs-browsing
...
bugfix for BrozzlerWorker._needs_browsing
2017-08-01 10:14:56 -07:00
Vangelis Banos
78b9d61654
Pass missing on_request callback to BrozzlerWorker methods
...
``Browser.browser_page`` has the ``on_request`` parameter but it is on
used by ``BrozzlerWorker._browse_page`` where we invoke it.
I add this to ``BrozzlerWorker.brozzle_page`` and pass it also to
``BrozzlerWorker._browser_page``. Now, it is possible to use this
callback from other applications when calling
``BrozzlerWorker.brozzle_page``.
2017-08-01 13:58:42 +00:00
Vangelis Banos
ae73edb244
bugfix for BrozzlerWorker._needs_browsing
...
I'm sorry but with my previous commit I introduced a bug in
``BrozzlerWorker._needs_browsing`` method.
More specifically, if the ``brozzler_spy`` param is False (this happens
when ``youtube_dl`` is disabled), ``_needs_browsing`` method returns
always ``False`` and this messes with ``brozzle_page`` workflow. The
browser is nevenr visiting the page.
I'm fixing this here.
2017-08-01 11:43:42 +00:00
Noah Levitt
895bfbf913
Merge pull request #51 from vbanos/youtube-dl-option
...
Make youtube-dl optional in BrozzlerWorker.brozzle_page
2017-07-31 11:32:51 -07:00
Noah Levitt
7d62eb6525
Merge pull request #52 from vbanos/remove-redundant-param
...
Remove redundant method parameter.
2017-07-31 11:23:28 -07:00
Vangelis Banos
0343969807
Remove redundant method parameter.
...
``ignore_cert_errors`` is passed to ``Chrome`` via ``Browser`` via
``BrowserPool` here:
https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L120
it is not doing anything in ``Browser.browser_page``.
2017-07-31 12:36:17 +00:00
Vangelis Banos
6259d03be1
bugfix
2017-07-31 10:36:35 +00:00
Vangelis Banos
9c81a7bbda
Make youtube-dl optional in BrozzlerWorker.brozzle_page
...
Enabled by default (of course).
Speed is significantly improved when disabled.
2017-07-31 08:57:47 +00:00
Barbara Miller
a563e9eb0c
Merge pull request #50 from internetarchive/ARI-5407
...
Add click selector for facebook’s new See More link
2017-07-21 14:43:38 -07:00
Barbara Miller
7620ee6fae
only div.teaser, for *pm.gc.ca*
2017-07-20 15:52:53 -07:00
Barbara Miller
795b9ab809
improve/expand url_regex
2017-07-19 15:50:49 -07:00
Barbara Miller
5c4961fbce
add selector for See More link
2017-07-19 14:44:27 -07:00
Barbara Miller
59571dadfa
div.compactTrackListItem for soundcloud.com
2017-07-18 14:27:10 -07:00
Barbara Miller
3c9eb30212
div.soundItem selector for multi-item list
2017-07-18 14:08:20 -07:00
Noah Levitt
0955d56926
Merge pull request #46 from internetarchive/ARI-5379
...
ARI-5379: custom behavior for pm.gc.ca
2017-07-13 11:42:34 -07:00
Barbara Miller
0a2895364d
resolve conflict
2017-07-13 11:32:25 -07:00
Barbara Miller
762b65ee3e
selectors for multi-item playlist
2017-07-12 11:19:53 -07:00
Noah Levitt
c77f4e4249
dev version bump
2017-07-06 17:19:53 -07:00
Noah Levitt
6cbe097c87
Merge pull request #48 from vbanos/WWM-802
...
new skip cli options for brozzle-page and brozzler-worker
2017-07-06 17:19:28 -07:00
Vangelis Banos
8019eb4b5f
Hide the options using argparse.SUPPRESS
2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c
add skip cli options to brozzle-page
...
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4
--skip-extract-outlinks, --skip-visit-hashtags
...
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
261e7977ad
Merge pull request #47 from galgeek/ARI-5389
...
custom behavior for pitchfork.com, based on facebook & pm-gc-ca behaviors
2017-07-03 16:40:27 -07:00