923 Commits

Author SHA1 Message Date
Vangelis Banos
9c81a7bbda Make youtube-dl optional in BrozzlerWorker.brozzle_page
Enabled by default (of course).
Speed is significantly improved when disabled.
2017-07-31 08:57:47 +00:00
Barbara Miller
a563e9eb0c Merge pull request #50 from internetarchive/ARI-5407
Add click selector for facebook’s new See More link
2017-07-21 14:43:38 -07:00
Barbara Miller
7620ee6fae only div.teaser, for *pm.gc.ca* 2017-07-20 15:52:53 -07:00
Barbara Miller
795b9ab809 improve/expand url_regex 2017-07-19 15:50:49 -07:00
Barbara Miller
5c4961fbce add selector for See More link 2017-07-19 14:44:27 -07:00
Barbara Miller
59571dadfa div.compactTrackListItem for soundcloud.com 2017-07-18 14:27:10 -07:00
Barbara Miller
3c9eb30212 div.soundItem selector for multi-item list 2017-07-18 14:08:20 -07:00
Noah Levitt
0955d56926 Merge pull request #46 from internetarchive/ARI-5379
ARI-5379: custom behavior for pm.gc.ca
2017-07-13 11:42:34 -07:00
Barbara Miller
0a2895364d resolve conflict 2017-07-13 11:32:25 -07:00
Barbara Miller
762b65ee3e selectors for multi-item playlist 2017-07-12 11:19:53 -07:00
Noah Levitt
c77f4e4249 dev version bump 2017-07-06 17:19:53 -07:00
Noah Levitt
6cbe097c87 Merge pull request #48 from vbanos/WWM-802
new skip cli options for brozzle-page and brozzler-worker
2017-07-06 17:19:28 -07:00
Vangelis Banos
8019eb4b5f Hide the options using argparse.SUPPRESS 2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c add skip cli options to brozzle-page
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4 --skip-extract-outlinks, --skip-visit-hashtags
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
261e7977ad Merge pull request #47 from galgeek/ARI-5389
custom behavior for pitchfork.com, based on facebook & pm-gc-ca behaviors
2017-07-03 16:40:27 -07:00
Barbara Miller
24a68cb55d pitchfork behavior, based on pm-ca and facebook behaviors 2017-06-30 13:54:54 -07:00
Noah Levitt
051e299a80 fix "local variable 'start' referenced before assignment" 2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30 enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness 2017-06-26 18:00:32 -07:00
Noah Levitt
3385d727ac minimally update test_time_limit for new time accounting 2017-06-26 17:57:50 -07:00
Noah Levitt
8ef7972ace make sure youtube-dl progress thing can't derail youtube-dl operation 2017-06-26 16:10:40 -07:00
Noah Levitt
caee2787b0 have brozzler-list-sites --active use the index 2017-06-24 01:05:19 +00:00
Noah Levitt
35babeb01b make youtube-dl prefer unsegmented videos 2017-06-23 15:19:30 -07:00
Noah Levitt
e6b5770f6c try workaround, maybe this is an issue with https://blog.travis-ci.com/2017-06-21-trusty-updates-2017-Q2-launch 2017-06-23 14:07:07 -07:00
Noah Levitt
29b19b1e9d shed some light on the travis-ci error 2017-06-23 13:56:25 -07:00
Noah Levitt
405c5725e4 restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site 2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646 disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue 2017-06-19 11:21:02 -07:00
Barbara Miller
82b77b6903 WIP 2017-06-16 10:10:42 -07:00
Noah Levitt
7ae22381ef bump version number for pull request 2017-06-12 15:42:49 -07:00
Noah Levitt
33508af58f Merge pull request #44 from galgeek/ARI-5384
simpleclicks for recent issuu.com URLs
2017-06-12 15:42:04 -07:00
Barbara Miller
626220ce86 simpleclicks for recent issuu.com URLs 2017-06-09 14:08:34 -07:00
Noah Levitt
193ac43797 back to dev version number 2017-06-08 17:33:29 -07:00
Noah Levitt
44f74066cf 1.1b11 1.1b11 2017-06-08 17:30:24 -07:00
Noah Levitt
27040fd8b7 mini fix 2017-06-08 17:29:51 -07:00
Noah Levitt
02e1c88fac oops bump version 2017-06-07 13:08:23 -07:00
Noah Levitt
4d7f4518b5 use %r instead of calling repr() 2017-06-07 13:07:42 -07:00
Noah Levitt
65adc11d95 oops, should have bumped version number after merging pull requests 2017-06-07 08:51:21 -07:00
Noah Levitt
39fb811d13 Merge pull request #41 from galgeek/ARI-4868
ARI-4868 behavior for Huffington Post slideshow
2017-06-02 14:41:02 -07:00
Noah Levitt
5e38a9755e Merge pull request #42 from galgeek/loginAndReloadSeed
login and reload original url if navigated away
2017-06-02 14:03:51 -07:00
Barbara Miller
a0330d9716 updates per Noah's review 2017-06-02 13:27:01 -07:00
Barbara Miller
830b0eef89 undo post-login nav (ARI-5385 and/or ARI-5386) 2017-06-02 12:47:19 -07:00
Noah Levitt
f2227e6759 have travis-ci test against python 3.5 and 3.6 too 2017-05-26 13:28:00 -07:00
Noah Levitt
bdc0badec3 rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!) 2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c remove stray logging 2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6 use "ttl" for updated doublethink svc reg api 2017-05-23 11:33:04 -07:00
Barbara Miller
079db762d4 add relocated behavior file with updated copyright 2017-05-22 12:38:50 -07:00
Barbara Miller
d7c31be8d0 enable huffpostslides.js 2017-05-22 12:32:28 -07:00
Noah Levitt
89e7c8b079 fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called 2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97 improve thread_raise() so that the new tests pass
1. If thread is not currently accepting exceptions, queue it and raise if and
   when it does start accepting them. This fixes problem of thread_raise
   exceptions being ignored when raised just before the target thread starts
   accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
   thread in quick succession by ensuring that only one is actually raised for
   a given `with` block. This type of occurrence had been putting brozzler into
   a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
d514eaec15 even more, better failing tests for thread_raise 2017-05-16 14:00:10 -07:00