Barbara Miller
7620ee6fae
only div.teaser, for *pm.gc.ca*
2017-07-20 15:52:53 -07:00
Barbara Miller
795b9ab809
improve/expand url_regex
2017-07-19 15:50:49 -07:00
Barbara Miller
0a2895364d
resolve conflict
2017-07-13 11:32:25 -07:00
Noah Levitt
c77f4e4249
dev version bump
2017-07-06 17:19:53 -07:00
Noah Levitt
6cbe097c87
Merge pull request #48 from vbanos/WWM-802
...
new skip cli options for brozzle-page and brozzler-worker
2017-07-06 17:19:28 -07:00
Vangelis Banos
8019eb4b5f
Hide the options using argparse.SUPPRESS
2017-07-06 06:25:04 +00:00
Vangelis Banos
475ddd329c
add skip cli options to brozzle-page
...
Add --skip-extract-outlinks --skip-visit-hashtags options to
`brozzle-page` command.
2017-07-05 07:31:14 +00:00
Vangelis Banos
89877670a4
--skip-extract-outlinks, --skip-visit-hashtags
...
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Noah Levitt
261e7977ad
Merge pull request #47 from galgeek/ARI-5389
...
custom behavior for pitchfork.com, based on facebook & pm-gc-ca behaviors
2017-07-03 16:40:27 -07:00
Barbara Miller
24a68cb55d
pitchfork behavior, based on pm-ca and facebook behaviors
2017-06-30 13:54:54 -07:00
Noah Levitt
051e299a80
fix "local variable 'start' referenced before assignment"
2017-06-27 11:08:51 -07:00
Noah Levitt
b9640b8a30
enforce time limits based on time claimed by worker actively brozzling, to avoid problem of stopping crawls that haven't had much chance to crawl, because of cluster busy-ness
2017-06-26 18:00:32 -07:00
Noah Levitt
3385d727ac
minimally update test_time_limit for new time accounting
2017-06-26 17:57:50 -07:00
Noah Levitt
8ef7972ace
make sure youtube-dl progress thing can't derail youtube-dl operation
2017-06-26 16:10:40 -07:00
Noah Levitt
caee2787b0
have brozzler-list-sites --active use the index
2017-06-24 01:05:19 +00:00
Noah Levitt
35babeb01b
make youtube-dl prefer unsegmented videos
2017-06-23 15:19:30 -07:00
Noah Levitt
e6b5770f6c
try workaround, maybe this is an issue with https://blog.travis-ci.com/2017-06-21-trusty-updates-2017-Q2-launch
2017-06-23 14:07:07 -07:00
Noah Levitt
29b19b1e9d
shed some light on the travis-ci error
2017-06-23 13:56:25 -07:00
Noah Levitt
405c5725e4
restore reclamation of orphaned, claimed sites, and heartbeat site.last_claimed every 7 minutes during youtube-dl processing, to prevent another brozzler-worker claiming the site
2017-06-23 13:50:49 -07:00
Noah Levitt
6bae53e646
disable the re-claiming of sites that are marked claimed from more than an hour ago, because sometimes pages legitimately take longer than an hour to brozzle; working on a better solution to this issue
2017-06-19 11:21:02 -07:00
Barbara Miller
82b77b6903
WIP
2017-06-16 10:10:42 -07:00
Noah Levitt
7ae22381ef
bump version number for pull request
2017-06-12 15:42:49 -07:00
Noah Levitt
33508af58f
Merge pull request #44 from galgeek/ARI-5384
...
simpleclicks for recent issuu.com URLs
2017-06-12 15:42:04 -07:00
Barbara Miller
626220ce86
simpleclicks for recent issuu.com URLs
2017-06-09 14:08:34 -07:00
Noah Levitt
193ac43797
back to dev version number
2017-06-08 17:33:29 -07:00
Noah Levitt
44f74066cf
1.1b11
1.1b11
2017-06-08 17:30:24 -07:00
Noah Levitt
27040fd8b7
mini fix
2017-06-08 17:29:51 -07:00
Noah Levitt
02e1c88fac
oops bump version
2017-06-07 13:08:23 -07:00
Noah Levitt
4d7f4518b5
use %r instead of calling repr()
2017-06-07 13:07:42 -07:00
Noah Levitt
65adc11d95
oops, should have bumped version number after merging pull requests
2017-06-07 08:51:21 -07:00
Noah Levitt
39fb811d13
Merge pull request #41 from galgeek/ARI-4868
...
ARI-4868 behavior for Huffington Post slideshow
2017-06-02 14:41:02 -07:00
Noah Levitt
5e38a9755e
Merge pull request #42 from galgeek/loginAndReloadSeed
...
login and reload original url if navigated away
2017-06-02 14:03:51 -07:00
Barbara Miller
a0330d9716
updates per Noah's review
2017-06-02 13:27:01 -07:00
Barbara Miller
830b0eef89
undo post-login nav (ARI-5385 and/or ARI-5386)
2017-06-02 12:47:19 -07:00
Noah Levitt
f2227e6759
have travis-ci test against python 3.5 and 3.6 too
2017-05-26 13:28:00 -07:00
Noah Levitt
bdc0badec3
rewrite frontier.scope_and_schedule_outlinks() to use batch rethinkdb queries, because we have witnessed the method running for hours(!)
2017-05-26 13:24:14 -07:00
Noah Levitt
d904daea9c
remove stray logging
2017-05-24 11:36:06 -07:00
Noah Levitt
ac543ee5b6
use "ttl" for updated doublethink svc reg api
2017-05-23 11:33:04 -07:00
Barbara Miller
079db762d4
add relocated behavior file with updated copyright
2017-05-22 12:38:50 -07:00
Barbara Miller
d7c31be8d0
enable huffpostslides.js
2017-05-22 12:32:28 -07:00
Noah Levitt
89e7c8b079
fix exception from ReachedLimit.__repr__ when it has been instantiated implicitly and __init__ was not called
2017-05-16 15:47:18 -07:00
Noah Levitt
31dc6a2d97
improve thread_raise() so that the new tests pass
...
1. If thread is not currently accepting exceptions, queue it and raise if and
when it does start accepting them. This fixes problem of thread_raise
exceptions being ignored when raised just before the target thread starts
accepting exceptions.
2. Avoid problems caused by raising multiple exceptions in the same
thread in quick succession by ensuring that only one is actually raised for
a given `with` block. This type of occurrence had been putting brozzler into
a borked/frozen state.
2017-05-16 14:20:53 -07:00
Noah Levitt
d514eaec15
even more, better failing tests for thread_raise
2017-05-16 14:00:10 -07:00
Noah Levitt
d2525e2e87
failing test for forthcoming behavior of thread_raise
2017-05-15 16:20:20 -07:00
Noah Levitt
60c5a7c1c4
recognize ConnectionError (of which ConnectionResetError is a subclass) in _warcprox_write_record as a proxy error
2017-05-12 10:03:53 -07:00
Barbara Miller
054625b8a5
Merge pull request #40 from BitBaron/ari-4960
...
Crawl Google Calendar for fortstjames.ca
2017-05-09 14:12:48 -07:00
Noah Levitt
b4bf17df9b
do a better job of making sure to shut down the browser when brozzle-page is killed
2017-05-03 16:43:31 -07:00
Noah Levitt
9d4cbbf6eb
handle another rethinkdb outage corner case
2017-05-01 14:12:43 -07:00
Noah Levitt
389db01458
BrozzlerWorkerThread separate from MainThread to avoid SIGTERM/SIGINT raising exception inside of some rethinkdb code or other sensitive code in that BrozzlerWorker.run() calls
2017-05-01 13:46:19 -07:00
Noah Levitt
52433ade78
re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
2017-05-01 13:00:04 -07:00