Noah Levitt
|
fee008266f
|
support for one-hop-off (or n-hop-off) scoping
|
2016-04-21 17:41:59 +00:00 |
|
Noah Levitt
|
4bbbbcf138
|
fix bug where the first time a site was claimed, another brozzler-worker would claim it anyway (and find no pages to brozzle)
|
2016-04-21 00:21:08 +00:00 |
|
Noah Levitt
|
5bb23b354c
|
fix stupid bug where all new sites would have same start_time
|
2016-04-07 23:34:30 +00:00 |
|
Noah Levitt
|
733124c7dc
|
fix bug preventing brozzler from simultaneously working on more than one site from the same job
|
2016-04-04 23:28:24 +00:00 |
|
Noah Levitt
|
36e2bb2729
|
use rethinkdb native time type for date/time values
|
2015-11-18 02:07:27 +00:00 |
|
Noah Levitt
|
40522ef5a5
|
fix some rethinkdb related stuff; most notably r.desc() and related stuff don't currently work correctly if r is a Rethinker, so use rethinkdb directly in that case
|
2015-09-23 01:53:05 +00:00 |
|
Noah Levitt
|
dc9d1a4959
|
detecting job finish seems to be working now
|
2015-09-10 01:38:31 +00:00 |
|
Noah Levitt
|
92a288bc35
|
detect jobs finishing! (not well tested yet)
|
2015-09-09 22:11:48 +00:00 |
|
Noah Levitt
|
5fe2805285
|
fix bug claiming site, looks like there could be a race condition with other worker claiming the same site
|
2015-09-04 01:36:29 +00:00 |
|
Noah Levitt
|
3c23aa8fd4
|
finally, the jobs table
|
2015-09-03 01:05:03 +00:00 |
|
Noah Levitt
|
ad543e6134
|
enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect
|
2015-08-19 20:16:25 +00:00 |
|
Noah Levitt
|
2533229fa1
|
add __all__ to modules
|
2015-08-19 19:01:28 +00:00 |
|
Noah Levitt
|
b7df0a1f37
|
make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py
|
2015-08-19 18:45:19 +00:00 |
|
Noah Levitt
|
a878730e02
|
goodbye sqlite and rabbitmq, hello rethinkdb
|
2015-08-18 21:44:54 +00:00 |
|
Noah Levitt
|
e96b16e19a
|
support for max_hops scope rule
|
2015-08-07 22:36:39 +00:00 |
|
Noah Levitt
|
2a7a0b7c30
|
little fix, tweak
|
2015-08-05 00:17:43 +00:00 |
|
Noah Levitt
|
b6beac3807
|
new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file
|
2015-08-04 19:52:01 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
5c701abb36
|
reject urls with scheme other than http/https (for now)
|
2015-07-28 01:11:26 +00:00 |
|
Noah Levitt
|
a04bf04307
|
keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over
|
2015-07-23 02:19:25 +00:00 |
|
Noah Levitt
|
4dacc0b087
|
new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue
|
2015-07-23 01:21:23 +00:00 |
|
Noah Levitt
|
6a09f2095c
|
handle exceptions in robots.txt fetching/parsing
|
2015-07-22 00:54:49 +00:00 |
|
Noah Levitt
|
b5cb94fc8b
|
some additional logging and error handling to avoid mysterious messages
|
2015-07-21 06:33:02 +00:00 |
|
Noah Levitt
|
2f28f00a09
|
make putmeta requests respect site configured extra_headers
|
2015-07-17 16:52:06 -07:00 |
|
Noah Levitt
|
2ba5bd4d4b
|
support adding extra http request headers
|
2015-07-17 13:45:27 -07:00 |
|
Noah Levitt
|
a54e60dbaf
|
change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix
|
2015-07-16 18:39:29 -07:00 |
|
Noah Levitt
|
d2650a2547
|
update scope if seed redirects
|
2015-07-16 18:27:47 -07:00 |
|
Noah Levitt
|
140a441eb5
|
honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs
|
2015-07-16 17:19:12 -07:00 |
|
Noah Levitt
|
e04247c3f7
|
add support for supplying json blob defining site with configuration to brozzler-add-site
|
2015-07-16 14:48:01 -07:00 |
|
Noah Levitt
|
f2bc7ec271
|
refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling
|
2015-07-15 18:03:03 -07:00 |
|