Commit graph

31 commits

Author SHA1 Message Date
Noah Levitt
fee008266f support for one-hop-off (or n-hop-off) scoping 2016-04-21 17:41:59 +00:00
Noah Levitt
4bbbbcf138 fix bug where the first time a site was claimed, another brozzler-worker would claim it anyway (and find no pages to brozzle) 2016-04-21 00:21:08 +00:00
Noah Levitt
5bb23b354c fix stupid bug where all new sites would have same start_time 2016-04-07 23:34:30 +00:00
Noah Levitt
733124c7dc fix bug preventing brozzler from simultaneously working on more than one site from the same job 2016-04-04 23:28:24 +00:00
Noah Levitt
36e2bb2729 use rethinkdb native time type for date/time values 2015-11-18 02:07:27 +00:00
Noah Levitt
40522ef5a5 fix some rethinkdb related stuff; most notably r.desc() and related stuff don't currently work correctly if r is a Rethinker, so use rethinkdb directly in that case 2015-09-23 01:53:05 +00:00
Noah Levitt
dc9d1a4959 detecting job finish seems to be working now 2015-09-10 01:38:31 +00:00
Noah Levitt
92a288bc35 detect jobs finishing! (not well tested yet) 2015-09-09 22:11:48 +00:00
Noah Levitt
5fe2805285 fix bug claiming site, looks like there could be a race condition with other worker claiming the same site 2015-09-04 01:36:29 +00:00
Noah Levitt
3c23aa8fd4 finally, the jobs table 2015-09-03 01:05:03 +00:00
Noah Levitt
ad543e6134 enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect 2015-08-19 20:16:25 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py 2015-08-19 18:45:19 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
e96b16e19a support for max_hops scope rule 2015-08-07 22:36:39 +00:00
Noah Levitt
2a7a0b7c30 little fix, tweak 2015-08-05 00:17:43 +00:00
Noah Levitt
b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file 2015-08-04 19:52:01 +00:00
Noah Levitt
7b98af7d9f handle reached limit response from warcprox 2015-08-01 00:09:57 +00:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
5c701abb36 reject urls with scheme other than http/https (for now) 2015-07-28 01:11:26 +00:00
Noah Levitt
a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over 2015-07-23 02:19:25 +00:00
Noah Levitt
4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue 2015-07-23 01:21:23 +00:00
Noah Levitt
6a09f2095c handle exceptions in robots.txt fetching/parsing 2015-07-22 00:54:49 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
2f28f00a09 make putmeta requests respect site configured extra_headers 2015-07-17 16:52:06 -07:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix 2015-07-16 18:39:29 -07:00
Noah Levitt
d2650a2547 update scope if seed redirects 2015-07-16 18:27:47 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site 2015-07-16 14:48:01 -07:00
Noah Levitt
f2bc7ec271 refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling 2015-07-15 18:03:03 -07:00