Commit Graph

  • 1ca17f204b brozzler web console initial fiddling Noah Levitt 2015-09-25 17:59:38 +00:00
  • c08128dfe2 Merge pull request #44 from nlevitt/AITFIVE-451-3 Hunter 2015-09-24 15:44:11 -07:00
  • a17b0f3b8d refactor umbraAboveBelowOrOnScreen into umbraBehavior object Noah Levitt 2015-09-24 12:34:55 -07:00
  • f2ead0570e fixes for psu24 behavior Noah Levitt 2015-09-24 12:20:19 -07:00
  • dff4149185 missed one more use of brozzler.version Noah Levitt 2015-09-24 00:44:35 +00:00
  • a94dfd27f8 oops, set brozzler.__version__ Noah Levitt 2015-09-24 00:34:51 +00:00
  • 8c69ca3b39 giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing Noah Levitt 2015-09-24 00:17:33 +00:00
  • 9699a40645 remove "dev" from version number and switch README to rst Noah Levitt 2015-09-23 22:35:26 +00:00
  • 245078284d pep440 compliant versioning Noah Levitt 2015-09-23 14:46:57 -07:00
  • 40522ef5a5 fix some rethinkdb related stuff; most notably r.desc() and related stuff don't currently work correctly if r is a Rethinker, so use rethinkdb directly in that case Noah Levitt 2015-09-23 01:53:05 +00:00
  • 8bf34d9db6 tweaks Noah Levitt 2015-09-23 00:50:38 +00:00
  • 2bc66f52d4 new rethinkstuff.Rethinker api Noah Levitt 2015-09-23 00:50:15 +00:00
  • 2863b7e422 goodbye requirements.txt now that we have devpi Noah Levitt 2015-09-23 00:49:20 +00:00
  • f8a70f3842 More changes. Hunter Stern 2015-09-17 16:24:41 -07:00
  • 8829323a38 Remove changes for https://webarchive.jira.com/browse/ARI-4518: Hunter Stern 2015-09-17 09:07:03 -07:00
  • f282213981 Add fix for https://webarchive.jira.com/browse/ARI-4518 Hunter Stern 2015-09-17 08:43:30 -07:00
  • c780b147b3 missed "git+" Noah Levitt 2015-09-16 19:24:48 +00:00
  • c682627aec Rethinker moved to pyrethink library Noah Levitt 2015-09-16 19:24:17 +00:00
  • a8f9664212 separate virtualenvs Noah Levitt 2015-09-16 19:23:11 +00:00
  • 5ccc535f51 More changes Hunter Stern 2015-09-16 09:23:13 -07:00
  • 3467670900 More changes for handling psu24 site Hunter Stern 2015-09-15 18:03:08 -07:00
  • 5a6cbf01da Dockerfile for brozzler worker Noah Levitt 2015-09-15 23:02:37 +00:00
  • ea41653c44 Pulled in changes from https://github.com/nlevitt/umbra/tree/aitfive-451-alt Hunter Stern 2015-09-15 11:53:53 -07:00
  • 70308c10f4 shouldn't have local paths as requirements Noah Levitt 2015-09-15 18:07:47 +00:00
  • b30cc2d68b simpler implementation for https://github.com/internetarchive/umbra/pull/42/files Noah Levitt 2015-09-14 17:57:01 -07:00
  • dc9d1a4959 detecting job finish seems to be working now Noah Levitt 2015-09-10 01:38:31 +00:00
  • 92a288bc35 detect jobs finishing! (not well tested yet) Noah Levitt 2015-09-09 22:11:48 +00:00
  • 72e72e03c4 brozzler-job-starter.py -> ait-brozzler-boss.py Noah Levitt 2015-09-09 22:11:14 +00:00
  • 1b94d10723 on reset, mark active jobs as finished Noah Levitt 2015-09-08 22:38:39 +00:00
  • 290ea433a5 save full size screenshot as jpeg too Noah Levitt 2015-09-08 22:37:35 +00:00
  • 9698b0f847 create thumbnail of screenshot and send to warcprox Noah Levitt 2015-09-07 06:27:21 +00:00
  • 565ab5f936 save screenshots with new scheme url screenshot:..., WARC-Type:resource Noah Levitt 2015-09-07 00:26:37 +00:00
  • 993ae6a833 run ait5 partner webapp; consolidate "status" and "fullstatus" Noah Levitt 2015-09-04 21:02:33 +00:00
  • 5fe2805285 fix bug claiming site, looks like there could be a race condition with other worker claiming the same site Noah Levitt 2015-09-04 01:36:29 +00:00
  • 3c23aa8fd4 finally, the jobs table Noah Levitt 2015-09-03 01:05:03 +00:00
  • 6cda4739b8 log exception when thread dies (seems to be dying silently sometimes) Noah Levitt 2015-09-03 01:04:41 +00:00
  • 839bf6f4ae script to help with starting/restarting/etc in my dev environment Noah Levitt 2015-09-03 01:03:19 +00:00
  • f334107b47 support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell) Noah Levitt 2015-08-28 00:37:26 +00:00
  • cf91fb1377 Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily" Ugh.. too much pain, not worth the time to figure out the magic #egg= incantation. Noah Levitt 2015-08-26 19:44:04 +00:00
  • 78ca070165 use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily Noah Levitt 2015-08-26 19:22:59 +00:00
  • efa640c640 refactor to simplify starting new job from code Noah Levitt 2015-08-25 19:52:33 +00:00
  • 68de85022a there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb Noah Levitt 2015-08-21 17:55:29 +00:00
  • 231d019659 use nlevitt fork of surt library for less stupid handling of mailto: urls, etc Noah Levitt 2015-08-20 21:23:59 +00:00
  • ee50818dca if database already exists but tables don't, just create them Noah Levitt 2015-08-20 21:23:08 +00:00
  • 3af1e10e13 make it work again, and list discovered outlinks Noah Levitt 2015-08-20 21:22:08 +00:00
  • 8b45d7eb69 since I can't figure out what's causing these sporadic errors fetching certain robots.txt through warcprox, stick a retry loop around the fetch Noah Levitt 2015-08-19 22:50:04 +00:00
  • ad543e6134 enforce time limits; move scope_and_schedule_outlinks into frontier.py; fix bugs around scoping on seed redirect Noah Levitt 2015-08-19 20:16:25 +00:00
  • ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception Noah Levitt 2015-08-19 20:04:46 +00:00
  • 2533229fa1 add __all__ to modules Noah Levitt 2015-08-19 19:01:28 +00:00
  • b7df0a1f37 make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py Noah Levitt 2015-08-19 18:45:19 +00:00
  • b8506a2ab4 rename "db" to "frontier" Noah Levitt 2015-08-19 17:47:05 +00:00
  • cd3a644298 switch order of brozzle_count and claimed in priority_by_site index to fix has_outstanding_pages check Noah Levitt 2015-08-19 00:04:20 +00:00
  • 382c826678 rethinkdb connection per request, to server chosen randomly from list Noah Levitt 2015-08-18 23:47:28 +00:00
  • a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb Noah Levitt 2015-08-18 21:44:54 +00:00
  • e6fbf0e2e9 rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al Noah Levitt 2015-08-17 22:48:25 +00:00
  • 6b6583e63a more notes on choosing a db Noah Levitt 2015-08-13 01:01:35 +00:00
  • e68c98e66d brozzle a site for 5 minutes at a time instead of 1 for now Noah Levitt 2015-08-11 18:15:16 +00:00
  • fc75e18928 handle "aw snap" or "he's dead jim" from chrome Noah Levitt 2015-08-11 18:14:53 +00:00
  • 3d70776ce3 some thoughts on distributed database Noah Levitt 2015-08-11 18:06:58 +00:00
  • ce154fc3db more robustness improvements Noah Levitt 2015-08-10 20:11:46 +00:00
  • e96b16e19a support for max_hops scope rule Noah Levitt 2015-08-07 22:36:39 +00:00
  • a47292dab5 thread to read and selectively log output from chrome Noah Levitt 2015-08-07 22:36:07 +00:00
  • 2a7a0b7c30 little fix, tweak Noah Levitt 2015-08-05 00:17:43 +00:00
  • b6beac3807 new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file Noah Levitt 2015-08-04 19:52:01 +00:00
  • 4624f47402 Merge pull request #41 from ldko/add-routing-key Noah Levitt 2015-08-03 12:39:26 -07:00
  • e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq Noah Levitt 2015-08-01 17:54:29 +00:00
  • 511e19ff4d handle 420 "Limit reached" when browser receives it Noah Levitt 2015-08-01 01:26:59 +00:00
  • f5acb6c34b make requests library dependency explicity Noah Levitt 2015-08-01 01:25:07 +00:00
  • 7b98af7d9f handle reached limit response from warcprox Noah Levitt 2015-08-01 00:09:57 +00:00
  • d4a783285e Adds routing_key to queue Queue creation Lauren Ko 2015-07-31 14:15:18 -05:00
  • 11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc Noah Levitt 2015-07-31 00:03:13 +00:00
  • 8366bd2d66 refactor to simplify run() Noah Levitt 2015-07-28 01:12:41 +00:00
  • 5c701abb36 reject urls with scheme other than http/https (for now) Noah Levitt 2015-07-28 01:11:26 +00:00
  • a0a0b0ff2c use nlevitt brozzler branch of youtube-dl Noah Levitt 2015-07-28 01:10:39 +00:00
  • 060b796d78 avoid putting "extra_http_headers" in params if value is None (because "in" operator would return true) Noah Levitt 2015-07-24 01:40:35 +00:00
  • a04bf04307 keep robots caches in BrozzlerHQ class, because Site instances get recreated over and over, which meant robots.txt was fetched over and over Noah Levitt 2015-07-23 02:19:25 +00:00
  • 4dacc0b087 new queue for disclaimed sites so that sites that are finished don't get picked up again off of the unclaimed queue Noah Levitt 2015-07-23 01:21:23 +00:00
  • 6e6fd5dc2c don't brozzle the same url more than once, and note when site has been crawled to completion i.e. finished (other behavior like recrawling urls under certain circumstances could be supported in the future) Noah Levitt 2015-07-23 00:44:33 +00:00
  • 5d5151584c fix another dumb little bug in handling exceptions from youtube_dl Noah Levitt 2015-07-23 00:41:26 +00:00
  • 85a863b1e3 change argument to --amqp-url for clarity and consistency Noah Levitt 2015-07-23 00:39:57 +00:00
  • 6a09f2095c handle exceptions in robots.txt fetching/parsing Noah Levitt 2015-07-22 00:54:49 +00:00
  • f00571f7bd fix youtube-dl exception handling Noah Levitt 2015-07-22 00:53:39 +00:00
  • 83a8e7cbe5 fix bug when --extra-header switch is not supplied Noah Levitt 2015-07-21 20:39:41 +00:00
  • f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening Noah Levitt 2015-07-21 20:39:19 +00:00
  • 88f352efea use new fork of youtube-dl with support for extra http headers on every request Noah Levitt 2015-07-21 19:23:01 +00:00
  • b5cb94fc8b some additional logging and error handling to avoid mysterious messages Noah Levitt 2015-07-21 06:33:02 +00:00
  • 1e56bc8686 add only one site at a time, specify settings with command line switches Noah Levitt 2015-07-21 06:32:00 +00:00
  • 38ddfe498d require my "tweaks" branch of websocket-client Noah Levitt 2015-07-20 16:06:47 -07:00
  • dc04048d50 add some info to the readme Noah Levitt 2015-07-20 12:00:14 -07:00
  • 2f28f00a09 make putmeta requests respect site configured extra_headers Noah Levitt 2015-07-17 16:52:06 -07:00
  • 2ba5bd4d4b support adding extra http request headers Noah Levitt 2015-07-17 13:45:27 -07:00
  • c178ed1950 fix buglet Noah Levitt 2015-07-16 18:43:14 -07:00
  • a54e60dbaf change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix Noah Levitt 2015-07-16 18:39:29 -07:00
  • d2650a2547 update scope if seed redirects Noah Levitt 2015-07-16 18:27:47 -07:00
  • 140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs Noah Levitt 2015-07-16 17:19:12 -07:00
  • e04247c3f7 add support for supplying json blob defining site with configuration to brozzler-add-site Noah Levitt 2015-07-16 14:48:01 -07:00
  • 6b2ee9faee chmod -x worker.py Noah Levitt 2015-07-15 18:03:49 -07:00
  • f2bc7ec271 refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling Noah Levitt 2015-07-15 18:03:03 -07:00
  • a9c51edd84 robots cache per site, and some so far unused support for site level configuration Noah Levitt 2015-07-15 17:44:42 -07:00
  • efa3cd6269 don't set http_proxy environment variable, because it affects things we don't want it to Noah Levitt 2015-07-15 17:33:29 -07:00