Commit Graph

  • 1f7f55a14a browser.py - Fix port search logic Adam Miller 2016-05-05 22:55:45 +00:00
  • 8e84465ff9 browser.py - Check for open ports before starting Chrome. Open next available on conflict Adam Miller 2016-05-05 22:31:07 +00:00
  • 053767d393 bump version again Noah Levitt 2016-05-05 10:37:58 -07:00
  • 8d618ed135 refactor post-behavior stuff into separate interval function for clarity Noah Levitt 2016-05-04 12:19:56 -07:00
  • 1ef528eea7 do the clearInterval thing when umbraBehaviorFinished is about to return true on all the behaviors (that have that function)... for the record the impetus for this is to stop scrolling so we can take the screenshot Noah Levitt 2016-05-05 10:35:06 -07:00
  • 5b492ac6f1 remove old facebook behavior, replaced by facebook.js.template (missed this on commit cea192b) Noah Levitt 2016-05-05 10:28:01 -07:00
  • 5a2ea2cea4 make brozzle-page utility save the screenshot to a file Noah Levitt 2016-05-04 11:53:45 -07:00
  • 87af7eaa73 Merge pull request #2 from internetarchive/AITFIVE-832 Noah Levitt 2016-05-05 10:08:21 -07:00
  • 31356d526a Merge branch 'master' into AITFIVE-832 Noah Levitt 2016-05-05 10:06:12 -07:00
  • cea192b4b3 copy over latest behaviors and stuff from umbra Noah Levitt 2016-05-05 00:58:26 -07:00
  • 6e4e28d2df Modifying default.js behavior to stop the interval function when umbraBehaviorFinished returns true We should do this in all behaviors ultimately to stop the behavior script upon completion Adam Miller 2016-05-05 01:03:57 +00:00
  • 61cec15fff Restructure browser.py to take screenshot after behavior script. Adam Miller 2016-05-03 22:06:03 +00:00
  • 0af00bb3d5 support for host rules in outlink scoping Noah Levitt 2016-05-03 20:52:22 +00:00
  • 1d21f2c307 recover from rethinkdb error updating service registry Noah Levitt 2016-05-03 08:02:59 +00:00
  • f285be71fb new generator site_pages() iterates over a site's pages Noah Levitt 2016-04-28 00:29:22 +00:00
  • abe2c244eb fix brozzler.svg symlink Noah Levitt 2016-04-25 20:03:02 +00:00
  • df61e55b6b add license headers Noah Levitt 2016-04-25 20:02:11 +00:00
  • e210d417fb add methods to get all sites for a job, seed page for a site Noah Levitt 2016-04-25 17:01:56 +00:00
  • 2c7c713f00 add "metadata" field to site object Noah Levitt 2016-04-25 17:01:22 +00:00
  • 8d9fc7d3e3 working on avoiding race condition resulting in multiple brozzler-workers claiming the same site Noah Levitt 2016-04-22 01:27:50 +00:00
  • 2825ffea15 support for extra "blocks" and "accepts" scope rules Noah Levitt 2016-04-21 22:22:44 +00:00
  • 68abb3cb94 log "behavior finished"/"hard timeout" only once Noah Levitt 2016-04-21 22:02:50 +00:00
  • 568a553432 use the uncanonicalized url as part of the sha1 input to generate the page id, since canonicalization was stripping off the #fragment, and we might want to crawl the same url with different fragments (and there's no option to GoogleURLCanonicalizer to not strip the fragment) Noah Levitt 2016-04-21 22:01:49 +00:00
  • dd8f0d525d set read_mode=majority when claiming a site to brozzle, to avoid weird thing where brozzler keeps claiming site it's already working on (not sure this is the cause of the problem but i don't see why else it might happen) Noah Levitt 2016-04-21 20:32:28 +00:00
  • 1e52d1cf98 restore scoping out of urls with unsupported schemes Noah Levitt 2016-04-21 11:40:08 -07:00
  • fee008266f support for one-hop-off (or n-hop-off) scoping Noah Levitt 2016-04-21 17:41:30 +00:00
  • 7bc726f717 fix bug preventing links from being extracted if hard timeout is reached Noah Levitt 2016-04-20 17:24:18 -07:00
  • 4bbbbcf138 fix bug where the first time a site was claimed, another brozzler-worker would claim it anyway (and find no pages to brozzle) Noah Levitt 2016-04-21 00:21:08 +00:00
  • 416aa064f8 don't know why some jobs were missing from the list, but with this change they all show up Noah Levitt 2016-04-19 22:41:48 +00:00
  • b5f5581477 only list available services (ones with recent heartbeats) Noah Levitt 2016-04-19 21:14:20 +00:00
  • 72a94ed816 un-hardcode some stuff in webconsole, load from environment variables instead Noah Levitt 2016-04-19 18:51:14 +00:00
  • 35b713a2e7 little version bump Noah Levitt 2016-04-07 23:36:05 +00:00
  • 919692f9fa pin rethinkdb requirement to 2.3.x (this needs to roughly track deployed version) Noah Levitt 2016-04-07 23:35:20 +00:00
  • 7c637a45e0 remove debugging line Noah Levitt 2016-04-07 23:34:44 +00:00
  • 5bb23b354c fix stupid bug where all new sites would have same start_time Noah Levitt 2016-04-07 23:34:30 +00:00
  • ecb2e44442 if youtube-dl fetches pages or makes HEAD requests, look at the responses to determine if the page is html and therefore needs to be browsed; if it doesn't need to be browsed, check if youtube-dl has already fetched it (GET request to final bounce of redirect chain that returned a 200); if not, simply fetch it Noah Levitt 2016-04-06 17:50:48 -07:00
  • ed0ea24de6 Merge branch 'master' of github.com:nlevitt/brozzler Noah Levitt 2016-04-04 22:43:18 -07:00
  • d834516362 include custom http request headers in youtube-dl requests without need for special hacked youtube-dl Noah Levitt 2016-04-04 22:43:08 -07:00
  • 733124c7dc fix bug preventing brozzler from simultaneously working on more than one site from the same job Noah Levitt 2016-04-04 23:28:24 +00:00
  • a43b5016e1 use a dev version number Noah Levitt 2016-03-18 02:03:20 +00:00
  • c2e80ed6ff make whole process die if main worker thread dies Noah Levitt 2016-03-16 23:35:33 +00:00
  • ca9e62f5cf if a site is marked "claimed" in rethinkdb, but last_disclaimed is more than 2 hours ago, claim it and log a warning Noah Levitt 2016-03-14 22:21:16 +00:00
  • 4874eaccbb Merge remote-tracking branch 'umbra/master' Noah Levitt 2016-03-07 17:37:12 -08:00
  • b06381790c honor crawl job stop requests Noah Levitt 2016-03-08 00:18:54 +00:00
  • d2567f4a13 loosen surt req Noah Levitt 2016-03-02 00:16:58 +00:00
  • b75577fca4 Merge pull request #52 from vonrosen/ARI-4725 Noah Levitt 2016-02-16 15:22:15 -08:00
  • 77dfbcd328 remove cluster-control.sh script because it's specific to ait environment Noah Levitt 2016-02-11 23:56:16 +00:00
  • 664bb33add tweaks that have been sitting here Noah Levitt 2016-02-10 00:38:48 +00:00
  • 887eadb99a lock down vnc Noah Levitt 2016-02-10 00:37:36 +00:00
  • fe650b69ed Handle Python to JS boolean conversion Hunter Stern 2016-02-09 10:48:33 -08:00
  • 2ed96f9b59 Allow clicking on already clicked element to continue in behaviors if click_until_hard_timeout is set to true Hunter Stern 2016-02-05 10:00:24 -08:00
  • b9973c7cae Merge pull request #51 from vonrosen/ARI-4690 Neil Minton 2016-02-03 14:07:51 -08:00
  • fe81aa4ff2 Make Umbra click on 'Load More' button for youtube pages Hunter Stern 2016-01-28 11:53:59 -08:00
  • 54d92f88b0 Merge pull request #49 from nlevitt/work-dir-cleanup-exception Neil Minton 2015-12-18 11:34:54 -08:00
  • f1770b813d Merge pull request #48 from sfdevguy/master Noah Levitt 2015-12-18 11:34:00 -08:00
  • 8ab0857dad catch and log exception deleting temporary work directory Noah Levitt 2015-12-18 11:26:36 -08:00
  • c494afb749 Merge branch 'AITFIVE-497' Neil Minton 2015-12-02 10:02:05 -08:00
  • 36e2bb2729 use rethinkdb native time type for date/time values Noah Levitt 2015-11-18 02:07:27 +00:00
  • ca0053e3be also when adding new job, insert all sites before the job, to prevent brozzler workers thinking the job is finished before all the sites are in the db Noah Levitt 2015-11-14 03:10:58 +00:00
  • 3260fe4e9e when adding new job, insert the seed url Page document into the database before the Site, to avoid situation where brozzler worker claims the site, finds no pages to crawl, and decides the site is finished Noah Levitt 2015-11-13 23:47:51 +00:00
  • 21906f8cad vnc-websock.sh uses bashisms Noah Levitt 2015-11-12 02:59:45 +00:00
  • 3bcd2400f7 2 instances of warcprox; no docker for brozzler worker Noah Levitt 2015-11-12 02:59:21 +00:00
  • 4c2ecab856 surt==0.3b2 (available on pypi) Noah Levitt 2015-11-12 02:58:53 +00:00
  • 38dec97e19 logging tweaks Noah Levitt 2015-11-12 02:58:26 +00:00
  • 5597b4cf1a quiet down requests.packages.urllib3 Noah Levitt 2015-11-12 02:58:00 +00:00
  • 998c3975b2 replace jobs page with home page which also lists services Noah Levitt 2015-11-12 02:57:27 +00:00
  • 343b5c0f82 register with service registry; only start chrome right before using it, so that web console vnc windows aren't always full of about:blank Noah Levitt 2015-11-12 02:56:27 +00:00
  • b91d7e4c3f startup scripts for services needed for non-docker deployment Noah Levitt 2015-11-11 21:28:55 +00:00
  • 29b6a0b0d4 Merge branch 'master' of github.com:nlevitt/brozzler Noah Levitt 2015-11-05 20:10:22 +00:00
  • 8c422534a5 smart waiting for tables and indexes to be ready Noah Levitt 2015-11-05 20:10:14 +00:00
  • b329d193ca Merge pull request #46 from nlevitt/facebook-modal-close Hunter 2015-11-04 07:37:36 -08:00
  • f6f4daf24a update detection of modal close button for facebook changes Noah Levitt 2015-11-03 15:36:31 -08:00
  • 8889707f24 update detection of modal close button for facebook changes Noah Levitt 2015-11-03 15:33:46 -08:00
  • 85d87a5e42 Merge remote-tracking branch 'umbra/master' Noah Levitt 2015-11-03 15:31:38 -08:00
  • dceef1a676 Add custom behavior for Brooklyn Museum. Neil Minton 2015-11-03 13:59:20 -08:00
  • 90fad87f7e websockify startup script Noah Levitt 2015-11-03 20:15:41 +00:00
  • 03e7c29701 switch noVNC git url to https Noah Levitt 2015-10-29 21:36:43 +00:00
  • d9d69a88fd tweaking workers page Noah Levitt 2015-10-29 01:01:28 +00:00
  • 7b39ba021b proof of concept presenting workers in web console with novnc Noah Levitt 2015-10-27 19:00:44 +00:00
  • a0f4fd449c Merge pull request #1 from adam-miller/fixes Noah Levitt 2015-10-22 15:33:46 -07:00
  • 20bde1c482 uncommented init imports, removed required job_id in Frontier.finished Adam Miller 2015-10-22 22:29:24 +00:00
  • d1aebb0258 fix indentation Noah Levitt 2015-10-14 00:44:29 +00:00
  • 80f963591f mount warcs dir with sshfs; start-dead to start only services that aren't already running Noah Levitt 2015-10-12 23:10:13 +00:00
  • 196e52ac0a homegrown infinite scroll through pages on site page Noah Levitt 2015-10-12 23:08:35 +00:00
  • 3df4a3e109 make the site page present something sensible Noah Levitt 2015-10-10 00:30:03 +00:00
  • 549b149e39 Merge branch 'master' of github.com:nlevitt/brozzler Noah Levitt 2015-10-09 20:31:15 +00:00
  • 9ed1ac817e 4 space indent everywhere Noah Levitt 2015-10-09 20:31:07 +00:00
  • 0591548861 more incremental progress on web console Noah Levitt 2015-10-09 20:12:40 +00:00
  • 0050fe56b8 logo Noah Levitt 2015-10-07 17:53:16 -07:00
  • 2ddda68392 symlink to root Noah Levitt 2015-10-08 00:37:39 +00:00
  • d1158ab224 incremental progress on web console Noah Levitt 2015-10-08 00:33:49 +00:00
  • 7ab2eb4fda brozzler web console in the mix Noah Levitt 2015-10-08 00:31:28 +00:00
  • 82011c15cd Merge branch 'master' of github.com:nlevitt/brozzler Noah Levitt 2015-10-07 23:56:44 +00:00
  • 3805c7bf93 logo!? Noah Levitt 2015-10-07 15:45:01 -07:00
  • a5eb223b32 run brozzler workers inside docker containers Noah Levitt 2015-10-06 01:24:01 +00:00
  • 5868192e0a more stubby stuff Noah Levitt 2015-09-28 22:05:43 +00:00
  • 2e1601ac81 i think hash-less urls are working Noah Levitt 2015-09-25 22:48:01 +00:00
  • 05e15b9667 progress on the structure of this little app Noah Levitt 2015-09-25 22:19:29 +00:00
  • 51732d0d49 run warcprox on wbgrp-svc111 Noah Levitt 2015-09-25 19:16:27 +00:00
  • 69a25bc74a equivalent functionality using angular and restful json Noah Levitt 2015-09-25 19:15:20 +00:00