609 Commits

Author SHA1 Message Date
Noah Levitt
2c7c713f00 add "metadata" field to site object 2016-04-25 17:01:22 +00:00
Noah Levitt
8d9fc7d3e3 working on avoiding race condition resulting in multiple brozzler-workers claiming the same site 2016-04-22 01:27:50 +00:00
Noah Levitt
2825ffea15 support for extra "blocks" and "accepts" scope rules 2016-04-21 22:22:44 +00:00
Noah Levitt
68abb3cb94 log "behavior finished"/"hard timeout" only once 2016-04-21 22:02:50 +00:00
Noah Levitt
568a553432 use the uncanonicalized url as part of the sha1 input to generate the page id, since canonicalization was stripping off the #fragment, and we might want to crawl the same url with different fragments (and there's no option to GoogleURLCanonicalizer to not strip the fragment) 2016-04-21 22:01:49 +00:00
Noah Levitt
dd8f0d525d set read_mode=majority when claiming a site to brozzle, to avoid weird thing where brozzler keeps claiming site it's already working on (not sure this is the cause of the problem but i don't see why else it might happen) 2016-04-21 20:43:25 +00:00
Noah Levitt
1e52d1cf98 restore scoping out of urls with unsupported schemes 2016-04-21 11:40:08 -07:00
Noah Levitt
fee008266f support for one-hop-off (or n-hop-off) scoping 2016-04-21 17:41:59 +00:00
Noah Levitt
7bc726f717 fix bug preventing links from being extracted if hard timeout is reached 2016-04-20 17:24:18 -07:00
Noah Levitt
4bbbbcf138 fix bug where the first time a site was claimed, another brozzler-worker would claim it anyway (and find no pages to brozzle) 2016-04-21 00:21:08 +00:00
Noah Levitt
416aa064f8 don't know why some jobs were missing from the list, but with this change they all show up 2016-04-19 22:41:48 +00:00
Noah Levitt
b5f5581477 only list available services (ones with recent heartbeats) 2016-04-19 21:14:20 +00:00
Noah Levitt
72a94ed816 un-hardcode some stuff in webconsole, load from environment variables instead 2016-04-19 18:51:14 +00:00
Noah Levitt
35b713a2e7 little version bump 2016-04-07 23:36:05 +00:00
Noah Levitt
919692f9fa pin rethinkdb requirement to 2.3.x (this needs to roughly track deployed version) 2016-04-07 23:35:20 +00:00
Noah Levitt
7c637a45e0 remove debugging line 2016-04-07 23:34:44 +00:00
Noah Levitt
5bb23b354c fix stupid bug where all new sites would have same start_time 2016-04-07 23:34:30 +00:00
Noah Levitt
ecb2e44442 if youtube-dl fetches pages or makes HEAD requests, look at the responses to determine if the page is html and therefore needs to be browsed; if it doesn't need to be browsed, check if youtube-dl has already fetched it (GET request to final bounce of redirect chain that returned a 200); if not, simply fetch it 2016-04-06 17:50:48 -07:00
Noah Levitt
ed0ea24de6 Merge branch 'master' of github.com:nlevitt/brozzler
* 'master' of github.com:nlevitt/brozzler:
  fix bug preventing brozzler from simultaneously working on more than one site from the same job
2016-04-04 22:43:18 -07:00
Noah Levitt
d834516362 include custom http request headers in youtube-dl requests without need for special hacked youtube-dl 2016-04-04 22:43:08 -07:00
Noah Levitt
733124c7dc fix bug preventing brozzler from simultaneously working on more than one site from the same job 2016-04-04 23:28:24 +00:00
Noah Levitt
a43b5016e1 use a dev version number 2016-03-18 02:03:20 +00:00
Noah Levitt
c2e80ed6ff make whole process die if main worker thread dies 2016-03-16 23:35:33 +00:00
Noah Levitt
ca9e62f5cf if a site is marked "claimed" in rethinkdb, but last_disclaimed is more than 2 hours ago, claim it and log a warning 2016-03-14 22:21:16 +00:00
Noah Levitt
4874eaccbb Merge remote-tracking branch 'umbra/master'
* umbra/master:
  Handle Python to JS boolean conversion
  Allow clicking on already clicked element to continue in behaviors if click_until_hard_timeout is set to true
  Make Umbra click on 'Load More' button for youtube pages
  catch and log exception deleting temporary work directory
  update detection of modal close button for facebook changes
  Add custom behavior for Brooklyn Museum.
2016-03-07 17:37:12 -08:00
Noah Levitt
b06381790c honor crawl job stop requests 2016-03-08 00:18:54 +00:00
Noah Levitt
d2567f4a13 loosen surt req 2016-03-02 00:16:58 +00:00
Noah Levitt
b75577fca4 Merge pull request #52 from vonrosen/ARI-4725
Allow clicking on already clicked element to continue in behaviors if…
2016-02-16 15:22:15 -08:00
Noah Levitt
77dfbcd328 remove cluster-control.sh script because it's specific to ait environment 2016-02-11 23:56:16 +00:00
Noah Levitt
664bb33add tweaks that have been sitting here 2016-02-10 00:38:48 +00:00
Noah Levitt
887eadb99a lock down vnc 2016-02-10 00:37:36 +00:00
Hunter Stern
fe650b69ed Handle Python to JS boolean conversion 2016-02-09 10:48:33 -08:00
Hunter Stern
2ed96f9b59 Allow clicking on already clicked element to continue in behaviors if click_until_hard_timeout is set to true 2016-02-05 10:00:24 -08:00
Neil Minton
b9973c7cae Merge pull request #51 from vonrosen/ARI-4690
Make Umbra click on 'Load More' button for youtube pages
2016-02-03 14:07:51 -08:00
Hunter Stern
fe81aa4ff2 Make Umbra click on 'Load More' button for youtube pages 2016-01-28 11:53:59 -08:00
Neil Minton
54d92f88b0 Merge pull request #49 from nlevitt/work-dir-cleanup-exception
catch and log exception deleting temporary work directory
2015-12-18 11:34:54 -08:00
Noah Levitt
f1770b813d Merge pull request #48 from sfdevguy/master
Add custom behavior for Brooklyn Museum
2015-12-18 11:34:00 -08:00
Noah Levitt
8ab0857dad catch and log exception deleting temporary work directory 2015-12-18 11:26:36 -08:00
Neil Minton
c494afb749 Merge branch 'AITFIVE-497' 2015-12-02 10:02:05 -08:00
Noah Levitt
36e2bb2729 use rethinkdb native time type for date/time values 2015-11-18 02:07:27 +00:00
Noah Levitt
ca0053e3be also when adding new job, insert all sites before the job, to prevent brozzler workers thinking the job is finished before all the sites are in the db 2015-11-14 03:10:58 +00:00
Noah Levitt
3260fe4e9e when adding new job, insert the seed url Page document into the database before the Site, to avoid situation where brozzler worker claims the site, finds no pages to crawl, and decides the site is finished 2015-11-13 23:47:51 +00:00
Noah Levitt
21906f8cad vnc-websock.sh uses bashisms 2015-11-12 02:59:45 +00:00
Noah Levitt
3bcd2400f7 2 instances of warcprox; no docker for brozzler worker 2015-11-12 02:59:21 +00:00
Noah Levitt
4c2ecab856 surt==0.3b2 (available on pypi) 2015-11-12 02:58:53 +00:00
Noah Levitt
38dec97e19 logging tweaks 2015-11-12 02:58:26 +00:00
Noah Levitt
5597b4cf1a quiet down requests.packages.urllib3 2015-11-12 02:58:00 +00:00
Noah Levitt
998c3975b2 replace jobs page with home page which also lists services 2015-11-12 02:57:27 +00:00
Noah Levitt
343b5c0f82 register with service registry; only start chrome right before using it, so that web console vnc windows aren't always full of about:blank 2015-11-12 02:56:27 +00:00
Noah Levitt
b91d7e4c3f startup scripts for services needed for non-docker deployment 2015-11-11 21:28:55 +00:00