Noah Levitt
317a5eb99d
without sudo, psutil.net_connections() raises psutil.AccessDenied on mac; in this case, silently try running chrome on the unvetted configured port
2016-05-09 17:25:14 -07:00
Noah Levitt
1141c5951e
add psutil dependency
2016-05-09 17:19:53 -07:00
Noah Levitt
c12090b3ef
oops, no "+" there
2016-05-07 01:46:36 +00:00
Noah Levitt
464da5c3a6
avoid errors with old versions of pip or non-utf-8 locales by specifying the encoding of README.rst
2016-05-07 01:46:15 +00:00
Noah Levitt
1445aa9976
make Site.warcprox_meta a special thing, replacing Site.extra_headers; this way, warcprox_meta is a dictionary in rethinkdb rather than a long json string
2016-05-05 23:24:10 +00:00
Noah Levitt
07e15e26bd
Merge pull request #3 from internetarchive/AITFIVE-859
...
browser.py - Check for open ports before starting Chrome. Open next a…
2016-05-05 16:00:38 -07:00
Adam Miller
1f7f55a14a
browser.py - Fix port search logic
2016-05-05 22:55:45 +00:00
Adam Miller
8e84465ff9
browser.py - Check for open ports before starting Chrome. Open next available on conflict
2016-05-05 22:31:07 +00:00
Noah Levitt
053767d393
bump version again
2016-05-05 10:37:58 -07:00
Noah Levitt
8d618ed135
refactor post-behavior stuff into separate interval function for clarity
2016-05-05 10:37:00 -07:00
Noah Levitt
1ef528eea7
do the clearInterval thing when umbraBehaviorFinished is about to return true on all the behaviors (that have that function)... for the record the impetus for this is to stop scrolling so we can take the screenshot
2016-05-05 10:35:06 -07:00
Noah Levitt
5b492ac6f1
remove old facebook behavior, replaced by facebook.js.template (missed this on commit cea192b)
2016-05-05 10:28:01 -07:00
Noah Levitt
5a2ea2cea4
make brozzle-page utility save the screenshot to a file
2016-05-05 10:10:53 -07:00
Noah Levitt
87af7eaa73
Merge pull request #2 from internetarchive/AITFIVE-832
...
Restructure browser.py to take screenshot after behavior script.
2016-05-05 10:08:21 -07:00
Noah Levitt
31356d526a
Merge branch 'master' into AITFIVE-832
...
* master:
copy over latest behaviors and stuff from umbra
support for host rules in outlink scoping
recover from rethinkdb error updating service registry
2016-05-05 10:06:12 -07:00
Noah Levitt
cea192b4b3
copy over latest behaviors and stuff from umbra
2016-05-05 00:58:26 -07:00
Adam Miller
6e4e28d2df
Modifying default.js behavior to stop the interval function when umbraBehaviorFinished returns true
...
We should do this in all behaviors ultimately to stop the behavior script upon completion
2016-05-05 01:03:57 +00:00
Adam Miller
61cec15fff
Restructure browser.py to take screenshot after behavior script.
2016-05-03 22:06:03 +00:00
Noah Levitt
0af00bb3d5
support for host rules in outlink scoping
2016-05-03 20:52:22 +00:00
Noah Levitt
1d21f2c307
recover from rethinkdb error updating service registry
2016-05-03 08:02:59 +00:00
Noah Levitt
f285be71fb
new generator site_pages() iterates over a site's pages
2016-04-28 00:29:22 +00:00
Noah Levitt
abe2c244eb
fix brozzler.svg symlink
2016-04-25 20:03:02 +00:00
Noah Levitt
df61e55b6b
add license headers
2016-04-25 20:02:11 +00:00
Noah Levitt
e210d417fb
add methods to get all sites for a job, seed page for a site
2016-04-25 17:01:56 +00:00
Noah Levitt
2c7c713f00
add "metadata" field to site object
2016-04-25 17:01:22 +00:00
Noah Levitt
8d9fc7d3e3
working on avoiding race condition resulting in multiple brozzler-workers claiming the same site
2016-04-22 01:27:50 +00:00
Noah Levitt
2825ffea15
support for extra "blocks" and "accepts" scope rules
2016-04-21 22:22:44 +00:00
Noah Levitt
68abb3cb94
log "behavior finished"/"hard timeout" only once
2016-04-21 22:02:50 +00:00
Noah Levitt
568a553432
use the uncanonicalized url as part of the sha1 input to generate the page id, since canonicalization was stripping off the #fragment, and we might want to crawl the same url with different fragments (and there's no option to GoogleURLCanonicalizer to not strip the fragment)
2016-04-21 22:01:49 +00:00
Noah Levitt
dd8f0d525d
set read_mode=majority when claiming a site to brozzle, to avoid weird thing where brozzler keeps claiming site it's already working on (not sure this is the cause of the problem but i don't see why else it might happen)
2016-04-21 20:43:25 +00:00
Noah Levitt
1e52d1cf98
restore scoping out of urls with unsupported schemes
2016-04-21 11:40:08 -07:00
Noah Levitt
fee008266f
support for one-hop-off (or n-hop-off) scoping
2016-04-21 17:41:59 +00:00
Noah Levitt
7bc726f717
fix bug preventing links from being extracted if hard timeout is reached
2016-04-20 17:24:18 -07:00
Noah Levitt
4bbbbcf138
fix bug where the first time a site was claimed, another brozzler-worker would claim it anyway (and find no pages to brozzle)
2016-04-21 00:21:08 +00:00
Noah Levitt
416aa064f8
don't know why some jobs were missing from the list, but with this change they all show up
2016-04-19 22:41:48 +00:00
Noah Levitt
b5f5581477
only list available services (ones with recent heartbeats)
2016-04-19 21:14:20 +00:00
Noah Levitt
72a94ed816
un-hardcode some stuff in webconsole, load from environment variables instead
2016-04-19 18:51:14 +00:00
Noah Levitt
35b713a2e7
little version bump
2016-04-07 23:36:05 +00:00
Noah Levitt
919692f9fa
pin rethinkdb requirement to 2.3.x (this needs to roughly track deployed version)
2016-04-07 23:35:20 +00:00
Noah Levitt
7c637a45e0
remove debugging line
2016-04-07 23:34:44 +00:00
Noah Levitt
5bb23b354c
fix stupid bug where all new sites would have same start_time
2016-04-07 23:34:30 +00:00
Noah Levitt
ecb2e44442
if youtube-dl fetches pages or makes HEAD requests, look at the responses to determine if the page is html and therefore needs to be browsed; if it doesn't need to be browsed, check if youtube-dl has already fetched it (GET request to final bounce of redirect chain that returned a 200); if not, simply fetch it
2016-04-06 17:50:48 -07:00
Noah Levitt
ed0ea24de6
Merge branch 'master' of github.com:nlevitt/brozzler
...
* 'master' of github.com:nlevitt/brozzler:
fix bug preventing brozzler from simultaneously working on more than one site from the same job
2016-04-04 22:43:18 -07:00
Noah Levitt
d834516362
include custom http request headers in youtube-dl requests without need for special hacked youtube-dl
2016-04-04 22:43:08 -07:00
Noah Levitt
733124c7dc
fix bug preventing brozzler from simultaneously working on more than one site from the same job
2016-04-04 23:28:24 +00:00
Noah Levitt
a43b5016e1
use a dev version number
2016-03-18 02:03:20 +00:00
Noah Levitt
c2e80ed6ff
make whole process die if main worker thread dies
2016-03-16 23:35:33 +00:00
Noah Levitt
ca9e62f5cf
if a site is marked "claimed" in rethinkdb, but last_disclaimed is more than 2 hours ago, claim it and log a warning
2016-03-14 22:21:16 +00:00
Noah Levitt
4874eaccbb
Merge remote-tracking branch 'umbra/master'
...
* umbra/master:
Handle Python to JS boolean conversion
Allow clicking on already clicked element to continue in behaviors if click_until_hard_timeout is set to true
Make Umbra click on 'Load More' button for youtube pages
catch and log exception deleting temporary work directory
update detection of modal close button for facebook changes
Add custom behavior for Brooklyn Museum.
2016-03-07 17:37:12 -08:00
Noah Levitt
b06381790c
honor crawl job stop requests
2016-03-08 00:18:54 +00:00