Noah Levitt
d04c5a31cc
to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism
2016-06-29 00:02:25 +00:00
Noah Levitt
7a805a43d1
calm logging, don't print stacktrace on 430 from youtube-dl
2016-06-28 23:19:28 +00:00
Noah Levitt
e9c398caea
fix buglet in creation of new least_hops on pages table
2016-06-28 23:14:23 +00:00
Noah Levitt
77c800f6a2
renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
2016-06-28 15:13:48 -05:00
Noah Levitt
e64a4d6985
let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part'
2016-06-28 13:56:30 -05:00
Noah Levitt
772bcf0df6
handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988)
2016-06-28 12:23:32 -05:00
Noah Levitt
0bd687abde
avoid hanging in case a page has no outlinks
2016-06-28 11:25:04 -05:00
Noah Levitt
6a11e1da2a
fix noVNC submodule path since brozzler webconsole has moved
2016-06-28 16:15:16 +00:00
Noah Levitt
cb4a16e58c
handle new bucket format in brozzler-webconsole
2016-06-28 00:13:01 +00:00
Noah Levitt
98915b3d86
fix brozzler.svg symlink
2016-06-27 20:01:35 +00:00
Noah Levitt
89474cb430
convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
2016-06-27 14:12:56 -05:00
Noah Levitt
e4f8efe376
make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies
2016-06-27 12:43:24 -05:00
Noah Levitt
08a9636e95
remove crufty docker and no-docker scripts
2016-06-27 00:11:37 -05:00
Noah Levitt
55fda6e892
note python 3.4 requirement in readme
2016-06-25 15:44:27 -05:00
Noah Levitt
366e467501
enhancements to the page thumbnail on the site page
2016-06-22 23:09:27 +00:00
Noah Levitt
9b3f3809cc
expose full rethinkdb entry as yaml on job page
2016-06-22 22:29:07 +00:00
Noah Levitt
510456eef2
order the page thumbnails on site page by least number of hops, so the seed shows up first
2016-06-22 21:20:00 +00:00
Noah Levitt
2038598f41
fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''}
2016-06-22 17:43:53 +00:00
Noah Levitt
d198a69e45
recurse through all frames to find outlinks
2016-06-22 11:39:31 -05:00
Noah Levitt
3b615120d4
Merge branch 'master' of github.com:internetarchive/brozzler
...
* 'master' of github.com:internetarchive/brozzler:
back to a dev version number
update url
back to a dev version number
beta version number for pypi upload
bump version number
call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7
)
copy over fec.gov behavior from umbra master
2016-06-20 16:48:11 +00:00
Noah Levitt
5ccf5a9dcb
fix site warcprox-meta lookup now that it is a native json object rather than a string
2016-06-20 16:48:01 +00:00
Noah Levitt
8d7bb582cb
back to a dev version number
2016-06-16 14:19:00 -05:00
Noah Levitt
6237ed3f34
update url
2016-06-16 14:18:24 -05:00
Noah Levitt
9c2fe25dd0
back to a dev version number
2016-06-16 14:13:03 -05:00
Noah Levitt
b2d4cc5ff0
beta version number for pypi upload
2016-06-16 14:11:45 -05:00
Noah Levitt
81d709eed0
bump version number
2016-06-16 13:56:08 -05:00
Noah Levitt
1577cd8926
call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7
)
2016-06-16 13:55:17 -05:00
Noah Levitt
98acc8dc92
copy over fec.gov behavior from umbra master
2016-06-16 13:53:28 -05:00
Noah Levitt
d75e8c394a
switch flask requirement to recent release, suggest gunicorn for running the app
2016-06-15 22:00:39 +00:00
Noah Levitt
b0ed4b8128
Merge pull request #7 from galgeek/disable_extensions
...
disable browser extensions
2016-06-06 11:57:46 -07:00
Noah Levitt
c63c21c30a
Merge pull request #6 from ato/document-config
...
Document the job config format
2016-06-06 11:56:45 -07:00
Barbara Miller
1c1237d07e
disable browser extensions
2016-05-27 22:51:38 -07:00
Noah Levitt
92f8f7c16d
Merge pull request #5 from ato/fix-brozzler-new-site
...
brozzler-new-site: Fix warcprox_meta's default value and json import
2016-05-17 10:14:48 -07:00
Alex Osborne
484805fbda
proxy is not supposed to have http:// prefix
...
Looks like the prefixes are added by BrozzleWorker._fetch_url()
2016-05-17 16:20:38 +10:00
Alex Osborne
02af30edd4
Document the job config format
2016-05-17 15:20:09 +10:00
Alex Osborne
a939689d44
Fix warcprox_meta's default value and json import
2016-05-17 13:51:42 +10:00
Noah Levitt
182cbfd0ce
bump version so we can upload to pypi and fix the readme
2016-05-11 12:10:23 -07:00
Noah Levitt
dd2211df31
yes you can install brozzler from the outside world now!
2016-05-11 12:09:09 -07:00
Noah Levitt
6f6216e432
catch exception from rethinkdb when unregistering from the service registry at shutdown
2016-05-11 00:46:50 +00:00
Noah Levitt
c6e0e7c507
correctly handle site with no pages (which means the seed was blocked by robots.txt) in frontier.seed_page
2016-05-11 00:45:47 +00:00
Noah Levitt
317a5eb99d
without sudo, psutil.net_connections() raises psutil.AccessDenied on mac; in this case, silently try running chrome on the unvetted configured port
2016-05-09 17:25:14 -07:00
Noah Levitt
1141c5951e
add psutil dependency
2016-05-09 17:19:53 -07:00
Noah Levitt
c12090b3ef
oops, no "+" there
2016-05-07 01:46:36 +00:00
Noah Levitt
464da5c3a6
avoid errors with old versions of pip or non-utf-8 locales by specifying the encoding of README.rst
2016-05-07 01:46:15 +00:00
Noah Levitt
1445aa9976
make Site.warcprox_meta a special thing, replacing Site.extra_headers; this way, warcprox_meta is a dictionary in rethinkdb rather than a long json string
2016-05-05 23:24:10 +00:00
Noah Levitt
07e15e26bd
Merge pull request #3 from internetarchive/AITFIVE-859
...
browser.py - Check for open ports before starting Chrome. Open next a…
2016-05-05 16:00:38 -07:00
Adam Miller
1f7f55a14a
browser.py - Fix port search logic
2016-05-05 22:55:45 +00:00
Adam Miller
8e84465ff9
browser.py - Check for open ports before starting Chrome. Open next available on conflict
2016-05-05 22:31:07 +00:00
Noah Levitt
053767d393
bump version again
2016-05-05 10:37:58 -07:00
Noah Levitt
8d618ed135
refactor post-behavior stuff into separate interval function for clarity
2016-05-05 10:37:00 -07:00