Noah Levitt
be58fb46f7
forgot to commit easy.py, add pywb.py with support for pywb rethinkdb index, and make brozzler-easy also run pywb
2016-07-06 14:52:00 -05:00
Noah Levitt
3b252002b7
working on brozzler-easy, single process with brozzler-worker and warcprox working together (pywb to be added)
2016-07-05 18:46:42 -05:00
Noah Levitt
1a7b94cae7
twirldown for site yaml on site page
2016-07-05 21:42:36 +00:00
Noah Levitt
f825e76371
give master a version number considered later than the one up on pypi (1.1b3.dev45 > 1.1b2)
2016-07-05 10:44:48 -05:00
Noah Levitt
0b9ce94226
in vagrant/ansible, install brozzler from this checkout instead of from github master
2016-07-01 15:45:39 -05:00
Noah Levitt
3e128d2b27
option to save list of outlinks (categorized as "accepted", "blocked" (by robots), or "rejected") per page in rethinkdb (to be used by archive-it for out-of-scope reporting)
2016-07-01 15:23:46 -05:00
Noah Levitt
01e38ea8c7
oops didn't mean to leave that windows-only subprocess flag
2016-07-01 14:07:04 -05:00
Noah Levitt
ad502f33da
remove accidentally committed playbook.retry
2016-06-30 17:56:56 -05:00
Noah Levitt
2aef00826b
vagrant setup (unfinished)
2016-06-30 17:50:11 -05:00
Noah Levitt
79ad57669c
do not send more than one SIGTERM when shutting down browser process, because on recent chromium on linux, the second sigterm abruptly ends the process, and sometimes leaves orphan subprocesses; also send TERM/KILL signals to the whole process group, another measure to avoid orphans; and adjust logging levels for captured chrome output
2016-06-30 17:10:27 -05:00
Noah Levitt
371590b578
command line utility brozzler-ensure-tables, creates rethinkdb tables if they don't already exist... brozzler normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster
2016-06-30 15:16:04 -05:00
Noah Levitt
9fd78fdbe8
implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top
2016-06-30 11:45:19 -05:00
Noah Levitt
a1910fc0fe
avoid "AttributeError: 'ExtractorError' object has no attribute 'code'" checking for 430 (soft limit) from youtube-dl
2016-06-29 19:57:51 -05:00
Noah Levitt
79beddfc44
set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure
2016-06-29 19:47:25 -05:00
Noah Levitt
2e687b65fb
fix case where rethinkdb page already has claimed=True
2016-06-29 19:29:18 -05:00
Noah Levitt
ffcf26b6c9
undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3 )
2016-06-29 18:53:32 -05:00
Noah Levitt
7431ae0eb1
fix bug preventing brozzler-new-site from working, add note about brozzler-new-site in readme
2016-06-29 18:41:45 -05:00
Noah Levitt
479713e25b
--trace level logging
2016-06-29 18:29:45 -05:00
Noah Levitt
d04c5a31cc
to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism
2016-06-29 00:02:25 +00:00
Noah Levitt
7a805a43d1
calm logging, don't print stacktrace on 430 from youtube-dl
2016-06-28 23:19:28 +00:00
Noah Levitt
e9c398caea
fix buglet in creation of new least_hops on pages table
2016-06-28 23:14:23 +00:00
Noah Levitt
77c800f6a2
renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
2016-06-28 15:13:48 -05:00
Noah Levitt
e64a4d6985
let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part'
2016-06-28 13:56:30 -05:00
Noah Levitt
772bcf0df6
handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988)
2016-06-28 12:23:32 -05:00
Noah Levitt
0bd687abde
avoid hanging in case a page has no outlinks
2016-06-28 11:25:04 -05:00
Noah Levitt
6a11e1da2a
fix noVNC submodule path since brozzler webconsole has moved
2016-06-28 16:15:16 +00:00
Noah Levitt
cb4a16e58c
handle new bucket format in brozzler-webconsole
2016-06-28 00:13:01 +00:00
Noah Levitt
98915b3d86
fix brozzler.svg symlink
2016-06-27 20:01:35 +00:00
Noah Levitt
89474cb430
convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
2016-06-27 14:12:56 -05:00
Noah Levitt
e4f8efe376
make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies
2016-06-27 12:43:24 -05:00
Noah Levitt
08a9636e95
remove crufty docker and no-docker scripts
2016-06-27 00:11:37 -05:00
Noah Levitt
55fda6e892
note python 3.4 requirement in readme
2016-06-25 15:44:27 -05:00
Noah Levitt
366e467501
enhancements to the page thumbnail on the site page
2016-06-22 23:09:27 +00:00
Noah Levitt
9b3f3809cc
expose full rethinkdb entry as yaml on job page
2016-06-22 22:29:07 +00:00
Noah Levitt
510456eef2
order the page thumbnails on site page by least number of hops, so the seed shows up first
2016-06-22 21:20:00 +00:00
Noah Levitt
2038598f41
fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''}
2016-06-22 17:43:53 +00:00
Noah Levitt
d198a69e45
recurse through all frames to find outlinks
2016-06-22 11:39:31 -05:00
Noah Levitt
3b615120d4
Merge branch 'master' of github.com:internetarchive/brozzler
...
* 'master' of github.com:internetarchive/brozzler:
back to a dev version number
update url
back to a dev version number
beta version number for pypi upload
bump version number
call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7)
copy over fec.gov behavior from umbra master
2016-06-20 16:48:11 +00:00
Noah Levitt
5ccf5a9dcb
fix site warcprox-meta lookup now that it is a native json object rather than a string
2016-06-20 16:48:01 +00:00
Noah Levitt
8d7bb582cb
back to a dev version number
2016-06-16 14:19:00 -05:00
Noah Levitt
6237ed3f34
update url
1.1b2
2016-06-16 14:18:24 -05:00
Noah Levitt
9c2fe25dd0
back to a dev version number
2016-06-16 14:13:03 -05:00
Noah Levitt
b2d4cc5ff0
beta version number for pypi upload
2016-06-16 14:11:45 -05:00
Noah Levitt
81d709eed0
bump version number
2016-06-16 13:56:08 -05:00
Noah Levitt
1577cd8926
call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7)
2016-06-16 13:55:17 -05:00
Noah Levitt
98acc8dc92
copy over fec.gov behavior from umbra master
2016-06-16 13:53:28 -05:00
Noah Levitt
d75e8c394a
switch flask requirement to recent release, suggest gunicorn for running the app
2016-06-15 22:00:39 +00:00
Noah Levitt
b0ed4b8128
Merge pull request #7 from galgeek/disable_extensions
...
disable browser extensions
2016-06-06 11:57:46 -07:00
Noah Levitt
c63c21c30a
Merge pull request #6 from ato/document-config
...
Document the job config format
2016-06-06 11:56:45 -07:00
Barbara Miller
1c1237d07e
disable browser extensions
2016-05-27 22:51:38 -07:00