Noah Levitt
|
3bf3c80720
|
implement timeout and retries to work around issue where sometimes we receive no result message after requesting outlinks
|
2016-07-06 17:54:36 -05:00 |
|
Noah Levitt
|
be58fb46f7
|
forgot to commit easy.py, add pywb.py with support for pywb rethinkdb index, and make brozzler-easy also run pywb
|
2016-07-06 14:52:00 -05:00 |
|
Noah Levitt
|
3b252002b7
|
working on brozzler-easy, single process with brozzler-worker and warcprox working together (pywb to be added)
|
2016-07-05 18:46:42 -05:00 |
|
Noah Levitt
|
1a7b94cae7
|
twirldown for site yaml on site page
|
2016-07-05 21:42:36 +00:00 |
|
Noah Levitt
|
f825e76371
|
give master a version number considered later than the one up on pypi (1.1b3.dev45 > 1.1b2)
|
2016-07-05 10:44:48 -05:00 |
|
Noah Levitt
|
0b9ce94226
|
in vagrant/ansible, install brozzler from this checkout instead of from github master
|
2016-07-01 15:45:39 -05:00 |
|
Noah Levitt
|
3e128d2b27
|
option to save list of outlinks (categorized as "accepted", "blocked" (by robots), or "rejected") per page in rethinkdb (to be used by archive-it for out-of-scope reporting)
|
2016-07-01 15:23:46 -05:00 |
|
Noah Levitt
|
01e38ea8c7
|
oops didn't mean to leave that windows-only subprocess flag
|
2016-07-01 14:07:04 -05:00 |
|
Noah Levitt
|
ad502f33da
|
remove accidentally committed playbook.retry
|
2016-06-30 17:56:56 -05:00 |
|
Noah Levitt
|
2aef00826b
|
vagrant setup (unfinished)
|
2016-06-30 17:50:11 -05:00 |
|
Noah Levitt
|
79ad57669c
|
do not send more than one SIGTERM when shutting down browser process, because on recent chromium on linux, the second sigterm abruptly ends the process, and sometimes leaves orphan subprocesses; also send TERM/KILL signals to the whole process group, another measure to avoid orphans; and adjust logging levels for captured chrome output
|
2016-06-30 17:10:27 -05:00 |
|
Noah Levitt
|
371590b578
|
command line utility brozzler-ensure-tables, creates rethinkdb tables if they don't already exist... brozzler normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster
|
2016-06-30 15:16:04 -05:00 |
|
Noah Levitt
|
9fd78fdbe8
|
implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top
|
2016-06-30 11:45:19 -05:00 |
|
Noah Levitt
|
a1910fc0fe
|
avoid "AttributeError: 'ExtractorError' object has no attribute 'code'" checking for 430 (soft limit) from youtube-dl
|
2016-06-29 19:57:51 -05:00 |
|
Noah Levitt
|
79beddfc44
|
set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure
|
2016-06-29 19:47:25 -05:00 |
|
Noah Levitt
|
2e687b65fb
|
fix case where rethinkdb page already has claimed=True
|
2016-06-29 19:29:18 -05:00 |
|
Noah Levitt
|
ffcf26b6c9
|
undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3)
|
2016-06-29 18:53:32 -05:00 |
|
Noah Levitt
|
7431ae0eb1
|
fix bug preventing brozzler-new-site from working, add note about brozzler-new-site in readme
|
2016-06-29 18:41:45 -05:00 |
|
Noah Levitt
|
479713e25b
|
--trace level logging
|
2016-06-29 18:29:45 -05:00 |
|
Noah Levitt
|
d04c5a31cc
|
to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism
|
2016-06-29 00:02:25 +00:00 |
|
Noah Levitt
|
7a805a43d1
|
calm logging, don't print stacktrace on 430 from youtube-dl
|
2016-06-28 23:19:28 +00:00 |
|
Noah Levitt
|
e9c398caea
|
fix buglet in creation of new least_hops on pages table
|
2016-06-28 23:14:23 +00:00 |
|
Noah Levitt
|
77c800f6a2
|
renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
|
2016-06-28 15:13:48 -05:00 |
|
Noah Levitt
|
e64a4d6985
|
let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part'
|
2016-06-28 13:56:30 -05:00 |
|
Noah Levitt
|
772bcf0df6
|
handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988)
|
2016-06-28 12:23:32 -05:00 |
|
Noah Levitt
|
0bd687abde
|
avoid hanging in case a page has no outlinks
|
2016-06-28 11:25:04 -05:00 |
|
Noah Levitt
|
6a11e1da2a
|
fix noVNC submodule path since brozzler webconsole has moved
|
2016-06-28 16:15:16 +00:00 |
|
Noah Levitt
|
cb4a16e58c
|
handle new bucket format in brozzler-webconsole
|
2016-06-28 00:13:01 +00:00 |
|
Noah Levitt
|
89474cb430
|
convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
|
2016-06-27 14:12:56 -05:00 |
|
Noah Levitt
|
e4f8efe376
|
make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies
|
2016-06-27 12:43:24 -05:00 |
|
Noah Levitt
|
366e467501
|
enhancements to the page thumbnail on the site page
|
2016-06-22 23:09:27 +00:00 |
|
Noah Levitt
|
9b3f3809cc
|
expose full rethinkdb entry as yaml on job page
|
2016-06-22 22:29:07 +00:00 |
|
Noah Levitt
|
510456eef2
|
order the page thumbnails on site page by least number of hops, so the seed shows up first
|
2016-06-22 21:20:00 +00:00 |
|
Noah Levitt
|
2038598f41
|
fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''}
|
2016-06-22 17:43:53 +00:00 |
|
Noah Levitt
|
d198a69e45
|
recurse through all frames to find outlinks
|
2016-06-22 11:39:31 -05:00 |
|
Noah Levitt
|
8d7bb582cb
|
back to a dev version number
|
2016-06-16 14:19:00 -05:00 |
|
Noah Levitt
|
6237ed3f34
|
update url
|
2016-06-16 14:18:24 -05:00 |
|
Noah Levitt
|
9c2fe25dd0
|
back to a dev version number
|
2016-06-16 14:13:03 -05:00 |
|
Noah Levitt
|
b2d4cc5ff0
|
beta version number for pypi upload
|
2016-06-16 14:11:45 -05:00 |
|
Noah Levitt
|
81d709eed0
|
bump version number
|
2016-06-16 13:56:08 -05:00 |
|
Noah Levitt
|
182cbfd0ce
|
bump version so we can upload to pypi and fix the readme
|
2016-05-11 12:10:23 -07:00 |
|
Noah Levitt
|
dd2211df31
|
yes you can install brozzler from the outside world now!
|
2016-05-11 12:09:09 -07:00 |
|
Noah Levitt
|
6f6216e432
|
catch exception from rethinkdb when unregistering from the service registry at shutdown
|
2016-05-11 00:46:50 +00:00 |
|
Noah Levitt
|
1141c5951e
|
add psutil dependency
|
2016-05-09 17:19:53 -07:00 |
|
Noah Levitt
|
464da5c3a6
|
avoid errors with old versions of pip or non-utf-8 locales by specifying the encoding of README.rst
|
2016-05-07 01:46:15 +00:00 |
|
Noah Levitt
|
053767d393
|
bump version again
|
2016-05-05 10:37:58 -07:00 |
|
Noah Levitt
|
cea192b4b3
|
copy over latest behaviors and stuff from umbra
|
2016-05-05 00:58:26 -07:00 |
|
Noah Levitt
|
0af00bb3d5
|
support for host rules in outlink scoping
|
2016-05-03 20:52:22 +00:00 |
|
Noah Levitt
|
df61e55b6b
|
add license headers
|
2016-04-25 20:02:11 +00:00 |
|
Noah Levitt
|
2825ffea15
|
support for extra "blocks" and "accepts" scope rules
|
2016-04-21 22:22:44 +00:00 |
|