* master:
fix problem where each hashtag visited causes a page load if page url redirects
new test exposing problem where each hashtag visited causes a page load, if page redirects
``behavior_timeout`` is hardcoded to 900s. With this MR we make it
configurable with a default value of 900. We add a new variable to
``BrozzlerWorker`` and ``Browser``.
* master:
bump dev version number after some PR merges
bugfix for BrozzlerWorker._needs_browsing
Remove redundant method parameter.
bugfix
Make youtube-dl optional in BrozzlerWorker.brozzle_page
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
* master:
re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
add a github PR template for this repo
update headless chrome instructions for regular chrome builds
use the new api `with brozzler.thread_accept_exceptions()`
refactor thread_raise safety to use a context manager
allow this stupid test to fail
improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
* master:
raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
raise new exception brozzler.ProxyError in case of proxy error browsing a page
make brozzle-page respect --proxy (no test for this!)
oops, version bump for previous commit
bubble up proxy errors fetching robots.txt, with unit test, and documentation
* master:
restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select()
missed a spot
improve brozzler-dashboard logging; fix default wayback baseurl in brozzler dashboard (https://github.com/internetarchive/brozzler/issues/31); tweak arg parsing related stuff
avoid js errors in case site or job is not configured to keep stats
add travis-ci slack notification to internetarchive/brozzler channel
* master:
fix oversight including username/password in site config when starting a new job
restore BrozzlerWorker built-in support for managing its own thread
restore handling of 420 Reached limit, with a rudimentary test
add import missing from test
restore support for on_response and on_request, with an automated test for on_response
* master:
need $DISPLAY set for test_brozzling.py
restore handling of "aw snap" or "he's dead jim"
add seed username/password parameters to job config schema
loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated
convert mouseovers and simpleclicks to jinja2
remove obsolete facebook login code
convert behaviors to jinja2, move them to new subdir js-templates, along with javascript previously stored as a string in browser.py
add hack for submitting a login form containing an element with name or id "submit", which masks the form submit() method
how did i miss this file?
forgot to git add new test data
detect <input type="email"> as potential username field for login
generalized support for login doing automatic detection of login form on a page
yet more refactoring of browser.py, clearer separation of purpose, Browser class manages browsing, sends most of the messages to chrome, WebsockReceiverThread handles messages that come back from chrome
bump version number in setup.py
major refactoring of browsing code to make it easier to add functionality
back to dev version number
i dub thee 1.1b8