brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-08-08 06:22:23 -04:00

Author	SHA1	Message	Date
Barbara Miller	5e7b3b73dd	skip_youtube_dl	2017-09-29 14:33:23 -07:00
Noah Levitt	9422fb6a26	Merge branch 'master' into qa * master: fix problem where each hashtag visited causes a page load if page url redirects new test exposing problem where each hashtag visited causes a page load, if page redirects	2017-09-27 14:11:30 -07:00
Noah Levitt	ec847e48bc	fix problem where each hashtag visited causes a page load if page url redirects	2017-09-27 14:11:20 -07:00
Barbara Miller	17d410f000	Merge branch 'behavior_timeout' into qa	2017-08-31 10:31:55 -07:00
Vangelis Banos	bb93b04c23	Make behavior_timeout configurable ``behavior_timeout`` is hardcoded to 900s. With this MR we make it configurable with a default value of 900. We add a new variable to ``BrozzlerWorker`` and ``Browser``.	2017-08-31 08:06:26 +00:00
Neil Minton	337945004c	Merge branch 'aitfive-1295' into qa	2017-08-25 10:34:51 -07:00
Neil Minton	5ad7c9c7cc	Revert "Log oulinks for all users of Browser." This reverts commit `58b95fa7bf`. It was decided that this change didn't make sense for Brozzler.	2017-08-25 10:30:42 -07:00
Barbara Miller	96ba4f1a78	Merge branch 'configurable-page-timeout' into qa	2017-08-23 11:11:32 -07:00
Vangelis Banos	00513af877	Configurable page timeout The page loading timeout was hard-coded to 300s. With this change, we make it configurable with a default value of 300.	2017-08-23 08:05:14 +00:00
Neil Minton	3e8e699661	Merge branch 'aitfive-1295' into qa	2017-08-02 10:59:00 -07:00
Noah Levitt	5be7dd4407	Merge branch 'master' into qa * master: bump dev version number after some PR merges bugfix for BrozzlerWorker._needs_browsing Remove redundant method parameter. bugfix Make youtube-dl optional in BrozzlerWorker.brozzle_page	2017-08-01 12:05:07 -07:00
Neil Minton	58b95fa7bf	Log oulinks for all users of Browser.	2017-07-31 15:43:53 -07:00
Vangelis Banos	0343969807	Remove redundant method parameter. ``ignore_cert_errors`` is passed to ``Chrome`` via ``Browser`` via ``BrowserPool` here: https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L120 it is not doing anything in ``Browser.browser_page``.	2017-07-31 12:36:17 +00:00
Neil Minton	512931b6c8	Merge branch 'ari-5210' into qa	2017-07-12 17:30:06 -07:00
Vangelis Banos	89877670a4	--skip-extract-outlinks, --skip-visit-hashtags Brozzler always did these actions. We make it possible to skip them with this MR. Options are passed to `brozzler-worker`. This feature is useful for tasks where we just need to retrieve a specific page and we don't need to extract outlinks to continue crawling.	2017-07-04 21:50:05 +00:00
Barbara Miller	d41f30cbc7	Merge branch 'loginAndReloadSeed' into qa	2017-06-02 13:40:36 -07:00
Barbara Miller	a0330d9716	updates per Noah's review	2017-06-02 13:27:01 -07:00
Barbara Miller	830b0eef89	undo post-login nav (ARI-5385 and/or ARI-5386)	2017-06-02 12:47:19 -07:00
Noah Levitt	69d8571871	Merge branch 'master' into qa * master: re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker add a github PR template for this repo update headless chrome instructions for regular chrome builds use the new api `with brozzler.thread_accept_exceptions()` refactor thread_raise safety to use a context manager allow this stupid test to fail improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such	2017-05-01 13:00:34 -07:00
Noah Levitt	d916b68ab9	use the new api `with brozzler.thread_accept_exceptions()`	2017-04-24 20:02:34 -07:00
Noah Levitt	7706bab8b8	safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such	2017-04-20 17:08:16 -07:00
Noah Levitt	6844cb5bcb	Merge branch 'master' into qa * master: raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch raise new exception brozzler.ProxyError in case of proxy error browsing a page make brozzle-page respect --proxy (no test for this!) oops, version bump for previous commit bubble up proxy errors fetching robots.txt, with unit test, and documentation	2017-04-17 18:15:32 -07:00
Noah Levitt	349b41ab32	raise new exception brozzler.ProxyError in case of proxy error browsing a page	2017-04-17 18:14:02 -07:00
Noah Levitt	a83c11b302	Merge branch 'master' into qa * master: new model for crawling hashtags, each one is no longer a top-level page remove some vestiges of old proxy stuff	2017-03-27 12:16:11 -07:00
Noah Levitt	3d47805ec1	new model for crawling hashtags, each one is no longer a top-level page	2017-03-27 12:15:49 -07:00
Noah Levitt	63474c09f2	Merge branch 'master' into qa * master: use urlcanon library for canonicalization, surtification, scope match rules more automated tests of frontier stuff	2017-03-15 15:00:01 -07:00
Noah Levitt	12fb9eaa15	use urlcanon library for canonicalization, surtification, scope match rules	2017-03-15 14:59:51 -07:00
Noah Levitt	95f362d49a	Merge branch 'master' into qa * master: use new rethinkstuff ORM	2017-02-28 16:12:58 -08:00
Noah Levitt	700b08b7d7	use new rethinkstuff ORM	2017-02-28 16:12:50 -08:00
Noah Levitt	cb75bb6e04	Merge branch 'master' into qa * master: let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port	2017-02-22 12:44:27 -08:00
Noah Levitt	2398031010	let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port	2017-02-22 12:44:19 -08:00
Noah Levitt	23601e2e0a	Merge branch 'master' into qa * master: handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭	2017-02-22 10:57:27 -08:00
Noah Levitt	3c4ab834da	handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭	2017-02-22 10:57:11 -08:00
Noah Levitt	f6fdb91d57	Merge branch 'master' into qa * master: add --yaml option to brozzler-list-* commands take screenshot before running behavior (but after login) - thanks danielbicho	2017-02-15 23:13:32 +00:00
Noah Levitt	1054e8e3cb	take screenshot before running behavior (but after login) - thanks danielbicho	2017-02-15 09:13:44 -08:00
Noah Levitt	08752a5163	Merge branch 'master' into qa * master: logging tweaks	2017-02-10 15:19:35 -08:00
Noah Levitt	e58f4b7c44	logging tweaks	2017-02-10 15:19:28 -08:00
Noah Levitt	aa22594928	Merge branch 'master' into qa * master: fix TypeError: not all arguments converted during string formatting	2017-02-03 17:24:53 -08:00
Noah Levitt	09fa41f959	fix TypeError: not all arguments converted during string formatting	2017-02-03 17:24:47 -08:00
Noah Levitt	8c116295ea	Merge branch 'master' into qa * master: restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select() missed a spot improve brozzler-dashboard logging; fix default wayback baseurl in brozzler dashboard (https://github.com/internetarchive/brozzler/issues/31); tweak arg parsing related stuff avoid js errors in case site or job is not configured to keep stats add travis-ci slack notification to internetarchive/brozzler channel	2017-01-24 09:56:14 -08:00
Noah Levitt	d22cc075e0	restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select()	2017-01-24 09:55:56 -08:00
Noah Levitt	58bac8fc83	Merge branch 'master' into qa * master: adapt to exception message from newer versions of chromium (e.g. 57.0.2981.0)	2017-01-13 12:08:09 -08:00
Noah Levitt	77c4dc1116	adapt to exception message from newer versions of chromium (e.g. 57.0.2981.0)	2017-01-13 12:08:00 -08:00
Noah Levitt	87eeaf7888	Merge branch 'master' into qa * master: tests for dismissal of javascript dialogs (alert, prompt, confirm) dismiss alerts from the page being browsed (avoids hanging)	2017-01-13 11:46:52 -08:00
Noah Levitt	011d814ee2	tests for dismissal of javascript dialogs (alert, prompt, confirm)	2017-01-13 11:46:42 -08:00
Noah Levitt	d2ed6b97a2	dismiss alerts from the page being browsed (avoids hanging)	2017-01-13 10:27:37 -08:00
Noah Levitt	4e7f9f8690	Merge branch 'master' into qa * master: fix oversight including username/password in site config when starting a new job restore BrozzlerWorker built-in support for managing its own thread restore handling of 420 Reached limit, with a rudimentary test add import missing from test restore support for on_response and on_request, with an automated test for on_response	2017-01-06 13:03:25 -08:00
Noah Levitt	70b67942a5	restore handling of 420 Reached limit, with a rudimentary test	2016-12-22 13:44:09 -08:00
Noah Levitt	eabb0fb114	restore support for on_response and on_request, with an automated test for on_response	2016-12-21 18:35:55 -08:00
Noah Levitt	422a5ad726	Merge branch 'master' into qa * master: need $DISPLAY set for test_brozzling.py restore handling of "aw snap" or "he's dead jim" add seed username/password parameters to job config schema loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated convert mouseovers and simpleclicks to jinja2 remove obsolete facebook login code convert behaviors to jinja2, move them to new subdir js-templates, along with javascript previously stored as a string in browser.py add hack for submitting a login form containing an element with name or id "submit", which masks the form submit() method how did i miss this file? forgot to git add new test data detect <input type="email"> as potential username field for login generalized support for login doing automatic detection of login form on a page yet more refactoring of browser.py, clearer separation of purpose, Browser class manages browsing, sends most of the messages to chrome, WebsockReceiverThread handles messages that come back from chrome bump version number in setup.py major refactoring of browsing code to make it easier to add functionality back to dev version number i dub thee 1.1b8	2016-12-21 18:11:56 -08:00

1 2 3 4

190 commits