Commit graph

190 commits

Author SHA1 Message Date
Barbara Miller
5e7b3b73dd skip_youtube_dl 2017-09-29 14:33:23 -07:00
Noah Levitt
9422fb6a26 Merge branch 'master' into qa
* master:
  fix problem where each hashtag visited causes a page load if page url redirects
  new test exposing problem where each hashtag visited causes a page load, if page redirects
2017-09-27 14:11:30 -07:00
Noah Levitt
ec847e48bc fix problem where each hashtag visited causes a page load if page url redirects 2017-09-27 14:11:20 -07:00
Barbara Miller
17d410f000 Merge branch 'behavior_timeout' into qa 2017-08-31 10:31:55 -07:00
Vangelis Banos
bb93b04c23 Make behavior_timeout configurable
``behavior_timeout`` is hardcoded to 900s. With this MR we make it
configurable with a default value of 900. We add a new variable to
``BrozzlerWorker`` and ``Browser``.
2017-08-31 08:06:26 +00:00
Neil Minton
337945004c Merge branch 'aitfive-1295' into qa 2017-08-25 10:34:51 -07:00
Neil Minton
5ad7c9c7cc Revert "Log oulinks for all users of Browser."
This reverts commit 58b95fa7bf.

It was decided that this change didn't make sense for Brozzler.
2017-08-25 10:30:42 -07:00
Barbara Miller
96ba4f1a78 Merge branch 'configurable-page-timeout' into qa 2017-08-23 11:11:32 -07:00
Vangelis Banos
00513af877 Configurable page timeout
The page loading timeout was hard-coded to 300s. With this change,
we make it configurable with a default value of 300.
2017-08-23 08:05:14 +00:00
Neil Minton
3e8e699661 Merge branch 'aitfive-1295' into qa 2017-08-02 10:59:00 -07:00
Noah Levitt
5be7dd4407 Merge branch 'master' into qa
* master:
  bump dev version number after some PR merges
  bugfix for BrozzlerWorker._needs_browsing
  Remove redundant method parameter.
  bugfix
  Make youtube-dl optional in BrozzlerWorker.brozzle_page
2017-08-01 12:05:07 -07:00
Neil Minton
58b95fa7bf Log oulinks for all users of Browser. 2017-07-31 15:43:53 -07:00
Vangelis Banos
0343969807 Remove redundant method parameter.
``ignore_cert_errors`` is passed to ``Chrome`` via ``Browser`` via
``BrowserPool` here:

https://github.com/internetarchive/brozzler/blob/master/brozzler/worker.py#L120

it is not doing anything in ``Browser.browser_page``.
2017-07-31 12:36:17 +00:00
Neil Minton
512931b6c8 Merge branch 'ari-5210' into qa 2017-07-12 17:30:06 -07:00
Vangelis Banos
89877670a4 --skip-extract-outlinks, --skip-visit-hashtags
Brozzler always did these actions. We make it possible to skip them with
this MR. Options are passed to `brozzler-worker`.
This feature is useful for tasks where we just need to retrieve a specific
page and we don't need to extract outlinks to continue crawling.
2017-07-04 21:50:05 +00:00
Barbara Miller
d41f30cbc7 Merge branch 'loginAndReloadSeed' into qa 2017-06-02 13:40:36 -07:00
Barbara Miller
a0330d9716 updates per Noah's review 2017-06-02 13:27:01 -07:00
Barbara Miller
830b0eef89 undo post-login nav (ARI-5385 and/or ARI-5386) 2017-06-02 12:47:19 -07:00
Noah Levitt
69d8571871 Merge branch 'master' into qa
* master:
  re-claim sites after 1 hour instead of 2 so that sites don't have to wait as long to be brozzled again in case of kill -9 brozzler-worker
  add a github PR template for this repo
  update headless chrome instructions for regular chrome builds
  use the new api `with brozzler.thread_accept_exceptions()`
  refactor thread_raise safety to use a context manager
  allow this stupid test to fail
  improve messaging when brozzler-stop-crawl is passed nonexistent seed/job id
  safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such
2017-05-01 13:00:34 -07:00
Noah Levitt
d916b68ab9 use the new api with brozzler.thread_accept_exceptions() 2017-04-24 20:02:34 -07:00
Noah Levitt
7706bab8b8 safen up brozzler.thread_raise() to avoid interrupting rethinkdb transactions and such 2017-04-20 17:08:16 -07:00
Noah Levitt
6844cb5bcb Merge branch 'master' into qa
* master:
  raise brozzler.ProxyError in case of proxy error fetching robots.txt, doing youtube-dl, or doing raw fetch
  raise new exception brozzler.ProxyError in case of proxy error browsing a page
  make brozzle-page respect --proxy (no test for this!)
  oops, version bump for previous commit
  bubble up proxy errors fetching robots.txt, with unit test, and documentation
2017-04-17 18:15:32 -07:00
Noah Levitt
349b41ab32 raise new exception brozzler.ProxyError in case of proxy error browsing a page 2017-04-17 18:14:02 -07:00
Noah Levitt
a83c11b302 Merge branch 'master' into qa
* master:
  new model for crawling hashtags, each one is no longer a top-level page
  remove some vestiges of old proxy stuff
2017-03-27 12:16:11 -07:00
Noah Levitt
3d47805ec1 new model for crawling hashtags, each one is no longer a top-level page 2017-03-27 12:15:49 -07:00
Noah Levitt
63474c09f2 Merge branch 'master' into qa
* master:
  use urlcanon library for canonicalization, surtification, scope match rules
  more automated tests of frontier stuff
2017-03-15 15:00:01 -07:00
Noah Levitt
12fb9eaa15 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 14:59:51 -07:00
Noah Levitt
95f362d49a Merge branch 'master' into qa
* master:
  use new rethinkstuff ORM
2017-02-28 16:12:58 -08:00
Noah Levitt
700b08b7d7 use new rethinkstuff ORM 2017-02-28 16:12:50 -08:00
Noah Levitt
cb75bb6e04 Merge branch 'master' into qa
* master:
  let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port
2017-02-22 12:44:27 -08:00
Noah Levitt
2398031010 let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port 2017-02-22 12:44:19 -08:00
Noah Levitt
23601e2e0a Merge branch 'master' into qa
* master:
  handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭
2017-02-22 10:57:27 -08:00
Noah Levitt
3c4ab834da handle errors from extract-outlinks.js, which happens on polyvore.com because it changes the definition of Set 😭 2017-02-22 10:57:11 -08:00
Noah Levitt
f6fdb91d57 Merge branch 'master' into qa
* master:
  add --yaml option to brozzler-list-* commands
  take screenshot before running behavior (but after login) - thanks danielbicho
2017-02-15 23:13:32 +00:00
Noah Levitt
1054e8e3cb take screenshot before running behavior (but after login) - thanks danielbicho 2017-02-15 09:13:44 -08:00
Noah Levitt
08752a5163 Merge branch 'master' into qa
* master:
  logging tweaks
2017-02-10 15:19:35 -08:00
Noah Levitt
e58f4b7c44 logging tweaks 2017-02-10 15:19:28 -08:00
Noah Levitt
aa22594928 Merge branch 'master' into qa
* master:
  fix TypeError: not all arguments converted during string formatting
2017-02-03 17:24:53 -08:00
Noah Levitt
09fa41f959 fix TypeError: not all arguments converted during string formatting 2017-02-03 17:24:47 -08:00
Noah Levitt
8c116295ea Merge branch 'master' into qa
* master:
  restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select()
  missed a spot
  improve brozzler-dashboard logging; fix default wayback baseurl in brozzler dashboard (https://github.com/internetarchive/brozzler/issues/31); tweak arg parsing related stuff
  avoid js errors in case site or job is not configured to keep stats
  add travis-ci slack notification to internetarchive/brozzler channel
2017-01-24 09:56:14 -08:00
Noah Levitt
d22cc075e0 restore ping_timeout argument to WebSocketApp.run_forever to fix problem of leaking websocket receiver threads hanging forever on select() 2017-01-24 09:55:56 -08:00
Noah Levitt
58bac8fc83 Merge branch 'master' into qa
* master:
  adapt to exception message from newer versions of chromium (e.g. 57.0.2981.0)
2017-01-13 12:08:09 -08:00
Noah Levitt
77c4dc1116 adapt to exception message from newer versions of chromium (e.g. 57.0.2981.0) 2017-01-13 12:08:00 -08:00
Noah Levitt
87eeaf7888 Merge branch 'master' into qa
* master:
  tests for dismissal of javascript dialogs (alert, prompt, confirm)
  dismiss alerts from the page being browsed (avoids hanging)
2017-01-13 11:46:52 -08:00
Noah Levitt
011d814ee2 tests for dismissal of javascript dialogs (alert, prompt, confirm) 2017-01-13 11:46:42 -08:00
Noah Levitt
d2ed6b97a2 dismiss alerts from the page being browsed (avoids hanging) 2017-01-13 10:27:37 -08:00
Noah Levitt
4e7f9f8690 Merge branch 'master' into qa
* master:
  fix oversight including username/password in site config when starting a new job
  restore BrozzlerWorker built-in support for managing its own thread
  restore handling of 420 Reached limit, with a rudimentary test
  add import missing from test
  restore support for on_response and on_request, with an automated test for on_response
2017-01-06 13:03:25 -08:00
Noah Levitt
70b67942a5 restore handling of 420 Reached limit, with a rudimentary test 2016-12-22 13:44:09 -08:00
Noah Levitt
eabb0fb114 restore support for on_response and on_request, with an automated test for on_response 2016-12-21 18:35:55 -08:00
Noah Levitt
422a5ad726 Merge branch 'master' into qa
* master:
  need $DISPLAY set for test_brozzling.py
  restore handling of "aw snap" or "he's dead jim"
  add seed username/password parameters to job config schema
  loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated
  convert mouseovers and simpleclicks to jinja2
  remove obsolete facebook login code
  convert behaviors to jinja2, move them to new subdir js-templates, along with javascript previously stored as a string in browser.py
  add hack for submitting a login form containing an element with name or id "submit", which masks the form submit() method
  how did i miss this file?
  forgot to git add new test data
  detect <input type="email"> as potential username field for login
  generalized support for login doing automatic detection of login form on a page
  yet more refactoring of browser.py, clearer separation of purpose, Browser class manages browsing, sends most of the messages to chrome, WebsockReceiverThread handles messages that come back from chrome
  bump version number in setup.py
  major refactoring of browsing code to make it easier to add functionality
  back to dev version number
  i dub thee 1.1b8
2016-12-21 18:11:56 -08:00