Commit graph

190 commits

Author SHA1 Message Date
Noah Levitt
c6e6b34e82 handle case where websocket connection is unexpectedly closed during the post-behavior phase 2016-07-06 18:17:01 -05:00
Noah Levitt
3bf3c80720 implement timeout and retries to work around issue where sometimes we receive no result message after requesting outlinks 2016-07-06 17:54:36 -05:00
Noah Levitt
01e38ea8c7 oops didn't mean to leave that windows-only subprocess flag 2016-07-01 14:07:04 -05:00
Noah Levitt
79ad57669c do not send more than one SIGTERM when shutting down browser process, because on recent chromium on linux, the second sigterm abruptly ends the process, and sometimes leaves orphan subprocesses; also send TERM/KILL signals to the whole process group, another measure to avoid orphans; and adjust logging levels for captured chrome output 2016-06-30 17:10:27 -05:00
Noah Levitt
9fd78fdbe8 implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top 2016-06-30 11:45:19 -05:00
Noah Levitt
79beddfc44 set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure 2016-06-29 19:47:25 -05:00
Noah Levitt
ffcf26b6c9 undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3) 2016-06-29 18:53:32 -05:00
Noah Levitt
479713e25b --trace level logging 2016-06-29 18:29:45 -05:00
Noah Levitt
772bcf0df6 handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988) 2016-06-28 12:23:32 -05:00
Noah Levitt
0bd687abde avoid hanging in case a page has no outlinks 2016-06-28 11:25:04 -05:00
Noah Levitt
2038598f41 fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''} 2016-06-22 17:43:53 +00:00
Noah Levitt
d198a69e45 recurse through all frames to find outlinks 2016-06-22 11:39:31 -05:00
Barbara Miller
1c1237d07e disable browser extensions 2016-05-27 22:51:38 -07:00
Noah Levitt
317a5eb99d without sudo, psutil.net_connections() raises psutil.AccessDenied on mac; in this case, silently try running chrome on the unvetted configured port 2016-05-09 17:25:14 -07:00
Adam Miller
1f7f55a14a browser.py - Fix port search logic 2016-05-05 22:55:45 +00:00
Adam Miller
8e84465ff9 browser.py - Check for open ports before starting Chrome. Open next available on conflict 2016-05-05 22:31:07 +00:00
Noah Levitt
8d618ed135 refactor post-behavior stuff into separate interval function for clarity 2016-05-05 10:37:00 -07:00
Noah Levitt
31356d526a Merge branch 'master' into AITFIVE-832
* master:
  copy over latest behaviors and stuff from umbra
  support for host rules in outlink scoping
  recover from rethinkdb error updating service registry
2016-05-05 10:06:12 -07:00
Noah Levitt
cea192b4b3 copy over latest behaviors and stuff from umbra 2016-05-05 00:58:26 -07:00
Adam Miller
61cec15fff Restructure browser.py to take screenshot after behavior script. 2016-05-03 22:06:03 +00:00
Noah Levitt
df61e55b6b add license headers 2016-04-25 20:02:11 +00:00
Noah Levitt
68abb3cb94 log "behavior finished"/"hard timeout" only once 2016-04-21 22:02:50 +00:00
Noah Levitt
7bc726f717 fix bug preventing links from being extracted if hard timeout is reached 2016-04-20 17:24:18 -07:00
Noah Levitt
4874eaccbb Merge remote-tracking branch 'umbra/master'
* umbra/master:
  Handle Python to JS boolean conversion
  Allow clicking on already clicked element to continue in behaviors if click_until_hard_timeout is set to true
  Make Umbra click on 'Load More' button for youtube pages
  catch and log exception deleting temporary work directory
  update detection of modal close button for facebook changes
  Add custom behavior for Brooklyn Museum.
2016-03-07 17:37:12 -08:00
Noah Levitt
343b5c0f82 register with service registry; only start chrome right before using it, so that web console vnc windows aren't always full of about:blank 2015-11-12 02:56:27 +00:00
Noah Levitt
ddce1cdc71 fix mistakenly removed import; try to shut down chrome in case of unexpected exception 2015-08-19 20:04:46 +00:00
Noah Levitt
2533229fa1 add __all__ to modules 2015-08-19 19:01:28 +00:00
Noah Levitt
a878730e02 goodbye sqlite and rabbitmq, hello rethinkdb 2015-08-18 21:44:54 +00:00
Noah Levitt
fc75e18928 handle "aw snap" or "he's dead jim" from chrome 2015-08-11 18:14:53 +00:00
Noah Levitt
ce154fc3db more robustness improvements 2015-08-10 20:11:46 +00:00
Noah Levitt
a47292dab5 thread to read and selectively log output from chrome 2015-08-07 22:36:07 +00:00
Noah Levitt
e6eeca6ae2 handle 420 Reached limit when fetching robots in brozzler-hq 2015-08-01 17:54:29 +00:00
Noah Levitt
511e19ff4d handle 420 "Limit reached" when browser receives it 2015-08-01 01:26:59 +00:00
Noah Levitt
11fbbc9d49 change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc 2015-07-31 00:03:13 +00:00
Noah Levitt
f9c049a69e navigate to about:blank before the real url to avoid situation where we navigate to the same page that we're currently on, perhaps with a different #fragment, which prevents Page.loadEventFired from happening 2015-07-21 20:39:19 +00:00
Noah Levitt
b5cb94fc8b some additional logging and error handling to avoid mysterious messages 2015-07-21 06:33:02 +00:00
Noah Levitt
2ba5bd4d4b support adding extra http request headers 2015-07-17 13:45:27 -07:00
Noah Levitt
d2650a2547 update scope if seed redirects 2015-07-16 18:27:47 -07:00
Noah Levitt
140a441eb5 honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs 2015-07-16 17:19:12 -07:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Renamed from umbra/browser.py (Browse further)