Commit Graph

  • e3c23a0f2b Merge pull request #25 from vonrosen/ari-3724 Noah Levitt 2014-06-06 15:15:24 -07:00
  • d40b542ffe Merge pull request #1 from vonrosen/ari-3724 vonrosen 2014-06-06 10:51:09 -07:00
  • 41270af223 Allow flash requests to be detected. Hunter Stern 2014-06-06 10:47:29 -07:00
  • e8456e0a62 Merge pull request #24 from nlevitt/dev vonrosen 2014-06-05 12:00:37 -07:00
  • dd2d36328f scroll up faster on facebook Noah Levitt 2014-06-04 12:34:20 -07:00
  • c2153be288 start behaviors again on any Page.loadEventFired, because if we don't do that, we keep asking the page if the behavior thinks it's finished, and it doesn't know what we're talking about Noah Levitt 2014-06-03 18:06:02 -07:00
  • bfb6cac25f use temp dir as $HOME instead of just chromium user-data-dir, because sometimes we have been seeing chrome print this error message and hang "[1975:2001:0603/215855:ERROR:nss_util.cc(444)] Error initializing NSS with a persistent database (sql:/home/archiveit/.pki/nssdb): NSS error code: -8187" Noah Levitt 2014-06-03 16:02:00 -07:00
  • e619e013b6 sleep for 5 seconds after starting a browser, since starting 20 at once brings the computer to its knees Noah Levitt 2014-06-03 15:57:12 -07:00
  • 1f91018d91 even more patience killing chrome, send another sigterms every ten seconds if chrome is still alive Noah Levitt 2014-06-02 12:09:15 -07:00
  • c6bd2417d7 good smarter killing of chrome Noah Levitt 2014-06-02 11:58:11 -07:00
  • 1ae9b83dab Merge branch 'dev' of github.com:nlevitt/umbra into dev Noah Levitt 2014-05-30 23:07:54 -07:00
  • 56a721f059 dump stack trace and don't return browser to pool on critical error where chrome process might still be running Noah Levitt 2014-05-30 23:07:39 -07:00
  • b2e27b99d2 nice log message when fully shut down Noah Levitt 2014-05-30 17:32:01 -07:00
  • c9d503e690 log version number at startup Noah Levitt 2014-05-30 15:00:01 -07:00
  • ed92f3bd53 for the version string, use abbreviated commit hash instead of attempting to use the branch name Noah Levitt 2014-05-29 23:33:14 -07:00
  • bef57e2819 for version string, try to handle case where head is detached Noah Levitt 2014-05-29 20:57:33 -07:00
  • 3127e02cbb fancy --version that includes git branch and timestamp of last commit if available Noah Levitt 2014-05-29 20:43:00 -07:00
  • 0bcc583b40 think it's safer to use a range of ports 9200 thru 9200+n than to try to choose random ports and hold them with socket.bind() (don't know how we can be sure a port is available) Noah Levitt 2014-05-29 17:55:00 -07:00
  • 94c2e4390b debugging to and mitigation for problem "[Errno 98] Address already in use" Noah Levitt 2014-05-28 18:57:21 -07:00
  • 2dc30cc8bc Merge pull request #23 from nlevitt/master vonrosen 2014-05-28 12:38:11 -07:00
  • 9c08be2699 sigterm and sigint both shutdown request shutdown, which stops consuming urls and waits for active browsers to finish; a second sigint/sigterm immediately shuts down active browsers Noah Levitt 2014-05-24 01:52:22 -07:00
  • b67d9fadf0 log ports chose for browsers, and give threads nice names to make logs easier to understand Noah Levitt 2014-05-23 22:30:25 -07:00
  • 2c4ba005b5 make umbra amenable to clustering by using a pool of n browsers and removing the browser-clientId affinity (not useful currently since we start a fresh browser instance for each page browsed), and set prefetch_count=1 on amqp consumers to round-robin incoming urls among umbra instances Noah Levitt 2014-05-23 21:59:34 -07:00
  • 8d269f4c56 add options --verbose, --exchange, --queue, --routing-key Noah Levitt 2014-05-23 13:39:39 -07:00
  • bd3f979b56 capitalize AMQP in description Noah Levitt 2014-05-23 13:39:08 -07:00
  • 6f61d0289b improve readme, mentioning archive-it per kristine Noah Levitt 2014-05-23 13:34:51 -07:00
  • a7cd872b95 sleep for 0.5 sec before attempting to reconnect to amqp; documentation tweaks Noah Levitt 2014-05-23 13:34:07 -07:00
  • 155db96461 provide abbreviated api Noah Levitt 2014-05-23 13:27:00 -07:00
  • bf3afcccb9 oops, Browser.__init__ doesn't take client_id anymore Noah Levitt 2014-05-20 19:27:53 -07:00
  • d7cfcbf233 new helper utility to browse urls provided as command line args Noah Levitt 2014-05-20 17:11:16 -07:00
  • 6c69b68771 organize imports, tweak command line args Noah Levitt 2014-05-20 17:10:41 -07:00
  • d4693b2aba remove unused param to __init__, avoid exception when on_request callback not provided Noah Levitt 2014-05-20 17:07:42 -07:00
  • 99d219dfda not sure why /bin/ et al were in .gitignore... replace with a couple of useful things Noah Levitt 2014-05-20 17:06:26 -07:00
  • 1e18c2ca74 improve helper utilities Noah Levitt 2014-05-20 16:44:13 -07:00
  • 8749b97811 oops, check in browser.py Noah Levitt 2014-05-20 03:10:33 -07:00
  • b59e76a5b9 clean shutdown without draining entire amqp queue (only consume urls from amqp when browser activity isn't saturated) Noah Levitt 2014-05-20 03:02:48 -07:00
  • 3e4232f32c refactor umbra.py into controller.py and browser.py, improve class names Noah Levitt 2014-05-20 02:42:40 -07:00
  • 6fdcdd0bf0 configurable max number of instances of chrome simultaneously browsing pages (default=3); close and reopen connection to amqp every 15 minutes (consumer only); increase default browser wait to 60 sec Noah Levitt 2014-05-20 01:09:11 -07:00
  • cc0ffee508 only websocket-client-py3==0.13.1 works right with python3 at the moment, see https://github.com/liris/websocket-client/issues/84 Noah Levitt 2014-05-20 00:57:07 -07:00
  • 154eb6f334 Merge pull request #22 from nlevitt/master Eldon 2014-05-06 09:13:56 -04:00
  • 05e673917d "wasThrown" is necessarily always included in the result message from chrome for Runtime.evaluate Noah Levitt 2014-05-05 19:58:41 -07:00
  • 93b16f28b9 improve facebook behavior: when we expect a "close" button to appear, wait for it before moving on to other actions; and when we discover a missed click target above, scroll back up to click on it Noah Levitt 2014-05-05 18:39:16 -07:00
  • fa6e3eebb2 clear UmbraWorker.self._behavior when finished with a page (after the first page, nothing was getting behaviors); bump hard timeout to 20 minutes Noah Levitt 2014-05-05 18:37:39 -07:00
  • 55fad80553 UmbraWorker.send_to_chrome() - central place to send message to chrome via websocket Noah Levitt 2014-05-05 12:26:39 -07:00
  • a62a07e6b7 change magic first line of behavior js files to a commented-out json blob, which should include the fields 'url_regex' and 'request_idle_timeout_sec'; behavior.is_finished() incorporates the custom idle timeout into its check; also rename variables in behavior scripts with umbra/UMBRA_ prefix to sort of namespace them; and add "finished" logic to facebook and vimeo behaviors (flickr needs work to support it) Noah Levitt 2014-05-05 11:58:55 -07:00
  • 2a9633ad77 Bunch of improvements, most importantly a default fallback behavior script which scrolls to the bottom of the page, and rearchitecting some stuff so that the behavior script can have some say on when it's finished with the page. Also some doc comments. Noah Levitt 2014-05-04 21:33:13 -07:00
  • 602459bb42 Merge pull request #21 from nlevitt/disable-google-analytics Adam Miller 2014-05-02 18:32:35 -07:00
  • 8679ee0ea7 disable google analytics by setting a breakpoint in www.google-analytics.com/analytics.js and replacing the content of that script when the breakpoint is hit Noah Levitt 2014-05-02 18:30:28 -07:00
  • d6b696ded8 Merge pull request #20 from adam-miller/master Noah Levitt 2014-05-02 17:42:53 -07:00
  • 9cf20f195c Removing first run ui checks Adam Miller 2014-05-02 17:37:10 -07:00
  • e7353fbb4b Merge pull request #19 from nlevitt/ari-3814 Eldon 2014-04-09 13:25:22 -04:00
  • 89e41e7c82 remove exception raised for testing Noah Levitt 2014-04-07 11:45:54 -07:00
  • aacb886b62 ARI-3814 try to recover from rabbitmq communication problems Noah Levitt 2014-04-07 11:45:12 -07:00
  • 4e72cbae58 Merge pull request #18 from nlevitt/ari-3771 Eldon 2014-04-04 16:04:38 -04:00
  • beeb4a2a2c Merge pull request #17 from nlevitt/ari-3811 Eldon 2014-04-04 15:21:41 -04:00
  • be9115fd11 to address ARI-3771 "Lasalle Facebook last scrolldown doesn't work", scroll by 200 pixels each time instead of 100 on facebook, which avoids hitting the 15 second idle timeout in my tests; also detect when unclicked targets are above the screen/viewport and not below and log it as such, instead of trying to continue scrolling down Noah Levitt 2014-04-04 12:16:00 -07:00
  • da975bc586 thread dump on SIGQUIT a la java Noah Levitt 2014-04-03 21:19:08 -07:00
  • e1c297269c Merge pull request #15 from nlevitt/master Eldon 2014-03-13 11:09:34 -04:00
  • f3a540b92d setup.py - include behaviors.d/*.js in installation Noah Levitt 2014-03-13 00:00:32 -07:00
  • b3bd959ab2 Merge pull request #14 from eldondev/master vonrosen 2014-03-10 12:01:20 -07:00
  • 427b74ebfc Check to see if the object has a click method before calling it Eldon 2014-03-10 14:58:16 -04:00
  • a16ce4abeb Merge pull request #13 from nlevitt/master vonrosen 2014-03-09 16:47:19 -07:00
  • 3fd792fddb lengthen timeouts and improve timeout handling; log js console messages from browser Noah Levitt 2014-03-07 19:39:27 -08:00
  • 5637e7111f use *rel=["theater"] to click on photos and videos that won't navigate to a new page; don't click on comments links for now, since it might interfere with other stuff; more verbose logging of click targets Noah Levitt 2014-03-07 19:37:43 -08:00
  • a0f8474a73 Merge pull request #12 from nlevitt/master vonrosen 2014-03-07 11:32:14 -08:00
  • 5a7a24083f simplify checking for *.js Noah Levitt 2014-03-07 11:29:43 -08:00
  • a30b5d8dd2 only reset idle timer on Network.requestWillBeSent instead of all events (otherwise long-running videos keep the browser open unnecessarily) Noah Levitt 2014-03-06 18:35:04 -08:00
  • 9d9014c864 start the hard stop timer Noah Levitt 2014-03-06 18:32:30 -08:00
  • 52db581a3c restore logging Noah Levitt 2014-03-06 18:25:46 -08:00
  • 12d66982d1 only load behaviors files named like *.js (avoids vim .swp files and stuff); tweak logging Noah Levitt 2014-03-06 18:25:35 -08:00
  • 9cb9172a4d behavior for vimeo - click on <video> elements Noah Levitt 2014-03-06 18:24:12 -08:00
  • 9848c41d5f make regexes the same that crawlman puts in crawler-beans.cxml Noah Levitt 2014-03-06 18:23:31 -08:00
  • 5b1992a8c0 Merge pull request #11 from eldondev/master vonrosen 2014-03-06 11:08:45 -08:00
  • 393df3f16e Update behaviors for facebook theater Eldon 2014-03-05 23:44:52 -05:00
  • f2f78d2ced Convert from one big json file, to js files with a regex as a comment at the top. Eldon 2014-03-05 23:19:09 -05:00
  • 4c22891093 Merge pull request #10 from nlevitt/master Eldon 2014-02-25 17:35:26 -05:00
  • b763d6550f remove unused function Noah Levitt 2014-02-25 14:26:10 -08:00
  • b4675a7cd2 Merge pull request #9 from nlevitt/master Eldon 2014-02-25 16:23:27 -05:00
  • 11da122ec2 remove old commented out line of code Noah Levitt 2014-02-18 13:20:18 -08:00
  • b96d8856d4 create temp dir for user profile rather than rely on --temp-profile Noah Levitt 2014-02-14 19:45:16 -08:00
  • b4846e1063 scrolldown seems to get everything for flickr and facebook at the moment Noah Levitt 2014-02-14 17:57:04 -08:00
  • 28282641f2 add a little logging Noah Levitt 2014-02-14 15:18:10 -08:00
  • 2368688fbe Merge remote-tracking branch 'eldondev/master' into nlevitt-master (add behaviors) Noah Levitt 2014-02-14 15:10:23 -08:00
  • 3389c5a66d remove some extraneous debug logging Noah Levitt 2014-02-13 18:36:08 -08:00
  • fe15932c26 Click on photos in gallery behavior Eldon 2014-02-13 13:37:08 -05:00
  • af01fcbcfe Add more flickr behavior Eldon 2014-02-13 13:32:34 -05:00
  • 445288d5e7 First few behaviors Eldon 2014-02-13 01:00:39 -05:00
  • f69edd5a87 handle multiple clients, browsers Noah Levitt 2014-02-13 01:59:09 -08:00
  • 4dbe111aee Merge branch 'master' into nlevitt-master Noah Levitt 2014-02-12 18:15:05 -08:00
  • fe1c68af90 Merge pull request #7 from eldondev/master Noah Levitt 2014-02-12 18:13:52 -08:00
  • bdf00cc515 Refactor to pull Chrome execution inside of umbra, simplify some things Eldon 2014-02-12 19:31:03 -05:00
  • dd871f3a6a Merge pull request #6 from nlevitt/master Eldon 2014-02-12 16:09:04 -05:00
  • 4935d55b6e specify classifier 'Programming Language :: Python :: 3.3' since websocket-client-py3 requires python 3.3, doesn't work with 3.2 Noah Levitt 2014-02-12 12:17:41 -08:00
  • f9d56d3071 formatting change only - indent with 4 spaces Noah Levitt 2014-02-10 20:45:18 -08:00
  • 02fbe725cb cache parent url metadata and send back via amqp with child urls Noah Levitt 2014-02-10 20:40:06 -08:00
  • f8c5a08c1b Merge pull request #5 from eldondev/master Noah Levitt 2014-01-28 14:57:20 -08:00
  • 5588eedbe2 Update readme Eldon 2014-01-28 00:12:33 -05:00
  • 8afe7d90a2 Replace js evaluation with direct page navigation, add default for dump_queue Eldon 2014-01-28 00:10:31 -05:00
  • 7d89d1bed1 Merge pull request #4 from nlevitt/master Eldon 2014-01-27 20:32:59 -08:00
  • 8eb92b28e6 make load_url handle arguments similarly to umbra Noah Levitt 2014-01-27 19:34:54 -08:00