Commit Graph

  • 923cd98652 save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl) Noah Levitt 2015-07-15 16:32:02 -07:00
  • 5aea76ab6d refactor worker code into worker module Noah Levitt 2015-07-15 15:42:40 -07:00
  • 7b92ba39c7 avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately) Noah Levitt 2015-07-15 14:33:22 -07:00
  • 9b13f0c34c refactor hq code into hq module Noah Levitt 2015-07-15 14:27:21 -07:00
  • 4cfb287397 refactor hq code into hq module Noah Levitt 2015-07-15 14:26:48 -07:00
  • 9b5da57d7e initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled Noah Levitt 2015-07-14 18:57:45 -07:00
  • fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff Noah Levitt 2015-07-13 17:09:39 -07:00
  • 3eff099b16 determine if youtube-dl can do something with a url Noah Levitt 2015-07-13 16:40:56 -07:00
  • 6470a8ef26 sigquit dumps thread traces Noah Levitt 2015-07-13 15:57:14 -07:00
  • 18ca996216 rudimentary robots.txt support Noah Levitt 2015-07-13 15:56:54 -07:00
  • eb74967fed brozzler-worker round-robins sites needing crawling Noah Levitt 2015-07-13 12:13:41 -07:00
  • ddd764cac5 brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox) Noah Levitt 2015-07-11 23:07:47 -07:00
  • b0f3b8a5e3 clean shutdown for brozzler-hq Noah Levitt 2015-07-11 18:18:54 -07:00
  • 384120928c set in_progress=0 for completed url Noah Levitt 2015-07-11 13:24:38 -07:00
  • 610f9c8cf4 add missing file hq.py, improve some logging, fix little race condition bug Noah Levitt 2015-07-11 13:09:45 -07:00
  • bb3561a690 check scope (on hq side), fix buglets Noah Levitt 2015-07-11 12:33:19 -07:00
  • 1fb336cb2e crawling outlinks not totally working Noah Levitt 2015-07-11 02:29:19 -07:00
  • 56a7bb7306 submit outlinks to hq Noah Levitt 2015-07-10 21:31:41 -07:00
  • fd99764baa brozzler-worker partially working Noah Levitt 2015-07-10 21:07:47 -07:00
  • 8aa1e6715a feed seed url to the crawl url queue Noah Levitt 2015-07-10 20:12:33 -07:00
  • 1d068f4f86 starting work on brozzler crawl hq Noah Levitt 2015-07-10 18:01:54 -07:00
  • fcc63b6675 fancier prioritization takes into account hops from seed, path depth; and clean shutdown Noah Levitt 2015-07-09 22:35:37 -07:00
  • 5f3c247e0c trick to avoid crawling same url again too quickly Noah Levitt 2015-07-09 21:49:55 -07:00
  • 7cc777661d fix dumb bug Noah Levitt 2015-07-09 18:54:09 -07:00
  • 783794ca37 basic of site/seed crawling with scoping Noah Levitt 2015-07-09 18:36:07 -07:00
  • 92ea701987 rudimentary crawling in parallel with multiple browsers Noah Levitt 2015-07-08 18:50:18 -07:00
  • 32abfcac8a fix 'CrawlUrl' object has no attribute 'priority' bug Noah Levitt 2015-07-08 17:51:09 -07:00
  • 4022cc0162 simple in-memory frontier with prioritized queues by host Noah Levitt 2015-07-08 17:44:38 -07:00
  • 4042f22497 rudimentary link extraction and crawling Noah Levitt 2015-07-07 16:45:52 -07:00
  • d8a962b29e experimenting with captureScreenshot Noah Levitt 2015-06-16 18:42:21 -07:00
  • f254e2eec1 it's been stable, call it 1.0 Noah Levitt 2015-06-13 11:30:01 -07:00
  • 903d2f3107 Merge pull request #39 from nlevitt/simple-behaviors Hunter 2015-04-16 15:01:49 -07:00
  • 73bbd87d5d merge in latest from master and adjust config as needed Noah Levitt 2015-02-02 14:52:56 -08:00
  • 776a6dac68 Merge branch 'master' into simple-behaviors Noah Levitt 2015-02-02 14:49:34 -08:00
  • 48b8754f40 Merge branch 'master' into simple-behaviors Noah Levitt 2015-02-02 14:48:26 -08:00
  • db759f1066 Merge pull request #32 from adam-miller/ARI-3904 Noah Levitt 2015-02-02 14:47:44 -08:00
  • ce47461656 Making scrolling and image loading more tolerant of slow loading. Adam Miller 2015-01-30 16:55:53 -08:00
  • 9e5900c61f ARI-3956 simple behavior for usask.ca slideshows (which also required enhancing the simple behavior logic) Noah Levitt 2015-01-27 16:03:58 -08:00
  • 0901cac2e0 Merge pull request #38 from nlevitt/bump-browser-timeout Noah Levitt 2015-01-26 21:22:18 -08:00
  • e9c2fc61dd increase browser start and stop timeouts, since sometimes we strand browser processes after starting them, when the machine is very busy Noah Levitt 2015-01-26 21:09:56 -08:00
  • d467cce221 Merge pull request #27 from vonrosen/ari-3774 Noah Levitt 2015-01-26 20:58:49 -08:00
  • c5c642a990 support for simple behavior that clicks on elements matching configured css selector; and one such behavior for acalog sites ARI-3775 Noah Levitt 2015-01-26 16:58:12 -08:00
  • 0647df1ab9 behaviors.yaml to configure behaviors, in preparation for "simple" behavior support Noah Levitt 2015-01-26 16:01:53 -08:00
  • 91f9788eb2 Add iframe css path to target id for soundcloud buttons. Hunter Stern 2015-01-21 16:28:29 -08:00
  • e9451f88d8 Merge branch 'master' of github.com:internetarchive/umbra into ari-3774 Hunter Stern 2015-01-21 16:21:13 -08:00
  • cdcef934e7 rewrite instagram behavior to be more like a state machine; update css selectors for current instagram; refactor as a sort of singleton class for cleaner namespacing Noah Levitt 2015-01-16 13:21:12 -08:00
  • ddc7064585 Merge branch 'master' into ARI-3904 Noah Levitt 2015-01-15 18:37:28 -08:00
  • ffd60d35e6 Merge pull request #36 from vonrosen/ari-4150 Noah Levitt 2014-12-22 21:47:31 -08:00
  • 5ea12fd053 More refinements. Hunter Stern 2014-12-19 15:52:13 -08:00
  • 8d225b8859 More debugging. Hunter Stern 2014-12-19 15:13:02 -08:00
  • 5304f2909d Less verbose logging. Hunter Stern 2014-12-19 14:35:11 -08:00
  • ae60205648 Fix for https://webarchive.jira.com/browse/ARI-4150 Hunter Stern 2014-12-19 14:17:50 -08:00
  • cf88b9968c Merge branch 'master' of github.com:internetarchive/umbra Hunter Stern 2014-12-12 15:59:25 -08:00
  • 1108ef9362 Merge pull request #33 from adam-miller/ARI-4016 Noah Levitt 2014-11-21 15:10:53 -08:00
  • 7f8e6802de Implementing suggestions in pull request. Adam Miller 2014-11-07 15:56:05 -08:00
  • 8e6859ef56 Merge pull request #35 from nlevitt/amqp-socket-error vonrosen 2014-11-03 12:09:27 -08:00
  • 9053279b4e change default routing key to "urls" Noah Levitt 2014-11-03 11:54:59 -08:00
  • ab86426475 properly handle socket.error from amqp conn.drain_events (was previously diagnosed as error starting browser) Noah Levitt 2014-11-03 11:54:10 -08:00
  • f40bd39e1a Merge pull request #34 from dhamaniasad/patch-1 Noah Levitt 2014-10-30 19:04:24 -07:00
  • 9231cc2b5c Update README.md Asad Dhamani 2014-10-31 07:02:49 +05:30
  • e264f09c27 Update README.md Asad Dhamani 2014-10-29 12:42:43 +05:30
  • 52bb02cbbe Merge branch 'master' of github.com:internetarchive/umbra Hunter Stern 2014-10-16 20:09:42 +00:00
  • 01ed5a7d4d Merge pull request #28 from internetarchive/ari-3940 vonrosen 2014-10-09 21:21:02 +00:00
  • bdf3e73062 Wait until big image is loaded before clicking to next image. Adam Miller 2014-10-03 14:17:07 -07:00
  • 1ee45053c5 Even more formatting changes. Hunter Stern 2014-09-22 14:22:52 -07:00
  • 6af3455dbf Improve formatting. Hunter Stern 2014-09-22 14:21:00 -07:00
  • 0edef7be6b Merge remote-tracking branch 'internetarchive/master' into ari-3774 Hunter Stern 2014-09-22 14:12:59 -07:00
  • 916f1b990e Cleanup instagram timeout and state handling Adam Miller 2014-09-17 16:26:53 -07:00
  • eb3ea95b87 Cleanup timeout logic Adam Miller 2014-09-17 15:26:13 -07:00
  • 5a3c8e9a05 ARI-4016 - Support: embedded videos on marquette.edu Adam Miller 2014-09-15 11:06:33 -07:00
  • a2ea2501db More soundcloud changes. Hunter Stern 2014-09-12 16:07:32 -07:00
  • e320654d1e Allow selector to detect https and http soundcloud widget. Hunter Stern 2014-09-12 09:56:41 -07:00
  • 7afdd7b50b Added behavior for instagram to scroll past two pages, and click to enlarge images. Adam Miller 2014-09-02 17:02:30 -07:00
  • 9052fd8569 add license section Noah Levitt 2014-09-02 16:11:49 -07:00
  • 51d6b1a4e2 apache license Noah Levitt 2014-09-02 16:10:00 -07:00
  • eb8c9faf89 Merge remote-tracking branch 'internetarchive/master' Hunter Stern 2014-08-28 10:56:27 -07:00
  • ce2957269f Merge pull request #31 from nlevitt/drain-republish Adam Miller 2014-08-26 16:58:21 -07:00
  • 2ab767eaa9 make drain-queue output actual json instead of python dict syntax Noah Levitt 2014-08-26 23:28:21 +00:00
  • fe1d9e01eb utility queue-json to publish an arbitrary json blob to amqp Noah Levitt 2014-08-26 23:11:44 +00:00
  • 0e7fd93967 Merge remote-tracking branch 'internetarchive/master' into ari-3774 Hunter Stern 2014-08-26 15:12:13 -07:00
  • bbba344886 Merge pull request #29 from nlevitt/handle-bad-message vonrosen 2014-08-20 08:21:29 -07:00
  • c886b57d3a reject (discard) bad messages Noah Levitt 2014-08-19 18:51:43 -07:00
  • b110a57938 Merge remote-tracking branch 'internetarchive/master' Hunter Stern 2014-08-14 15:26:15 -07:00
  • 9d90b5830a facebook - scroll all the to the bottom before scrolling back up to click more stuff Noah Levitt 2014-08-01 16:53:13 -07:00
  • dd9ef50484 suppress logging of umbraBehaviorFinished() message which is sent a lot Noah Levitt 2014-08-01 16:22:45 -07:00
  • 6a5d1e2266 Disable web security in chromium so iframes on different domains can be accessed by behavior javascript. Hunter Stern 2014-07-24 16:46:06 -07:00
  • 80f3a4a067 Enhancement to allow embedded soundcloud audio files to be detected Hunter Stern 2014-07-24 16:44:05 -07:00
  • e7e82aa913 Merge branch 'master' of github.com:vonrosen/umbra into vonrosenmaster Hunter Stern 2014-07-23 11:09:27 -07:00
  • 8e44e18053 Merge pull request #26 from nlevitt/dev Adam Miller 2014-07-21 13:18:24 -07:00
  • ae838af25d set amqp prefetch count to the number of urls we can handle at a time, i.e. max_active_browsers (with prefetch=1 umbra was only browsing one url at a time, after quickly burning through urls already on the queue when started) Noah Levitt 2014-07-02 10:30:51 -07:00
  • 6306c16698 kill -HUP to immediately close and reopen amqp consumer connection Noah Levitt 2014-06-23 17:18:27 -07:00
  • 02c054c284 do not wait forever for zombie websocket threads (this change should also reveal how we get these sometimes) Noah Levitt 2014-06-20 18:13:45 -07:00
  • 9b32f9a3d1 ugh, it was better with the default width, in spite of the ridiculous behavior.script Noah Levitt 2014-06-20 14:40:12 -07:00
  • 2cf69bdaff seriously, don't try to wrap any lines, pprint Noah Levitt 2014-06-20 14:37:33 -07:00
  • c6fa00812c when dumping state on SIGQUIT, build the whole string before printing to avoid stuff getting intermingled with other logging and stuff Noah Levitt 2014-06-20 14:33:01 -07:00
  • ead46d5716 more elaborate dumping of state on SIGQUIT to replace faulthandler Noah Levitt 2014-06-20 14:05:33 -07:00
  • ebb14ff889 get rid of chrome_wait straggler Noah Levitt 2014-06-18 17:31:28 -07:00
  • 17ef9d9f28 close and reopen the amqp consumer connection only every 2.5 hours instead of every 15 minutes, because now that we have to wait for all browsers to close when we do the reconnection, it slows us down a lot Noah Levitt 2014-06-18 14:58:44 -07:00
  • 025db91dea get rid of --browser-wait and --routing-key in favor of sensible defaults, some other tweaks Noah Levitt 2014-06-11 10:58:08 -07:00
  • a78e60f1da wait for a browser to become available and start it up before reading the next url from amqp; ack the message only after completing the browsing process successfully, and requeue if it's not successful; some refactoring to make the timing work for this Noah Levitt 2014-06-09 13:15:05 -07:00