138 Commits

Author SHA1 Message Date
Noah Levitt
eb74967fed brozzler-worker round-robins sites needing crawling 2015-07-13 12:13:41 -07:00
Noah Levitt
ddd764cac5 brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox) 2015-07-11 23:07:47 -07:00
Noah Levitt
610f9c8cf4 add missing file hq.py, improve some logging, fix little race condition bug 2015-07-11 13:09:45 -07:00
Noah Levitt
1fb336cb2e crawling outlinks not totally working 2015-07-11 02:29:19 -07:00
Noah Levitt
56a7bb7306 submit outlinks to hq 2015-07-10 21:31:41 -07:00
Noah Levitt
fd99764baa brozzler-worker partially working 2015-07-10 21:07:47 -07:00
Noah Levitt
fcc63b6675 fancier prioritization takes into account hops from seed, path depth; and clean shutdown 2015-07-09 22:35:37 -07:00
Noah Levitt
783794ca37 basic of site/seed crawling with scoping 2015-07-09 18:36:07 -07:00
Noah Levitt
92ea701987 rudimentary crawling in parallel with multiple browsers 2015-07-08 18:50:18 -07:00
Noah Levitt
32abfcac8a fix 'CrawlUrl' object has no attribute 'priority' bug 2015-07-08 17:51:09 -07:00
Noah Levitt
4022cc0162 simple in-memory frontier with prioritized queues by host 2015-07-08 17:44:38 -07:00
Noah Levitt
4042f22497 rudimentary link extraction and crawling 2015-07-07 16:45:52 -07:00
Noah Levitt
d8a962b29e experimenting with captureScreenshot 2015-06-16 18:42:21 -07:00
Noah Levitt
73bbd87d5d merge in latest from master and adjust config as needed 2015-02-02 14:52:56 -08:00
Noah Levitt
776a6dac68 Merge branch 'master' into simple-behaviors 2015-02-02 14:49:34 -08:00
Noah Levitt
48b8754f40 Merge branch 'master' into simple-behaviors 2015-02-02 14:48:26 -08:00
Noah Levitt
db759f1066 Merge pull request #32 from adam-miller/ARI-3904
ARI-3904 Instagram behavior to scroll past two pages, and click to enla...
2015-02-02 14:47:44 -08:00
Adam Miller
ce47461656 Making scrolling and image loading more tolerant of slow loading. 2015-01-30 16:55:53 -08:00
Noah Levitt
9e5900c61f ARI-3956 simple behavior for usask.ca slideshows (which also required enhancing the simple behavior logic) 2015-01-27 16:03:58 -08:00
Noah Levitt
0901cac2e0 Merge pull request #38 from nlevitt/bump-browser-timeout
increase browser start and stop timeouts, since sometimes we strand brow...
2015-01-26 21:22:18 -08:00
Noah Levitt
e9c2fc61dd increase browser start and stop timeouts, since sometimes we strand browser processes after starting them, when the machine is very busy 2015-01-26 21:09:56 -08:00
Noah Levitt
c5c642a990 support for simple behavior that clicks on elements matching configured css selector; and one such behavior for acalog sites ARI-3775 2015-01-26 16:58:12 -08:00
Noah Levitt
0647df1ab9 behaviors.yaml to configure behaviors, in preparation for "simple" behavior support 2015-01-26 16:01:53 -08:00
Hunter Stern
91f9788eb2 Add iframe css path to target id for soundcloud buttons. 2015-01-21 16:28:29 -08:00
Hunter Stern
e9451f88d8 Merge branch 'master' of github.com:internetarchive/umbra into ari-3774 2015-01-21 16:21:13 -08:00
Noah Levitt
cdcef934e7 rewrite instagram behavior to be more like a state machine; update css selectors for current instagram; refactor as a sort of singleton class for cleaner namespacing 2015-01-16 13:21:12 -08:00
Noah Levitt
ddc7064585 Merge branch 'master' into ARI-3904 2015-01-15 18:37:28 -08:00
Hunter Stern
5ea12fd053 More refinements. 2014-12-19 15:52:13 -08:00
Hunter Stern
8d225b8859 More debugging. 2014-12-19 15:13:02 -08:00
Hunter Stern
5304f2909d Less verbose logging. 2014-12-19 14:35:11 -08:00
Hunter Stern
ae60205648 Fix for https://webarchive.jira.com/browse/ARI-4150 2014-12-19 14:17:50 -08:00
Noah Levitt
1108ef9362 Merge pull request #33 from adam-miller/ARI-4016
ARI-4016 - Support: embedded videos on marquette.edu
2014-11-21 15:10:53 -08:00
Adam Miller
7f8e6802de Implementing suggestions in pull request. 2014-11-07 15:56:05 -08:00
Noah Levitt
ab86426475 properly handle socket.error from amqp conn.drain_events (was previously diagnosed as error starting browser) 2014-11-03 11:54:10 -08:00
vonrosen
01ed5a7d4d Merge pull request #28 from internetarchive/ari-3940
Ari 3940 - prioritize scrolling all the way to the bottom
2014-10-09 21:21:02 +00:00
Adam Miller
bdf3e73062 Wait until big image is loaded before clicking to next image. 2014-10-03 14:17:07 -07:00
Hunter Stern
1ee45053c5 Even more formatting changes. 2014-09-22 14:22:52 -07:00
Hunter Stern
6af3455dbf Improve formatting. 2014-09-22 14:21:00 -07:00
Adam Miller
916f1b990e Cleanup instagram timeout and state handling 2014-09-17 16:26:53 -07:00
Adam Miller
eb3ea95b87 Cleanup timeout logic 2014-09-17 15:26:13 -07:00
Adam Miller
5a3c8e9a05 ARI-4016 - Support: embedded videos on marquette.edu 2014-09-15 11:06:33 -07:00
Hunter Stern
a2ea2501db More soundcloud changes. 2014-09-12 16:07:32 -07:00
Hunter Stern
e320654d1e Allow selector to detect https and http soundcloud widget. 2014-09-12 09:56:41 -07:00
Adam Miller
7afdd7b50b Added behavior for instagram to scroll past two pages, and click to enlarge images. 2014-09-02 17:02:30 -07:00
Hunter Stern
0e7fd93967 Merge remote-tracking branch 'internetarchive/master' into ari-3774 2014-08-26 15:12:13 -07:00
Noah Levitt
c886b57d3a reject (discard) bad messages 2014-08-19 18:51:43 -07:00
Noah Levitt
9d90b5830a facebook - scroll all the to the bottom before scrolling back up to click more stuff 2014-08-01 16:53:13 -07:00
Noah Levitt
dd9ef50484 suppress logging of umbraBehaviorFinished() message which is sent a lot 2014-08-01 16:22:45 -07:00
Hunter Stern
6a5d1e2266 Disable web security in chromium so iframes on different domains can be accessed by behavior javascript. 2014-07-24 16:46:06 -07:00
Hunter Stern
80f3a4a067 Enhancement to allow embedded soundcloud audio files to be detected 2014-07-24 16:44:05 -07:00