1611 Commits

Author SHA1 Message Date
Noah Levitt
1fb336cb2e crawling outlinks not totally working 2015-07-11 02:29:19 -07:00
Noah Levitt
56a7bb7306 submit outlinks to hq 2015-07-10 21:31:41 -07:00
Noah Levitt
fd99764baa brozzler-worker partially working 2015-07-10 21:07:47 -07:00
Noah Levitt
8aa1e6715a feed seed url to the crawl url queue 2015-07-10 20:12:33 -07:00
Noah Levitt
1d068f4f86 starting work on brozzler crawl hq 2015-07-10 18:01:54 -07:00
Noah Levitt
fcc63b6675 fancier prioritization takes into account hops from seed, path depth; and clean shutdown 2015-07-09 22:35:37 -07:00
Noah Levitt
5f3c247e0c trick to avoid crawling same url again too quickly 2015-07-09 21:49:55 -07:00
Noah Levitt
7cc777661d fix dumb bug 2015-07-09 18:54:09 -07:00
Noah Levitt
783794ca37 basic of site/seed crawling with scoping 2015-07-09 18:36:07 -07:00
Noah Levitt
92ea701987 rudimentary crawling in parallel with multiple browsers 2015-07-08 18:50:18 -07:00
Noah Levitt
32abfcac8a fix 'CrawlUrl' object has no attribute 'priority' bug 2015-07-08 17:51:09 -07:00
Noah Levitt
4022cc0162 simple in-memory frontier with prioritized queues by host 2015-07-08 17:44:38 -07:00
Noah Levitt
4042f22497 rudimentary link extraction and crawling 2015-07-07 16:45:52 -07:00
Noah Levitt
d8a962b29e experimenting with captureScreenshot 2015-06-16 18:42:21 -07:00
Noah Levitt
f254e2eec1 it's been stable, call it 1.0 2015-06-13 11:30:01 -07:00
Hunter
903d2f3107 Merge pull request #39 from nlevitt/simple-behaviors
ARI-3775, ARI-3956 Simple behaviors
2015-04-16 15:01:49 -07:00
Noah Levitt
73bbd87d5d merge in latest from master and adjust config as needed 2015-02-02 14:52:56 -08:00
Noah Levitt
776a6dac68 Merge branch 'master' into simple-behaviors 2015-02-02 14:49:34 -08:00
Noah Levitt
48b8754f40 Merge branch 'master' into simple-behaviors 2015-02-02 14:48:26 -08:00
Noah Levitt
db759f1066 Merge pull request #32 from adam-miller/ARI-3904
ARI-3904 Instagram behavior to scroll past two pages, and click to enla...
2015-02-02 14:47:44 -08:00
Adam Miller
ce47461656 Making scrolling and image loading more tolerant of slow loading. 2015-01-30 16:55:53 -08:00
Noah Levitt
9e5900c61f ARI-3956 simple behavior for usask.ca slideshows (which also required enhancing the simple behavior logic) 2015-01-27 16:03:58 -08:00
Noah Levitt
0901cac2e0 Merge pull request #38 from nlevitt/bump-browser-timeout
increase browser start and stop timeouts, since sometimes we strand brow...
2015-01-26 21:22:18 -08:00
Noah Levitt
e9c2fc61dd increase browser start and stop timeouts, since sometimes we strand browser processes after starting them, when the machine is very busy 2015-01-26 21:09:56 -08:00
Noah Levitt
d467cce221 Merge pull request #27 from vonrosen/ari-3774
Allow default behavior to include clicking on sound cloud player buttons embbedded in 3rd party sites.
2015-01-26 20:58:49 -08:00
Noah Levitt
c5c642a990 support for simple behavior that clicks on elements matching configured css selector; and one such behavior for acalog sites ARI-3775 2015-01-26 16:58:12 -08:00
Noah Levitt
0647df1ab9 behaviors.yaml to configure behaviors, in preparation for "simple" behavior support 2015-01-26 16:01:53 -08:00
Hunter Stern
91f9788eb2 Add iframe css path to target id for soundcloud buttons. 2015-01-21 16:28:29 -08:00
Hunter Stern
e9451f88d8 Merge branch 'master' of github.com:internetarchive/umbra into ari-3774 2015-01-21 16:21:13 -08:00
Noah Levitt
cdcef934e7 rewrite instagram behavior to be more like a state machine; update css selectors for current instagram; refactor as a sort of singleton class for cleaner namespacing 2015-01-16 13:21:12 -08:00
Noah Levitt
ddc7064585 Merge branch 'master' into ARI-3904 2015-01-15 18:37:28 -08:00
Noah Levitt
ffd60d35e6 Merge pull request #36 from vonrosen/ari-4150
Allow scrolling down a timeline in the facebook plugin so as to capture content in third party embedded timelines.
2014-12-22 21:47:31 -08:00
Hunter Stern
5ea12fd053 More refinements. 2014-12-19 15:52:13 -08:00
Hunter Stern
8d225b8859 More debugging. 2014-12-19 15:13:02 -08:00
Hunter Stern
5304f2909d Less verbose logging. 2014-12-19 14:35:11 -08:00
Hunter Stern
ae60205648 Fix for https://webarchive.jira.com/browse/ARI-4150 2014-12-19 14:17:50 -08:00
Hunter Stern
cf88b9968c Merge branch 'master' of github.com:internetarchive/umbra 2014-12-12 15:59:25 -08:00
Noah Levitt
1108ef9362 Merge pull request #33 from adam-miller/ARI-4016
ARI-4016 - Support: embedded videos on marquette.edu
2014-11-21 15:10:53 -08:00
Adam Miller
7f8e6802de Implementing suggestions in pull request. 2014-11-07 15:56:05 -08:00
vonrosen
8e6859ef56 Merge pull request #35 from nlevitt/amqp-socket-error
properly handle socket.error from amqp conn.drain_events (was previously...
2014-11-03 12:09:27 -08:00
Noah Levitt
9053279b4e change default routing key to "urls" 2014-11-03 11:54:59 -08:00
Noah Levitt
ab86426475 properly handle socket.error from amqp conn.drain_events (was previously diagnosed as error starting browser) 2014-11-03 11:54:10 -08:00
Noah Levitt
f40bd39e1a Merge pull request #34 from dhamaniasad/patch-1
Update README.md
2014-10-30 19:04:24 -07:00
Asad Dhamani
9231cc2b5c Update README.md 2014-10-31 07:02:49 +05:30
Asad Dhamani
e264f09c27 Update README.md 2014-10-29 12:42:43 +05:30
Hunter Stern
52bb02cbbe Merge branch 'master' of github.com:internetarchive/umbra 2014-10-16 20:09:42 +00:00
vonrosen
01ed5a7d4d Merge pull request #28 from internetarchive/ari-3940
Ari 3940 - prioritize scrolling all the way to the bottom
2014-10-09 21:21:02 +00:00
Adam Miller
bdf3e73062 Wait until big image is loaded before clicking to next image. 2014-10-03 14:17:07 -07:00
Hunter Stern
1ee45053c5 Even more formatting changes. 2014-09-22 14:22:52 -07:00
Hunter Stern
6af3455dbf Improve formatting. 2014-09-22 14:21:00 -07:00