Noah Levitt
|
dc04048d50
|
add some info to the readme
|
2015-07-20 12:00:14 -07:00 |
|
Noah Levitt
|
2f28f00a09
|
make putmeta requests respect site configured extra_headers
|
2015-07-17 16:52:06 -07:00 |
|
Noah Levitt
|
2ba5bd4d4b
|
support adding extra http request headers
|
2015-07-17 13:45:27 -07:00 |
|
Noah Levitt
|
c178ed1950
|
fix buglet
|
2015-07-16 18:43:14 -07:00 |
|
Noah Levitt
|
a54e60dbaf
|
change terminology CrawlUrl => Page since that better represents what it means in brozzler and differentiates from heritrix
|
2015-07-16 18:39:29 -07:00 |
|
Noah Levitt
|
d2650a2547
|
update scope if seed redirects
|
2015-07-16 18:27:47 -07:00 |
|
Noah Levitt
|
140a441eb5
|
honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs
|
2015-07-16 17:19:12 -07:00 |
|
Noah Levitt
|
e04247c3f7
|
add support for supplying json blob defining site with configuration to brozzler-add-site
|
2015-07-16 14:48:01 -07:00 |
|
Noah Levitt
|
6b2ee9faee
|
chmod -x worker.py
|
2015-07-15 18:03:49 -07:00 |
|
Noah Levitt
|
f2bc7ec271
|
refactor brozzler.hq.Site and brozzler.url.CrawlUrl into new brozzler.site package; fix bugs in robots.txt handling
|
2015-07-15 18:03:03 -07:00 |
|
Noah Levitt
|
a9c51edd84
|
robots cache per site, and some so far unused support for site level configuration
|
2015-07-15 17:44:42 -07:00 |
|
Noah Levitt
|
efa3cd6269
|
don't set http_proxy environment variable, because it affects things we don't want it to
|
2015-07-15 17:33:29 -07:00 |
|
Noah Levitt
|
923cd98652
|
save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)
|
2015-07-15 16:32:02 -07:00 |
|
Noah Levitt
|
5aea76ab6d
|
refactor worker code into worker module
|
2015-07-15 15:42:40 -07:00 |
|
Noah Levitt
|
7b92ba39c7
|
avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)
|
2015-07-15 14:33:22 -07:00 |
|
Noah Levitt
|
9b13f0c34c
|
refactor hq code into hq module
|
2015-07-15 14:27:21 -07:00 |
|
Noah Levitt
|
4cfb287397
|
refactor hq code into hq module
|
2015-07-15 14:26:48 -07:00 |
|
Noah Levitt
|
9b5da57d7e
|
initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled
|
2015-07-14 18:57:45 -07:00 |
|
Noah Levitt
|
fd0c3322ee
|
update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff
|
2015-07-13 17:09:39 -07:00 |
|
Noah Levitt
|
3eff099b16
|
determine if youtube-dl can do something with a url
|
2015-07-13 16:40:56 -07:00 |
|
Noah Levitt
|
6470a8ef26
|
sigquit dumps thread traces
|
2015-07-13 15:57:14 -07:00 |
|
Noah Levitt
|
18ca996216
|
rudimentary robots.txt support
|
2015-07-13 15:56:54 -07:00 |
|
Noah Levitt
|
eb74967fed
|
brozzler-worker round-robins sites needing crawling
|
2015-07-13 12:13:41 -07:00 |
|
Noah Levitt
|
ddd764cac5
|
brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)
|
2015-07-11 23:07:47 -07:00 |
|
Noah Levitt
|
b0f3b8a5e3
|
clean shutdown for brozzler-hq
|
2015-07-11 18:18:54 -07:00 |
|
Noah Levitt
|
384120928c
|
set in_progress=0 for completed url
|
2015-07-11 13:24:38 -07:00 |
|
Noah Levitt
|
610f9c8cf4
|
add missing file hq.py, improve some logging, fix little race condition bug
|
2015-07-11 13:09:45 -07:00 |
|
Noah Levitt
|
bb3561a690
|
check scope (on hq side), fix buglets
|
2015-07-11 12:33:19 -07:00 |
|
Noah Levitt
|
1fb336cb2e
|
crawling outlinks not totally working
|
2015-07-11 02:29:19 -07:00 |
|
Noah Levitt
|
56a7bb7306
|
submit outlinks to hq
|
2015-07-10 21:31:41 -07:00 |
|
Noah Levitt
|
fd99764baa
|
brozzler-worker partially working
|
2015-07-10 21:07:47 -07:00 |
|
Noah Levitt
|
8aa1e6715a
|
feed seed url to the crawl url queue
|
2015-07-10 20:12:33 -07:00 |
|
Noah Levitt
|
1d068f4f86
|
starting work on brozzler crawl hq
|
2015-07-10 18:01:54 -07:00 |
|
Noah Levitt
|
fcc63b6675
|
fancier prioritization takes into account hops from seed, path depth; and clean shutdown
|
2015-07-09 22:35:37 -07:00 |
|
Noah Levitt
|
5f3c247e0c
|
trick to avoid crawling same url again too quickly
|
2015-07-09 21:49:55 -07:00 |
|
Noah Levitt
|
7cc777661d
|
fix dumb bug
|
2015-07-09 18:54:09 -07:00 |
|
Noah Levitt
|
783794ca37
|
basic of site/seed crawling with scoping
|
2015-07-09 18:36:07 -07:00 |
|
Noah Levitt
|
92ea701987
|
rudimentary crawling in parallel with multiple browsers
|
2015-07-08 18:50:18 -07:00 |
|
Noah Levitt
|
32abfcac8a
|
fix 'CrawlUrl' object has no attribute 'priority' bug
|
2015-07-08 17:51:09 -07:00 |
|
Noah Levitt
|
4022cc0162
|
simple in-memory frontier with prioritized queues by host
|
2015-07-08 17:44:38 -07:00 |
|
Noah Levitt
|
4042f22497
|
rudimentary link extraction and crawling
|
2015-07-07 16:45:52 -07:00 |
|
Noah Levitt
|
d8a962b29e
|
experimenting with captureScreenshot
|
2015-06-16 18:42:21 -07:00 |
|
Noah Levitt
|
f254e2eec1
|
it's been stable, call it 1.0
|
2015-06-13 11:30:01 -07:00 |
|
Hunter
|
903d2f3107
|
Merge pull request #39 from nlevitt/simple-behaviors
ARI-3775, ARI-3956 Simple behaviors
|
2015-04-16 15:01:49 -07:00 |
|
Noah Levitt
|
73bbd87d5d
|
merge in latest from master and adjust config as needed
|
2015-02-02 14:52:56 -08:00 |
|
Noah Levitt
|
776a6dac68
|
Merge branch 'master' into simple-behaviors
|
2015-02-02 14:49:34 -08:00 |
|
Noah Levitt
|
48b8754f40
|
Merge branch 'master' into simple-behaviors
|
2015-02-02 14:48:26 -08:00 |
|
Noah Levitt
|
db759f1066
|
Merge pull request #32 from adam-miller/ARI-3904
ARI-3904 Instagram behavior to scroll past two pages, and click to enla...
|
2015-02-02 14:47:44 -08:00 |
|
Adam Miller
|
ce47461656
|
Making scrolling and image loading more tolerant of slow loading.
|
2015-01-30 16:55:53 -08:00 |
|
Noah Levitt
|
9e5900c61f
|
ARI-3956 simple behavior for usask.ca slideshows (which also required enhancing the simple behavior logic)
|
2015-01-27 16:03:58 -08:00 |
|