Noah Levitt
|
df61e55b6b
|
add license headers
|
2016-04-25 20:02:11 +00:00 |
|
Noah Levitt
|
c2e80ed6ff
|
make whole process die if main worker thread dies
|
2016-03-16 23:35:33 +00:00 |
|
Noah Levitt
|
5597b4cf1a
|
quiet down requests.packages.urllib3
|
2015-11-12 02:58:00 +00:00 |
|
Noah Levitt
|
343b5c0f82
|
register with service registry; only start chrome right before using it, so that web console vnc windows aren't always full of about:blank
|
2015-11-12 02:56:27 +00:00 |
|
Noah Levitt
|
dff4149185
|
missed one more use of brozzler.version
|
2015-09-24 00:44:35 +00:00 |
|
Noah Levitt
|
8c69ca3b39
|
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
|
2015-09-24 00:17:33 +00:00 |
|
Noah Levitt
|
2bc66f52d4
|
new rethinkstuff.Rethinker api
|
2015-09-23 00:50:15 +00:00 |
|
Noah Levitt
|
92a288bc35
|
detect jobs finishing! (not well tested yet)
|
2015-09-09 22:11:48 +00:00 |
|
Noah Levitt
|
3c23aa8fd4
|
finally, the jobs table
|
2015-09-03 01:05:03 +00:00 |
|
Noah Levitt
|
f334107b47
|
support for specifying rethinkdb database name; wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2015-08-28 00:37:26 +00:00 |
|
Noah Levitt
|
efa640c640
|
refactor to simplify starting new job from code
|
2015-08-25 19:52:33 +00:00 |
|
Noah Levitt
|
3af1e10e13
|
make it work again, and list discovered outlinks
|
2015-08-20 21:22:08 +00:00 |
|
Noah Levitt
|
b7df0a1f37
|
make frontier prioritize least recently brozzled site; move disclaim_site() and completed_page() into frontier.py
|
2015-08-19 18:45:19 +00:00 |
|
Noah Levitt
|
b8506a2ab4
|
rename "db" to "frontier"
|
2015-08-19 17:47:05 +00:00 |
|
Noah Levitt
|
a878730e02
|
goodbye sqlite and rabbitmq, hello rethinkdb
|
2015-08-18 21:44:54 +00:00 |
|
Noah Levitt
|
e6fbf0e2e9
|
rename brozzler-add-site to brozzler-new-site to match brozzler-new-job et al
|
2015-08-17 22:48:25 +00:00 |
|
Noah Levitt
|
2a7a0b7c30
|
little fix, tweak
|
2015-08-05 00:17:43 +00:00 |
|
Noah Levitt
|
b6beac3807
|
new script brozzler-new-job to queue a new job with brozzler based on yaml configuration file
|
2015-08-04 19:52:01 +00:00 |
|
Noah Levitt
|
511e19ff4d
|
handle 420 "Limit reached" when browser receives it
|
2015-08-01 01:26:59 +00:00 |
|
Noah Levitt
|
7b98af7d9f
|
handle reached limit response from warcprox
|
2015-08-01 00:09:57 +00:00 |
|
Noah Levitt
|
11fbbc9d49
|
change browse-url command to brozzle-page, which does some more stuff as if it were in brozzler, like youtube_dl, warcprox features, etc
|
2015-07-31 00:03:13 +00:00 |
|
Noah Levitt
|
85a863b1e3
|
change argument to --amqp-url for clarity and consistency
|
2015-07-23 00:39:57 +00:00 |
|
Noah Levitt
|
83a8e7cbe5
|
fix bug when --extra-header switch is not supplied
|
2015-07-21 20:39:41 +00:00 |
|
Noah Levitt
|
b5cb94fc8b
|
some additional logging and error handling to avoid mysterious messages
|
2015-07-21 06:33:02 +00:00 |
|
Noah Levitt
|
1e56bc8686
|
add only one site at a time, specify settings with command line switches
|
2015-07-21 06:32:00 +00:00 |
|
Noah Levitt
|
2ba5bd4d4b
|
support adding extra http request headers
|
2015-07-17 13:45:27 -07:00 |
|
Noah Levitt
|
140a441eb5
|
honor site proxy setting; remove brozzler-worker options that are now configured at the site level (and in the case of ignore_cert_errors, always on, no longer an option); use "reppy" library for robots.txt handling; fix some bugs
|
2015-07-16 17:19:12 -07:00 |
|
Noah Levitt
|
e04247c3f7
|
add support for supplying json blob defining site with configuration to brozzler-add-site
|
2015-07-16 14:48:01 -07:00 |
|
Noah Levitt
|
923cd98652
|
save screenshots as metadata records using warcprox PUTMETA (same format as kenji's wide crawl)
|
2015-07-15 16:32:02 -07:00 |
|
Noah Levitt
|
5aea76ab6d
|
refactor worker code into worker module
|
2015-07-15 15:42:40 -07:00 |
|
Noah Levitt
|
7b92ba39c7
|
avoid printing stack trace on normal youtube_dl unsupported condition (still prints error message unfortunately)
|
2015-07-15 14:33:22 -07:00 |
|
Noah Levitt
|
9b13f0c34c
|
refactor hq code into hq module
|
2015-07-15 14:27:21 -07:00 |
|
Noah Levitt
|
9b5da57d7e
|
initial youtube-dl support, including saving youtube-dl derived json with warcprox by sending a PUTMETA request, if new option --enable-warcprox-features is enabled
|
2015-07-14 18:57:45 -07:00 |
|
Noah Levitt
|
fd0c3322ee
|
update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff
|
2015-07-13 17:09:39 -07:00 |
|
Noah Levitt
|
3eff099b16
|
determine if youtube-dl can do something with a url
|
2015-07-13 16:40:56 -07:00 |
|
Noah Levitt
|
6470a8ef26
|
sigquit dumps thread traces
|
2015-07-13 15:57:14 -07:00 |
|
Noah Levitt
|
18ca996216
|
rudimentary robots.txt support
|
2015-07-13 15:56:54 -07:00 |
|
Noah Levitt
|
eb74967fed
|
brozzler-worker round-robins sites needing crawling
|
2015-07-13 12:13:41 -07:00 |
|
Noah Levitt
|
ddd764cac5
|
brozzle-worker options --proxy-server=host:port and --ignore-certificate-errors (for use with warcprox)
|
2015-07-11 23:07:47 -07:00 |
|
Noah Levitt
|
b0f3b8a5e3
|
clean shutdown for brozzler-hq
|
2015-07-11 18:18:54 -07:00 |
|
Noah Levitt
|
384120928c
|
set in_progress=0 for completed url
|
2015-07-11 13:24:38 -07:00 |
|
Noah Levitt
|
610f9c8cf4
|
add missing file hq.py, improve some logging, fix little race condition bug
|
2015-07-11 13:09:45 -07:00 |
|
Noah Levitt
|
bb3561a690
|
check scope (on hq side), fix buglets
|
2015-07-11 12:33:19 -07:00 |
|
Noah Levitt
|
1fb336cb2e
|
crawling outlinks not totally working
|
2015-07-11 02:29:19 -07:00 |
|
Noah Levitt
|
56a7bb7306
|
submit outlinks to hq
|
2015-07-10 21:31:41 -07:00 |
|
Noah Levitt
|
fd99764baa
|
brozzler-worker partially working
|
2015-07-10 21:07:47 -07:00 |
|
Noah Levitt
|
8aa1e6715a
|
feed seed url to the crawl url queue
|
2015-07-10 20:12:33 -07:00 |
|
Noah Levitt
|
1d068f4f86
|
starting work on brozzler crawl hq
|
2015-07-10 18:01:54 -07:00 |
|
Noah Levitt
|
fcc63b6675
|
fancier prioritization takes into account hops from seed, path depth; and clean shutdown
|
2015-07-09 22:35:37 -07:00 |
|
Noah Levitt
|
5f3c247e0c
|
trick to avoid crawling same url again too quickly
|
2015-07-09 21:49:55 -07:00 |
|