Noah Levitt
|
fcc63b6675
|
fancier prioritization takes into account hops from seed, path depth; and clean shutdown
|
2015-07-09 22:35:37 -07:00 |
|
Noah Levitt
|
5f3c247e0c
|
trick to avoid crawling same url again too quickly
|
2015-07-09 21:49:55 -07:00 |
|
Noah Levitt
|
7cc777661d
|
fix dumb bug
|
2015-07-09 18:54:09 -07:00 |
|
Noah Levitt
|
783794ca37
|
basic of site/seed crawling with scoping
|
2015-07-09 18:36:07 -07:00 |
|
Noah Levitt
|
92ea701987
|
rudimentary crawling in parallel with multiple browsers
|
2015-07-08 18:50:18 -07:00 |
|
Noah Levitt
|
4022cc0162
|
simple in-memory frontier with prioritized queues by host
|
2015-07-08 17:44:38 -07:00 |
|
Noah Levitt
|
4042f22497
|
rudimentary link extraction and crawling
|
2015-07-07 16:45:52 -07:00 |
|
Noah Levitt
|
d8a962b29e
|
experimenting with captureScreenshot
|
2015-06-16 18:42:21 -07:00 |
|
Noah Levitt
|
9053279b4e
|
change default routing key to "urls"
|
2014-11-03 11:54:59 -08:00 |
|
Noah Levitt
|
2ab767eaa9
|
make drain-queue output actual json instead of python dict syntax
|
2014-08-26 23:46:00 +00:00 |
|
Noah Levitt
|
fe1d9e01eb
|
utility queue-json to publish an arbitrary json blob to amqp
|
2014-08-26 23:45:42 +00:00 |
|
Noah Levitt
|
6306c16698
|
kill -HUP to immediately close and reopen amqp consumer connection
|
2014-06-23 17:18:27 -07:00 |
|
Noah Levitt
|
9b32f9a3d1
|
ugh, it was better with the default width, in spite of the ridiculous behavior.script
|
2014-06-20 14:40:12 -07:00 |
|
Noah Levitt
|
2cf69bdaff
|
seriously, don't try to wrap any lines, pprint
|
2014-06-20 14:37:33 -07:00 |
|
Noah Levitt
|
c6fa00812c
|
when dumping state on SIGQUIT, build the whole string before printing to avoid stuff getting intermingled with other logging and stuff
|
2014-06-20 14:33:01 -07:00 |
|
Noah Levitt
|
ead46d5716
|
more elaborate dumping of state on SIGQUIT to replace faulthandler
|
2014-06-20 14:05:33 -07:00 |
|
Noah Levitt
|
ebb14ff889
|
get rid of chrome_wait straggler
|
2014-06-18 17:31:28 -07:00 |
|
Noah Levitt
|
025db91dea
|
get rid of --browser-wait and --routing-key in favor of sensible defaults, some other tweaks
|
2014-06-11 10:58:08 -07:00 |
|
Noah Levitt
|
a78e60f1da
|
wait for a browser to become available and start it up before reading the next url from amqp; ack the message only after completing the browsing process successfully, and requeue if it's not successful; some refactoring to make the timing work for this
|
2014-06-09 13:15:05 -07:00 |
|
Noah Levitt
|
b2e27b99d2
|
nice log message when fully shut down
|
2014-05-30 17:32:01 -07:00 |
|
Noah Levitt
|
c9d503e690
|
log version number at startup
|
2014-05-30 15:00:01 -07:00 |
|
Noah Levitt
|
3127e02cbb
|
fancy --version that includes git branch and timestamp of last commit if available
|
2014-05-29 20:43:00 -07:00 |
|
Noah Levitt
|
9c08be2699
|
sigterm and sigint both shutdown request shutdown, which stops consuming urls and waits for active browsers to finish; a second sigint/sigterm immediately shuts down active browsers
|
2014-05-24 01:52:22 -07:00 |
|
Noah Levitt
|
2c4ba005b5
|
make umbra amenable to clustering by using a pool of n browsers and removing the browser-clientId affinity (not useful currently since we start a fresh browser instance for each page browsed), and set prefetch_count=1 on amqp consumers to round-robin incoming urls among umbra instances
|
2014-05-23 21:59:34 -07:00 |
|
Noah Levitt
|
8d269f4c56
|
add options --verbose, --exchange, --queue, --routing-key
|
2014-05-23 13:39:39 -07:00 |
|
Noah Levitt
|
bd3f979b56
|
capitalize AMQP in description
|
2014-05-23 13:39:08 -07:00 |
|
Noah Levitt
|
d7cfcbf233
|
new helper utility to browse urls provided as command line args
|
2014-05-20 17:11:16 -07:00 |
|
Noah Levitt
|
6c69b68771
|
organize imports, tweak command line args
|
2014-05-20 17:10:41 -07:00 |
|
Noah Levitt
|
1e18c2ca74
|
improve helper utilities
|
2014-05-20 16:44:13 -07:00 |
|
Noah Levitt
|
b59e76a5b9
|
clean shutdown without draining entire amqp queue (only consume urls from amqp when browser activity isn't saturated)
|
2014-05-20 03:02:48 -07:00 |
|
Noah Levitt
|
3e4232f32c
|
refactor umbra.py into controller.py and browser.py, improve class names
|
2014-05-20 02:42:40 -07:00 |
|
Noah Levitt
|
f69edd5a87
|
handle multiple clients, browsers
|
2014-02-13 01:59:09 -08:00 |
|
Eldon
|
bdf00cc515
|
Refactor to pull Chrome execution inside of umbra, simplify some things
|
2014-02-12 19:31:03 -05:00 |
|
Eldon
|
8afe7d90a2
|
Replace js evaluation with direct page navigation, add default for dump_queue
|
2014-01-28 00:10:31 -05:00 |
|
Noah Levitt
|
8eb92b28e6
|
make load_url handle arguments similarly to umbra
|
2014-01-27 19:34:54 -08:00 |
|
Eldon
|
bd0183058d
|
Inccognito messes with currently running chromium instances, disable it
|
2014-01-23 18:26:20 -05:00 |
|
Eldon
|
6dc20e660f
|
Remove debugging output, improve support scripts
|
2014-01-22 18:41:00 +00:00 |
|
Eldon
|
4e38a142d4
|
Some refactor/testing and utility scripts
|
2014-01-22 18:03:02 +00:00 |
|
Eldon
|
428d6cb7da
|
Rework executable script so that it uses a main
|
2014-01-22 02:30:12 +00:00 |
|
Eldon
|
7b219ab011
|
Fix parameter passing and work with chromiums wrapper stuff
|
2014-01-22 02:22:16 +00:00 |
|
Eldon
|
dd72311e2d
|
Create executable umbra script
|
2014-01-21 18:23:11 +00:00 |
|