Commit graph

484 commits

Author SHA1 Message Date
Noah Levitt
568a553432 use the uncanonicalized url as part of the sha1 input to generate the page id, since canonicalization was stripping off the #fragment, and we might want to crawl the same url with different fragments (and there's no option to GoogleURLCanonicalizer to not strip the fragment) 2016-04-21 22:01:49 +00:00
Noah Levitt
fee008266f support for one-hop-off (or n-hop-off) scoping 2016-04-21 17:41:59 +00:00
Noah Levitt
35b713a2e7 little version bump 2016-04-07 23:36:05 +00:00
Noah Levitt
919692f9fa pin rethinkdb requirement to 2.3.x (this needs to roughly track deployed version) 2016-04-07 23:35:20 +00:00
Noah Levitt
ecb2e44442 if youtube-dl fetches pages or makes HEAD requests, look at the responses to determine if the page is html and therefore needs to be browsed; if it doesn't need to be browsed, check if youtube-dl has already fetched it (GET request to final bounce of redirect chain that returned a 200); if not, simply fetch it 2016-04-06 17:50:48 -07:00
Noah Levitt
a43b5016e1 use a dev version number 2016-03-18 02:03:20 +00:00
Noah Levitt
b06381790c honor crawl job stop requests 2016-03-08 00:18:54 +00:00
Noah Levitt
d2567f4a13 loosen surt req 2016-03-02 00:16:58 +00:00
Noah Levitt
4c2ecab856 surt==0.3b2 (available on pypi) 2015-11-12 02:58:53 +00:00
Noah Levitt
8c69ca3b39 giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2015-09-24 00:17:33 +00:00
Noah Levitt
9699a40645 remove "dev" from version number and switch README to rst 2015-09-23 22:35:26 +00:00
Noah Levitt
245078284d pep440 compliant versioning 2015-09-23 14:46:57 -07:00
Noah Levitt
2863b7e422 goodbye requirements.txt now that we have devpi 2015-09-23 00:49:20 +00:00
Noah Levitt
cf91fb1377 Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.

This reverts commit 78ca070165.
2015-08-26 19:44:04 +00:00
Noah Levitt
78ca070165 use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily 2015-08-26 19:22:59 +00:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Noah Levitt
783794ca37 basic of site/seed crawling with scoping 2015-07-09 18:36:07 -07:00
Noah Levitt
4022cc0162 simple in-memory frontier with prioritized queues by host 2015-07-08 17:44:38 -07:00
Noah Levitt
f254e2eec1 it's been stable, call it 1.0 2015-06-13 11:30:01 -07:00
Noah Levitt
c5c642a990 support for simple behavior that clicks on elements matching configured css selector; and one such behavior for acalog sites ARI-3775 2015-01-26 16:58:12 -08:00
Noah Levitt
0647df1ab9 behaviors.yaml to configure behaviors, in preparation for "simple" behavior support 2015-01-26 16:01:53 -08:00
Noah Levitt
ed92f3bd53 for the version string, use abbreviated commit hash instead of attempting to use the branch name 2014-05-29 23:33:14 -07:00
Noah Levitt
bef57e2819 for version string, try to handle case where head is detached 2014-05-29 20:57:33 -07:00
Noah Levitt
3127e02cbb fancy --version that includes git branch and timestamp of last commit if available 2014-05-29 20:43:00 -07:00
Noah Levitt
1e18c2ca74 improve helper utilities 2014-05-20 16:44:13 -07:00
Noah Levitt
3e4232f32c refactor umbra.py into controller.py and browser.py, improve class names 2014-05-20 02:42:40 -07:00
Noah Levitt
cc0ffee508 only websocket-client-py3==0.13.1 works right with python3 at the moment, see https://github.com/liris/websocket-client/issues/84 2014-05-20 00:57:07 -07:00
Noah Levitt
f3a540b92d setup.py - include behaviors.d/*.js in installation 2014-03-13 00:00:32 -07:00
Noah Levitt
4935d55b6e specify classifier 'Programming Language :: Python :: 3.3' since websocket-client-py3 requires python 3.3, doesn't work with 3.2 2014-02-12 12:17:41 -08:00
Eldon
bd0183058d Inccognito messes with currently running chromium instances, disable it 2014-01-23 18:26:20 -05:00
Eldon
4852fbf29f Update setup.py, get rid of unused dependency 2014-01-23 16:18:13 -05:00
Eldon
db9eee5f2b Should be full python 3 now 2014-01-22 01:32:41 +00:00
Eldon
272a9a3f42 Fix readme filename 2014-01-21 18:10:43 +00:00
Eldon
fdb62be2ba First commit of umbra 2014-01-21 06:41:46 +00:00