409 Commits

Author SHA1 Message Date
Noah Levitt
0bd687abde avoid hanging in case a page has no outlinks 2016-06-28 11:25:04 -05:00
Noah Levitt
6a11e1da2a fix noVNC submodule path since brozzler webconsole has moved 2016-06-28 16:15:16 +00:00
Noah Levitt
cb4a16e58c handle new bucket format in brozzler-webconsole 2016-06-28 00:13:01 +00:00
Noah Levitt
89474cb430 convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc) 2016-06-27 14:12:56 -05:00
Noah Levitt
e4f8efe376 make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies 2016-06-27 12:43:24 -05:00
Noah Levitt
366e467501 enhancements to the page thumbnail on the site page 2016-06-22 23:09:27 +00:00
Noah Levitt
9b3f3809cc expose full rethinkdb entry as yaml on job page 2016-06-22 22:29:07 +00:00
Noah Levitt
510456eef2 order the page thumbnails on site page by least number of hops, so the seed shows up first 2016-06-22 21:20:00 +00:00
Noah Levitt
2038598f41 fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''} 2016-06-22 17:43:53 +00:00
Noah Levitt
d198a69e45 recurse through all frames to find outlinks 2016-06-22 11:39:31 -05:00
Noah Levitt
8d7bb582cb back to a dev version number 2016-06-16 14:19:00 -05:00
Noah Levitt
6237ed3f34 update url 2016-06-16 14:18:24 -05:00
Noah Levitt
9c2fe25dd0 back to a dev version number 2016-06-16 14:13:03 -05:00
Noah Levitt
b2d4cc5ff0 beta version number for pypi upload 2016-06-16 14:11:45 -05:00
Noah Levitt
81d709eed0 bump version number 2016-06-16 13:56:08 -05:00
Noah Levitt
182cbfd0ce bump version so we can upload to pypi and fix the readme 2016-05-11 12:10:23 -07:00
Noah Levitt
dd2211df31 yes you can install brozzler from the outside world now! 2016-05-11 12:09:09 -07:00
Noah Levitt
6f6216e432 catch exception from rethinkdb when unregistering from the service registry at shutdown 2016-05-11 00:46:50 +00:00
Noah Levitt
1141c5951e add psutil dependency 2016-05-09 17:19:53 -07:00
Noah Levitt
464da5c3a6 avoid errors with old versions of pip or non-utf-8 locales by specifying the encoding of README.rst 2016-05-07 01:46:15 +00:00
Noah Levitt
053767d393 bump version again 2016-05-05 10:37:58 -07:00
Noah Levitt
cea192b4b3 copy over latest behaviors and stuff from umbra 2016-05-05 00:58:26 -07:00
Noah Levitt
0af00bb3d5 support for host rules in outlink scoping 2016-05-03 20:52:22 +00:00
Noah Levitt
df61e55b6b add license headers 2016-04-25 20:02:11 +00:00
Noah Levitt
2825ffea15 support for extra "blocks" and "accepts" scope rules 2016-04-21 22:22:44 +00:00
Noah Levitt
568a553432 use the uncanonicalized url as part of the sha1 input to generate the page id, since canonicalization was stripping off the #fragment, and we might want to crawl the same url with different fragments (and there's no option to GoogleURLCanonicalizer to not strip the fragment) 2016-04-21 22:01:49 +00:00
Noah Levitt
fee008266f support for one-hop-off (or n-hop-off) scoping 2016-04-21 17:41:59 +00:00
Noah Levitt
35b713a2e7 little version bump 2016-04-07 23:36:05 +00:00
Noah Levitt
919692f9fa pin rethinkdb requirement to 2.3.x (this needs to roughly track deployed version) 2016-04-07 23:35:20 +00:00
Noah Levitt
ecb2e44442 if youtube-dl fetches pages or makes HEAD requests, look at the responses to determine if the page is html and therefore needs to be browsed; if it doesn't need to be browsed, check if youtube-dl has already fetched it (GET request to final bounce of redirect chain that returned a 200); if not, simply fetch it 2016-04-06 17:50:48 -07:00
Noah Levitt
a43b5016e1 use a dev version number 2016-03-18 02:03:20 +00:00
Noah Levitt
b06381790c honor crawl job stop requests 2016-03-08 00:18:54 +00:00
Noah Levitt
d2567f4a13 loosen surt req 2016-03-02 00:16:58 +00:00
Noah Levitt
4c2ecab856 surt==0.3b2 (available on pypi) 2015-11-12 02:58:53 +00:00
Noah Levitt
8c69ca3b39 giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2015-09-24 00:17:33 +00:00
Noah Levitt
9699a40645 remove "dev" from version number and switch README to rst 2015-09-23 22:35:26 +00:00
Noah Levitt
245078284d pep440 compliant versioning 2015-09-23 14:46:57 -07:00
Noah Levitt
2863b7e422 goodbye requirements.txt now that we have devpi 2015-09-23 00:49:20 +00:00
Noah Levitt
cf91fb1377 Revert "use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily"
Ugh.. too much pain, not worth the time to figure out the magic #egg=
incantation.

This reverts commit 78ca0701651c35bda69122ddf652cbb8d95daeb0.
2015-08-26 19:44:04 +00:00
Noah Levitt
78ca070165 use dependency_links instead of requirements.txt in spite of ugliness of --process-dependency-links, #egg=..., so that dependent projects can use brozzler more easily 2015-08-26 19:22:59 +00:00
Noah Levitt
fd0c3322ee update readme, s/umbra/brozzler/ in most places, delete non-brozzler stuff 2015-07-13 17:09:39 -07:00
Noah Levitt
783794ca37 basic of site/seed crawling with scoping 2015-07-09 18:36:07 -07:00
Noah Levitt
4022cc0162 simple in-memory frontier with prioritized queues by host 2015-07-08 17:44:38 -07:00
Noah Levitt
f254e2eec1 it's been stable, call it 1.0 2015-06-13 11:30:01 -07:00
Noah Levitt
c5c642a990 support for simple behavior that clicks on elements matching configured css selector; and one such behavior for acalog sites ARI-3775 2015-01-26 16:58:12 -08:00
Noah Levitt
0647df1ab9 behaviors.yaml to configure behaviors, in preparation for "simple" behavior support 2015-01-26 16:01:53 -08:00
Noah Levitt
ed92f3bd53 for the version string, use abbreviated commit hash instead of attempting to use the branch name 2014-05-29 23:33:14 -07:00
Noah Levitt
bef57e2819 for version string, try to handle case where head is detached 2014-05-29 20:57:33 -07:00
Noah Levitt
3127e02cbb fancy --version that includes git branch and timestamp of last commit if available 2014-05-29 20:43:00 -07:00
Noah Levitt
1e18c2ca74 improve helper utilities 2014-05-20 16:44:13 -07:00