Noah Levitt
|
5bcd10c228
|
extract area/@href links, and add test for outlink extraction
|
2017-04-05 12:09:48 -07:00 |
|
Noah Levitt
|
3d47805ec1
|
new model for crawling hashtags, each one is no longer a top-level page
|
2017-03-27 12:15:49 -07:00 |
|
Noah Levitt
|
a836269e95
|
remove some vestiges of old proxy stuff
|
2017-03-24 16:04:43 -07:00 |
|
Noah Levitt
|
a826fdc7ef
|
new test of frontier.seed_page
|
2017-03-24 15:45:40 -07:00 |
|
Noah Levitt
|
934190084c
|
Refactor the way the proxy is configured. Job/site settings "proxy" and "enable_warcprox_features" are gone. Brozzler-worker now has mutually exclusive options --proxy and --warcprox-auto. --warcprox-auto means find an instance of warcprox in the service registry, and enable warcprox features. --proxy is provided, determines if proxy is warcprox by consulting http://{proxy_address}/status (see https://github.com/internetarchive/warcprox/commit/8caae0d7d3), and enables warcprox features if so.
|
2017-03-24 13:55:23 -07:00 |
|
Noah Levitt
|
34bb64297f
|
fix frontier tests now that enable_warcprox_features is simply omitted by default
|
2017-03-22 15:46:12 -07:00 |
|
Noah Levitt
|
eeee523b18
|
three-value "brozzled" parameter for frontier.site_pages(); fix thing where every Site got a list of all the seeds from the job; and some more frontier tests to catch these kinds of things
|
2017-03-20 17:28:16 -07:00 |
|
Noah Levitt
|
0e9f4a0c26
|
forgot to add the new test data
|
2017-03-20 12:33:52 -07:00 |
|
Noah Levitt
|
e9c7606318
|
oops remove pdb call
|
2017-03-20 12:14:11 -07:00 |
|
Noah Levitt
|
13130bd9d9
|
save info about embedded videos in page document in rethinkdb
|
2017-03-20 11:49:11 -07:00 |
|
Noah Levitt
|
0685c77d01
|
always save outlinks info on rethinkdb page object, get rid of 'remember_outlinks' option, to keep config simple, and because it's not a very expensive thing
|
2017-03-17 10:04:10 -07:00 |
|
Noah Levitt
|
6c81b40e28
|
if parent page has a redirect_url, check scope rules both with the parent_page original url and with the redirect url, with automated tests
|
2017-03-16 12:12:33 -07:00 |
|
Noah Levitt
|
12fb9eaa15
|
use urlcanon library for canonicalization, surtification, scope match rules
|
2017-03-15 14:59:51 -07:00 |
|
Noah Levitt
|
479f0f7e09
|
more automated tests of frontier stuff
|
2017-03-15 14:54:16 -07:00 |
|
Noah Levitt
|
9e1e002a71
|
turns out we want populate_defaults to happen in __init__, fix so things work right
|
2017-03-07 17:52:38 -08:00 |
|
Noah Levitt
|
01653c01d7
|
use updated doublethink library populate_defaults() to avoid problem where under certain circumstances field values from the database would be overwritten by defaults
|
2017-03-07 13:19:56 -08:00 |
|
Noah Levitt
|
242ff51ec7
|
fix bug with seed redirects where scope change was applied too late to affect scoping of outlinks from the seed (with automated tests)
|
2017-03-06 15:13:40 -08:00 |
|
Noah Levitt
|
40bbbb3524
|
add tests of backwards compatibility handling of start/stop times and fix a bug or two
|
2017-03-02 16:53:24 -08:00 |
|
Noah Levitt
|
569af05b11
|
rethinkstuff is now "doublethink
|
2017-03-02 12:48:45 -08:00 |
|
Noah Levitt
|
2398031010
|
let the OS pick an available port, to avoid what appear to be timing issues causing multiple browsers to choose the same port
|
2017-02-22 12:44:19 -08:00 |
|
Noah Levitt
|
b409e49cfa
|
deprecate current scope rule syntax and create new syntax with slightly different semantics (to be documented), and add parent_url_regex scope rule; unit test for scoping
|
2017-02-15 16:46:45 -08:00 |
|
Noah Levitt
|
14e312e4c4
|
make sure site is not "claimed" when it's finished
|
2017-02-03 16:40:15 -08:00 |
|
Noah Levitt
|
a60878c5a7
|
support for resuming jobs, keeping track of each start and stop time, used to enforce time limits correctly
|
2017-02-03 14:56:12 -08:00 |
|
Noah Levitt
|
5c684779e5
|
pywb support for thumbnail: and screenshot: urls
|
2017-01-31 10:26:38 -08:00 |
|
Noah Levitt
|
4b6831b464
|
new flag Page.blocked_by_robots
|
2017-01-30 10:43:25 -08:00 |
|
Noah Levitt
|
5375b819dd
|
missed a spot
|
2017-01-20 23:59:31 -08:00 |
|
Noah Levitt
|
011d814ee2
|
tests for dismissal of javascript dialogs (alert, prompt, confirm)
|
2017-01-13 11:46:42 -08:00 |
|
Noah Levitt
|
70b67942a5
|
restore handling of 420 Reached limit, with a rudimentary test
|
2016-12-22 13:44:09 -08:00 |
|
Noah Levitt
|
e5fb6cb4b9
|
add import missing from test
|
2016-12-21 19:19:34 -08:00 |
|
Noah Levitt
|
eabb0fb114
|
restore support for on_response and on_request, with an automated test for on_response
|
2016-12-21 18:35:55 -08:00 |
|
Noah Levitt
|
f7427219cf
|
restore handling of "aw snap" or "he's dead jim"
|
2016-12-21 14:21:20 -08:00 |
|
Noah Levitt
|
86d6060a2d
|
loosen the find_available_port test slightly, since it seems to be not 100% predictable for reasons i haven't investigated
|
2016-12-20 17:52:21 -08:00 |
|
Noah Levitt
|
b24b229cb2
|
how did i miss this file?
|
2016-12-20 11:13:48 -08:00 |
|
Noah Levitt
|
7a40822e64
|
forgot to git add new test data
|
2016-12-19 18:10:07 -08:00 |
|
Noah Levitt
|
86ac48d6c3
|
generalized support for login doing automatic detection of login form on a page
|
2016-12-19 17:30:09 -08:00 |
|
Noah Levitt
|
9bcec54f4b
|
fix _find_available_port and its unit test
|
2016-12-07 14:08:34 -08:00 |
|
Noah Levitt
|
eed8b9ec30
|
little fixes
|
2016-12-07 11:20:10 -08:00 |
|
Noah Levitt
|
ce03381b92
|
move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test
|
2016-12-06 17:12:20 -08:00 |
|
Noah Levitt
|
72816d1058
|
don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence
|
2016-11-16 12:23:59 -08:00 |
|
Noah Levitt
|
24cc8377fb
|
robots.txt for testing
|
2016-11-16 12:12:17 -08:00 |
|
Noah Levitt
|
3aead6de93
|
monkey-patch reppy to support substring user-agent matching
|
2016-11-16 11:41:34 -08:00 |
|
Noah Levitt
|
5ac8994a24
|
rename webconsole to dashboard
|
2016-11-04 17:46:23 -07:00 |
|
Mouse Reeve
|
2215aaab21
|
Use warcprox if enable_warcprox_features is true
|
2016-10-18 17:39:33 -07:00 |
|
Noah Levitt
|
a370e7b987
|
tiny fix, and now the test passes for me
|
2016-10-14 19:21:26 -07:00 |
|
Noah Levitt
|
27452990ee
|
toward getting initial tests to pass
|
2016-10-14 18:26:48 -07:00 |
|
Noah Levitt
|
56e651baeb
|
working on basic integration tests
|
2016-10-13 17:12:35 -07:00 |
|
Noah Levitt
|
c864499a64
|
starting to create a framework for testing
|
2016-09-14 17:06:49 -07:00 |
|