241 Commits

Author SHA1 Message Date
Noah Levitt
7a40822e64 forgot to git add new test data 2016-12-19 18:10:07 -08:00
Noah Levitt
2f8f20bbb4 detect <input type="email"> as potential username field for login 2016-12-19 18:08:10 -08:00
Noah Levitt
86ac48d6c3 generalized support for login doing automatic detection of login form on a page 2016-12-19 17:30:09 -08:00
Noah Levitt
bc6e0d243f yet more refactoring of browser.py, clearer separation of purpose, Browser class manages browsing, sends most of the messages to chrome, WebsockReceiverThread handles messages that come back from chrome 2016-12-16 13:52:12 -08:00
Noah Levitt
534d2e63d6 bump version number in setup.py 2016-12-15 16:43:27 -08:00
Noah Levitt
f6333df6ef back to dev version number 2016-12-15 12:34:26 -08:00
Noah Levitt
85de2fad6a i dub thee 1.1b8 2016-12-15 12:33:34 -08:00
Noah Levitt
d68053764c fix bug handling page with zero outlinks 2016-12-09 16:43:23 -08:00
Noah Levitt
af1e1c75ec avoid infinite loop in case youtube-dl encounters redirect loop (which can be ok if cookies have been set or something) 2016-12-09 14:16:27 -08:00
Noah Levitt
f6a25aa4f0 brozzler logo svg with small default size 2016-12-08 15:16:02 -08:00
Noah Levitt
40b4d9bfe8 travis-ci slack integration 2016-12-07 14:46:29 -08:00
Noah Levitt
9bcec54f4b fix _find_available_port and its unit test 2016-12-07 14:08:34 -08:00
Noah Levitt
eed8b9ec30 little fixes 2016-12-07 11:20:10 -08:00
Noah Levitt
0b6c5346bd avoid broken version of websocket-client to fix https://github.com/internetarchive/brozzler/issues/28 2016-12-07 11:18:41 -08:00
Noah Levitt
e250c4ca89 wrong branch of warcprox in ansible install 2016-12-07 09:33:06 -08:00
Noah Levitt
d3063fbd2b move cookie db management code into chrome.py 2016-12-06 18:04:51 -08:00
Noah Levitt
ce03381b92 move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test 2016-12-06 17:12:20 -08:00
Noah Levitt
74009852d6 split Chrome class into its own module 2016-12-06 12:50:38 -08:00
Noah Levitt
3c43fdaced new utility brozzler-list-captures for looking up entries in the "captures" table 2016-11-30 00:52:14 +00:00
Noah Levitt
9567c088c8 in warcprox 2.0b2, captures table field has been renamed to "record_length" 2016-11-21 16:21:21 -08:00
Noah Levitt
55c9ae07b7 remove flickr behavior, flickr is better off with the default behavior for now 2016-11-16 17:16:48 -08:00
Noah Levitt
72816d1058 don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence 2016-11-16 12:23:59 -08:00
Noah Levitt
3aead6de93 monkey-patch reppy to support substring user-agent matching 2016-11-16 11:41:34 -08:00
Noah Levitt
398871d46b give vagrant vm enough memory so that tests pass consistently 2016-11-14 18:26:00 -08:00
Noah Levitt
a74247412c need warcprox to listen on public address because that's what it puts in the service registry 2016-11-14 10:03:40 -08:00
Noah Levitt
28b010a2ba back to dev version number 2016-11-11 14:58:55 -08:00
Noah Levitt
7aca046905 1.1b7 2016-11-11 14:58:07 -08:00
Noah Levitt
26b571219b use \n to delimit outlinks because urls can contain spaces (and anything else except [\n\t\0]) in the fragment part even after browser canonicalization 2016-11-11 14:14:47 -08:00
Noah Levitt
02bf23059e pass behavior_parameters from job configuration into Site objects 2016-11-09 13:43:10 -08:00
Noah Levitt
8e115b44fa add --behavior-parameters argument to brozzler-new-site 2016-11-09 13:12:36 -08:00
Noah Levitt
953e50d9a6 fix bug in final_bounces (not sure what I was thinking) 2016-11-09 13:12:14 -08:00
Noah Levitt
054cb255ac cat logs on travis-ci failure 2016-11-08 14:26:12 -08:00
Noah Levitt
125a31165a reppy 0.4.1 has a significantly different api apparently, so for now let's go back to 0.3.4 2016-11-08 14:11:46 -08:00
Noah Levitt
fe18d915f5 still trying to get installation of pip to work on travis-ci 2016-11-08 13:50:12 -08:00
Noah Levitt
f10b4c71e6 update for reppy api change and pin to current version of reppy 2016-11-08 13:39:32 -08:00
Noah Levitt
cba5fa4a0b tweaks to ansible config to try to get the deployment to run on travis-ci 2016-11-08 13:31:52 -08:00
Noah Levitt
9d66f294ec move behavior_parameters into top level of site configuration 2016-11-07 18:16:04 -08:00
Noah Levitt
abca90a128 install the virtualenv package with pip because the apt version is old and conflicts with the recent version of pip we're using 2016-11-07 17:51:43 -08:00
Noah Levitt
99feeab581 logging tweak 2016-11-04 17:53:02 -07:00
Noah Levitt
5ac8994a24 rename webconsole to dashboard 2016-11-04 17:46:23 -07:00
Noah Levitt
5bd4908e1d punycode host part of url to avoid errors doing WARCPROX_WRITE_RECORD 2016-10-26 13:50:23 -07:00
Noah Levitt
f30c143c66 avoid exception in case of url without host part 2016-10-26 12:45:24 -07:00
Noah Levitt
332912acd7 apparently response.status doesn't work sometimes; response.getcode() is documented so hopefully it keeps working 2016-10-25 17:50:49 -07:00
Noah Levitt
70ce642bee integer job ids are permitted as well as string 2016-10-21 21:25:16 +00:00
Noah Levitt
21891476c4 avoid use of __double_underscore member variables because they're special https://shahriar.svbtle.com/underscores-in-python 2016-10-18 18:57:11 -07:00
Noah Levitt
becd832ea3 bump version after merging accept-encoding pull request 2016-10-18 17:55:00 -07:00
Noah Levitt
aae34452f5 bump version number after merging travis-ci pull request 2016-10-18 17:48:45 -07:00
Noah Levitt
68a32fcbe2 bump version number after mouse's pull request 2016-10-18 17:45:55 -07:00
Noah Levitt
a370e7b987 tiny fix, and now the test passes for me 2016-10-14 19:21:26 -07:00
Noah Levitt
4044fcb647 fix pywb/brozzler replay of revisit records 2016-10-14 19:15:23 -07:00