Commit graph

791 commits

Author SHA1 Message Date
Noah Levitt
7a68599057 Merge branch 'refactor-browsing' into qa
* refactor-browsing:
  more shutdown tweaks
  improving shutdown process
  working on major refactoring of browser management
2016-12-15 12:28:21 -08:00
Noah Levitt
4186869bf9 Merge branch 'master' into qa
* master:
  fix bug handling page with zero outlinks
  avoid infinite loop in case youtube-dl encounters redirect loop (which can be ok if cookies have been set or something)
  brozzler logo svg with small default size
  travis-ci slack integration
  fix _find_available_port and its unit test
  little fixes
  avoid broken version of websocket-client to fix https://github.com/internetarchive/brozzler/issues/28
  wrong branch of warcprox in ansible install
  move cookie db management code into chrome.py
  move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test
  split Chrome class into its own module
  new utility brozzler-list-captures for looking up entries in the "captures" table
2016-12-15 12:07:29 -08:00
Noah Levitt
4bdad4729a more shutdown tweaks 2016-12-14 16:13:14 -08:00
Noah Levitt
5fa96b6438 improving shutdown process 2016-12-14 14:49:41 -08:00
Noah Levitt
f23f928c16 working on major refactoring of browser management 2016-12-09 16:50:11 -08:00
Noah Levitt
d68053764c fix bug handling page with zero outlinks 2016-12-09 16:43:23 -08:00
Noah Levitt
af1e1c75ec avoid infinite loop in case youtube-dl encounters redirect loop (which can be ok if cookies have been set or something) 2016-12-09 14:16:27 -08:00
Noah Levitt
f6a25aa4f0 brozzler logo svg with small default size 2016-12-08 15:16:02 -08:00
Noah Levitt
40b4d9bfe8 travis-ci slack integration 2016-12-07 14:46:29 -08:00
Noah Levitt
9bcec54f4b fix _find_available_port and its unit test 2016-12-07 14:08:34 -08:00
Noah Levitt
eed8b9ec30 little fixes 2016-12-07 11:20:10 -08:00
Noah Levitt
0b6c5346bd avoid broken version of websocket-client to fix https://github.com/internetarchive/brozzler/issues/28 2016-12-07 11:18:41 -08:00
Noah Levitt
e250c4ca89 wrong branch of warcprox in ansible install 2016-12-07 09:33:06 -08:00
Noah Levitt
d3063fbd2b move cookie db management code into chrome.py 2016-12-06 18:04:51 -08:00
Noah Levitt
ce03381b92 move _find_available_ports to chrome.py, changing the way it works so that browser:9200 doesn't get stuck at 9201 forever, which pushes 9201 to 9202 etc, and add a unit test 2016-12-06 17:12:20 -08:00
Noah Levitt
74009852d6 split Chrome class into its own module 2016-12-06 12:50:38 -08:00
Noah Levitt
3c43fdaced new utility brozzler-list-captures for looking up entries in the "captures" table 2016-11-30 00:52:14 +00:00
Noah Levitt
2eea50dcfb Merge branch 'master' into qa
* master:
  in warcprox 2.0b2, captures table field has been renamed to "record_length"
  remove flickr behavior, flickr is better off with the default behavior for now
  Update README.rst
  add travis-ci badge
2016-11-21 16:21:30 -08:00
Noah Levitt
9567c088c8 in warcprox 2.0b2, captures table field has been renamed to "record_length" 2016-11-21 16:21:21 -08:00
Noah Levitt
55c9ae07b7 remove flickr behavior, flickr is better off with the default behavior for now 2016-11-16 17:16:48 -08:00
Noah Levitt
899ee8a8dd Update README.rst 2016-11-16 12:26:50 -08:00
Noah Levitt
6bb9d68dce add travis-ci badge 2016-11-16 12:26:33 -08:00
Noah Levitt
eaa32ad3fc Merge branch 'master' into qa
* master:
  don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence
  robots.txt for testing
  monkey-patch reppy to support substring user-agent matching
  give vagrant vm enough memory so that tests pass consistently
  need warcprox to listen on public address because that's what it puts in the service registry
  looks like the problem may have been a bug in ansible 2.2.0.0, so pin to 2.1.3.0
2016-11-16 12:24:30 -08:00
Noah Levitt
72816d1058 don't check robots.txt when scheduling a new site to be crawled, but mark the seed Page as needs_robots_check, and delegate the robots check to brozzler-worker; new test of robots.txt adherence 2016-11-16 12:23:59 -08:00
Noah Levitt
24cc8377fb robots.txt for testing 2016-11-16 12:12:17 -08:00
Noah Levitt
3aead6de93 monkey-patch reppy to support substring user-agent matching 2016-11-16 11:41:34 -08:00
Noah Levitt
398871d46b give vagrant vm enough memory so that tests pass consistently 2016-11-14 18:26:00 -08:00
Noah Levitt
2b0a47c914 Merge pull request #27 from internetarchive/i2
update Instagram behavior, mostly css selectors
2016-11-14 12:40:55 -08:00
Noah Levitt
a74247412c need warcprox to listen on public address because that's what it puts in the service registry 2016-11-14 10:03:40 -08:00
Noah Levitt
c9b45a7e76 looks like the problem may have been a bug in ansible 2.2.0.0, so pin to 2.1.3.0 2016-11-14 09:58:13 -08:00
Barbara Miller
e01739743f Merge branch 'i2' into qa 2016-11-14 09:25:58 -08:00
Barbara Miller
12a054e6dc update behavior, mostly css selectors 2016-11-14 09:20:40 -08:00
Noah Levitt
28b010a2ba back to dev version number 2016-11-11 14:58:55 -08:00
Noah Levitt
7aca046905 1.1b7 2016-11-11 14:58:07 -08:00
Barbara Miller
eb3fad9c84 cp feature branch instagram.js 2016-11-11 14:51:11 -08:00
Barbara Miller
54ec6cf15b Merge branch 'i2' into qa 2016-11-11 14:44:10 -08:00
Barbara Miller
bb9334d757 jslint edits 2016-11-11 14:21:08 -08:00
Barbara Miller
d162a85a65 update markup, & simplify big image browse? 2016-11-11 14:21:08 -08:00
Noah Levitt
a80d6bcc9a Merge branch 'master' into qa
* master:
  use \n to delimit outlinks because urls can contain spaces (and anything else except [\n\t\0]) in the fragment part even after browser canonicalization
2016-11-11 14:19:37 -08:00
Noah Levitt
26b571219b use \n to delimit outlinks because urls can contain spaces (and anything else except [\n\t\0]) in the fragment part even after browser canonicalization 2016-11-11 14:14:47 -08:00
Barbara Miller
7093e66360 Merge branch 'i2' into qa 2016-11-11 13:34:44 -08:00
Barbara Miller
51dfb2a899 jslint edits 2016-11-11 13:33:09 -08:00
Barbara Miller
3c3a09f5c0 Merge branch 'i2' into qa 2016-11-10 17:21:33 -08:00
Barbara Miller
2f6767627b update markup, & simplify big image browse? 2016-11-10 17:21:20 -08:00
Noah Levitt
0eb07c9ca2 Merge branch 'master' into qa
* master:
  pass behavior_parameters from job configuration into Site objects
  add --behavior-parameters argument to brozzler-new-site
  fix bug in final_bounces (not sure what I was thinking)
  restore accidentally removed functionality handling page redirects and friends
  cat logs on travis-ci failure
  reppy 0.4.1 has a significantly different api apparently, so for now let's go back to 0.3.4
  still trying to get installation of pip to work on travis-ci
  update for reppy api change and pin to current version of reppy
  tweaks to ansible config to try to get the deployment to run on travis-ci
2016-11-09 13:43:24 -08:00
Noah Levitt
02bf23059e pass behavior_parameters from job configuration into Site objects 2016-11-09 13:43:10 -08:00
Noah Levitt
8e115b44fa add --behavior-parameters argument to brozzler-new-site 2016-11-09 13:12:36 -08:00
Noah Levitt
953e50d9a6 fix bug in final_bounces (not sure what I was thinking) 2016-11-09 13:12:14 -08:00
Noah Levitt
8889e4ab20 restore accidentally removed functionality handling page redirects and friends 2016-11-08 18:17:48 -08:00
Noah Levitt
054cb255ac cat logs on travis-ci failure 2016-11-08 14:26:12 -08:00