Commit graph

1048 commits

Author SHA1 Message Date
Noah Levitt
2aef00826b vagrant setup (unfinished) 2016-06-30 17:50:11 -05:00
Noah Levitt
79ad57669c do not send more than one SIGTERM when shutting down browser process, because on recent chromium on linux, the second sigterm abruptly ends the process, and sometimes leaves orphan subprocesses; also send TERM/KILL signals to the whole process group, another measure to avoid orphans; and adjust logging levels for captured chrome output 2016-06-30 17:10:27 -05:00
Noah Levitt
371590b578 command line utility brozzler-ensure-tables, creates rethinkdb tables if they don't already exist... brozzler normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster 2016-06-30 15:16:04 -05:00
Noah Levitt
d82feb14da Merge branch 'master' into qa
* master:
  implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top
  avoid "AttributeError: 'ExtractorError' object has no attribute 'code'" checking for 430 (soft limit) from youtube-dl
  set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure
  fix case where rethinkdb page already has claimed=True
  undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3)
  fix bug preventing brozzler-new-site from working, add note about brozzler-new-site in readme
  --trace level logging
2016-06-30 11:46:31 -05:00
Noah Levitt
9fd78fdbe8 implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top 2016-06-30 11:45:19 -05:00
Noah Levitt
a1910fc0fe avoid "AttributeError: 'ExtractorError' object has no attribute 'code'" checking for 430 (soft limit) from youtube-dl 2016-06-29 19:57:51 -05:00
Noah Levitt
79beddfc44 set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure 2016-06-29 19:47:25 -05:00
Noah Levitt
2e687b65fb fix case where rethinkdb page already has claimed=True 2016-06-29 19:29:18 -05:00
Noah Levitt
ffcf26b6c9 undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3) 2016-06-29 18:53:32 -05:00
Noah Levitt
7431ae0eb1 fix bug preventing brozzler-new-site from working, add note about brozzler-new-site in readme 2016-06-29 18:41:45 -05:00
Noah Levitt
479713e25b --trace level logging 2016-06-29 18:29:45 -05:00
Noah Levitt
8576c71c62 Merge branch 'master' into qa
* master:
  to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism
  calm logging, don't print stacktrace on 430 from youtube-dl
  fix buglet in creation of new least_hops on pages table
  renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
2016-06-29 16:56:12 -05:00
Noah Levitt
d04c5a31cc to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism 2016-06-29 00:02:25 +00:00
Noah Levitt
7a805a43d1 calm logging, don't print stacktrace on 430 from youtube-dl 2016-06-28 23:19:28 +00:00
Noah Levitt
e9c398caea fix buglet in creation of new least_hops on pages table 2016-06-28 23:14:23 +00:00
Noah Levitt
77c800f6a2 renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well 2016-06-28 15:13:48 -05:00
Noah Levitt
cf3004033f Merge branch 'master' into qa
* master:
  let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part'
2016-06-28 13:57:17 -05:00
Noah Levitt
e64a4d6985 let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part' 2016-06-28 13:56:30 -05:00
Noah Levitt
de889b7553 re-add submodule accidentally removed during merge 2016-06-28 12:37:57 -05:00
Noah Levitt
6674c96bc6 Merge branch 'master' into qa
* master:
  handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988)
  avoid hanging in case a page has no outlinks
  fix noVNC submodule path since brozzler webconsole has moved
  handle new bucket format in brozzler-webconsole
  fix brozzler.svg symlink
  convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
  make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies
  remove crufty docker and no-docker scripts
  note python 3.4 requirement in readme
2016-06-28 12:26:38 -05:00
Noah Levitt
772bcf0df6 handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988) 2016-06-28 12:23:32 -05:00
Noah Levitt
0bd687abde avoid hanging in case a page has no outlinks 2016-06-28 11:25:04 -05:00
Noah Levitt
6a11e1da2a fix noVNC submodule path since brozzler webconsole has moved 2016-06-28 16:15:16 +00:00
Noah Levitt
cb4a16e58c handle new bucket format in brozzler-webconsole 2016-06-28 00:13:01 +00:00
Noah Levitt
98915b3d86 fix brozzler.svg symlink 2016-06-27 20:01:35 +00:00
Noah Levitt
89474cb430 convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc) 2016-06-27 14:12:56 -05:00
Noah Levitt
e4f8efe376 make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies 2016-06-27 12:43:24 -05:00
Noah Levitt
08a9636e95 remove crufty docker and no-docker scripts 2016-06-27 00:11:37 -05:00
Barbara Miller
cdd62c0dd6 disable downloads & kccfed behavior 2016-06-25 15:05:57 -07:00
Noah Levitt
55fda6e892 note python 3.4 requirement in readme 2016-06-25 15:44:27 -05:00
Barbara Miller
8f726eac76 Merge branch 'master' into qa 2016-06-22 17:45:44 -07:00
Noah Levitt
366e467501 enhancements to the page thumbnail on the site page 2016-06-22 23:09:27 +00:00
Noah Levitt
9b3f3809cc expose full rethinkdb entry as yaml on job page 2016-06-22 22:29:07 +00:00
Noah Levitt
510456eef2 order the page thumbnails on site page by least number of hops, so the seed shows up first 2016-06-22 21:20:00 +00:00
Noah Levitt
2038598f41 fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''} 2016-06-22 17:43:53 +00:00
Noah Levitt
d198a69e45 recurse through all frames to find outlinks 2016-06-22 11:39:31 -05:00
Noah Levitt
3b615120d4 Merge branch 'master' of github.com:internetarchive/brozzler
* 'master' of github.com:internetarchive/brozzler:
  back to a dev version number
  update url
  back to a dev version number
  beta version number for pypi upload
  bump version number
  call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7)
  copy over fec.gov behavior from umbra master
2016-06-20 16:48:11 +00:00
Noah Levitt
5ccf5a9dcb fix site warcprox-meta lookup now that it is a native json object rather than a string 2016-06-20 16:48:01 +00:00
Barbara Miller
66d697e662 clickGetPDFs for kansascityfed 2016-06-17 12:23:46 -07:00
Noah Levitt
8d7bb582cb back to a dev version number 2016-06-16 14:19:00 -05:00
Noah Levitt
6237ed3f34 update url 2016-06-16 14:18:24 -05:00
Noah Levitt
9c2fe25dd0 back to a dev version number 2016-06-16 14:13:03 -05:00
Noah Levitt
b2d4cc5ff0 beta version number for pypi upload 2016-06-16 14:11:45 -05:00
Noah Levitt
81d709eed0 bump version number 2016-06-16 13:56:08 -05:00
Noah Levitt
1577cd8926 call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7) 2016-06-16 13:55:17 -05:00
Noah Levitt
98acc8dc92 copy over fec.gov behavior from umbra master 2016-06-16 13:53:28 -05:00
Noah Levitt
d75e8c394a switch flask requirement to recent release, suggest gunicorn for running the app 2016-06-15 22:00:39 +00:00
Barbara Miller
d11da721bd disable browser extensions 2016-06-15 11:50:05 -07:00
Barbara Miller
77185e2b9b kcfed test 2016-06-09 23:16:43 -07:00
Noah Levitt
b0ed4b8128 Merge pull request #7 from galgeek/disable_extensions
disable browser extensions
2016-06-06 11:57:46 -07:00