Commit Graph

  • 2046ee36e0 add a timeout to the one post-behavior step that didn't already have one (getting a screenshot), and majorly refactored the post-behavior code to incorporate timeouts automatically into each step, and hopefully make it easier to follow Noah Levitt 2016-07-28 19:59:28 -05:00
  • b2b07b79a9 logging tweaks Noah Levitt 2016-07-28 10:19:30 -05:00
  • dd2d8c89e3 reduce log level of messages from chrome, since it spews stuff that looks bad but usually isn't Noah Levitt 2016-07-27 18:48:13 -05:00
  • 041a4970ce back to a dev version number Noah Levitt 2016-07-27 16:57:42 -05:00
  • d94a7c23b9 1.1b3 for upload to pypi 1.1b3 Noah Levitt 2016-07-27 16:53:10 -05:00
  • fdc2f87a0e Merge branch 'master' into qa Noah Levitt 2016-07-26 19:47:50 -05:00
  • c4bdb6c1fd pass behavior template parameters on to behavior - fixes umbra's ability to log in with parameters received from amqp Noah Levitt 2016-07-26 19:47:09 -05:00
  • c685a4432c Merge pull request #9 from internetarchive/AITFIVE-841 Noah Levitt 2016-07-26 11:28:32 -05:00
  • c2dc2fee2a Changing EnvironmentError to OSError Adam Miller 2016-07-26 00:46:16 +00:00
  • 77dabd4057 Fix naming conventions. Adam Miller 2016-07-26 00:39:50 +00:00
  • 2029964a74 Create cookie directory if it doesn't exist. Add debug messages for cookie db read/write. Adam Miller 2016-07-25 23:36:14 +00:00
  • 1cb6653fab Read/Write Cookie DB file when creating and stopping browser instance. Adam Miller 2016-07-22 00:22:28 +00:00
  • 127002b77d brozzler[easy] requires warcprox>=2.0b1 Noah Levitt 2016-07-21 19:14:11 -05:00
  • 37bff5328b look for a sensible default chromium/chrome executable Noah Levitt 2016-07-19 15:57:24 -05:00
  • c902a70450 tweak thread names Noah Levitt 2016-07-19 14:33:57 -05:00
  • ac3a71742d convert domain specific rule url prefixes to our style of surt Noah Levitt 2016-07-19 14:31:43 -05:00
  • 7d9f019e67 have pywb support loading warc records from warc files still being written (look for foo.warc.gz.open) Noah Levitt 2016-07-17 20:09:56 -05:00
  • b62d5a6350 install flash plugin for chromium Noah Levitt 2016-07-13 15:23:50 -05:00
  • 04e1e5277e make state dumping signal handler more robust (now you can kill -QUIT a thousand times in a row without causing problems) Noah Levitt 2016-07-13 14:52:05 -05:00
  • c6e6b34e82 handle case where websocket connection is unexpectedly closed during the post-behavior phase Noah Levitt 2016-07-06 18:17:01 -05:00
  • 3bf3c80720 implement timeout and retries to work around issue where sometimes we receive no result message after requesting outlinks Noah Levitt 2016-07-06 17:54:36 -05:00
  • be58fb46f7 forgot to commit easy.py, add pywb.py with support for pywb rethinkdb index, and make brozzler-easy also run pywb Noah Levitt 2016-07-06 14:52:00 -05:00
  • 3b252002b7 working on brozzler-easy, single process with brozzler-worker and warcprox working together (pywb to be added) Noah Levitt 2016-07-05 18:46:42 -05:00
  • 1a7b94cae7 twirldown for site yaml on site page Noah Levitt 2016-07-05 21:42:36 +00:00
  • f825e76371 give master a version number considered later than the one up on pypi (1.1b3.dev45 > 1.1b2) Noah Levitt 2016-07-05 10:44:48 -05:00
  • 0b9ce94226 in vagrant/ansible, install brozzler from this checkout instead of from github master Noah Levitt 2016-07-01 15:45:39 -05:00
  • 3e128d2b27 option to save list of outlinks (categorized as "accepted", "blocked" (by robots), or "rejected") per page in rethinkdb (to be used by archive-it for out-of-scope reporting) Noah Levitt 2016-07-01 15:23:46 -05:00
  • 01e38ea8c7 oops didn't mean to leave that windows-only subprocess flag Noah Levitt 2016-07-01 14:07:04 -05:00
  • ad502f33da remove accidentally committed playbook.retry Noah Levitt 2016-06-30 17:56:56 -05:00
  • 2aef00826b vagrant setup (unfinished) Noah Levitt 2016-06-30 17:50:11 -05:00
  • 79ad57669c do not send more than one SIGTERM when shutting down browser process, because on recent chromium on linux, the second sigterm abruptly ends the process, and sometimes leaves orphan subprocesses; also send TERM/KILL signals to the whole process group, another measure to avoid orphans; and adjust logging levels for captured chrome output Noah Levitt 2016-06-30 17:10:27 -05:00
  • 371590b578 command line utility brozzler-ensure-tables, creates rethinkdb tables if they don't already exist... brozzler normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster Noah Levitt 2016-06-30 15:16:04 -05:00
  • d82feb14da Merge branch 'master' into qa Noah Levitt 2016-06-30 11:46:31 -05:00
  • 9fd78fdbe8 implement timeout to work around issue where sometimes we receive no result message after requesting scroll to top Noah Levitt 2016-06-30 11:45:19 -05:00
  • a1910fc0fe avoid "AttributeError: 'ExtractorError' object has no attribute 'code'" checking for 430 (soft limit) from youtube-dl Noah Levitt 2016-06-29 19:57:51 -05:00
  • 79beddfc44 set Browser._chrome_instance=None if _chrome_instance.start() throws exception, to avoid endless loop after one failure Noah Levitt 2016-06-29 19:47:25 -05:00
  • 2e687b65fb fix case where rethinkdb page already has claimed=True Noah Levitt 2016-06-29 19:29:18 -05:00
  • ffcf26b6c9 undo accidentally committed change to browser startup timeout, and remove now misleading comment about browser ports (see https://github.com/internetarchive/brozzler/pull/3) Noah Levitt 2016-06-29 18:53:32 -05:00
  • 7431ae0eb1 fix bug preventing brozzler-new-site from working, add note about brozzler-new-site in readme Noah Levitt 2016-06-29 18:41:45 -05:00
  • 479713e25b --trace level logging Noah Levitt 2016-06-29 18:29:45 -05:00
  • 8576c71c62 Merge branch 'master' into qa Noah Levitt 2016-06-29 16:56:12 -05:00
  • d04c5a31cc to avoid infinite loops in some cases, ignore the "claimed" field in the rethinkdb table "pages", because if a page is left "claimed", it must have been because of some error... site.claimed is the real claiming mechanism Noah Levitt 2016-06-29 00:02:25 +00:00
  • 7a805a43d1 calm logging, don't print stacktrace on 430 from youtube-dl Noah Levitt 2016-06-28 23:19:28 +00:00
  • e9c398caea fix buglet in creation of new least_hops on pages table Noah Levitt 2016-06-28 23:14:23 +00:00
  • 77c800f6a2 renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well Noah Levitt 2016-06-28 15:13:48 -05:00
  • cf3004033f Merge branch 'master' into qa Noah Levitt 2016-06-28 13:57:17 -05:00
  • e64a4d6985 let youtube-dl write to a temporary directory instead of /dev/null, to fix errors like this "youtube_dl.utils.DownloadError: ERROR: unable to open for writing: [Errno 13] Permission denied: '/dev/null-Frag0.part' Noah Levitt 2016-06-28 13:56:30 -05:00
  • de889b7553 re-add submodule accidentally removed during merge Noah Levitt 2016-06-28 12:37:57 -05:00
  • 6674c96bc6 Merge branch 'master' into qa Noah Levitt 2016-06-28 12:26:38 -05:00
  • 772bcf0df6 handle "undefined" in list of frames when extracting outlinks (fixes ARI-4988) Noah Levitt 2016-06-28 12:23:32 -05:00
  • 0bd687abde avoid hanging in case a page has no outlinks Noah Levitt 2016-06-28 11:25:04 -05:00
  • 6a11e1da2a fix noVNC submodule path since brozzler webconsole has moved Noah Levitt 2016-06-28 16:15:16 +00:00
  • cb4a16e58c handle new bucket format in brozzler-webconsole Noah Levitt 2016-06-28 00:13:01 +00:00
  • 98915b3d86 fix brozzler.svg symlink Noah Levitt 2016-06-27 20:01:35 +00:00
  • 89474cb430 convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc) Noah Levitt 2016-06-27 14:12:56 -05:00
  • e4f8efe376 make brozzler-webconsole a part of the main brozzler package, using optional "extras_require" dependencies Noah Levitt 2016-06-27 12:43:24 -05:00
  • 08a9636e95 remove crufty docker and no-docker scripts Noah Levitt 2016-06-27 00:11:37 -05:00
  • cdd62c0dd6 disable downloads & kccfed behavior Barbara Miller 2016-06-25 15:05:37 -07:00
  • 55fda6e892 note python 3.4 requirement in readme Noah Levitt 2016-06-25 15:44:27 -05:00
  • 8f726eac76 Merge branch 'master' into qa Barbara Miller 2016-06-22 17:45:44 -07:00
  • 366e467501 enhancements to the page thumbnail on the site page Noah Levitt 2016-06-22 23:09:27 +00:00
  • 9b3f3809cc expose full rethinkdb entry as yaml on job page Noah Levitt 2016-06-22 22:29:07 +00:00
  • 510456eef2 order the page thumbnails on site page by least number of hops, so the seed shows up first Noah Levitt 2016-06-22 21:20:00 +00:00
  • 2038598f41 fix bug in case no outlinks are found, make brozzler.browser.browse_page() return an empty set instead of a set with one element which is an empty string {''} Noah Levitt 2016-06-22 17:43:53 +00:00
  • d198a69e45 recurse through all frames to find outlinks Noah Levitt 2016-06-22 11:39:31 -05:00
  • 3b615120d4 Merge branch 'master' of github.com:internetarchive/brozzler Noah Levitt 2016-06-20 16:48:11 +00:00
  • 5ccf5a9dcb fix site warcprox-meta lookup now that it is a native json object rather than a string Noah Levitt 2016-06-20 16:48:01 +00:00
  • 66d697e662 clickGetPDFs for kansascityfed Barbara Miller 2016-05-08 19:55:50 -07:00
  • 8d7bb582cb back to a dev version number Noah Levitt 2016-06-16 14:19:00 -05:00
  • 6237ed3f34 update url 1.1b2 Noah Levitt 2016-06-16 14:18:24 -05:00
  • 9c2fe25dd0 back to a dev version number Noah Levitt 2016-06-16 14:13:03 -05:00
  • b2d4cc5ff0 beta version number for pypi upload Noah Levitt 2016-06-16 14:11:45 -05:00
  • 81d709eed0 bump version number Noah Levitt 2016-06-16 13:56:08 -05:00
  • 1577cd8926 call clearInterval when umbraBehaviorFinished is about to return true (see 1ef528eea7) Noah Levitt 2016-06-16 13:55:17 -05:00
  • 98acc8dc92 copy over fec.gov behavior from umbra master Noah Levitt 2016-06-16 13:53:28 -05:00
  • d75e8c394a switch flask requirement to recent release, suggest gunicorn for running the app Noah Levitt 2016-06-15 22:00:39 +00:00
  • d11da721bd disable browser extensions Barbara Miller 2016-05-09 13:04:49 -07:00
  • 77185e2b9b kcfed test Barbara Miller 2016-05-27 22:20:36 -07:00
  • b0ed4b8128 Merge pull request #7 from galgeek/disable_extensions Noah Levitt 2016-06-06 11:57:46 -07:00
  • c63c21c30a Merge pull request #6 from ato/document-config Noah Levitt 2016-06-06 11:56:45 -07:00
  • 1c1237d07e disable browser extensions Barbara Miller 2016-05-09 13:04:49 -07:00
  • cc6403e031 clickGetPDFs.template Barbara Miller 2016-05-21 19:27:05 -07:00
  • a2d50e60ff huffpo slideshow custom behavior Barbara Miller 2016-05-17 17:40:41 -07:00
  • 92f8f7c16d Merge pull request #5 from ato/fix-brozzler-new-site Noah Levitt 2016-05-17 10:14:48 -07:00
  • 484805fbda proxy is not supposed to have http:// prefix Alex Osborne 2016-05-17 16:20:38 +10:00
  • 02af30edd4 Document the job config format Alex Osborne 2016-05-17 15:20:09 +10:00
  • a939689d44 Fix warcprox_meta's default value and json import Alex Osborne 2016-05-17 13:51:42 +10:00
  • 182cbfd0ce bump version so we can upload to pypi and fix the readme Noah Levitt 2016-05-11 12:10:23 -07:00
  • dd2211df31 yes you can install brozzler from the outside world now! Noah Levitt 2016-05-11 12:05:50 -07:00
  • 7dd841cae2 Merge branch 'master' into qa Noah Levitt 2016-05-11 00:47:05 +00:00
  • 6f6216e432 catch exception from rethinkdb when unregistering from the service registry at shutdown Noah Levitt 2016-05-11 00:46:50 +00:00
  • c6e0e7c507 correctly handle site with no pages (which means the seed was blocked by robots.txt) in frontier.seed_page Noah Levitt 2016-05-11 00:45:47 +00:00
  • 43e3ada240 Merge pull request #4 from galgeek/qa Barbara Miller 2016-05-10 14:55:28 -07:00
  • 34d0f228dc multiclicks behavior Barbara Miller 2016-05-09 22:50:40 -07:00
  • 317a5eb99d without sudo, psutil.net_connections() raises psutil.AccessDenied on mac; in this case, silently try running chrome on the unvetted configured port Noah Levitt 2016-05-09 17:25:14 -07:00
  • 1141c5951e add psutil dependency Noah Levitt 2016-05-09 17:19:53 -07:00
  • c12090b3ef oops, no "+" there Noah Levitt 2016-05-07 01:46:36 +00:00
  • 464da5c3a6 avoid errors with old versions of pip or non-utf-8 locales by specifying the encoding of README.rst Noah Levitt 2016-05-07 01:46:15 +00:00
  • 1445aa9976 make Site.warcprox_meta a special thing, replacing Site.extra_headers; this way, warcprox_meta is a dictionary in rethinkdb rather than a long json string Noah Levitt 2016-05-05 23:23:52 +00:00
  • 07e15e26bd Merge pull request #3 from internetarchive/AITFIVE-859 Noah Levitt 2016-05-05 16:00:38 -07:00