Merge pull request #2 from internetarchive/master

Updating to upstream origin
2025-12-10 06:15:32 -05:00 · 2018-08-17 14:26:46 -04:00 · 2018-08-17 14:26:46 -04:00 · 2081e6388a
commit 2081e6388a
parent bd78e07232 d19e139101
12 changed files with 193 additions and 108 deletions
--- a/README.rst
+++ b/README.rst
@ -1,14 +1,20 @@
 .. image:: https://travis-ci.org/internetarchive/brozzler.svg?branch=master
    :target: https://travis-ci.org/internetarchive/brozzler

-.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
+.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b12/brozzler/dashboard/static/brozzler.svg
   :width: 60px

 |logo| brozzler
 ===============
 "browser" \| "crawler" = "brozzler"

-Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. It employs `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture capabilities, warcprox to write content to Web ARChive (WARC) files, `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to index captured URLs, a native dashboard for crawl job monitoring, and a customized Python Wayback interface for archival replay.
+Brozzler is a distributed web crawler (爬虫) that uses a real browser (Chrome
+or Chromium) to fetch pages and embedded URLs and to extract links. It employs
+`youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media capture
+capabilities and `rethinkdb <https://github.com/rethinkdb/rethinkdb>`_ to
+manage crawl state.
+
+Brozzler is designed to work in conjuction with warcprox for web archiving.

 Requirements
 ------------
@ -17,12 +23,21 @@ Requirements
 - RethinkDB deployment
 - Chromium or Google Chrome >= version 64

-Note: The browser requires a graphical environment to run. When brozzler is run on a server, this may require deploying some additional infrastructure (typically X11; Xvfb does not support screenshots, however Xvnc4 from package vnc4server, does). The `vagrant configuration <vagrant/>`_ in the brozzler repository (still a work in progress) has an example setup. 
+Note: The browser requires a graphical environment to run. When brozzler is run
+on a server, this may require deploying some additional infrastructure,
+typically X11. Xvnc4 and Xvfb are X11 variants that are suitable for use on a
+server, because they don't display anything to a physical screen. The `vagrant
+configuration <vagrant/>`_ in the brozzler repository has an example setup
+using Xvnc4. (When last tested, chromium on Xvfb did not support screenshots,
+so Xvnc4 is preferred at this time.)

 Getting Started
 ---------------

-The easiest way to get started with brozzler for web archiving is with ``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox, brozzler wayback, and brozzler-dashboard, configured to work with each other in a single process.
+The easiest way to get started with brozzler for web archiving is with
+``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox, brozzler
+wayback, and brozzler-dashboard, configured to work with each other in a single
+process.

 Mac instructions:

@ -45,7 +60,8 @@ Mac instructions:
    # start brozzler-easy
    brozzler-easy

-At this point brozzler-easy will start archiving your site. Results will be immediately available for playback in pywb at http://localhost:8880/brozzler/.
+At this point brozzler-easy will start archiving your site. Results will be
+immediately available for playback in pywb at http://localhost:8880/brozzler/.

 *Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
 does not take advantage of brozzler's distributed nature.*
@ -72,7 +88,9 @@ Submit sites not tied to a job::
 Job Configuration
 -----------------

-Brozzler jobs are defined using YAML files. Options may be specified either at the top-level or on individual seeds. At least one seed URL must be specified, however everything else is optional. For details, see `<job-conf.rst>`_.
+Brozzler jobs are defined using YAML files. Options may be specified either at
+the top-level or on individual seeds. At least one seed URL must be specified,
+however everything else is optional. For details, see `<job-conf.rst>`_.

 ::

@ -116,7 +134,9 @@ See ``brozzler-dashboard --help`` for configuration options.
 Brozzler Wayback
 ----------------

-Brozzler comes with a customized version of `pywb <https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb "captures" table (populated by warcprox) as its index.
+Brozzler comes with a customized version of `pywb
+<https://github.com/ikreymer/pywb>`_, which supports using the rethinkdb
+"captures" table (populated by warcprox) as its index.

 To use, first install dependencies.

@ -150,28 +170,11 @@ Run pywb like so:

 Then browse http://localhost:8880/brozzler/.

-
 Headless Chrome (experimental)
--------------------------------
+------------------------------

-`Headless Chromium <https://chromium.googlesource.com/chromium/src/+/master/headless/README.md>`_ is now available in stable Chrome releases for 64-bit Linux and may be used to run the browser without a visible window or X11.
-
-To try this out, create a wrapper script like ~/bin/chrome-headless.sh:
-
-::
-
-    #!/bin/bash
-    exec /opt/google/chrome/chrome --headless --disable-gpu "$@"
-
-Run brozzler passing the path to the wrapper script as the ``--chrome-exe``
-option:
-
-::
-
-    chmod +x ~/bin/chrome-headless.sh
-    brozzler-worker --chrome-exe ~/bin/chrome-headless.sh
-
-Beware: Chrome's headless mode is still very new and has `unresolved issues <https://bugs.chromium.org/p/chromium/issues/list?can=2&q=Proj%3DHeadless>`_. Its use with brozzler has not yet been extensively tested. You may experience hangs or crashes with some types of content. For the moment we recommend using Chrome's regular mode instead.
+Brozzler is known to work nominally with Chrome/Chromium in headless mode, but
+this has not yet been extensively tested.

 License
 -------
--- a/brozzler/behaviors.yaml
+++ b/brozzler/behaviors.yaml
@ -27,7 +27,7 @@
  default_parameters:
    actions:
      - selector: a.coreSpriteDismissLarge
-      - selector: div._mck9w a
+      - selector: a>div[role='button']
        firstMatchOnly: true
      - selector: a.coreSpriteRightPaginationArrow
        repeatSameElement: true
--- a/brozzler/cli.py
+++ b/brozzler/cli.py
@ -156,13 +156,12 @@ def brozzle_page(argv=None):
            '--proxy', dest='proxy', default=None, help='http proxy')
    arg_parser.add_argument(
            '--skip-extract-outlinks', dest='skip_extract_outlinks',
-            action='store_true', help=argparse.SUPPRESS)
+            action='store_true')
    arg_parser.add_argument(
            '--skip-visit-hashtags', dest='skip_visit_hashtags',
-            action='store_true', help=argparse.SUPPRESS)
+            action='store_true')
    arg_parser.add_argument(
-            '--skip-youtube-dl', dest='skip_youtube_dl',
-            action='store_true', help=argparse.SUPPRESS)
+            '--skip-youtube-dl', dest='skip_youtube_dl', action='store_true')
    add_common_options(arg_parser, argv)

    args = arg_parser.parse_args(args=argv[1:])
--- a/brozzler/dashboard/init.py
+++ b/brozzler/dashboard/init.py
@ -24,7 +24,7 @@ try:
 except ImportError as e:
    logging.critical(
            '%s: %s\n\nYou might need to run "pip install '
-            'brozzler[dashboard]".\nSee readme.rst for more information.',
+            'brozzler[dashboard]".\nSee README.rst for more information.',
            type(e).__name__, e)
    sys.exit(1)
 import doublethink
--- a/brozzler/easy.py
+++ b/brozzler/easy.py
@ -31,7 +31,7 @@ try:
 except ImportError as e:
    logging.critical(
            '%s: %s\n\nYou might need to run "pip install '
-            'brozzler[easy]".\nSee readme.rst for more information.',
+            'brozzler[easy]".\nSee README.rst for more information.',
            type(e).__name__, e)
    sys.exit(1)
 import argparse
--- a/brozzler/pywb.py
+++ b/brozzler/pywb.py
@ -31,7 +31,7 @@ try:
 except ImportError as e:
    logging.critical(
            '%s: %s\n\nYou might need to run "pip install '
-            'brozzler[easy]".\nSee readme.rst for more information.',
+            'brozzler[easy]".\nSee README.rst for more information.',
            type(e).__name__, e)
    sys.exit(1)
 import doublethink
@ -270,7 +270,7 @@ Run pywb like so:

    $ PYWB_CONFIG_FILE=pywb.yml brozzler-wayback

-See readme.rst for more information.
+See README.rst for more information.
 '''

 # copied and pasted from cdxdomainspecific.py, only changes are commented as
--- a/brozzler/robots.py
+++ b/brozzler/robots.py
@ -46,20 +46,21 @@ def _reppy_rules_getitem(self, agent):
    return self.agents.get('*')
 reppy.parser.Rules.__getitem__ = _reppy_rules_getitem

+class _SessionRaiseOn420(requests.Session):
+    timeout = 60
+    def get(self, url, *args, **kwargs):
+        res = super().get(url, timeout=self.timeout, *args, **kwargs)
+        if res.status_code == 420 and 'warcprox-meta' in res.headers:
+            raise brozzler.ReachedLimit(
+                    warcprox_meta=json.loads(res.headers['warcprox-meta']),
+                    http_payload=res.text)
+        else:
+            return res
+
 _robots_caches = {}  # {site_id:reppy.cache.RobotsCache}
 def _robots_cache(site, proxy=None):
-    class SessionRaiseOn420(requests.Session):
-        def get(self, url, *args, **kwargs):
-            res = super().get(url, *args, **kwargs)
-            if res.status_code == 420 and 'warcprox-meta' in res.headers:
-                raise brozzler.ReachedLimit(
-                        warcprox_meta=json.loads(res.headers['warcprox-meta']),
-                        http_payload=res.text)
-            else:
-                return res
-
    if not site.id in _robots_caches:
-        req_sesh = SessionRaiseOn420()
+        req_sesh = _SessionRaiseOn420()
        req_sesh.verify = False   # ignore cert errors
        if proxy:
            proxie = "http://%s" % proxy
@ -68,7 +69,8 @@ def _robots_cache(site, proxy=None):
            req_sesh.headers.update(site.extra_headers())
        if site.user_agent:
            req_sesh.headers['User-Agent'] = site.user_agent
-        _robots_caches[site.id] = reppy.cache.RobotsCache(session=req_sesh)
+        _robots_caches[site.id] = reppy.cache.RobotsCache(
+                session=req_sesh, disallow_forbidden=False)

    return _robots_caches[site.id]

@ -76,13 +78,9 @@ def is_permitted_by_robots(site, url, proxy=None):
    '''
    Checks if `url` is permitted by robots.txt.

-    In case of problems fetching robots.txt, different things can happen.
-    Reppy (the robots.txt parsing library) handles some exceptions internally
-    and applies an appropriate policy. It bubbles up other exceptions. Of
-    these, there are two kinds that this function raises for the caller to
-    handle, described below. Yet other types of exceptions are caught, and the
-    fetch is retried up to 10 times. In this case, after the 10th failure, the
-    function returns `False` (i.e. forbidden by robots).
+    Treats any kind of error fetching robots.txt as "allow all". See
+    http://builds.archive.org/javadoc/heritrix-3.x-snapshot/org/archive/modules/net/CrawlServer.html#updateRobots(org.archive.modules.CrawlURI)
+    for some background on that policy.

    Returns:
        bool: `True` if `site.ignore_robots` is set, or if `url` is permitted
@ -95,29 +93,21 @@ def is_permitted_by_robots(site, url, proxy=None):
    if site.ignore_robots:
        return True

-    tries_left = 10
-    while True:
-        try:
-            result = _robots_cache(site, proxy).allowed(
-                    url, site.user_agent or "brozzler")
-            return result
-        except Exception as e:
-            if isinstance(e, reppy.exceptions.ServerError) and isinstance(
-                    e.args[0], brozzler.ReachedLimit):
-                raise e.args[0]
-            elif hasattr(e, 'args') and isinstance(
-                    e.args[0], requests.exceptions.ProxyError):
-                # reppy has wrapped an exception that we want to bubble up
-                raise brozzler.ProxyError(e)
-            else:
-                if tries_left > 0:
-                    logging.warn(
-                            "caught exception fetching robots.txt (%r tries "
-                            "left) for %r: %r", tries_left, url, e)
-                    tries_left -= 1
-                else:
-                    logging.error(
-                            "caught exception fetching robots.txt (0 tries "
-                            "left) for %r: %r", url, e, exc_info=True)
-                    return False
+    try:
+        result = _robots_cache(site, proxy).allowed(
+                url, site.user_agent or "brozzler")
+        return result
+    except Exception as e:
+        if isinstance(e, reppy.exceptions.ServerError) and isinstance(
+                e.args[0], brozzler.ReachedLimit):
+            raise e.args[0]
+        elif hasattr(e, 'args') and isinstance(
+                e.args[0], requests.exceptions.ProxyError):
+            # reppy has wrapped an exception that we want to bubble up
+            raise brozzler.ProxyError(e)
+        else:
+            logging.warn(
+                    "returning true (permitted) after problem fetching "
+                    "robots.txt for %r: %r", url, e)
+            return True

--- a/brozzler/worker.py
+++ b/brozzler/worker.py
@ -113,7 +113,11 @@ class YoutubeDLSpy(urllib.request.BaseHandler):
 class BrozzlerWorker:
    logger = logging.getLogger(__module__ + "." + __qualname__)

-    HEARTBEAT_INTERVAL = 20.0
+    # 3⅓ min heartbeat interval => 10 min ttl
+    # This is kind of a long time, because `frontier.claim_sites()`, which runs
+    # in the same thread as the heartbeats, can take a while on a busy brozzler
+    # cluster with slow rethinkdb.
+    HEARTBEAT_INTERVAL = 200.0
    SITE_SESSION_MINUTES = 15

    def __init__(
@ -346,7 +350,8 @@ class BrozzlerWorker:
                raise

    def full_and_thumb_jpegs(self, large_png):
-        img = PIL.Image.open(io.BytesIO(large_png))
+        # these screenshots never have any alpha (right?)
+        img = PIL.Image.open(io.BytesIO(large_png)).convert('RGB')

        out = io.BytesIO()
        img.save(out, "jpeg", quality=95)
--- a/setup.py
+++ b/setup.py
@ -32,12 +32,12 @@ def find_package_data(package):

 setuptools.setup(
        name='brozzler',
-        version='1.1b13.dev291',
+        version='1.4.dev299',
        description='Distributed web crawling with browsers',
        url='https://github.com/internetarchive/brozzler',
        author='Noah Levitt',
        author_email='nlevitt@archive.org',
-        long_description=open('readme.rst', mode='rb').read().decode('UTF-8'),
+        long_description=open('README.rst', mode='rb').read().decode('UTF-8'),
        license='Apache License 2.0',
        packages=['brozzler', 'brozzler.dashboard'],
        package_data={
@ -63,26 +63,29 @@ setuptools.setup(
            ],
        },
        install_requires=[
-            'PyYAML',
-            'youtube-dl',
+            'PyYAML>=3.12',
+            'youtube-dl>=2018.7.21',
            'reppy==0.3.4',
-            'requests',
-            'websocket-client!=0.39.0',
-            'pillow==3.3.0',
+            'requests>=2.18.4',
+            'websocket-client>=0.39.0,!=0.49.0',
+            'pillow>=5.2.0',
            'urlcanon>=0.1.dev23',
            'doublethink>=0.2.0.dev88',
-            'rethinkdb>=2.3,<2.4',
-            'cerberus==1.0.1',
-            'jinja2',
-            'cryptography!=2.1.1', # 2.1.1 installation is failing on ubuntu
+            'rethinkdb>=2.3',
+            'cerberus>=1.0.1',
+            'jinja2>=2.10',
+            'cryptography>=2.3',
        ],
        extras_require={
-            'dashboard': ['flask>=0.11', 'gunicorn'],
+            'dashboard': [
+                'flask>=0.11',
+                'gunicorn>=19.8.1'
+            ],
            'easy': [
                'warcprox>=2.4b2.dev173',
-                'pywb<2',
+                'pywb>=0.33.2,<2',
                'flask>=0.11',
-                'gunicorn'
+                'gunicorn>=19.8.1'
            ],
        },
        zip_safe=False,
--- a/tests/test_cluster.py
+++ b/tests/test_cluster.py
@ -769,7 +769,7 @@ def test_time_limit(httpd):
    rr = doublethink.Rethinker('localhost', db='brozzler')
    frontier = brozzler.RethinkDbFrontier(rr)

-    # create a new job with three sites that could be crawled forever
+    # create a new job with one seed that could be crawled forever
    job_conf = {'seeds': [{
        'url': 'http://localhost:%s/infinite/foo/' % httpd.server_port,
        'time_limit': 20}]}
@ -789,6 +789,10 @@ def test_time_limit(httpd):
    assert sites[0].status == 'FINISHED_TIME_LIMIT'

    # all sites finished so job should be finished too
+    start = time.time()
    job.refresh()
+    while not job.status == 'FINISHED' and time.time() - start < 10:
+        time.sleep(0.5)
+        job.refresh()
    assert job.status == 'FINISHED'

--- a/tests/test_units.py
+++ b/tests/test_units.py
@ -32,6 +32,7 @@ import uuid
 import socket
 import time
 import sys
+import threading

 logging.basicConfig(
        stream=sys.stderr, level=logging.INFO, format=(
@ -67,6 +68,87 @@ def test_robots(httpd):
    site = brozzler.Site(None, {'seed':url,'user_agent':'im/a bAdBOt/uh huh'})
    assert not brozzler.is_permitted_by_robots(site, url)

+def test_robots_http_statuses():
+    for status in (
+            200, 204, 400, 401, 402, 403, 404, 405,
+            500, 501, 502, 503, 504, 505):
+        class Handler(http.server.BaseHTTPRequestHandler):
+            def do_GET(self):
+                response = (('HTTP/1.1 %s Meaningless message\r\n'
+                          + 'Content-length: 0\r\n'
+                          + '\r\n') % status).encode('utf-8')
+                self.connection.sendall(response)
+                # self.send_response(status)
+                # self.end_headers()
+        httpd = http.server.HTTPServer(('localhost', 0), Handler)
+        httpd_thread = threading.Thread(name='httpd', target=httpd.serve_forever)
+        httpd_thread.start()
+
+        try:
+            url = 'http://localhost:%s/' % httpd.server_port
+            site = brozzler.Site(None, {'seed': url})
+            assert brozzler.is_permitted_by_robots(site, url)
+        finally:
+            httpd.shutdown()
+            httpd.server_close()
+            httpd_thread.join()
+
+def test_robots_empty_response():
+    class Handler(http.server.BaseHTTPRequestHandler):
+        def do_GET(self):
+            self.connection.shutdown(socket.SHUT_RDWR)
+            self.connection.close()
+    httpd = http.server.HTTPServer(('localhost', 0), Handler)
+    httpd_thread = threading.Thread(name='httpd', target=httpd.serve_forever)
+    httpd_thread.start()
+
+    try:
+        url = 'http://localhost:%s/' % httpd.server_port
+        site = brozzler.Site(None, {'seed': url})
+        assert brozzler.is_permitted_by_robots(site, url)
+    finally:
+        httpd.shutdown()
+        httpd.server_close()
+        httpd_thread.join()
+
+def test_robots_socket_timeout():
+    stop_hanging = threading.Event()
+    class Handler(http.server.BaseHTTPRequestHandler):
+        def do_GET(self):
+            stop_hanging.wait(60)
+            self.connection.sendall(
+                    b'HTTP/1.1 200 OK\r\nContent-length: 0\r\n\r\n')
+
+    orig_timeout = brozzler.robots._SessionRaiseOn420.timeout
+
+    httpd = http.server.HTTPServer(('localhost', 0), Handler)
+    httpd_thread = threading.Thread(name='httpd', target=httpd.serve_forever)
+    httpd_thread.start()
+
+    try:
+        url = 'http://localhost:%s/' % httpd.server_port
+        site = brozzler.Site(None, {'seed': url})
+        brozzler.robots._SessionRaiseOn420.timeout = 2
+        assert brozzler.is_permitted_by_robots(site, url)
+    finally:
+        brozzler.robots._SessionRaiseOn420.timeout = orig_timeout
+        stop_hanging.set()
+        httpd.shutdown()
+        httpd.server_close()
+        httpd_thread.join()
+
+def test_robots_dns_failure():
+    # .invalid. is guaranteed nonexistent per rfc 6761
+    url = 'http://whatever.invalid./'
+    site = brozzler.Site(None, {'seed': url})
+    assert brozzler.is_permitted_by_robots(site, url)
+
+def test_robots_connection_failure():
+    # .invalid. is guaranteed nonexistent per rfc 6761
+    url = 'http://localhost:4/' # nobody listens on port 4
+    site = brozzler.Site(None, {'seed': url})
+    assert brozzler.is_permitted_by_robots(site, url)
+
 def test_scoping():
    test_scope = yaml.load('''
 max_hops: 100
--- a/vagrant/README.rst
+++ b/vagrant/README.rst
@ -1,15 +1,14 @@
 Single-VM Vagrant Brozzler Deployment
 -------------------------------------

-This is a work in progress. Vagrant + ansible configuration for a single-vm
-deployment of brozzler and warcprox with dependencies (notably rethinkdb).
+This is a vagrant + ansible configuration for a single-vm deployment of
+brozzler and warcprox with dependencies (notably rethinkdb).

 The idea is for this to be a quick way for people to get up and running with a
 deployment resembling a real distributed deployment, and to offer a starting
 configuration for people to adapt to their clusters.

-And equally important, as a harness for integration tests. (As of now brozzler
-itself has no automated tests!)
+And equally important, as a harness for integration tests.

 You'll need vagrant installed.
 https://www.vagrantup.com/docs/installation/
@ -25,27 +24,27 @@ the brozzler virtualenv.
 ::

    my-laptop$ vagrant ssh
-    vagrant@brozzler-easy:~$ source ~/brozzler-ve34/bin/activate
-    (brozzler-ve34)vagrant@brozzler-easy:~$
+    vagrant@brzl:~$ source /opt/brozzler-ve34/bin/activate
+    (brozzler-ve34)vagrant@brzl:~$

 Then you can run brozzler-new-site:

 ::

-    (brozzler-ve34)vagrant@brozzler-easy:~$ brozzler-new-site \
-           --proxy=localhost:8000 http://example.com/
+    (brozzler-ve34)vagrant@brzl:~$ brozzler-new-site --proxy=localhost:8000 http://example.com/


 Or brozzler-new-job (make sure to set the proxy to localhost:8000):

 ::

-    (brozzler-ve34)vagrant@brozzler-easy:~$ cat >job1.yml
+    (brozzler-ve34)vagrant@brzl:~$ cat >job1.yml <<EOF
    id: job1
    proxy: localhost:8000 # point at warcprox for archiving
    seeds:
-      - url: https://example.org/
-    (brozzler-ve34)vagrant@brozzler-easy:~$ brozzler-new-job job1.yml
+    - url: https://example.org/
+    EOF
+    (brozzler-ve34)vagrant@brzl:~$ brozzler-new-job job1.yml

 WARC files will appear in ./warcs and brozzler, warcprox and rethinkdb logs in
 ./logs (via vagrant folders syncing).