Merge branch 'master' into qa

* master:
  avoid "Uncaught TypeError: Cannot read property 'querySelectorAll' of undefined" from outlinks script
  little readme fix
  for vagrant, static ansible inventory file, add brozzler-webconsole
  add info to display of jobless sites in brozzler-webconsole; fix creation of "least_hops" index on the rethinkdb table "pages"
  add arguments --webconsole-address --webconsole-port --pywb-address and change default ports
  list jobless sites on brozzler-webconsole front page
  run brozzler-webconsole inside brozzler-easy
  add section about brozzler-easy to the readme
  add --help to brozzler-webconsole
This commit is contained in:
Noah Levitt 2016-08-29 09:59:55 -07:00
commit caadb2beff
17 changed files with 244 additions and 113 deletions

View File

@ -1,4 +1,4 @@
.. |logo| image:: https://cdn.rawgit.com/nlevitt/brozzler/d1158ab2242815b28fe7bb066042b5b5982e4627/webconsole/static/brozzler.svg
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
:width: 7%
brozzler |logo|
@ -6,37 +6,76 @@ brozzler |logo|
"browser" \| "crawler" = "brozzler"
Brozzler is a distributed web crawler (爬虫) that uses a real browser
(chrome or chromium) to fetch pages and embedded urls and to extract
links. It also uses `youtube-dl <https://github.com/rg3/youtube-dl>`__
to enhance media capture capabilities.
It is forked from https://github.com/internetarchive/umbra.
Brozzler is a distributed web crawler (爬虫) that uses a real browser (chrome
or chromium) to fetch pages and embedded urls and to extract links. It also
uses `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media
capture capabilities.
Brozzler is designed to work in conjunction with
`warcprox <https://github.com/internetarchive/warcprox>`__ for web
`warcprox <https://github.com/internetarchive/warcprox>`_ for web
archiving.
Installation
Requirements
------------
Brozzler requires python 3.4 or later.
- Python 3.4 or later
- RethinkDB deployment
- Chromium or Google Chrome browser
Worth noting is that the browser requires a graphical environment to run. You
already have this on your laptop, but on a server it will probably require
deploying some additional infrastructure (typically X11). The vagrant
configuration in the brozzler repository (still a work in progress) has an
example setup.
Getting Started
---------------
The easiest way to get started with brozzler for web archiving is with
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox,
`pywb <https://github.com/ikreymer/pywb>`_, and brozzler-webconsole, configured
to work with each other, in a single process.
Mac instructions:
::
# set up virtualenv if desired
pip install brozzler
# install and start rethinkdb
brew install rethinkdb
rethinkdb &>>rethinkdb.log &
Brozzler also requires a rethinkdb deployment.
# install brozzler with special dependencies pywb and warcprox
pip install brozzler[easy] # in a virtualenv if desired
Usage
-----
# queue a site to crawl
brozzler-new-site http://example.com/
# or a job
brozzler-new-job job1.yml
# start brozzler-easy
brozzler-easy
At this point brozzler-easy will start brozzling your site. Results will be
immediately available for playback in pywb at http://localhost:8880/brozzler/.
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
does not take advantage of brozzler's distributed nature.*
Installation and Usage
----------------------
To install brozzler only:
::
pip install brozzler # in a virtualenv if desired
Launch one or more workers:
::
brozzler-worker -e chromium
brozzler-worker
Submit jobs:
@ -44,6 +83,13 @@ Submit jobs:
brozzler-new-job myjob.yaml
Submit sites not tied to a job:
::
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
--time-limit=600 http://example.com/
Job Configuration
-----------------
@ -70,14 +116,6 @@ must be specified, everything else is optional.
scope:
surt: http://(org,example,
Submit a Site to Crawl Without Configuring a Job
------------------------------------------------
::
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
--time-limit=600 http://example.com/
Brozzler Web Console
--------------------
@ -95,19 +133,7 @@ To start the app, run
brozzler-webconsole
XXX configuration stuff
Fonts (for decent screenshots)
------------------------------
On ubuntu 14.04 trusty I installed these packages:
xfonts-base ttf-mscorefonts-installer fonts-arphic-bkai00mp
fonts-arphic-bsmi00lp fonts-arphic-gbsn00lp fonts-arphic-gkai00mp
fonts-arphic-ukai fonts-farsiweb fonts-nafees fonts-sil-abyssinica
fonts-sil-ezra fonts-sil-padauk fonts-unfonts-extra fonts-unfonts-core
ttf-indic-fonts fonts-thai-tlwg fonts-lklug-sinhala
See ``brozzler-webconsole --help`` for configuration options.
License
-------

View File

@ -304,11 +304,13 @@ class Browser:
var __brzl_framesDone = new Set();
var __brzl_compileOutlinks = function(frame) {
__brzl_framesDone.add(frame);
var outlinks = Array.prototype.slice.call(
frame.document.querySelectorAll('a[href]'));
for (var i = 0; i < frame.frames.length; i++) {
if (frame.frames[i] && !__brzl_framesDone.has(frame.frames[i])) {
outlinks = outlinks.concat(__brzl_compileOutlinks(frame.frames[i]));
if (frame && frame.document) {
var outlinks = Array.prototype.slice.call(
frame.document.querySelectorAll('a[href]'));
for (var i = 0; i < frame.frames.length; i++) {
if (frame.frames[i] && !__brzl_framesDone.has(frame.frames[i])) {
outlinks = outlinks.concat(__brzl_compileOutlinks(frame.frames[i]));
}
}
}
return outlinks;

View File

@ -1,7 +1,7 @@
#!/usr/bin/env python
'''
brozzler-easy - brozzler-worker, warcprox, and pywb all working together in a
single process
brozzler-easy - brozzler-worker, warcprox, pywb, and brozzler-webconsole all
working together in a single process
Copyright (C) 2016 Internet Archive
@ -27,7 +27,7 @@ try:
import brozzler.pywb
import wsgiref.simple_server
import wsgiref.handlers
import six.moves.socketserver
import brozzler.webconsole
except ImportError as e:
logging.critical(
'%s: %s\n\nYou might need to run "pip install '
@ -44,16 +44,17 @@ import threading
import time
import rethinkstuff
import traceback
import socketserver
def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
arg_parser = argparse.ArgumentParser(
prog=prog, formatter_class=argparse.ArgumentDefaultsHelpFormatter,
description=(
'brozzler-easy - easy deployment of brozzler, with '
'brozzler-worker, warcprox, and pywb all running in a single '
'process'))
'brozzler-worker, warcprox, pywb, and brozzler-webconsole all '
'running in a single process'))
# === common args ===
# common args
arg_parser.add_argument(
'--rethinkdb-servers', dest='rethinkdb_servers',
default='localhost', help=(
@ -66,7 +67,7 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
'-d', '--warcs-dir', dest='warcs_dir', default='./warcs',
help='where to write warcs')
# === warcprox args ===
# warcprox args
arg_parser.add_argument(
'-c', '--cacert', dest='cacert',
default='./%s-warcprox-ca.pem' % socket.gethostname(),
@ -83,24 +84,42 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
'host:port of tor socks proxy, used only to connect to '
'.onion sites'))
# === brozzler-worker args ===
# brozzler-worker args
arg_parser.add_argument(
'-e', '--chrome-exe', dest='chrome_exe',
default=brozzler.cli.suggest_default_chome_exe(),
help='executable to use to invoke chrome')
arg_parser.add_argument(
'-n', '--max-browsers', dest='max_browsers', default='1',
help='max number of chrome instances simultaneously browsing pages')
'-n', '--max-browsers', dest='max_browsers',
type=int, default=1, help=(
'max number of chrome instances simultaneously '
'browsing pages'))
# === pywb args ===
# pywb args
arg_parser.add_argument(
'--pywb-port', dest='pywb_port', type=int, default=8091,
help='pywb wayback port')
'--pywb-address', dest='pywb_address',
default='0.0.0.0',
help='pywb wayback address to listen on')
arg_parser.add_argument(
'--pywb-port', dest='pywb_port', type=int,
default=8880, help='pywb wayback port')
# === common at the bottom args ===
# webconsole args
arg_parser.add_argument(
'-v', '--verbose', dest='verbose', action='store_true')
arg_parser.add_argument('-q', '--quiet', dest='quiet', action='store_true')
'--webconsole-address', dest='webconsole_address',
default='localhost',
help='brozzler web console address to listen on')
arg_parser.add_argument(
'--webconsole-port', dest='webconsole_port',
type=int, default=8881, help='brozzler web console port')
# common at the bottom args
arg_parser.add_argument(
'-v', '--verbose', dest='verbose', action='store_true',
help='verbose logging')
arg_parser.add_argument(
'-q', '--quiet', dest='quiet', action='store_true',
help='quiet logging (warnings and errors only)')
# arg_parser.add_argument(
# '-s', '--silent', dest='log_level', action='store_const',
# default=logging.INFO, const=logging.CRITICAL)
@ -110,6 +129,10 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
return arg_parser
class ThreadingWSGIServer(
socketserver.ThreadingMixIn, wsgiref.simple_server.WSGIServer):
pass
class BrozzlerEasyController:
logger = logging.getLogger(__module__ + "." + __qualname__)
@ -120,6 +143,12 @@ class BrozzlerEasyController:
self._warcprox_args(args))
self.brozzler_worker = self._init_brozzler_worker(args)
self.pywb_httpd = self._init_pywb(args)
self.webconsole_httpd = self._init_brozzler_webconsole(args)
def _init_brozzler_webconsole(self, args):
return wsgiref.simple_server.make_server(
args.webconsole_address, args.webconsole_port,
brozzler.webconsole.app, ThreadingWSGIServer)
def _init_brozzler_worker(self, args):
r = rethinkstuff.Rethinker(
@ -128,7 +157,7 @@ class BrozzlerEasyController:
service_registry = rethinkstuff.ServiceRegistry(r)
worker = brozzler.worker.BrozzlerWorker(
frontier, service_registry,
max_browsers=int(args.max_browsers),
max_browsers=args.max_browsers,
chrome_exe=args.chrome_exe,
proxy='%s:%s' % self.warcprox_controller.proxy.server_address,
enable_warcprox_features=True)
@ -166,12 +195,9 @@ class BrozzlerEasyController:
# disable is_hop_by_hop restrictions
wsgiref.handlers.is_hop_by_hop = lambda x: False
class ThreadingWSGIServer(
six.moves.socketserver.ThreadingMixIn,
wsgiref.simple_server.WSGIServer):
pass
return wsgiref.simple_server.make_server(
'', args.pywb_port, wsgi_app, ThreadingWSGIServer)
args.pywb_address, args.pywb_port, wsgi_app,
ThreadingWSGIServer)
def start(self):
self.logger.info('starting warcprox')
@ -185,7 +211,15 @@ class BrozzlerEasyController:
'starting pywb at %s:%s', *self.pywb_httpd.server_address)
threading.Thread(target=self.pywb_httpd.serve_forever).start()
self.logger.info(
'starting brozzler-webconsole at %s:%s',
*self.webconsole_httpd.server_address)
threading.Thread(target=self.webconsole_httpd.serve_forever).start()
def shutdown(self):
self.logger.info('shutting down brozzler-webconsole')
self.webconsole_httpd.shutdown()
self.logger.info('shutting down brozzler-worker')
self.brozzler_worker.shutdown_now()
# brozzler-worker is fully shut down at this point

View File

@ -69,7 +69,7 @@ class RethinkDbFrontier:
self.r.table("pages").index_create(
"least_hops", [
self.r.row["site_id"], self.r.row["brozzle_count"],
self.r.row["hops_from_seed"]])
self.r.row["hops_from_seed"]]).run()
if not "jobs" in tables:
self.logger.info(
"creating rethinkdb table 'jobs' in database %s",

View File

@ -27,7 +27,6 @@ except ImportError as e:
'brozzler[webconsole]".\nSee README.rst for more information.',
type(e).__name__, e)
sys.exit(1)
import rethinkstuff
import json
import os
@ -56,11 +55,16 @@ SETTINGS = {
'RETHINKDB_SERVERS', 'localhost').split(','),
'RETHINKDB_DB': os.environ.get('RETHINKDB_DB', 'brozzler'),
'WAYBACK_BASEURL': os.environ.get(
'WAYBACK_BASEURL', 'http://wbgrp-svc107.us.archive.org:8091'),
'WAYBACK_BASEURL', 'http://localhost:8091/brozzler'),
}
r = rethinkstuff.Rethinker(
SETTINGS['RETHINKDB_SERVERS'], db=SETTINGS['RETHINKDB_DB'])
service_registry = rethinkstuff.ServiceRegistry(r)
_svc_reg = None
def service_registry():
global _svc_reg
if not _svc_reg:
_svc_reg = rethinkstuff.ServiceRegistry(r)
return _svc_reg
@app.route("/api/sites/<site_id>/queued_count")
@app.route("/api/site/<site_id>/queued_count")
@ -149,6 +153,16 @@ def sites(job_id):
s["cookie_db"] = base64.b64encode(s["cookie_db"]).decode("ascii")
return flask.jsonify(sites=sites_)
@app.route("/api/jobless-sites")
def jobless_sites():
# XXX inefficient (unindexed) query
sites_ = list(r.table("sites").filter(~r.row.has_fields("job_id")).run())
# TypeError: <binary, 7168 bytes, '53 51 4c 69 74 65...'> is not JSON serializable
for s in sites_:
if "cookie_db" in s:
s["cookie_db"] = base64.b64encode(s["cookie_db"]).decode("ascii")
return flask.jsonify(sites=sites_)
@app.route("/api/jobs/<int:job_id>")
@app.route("/api/job/<int:job_id>")
def job(job_id):
@ -165,12 +179,12 @@ def job_yaml(job_id):
@app.route("/api/workers")
def workers():
workers_ = service_registry.available_services("brozzler-worker")
workers_ = service_registry().available_services("brozzler-worker")
return flask.jsonify(workers=list(workers_))
@app.route("/api/services")
def services():
services_ = service_registry.available_services()
services_ = service_registry().available_services()
return flask.jsonify(services=list(services_))
@app.route("/api/jobs")
@ -221,7 +235,26 @@ except ImportError:
logging.info('running brozzler-webconsole using simple flask app.run')
app.run()
if __name__ == "__main__":
# arguments?
def main():
import argparse
arg_parser = argparse.ArgumentParser(
prog=os.path.basename(sys.argv[0]),
formatter_class=argparse.RawDescriptionHelpFormatter,
description=(
'brozzler-webconsole - web application for viewing brozzler '
'crawl status'),
epilog=(
'brozzler-webconsole has no command line options, but can be '
'configured using the following environment variables:\n\n'
' RETHINKDB_SERVERS rethinkdb servers, e.g. db0.foo.org,'
'db0.foo.org:38015,db1.foo.org (default: localhost)\n'
' RETHINKDB_DB rethinkdb database name (default: '
'brozzler)\n'
' WAYBACK_BASEURL base url for constructing wayback '
'links (default http://localhost:8091/brozzler)'))
args = arg_parser.parse_args(args=sys.argv[1:])
run()
if __name__ == "__main__":
main()

View File

@ -79,6 +79,9 @@ brozzlerControllers.controller("HomeController", ["$scope", "$http",
$http.get("/api/services").success(function(data) {
$scope.services = data.services;
});
$http.get("/api/jobless-sites").success(function(data) {
$scope.joblessSites = data.sites;
});
}]);
brozzlerControllers.controller("WorkersListController", ["$scope", "$http",

View File

@ -41,7 +41,6 @@
</div>
<h2>Jobs</h2>
<div class="row">
<div class="col-sm-12">
<table class="table table-striped">
@ -66,4 +65,29 @@
</table>
</div>
</div>
<h2>Jobless Sites</h2>
<div class="row">
<div class="col-sm-12">
<table class="table table-striped">
<thead>
<tr>
<th>id</th>
<th>status</th>
<th>started</th>
<th>seed url</th>
</tr>
</thead>
<tbody>
<tr ng-repeat="site in joblessSites">
<td><a href="/sites/{{site.id}}">{{site.id}}</a></td>
<td>{{site.status}}</td>
<td>{{site.start_time}}</td>
<td>{{site.seed}}</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>

View File

@ -32,7 +32,7 @@ def find_package_data(package):
setuptools.setup(
name='brozzler',
version='1.1b6.dev69',
version='1.1b6.dev78',
description='Distributed web crawling with browsers',
url='https://github.com/internetarchive/brozzler',
author='Noah Levitt',
@ -51,7 +51,7 @@ setuptools.setup(
'brozzler-new-site=brozzler.cli:brozzler_new_site',
'brozzler-worker=brozzler.cli:brozzler_worker',
'brozzler-ensure-tables=brozzler.cli:brozzler_ensure_tables',
'brozzler-webconsole=brozzler.webconsole:run',
'brozzler-webconsole=brozzler.webconsole:main',
'brozzler-easy=brozzler.easy:main',
],
},
@ -69,7 +69,7 @@ setuptools.setup(
],
extras_require={
'webconsole': ['flask>=0.11', 'gunicorn'],
'easy': ['warcprox>=2.0b1', 'pywb'],
'easy': ['warcprox>=2.0b1', 'pywb', 'flask>=0.11', 'gunicorn'],
},
zip_safe=False,
classifiers=[

11
vagrant/Vagrantfile vendored
View File

@ -1,16 +1,13 @@
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.hostname = "brozzler-easy"
config.vm.define "10.9.9.9"
config.vm.hostname = "brzl"
config.vm.network :private_network, ip: "10.9.9.9"
config.vm.synced_folder "..", "/brozzler"
config.vm.provision "ansible" do |ansible|
ansible.inventory_path = "ansible/hosts"
ansible.playbook = "ansible/playbook.yml"
ansible.groups = {
"rethinkdb" => ["default"],
"warcprox" => ["default"],
"brozzler-worker" => ["default"],
# "brozzler-webconsole" => ["default"],
}
end
end

16
vagrant/ansible/hosts Normal file
View File

@ -0,0 +1,16 @@
ansible_ssh_private_key_file=.vagrant/machines/10.9.9.9/virtualbox/private_key
[rethinkdb]
10.9.9.9
[warcprox]
10.9.9.9
[brozzler-worker]
10.9.9.9
[brozzler-webconsole]
10.9.9.9
[pywb]
10.9.9.9

View File

@ -2,27 +2,27 @@
- name: apply common configuration to all nodes
hosts: all
roles:
- common
- common
- name: deploy rethinkdb
hosts: rethinkdb
roles:
- rethinkdb
- rethinkdb
- name: deploy warcprox
hosts: warcprox
roles:
- warcprox
- warcprox
- name: deploy brozzler-worker
hosts: brozzler-worker
roles:
- brozzler-worker
- brozzler-worker
# - name: deploy brozzler-webconsole
# hosts: brozzler-webconsole
# roles:
# - brozzler-webconsole
- name: deploy brozzler-webconsole
hosts: brozzler-webconsole
roles:
- brozzler-webconsole
# - name: deploy pywb
# hosts: pywb

View File

@ -0,0 +1,4 @@
---
- name: restart brozzler-webconsole
service: name=brozzler-webconsole state=restarted
become: true

View File

@ -1,19 +1,15 @@
---
- name: git clone https://github.com/internetarchive/brozzler.git
git: repo=https://github.com/internetarchive/brozzler.git
dest=/home/vagrant/brozzler
- name: pip install -r requirements.txt in virtualenv
pip: requirements=/home/vagrant/brozzler/webconsole/requirements.txt
- name: install brozzler[webconsole] in virtualenv
become: true
pip: name='-e /brozzler[webconsole]'
virtualenv=/home/vagrant/brozzler-webconsole-ve34
virtualenv_python=python3.4
extra_args='--no-input --upgrade --pre'
notify:
- restart brozzler-webconsole
- restart brozzler-webconsole
- name: install upstart config /etc/init/brozzler-webconsole.conf
become: true
template: src=templates/brozzler-webconsole.conf.j2
dest=/etc/init/brozzler-webconsole.conf
notify:
- restart brozzler-webconsole
- restart brozzler-webconsole

View File

@ -3,19 +3,16 @@ description "brozzler-webconsole"
start on runlevel [2345]
stop on runlevel [!2345]
env PYTHONPATH=/home/vagrant/brozzler-webconsole-ve34/lib/python3.4/site-packages:/home/vagrant/brozzler/webconsole
env PYTHONPATH=/home/vagrant/brozzler-webconsole-ve34/lib/python3.4/site-packages
env PATH=/home/vagrant/brozzler-webconsole-ve34/bin:/usr/bin:/bin
env LC_ALL=C.UTF-8
env WAYBACK_BASEURL={{base_wayback_url}}/all
# env RETHINKDB_SERVERS={{groups['rethinkdb'] | join(',')}}
env RETHINKDB_SERVERS=localhost
env RETHINKDB_DB={{rethinkdb_db}}
env WAYBACK_BASEURL=http://{{groups['pywb'][0]}}:8880/brozzler
env RETHINKDB_SERVERS={{groups['rethinkdb'] | join(',')}}
env RETHINKDB_DB=brozzler
setuid vagrant
# console log
exec gunicorn --bind=0.0.0.0:8081 brozzler-webconsole:app >&/vagrant/logs/brozzler-webconsole.log
exec gunicorn --bind=0.0.0.0:8881 brozzler.webconsole:app >>/vagrant/logs/brozzler-webconsole.log 2>&1

View File

@ -19,7 +19,5 @@ stop on stopping Xvnc
kill timeout 60
exec nice brozzler-worker \
--rethinkdb-servers=localhost \
--max-browsers=4 >>/vagrant/logs/brozzler-worker.log 2>&1
# --rethinkdb-servers={{groups['rethinkdb'] | join(',')}} \
--rethinkdb-servers={{groups['rethinkdb'] | join(',')}} \
--max-browsers=4 >>/vagrant/logs/brozzler-worker.log 2>&1

View File

@ -10,5 +10,6 @@ console log
env PYTHONPATH=/home/vagrant/websockify-ve34/lib/python3.4/site-packages
env PATH=/home/vagrant/websockify-ve34/bin:/usr/bin:/bin
# port 8901 is hard-coded in brozzler/webconsole/static/partials/workers.html
exec nice websockify 0.0.0.0:8901 localhost:5901

View File

@ -1,5 +1,5 @@
runuser=vagrant
# bind=0.0.0.0
bind=0.0.0.0
# directory=/var/lib/rethinkdb
# log-file=/var/log/rethinkdb.log
log-file=/vagrant/logs/rethinkdb.log # synced dir