mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-12-14 16:19:00 -05:00
Merge branch 'master' into qa
* master: avoid "Uncaught TypeError: Cannot read property 'querySelectorAll' of undefined" from outlinks script little readme fix for vagrant, static ansible inventory file, add brozzler-webconsole add info to display of jobless sites in brozzler-webconsole; fix creation of "least_hops" index on the rethinkdb table "pages" add arguments --webconsole-address --webconsole-port --pywb-address and change default ports list jobless sites on brozzler-webconsole front page run brozzler-webconsole inside brozzler-easy add section about brozzler-easy to the readme add --help to brozzler-webconsole
This commit is contained in:
commit
caadb2beff
17 changed files with 244 additions and 113 deletions
100
README.rst
100
README.rst
|
|
@ -1,4 +1,4 @@
|
||||||
.. |logo| image:: https://cdn.rawgit.com/nlevitt/brozzler/d1158ab2242815b28fe7bb066042b5b5982e4627/webconsole/static/brozzler.svg
|
.. |logo| image:: https://cdn.rawgit.com/internetarchive/brozzler/1.1b5/brozzler/webconsole/static/brozzler.svg
|
||||||
:width: 7%
|
:width: 7%
|
||||||
|
|
||||||
brozzler |logo|
|
brozzler |logo|
|
||||||
|
|
@ -6,37 +6,76 @@ brozzler |logo|
|
||||||
|
|
||||||
"browser" \| "crawler" = "brozzler"
|
"browser" \| "crawler" = "brozzler"
|
||||||
|
|
||||||
Brozzler is a distributed web crawler (爬虫) that uses a real browser
|
Brozzler is a distributed web crawler (爬虫) that uses a real browser (chrome
|
||||||
(chrome or chromium) to fetch pages and embedded urls and to extract
|
or chromium) to fetch pages and embedded urls and to extract links. It also
|
||||||
links. It also uses `youtube-dl <https://github.com/rg3/youtube-dl>`__
|
uses `youtube-dl <https://github.com/rg3/youtube-dl>`_ to enhance media
|
||||||
to enhance media capture capabilities.
|
capture capabilities.
|
||||||
|
|
||||||
It is forked from https://github.com/internetarchive/umbra.
|
|
||||||
|
|
||||||
Brozzler is designed to work in conjunction with
|
Brozzler is designed to work in conjunction with
|
||||||
`warcprox <https://github.com/internetarchive/warcprox>`__ for web
|
`warcprox <https://github.com/internetarchive/warcprox>`_ for web
|
||||||
archiving.
|
archiving.
|
||||||
|
|
||||||
Installation
|
Requirements
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Brozzler requires python 3.4 or later.
|
- Python 3.4 or later
|
||||||
|
- RethinkDB deployment
|
||||||
|
- Chromium or Google Chrome browser
|
||||||
|
|
||||||
|
Worth noting is that the browser requires a graphical environment to run. You
|
||||||
|
already have this on your laptop, but on a server it will probably require
|
||||||
|
deploying some additional infrastructure (typically X11). The vagrant
|
||||||
|
configuration in the brozzler repository (still a work in progress) has an
|
||||||
|
example setup.
|
||||||
|
|
||||||
|
Getting Started
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The easiest way to get started with brozzler for web archiving is with
|
||||||
|
``brozzler-easy``. Brozzler-easy runs brozzler-worker, warcprox,
|
||||||
|
`pywb <https://github.com/ikreymer/pywb>`_, and brozzler-webconsole, configured
|
||||||
|
to work with each other, in a single process.
|
||||||
|
|
||||||
|
Mac instructions:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# set up virtualenv if desired
|
# install and start rethinkdb
|
||||||
pip install brozzler
|
brew install rethinkdb
|
||||||
|
rethinkdb &>>rethinkdb.log &
|
||||||
|
|
||||||
Brozzler also requires a rethinkdb deployment.
|
# install brozzler with special dependencies pywb and warcprox
|
||||||
|
pip install brozzler[easy] # in a virtualenv if desired
|
||||||
|
|
||||||
Usage
|
# queue a site to crawl
|
||||||
-----
|
brozzler-new-site http://example.com/
|
||||||
|
|
||||||
|
# or a job
|
||||||
|
brozzler-new-job job1.yml
|
||||||
|
|
||||||
|
# start brozzler-easy
|
||||||
|
brozzler-easy
|
||||||
|
|
||||||
|
At this point brozzler-easy will start brozzling your site. Results will be
|
||||||
|
immediately available for playback in pywb at http://localhost:8880/brozzler/.
|
||||||
|
|
||||||
|
*Brozzler-easy demonstrates the full brozzler archival crawling workflow, but
|
||||||
|
does not take advantage of brozzler's distributed nature.*
|
||||||
|
|
||||||
|
Installation and Usage
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
To install brozzler only:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
pip install brozzler # in a virtualenv if desired
|
||||||
|
|
||||||
Launch one or more workers:
|
Launch one or more workers:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
brozzler-worker -e chromium
|
brozzler-worker
|
||||||
|
|
||||||
Submit jobs:
|
Submit jobs:
|
||||||
|
|
||||||
|
|
@ -44,6 +83,13 @@ Submit jobs:
|
||||||
|
|
||||||
brozzler-new-job myjob.yaml
|
brozzler-new-job myjob.yaml
|
||||||
|
|
||||||
|
Submit sites not tied to a job:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
|
||||||
|
--time-limit=600 http://example.com/
|
||||||
|
|
||||||
Job Configuration
|
Job Configuration
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
|
|
@ -70,14 +116,6 @@ must be specified, everything else is optional.
|
||||||
scope:
|
scope:
|
||||||
surt: http://(org,example,
|
surt: http://(org,example,
|
||||||
|
|
||||||
Submit a Site to Crawl Without Configuring a Job
|
|
||||||
------------------------------------------------
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
brozzler-new-site --proxy=localhost:8000 --enable-warcprox-features \
|
|
||||||
--time-limit=600 http://example.com/
|
|
||||||
|
|
||||||
Brozzler Web Console
|
Brozzler Web Console
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
|
|
@ -95,19 +133,7 @@ To start the app, run
|
||||||
|
|
||||||
brozzler-webconsole
|
brozzler-webconsole
|
||||||
|
|
||||||
|
See ``brozzler-webconsole --help`` for configuration options.
|
||||||
XXX configuration stuff
|
|
||||||
|
|
||||||
Fonts (for decent screenshots)
|
|
||||||
------------------------------
|
|
||||||
|
|
||||||
On ubuntu 14.04 trusty I installed these packages:
|
|
||||||
|
|
||||||
xfonts-base ttf-mscorefonts-installer fonts-arphic-bkai00mp
|
|
||||||
fonts-arphic-bsmi00lp fonts-arphic-gbsn00lp fonts-arphic-gkai00mp
|
|
||||||
fonts-arphic-ukai fonts-farsiweb fonts-nafees fonts-sil-abyssinica
|
|
||||||
fonts-sil-ezra fonts-sil-padauk fonts-unfonts-extra fonts-unfonts-core
|
|
||||||
ttf-indic-fonts fonts-thai-tlwg fonts-lklug-sinhala
|
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
|
|
||||||
|
|
@ -304,11 +304,13 @@ class Browser:
|
||||||
var __brzl_framesDone = new Set();
|
var __brzl_framesDone = new Set();
|
||||||
var __brzl_compileOutlinks = function(frame) {
|
var __brzl_compileOutlinks = function(frame) {
|
||||||
__brzl_framesDone.add(frame);
|
__brzl_framesDone.add(frame);
|
||||||
var outlinks = Array.prototype.slice.call(
|
if (frame && frame.document) {
|
||||||
frame.document.querySelectorAll('a[href]'));
|
var outlinks = Array.prototype.slice.call(
|
||||||
for (var i = 0; i < frame.frames.length; i++) {
|
frame.document.querySelectorAll('a[href]'));
|
||||||
if (frame.frames[i] && !__brzl_framesDone.has(frame.frames[i])) {
|
for (var i = 0; i < frame.frames.length; i++) {
|
||||||
outlinks = outlinks.concat(__brzl_compileOutlinks(frame.frames[i]));
|
if (frame.frames[i] && !__brzl_framesDone.has(frame.frames[i])) {
|
||||||
|
outlinks = outlinks.concat(__brzl_compileOutlinks(frame.frames[i]));
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
return outlinks;
|
return outlinks;
|
||||||
|
|
|
||||||
|
|
@ -1,7 +1,7 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
'''
|
'''
|
||||||
brozzler-easy - brozzler-worker, warcprox, and pywb all working together in a
|
brozzler-easy - brozzler-worker, warcprox, pywb, and brozzler-webconsole all
|
||||||
single process
|
working together in a single process
|
||||||
|
|
||||||
Copyright (C) 2016 Internet Archive
|
Copyright (C) 2016 Internet Archive
|
||||||
|
|
||||||
|
|
@ -27,7 +27,7 @@ try:
|
||||||
import brozzler.pywb
|
import brozzler.pywb
|
||||||
import wsgiref.simple_server
|
import wsgiref.simple_server
|
||||||
import wsgiref.handlers
|
import wsgiref.handlers
|
||||||
import six.moves.socketserver
|
import brozzler.webconsole
|
||||||
except ImportError as e:
|
except ImportError as e:
|
||||||
logging.critical(
|
logging.critical(
|
||||||
'%s: %s\n\nYou might need to run "pip install '
|
'%s: %s\n\nYou might need to run "pip install '
|
||||||
|
|
@ -44,16 +44,17 @@ import threading
|
||||||
import time
|
import time
|
||||||
import rethinkstuff
|
import rethinkstuff
|
||||||
import traceback
|
import traceback
|
||||||
|
import socketserver
|
||||||
|
|
||||||
def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
|
def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
|
||||||
arg_parser = argparse.ArgumentParser(
|
arg_parser = argparse.ArgumentParser(
|
||||||
prog=prog, formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
prog=prog, formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||||
description=(
|
description=(
|
||||||
'brozzler-easy - easy deployment of brozzler, with '
|
'brozzler-easy - easy deployment of brozzler, with '
|
||||||
'brozzler-worker, warcprox, and pywb all running in a single '
|
'brozzler-worker, warcprox, pywb, and brozzler-webconsole all '
|
||||||
'process'))
|
'running in a single process'))
|
||||||
|
|
||||||
# === common args ===
|
# common args
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'--rethinkdb-servers', dest='rethinkdb_servers',
|
'--rethinkdb-servers', dest='rethinkdb_servers',
|
||||||
default='localhost', help=(
|
default='localhost', help=(
|
||||||
|
|
@ -66,7 +67,7 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
|
||||||
'-d', '--warcs-dir', dest='warcs_dir', default='./warcs',
|
'-d', '--warcs-dir', dest='warcs_dir', default='./warcs',
|
||||||
help='where to write warcs')
|
help='where to write warcs')
|
||||||
|
|
||||||
# === warcprox args ===
|
# warcprox args
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'-c', '--cacert', dest='cacert',
|
'-c', '--cacert', dest='cacert',
|
||||||
default='./%s-warcprox-ca.pem' % socket.gethostname(),
|
default='./%s-warcprox-ca.pem' % socket.gethostname(),
|
||||||
|
|
@ -83,24 +84,42 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
|
||||||
'host:port of tor socks proxy, used only to connect to '
|
'host:port of tor socks proxy, used only to connect to '
|
||||||
'.onion sites'))
|
'.onion sites'))
|
||||||
|
|
||||||
# === brozzler-worker args ===
|
# brozzler-worker args
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'-e', '--chrome-exe', dest='chrome_exe',
|
'-e', '--chrome-exe', dest='chrome_exe',
|
||||||
default=brozzler.cli.suggest_default_chome_exe(),
|
default=brozzler.cli.suggest_default_chome_exe(),
|
||||||
help='executable to use to invoke chrome')
|
help='executable to use to invoke chrome')
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'-n', '--max-browsers', dest='max_browsers', default='1',
|
'-n', '--max-browsers', dest='max_browsers',
|
||||||
help='max number of chrome instances simultaneously browsing pages')
|
type=int, default=1, help=(
|
||||||
|
'max number of chrome instances simultaneously '
|
||||||
|
'browsing pages'))
|
||||||
|
|
||||||
# === pywb args ===
|
# pywb args
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'--pywb-port', dest='pywb_port', type=int, default=8091,
|
'--pywb-address', dest='pywb_address',
|
||||||
help='pywb wayback port')
|
default='0.0.0.0',
|
||||||
|
help='pywb wayback address to listen on')
|
||||||
|
arg_parser.add_argument(
|
||||||
|
'--pywb-port', dest='pywb_port', type=int,
|
||||||
|
default=8880, help='pywb wayback port')
|
||||||
|
|
||||||
# === common at the bottom args ===
|
# webconsole args
|
||||||
arg_parser.add_argument(
|
arg_parser.add_argument(
|
||||||
'-v', '--verbose', dest='verbose', action='store_true')
|
'--webconsole-address', dest='webconsole_address',
|
||||||
arg_parser.add_argument('-q', '--quiet', dest='quiet', action='store_true')
|
default='localhost',
|
||||||
|
help='brozzler web console address to listen on')
|
||||||
|
arg_parser.add_argument(
|
||||||
|
'--webconsole-port', dest='webconsole_port',
|
||||||
|
type=int, default=8881, help='brozzler web console port')
|
||||||
|
|
||||||
|
# common at the bottom args
|
||||||
|
arg_parser.add_argument(
|
||||||
|
'-v', '--verbose', dest='verbose', action='store_true',
|
||||||
|
help='verbose logging')
|
||||||
|
arg_parser.add_argument(
|
||||||
|
'-q', '--quiet', dest='quiet', action='store_true',
|
||||||
|
help='quiet logging (warnings and errors only)')
|
||||||
# arg_parser.add_argument(
|
# arg_parser.add_argument(
|
||||||
# '-s', '--silent', dest='log_level', action='store_const',
|
# '-s', '--silent', dest='log_level', action='store_const',
|
||||||
# default=logging.INFO, const=logging.CRITICAL)
|
# default=logging.INFO, const=logging.CRITICAL)
|
||||||
|
|
@ -110,6 +129,10 @@ def _build_arg_parser(prog=os.path.basename(sys.argv[0])):
|
||||||
|
|
||||||
return arg_parser
|
return arg_parser
|
||||||
|
|
||||||
|
class ThreadingWSGIServer(
|
||||||
|
socketserver.ThreadingMixIn, wsgiref.simple_server.WSGIServer):
|
||||||
|
pass
|
||||||
|
|
||||||
class BrozzlerEasyController:
|
class BrozzlerEasyController:
|
||||||
logger = logging.getLogger(__module__ + "." + __qualname__)
|
logger = logging.getLogger(__module__ + "." + __qualname__)
|
||||||
|
|
||||||
|
|
@ -120,6 +143,12 @@ class BrozzlerEasyController:
|
||||||
self._warcprox_args(args))
|
self._warcprox_args(args))
|
||||||
self.brozzler_worker = self._init_brozzler_worker(args)
|
self.brozzler_worker = self._init_brozzler_worker(args)
|
||||||
self.pywb_httpd = self._init_pywb(args)
|
self.pywb_httpd = self._init_pywb(args)
|
||||||
|
self.webconsole_httpd = self._init_brozzler_webconsole(args)
|
||||||
|
|
||||||
|
def _init_brozzler_webconsole(self, args):
|
||||||
|
return wsgiref.simple_server.make_server(
|
||||||
|
args.webconsole_address, args.webconsole_port,
|
||||||
|
brozzler.webconsole.app, ThreadingWSGIServer)
|
||||||
|
|
||||||
def _init_brozzler_worker(self, args):
|
def _init_brozzler_worker(self, args):
|
||||||
r = rethinkstuff.Rethinker(
|
r = rethinkstuff.Rethinker(
|
||||||
|
|
@ -128,7 +157,7 @@ class BrozzlerEasyController:
|
||||||
service_registry = rethinkstuff.ServiceRegistry(r)
|
service_registry = rethinkstuff.ServiceRegistry(r)
|
||||||
worker = brozzler.worker.BrozzlerWorker(
|
worker = brozzler.worker.BrozzlerWorker(
|
||||||
frontier, service_registry,
|
frontier, service_registry,
|
||||||
max_browsers=int(args.max_browsers),
|
max_browsers=args.max_browsers,
|
||||||
chrome_exe=args.chrome_exe,
|
chrome_exe=args.chrome_exe,
|
||||||
proxy='%s:%s' % self.warcprox_controller.proxy.server_address,
|
proxy='%s:%s' % self.warcprox_controller.proxy.server_address,
|
||||||
enable_warcprox_features=True)
|
enable_warcprox_features=True)
|
||||||
|
|
@ -166,12 +195,9 @@ class BrozzlerEasyController:
|
||||||
|
|
||||||
# disable is_hop_by_hop restrictions
|
# disable is_hop_by_hop restrictions
|
||||||
wsgiref.handlers.is_hop_by_hop = lambda x: False
|
wsgiref.handlers.is_hop_by_hop = lambda x: False
|
||||||
class ThreadingWSGIServer(
|
|
||||||
six.moves.socketserver.ThreadingMixIn,
|
|
||||||
wsgiref.simple_server.WSGIServer):
|
|
||||||
pass
|
|
||||||
return wsgiref.simple_server.make_server(
|
return wsgiref.simple_server.make_server(
|
||||||
'', args.pywb_port, wsgi_app, ThreadingWSGIServer)
|
args.pywb_address, args.pywb_port, wsgi_app,
|
||||||
|
ThreadingWSGIServer)
|
||||||
|
|
||||||
def start(self):
|
def start(self):
|
||||||
self.logger.info('starting warcprox')
|
self.logger.info('starting warcprox')
|
||||||
|
|
@ -185,7 +211,15 @@ class BrozzlerEasyController:
|
||||||
'starting pywb at %s:%s', *self.pywb_httpd.server_address)
|
'starting pywb at %s:%s', *self.pywb_httpd.server_address)
|
||||||
threading.Thread(target=self.pywb_httpd.serve_forever).start()
|
threading.Thread(target=self.pywb_httpd.serve_forever).start()
|
||||||
|
|
||||||
|
self.logger.info(
|
||||||
|
'starting brozzler-webconsole at %s:%s',
|
||||||
|
*self.webconsole_httpd.server_address)
|
||||||
|
threading.Thread(target=self.webconsole_httpd.serve_forever).start()
|
||||||
|
|
||||||
def shutdown(self):
|
def shutdown(self):
|
||||||
|
self.logger.info('shutting down brozzler-webconsole')
|
||||||
|
self.webconsole_httpd.shutdown()
|
||||||
|
|
||||||
self.logger.info('shutting down brozzler-worker')
|
self.logger.info('shutting down brozzler-worker')
|
||||||
self.brozzler_worker.shutdown_now()
|
self.brozzler_worker.shutdown_now()
|
||||||
# brozzler-worker is fully shut down at this point
|
# brozzler-worker is fully shut down at this point
|
||||||
|
|
|
||||||
|
|
@ -69,7 +69,7 @@ class RethinkDbFrontier:
|
||||||
self.r.table("pages").index_create(
|
self.r.table("pages").index_create(
|
||||||
"least_hops", [
|
"least_hops", [
|
||||||
self.r.row["site_id"], self.r.row["brozzle_count"],
|
self.r.row["site_id"], self.r.row["brozzle_count"],
|
||||||
self.r.row["hops_from_seed"]])
|
self.r.row["hops_from_seed"]]).run()
|
||||||
if not "jobs" in tables:
|
if not "jobs" in tables:
|
||||||
self.logger.info(
|
self.logger.info(
|
||||||
"creating rethinkdb table 'jobs' in database %s",
|
"creating rethinkdb table 'jobs' in database %s",
|
||||||
|
|
|
||||||
|
|
@ -27,7 +27,6 @@ except ImportError as e:
|
||||||
'brozzler[webconsole]".\nSee README.rst for more information.',
|
'brozzler[webconsole]".\nSee README.rst for more information.',
|
||||||
type(e).__name__, e)
|
type(e).__name__, e)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
import rethinkstuff
|
import rethinkstuff
|
||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
|
|
@ -56,11 +55,16 @@ SETTINGS = {
|
||||||
'RETHINKDB_SERVERS', 'localhost').split(','),
|
'RETHINKDB_SERVERS', 'localhost').split(','),
|
||||||
'RETHINKDB_DB': os.environ.get('RETHINKDB_DB', 'brozzler'),
|
'RETHINKDB_DB': os.environ.get('RETHINKDB_DB', 'brozzler'),
|
||||||
'WAYBACK_BASEURL': os.environ.get(
|
'WAYBACK_BASEURL': os.environ.get(
|
||||||
'WAYBACK_BASEURL', 'http://wbgrp-svc107.us.archive.org:8091'),
|
'WAYBACK_BASEURL', 'http://localhost:8091/brozzler'),
|
||||||
}
|
}
|
||||||
r = rethinkstuff.Rethinker(
|
r = rethinkstuff.Rethinker(
|
||||||
SETTINGS['RETHINKDB_SERVERS'], db=SETTINGS['RETHINKDB_DB'])
|
SETTINGS['RETHINKDB_SERVERS'], db=SETTINGS['RETHINKDB_DB'])
|
||||||
service_registry = rethinkstuff.ServiceRegistry(r)
|
_svc_reg = None
|
||||||
|
def service_registry():
|
||||||
|
global _svc_reg
|
||||||
|
if not _svc_reg:
|
||||||
|
_svc_reg = rethinkstuff.ServiceRegistry(r)
|
||||||
|
return _svc_reg
|
||||||
|
|
||||||
@app.route("/api/sites/<site_id>/queued_count")
|
@app.route("/api/sites/<site_id>/queued_count")
|
||||||
@app.route("/api/site/<site_id>/queued_count")
|
@app.route("/api/site/<site_id>/queued_count")
|
||||||
|
|
@ -149,6 +153,16 @@ def sites(job_id):
|
||||||
s["cookie_db"] = base64.b64encode(s["cookie_db"]).decode("ascii")
|
s["cookie_db"] = base64.b64encode(s["cookie_db"]).decode("ascii")
|
||||||
return flask.jsonify(sites=sites_)
|
return flask.jsonify(sites=sites_)
|
||||||
|
|
||||||
|
@app.route("/api/jobless-sites")
|
||||||
|
def jobless_sites():
|
||||||
|
# XXX inefficient (unindexed) query
|
||||||
|
sites_ = list(r.table("sites").filter(~r.row.has_fields("job_id")).run())
|
||||||
|
# TypeError: <binary, 7168 bytes, '53 51 4c 69 74 65...'> is not JSON serializable
|
||||||
|
for s in sites_:
|
||||||
|
if "cookie_db" in s:
|
||||||
|
s["cookie_db"] = base64.b64encode(s["cookie_db"]).decode("ascii")
|
||||||
|
return flask.jsonify(sites=sites_)
|
||||||
|
|
||||||
@app.route("/api/jobs/<int:job_id>")
|
@app.route("/api/jobs/<int:job_id>")
|
||||||
@app.route("/api/job/<int:job_id>")
|
@app.route("/api/job/<int:job_id>")
|
||||||
def job(job_id):
|
def job(job_id):
|
||||||
|
|
@ -165,12 +179,12 @@ def job_yaml(job_id):
|
||||||
|
|
||||||
@app.route("/api/workers")
|
@app.route("/api/workers")
|
||||||
def workers():
|
def workers():
|
||||||
workers_ = service_registry.available_services("brozzler-worker")
|
workers_ = service_registry().available_services("brozzler-worker")
|
||||||
return flask.jsonify(workers=list(workers_))
|
return flask.jsonify(workers=list(workers_))
|
||||||
|
|
||||||
@app.route("/api/services")
|
@app.route("/api/services")
|
||||||
def services():
|
def services():
|
||||||
services_ = service_registry.available_services()
|
services_ = service_registry().available_services()
|
||||||
return flask.jsonify(services=list(services_))
|
return flask.jsonify(services=list(services_))
|
||||||
|
|
||||||
@app.route("/api/jobs")
|
@app.route("/api/jobs")
|
||||||
|
|
@ -221,7 +235,26 @@ except ImportError:
|
||||||
logging.info('running brozzler-webconsole using simple flask app.run')
|
logging.info('running brozzler-webconsole using simple flask app.run')
|
||||||
app.run()
|
app.run()
|
||||||
|
|
||||||
if __name__ == "__main__":
|
def main():
|
||||||
# arguments?
|
import argparse
|
||||||
|
arg_parser = argparse.ArgumentParser(
|
||||||
|
prog=os.path.basename(sys.argv[0]),
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
description=(
|
||||||
|
'brozzler-webconsole - web application for viewing brozzler '
|
||||||
|
'crawl status'),
|
||||||
|
epilog=(
|
||||||
|
'brozzler-webconsole has no command line options, but can be '
|
||||||
|
'configured using the following environment variables:\n\n'
|
||||||
|
' RETHINKDB_SERVERS rethinkdb servers, e.g. db0.foo.org,'
|
||||||
|
'db0.foo.org:38015,db1.foo.org (default: localhost)\n'
|
||||||
|
' RETHINKDB_DB rethinkdb database name (default: '
|
||||||
|
'brozzler)\n'
|
||||||
|
' WAYBACK_BASEURL base url for constructing wayback '
|
||||||
|
'links (default http://localhost:8091/brozzler)'))
|
||||||
|
args = arg_parser.parse_args(args=sys.argv[1:])
|
||||||
run()
|
run()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -79,6 +79,9 @@ brozzlerControllers.controller("HomeController", ["$scope", "$http",
|
||||||
$http.get("/api/services").success(function(data) {
|
$http.get("/api/services").success(function(data) {
|
||||||
$scope.services = data.services;
|
$scope.services = data.services;
|
||||||
});
|
});
|
||||||
|
$http.get("/api/jobless-sites").success(function(data) {
|
||||||
|
$scope.joblessSites = data.sites;
|
||||||
|
});
|
||||||
}]);
|
}]);
|
||||||
|
|
||||||
brozzlerControllers.controller("WorkersListController", ["$scope", "$http",
|
brozzlerControllers.controller("WorkersListController", ["$scope", "$http",
|
||||||
|
|
|
||||||
|
|
@ -41,7 +41,6 @@
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<h2>Jobs</h2>
|
<h2>Jobs</h2>
|
||||||
|
|
||||||
<div class="row">
|
<div class="row">
|
||||||
<div class="col-sm-12">
|
<div class="col-sm-12">
|
||||||
<table class="table table-striped">
|
<table class="table table-striped">
|
||||||
|
|
@ -66,4 +65,29 @@
|
||||||
</table>
|
</table>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<h2>Jobless Sites</h2>
|
||||||
|
<div class="row">
|
||||||
|
<div class="col-sm-12">
|
||||||
|
<table class="table table-striped">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>id</th>
|
||||||
|
<th>status</th>
|
||||||
|
<th>started</th>
|
||||||
|
<th>seed url</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr ng-repeat="site in joblessSites">
|
||||||
|
<td><a href="/sites/{{site.id}}">{{site.id}}</a></td>
|
||||||
|
<td>{{site.status}}</td>
|
||||||
|
<td>{{site.start_time}}</td>
|
||||||
|
<td>{{site.seed}}</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
|
||||||
6
setup.py
6
setup.py
|
|
@ -32,7 +32,7 @@ def find_package_data(package):
|
||||||
|
|
||||||
setuptools.setup(
|
setuptools.setup(
|
||||||
name='brozzler',
|
name='brozzler',
|
||||||
version='1.1b6.dev69',
|
version='1.1b6.dev78',
|
||||||
description='Distributed web crawling with browsers',
|
description='Distributed web crawling with browsers',
|
||||||
url='https://github.com/internetarchive/brozzler',
|
url='https://github.com/internetarchive/brozzler',
|
||||||
author='Noah Levitt',
|
author='Noah Levitt',
|
||||||
|
|
@ -51,7 +51,7 @@ setuptools.setup(
|
||||||
'brozzler-new-site=brozzler.cli:brozzler_new_site',
|
'brozzler-new-site=brozzler.cli:brozzler_new_site',
|
||||||
'brozzler-worker=brozzler.cli:brozzler_worker',
|
'brozzler-worker=brozzler.cli:brozzler_worker',
|
||||||
'brozzler-ensure-tables=brozzler.cli:brozzler_ensure_tables',
|
'brozzler-ensure-tables=brozzler.cli:brozzler_ensure_tables',
|
||||||
'brozzler-webconsole=brozzler.webconsole:run',
|
'brozzler-webconsole=brozzler.webconsole:main',
|
||||||
'brozzler-easy=brozzler.easy:main',
|
'brozzler-easy=brozzler.easy:main',
|
||||||
],
|
],
|
||||||
},
|
},
|
||||||
|
|
@ -69,7 +69,7 @@ setuptools.setup(
|
||||||
],
|
],
|
||||||
extras_require={
|
extras_require={
|
||||||
'webconsole': ['flask>=0.11', 'gunicorn'],
|
'webconsole': ['flask>=0.11', 'gunicorn'],
|
||||||
'easy': ['warcprox>=2.0b1', 'pywb'],
|
'easy': ['warcprox>=2.0b1', 'pywb', 'flask>=0.11', 'gunicorn'],
|
||||||
},
|
},
|
||||||
zip_safe=False,
|
zip_safe=False,
|
||||||
classifiers=[
|
classifiers=[
|
||||||
|
|
|
||||||
11
vagrant/Vagrantfile
vendored
11
vagrant/Vagrantfile
vendored
|
|
@ -1,16 +1,13 @@
|
||||||
Vagrant.configure(2) do |config|
|
Vagrant.configure(2) do |config|
|
||||||
config.vm.box = "ubuntu/trusty64"
|
config.vm.box = "ubuntu/trusty64"
|
||||||
config.vm.hostname = "brozzler-easy"
|
config.vm.define "10.9.9.9"
|
||||||
|
config.vm.hostname = "brzl"
|
||||||
|
config.vm.network :private_network, ip: "10.9.9.9"
|
||||||
|
|
||||||
config.vm.synced_folder "..", "/brozzler"
|
config.vm.synced_folder "..", "/brozzler"
|
||||||
|
|
||||||
config.vm.provision "ansible" do |ansible|
|
config.vm.provision "ansible" do |ansible|
|
||||||
|
ansible.inventory_path = "ansible/hosts"
|
||||||
ansible.playbook = "ansible/playbook.yml"
|
ansible.playbook = "ansible/playbook.yml"
|
||||||
ansible.groups = {
|
|
||||||
"rethinkdb" => ["default"],
|
|
||||||
"warcprox" => ["default"],
|
|
||||||
"brozzler-worker" => ["default"],
|
|
||||||
# "brozzler-webconsole" => ["default"],
|
|
||||||
}
|
|
||||||
end
|
end
|
||||||
end
|
end
|
||||||
|
|
|
||||||
16
vagrant/ansible/hosts
Normal file
16
vagrant/ansible/hosts
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
ansible_ssh_private_key_file=.vagrant/machines/10.9.9.9/virtualbox/private_key
|
||||||
|
|
||||||
|
[rethinkdb]
|
||||||
|
10.9.9.9
|
||||||
|
|
||||||
|
[warcprox]
|
||||||
|
10.9.9.9
|
||||||
|
|
||||||
|
[brozzler-worker]
|
||||||
|
10.9.9.9
|
||||||
|
|
||||||
|
[brozzler-webconsole]
|
||||||
|
10.9.9.9
|
||||||
|
|
||||||
|
[pywb]
|
||||||
|
10.9.9.9
|
||||||
|
|
@ -2,27 +2,27 @@
|
||||||
- name: apply common configuration to all nodes
|
- name: apply common configuration to all nodes
|
||||||
hosts: all
|
hosts: all
|
||||||
roles:
|
roles:
|
||||||
- common
|
- common
|
||||||
|
|
||||||
- name: deploy rethinkdb
|
- name: deploy rethinkdb
|
||||||
hosts: rethinkdb
|
hosts: rethinkdb
|
||||||
roles:
|
roles:
|
||||||
- rethinkdb
|
- rethinkdb
|
||||||
|
|
||||||
- name: deploy warcprox
|
- name: deploy warcprox
|
||||||
hosts: warcprox
|
hosts: warcprox
|
||||||
roles:
|
roles:
|
||||||
- warcprox
|
- warcprox
|
||||||
|
|
||||||
- name: deploy brozzler-worker
|
- name: deploy brozzler-worker
|
||||||
hosts: brozzler-worker
|
hosts: brozzler-worker
|
||||||
roles:
|
roles:
|
||||||
- brozzler-worker
|
- brozzler-worker
|
||||||
|
|
||||||
# - name: deploy brozzler-webconsole
|
- name: deploy brozzler-webconsole
|
||||||
# hosts: brozzler-webconsole
|
hosts: brozzler-webconsole
|
||||||
# roles:
|
roles:
|
||||||
# - brozzler-webconsole
|
- brozzler-webconsole
|
||||||
|
|
||||||
# - name: deploy pywb
|
# - name: deploy pywb
|
||||||
# hosts: pywb
|
# hosts: pywb
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,4 @@
|
||||||
|
---
|
||||||
|
- name: restart brozzler-webconsole
|
||||||
|
service: name=brozzler-webconsole state=restarted
|
||||||
|
become: true
|
||||||
|
|
@ -1,19 +1,15 @@
|
||||||
---
|
---
|
||||||
- name: git clone https://github.com/internetarchive/brozzler.git
|
- name: install brozzler[webconsole] in virtualenv
|
||||||
git: repo=https://github.com/internetarchive/brozzler.git
|
become: true
|
||||||
dest=/home/vagrant/brozzler
|
pip: name='-e /brozzler[webconsole]'
|
||||||
- name: pip install -r requirements.txt in virtualenv
|
|
||||||
pip: requirements=/home/vagrant/brozzler/webconsole/requirements.txt
|
|
||||||
virtualenv=/home/vagrant/brozzler-webconsole-ve34
|
virtualenv=/home/vagrant/brozzler-webconsole-ve34
|
||||||
virtualenv_python=python3.4
|
virtualenv_python=python3.4
|
||||||
extra_args='--no-input --upgrade --pre'
|
extra_args='--no-input --upgrade --pre'
|
||||||
notify:
|
notify:
|
||||||
- restart brozzler-webconsole
|
- restart brozzler-webconsole
|
||||||
- name: install upstart config /etc/init/brozzler-webconsole.conf
|
- name: install upstart config /etc/init/brozzler-webconsole.conf
|
||||||
become: true
|
become: true
|
||||||
template: src=templates/brozzler-webconsole.conf.j2
|
template: src=templates/brozzler-webconsole.conf.j2
|
||||||
dest=/etc/init/brozzler-webconsole.conf
|
dest=/etc/init/brozzler-webconsole.conf
|
||||||
notify:
|
notify:
|
||||||
- restart brozzler-webconsole
|
- restart brozzler-webconsole
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -3,19 +3,16 @@ description "brozzler-webconsole"
|
||||||
start on runlevel [2345]
|
start on runlevel [2345]
|
||||||
stop on runlevel [!2345]
|
stop on runlevel [!2345]
|
||||||
|
|
||||||
env PYTHONPATH=/home/vagrant/brozzler-webconsole-ve34/lib/python3.4/site-packages:/home/vagrant/brozzler/webconsole
|
env PYTHONPATH=/home/vagrant/brozzler-webconsole-ve34/lib/python3.4/site-packages
|
||||||
env PATH=/home/vagrant/brozzler-webconsole-ve34/bin:/usr/bin:/bin
|
env PATH=/home/vagrant/brozzler-webconsole-ve34/bin:/usr/bin:/bin
|
||||||
env LC_ALL=C.UTF-8
|
env LC_ALL=C.UTF-8
|
||||||
|
|
||||||
env WAYBACK_BASEURL={{base_wayback_url}}/all
|
env WAYBACK_BASEURL=http://{{groups['pywb'][0]}}:8880/brozzler
|
||||||
# env RETHINKDB_SERVERS={{groups['rethinkdb'] | join(',')}}
|
env RETHINKDB_SERVERS={{groups['rethinkdb'] | join(',')}}
|
||||||
env RETHINKDB_SERVERS=localhost
|
env RETHINKDB_DB=brozzler
|
||||||
env RETHINKDB_DB={{rethinkdb_db}}
|
|
||||||
|
|
||||||
setuid vagrant
|
setuid vagrant
|
||||||
|
|
||||||
# console log
|
# console log
|
||||||
|
|
||||||
exec gunicorn --bind=0.0.0.0:8081 brozzler-webconsole:app >&/vagrant/logs/brozzler-webconsole.log
|
exec gunicorn --bind=0.0.0.0:8881 brozzler.webconsole:app >>/vagrant/logs/brozzler-webconsole.log 2>&1
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -19,7 +19,5 @@ stop on stopping Xvnc
|
||||||
kill timeout 60
|
kill timeout 60
|
||||||
|
|
||||||
exec nice brozzler-worker \
|
exec nice brozzler-worker \
|
||||||
--rethinkdb-servers=localhost \
|
--rethinkdb-servers={{groups['rethinkdb'] | join(',')}} \
|
||||||
--max-browsers=4 >>/vagrant/logs/brozzler-worker.log 2>&1
|
--max-browsers=4 >>/vagrant/logs/brozzler-worker.log 2>&1
|
||||||
# --rethinkdb-servers={{groups['rethinkdb'] | join(',')}} \
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -10,5 +10,6 @@ console log
|
||||||
env PYTHONPATH=/home/vagrant/websockify-ve34/lib/python3.4/site-packages
|
env PYTHONPATH=/home/vagrant/websockify-ve34/lib/python3.4/site-packages
|
||||||
env PATH=/home/vagrant/websockify-ve34/bin:/usr/bin:/bin
|
env PATH=/home/vagrant/websockify-ve34/bin:/usr/bin:/bin
|
||||||
|
|
||||||
|
# port 8901 is hard-coded in brozzler/webconsole/static/partials/workers.html
|
||||||
exec nice websockify 0.0.0.0:8901 localhost:5901
|
exec nice websockify 0.0.0.0:8901 localhost:5901
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,5 +1,5 @@
|
||||||
runuser=vagrant
|
runuser=vagrant
|
||||||
# bind=0.0.0.0
|
bind=0.0.0.0
|
||||||
# directory=/var/lib/rethinkdb
|
# directory=/var/lib/rethinkdb
|
||||||
# log-file=/var/log/rethinkdb.log
|
# log-file=/var/log/rethinkdb.log
|
||||||
log-file=/vagrant/logs/rethinkdb.log # synced dir
|
log-file=/vagrant/logs/rethinkdb.log # synced dir
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue