mirror of
https://github.com/internetarchive/brozzler.git
synced 2025-02-24 00:29:53 -05:00
there is no hq anymore; database notes can still be found in git history, though there's nothing about rethinkdb
This commit is contained in:
parent
231d019659
commit
68de85022a
119
hq-notes.txt
119
hq-notes.txt
@ -1,119 +0,0 @@
|
|||||||
possible architecture of brozzler-hq
|
|
||||||
====================================
|
|
||||||
|
|
||||||
keeps queues in rdbms
|
|
||||||
because easy to update, index on priority, index on canonicalized url
|
|
||||||
also easy to inspect
|
|
||||||
initially sqlite
|
|
||||||
|
|
||||||
-- sqlite3 syntax
|
|
||||||
create table brozzler_sites (
|
|
||||||
id integer primary key,
|
|
||||||
-- claimed boolean,
|
|
||||||
site_json text,
|
|
||||||
-- data_limit integer, -- bytes
|
|
||||||
-- time_limit integer, -- seconds
|
|
||||||
-- page_limit integer,
|
|
||||||
);
|
|
||||||
|
|
||||||
create table brozzler_urls (
|
|
||||||
id integer primary key,
|
|
||||||
site_id integer,
|
|
||||||
priority integer,
|
|
||||||
in_progress boolean,
|
|
||||||
canon_url varchar(4000),
|
|
||||||
crawl_url_json text,
|
|
||||||
index(priority),
|
|
||||||
index(canon_url),
|
|
||||||
index(site_id)
|
|
||||||
);
|
|
||||||
|
|
||||||
feeds rabbitmq:
|
|
||||||
- json payloads
|
|
||||||
- queue per site brozzler.{site_id}.crawl_urls
|
|
||||||
- queue of unclaimed sites brozzler.sites.unclaimed
|
|
||||||
|
|
||||||
reads from rabbitmq
|
|
||||||
- queue of new sites brozzler.sites.new
|
|
||||||
- queue per site brozzler.{site_id}.completed_urls
|
|
||||||
* json blob fed to this queue includes urls extracted to schedule
|
|
||||||
|
|
||||||
??? brozzler-hq considers site unclaimed if brozzler.{site_id}.crawl_urls has
|
|
||||||
not been read in some amount of time ??? or do workers need to explicitly
|
|
||||||
disclaim ???
|
|
||||||
|
|
||||||
brozzler-worker
|
|
||||||
- decides if it can run a new browser
|
|
||||||
- if so reads site from brozzler.sites.unclaimed
|
|
||||||
- site includes scope definition, crawl job info, ...
|
|
||||||
- starts browser
|
|
||||||
- reads urls from brozzler.{site-id}.crawl_urls
|
|
||||||
- after each(?) (every n?) urls, feeds brozzler.{site_id}.completed_urls
|
|
||||||
|
|
||||||
=== considering distributed database ===
|
|
||||||
preferred database requirements:
|
|
||||||
- secondary index (so we can look up by url or priority)
|
|
||||||
- good performance on updates since we will be doing many updates
|
|
||||||
- good performance of secondary index on updates that change the value of secondarily indexed field
|
|
||||||
- ideally strong consistency to support multiple instances of brozzler-hq (but we can probably tolerate eventual consistency)
|
|
||||||
- redundancy, fault tolerance
|
|
||||||
|
|
||||||
alternative to distrubuted database: each brozzler-hq instance has its own local db (sqlite?) and distribution is handled at application level
|
|
||||||
but implementing redundancy, fault tolerance, etc sounds daunting
|
|
||||||
|
|
||||||
cassandra:
|
|
||||||
- pluses
|
|
||||||
- easy to set up cluster, add nodes, administer (all nodes are basically the same)
|
|
||||||
- sharding, replication, fault tolerance are native, default features
|
|
||||||
- seems more reliable than others?
|
|
||||||
- minuses
|
|
||||||
- not so good for looking up pages by both url and priority because
|
|
||||||
- secondary indexes are bad for columns with high cardinality (url), and also bad for columns that get updated frequently (priority)
|
|
||||||
- other approach with second table by "priority_key" also not great because you can't update the value of a primary key, have to delete it and add a new row, and deletion in cassandra seems kind of heavy ("tombstones")
|
|
||||||
- cqlsh:brozzler> select * from priorities order by priority_key desc limit 1;
|
|
||||||
- InvalidRequest: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN."
|
|
||||||
- cqlsh:brozzler> select * from priorities where priority_key >= 999900000000;
|
|
||||||
- InvalidRequest: code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"
|
|
||||||
- possible solution: finite set of possible priorities, e.g. 0-1000, then secondary-indexable etc
|
|
||||||
redis:
|
|
||||||
- pluses
|
|
||||||
- fast, reliable, already known at ia
|
|
||||||
- perhaps can use the data structures
|
|
||||||
- minuses
|
|
||||||
- no experience with cluster at ia nor ilya
|
|
||||||
- all data being in memory limits amount of data
|
|
||||||
- Sam says sync to disk is slow
|
|
||||||
- no real namespaces
|
|
||||||
hbase:
|
|
||||||
- pluses
|
|
||||||
- already deployed, known, dedup data is already in there
|
|
||||||
- minuses
|
|
||||||
- no secondary indexes
|
|
||||||
- has not been very reliable for us, lots of moving parts
|
|
||||||
mongodb:
|
|
||||||
- pluses
|
|
||||||
- very popular according to http://db-engines.com/en/ranking
|
|
||||||
- secondary indexes
|
|
||||||
- some institutional knowledge (kenji)
|
|
||||||
- minuses
|
|
||||||
- according to kenji (https://webarchive.jira.com/wiki/display/~nlevitt/2015/08/10/Kenji%27s+thoughts+on+MongoDB)
|
|
||||||
- cluster is very cumbersome to setup & manage
|
|
||||||
- cluster member names are hard-wired
|
|
||||||
- each shard must be configured with master-slave pair if you want high availability.
|
|
||||||
- you cannot easily replace one shard with different VM
|
|
||||||
- mongodb is known to be slow on writes
|
|
||||||
couchdb:
|
|
||||||
- pluses
|
|
||||||
- mature, more reliable?
|
|
||||||
- minuses
|
|
||||||
- doesn't support sharding natively
|
|
||||||
- sharded implementations seem stale (bigcouch, lounge, ...)
|
|
||||||
multi-master rdbms (postgres-xl, mysql-cluster):
|
|
||||||
- pluses
|
|
||||||
- yes secondary indexes
|
|
||||||
- minuses:
|
|
||||||
- more difficult to deploy, administer?
|
|
||||||
- seem to be less uses than other distributed dbs, smaller community, less knowledge and experience available
|
|
||||||
- fault tolerance not so great? see http://www.slideshare.net/mason_s/postgres-xl-scaling slide 9
|
|
||||||
|
|
||||||
|
|
Loading…
x
Reference in New Issue
Block a user