brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 00:29:53 -05:00

Author	SHA1	Message	Date
Noah Levitt	5bb392ec7c	ssurts are strings now because they're friendlier that way in rethinkdb	2018-05-16 16:43:10 -07:00
Noah Levitt	399c097c7c	travis-ci install warcprox from github	2018-05-16 15:48:29 -07:00
Noah Levitt	ac735639ff	incorporate urlcanon fix	2018-05-16 14:41:49 -07:00
Noah Levitt	338d2e48f9	update warcprox dependency to include recent fixes	2018-05-16 14:26:51 -07:00
Noah Levitt	b9b8dcd062	backward compatibility for old scope["surt"] and make sure to store ssurt as string in rethinkdb	2018-05-16 14:19:23 -07:00
Noah Levitt	1572fd3ed6	missed a spot where is_permitted_by_robots needs monkeying	2018-05-15 16:52:48 -07:00
Noah Levitt	de1f240e25	describe scope rule conditions plus a bunch of tweaks and fixes	2018-05-15 11:01:09 -07:00
Noah Levitt	a327cb626f	more explication of scoping	2018-05-14 17:31:45 -07:00
Noah Levitt	2cf474aa1d	update docs to match new seed ssurt behavior	2018-05-14 16:59:55 -07:00
Noah Levitt	fc05cac338	ok seriously tests	2018-05-14 15:38:28 -07:00
Noah Levitt	05f8ab3495	fix more tests for new approach sans scope['surt']	2018-05-14 15:38:28 -07:00
Noah Levitt	85a4757527	s/max_hops_off_surt/max_hops_off/	2018-05-14 15:38:28 -07:00
Noah Levitt	5ebd2fb709	new test of max_hops_off	2018-05-14 15:38:28 -07:00
Noah Levitt	b83d3cb9df	rename page.hops_off_surt to page.hops_off	2018-05-14 15:38:28 -07:00
Noah Levitt	60f2b99cc0	doublethink had a bug fix	2018-05-14 15:38:28 -07:00
Noah Levitt	526a4d718f	tests for new approach without scope['surt'] replaced by an accept rule (two rules in some cases of seed redirects)	2018-05-14 15:38:28 -07:00
Noah Levitt	245e27a21a	tests for new approach without of scope['surt'] replaced by an accept rule (two rules in some cases of seed redirects)	2018-05-14 15:38:28 -07:00
Noah Levitt	f26712ce93	WIP add an accept rule instead of modifying surt in place for seed redirects	2018-05-14 15:38:28 -07:00
Noah Levitt	98ce67ef36	WIP some words on scoping	2018-05-14 15:38:28 -07:00
Noah Levitt	88214236bb	WIP starting to flesh out "scoping" section	2018-05-14 15:38:28 -07:00
Noah Levitt	6df2c1cf22	WIP some explanation of automatic login	2018-05-14 15:38:28 -07:00
Noah Levitt	914289b414	WIP documentation!	2018-05-14 15:38:28 -07:00
Noah Levitt	a1af18230c	Merge pull request #103 from internetarchive/ARI-5671 instagram updates	2018-03-23 14:18:04 -07:00
Barbara Miller	426ca48554	less is more	2018-03-23 14:17:22 -07:00
Barbara Miller	51977908ec	uncomment; now tested	2018-03-20 10:39:14 -07:00
Barbara Miller	9e871a9f81	instagram umbraBehavior & vanishing elem fix	2018-03-20 10:22:55 -07:00
Noah Levitt	6aa8af9d80	Merge pull request #101 from galgeek/ARI-5617 repeatSameElement, firstMatchOnly, configurable interval timing, for ARI-5617	2018-03-19 16:36:52 -07:00
Barbara Miller	1e2e7213c8	better booleans for umbraBehavior	2018-03-19 16:31:23 -07:00
Barbara Miller	bc5a36e8a3	better booleans	2018-03-19 16:28:47 -07:00
Barbara Miller	745e6cc942	log behavior params better	2018-03-19 16:28:14 -07:00
Barbara Miller	ae6f72769a	better config names	2018-03-19 16:02:07 -07:00
Barbara Miller	74fc7cd102	update behaviors.yaml	2018-03-19 14:44:29 -07:00
Barbara Miller	cc207763d5	add onceOnly config; other tweaks	2018-03-19 14:44:29 -07:00
Barbara Miller	8f861389ba	amerciaspresidents.si.edu/gallery behavior	2018-03-19 14:44:29 -07:00
Barbara Miller	5dfb081bb4	skipIDcheck, default false / no / 0	2018-03-19 14:44:29 -07:00
Barbara Miller	8f12f0b0c0	better idCheck and configurable interval timing	2018-03-19 14:44:04 -07:00
Barbara Miller	c31f13e47f	add idCheck feature, default: true	2018-03-19 14:44:04 -07:00
Noah Levitt	8e273b2e6b	Merge pull request #100 from nlevitt/max-claimed-sites reimplement max_claimed_sites	2018-03-15 15:05:46 -07:00
Noah Levitt	dc00f5de32	reimplement max_claimed_sites Other approach was too slow and caused db contention. New approach avoids (slow) rethinkdb join by max_claimed_sites job parameter to each of the job's sites. Uses rethinkdb fold() to count claimed sites and enforce max_claimed_sites within a single query.	2018-03-15 12:57:49 -07:00
Noah Levitt	55701ae373	bump version number after merge	2018-03-08 16:49:28 -08:00
jkafader	7d61673d3e	Merge pull request #97 from nlevitt/max-claimed-sites Max claimed sites	2018-03-08 16:48:31 -08:00
Noah Levitt	4daac3dfc5	fix timely time limit enforcement by including current brozzling session duration in time accounting	2018-03-05 17:05:41 -08:00
Noah Levitt	318ae13bcb	honor stop request before choosing proxy makes test_warcprox_outage_resiliency pass again	2018-03-05 16:08:24 -08:00
Noah Levitt	a914fb8461	Merge pull request #99 from vbanos/chromium-single-process Use single process model for chromium-browser	2018-03-05 12:06:20 -08:00
Vangelis Banos	171ce8d854	Use single process model for chromium-browser By default chromium creates multiple renderer processes (each running multiple threads) for each instance of a site the user visits. What we see from `ps auxcf` output is the following: ``` \_ chromium-browse \_ chromium-browse \| \_ chromium-browse \| \_ chromium-browse \| \_ chromium-browse \| \_ chromium-browse ``` Using the `--single-process` option, we run all renderers in the same process, saving the overhead of running multiple processes. `ps auxcf` output is the following: ``` \_ chromium-browse \_ chromium-browse \_ chromium-browse ``` Performance is improved a bit and I guess that using this in large scale Brozzler deployments will have even better performance effects. The potential problem of `--single-process` is stability (if a renderer crashes, the whole browser also crashes) but since we use very short-lived instances of chromium, we don't worry about this. Details on chromium process models: https://www.chromium.org/developers/design-documents/process-models	2018-03-04 20:48:29 +00:00
Noah Levitt	2639d7b991	fix query to make tests pass?	2018-03-02 16:30:35 -08:00
Noah Levitt	f9834ca77d	bump after merge	2018-03-02 11:51:50 -08:00
Noah Levitt	a0710b605c	Merge pull request #96 from vbanos/jinja2-auto-reload Disable Jinja2 template auto_reload for higher performance	2018-03-02 11:51:11 -08:00
Noah Levitt	f26d711a89	new job setting max_claimed_sites Puts a cap on the number of sites belonging to a given job that can be brozzled simultaneously across the cluster. Addresses the problem of a job with many seeds starving out other jobs. For AITFIVE-1578.	2018-03-01 17:17:54 -08:00
Noah Levitt	d7512fbeb6	move time limit enforcement now it's next to stop request enforcement which makes more sense and supports more timely action	2018-03-01 11:28:30 -08:00

1 2 3 4 5 ...

1011 Commits