1011 Commits

Author SHA1 Message Date
Noah Levitt
5bb392ec7c ssurts are strings now
because they're friendlier that way in rethinkdb
2018-05-16 16:43:10 -07:00
Noah Levitt
399c097c7c travis-ci install warcprox from github 2018-05-16 15:48:29 -07:00
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9 update warcprox dependency to include recent fixes 2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062 backward compatibility for old scope["surt"]
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6 missed a spot where is_permitted_by_robots needs monkeying 2018-05-15 16:52:48 -07:00
Noah Levitt
de1f240e25 describe scope rule conditions
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f more explication of scoping 2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d update docs to match new seed ssurt behavior 2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338 ok seriously tests 2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495 fix more tests for new approach sans scope['surt'] 2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527 s/max_hops_off_surt/max_hops_off/ 2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709 new test of max_hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df rename page.hops_off_surt to page.hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0 doublethink had a bug fix 2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f tests for new approach without scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a tests for new approach without of scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26712ce93 WIP add an accept rule instead of modifying surt
in place for seed redirects
2018-05-14 15:38:28 -07:00
Noah Levitt
98ce67ef36 WIP some words on scoping 2018-05-14 15:38:28 -07:00
Noah Levitt
88214236bb WIP starting to flesh out "scoping" section 2018-05-14 15:38:28 -07:00
Noah Levitt
6df2c1cf22 WIP some explanation of automatic login 2018-05-14 15:38:28 -07:00
Noah Levitt
914289b414 WIP documentation! 2018-05-14 15:38:28 -07:00
Noah Levitt
a1af18230c
Merge pull request #103 from internetarchive/ARI-5671
instagram updates
2018-03-23 14:18:04 -07:00
Barbara Miller
426ca48554 less is more 2018-03-23 14:17:22 -07:00
Barbara Miller
51977908ec uncomment; now tested 2018-03-20 10:39:14 -07:00
Barbara Miller
9e871a9f81 instagram umbraBehavior & vanishing elem fix 2018-03-20 10:22:55 -07:00
Noah Levitt
6aa8af9d80
Merge pull request #101 from galgeek/ARI-5617
repeatSameElement, firstMatchOnly, configurable interval timing, for ARI-5617
2018-03-19 16:36:52 -07:00
Barbara Miller
1e2e7213c8 better booleans for umbraBehavior 2018-03-19 16:31:23 -07:00
Barbara Miller
bc5a36e8a3 better booleans 2018-03-19 16:28:47 -07:00
Barbara Miller
745e6cc942 log behavior params better 2018-03-19 16:28:14 -07:00
Barbara Miller
ae6f72769a better config names 2018-03-19 16:02:07 -07:00
Barbara Miller
74fc7cd102 update behaviors.yaml 2018-03-19 14:44:29 -07:00
Barbara Miller
cc207763d5 add onceOnly config; other tweaks 2018-03-19 14:44:29 -07:00
Barbara Miller
8f861389ba amerciaspresidents.si.edu/gallery behavior 2018-03-19 14:44:29 -07:00
Barbara Miller
5dfb081bb4 skipIDcheck, default false / no / 0 2018-03-19 14:44:29 -07:00
Barbara Miller
8f12f0b0c0 better idCheck and configurable interval timing 2018-03-19 14:44:04 -07:00
Barbara Miller
c31f13e47f add idCheck feature, default: true 2018-03-19 14:44:04 -07:00
Noah Levitt
8e273b2e6b
Merge pull request #100 from nlevitt/max-claimed-sites
reimplement max_claimed_sites
2018-03-15 15:05:46 -07:00
Noah Levitt
dc00f5de32 reimplement max_claimed_sites
Other approach was too slow and caused db contention.
New approach avoids (slow) rethinkdb join by max_claimed_sites job
parameter to each of the job's sites. Uses rethinkdb fold() to count
claimed sites and enforce max_claimed_sites within a single query.
2018-03-15 12:57:49 -07:00
Noah Levitt
55701ae373 bump version number after merge 2018-03-08 16:49:28 -08:00
jkafader
7d61673d3e
Merge pull request #97 from nlevitt/max-claimed-sites
Max claimed sites
2018-03-08 16:48:31 -08:00
Noah Levitt
4daac3dfc5 fix timely time limit enforcement
by including current brozzling session duration in time accounting
2018-03-05 17:05:41 -08:00
Noah Levitt
318ae13bcb honor stop request before choosing proxy
makes test_warcprox_outage_resiliency pass again
2018-03-05 16:08:24 -08:00
Noah Levitt
a914fb8461
Merge pull request #99 from vbanos/chromium-single-process
Use single process model for chromium-browser
2018-03-05 12:06:20 -08:00
Vangelis Banos
171ce8d854 Use single process model for chromium-browser
By default chromium creates multiple renderer processes (each running
multiple threads) for each instance of a site the user visits. What we
see from `ps auxcf` output is the following:
```
\_ chromium-browse
  \_ chromium-browse
  |   \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
```

Using the `--single-process` option, we run all renderers in the same
process, saving the overhead of running multiple processes. `ps auxcf`
output is the following:

```
\_ chromium-browse
  \_ chromium-browse
    \_ chromium-browse
```

Performance is improved a bit and I guess that using this in large scale
Brozzler deployments will have even better performance effects.

The potential problem of `--single-process` is stability (if a renderer
crashes, the whole browser also crashes) but since we use very short-lived
instances of chromium, we don't worry about this.

Details on chromium process models:
https://www.chromium.org/developers/design-documents/process-models
2018-03-04 20:48:29 +00:00
Noah Levitt
2639d7b991 fix query to make tests pass? 2018-03-02 16:30:35 -08:00
Noah Levitt
f9834ca77d bump after merge 2018-03-02 11:51:50 -08:00
Noah Levitt
a0710b605c
Merge pull request #96 from vbanos/jinja2-auto-reload
Disable Jinja2 template auto_reload for higher performance
2018-03-02 11:51:11 -08:00
Noah Levitt
f26d711a89 new job setting max_claimed_sites
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs. For AITFIVE-1578.
2018-03-01 17:17:54 -08:00
Noah Levitt
d7512fbeb6 move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00