1009 Commits

Author SHA1 Message Date
Noah Levitt
ac735639ff incorporate urlcanon fix 2018-05-16 14:41:49 -07:00
Noah Levitt
338d2e48f9 update warcprox dependency to include recent fixes 2018-05-16 14:26:51 -07:00
Noah Levitt
b9b8dcd062 backward compatibility for old scope["surt"]
and make sure to store ssurt as string in rethinkdb
2018-05-16 14:19:23 -07:00
Noah Levitt
1572fd3ed6 missed a spot where is_permitted_by_robots needs monkeying 2018-05-15 16:52:48 -07:00
Noah Levitt
de1f240e25 describe scope rule conditions
plus a bunch of tweaks and fixes
2018-05-15 11:01:09 -07:00
Noah Levitt
a327cb626f more explication of scoping 2018-05-14 17:31:45 -07:00
Noah Levitt
2cf474aa1d update docs to match new seed ssurt behavior 2018-05-14 16:59:55 -07:00
Noah Levitt
fc05cac338 ok seriously tests 2018-05-14 15:38:28 -07:00
Noah Levitt
05f8ab3495 fix more tests for new approach sans scope['surt'] 2018-05-14 15:38:28 -07:00
Noah Levitt
85a4757527 s/max_hops_off_surt/max_hops_off/ 2018-05-14 15:38:28 -07:00
Noah Levitt
5ebd2fb709 new test of max_hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
b83d3cb9df rename page.hops_off_surt to page.hops_off 2018-05-14 15:38:28 -07:00
Noah Levitt
60f2b99cc0 doublethink had a bug fix 2018-05-14 15:38:28 -07:00
Noah Levitt
526a4d718f tests for new approach without scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
245e27a21a tests for new approach without of scope['surt']
replaced by an accept rule (two rules in some cases of seed redirects)
2018-05-14 15:38:28 -07:00
Noah Levitt
f26712ce93 WIP add an accept rule instead of modifying surt
in place for seed redirects
2018-05-14 15:38:28 -07:00
Noah Levitt
98ce67ef36 WIP some words on scoping 2018-05-14 15:38:28 -07:00
Noah Levitt
88214236bb WIP starting to flesh out "scoping" section 2018-05-14 15:38:28 -07:00
Noah Levitt
6df2c1cf22 WIP some explanation of automatic login 2018-05-14 15:38:28 -07:00
Noah Levitt
914289b414 WIP documentation! 2018-05-14 15:38:28 -07:00
Noah Levitt
a1af18230c
Merge pull request #103 from internetarchive/ARI-5671
instagram updates
2018-03-23 14:18:04 -07:00
Barbara Miller
426ca48554 less is more 2018-03-23 14:17:22 -07:00
Barbara Miller
51977908ec uncomment; now tested 2018-03-20 10:39:14 -07:00
Barbara Miller
9e871a9f81 instagram umbraBehavior & vanishing elem fix 2018-03-20 10:22:55 -07:00
Noah Levitt
6aa8af9d80
Merge pull request #101 from galgeek/ARI-5617
repeatSameElement, firstMatchOnly, configurable interval timing, for ARI-5617
2018-03-19 16:36:52 -07:00
Barbara Miller
1e2e7213c8 better booleans for umbraBehavior 2018-03-19 16:31:23 -07:00
Barbara Miller
bc5a36e8a3 better booleans 2018-03-19 16:28:47 -07:00
Barbara Miller
745e6cc942 log behavior params better 2018-03-19 16:28:14 -07:00
Barbara Miller
ae6f72769a better config names 2018-03-19 16:02:07 -07:00
Barbara Miller
74fc7cd102 update behaviors.yaml 2018-03-19 14:44:29 -07:00
Barbara Miller
cc207763d5 add onceOnly config; other tweaks 2018-03-19 14:44:29 -07:00
Barbara Miller
8f861389ba amerciaspresidents.si.edu/gallery behavior 2018-03-19 14:44:29 -07:00
Barbara Miller
5dfb081bb4 skipIDcheck, default false / no / 0 2018-03-19 14:44:29 -07:00
Barbara Miller
8f12f0b0c0 better idCheck and configurable interval timing 2018-03-19 14:44:04 -07:00
Barbara Miller
c31f13e47f add idCheck feature, default: true 2018-03-19 14:44:04 -07:00
Noah Levitt
8e273b2e6b
Merge pull request #100 from nlevitt/max-claimed-sites
reimplement max_claimed_sites
2018-03-15 15:05:46 -07:00
Noah Levitt
dc00f5de32 reimplement max_claimed_sites
Other approach was too slow and caused db contention.
New approach avoids (slow) rethinkdb join by max_claimed_sites job
parameter to each of the job's sites. Uses rethinkdb fold() to count
claimed sites and enforce max_claimed_sites within a single query.
2018-03-15 12:57:49 -07:00
Noah Levitt
55701ae373 bump version number after merge 2018-03-08 16:49:28 -08:00
jkafader
7d61673d3e
Merge pull request #97 from nlevitt/max-claimed-sites
Max claimed sites
2018-03-08 16:48:31 -08:00
Noah Levitt
4daac3dfc5 fix timely time limit enforcement
by including current brozzling session duration in time accounting
2018-03-05 17:05:41 -08:00
Noah Levitt
318ae13bcb honor stop request before choosing proxy
makes test_warcprox_outage_resiliency pass again
2018-03-05 16:08:24 -08:00
Noah Levitt
a914fb8461
Merge pull request #99 from vbanos/chromium-single-process
Use single process model for chromium-browser
2018-03-05 12:06:20 -08:00
Vangelis Banos
171ce8d854 Use single process model for chromium-browser
By default chromium creates multiple renderer processes (each running
multiple threads) for each instance of a site the user visits. What we
see from `ps auxcf` output is the following:
```
\_ chromium-browse
  \_ chromium-browse
  |   \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
```

Using the `--single-process` option, we run all renderers in the same
process, saving the overhead of running multiple processes. `ps auxcf`
output is the following:

```
\_ chromium-browse
  \_ chromium-browse
    \_ chromium-browse
```

Performance is improved a bit and I guess that using this in large scale
Brozzler deployments will have even better performance effects.

The potential problem of `--single-process` is stability (if a renderer
crashes, the whole browser also crashes) but since we use very short-lived
instances of chromium, we don't worry about this.

Details on chromium process models:
https://www.chromium.org/developers/design-documents/process-models
2018-03-04 20:48:29 +00:00
Noah Levitt
2639d7b991 fix query to make tests pass? 2018-03-02 16:30:35 -08:00
Noah Levitt
f9834ca77d bump after merge 2018-03-02 11:51:50 -08:00
Noah Levitt
a0710b605c
Merge pull request #96 from vbanos/jinja2-auto-reload
Disable Jinja2 template auto_reload for higher performance
2018-03-02 11:51:11 -08:00
Noah Levitt
f26d711a89 new job setting max_claimed_sites
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs. For AITFIVE-1578.
2018-03-01 17:17:54 -08:00
Noah Levitt
d7512fbeb6 move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Vangelis Banos
ce473897a3 Disable Jinja2 template auto_reload for higher performance
Every time we run a JS behavior, we load a Jinja2 template.
By default, Jinja2 has option `auto_reload=True`. This mean that
every time a template is requested the loader checks if the source file changed
and if yes, it will reload the template. For higher performance it’s possible
to disable that.

Also note that Jinja caches 400 templates by default.

Ref: http://jinja.pocoo.org/docs/2.10/api/

In Brozzler, we don't make changes to JS templates while the system is
running. So, there is no point in having auto_reload=True.
2018-02-25 20:24:25 +00:00
Noah Levitt
b438cdd33e
Merge pull request #94 from vbanos/json-compact
Send more compact JSON to browser
2018-02-21 09:53:16 -08:00