Commit graph

1587 commits

Author SHA1 Message Date
Barbara Miller
510bfa36f7 log chrome_msg URLs 2018-03-16 15:02:35 -07:00
Barbara Miller
81983d1695 use custom timeout always 2018-03-16 15:02:35 -07:00
Barbara Miller
d7b695613e check for None 2018-03-16 15:02:35 -07:00
Barbara Miller
3d615aed16 behavior_timeout_custom (not timeout_from_behavior) 2018-03-16 15:02:35 -07:00
Barbara Miller
2cc0eac5a6 configurable behavior timeout 2018-03-16 15:02:32 -07:00
Noah Levitt
9995754b60 Merge branch 'master' into qa
* master:
  reimplement max_claimed_sites
2018-03-15 15:07:08 -07:00
Noah Levitt
8e273b2e6b
Merge pull request #100 from nlevitt/max-claimed-sites
reimplement max_claimed_sites
2018-03-15 15:05:46 -07:00
Noah Levitt
dc00f5de32 reimplement max_claimed_sites
Other approach was too slow and caused db contention.
New approach avoids (slow) rethinkdb join by max_claimed_sites job
parameter to each of the job's sites. Uses rethinkdb fold() to count
claimed sites and enforce max_claimed_sites within a single query.
2018-03-15 12:57:49 -07:00
Barbara Miller
7713f0eb69 Merge branch 'ARI-5617' into qa 2018-03-14 13:13:26 -07:00
Barbara Miller
3cd06910da update behaviors.yaml 2018-03-14 13:05:49 -07:00
Barbara Miller
0f2f16e09f add onceOnly config; other tweaks 2018-03-14 13:05:10 -07:00
Barbara Miller
f7a655985b Merge branch 'ARI-5617' into qa 2018-03-13 18:49:00 -07:00
Barbara Miller
da057e93e0 amerciaspresidents.si.edu/gallery behavior 2018-03-13 18:47:32 -07:00
Barbara Miller
e778789370 skipIDcheck 2018-03-13 18:46:16 -07:00
Barbara Miller
409017c47f noIDcheck instead 2018-03-13 17:52:06 -07:00
Barbara Miller
cd3fa154d9 better idCheck and configurable interval timing 2018-03-13 17:36:38 -07:00
Barbara Miller
d371db8166 Merge branch 'ARI-5617' into qa 2018-03-13 16:34:21 -07:00
Barbara Miller
9353d2ed60 fix merge error 2018-03-13 16:33:01 -07:00
Barbara Miller
037e4193ab add idCheck feature, default: true 2018-03-13 16:29:42 -07:00
Noah Levitt
55701ae373 bump version number after merge 2018-03-08 16:49:28 -08:00
jkafader
7d61673d3e
Merge pull request #97 from nlevitt/max-claimed-sites
Max claimed sites
2018-03-08 16:48:31 -08:00
Noah Levitt
0c9ebcff6e Merge branch 'max-claimed-sites' into qa
* max-claimed-sites:
  fix timely time limit enforcement
  honor stop request before choosing proxy
  fix query to make tests pass?
2018-03-05 17:11:10 -08:00
Noah Levitt
4daac3dfc5 fix timely time limit enforcement
by including current brozzling session duration in time accounting
2018-03-05 17:05:41 -08:00
Noah Levitt
318ae13bcb honor stop request before choosing proxy
makes test_warcprox_outage_resiliency pass again
2018-03-05 16:08:24 -08:00
Noah Levitt
a914fb8461
Merge pull request #99 from vbanos/chromium-single-process
Use single process model for chromium-browser
2018-03-05 12:06:20 -08:00
Vangelis Banos
171ce8d854 Use single process model for chromium-browser
By default chromium creates multiple renderer processes (each running
multiple threads) for each instance of a site the user visits. What we
see from `ps auxcf` output is the following:
```
\_ chromium-browse
  \_ chromium-browse
  |   \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
  |       \_ chromium-browse
```

Using the `--single-process` option, we run all renderers in the same
process, saving the overhead of running multiple processes. `ps auxcf`
output is the following:

```
\_ chromium-browse
  \_ chromium-browse
    \_ chromium-browse
```

Performance is improved a bit and I guess that using this in large scale
Brozzler deployments will have even better performance effects.

The potential problem of `--single-process` is stability (if a renderer
crashes, the whole browser also crashes) but since we use very short-lived
instances of chromium, we don't worry about this.

Details on chromium process models:
https://www.chromium.org/developers/design-documents/process-models
2018-03-04 20:48:29 +00:00
Noah Levitt
2639d7b991 fix query to make tests pass? 2018-03-02 16:30:35 -08:00
Noah Levitt
f9834ca77d bump after merge 2018-03-02 11:51:50 -08:00
Noah Levitt
a0710b605c
Merge pull request #96 from vbanos/jinja2-auto-reload
Disable Jinja2 template auto_reload for higher performance
2018-03-02 11:51:11 -08:00
Noah Levitt
5a0700b297 Merge branch 'max-claimed-sites' into qa
* max-claimed-sites:
  new job setting max_claimed_sites
  move time limit enforcement
  Invalid syntax in WebsockReceiverThread._javascript_dialog_open
  Make Browser._wait_for sleep time a varible
  Send more compact JSON to browser
  Remove google safebrowsing flags
  try to get chromium 64? (#92)
  Add chromium CLI flags to improve capture performance
2018-03-01 17:20:17 -08:00
Noah Levitt
f26d711a89 new job setting max_claimed_sites
Puts a cap on the number of sites belonging to a given job that can be brozzled
simultaneously across the cluster. Addresses the problem of a job with many
seeds starving out other jobs. For AITFIVE-1578.
2018-03-01 17:17:54 -08:00
Noah Levitt
d7512fbeb6 move time limit enforcement
now it's next to stop request enforcement which makes more sense and
supports more timely action
2018-03-01 11:28:30 -08:00
Vangelis Banos
ce473897a3 Disable Jinja2 template auto_reload for higher performance
Every time we run a JS behavior, we load a Jinja2 template.
By default, Jinja2 has option `auto_reload=True`. This mean that
every time a template is requested the loader checks if the source file changed
and if yes, it will reload the template. For higher performance it’s possible
to disable that.

Also note that Jinja caches 400 templates by default.

Ref: http://jinja.pocoo.org/docs/2.10/api/

In Brozzler, we don't make changes to JS templates while the system is
running. So, there is no point in having auto_reload=True.
2018-02-25 20:24:25 +00:00
Noah Levitt
b438cdd33e
Merge pull request #94 from vbanos/json-compact
Send more compact JSON to browser
2018-02-21 09:53:16 -08:00
Vangelis Banos
646faa8ab0 Invalid syntax in WebsockReceiverThread._javascript_dialog_open
Fix `)` position
2018-02-21 07:34:36 +00:00
Noah Levitt
eda5133301
Merge pull request #95 from vbanos/configurable-wait-interval
Make Browser._wait_for sleep time a varible
2018-02-20 15:05:34 -08:00
Vangelis Banos
e2128b42f0 Make Browser._wait_for sleep time a varible
Useful to be able to tweak this value in other apps using `Browser`.
2018-02-18 23:08:51 +00:00
Vangelis Banos
d6c707d941 Send more compact JSON to browser
Use JSON separators without spaces to reduce json size.
Its already used elsewhere in Brozzler but not here.
2018-02-18 19:03:36 +00:00
Noah Levitt
0d605d0a88
Merge pull request #90 from vbanos/chrome-flags-performance
Add chromium CLI flags to improve capture performance
2018-02-15 10:54:34 -08:00
Vangelis Banos
970e2bd661 Remove google safebrowsing flags
Global Wayback policy is to archive everything, so its best to avoid
disabling these flags.
2018-02-15 13:35:24 +00:00
Noah Levitt
9e4737ee0a
try to get chromium 64? (#92)
chromium 64 for travis-ci
2018-02-14 13:43:53 -08:00
Noah Levitt
3d12daea06 Merge branch 'master' into qa
* master:
  bump up timeout waiting for websocket connection
  try taking screenshot 3 times, proceed on failure
2018-02-14 12:33:53 -08:00
Noah Levitt
f8c41c5e8d bump up timeout waiting for websocket connection
We've been seeing some of this:

2018-02-14 20:16:44,011 17816 CRITICAL BrozzlingThread:36444 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 447, in _browse_page
    cookie_db=site.get('cookie_db'))
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 338, in start
    self._wait_for(lambda: self.websock_thread.is_open, timeout=10)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 11.1s waiting for: <function Browser.start.<locals>.<lambda> at 0x7fb2dc772bd8>

Mostly at startup. Now that brozzler claims sites in batches for
brozzling, we have situations where we start up a whole bunch of
browsers at the same time. That's probably why in some cases they are
slow to establish the websocket connection.
2018-02-14 12:29:51 -08:00
Noah Levitt
b38fbdcda6 try taking screenshot 3 times, proceed on failure
We've been seeing a lot of this:

2018-02-14 20:06:01,472 13286 CRITICAL BrozzlingThread:44789 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:559) unexpected exception
Traceback (most recent call last):
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 528, in brozzle_site
    enable_youtube_dl=not self._skip_youtube_dl)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 385, in brozzle_page
    on_request)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/worker.py", line 459, in _browse_page
    behavior_timeout=self._behavior_timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 463, in browse_page
    jpeg_bytes = self.screenshot()
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 565, in screenshot
    timeout=timeout)
  File "/opt/brozzler-ve3/lib/python3.5/site-packages/brozzler/browser.py", line 311, in _wait_for
    elapsed, callback))
brozzler.browser.BrowsingTimeout: timed out after 90.5s waiting for: <function Browser.screenshot.<locals>.<lambda> at 0x7f5ab0076a68>

Browser bug, maybe? To work around it, reduce timeout to 45 seconds, try
getting screenshot 3 times, and if it fails proceed anyway, don't queue
the page for recrawling.
2018-02-14 12:15:48 -08:00
Noah Levitt
df0717d072 Merge branch 'master' into qa
* master:
  fix attempt for deadlock-ish situation
  fix unclosed file warnings when running python in debug mode
  give vagrant vm a name in virtualbox
  add note to readme about browser version
  check browser version at startup
  Reinstate logging
  Fix typo and block legacy google-analytics.com/ga.js
  Use Network.setBlockedUrls instead of Debugger to block URLs
  bump dev version after PR merge
  back to dev version number
  commit for beta release
  this should fix travis build?
  fix tests
  update brozzler-easy for current warcprox api
  claim sites to brozzle in batches to reduce contention over sites table
  lengthen site session brozzling time to 15 minutes
  fix needs_browsing check
  new test test_needs_browsing
  increase timeout waiting for screenshot
  Use TCP_NODELAY in websocket connection to improve performance
2018-02-13 17:10:10 -08:00
Noah Levitt
0faeaab3ac fix attempt for deadlock-ish situation
see https://github.com/internetarchive/brozzler/issues/91
2018-02-13 17:09:28 -08:00
Noah Levitt
6086bfe4b4 fix unclosed file warnings when running python in debug mode 2018-02-13 17:07:40 -08:00
Noah Levitt
56e01b9078 give vagrant vm a name in virtualbox 2018-02-13 17:05:45 -08:00
Vangelis Banos
dffd9504af Add chromium CLI flags to improve capture performance
``--disable-background-timer-throttling`` and ``--disable-renderer-backgrounding``:
karma JS test runner uses these to improve chrome performance
https://github.com/karma-runner/karma-chrome-launcher/issues/123

``--disable-hang-monitor``: Suppresses hang monitor dialogs in renderer
processes. This may allow slow unload handlers on a page to prevent the
tab from closing, but the Task Manager can be used to terminate the
offending process in this case.

``--mute-audio``: obvious.

The following are part of google safe browsing features:
``--disable-client-side-phishing-detection``
``--safebrowsing-disable-auto-update``
``--safebrowsing-disable-download-protection``

Reference: https://peter.sh/experiments/chromium-command-line-switches/
2018-02-13 12:32:39 +00:00
Barbara Miller
cbc552e88d Merge branch 'ARI-5517' into qa 2018-02-12 16:14:27 -08:00