brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-24 08:39:59 -05:00

Author	SHA1	Message	Date
Vangelis Banos	34d8f87fb5	Add option to capture full page screenshot Add option `full_page` to `Browser.screenshot`. The default behavior remains the same. We get inspiration from puppeteer to capture a screenshot of the full page: https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898 Add option `screenshot_full_page=False` to `Browser.browse_page` to use the new feature when capturing a page.	2019-10-08 10:55:10 +00:00
Noah Levitt	464562461c	Merge pull request #170 from danielbicho/master Add option to specify port and interface binding on brozzler-dashboard	2019-10-03 14:43:53 -07:00
Daniel Bicho	4feede08e4	Add option to specify port and interface binding on brozzler-dashboard	2019-10-03 15:20:03 +01:00
Noah Levitt	8a51f28c3d	fix dishonest travis badge	2019-10-02 15:02:56 -07:00
Corentin Barreau	f5ed91de6e	Replace facebook.js with behaviors.yaml	2019-09-27 21:57:35 +02:00
Noah Levitt	85e6027838	bump version after merge	2019-09-27 10:40:59 -07:00
Noah Levitt	996070b35c	Merge pull request #167 from vbanos/console-debug-only Enable Console and Runtime outputs only when debugging	2019-09-27 10:40:17 -07:00
Vangelis Banos	fed5e6b741	Enable Console and Runtime outputs only when debugging When capturing a page, we receive a LOT of messages from chrome. Examining these message, we see that we can reduce them a bit to speed up Brozzler. We always use `Console.enable` which returns all browser console output. Also, we always use `Runtime.enable`. Doc says: https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable Enables reporting of execution contexts creation by means of executionContextCreated event. When the reporting gets enabled the event will be sent immediately for each existing execution context. These outputs are useful when debugging but not in production. If we disable them, we reduce the websocket traffic and improve performance. With this PR, we enable them only when the current logging level is `DEBUG`. Counting the number of messages before and after the change, we see improvements like: https://www.gnome.org/technologies/ 220 -> 202 messages. https://www.whitehouse.gov/issues/budget-spending/ 203 -> 189 messages	2019-09-27 13:24:06 +00:00
Noah Levitt	7273c7c3a2	Merge pull request #166 from CorentinB/facebook-ads-lib Add support for Facebook ads library and fix closing	2019-09-26 14:13:47 -07:00
Corentin Barreau	e701e3f101	Add: break after closing the first visible element	2019-09-26 21:44:25 +02:00
Corentin Barreau	101f7f2e4a	Remove: useless comment	2019-09-25 19:48:38 +02:00
Corentin Barreau	fb30fb9aa3	Add: isVisible check for close selectors Modify: doTarget - Revert to initial code	2019-09-25 16:19:41 +02:00
Corentin Barreau	5c5743ea11	Fix: closeSelector not being clicked Add: support for facebook.com/ads/library - Open and close metrics for ads	2019-09-25 16:10:59 +02:00
Noah Levitt	efa185a8dc	Merge pull request #160 from vbanos/behavior-timeout More accurate JS behavior timeout	2019-09-24 12:11:37 -07:00
Noah Levitt	eb30ba0c33	Merge pull request #165 from vbanos/stderr-stdout-exception-handling Improve exception handling when reading STDIN/STDERR	2019-09-24 12:03:06 -07:00
Vangelis Banos	f42ff08da1	Improve exception handling when reading STDIN/STDERR When the chrome process dies and we try to read STDIN/STDERR, we get `ValueError: I/O operation on closed file` or `OSError: [Errno 9] Bad file descriptor`. We modify `readline_nonblock` method to return the buffer it read up to this point.	2019-09-19 20:08:55 +00:00
Vangelis Banos	0b28a4a57f	More accurate JS behavior timeout If you use a JS behavior timeout smaller than 7 sec, the JS behavior will always need 7 sec because `sleep(7)` is hard-coded there. We make a minor addition to use `min(timeout, 7)` for sleep so it will finish faster when using a smaller JS behavior timeout.	2019-08-22 21:15:44 +00:00
Noah Levitt	16f886259d	Merge pull request #158 from galgeek/aitfive-1668-soundcoud capture soundcloud user page before capturing tracks	2019-08-15 15:46:55 -07:00
Noah Levitt	94cd6cacb6	bump version after merge	2019-07-18 11:07:27 -07:00
Noah Levitt	726c6effed	Merge pull request #157 from vbanos/block-amp-analytics Block AMP analytics JS script	2019-07-18 11:07:09 -07:00
Barbara Miller	9cc60449d7	skip downloading tracks from soundcloud user page	2019-07-17 17:45:02 -07:00
Vangelis Banos	6bd4fd6532	Block AMP analytics JS script AMP analytics is part of Google analytics. We need to block it for similar reasons. AMP analytics reference: https://developers.google.com/analytics/devguides/collection/amp-analytics/	2019-06-26 21:19:35 +00:00
Noah Levitt	8107abd804	Merge pull request #154 from vbanos/fix-brozzling-test Fix test_brozzling::httpd fixture 1.5.6	2019-05-16 14:23:04 -07:00
Noah Levitt	5fdb2dd39c	documentation tweak	2019-05-16 14:03:43 -07:00
Noah Levitt	aa2d491009	i don't know where pyyaml 5.8 came from	2019-05-16 01:29:05 -07:00
Noah Levitt	42ddfba923	Merge pull request #150 from nlevitt/purge-old Purge old	2019-05-16 00:29:58 -07:00
Noah Levitt	40331f02ba	Merge pull request #153 from vbanos/warn-deprecated logging.warn is deprecated and replaced by logging.warning	2019-05-16 00:27:22 -07:00
Noah Levitt	f8db17ce3d	bump version after merge	2019-05-16 00:22:29 -07:00
Noah Levitt	eb34bebb91	Merge pull request #149 from nlevitt/travis-py37 trying to make this work with xenial for travis	2019-05-16 00:22:08 -07:00
Noah Levitt	c651bcdd18	remove some travis-ci debugging stuff	2019-05-16 00:21:28 -07:00
Noah Levitt	0a1360ab25	don't use localhost for test http server... ... because apparently sometimes chromium bypasses the proxy for local addresses	2019-05-15 18:49:18 -07:00
Noah Levitt	f8165dc02b	work around pytest issue until fix is out https://github.com/pytest-dev/pytest/issues/5257	2019-05-15 18:46:21 -07:00
Vangelis Banos	a1f9122317	Fix test_brozzling::httpd fixture We used `self.headers.getheader` which no longer works. We replace it with `self.headers.get`. We change the code to write binary data to `self.wfile` because we get an exception for writing str and/or None.	2019-05-14 16:29:52 +00:00
Vangelis Banos	a2ac3a0374	logging.warn is deprecated and replaced by logging.warning We replace it everywhere in the code base.	2019-05-14 12:10:59 +00:00
Noah Levitt	ee8ef23f0c	fix mistake in job-conf.rst	2019-04-30 10:49:48 -07:00
Noah Levitt	411b3f266a	bump version after merge	2019-04-09 22:07:51 +00:00
Noah Levitt	d4386491df	Merge pull request #151 from nlevitt/no-cerberus-normalize don't attempt cerberus normalization	2019-04-09 15:06:17 -07:00
Noah Levitt	5385232b40	don't attempt cerberus normalization which encumbers the validation with additional requirements, specifically makes it difficult to validate a subclass of `dict` because it expects a constructor that works like dict.__init__()	2019-04-09 01:45:37 -07:00
Noah Levitt	8dfd92cf7f	fix this utility	2019-04-09 01:44:14 -07:00
Noah Levitt	433b201b52	use logging.warning() to quiet py37 warnings	2019-04-09 01:43:38 -07:00
Noah Levitt	dfd9d9ecdd	omfg	2019-04-04 17:22:15 -07:00
Noah Levitt	fd0fe811e9	so little output from chromium-browser :( https://travis-ci.org/internetarchive/brozzler/jobs/515942434 could it be problems running as this other user?	2019-04-04 16:09:21 -07:00
Noah Levitt	55541be9e9	let's see chromium output inside brozzler-worker using --trace, because chromium seems to be working ok when we just run it	2019-04-04 15:11:24 -07:00
Noah Levitt	58d1d1c429	chromium-browser with no args isn't dying at start what about with all the args?	2019-04-04 14:38:29 -07:00
Noah Levitt	473e891fb4	not sure if --disable-extensions did something	2019-04-04 13:34:45 -07:00
Noah Levitt	6d145c87c8	chromium-browser --disable-extensions ?	2019-04-04 13:24:12 -07:00
Noah Levitt	0d46d8ce19	still trying to figure out what's up with chromium	2019-04-04 13:15:17 -07:00
Noah Levitt	45ac12117a	maybe Xvnc.log will tell us something	2019-04-04 13:09:02 -07:00
Noah Levitt	8303fd3ab3	guessing DISPLAY was the issue here https://travis-ci.org/internetarchive/brozzler/jobs/515882174#L610	2019-04-04 12:50:50 -07:00
Noah Levitt	899794f2da	debug what's going on with chromium in travis see https://travis-ci.org/internetarchive/brozzler/jobs/514858838 (unroll "sudo cat /var/log/brozzler-worker.log") 2019-04-02 20:16:01,792 18595 CRITICAL BrozzlingThread:42073 brozzler.worker.BrozzlerWorker.brozzle_site(worker.py:412) unexpected exception Traceback (most recent call last): File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 379, in brozzle_site enable_youtube_dl=not self._skip_youtube_dl) File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 215, in brozzle_page browser, site, page, on_screenshot, on_request) File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/worker.py", line 292, in _browse_page cookie_db=site.get('cookie_db')) File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/browser.py", line 341, in start self.websock_url = self.chrome.start(**kwargs) File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/chrome.py", line 200, in start return self._websocket_url() File "/opt/brozzler-ve3/lib/python3.6/site-packages/brozzler/chrome.py", line 247, in _websocket_url raise e Exception: chrome process died with status 1	2019-04-04 12:38:46 -07:00

... 7 8 9 10 11 ...

1591 Commits