brozzler

mirror of https://github.com/internetarchive/brozzler.git synced 2025-02-23 16:19:49 -05:00

Author	SHA1	Message	Date
Noah Levitt	e23fa68d65	fix bug clobbering own changes to parent_page and some other tweaks (python 3.5+, pytest logging config, ...)	2019-10-17 13:47:54 -07:00
Noah Levitt	ba85917f70	Merge pull request #172 from vbanos/block-more-analytics Block more google-analytics URLs	2019-10-16 10:49:48 -07:00
Vangelis Banos	f23f49108b	Block more google-analytics URLs After analysing capture logs, we see that we didn't block many google-analytics related URLS which are used for web statistics. We add these to the blocked URLs. In addition, we improve existing block rules. We used to block `google-analytics.com/analytics.js` but many sites used some kind of param in the end so these URLs weren't blocked. We add `` in the end of the existing rules to block these cases as well.	2019-10-11 10:45:23 +00:00
Noah Levitt	1bda52d4c9	bump version	2019-10-09 16:28:58 -07:00
Noah Levitt	65c7ccdcff	brozzle-page --screenshot-full-page option	2019-10-09 16:28:26 -07:00
Noah Levitt	e5a3ada349	Merge pull request #171 from vbanos/screenshot-full-screen Add option to capture full page screenshot	2019-10-09 16:27:05 -07:00
Vangelis Banos	ba901e3a99	Fix JPEG thumbnail problems Due to the fact that we run JS behaviors before we capture the screenshot, the browser could be scrolled down in the page. When we don't capture the full page, we may get a screenshot of the bottom part of the page and not the top. To fix that we run `window.scroll(0, 0)` before capturing the screenshot. We change method `BrozzlerWorker.full_and_thumb_jpegs` to `BrozzlerWorker.thumb_jpeg`. That's because we already get a JPEG now from the browser after our changes at `Browser.screenshot`. `thumb_jpeg` only returns a thumbnail now. There is no need to read PNG and convert to JPEG. This means that screenshots will be a bit faster now :)	2019-10-09 13:34:38 +00:00
Vangelis Banos	674da4aa99	Use JPEG quality: 95 for screenshots	2019-10-09 11:57:18 +00:00
Vangelis Banos	544222b021	Moved screenshot code right after run_behavior There were some weird screeshots when invoking `try_screenshot` in the end after `visit_hashtags` and `extract_outlinks`. The screenshot was distorted.	2019-10-09 11:39:32 +00:00
Vangelis Banos	c007cda87e	Capture screenshot after running behaviors This is necessary to load all images before taking the screenshot.	2019-10-09 11:05:58 +00:00
Vangelis Banos	34d8f87fb5	Add option to capture full page screenshot Add option `full_page` to `Browser.screenshot`. The default behavior remains the same. We get inspiration from puppeteer to capture a screenshot of the full page: https://github.com/GoogleChrome/puppeteer/blob/master/lib/Page.js#L898 Add option `screenshot_full_page=False` to `Browser.browse_page` to use the new feature when capturing a page.	2019-10-08 10:55:10 +00:00
Noah Levitt	464562461c	Merge pull request #170 from danielbicho/master Add option to specify port and interface binding on brozzler-dashboard	2019-10-03 14:43:53 -07:00
Daniel Bicho	4feede08e4	Add option to specify port and interface binding on brozzler-dashboard	2019-10-03 15:20:03 +01:00
Noah Levitt	8a51f28c3d	fix dishonest travis badge	2019-10-02 15:02:56 -07:00
Noah Levitt	85e6027838	bump version after merge	2019-09-27 10:40:59 -07:00
Noah Levitt	996070b35c	Merge pull request #167 from vbanos/console-debug-only Enable Console and Runtime outputs only when debugging	2019-09-27 10:40:17 -07:00
Vangelis Banos	fed5e6b741	Enable Console and Runtime outputs only when debugging When capturing a page, we receive a LOT of messages from chrome. Examining these message, we see that we can reduce them a bit to speed up Brozzler. We always use `Console.enable` which returns all browser console output. Also, we always use `Runtime.enable`. Doc says: https://chromedevtools.github.io/devtools-protocol/1-3/Runtime#method-enable Enables reporting of execution contexts creation by means of executionContextCreated event. When the reporting gets enabled the event will be sent immediately for each existing execution context. These outputs are useful when debugging but not in production. If we disable them, we reduce the websocket traffic and improve performance. With this PR, we enable them only when the current logging level is `DEBUG`. Counting the number of messages before and after the change, we see improvements like: https://www.gnome.org/technologies/ 220 -> 202 messages. https://www.whitehouse.gov/issues/budget-spending/ 203 -> 189 messages	2019-09-27 13:24:06 +00:00
Noah Levitt	7273c7c3a2	Merge pull request #166 from CorentinB/facebook-ads-lib Add support for Facebook ads library and fix closing	2019-09-26 14:13:47 -07:00
Corentin Barreau	e701e3f101	Add: break after closing the first visible element	2019-09-26 21:44:25 +02:00
Corentin Barreau	101f7f2e4a	Remove: useless comment	2019-09-25 19:48:38 +02:00
Corentin Barreau	fb30fb9aa3	Add: isVisible check for close selectors Modify: doTarget - Revert to initial code	2019-09-25 16:19:41 +02:00
Corentin Barreau	5c5743ea11	Fix: closeSelector not being clicked Add: support for facebook.com/ads/library - Open and close metrics for ads	2019-09-25 16:10:59 +02:00
Noah Levitt	efa185a8dc	Merge pull request #160 from vbanos/behavior-timeout More accurate JS behavior timeout	2019-09-24 12:11:37 -07:00
Noah Levitt	eb30ba0c33	Merge pull request #165 from vbanos/stderr-stdout-exception-handling Improve exception handling when reading STDIN/STDERR	2019-09-24 12:03:06 -07:00
Vangelis Banos	f42ff08da1	Improve exception handling when reading STDIN/STDERR When the chrome process dies and we try to read STDIN/STDERR, we get `ValueError: I/O operation on closed file` or `OSError: [Errno 9] Bad file descriptor`. We modify `readline_nonblock` method to return the buffer it read up to this point.	2019-09-19 20:08:55 +00:00
Vangelis Banos	0b28a4a57f	More accurate JS behavior timeout If you use a JS behavior timeout smaller than 7 sec, the JS behavior will always need 7 sec because `sleep(7)` is hard-coded there. We make a minor addition to use `min(timeout, 7)` for sleep so it will finish faster when using a smaller JS behavior timeout.	2019-08-22 21:15:44 +00:00
Noah Levitt	16f886259d	Merge pull request #158 from galgeek/aitfive-1668-soundcoud capture soundcloud user page before capturing tracks	2019-08-15 15:46:55 -07:00
Noah Levitt	94cd6cacb6	bump version after merge	2019-07-18 11:07:27 -07:00
Noah Levitt	726c6effed	Merge pull request #157 from vbanos/block-amp-analytics Block AMP analytics JS script	2019-07-18 11:07:09 -07:00
Barbara Miller	9cc60449d7	skip downloading tracks from soundcloud user page	2019-07-17 17:45:02 -07:00
Vangelis Banos	6bd4fd6532	Block AMP analytics JS script AMP analytics is part of Google analytics. We need to block it for similar reasons. AMP analytics reference: https://developers.google.com/analytics/devguides/collection/amp-analytics/	2019-06-26 21:19:35 +00:00
Noah Levitt	8107abd804	Merge pull request #154 from vbanos/fix-brozzling-test Fix test_brozzling::httpd fixture 1.5.6	2019-05-16 14:23:04 -07:00
Noah Levitt	5fdb2dd39c	documentation tweak	2019-05-16 14:03:43 -07:00
Noah Levitt	aa2d491009	i don't know where pyyaml 5.8 came from	2019-05-16 01:29:05 -07:00
Noah Levitt	42ddfba923	Merge pull request #150 from nlevitt/purge-old Purge old	2019-05-16 00:29:58 -07:00
Noah Levitt	40331f02ba	Merge pull request #153 from vbanos/warn-deprecated logging.warn is deprecated and replaced by logging.warning	2019-05-16 00:27:22 -07:00
Noah Levitt	f8db17ce3d	bump version after merge	2019-05-16 00:22:29 -07:00
Noah Levitt	eb34bebb91	Merge pull request #149 from nlevitt/travis-py37 trying to make this work with xenial for travis	2019-05-16 00:22:08 -07:00
Noah Levitt	c651bcdd18	remove some travis-ci debugging stuff	2019-05-16 00:21:28 -07:00
Noah Levitt	0a1360ab25	don't use localhost for test http server... ... because apparently sometimes chromium bypasses the proxy for local addresses	2019-05-15 18:49:18 -07:00
Noah Levitt	f8165dc02b	work around pytest issue until fix is out https://github.com/pytest-dev/pytest/issues/5257	2019-05-15 18:46:21 -07:00
Vangelis Banos	a1f9122317	Fix test_brozzling::httpd fixture We used `self.headers.getheader` which no longer works. We replace it with `self.headers.get`. We change the code to write binary data to `self.wfile` because we get an exception for writing str and/or None.	2019-05-14 16:29:52 +00:00
Vangelis Banos	a2ac3a0374	logging.warn is deprecated and replaced by logging.warning We replace it everywhere in the code base.	2019-05-14 12:10:59 +00:00
Noah Levitt	ee8ef23f0c	fix mistake in job-conf.rst	2019-04-30 10:49:48 -07:00
Noah Levitt	411b3f266a	bump version after merge	2019-04-09 22:07:51 +00:00
Noah Levitt	d4386491df	Merge pull request #151 from nlevitt/no-cerberus-normalize don't attempt cerberus normalization	2019-04-09 15:06:17 -07:00
Noah Levitt	5385232b40	don't attempt cerberus normalization which encumbers the validation with additional requirements, specifically makes it difficult to validate a subclass of `dict` because it expects a constructor that works like dict.__init__()	2019-04-09 01:45:37 -07:00
Noah Levitt	8dfd92cf7f	fix this utility	2019-04-09 01:44:14 -07:00
Noah Levitt	433b201b52	use logging.warning() to quiet py37 warnings	2019-04-09 01:43:38 -07:00
Noah Levitt	dfd9d9ecdd	omfg	2019-04-04 17:22:15 -07:00

1 2 3 4 5 ...

1200 Commits