Commit Graph

109 Commits

Author SHA1 Message Date
Erik Johnston
58c9653c6b Don't infer paragrahs from newlines 2016-08-02 18:50:24 +01:00
Erik Johnston
6b58ade2f0 Comment on why we clone 2016-08-02 18:41:22 +01:00
Erik Johnston
9e66c58ceb Spelling. 2016-08-02 18:37:31 +01:00
Erik Johnston
f83f5fbce8 Make it actually compile 2016-08-02 18:32:42 +01:00
Erik Johnston
aecaec3e10 Change the way we summarize URLs
Using XPath is slow on some machines (for unknown reasons), so use a
different approach to get a list of text nodes.

Try to generate a summary that respect paragraph and then word
boundaries, adding ellipses when appropriate.
2016-08-02 18:25:53 +01:00
Erik Johnston
f52cb4cd78 Remove race 2016-06-29 15:24:50 +01:00
Erik Johnston
a70688445d Implement purge_media_cache admin API 2016-06-29 14:57:59 +01:00
Erik Johnston
314b146b2e Track approximate last access time for remote media 2016-06-29 11:41:20 +01:00
Mark Haines
13e334506c Remove the legacy v0 content upload API.
The existing content can still be downloaded. The last upload to the
matrix.org server was in January 2015, so it is probably safe to remove
the upload API.
2016-06-21 11:47:39 +01:00
Erik Johnston
09a17f965c Line lengths 2016-06-15 16:58:12 +01:00
Erik Johnston
1e9026e484 Handle floats as img widths 2016-06-15 16:58:05 +01:00
Erik Johnston
a60169ea09 Handle og props with not content 2016-06-15 16:57:48 +01:00
Erik Johnston
eba4ff1bcb 502 on /thumbnail when can't contact remote server 2016-06-09 11:29:43 +01:00
Mark Haines
eb79110beb Clean up the blacklist/whitelist handling.
Always set the config key with an empty list, even if a list isn't specified.
This means that the codepaths are the same for both the empty list and
for a missing key. Since the behaviour is the same for both cases this
makes the code somewhat easier to reason about.
2016-05-16 13:03:59 +01:00
Mark Haines
8d7ad44331 Report per request metrics for all of the things using request_handler 2016-04-28 10:57:49 +01:00
Erik Johnston
e8884e5e9c Add self.media_repo to PreviewUrlResource 2016-04-19 14:51:34 +01:00
Erik Johnston
a7001c311b _make_dirs was moved to MediaRepository 2016-04-19 14:49:31 +01:00
Erik Johnston
9181e2f4c7 Add store to PreviewUrlResource 2016-04-19 14:48:24 +01:00
Erik Johnston
fb76a81ff7 Reorder imports 2016-04-19 14:45:05 +01:00
Erik Johnston
0c93df89b6 Move MediaRepository to media_repository module 2016-04-19 11:31:43 +01:00
Erik Johnston
43f0941e8f Split out BaseMediaResource into MediaRepository
This is so that a single MediaRepository can be shared across all
resources, rather than having a "copy" per resource.

In particular this allows us to guard against both the thumbnail and
download resource triggering a download of remote content at the same
time.
2016-04-19 11:24:59 +01:00
Matthew Hodgson
aaabbd3e9e explicitly pass in the charset from Content-Type to lxml to fix cyrillic woes better 2016-04-15 14:32:25 +01:00
Matthew Hodgson
84f9cac4d0 fix cyrillic URL previews by hardcoding all page decoding to UTF-8 for now, rather than relying on lxml's heuristics which seem to get it wrong 2016-04-15 13:20:08 +01:00
Matthew Hodgson
f78b479118 fix urlparse import thinko breaking tiny URLs 2016-04-14 15:23:55 +01:00
Matthew Hodgson
bd77216d06 comment out 2c838f6459 due to risk of https://en.wikipedia.org/wiki/Billion_laughs attacks - thanks @torhve 2016-04-14 14:39:24 +01:00
Erik Johnston
d0633e6dbe Sanitize the optional dependencies for spider API 2016-04-13 13:38:09 +01:00
Erik Johnston
17515bae14 PEP8 2016-04-11 11:02:50 +01:00
Matthew Hodgson
5ffacc5e84 fix typos and needless try/except from PR review 2016-04-11 10:39:16 +01:00
Matthew Hodgson
83b2f83da0 actually throw meaningful errors 2016-04-08 21:36:59 +01:00
Mark Haines
b36270b5e1 Fix pep8 warning 2016-04-08 19:52:23 +01:00
Matthew Hodgson
1ccabe2965 more PR feedback 2016-04-08 18:58:08 +01:00
Matthew Hodgson
dafef5a688 Add url_preview_enabled config option to turn on/off preview_url endpoint. defaults to off.
Add url_preview_ip_range_blacklist to let admins specify internal IP ranges that must not be spidered.
Add url_preview_url_blacklist to let admins specify URL patterns that must not be spidered.
Implement a custom SpiderEndpoint and associated support classes to implement url_preview_ip_range_blacklist
Add commentary and generally address PR feedback
2016-04-08 18:37:15 +01:00
Matthew Hodgson
cf51c4120e report image size (bytewise) in OG meta 2016-04-03 23:57:05 +01:00
Matthew Hodgson
0834b152fb char encoding 2016-04-03 12:59:27 +01:00
Matthew Hodgson
8b98a7e8c3 pep8 2016-04-03 12:56:29 +01:00
Matthew Hodgson
eab4d462f8 fix etag typing error. fix timestamp typing error 2016-04-03 02:02:46 +01:00
Matthew Hodgson
c3916462f6 rebase all image URLs 2016-04-03 01:33:12 +01:00
Matthew Hodgson
110780b18b remove stale todo 2016-04-03 00:48:31 +01:00
Matthew Hodgson
b09e29a03c Ensure only one download for a given URL is active at a time 2016-04-03 00:47:40 +01:00
Matthew Hodgson
7426c86eb8 add a persistent cache of URL lookups, and fix up the in-memory one to work 2016-04-03 00:31:57 +01:00
Matthew Hodgson
d1b154a10f support gzip compression, and don't pass through error msgs 2016-04-02 03:06:39 +01:00
Matthew Hodgson
9377157961 how was _respond_default_thumbnail ever meant to work? 2016-04-02 02:31:45 +01:00
Matthew Hodgson
2c838f6459 pass back SVGs as their own thumbnails 2016-04-02 02:30:07 +01:00
Matthew Hodgson
5037ee0d37 handle missing dimensions without crashing 2016-04-02 02:29:57 +01:00
Matthew Hodgson
b26e8604f1 make meta comparisons case insensitive 2016-04-02 01:35:44 +01:00
Matthew Hodgson
5fd07da764 refactor calc_og; spider image URLs; fix xpath; add a (broken) expiringcache; loads of other fixes 2016-04-02 00:35:49 +01:00
Matthew Hodgson
c60b751694 fix assorted redirect, unicode and screenscraping bugs 2016-04-01 02:17:48 +01:00
Matthew Hodgson
683e564815 handle spidered relative images correctly 2016-03-31 23:52:58 +01:00
Matthew Hodgson
72550c3803 prevent choking on invalid utf-8, and handle image thumbnailing smarter 2016-03-31 15:14:14 +01:00
Matthew Hodgson
bb9a2ca87c synthesise basig OG metadata from pages lacking it 2016-03-31 14:15:09 +01:00