From 54110410bf7c1009ebdf0f790debf1123b9ef665 Mon Sep 17 00:00:00 2001 From: Greg Lindahl Date: Tue, 4 Jul 2023 01:33:09 -0700 Subject: [PATCH 1/3] warcio was stable a long time ago (#134) Co-authored-by: Greg Lindahl --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ff484e0..9c32d61 100644 --- a/README.md +++ b/README.md @@ -161,7 +161,7 @@ This list of tools and software is intended to briefly describe some of the most * [Sparkling](https://github.com/internetarchive/Sparkling) - Internet Archive's Sparkling Data Processing Library. *(Stable)* * [Unwarcit](https://github.com/emmadickson/unwarcit) - Command line interface to unzip WARC and WACZ files (Python). * [Warcat](https://github.com/chfoo/warcat) - Tool and library for handling Web ARChive (WARC) files (Python). *(Stable)* -* [warcio](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO (Python). +* [warcio](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO (Python). *(Stable)* * [warctools](https://github.com/internetarchive/warctools) - Library to work with ARC and WARC files (Python). * [webarchive](https://github.com/richardlehane/webarchive) - Golang readers for ARC and WARC webarchive formats (Golang). From bf9664ff45ba12b4a879740962a7a3935f442aa6 Mon Sep 17 00:00:00 2001 From: Greg Lindahl Date: Tue, 4 Jul 2023 01:34:33 -0700 Subject: [PATCH 2/3] add web data commons (#137) Co-authored-by: Greg Lindahl Co-authored-by: Andy Jackson --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 9c32d61..1b116b4 100644 --- a/README.md +++ b/README.md @@ -172,6 +172,7 @@ This list of tools and software is intended to briefly describe some of the most * [Archives Unleashed Notebooks](https://github.com/archivesunleashed/notebooks) - Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. *(Stable)* * [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut) - Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark. *(Stable)* * [Tweet Archvies Unleashed Toolkit](https://github.com/archivesunleashed/twut) - An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark. *(In Development)* +* [Web Data Commons](http://webdatacommons.org/) - Structured data extracted from Common Crawl. *(Stable)* ### Quality Assurance From d395bb1b44ca73ae8f15bc8fb4b7902079e607d9 Mon Sep 17 00:00:00 2001 From: Greg Lindahl Date: Tue, 4 Jul 2023 01:36:05 -0700 Subject: [PATCH 3/3] add common crawl mailing list (#136) Co-authored-by: Greg Lindahl Co-authored-by: Andy Jackson --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 1b116b4..8001255 100644 --- a/README.md +++ b/README.md @@ -213,6 +213,7 @@ This list of tools and software is intended to briefly describe some of the most ### Mailing Lists +* [Common Crawl](https://groups.google.com/g/common-crawl) * [IIPC](http://netpreserve.org/about-us/iipc-mailing-list/) * [OpenWayback](https://groups.google.com/g/openwayback-dev) * [WASAPI](https://groups.google.com/g/wasapi-community)