The Content Name Collection

A collection of open content name datasets for Information Centric Networking

Motivation

The "Content Name Collection" (CNC) lists and hosts open datasets of content names. These datasets are either derived from URL link databases or web traces. The names are typically used for research on Information Centric Networking (ICN), for example to measure cache hit/miss ratios in simulations.

Current size of the CNC (comp./uncomp.): 85,8 GB / 486 GB
CNC timestamp: 2017-05-17-14:00:00

Download

2014
Dataset # Content Names unique MIME Type,
Encoding
Compression # Files Size
comp./uncomp.
unibas-icn-names-2014-08-teaser 10'000 no text/plain, UTF-8 none 1 - / 411 KB
unibas-icn-names-2014-08 2'144'314'011 no text/plain, UTF-8 yes, LZMA2/xz 215 24,5 GB / 115 GB
unibas-icn-names-2014-08-unique 870'501'646 yes text/plain, UTF-8 yes, LZMA2/xz 88 8,49 GB / 56,3 GB
unibas-url-names-2014-08-teaser 10'000 no text/plain, UTF-8 none 1 - / 451 KB
unibas-url-names-2014-08 2'144'314'011 no text/plain, UTF-8 yes, LZMA2/xz 215 24,9 GB / 120 GB
unibas-url-names-2014-08-unique 870'896'633 yes text/plain, UTF-8 yes, LZMA2/xz 88 8,68 GB / 58,6 GB
cisco-icn-names-2014-12 13'549'122 no text/plain, UTF-8 yes, LZMA2/xz 14 104 MB / 754 MB
cisco-url-names-2014-12 13'549'129 no text/plain, UTF-8 yes, LZMA2/xz 14 104 MB / 755 MB

2016
Dataset # Content Names unique MIME Type,
Encoding
Compression # Files Size
comp./uncomp.
unibas-icn-names-2016-08 1'409'358'326 yes text/plain, UTF-8 yes, LZMA2/xz 141 16,7 GB / 117 GB
unibas-icn-names-2016-08-cc-(1-5) 244'014'444 yes text/plain, UTF-8 yes, LZMA2/xz 5 2,3 GB / 17,4 GB

2017 - Hackaton at University of Basel
Dataset # Content Names unique MIME Type,
Encoding
Compression # Files Size
comp./uncomp.
urls.txt 2'144'314'011 no text/plain, UTF-8 none 1 - / 121 GB
urls-sample.txt 170'000'000 no text/plain, UTF-8 none 1 - / 9,68 GB

Metadata: metadata.txt

Example: the unibas datasets

The raw material for the unibas datasets was the URL shortener archive (release of 2013-07-20), provided by the URLTeam. This archive consists of URLs which were passed over from users to different URL shortening providers like bit.ly, is.gd, TinyURL.com and others. The archive also contains the corresponding short URLs. The advantage of this archive is that the URLs (mostly) point to actually existing content objects. This nicely reflects the Internet's reality.

The unibas datasets come in two flavours, either the ICN-like content names or the URL-like content names. For every valid entry in the URL shortener archive we provide two different representations. Examples for both categories:

ICN content names:

Corresponding URL content names:

Note that the domain name components of the URL-like content names are inverted in the ICN-like representation.

Contact

General coordinator:
Urs Schnurrenberger (urs.schnurrenberger@unibas.ch)

Also involved:
Christian Tschudin, Manolis Sifalakis

University of Basel
Department of Mathematics
and Computer Science
Spiegelgasse 1
CH - 4051 Basel