# Dmap Alexa 1M dataset
* In this document, we present the results produced by [Dmap][1] after crawling Alexa 1M domains
* We present the SQL code (and Python) used to obtain the results from Table III from our [Dmap paper](paper.pdf).
## Requirements:
* [PostgreSQL DB][3]
* Any SQL client that can connect to postgresql
* Check the data model to understand the results on the [Data model page](datamodel.html)
## Download the data and set it up
* You can download the Postgresql database dump (result tables only):
* File: [pg_dump_export_20180227.zip](https://s3.eu-central-1.amazonaws.com/dmap-pgdump/pg_dump_export_20180227.zip).
* `ms5sum: b5a248deace64fed2931df97908d39a5`
* Then install it at your already configured [PostgreSQL DB][3] using the following command:
```bash
# steps to import pg_dump_export_20180227.zip into PostgreSQL
# unzip file
$ unzip pg_dump_export_20180227.zip
# create database
$ createdb -T template0 DMapdb
# import the data
$ psql DMapdb < pg_dump_export_20180227
```
## DNS
* From this point onwards, we show the results of Table III in the paper, and which queries/python code we use to obtain these numbers
* **Reference**: check the [data model](datamodel.html) for a description of each column.
### DNS results from Table III in [Dmap paper][2]:
|*Metric* |*IPv4* |*IPv6* |*IPv6/IPv4*|
| --------| ------|-------| ----------|
|# Domains (OK) |972,155|153,485|0.16|
|# Unique NSes |289,014|26,127|0.09|
|# Unique IP |210,650|19,754|0.09|
|# Unique ASes |18,418|3,178 |0.17|
|# CDN Cloudflare |117,538|115,396 |0.98|
### SQL queries to obtain these results
*IMPORTANT:* crawl_run=67 is a ID of this entire measurment, and it' s a fixed value in all queries here.
```sql
--Domains (OK)
--IPv4(crawl_status=0 means no error during crawling)
select count(1) from crawl_result_dns
where crawl_run=67 and ip_version=4 and crawl_status=0
--IPv6 (crawl_status=0 no error)
select count(1) from crawl_result_dns
where crawl_run=67 and ip_version=6 and crawl_status=0
-- unique IPs (IPv4)
select count( distinct uq_ip)
from (select json_array_elements((jsonb_array_elements(dns_ns)->>'addresses')::json)->>'address' as uq_ip
from crawl_result_dns
where dns_ns is not null
and ip_version = 4
and crawl_status = 0
and crawl_run = 67)as ips
-- unique IPs (IPv6)
select count( distinct uq_ip)
from (select json_array_elements((jsonb_array_elements(dns_ns)->>'addresses')::json)->>'address' as uq_ip
from crawl_result_dns
where dns_ns is not null
and ip_version = 4
and crawl_status = 0
and crawl_run = 67)as ips
--unique NSes (IPv4)
select count( distinct lower(ns_name))
from (select jsonb_array_elements(dns_ns)->>'name' as ns_name
from crawl_result_dns
where dns_ns is not null
and ip_version = 6
and crawl_status = 0
and crawl_run = 67)as ns_names
-- unique NSes (IPV6)
select count( distinct lower(ns_name))
from (select jsonb_array_elements(dns_ns)->>'name' as ns_name
from crawl_result_dns
where dns_ns is not null
and ip_version = 6
and crawl_status = 0
and crawl_run = 67)as ns_names
--- unique ASes (IPv4)
select count( distinct uq_as)
from (select json_array_elements((jsonb_array_elements(dns_ns)->>'addresses')::json)->>'asn' as uq_as
from crawl_result_dns
where dns_ns is not null
and ip_version = 4
and crawl_status = 0
and crawl_run = 67)as ases
--- unique ASes (IPv6)
select count( distinct uq_as)
from (select json_array_elements((jsonb_array_elements(dns_ns)->>'addresses')::json)->>'asn' as uq_as
from crawl_result_dns
where dns_ns is not null
and ip_version = 6
and crawl_status = 0
and crawl_run = 67)as ases
---cloudflare
select count(domainname) from crawl_result_dns
where crawl_run=67 and ip_version=4 and crawl_status=0
and dns_cdn_cloudflare=true
select count(domainname) from crawl_result_dns
where crawl_run=67 and ip_version=6 and crawl_status=0
and dns_cdn_cloudflare=true
```
## HTTP
### HTTP results from Table III in [Dmap paper][2]:
|*Metric* |*IPv4* |*IPv6* |*IPv6/IPv4*|
| --------| ------|-------| ----------|
|# Domains (OK) |968,338|153,485 |0.16|
|#HTML 5 |681,757|116,066|0.17|
|Bytes (median) |53,889|64,735| 1.20|
|External links (median) |7|8|1.14|
|Internal links (median) |67 |75|1.12|
|Cookies (median) |1|1|1.00|
### HTTP SQL queries
```sql
--IPv4(crawl_status=0 no error)
select count(1) from crawl_result_http
where crawl_run=67 and ip_version=4
and crawl_status=0
--IPv6 (crawl_status=0 no error)
select count(1) from crawl_result_http
where crawl_run=67 and ip_version=6
and crawl_status=0
---- html 5 (IPv4)
select html_version, count(1) as df from crawl_result_http
where crawl_run=67 and ip_version=4 and http_status=200
and crawl_status=0 group by html_version
---- html 5 (IPv6)
select html_version, count(1) as df from crawl_result_http
where crawl_run=67 and ip_version=6 and crawl_status=0
and http_status=200 group by html_version
---median bytes len (IPv4/IPV6)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by http_bytes_len)
FROM crawl_result_http where crawl_run=67 and ip_version=4
and crawl_status=0 and http_status=200
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by http_bytes_len)
FROM crawl_result_http where crawl_run=67 and ip_version=6
and crawl_status=0 and http_status=200
---internal links (IPv4/IPV6)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by html_links_int)
FROM crawl_result_http where crawl_run=67 and ip_version=4 and
crawl_status=0 and http_status=200
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by html_links_int)
FROM crawl_result_http where crawl_run=67 and ip_version=6
sand crawl_status=0 and http_status=200
--- external links (IPv4/IPV6)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by html_links_ext)
FROM crawl_result_http where crawl_run=67 and ip_version=4
and crawl_status=0 and http_status=200
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by html_links_ext)
FROM crawl_result_http where crawl_run=67 and ip_version=6
and crawl_status=0 and http_status=200
---cookie count (IPv4/IPV6)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by cookie_count)
FROM crawl_result_http where crawl_run=67 and ip_version=4
and crawl_status=0
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER by cookie_count)
FROM crawl_result_http where crawl_run=67 and ip_version=6
and crawl_status=0
```
## TLS
### DNS results from Table III in [Dmap paper][2]:
|*Metric* |*IPv4* |*IPv6* |*IPv6/IPv4*|
| --------| ------|-------| ----------|
|#Domains (OK) |772,455 |129,443|0.17|
|# Let’s Encrypt |165,526|10,466|0.06|
### TLS SQL queries
```sql
---Domains (OK) (IPv4/IPv6)
select count(1) from crawl_result_tls where
crawl_run=67 and ip_version=4 and crawl_status=0
and tls_avail=true
select count(1) from crawl_result_tls where
crawl_run=67 and ip_version=6 and crawl_status=0
and tls_avail=true
--- Let's Encrypt (IPv4/IPv6)
select count(1) from crawl_result_tls where crawl_run=67
and ip_version=4 and tls_letsencrypt=True
select count(1) from crawl_result_tls where crawl_run=67
and ip_version=6 and tls_letsencrypt=True
```
## SMTP
### SMTP results from Table III in [DMap paper][2]:
|*Metric* |*IPv4* |*IPv6* |*IPv6/IPv4*|
| --------| ------|-------| ----------|
|# Domains (OK) |843,126|190,736|0.23|
|# Unique SMTP |501,848 |24,311|0.05|
|# Unique IP |286,504|10,113|0.04|
|# Unique StartTLS |302,871 |8,016|0.03|
### STMP SQL queries
* For StartTLS query support, we did not write the JSON queries, but a python script. Please [download it here][4] and run it, and obtain all results for SMTP.
* Here is part of the queries, for the remaining see the [smtpstats.py][4].
```sql
--domains (OK), IPv4 and IPv6:
select count(domainname) from crawl_result_smtp
where crawl_run=67 and ip_version=4 and
crawl_status=0 and smtp_count>0
select count(domainname) from crawl_result_smtp
where crawl_run=67 and ip_version=6 and
crawl_status=0 and smtp_count>0
--unique SMTP (IPv4, IPv6):
WITH smtp_hosts AS (
select lower(json_array_elements(smtp_hosts::json)->>'name') as host
from crawl_result_smtp
where ip_version = 4
and crawl_status = 0
and crawl_run = 67
)
select count(distinct host)
from smtp_hosts
--unique SMTP (IPv6):
WITH smtp_hosts AS (
select lower(json_array_elements(smtp_hosts::json)->>'name') as host
from crawl_result_smtp
where ip_version = 6
and crawl_status = 0
and crawl_run = 67
)
select count(distinct host)
from smtp_hosts
-- unique IP (IPv4):
WITH smtp_hosts AS (
select lower(json_array_elements((json_array_elements(smtp_hosts::json)->>'smtpHostIPs')::json)->>'ip') as addr
from crawl_result_smtp
where ip_version = 4
and crawl_status = 0
and crawl_run = 67
)
select count(distinct addr)
from smtp_hosts
-- unique IP (IPv6):
WITH smtp_hosts AS (
select lower(json_array_elements((json_array_elements(smtp_hosts::json)->>'smtpHostIPs')::json)->>'ip') as addr
from crawl_result_smtp
where ip_version = 6
and crawl_status = 0
and crawl_run = 67
)
select count(distinct addr)
from smtp_hosts
```
[1]: https://dmap.sidnlabs.nl
[2]: paper.pdf
[3]: https://www.postgresql.org/
[4]: https://sidn.github.io/dmap-site/smtpstats.py
Copyright (C) 2018 SIDN Labs