GATHERING INFORMATION ABOUT NETWORK INFRASTRUCTURE FROM DNS
NAMES AND ITS APPLICATIONS
by
ABHIJIT ALUR
A THESIS
Presented to the Department of Computer and Information Science
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Master of Science
December 2014
THESIS APPROVAL PAGE
Student: Abhijit Alur
Title: Gathering Information about Network Infrastructure from DNS Names and Its Applications
This thesis has been accepted and approved in partial fulfillment of the requirements for the Master
of Science degree in the Department of Computer and Information Science by:
Prof. Reza Rejaie Chair
Prof. Jun Li Member
and
J. Andrew Berglund Dean of the Graduate School
Original approval signatures are on file with the University of Oregon Graduate School.
Degree awarded December 2014
ii
c© 2014 Abhijit Alur
iii
THESIS ABSTRACT
Abhijit Alur
Master of Science
Department of Computer and Information Science
December 2014
Title: Gathering Information about Network Infrastructure from DNS Names and Its Applications
DNS (Domain Name System) names contain a wide variety of information, such as
geographic location, speed of the interface, type of interface, etc. However, extracting this
information is challenging since this information does not have a consistent format across
different ISPs (internet service providers) or even a particular ISP.
We present a new tool, GINIE, which extracts useful information and some common
dictionary words from a DNS name. We use three ISPs and a CAIDA (Center for Applied
Internet Data Analysis) dataset to demonstrate these capabilities.
Information extracted with GINIE provides valuable insight about the infrastructure of the
three ISPs and shows the availability and type of information in a collection of DNS names from
many ISPs that exist in a typical dataset. The embedded information from DNS names can be
used (with some additional active measurements) to infer the geo-aware topology of an ISP.
iv
CURRICULUM VITAE
NAME OF AUTHOR: Abhijit Alur
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR
B.V.B College of Engineering and Technology, Hubli, Karnataka, India
DEGREES AWARDED:
Master of Science, Computer and Information Science, 2014, University of Oregon
Bachelor of Engineering, Computer Science, 2008, BVB College of Engineering
AREAS OF SPECIAL INTEREST:
Network Measurement, Data Mining
PROFESSIONAL EXPERIENCE:
Systems Engineer, Tata Consultancy Services, 3.9 Years
v
ACKNOWLEDGEMENTS
I thank Professor and advisor prof. Reza Rejaie for his assistance in the preparation
of this manuscript. Special thanks are due to PhD student Reza Motamedi, whose guidance
and familiarity with the concepts of network measurement has been helpful throughout this
undertaking. I was able to use some of his databases of network measurement data and add build
it further. Specifically, the databases of information from RouteViews and Team-Cymru that he
had collected was really helpful in my research.
vi
TABLE OF CONTENTS
Chapter Page
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
III. METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Generating All IP Addresses in Prefixes . . . . . . . . . . . . . . . . . . . . . . . 7
Selecting The Public DNS Servers . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Issuing Reverse DNS Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Correcting The IP Addresses That Had Errors . . . . . . . . . . . . . . . . . . . 11
Creating Dictionaries of Interface Names, Router Function, Cities, etc. . . . . . 11
Parsing DNS Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
IV. CHARACTERISTICS OF INDIVIDUAL ISPS . . . . . . . . . . . . . . . . . . . . . 22
Selection of ISPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Level3 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Verizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Cogent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
V. CROSS ISP VS CAIDA DATASET ANALYSIS . . . . . . . . . . . . . . . . . . . . . 41
Observations of The CAIDA Dataset . . . . . . . . . . . . . . . . . . . . . . . . 41
VI. TOPOLOGY MAPPING FROM XNET AND IFFINDER . . . . . . . . . . . . . . . 46
Databank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vii
Chapter Page
Yahoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Graph Based on City Information . . . . . . . . . . . . . . . . . . . . . . . . . . 48
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
viii
LIST OF FIGURES
Figure Page
1. DNS Servers Whose Timeout Percentage is More Than 6% . . . . . . . . . . . . . . . 9
2. Number of TIMEOUTs Observed vs Timeout Value . . . . . . . . . . . . . . . . . . . 11
3. Level3 Inferred Prefixes and Their Size Distribution . . . . . . . . . . . . . . . . . . . 24
4. Verizon Inferred Prefixes and Their Size Distribution . . . . . . . . . . . . . . . . . . 25
5. Cogent Inferred Prefixes and Their Size Distribution . . . . . . . . . . . . . . . . . . 26
6. Level3 - Domains and Their Size Distribution . . . . . . . . . . . . . . . . . . . . . . 27
7. Level3 - Names Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8. Level3 - Names Distribution(Others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
9. Level3 - Parts Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
10. Level3 - CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . . . . . . . 31
11. Verizon - Domains and Their Size Distribution . . . . . . . . . . . . . . . . . . . . . . 32
12. Verizon - Names Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
13. Verizon - Names Distribution(Others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
14. Verizon - Names Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
15. Verizon - CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . . . . . . 36
16. Cogent - Names Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
17. Cogent - Names Distribution (Others) . . . . . . . . . . . . . . . . . . . . . . . . . . 38
18. Cogent - Parts Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
19. Cogent - CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . . . . . . 40
20. CAIDA - CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . . . . . . 43
21. CAIDA vs Others CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . 43
22. CAIDA Segment Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
23. CAIDA Others Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
24. CAIDA vs Others CDF of Extracted Information . . . . . . . . . . . . . . . . . . . . 45
25. Topology of Databank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
26. Topology of Yahoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
27. Region Level Topology of Verizon-gni . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ix
Figure Page
28. Region Level Topology of Level3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
x
LIST OF TABLES
Table Page
1. Prefixes of Level3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Distribution of Number of TIMEOUT or SERVFAIL Responses For Different Timeouts
Specified in The Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Juniper Interface Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4. Cisco Interface Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5. Huawei Interfaces And Their Meanings. . . . . . . . . . . . . . . . . . . . . . . . . . 15
6. ISPs and Their Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7. ISPs with Status Distribution Before Repair . . . . . . . . . . . . . . . . . . . . . . . 23
8. ISPs with Status Distribution After Repair . . . . . . . . . . . . . . . . . . . . . . . . 23
9. Count of All domains in Level3, Verizon and Cogent . . . . . . . . . . . . . . . . . . 23
10. Level3-Domain-Subnet Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11. Top 10 Domains in Level3 and Their Distribution . . . . . . . . . . . . . . . . . . . . 26
12. Level3 - Parsed DNS Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
13. Level3 - Parsed DNS Names (others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
14. Level3 - Most Occurring Dictionary Words . . . . . . . . . . . . . . . . . . . . . . . 29
15. Level3 - CDF of Information Parsed . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
16. Top 10 Domains in Verizon and Their Distribution . . . . . . . . . . . . . . . . . . . 31
17. Verizon - Parsed DNS Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
18. Verizon - Parsed DNS Names(Others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
19. Verizon - Most Occurring Dictionary Words . . . . . . . . . . . . . . . . . . . . . . . 34
20. Verizon - CDF of Information Parsed . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
21. Cogent - Parsed DNS Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
22. Cogent - Parsed DNS Names(Others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
23. Cogent - CDF of Information Parsed . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
24. Cogent - Most Occurring Dictionary Words . . . . . . . . . . . . . . . . . . . . . . . 38
25. Cogent - Place Matches and Mismatches with IP2Location Data . . . . . . . . . . . 39
26. Observations of CAIDA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xi
Table Page
27. CAIDA - CDF of Information parsed . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
28. CAIDA - Parsed DNS Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
29. CAIDA - Parsed DNS names(Others) . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xii
CHAPTER I
INTRODUCTION
It is important to know the information about Internet infrastructure. It helps in mapping
out the topology. Information such as geographical location, the speed, types of interfaces
etc gives an idea of network and helps build a complete picture of the Internet. Tier-1 ISPs
connect major part of the internet. Finding out their router locations, types of routers : physical
and logical, the types of interfaces, speed of interfaces etc is one way of understanding their
topological structure. Typical approaches to reconstruct the router level ISPs include traceroute
probes, multicast advertisements such as mrinfo[8] as used in MERLIN [19], IP options probing ,
manual analysis of existing dns records etc.
The active probing techniques mentioned above have some problems. The routers might
block some of those requests. Even when the probe is successful, they might not always reveal
a lot of information. traceroute might use the delay value to infer information which might not
be very satisfactory. The process is slow and we may have to run the process multiple times to
possibly get more information or to validate the prior results. Whereas using DNS names for
information gathering is both fast and easy. Some of the information such as speed and types of
interfaces etc which are very hard to obtain are easily available.
The objective of our work is to come up with a parser for DNS names and extract
information such as interface type, router function, geographical location, different companies
subleasing from a bigger ISP, dictionary words(which might mean something specific about the
configuration) etc. We run our parser on 10 different ISPs and discuss the results of 3 of them in
detail in this paper. We also run our parser on publicly available CAIDA dataset of DNS names
and compare them against 3 ISPs to see structure across multiple ASes and how they fare against
our 3 chosen ISPs. We use IFFINDER and XNET to find out aliases of IP addresses and we find the
router-level and city-level topologies of individual companies such as verizon-gni and level3.net.
There are many advantages of using reverse DNS names to map the topological structure
find and other useful data. But there are inherent challenges [26]. There is no standard on
how different ISPs name their routers. Each one follows their own rules. Many times they are
misconfigured. This might lead to erroneous inferences. Sometimes even within an ISP, a set
of routers follow one naming conventions and another set follow some other conventions. Many
1
ISPs don’t update the names regularly as they add/move/replace their routers. In such cases our
measurement approach doesn’t work. But we know about some ISPs which follow good naming
conventions and update the dns names of servers as that would help them in monitoring their
resources. Typically large ISPs tend to follow this culture.
We find that Level 3 communications, Verizon, Cogent are some of the most connected
Internet service providers with large networks in both Europe and North America, and they
follow some naming conventions. Some of the reverse DNS names which we found out are given
here as an example:
se-4-0.hsa1.Baltimore1.Level3.net
fa1-0-0.burlma2-cr1.bbnplanet.net
are, respectively, the reverse DNS records for the level3 owned routers (they have same
ASN).
These records include four pieces of information. First, a router location (e.g., Baltimore
and burlma which might mean burlingham massachusetts) we have uncovered around a 50000+
interface names that end with ’level3.net’ and almost all of them have full city names. Second,
the router code within a location (e.g., hsa1 and cr1) Third, the type of interface, which we
infer based on Cisco naming conventions (e.g., te for 10 Gbps Ethernet, and fa for 100 Mbps
Ethernet). And fourth, the interfaces position within the router (e.g., 1-0 and 4-0, which are,
respectively, the first ports on their line cards). Cr1 also hints that its a core router.
To capitalize on this information we first generated all possible IP addresses in a list of
subnets that belong to the 10 ISPs listed in the next section. We then request for their reverse
dns names.
We have several interesting results which we are releasing to the research community.
These results include:
– DNS name Parser named GINIE
– Information about the interface types used by different ISPs.
– Speeds of the interfaces.
– Location of interfaces of ISPs.
– Router level topology of smaller companies which form the larger AS.
2
– City level topology of smaller companies which form the larger AS.
3
CHAPTER II
RELATED WORK
Extracting information from DNS names has been done before. A few notable projects
are UNDNS [13] which is described in detail in the paper Measuring ISP Topologies with
Rocketfuel [21] and PathAudit [9] described in detail in the paper What’s in a name.
UNDNS It uses regular expressions to match against the DNS names and parse the
information from them. This approach needs a man-in-the-middle approach. Someone has to
come up with the rules by looking at the pattern of DNS names and write a regular expression
for it. However there are problems with this approach. The DNS names might change over
time. A slight change in the DNS name will render the regular expression written for an ISP
useless. Also, there are wide variations in the DNS name formats used by system administrators.
To write rules which don’t lead to erroneous results, the regular expressions have to be very
specific in some cases. This defeats the purpose of writing one regular expression for a group of
DNS names. This also leads to a need to write lots of regular expressions and a lot of manual
inspection. This makes the process very slow and unrealistic for large ISPs and widespread use.
PathAudit PathAudit [9] is another project described in detail in [14]. It uses a dictionary of
information such as city names, interface types etc to check for information in the DNS names.
They use clustering algorithms to group names into clusters based on the ”tags”. These tags are:
router function, dots (”.”), dashes(”-”), alphanumeric names [A-Za-z][A-Za-z]+[0-9], interface
speed, IP address in dnsname, and router type (cisco,juniper etc). However, since they use
clustering algorithms a situation a smaller subset of a part of a name might be matched first
and wrongly infer the tags. For example, a name such as ”Fibernet” might lead to a city tag
”bern”.
GINIE Approach We have come up with a parser called GINIE (Gathering Information from
Network InfrastructurE). We too use dictionaries of information of cities, interface types etc. But
we use the separators in the DNS names such as dots(”.”) and dashes(”-”) to split the names and
check against our dictionaries. Based on observation of DNS names we were able to come up with
a logical flow which most of the DNS names follow. For example, among all information, interface
4
information always comes first in name (if it is present) before other types of information. Based
on observation of city names we have also seen that the DNS names follow CLLI [3] name format
for city information.
5
CHAPTER III
METHODOLOGY
Most network interfaces are assigned with DNS names by their ISPs for ease of management.
These domain names usually have some structure to them which depends upon the hardware,
the functionality of the router etc. Such a typical name is of the form a7-0.lsanca1-
ar53.bbnplanet.net. The DNS names are usually separated by ”.” (DOT). We already know
some parts of this name such as the right-most part ”net” is a Top Level Domain (TLD). We
encounter other TLDs such as .com, .us etc. ”bbnplanet” maybe the company subleasing the
address space from Level3 communications because the IP address is addressed in the BGP
advertisements to be a part of Level3 ASN(3356). The name to the left of the TLD are usually
managed by the individual organizations. Our aim is identify patterns in these names and
retrieve as much data from these names as possible.
The data from BGP announcements are captured in projects like RouteViews [20]. The
BGP announcements are in the form of prefix to ASN mapping. Team-Cymru also has a large
database of IP address to ASN mapping too. We gather information about ISPs and their
prefixes from these two sources. We issue reverse DNS queries for a sample number of IP
addresses for some ISPs and select a list of about 10 ISPs whose DNS names seem to have good
structure (e.g a7-0.lsanca1-ar53.bbnplanet.net.). Usually the large ISPs have a good naming
structure. The further steps are broadly listed below:
– Generate all IP addresses in prefixes of the selected ISPs.
– Select a list of public DNS servers from public-dns.tk [10] website.
– Issue reverse DNS queries for all the IPs and store the DNS name and error messages that
we get (if any).
– Repair any IP addresses for which we encountered errors by re-sending the reverse DNS
queries.
– Build dictionaries of city names, city codes, interface types and their codes, states in United
States of America from various sources.
6
– Use the dictionaries to parse the DNS names that we resolved.
– Validate the cities that we parsed from the names with IP-geo location database such as
IP2Location [7]
Generating All IP Addresses in Prefixes
The example prefixes for Level3 ISP are in listed in the table 1. below.
TABLE 1. Prefixes of Level3
Prefix Size
4.0.0.0/10 4,194,304
8.0.0.0/10 4,194,304
62.67.0.0/16 65,536
62.140.0.0/19 8,192
63.208.0.0/13 524,288
64.30.32.0/19 8,192
64.152.0.0/13 524,288
64.200.0.0/16 65,536
65.88.0.0/14 262,144
66.170.136.0/22 1024
67.96.0.0/14 262,144
166.90.0.0/16 65,536
195.16.160.0/19 8,192
195.50.64.0/18 16,384
198.17.30.0/24 256
This table shows some of the prefixes we use that belong to Level3 communications
Autonomous System Number (ASN). The size of the prefixes shows all possible IP addresses
in that prefix. We generate all possible IP addreses for each prefix for issuing reverse DNS queries
for them.
Problems in This Approach There are certain issues with this approach.
– Since we acquire these prefixes from Routeviews [20] and Team-cymru[23] which internally
gather these information from Router BGP updates, The size and values of these prefixes
might vary from time to time.There may be new prefixes associated with the ISPs as well.
– Some of the IP addresses might not be allocated. Since the only method we use is reverse
DNS queries, we cannot be sure whether these IP addresses were allocated. Once approach
7
is to send ping requests to the routers. But this approach isn’t fool-proof either since many
of the routers block ping requests.
– The ISP might rename the routers from time to time. This might change our view of the
ISP and its configuration. They might also move the servers from one location to other.
But we assume that these changes are quite slow and happen for only a small section of the
routers.
In spite of these drawbacks, analyzing the DNS-names is potentially very useful because
they have a lot of information in them and many times the information that we gather using this
approach might not be acquired by any other means of active or passive network measurement.
Moreover, our methodology can be repeated to find a more up-to-date view of the ISPs.
Selecting The Public DNS Servers
There are many public DNS servers online. Public-dns.tk [10] does a good job of listing all
the public DNS servers and their statuses. We only use IPV4 DNS servers because we consider
only the IPV4 addresses for the ISPs. There are around 3000 such DNS servers. Some of these
servers might not be very efficient. So it is essential for us to weed out the servers which are
either slow or have high error rate. We pass the list of servers with 12000 sample (one cycle of
our algorithm) queries and check the responses and the behavior of the servers. We remove all
the servers with more than 6% error rate. Around 273 addresses have losses more than 6% and
they are shown in red to show that they are not used further in our analysis. We also see that
most of the DNS servers lie within the 6% error percentage rate. We have such a strict check
because we have millions of addresses to resolve the DNS names for and the servers with higher
error rate than 10% would eventually have a much higher error rate because of our persistent
checks. Below is the figure which shows the server timeout percentage on x axis and the number
of servers having them. We delete all these DNS servers from our list. 28 of the DNS servers have
100% timeouts. So in all, 272 servers are deleted from our list.
8
FIGURE 1. DNS Servers Whose Timeout Percentage is More Than 6%
Issuing Reverse DNS Queries
We use dig tool to perform reverse DNS queries. Other DNS lookup tools exist such as
nslookup and host. nslookup is deprecated. host is much more succinct form of dig. But we
use dig as it gives us a lot more information such as
Answer Section: Contains the type of reverse DNS request (in our case it PTR) and the
DNS names if it exists.
Question Section: Contains information about the request.
Authority Section: Names of the authoritative DNS servers.
Additional Section: If we query for an MX record, the answer section will show the dns
names of the mail servers and the additional section would show the IP address of those
name servers (If they are present).
EDNS option: If the DNS server is EDNS enabled, the query is converted into and
EDNS dns query and sent to the server. The server then recursively relays the query to
the authoritative DNS server for the domain requested in the query. The authoritative
DNS server then looks at the EDNS query (The EDNS query contains the prefix of the
9
client which initially made the request) and provides a response which might contain an IP
address that is nearer to the client. host doesn’t have this option.
Statistics: It also shows the statistics of the query such as the time it took for the query
to be resolved and the message size received.
which might be useful for analysis later. The following are the responses generated by dig tool
and their meanings from RFC 1035 [4]
Setting The Timeout Value in dig
dig tool has an option for setting the timeout for each query in seconds. To decide an
optimal timeout setting for our queries, we ran our script against 300,000 IP addresses of Level3
(Each of the DNS servers would be queried approximately 300 times. This is also the number
of requests after which the script is programmed to wait for around 10 minutes.) with values of
timeouts ranging from 1 second to 6 seconds. When a DNS server is bogged down by requests, it
tends to take a longer (more than the timeout specified in the query) time to respond and hence
we are expected to get a higher number of TIMEOUT responses. The distribution of TIMEOUT
responses received for each of these set of experiments is shown in 2. and in figure 2..
TABLE 2. Distribution of Number of TIMEOUT or SERVFAIL Responses For Different
Timeouts Specified in The Queries
Timeout value specified Total No.of IP Addresses No. of TIMEOUT responses Percentage
1 Seconds 500,127 73,813 14.75%
2 Seconds 500,117 68,850 13.76 %
3 Seconds 500,117 69,439 13.88 %
4 Seconds 500,117 67,590 13.51 %
5 Seconds 489,739 77,739 15.87 %
6 Seconds 480,089 63,765 13.28 %
7 Seconds 480,089 65,507 13.64 %
8 Seconds 480,089 67,614 14.08 %
9 Seconds 480,089 66,541 13.86 %
10 Seconds 480,089 68,569 14.28 %
Methodology of Running Reverse DNS Queries
We produce 100,000 possible addresses from our list of IP address prefixes. We divide
them into 4 parts. We pass these 4 lists of IP addresses to 4 new processes. These processes
spawn 500 threads each and each of 25,000 addresses are subdivided into 500 parts so that
10
FIGURE 2. Number of TIMEOUTs Observed vs Timeout Value
each thread is responsible for 25 IP addresses. Once all threads of the 4 processes complete, we
generate 100,000 more IP addresses. This process constitutes one loop or a cycle. We repeat this
process until all the prefixes and IP addresses are exhausted. The program is able to resolve 160-
170 IP addresses in a second. After every 300,000 addresses are resolved, we induce a 10 minute
sleep for the program so that the DNS servers don’t blacklist us. For every reverse DNS query
(for each thread), a different DNS server is chosen randomly. On an average, each DNS server is
queried 2-3 times a minute.
Correcting The IP Addresses That Had Errors
Some of the reverse DNS requests result in errors as mentioned above namely TIMEOUT
and SERVFAIL. We issue reverse DNS requests for these IP addresses again. But to minimize the
errors, we reduce the speed with which we query by inducing threads to sleep for random time.
We also, choose only those servers which have very high success rate. We list around 600 of such
DNS servers which have TIMEOUT errors of less than 1%.
Creating Dictionaries of Interface Names, Router Function, Cities, etc.
The DNS names are composed of a wide variety of information. Broadly the DNS names
have the following sections of information (if they are present). [14]
11
Interface - Which tells about the type of the interface, possibly Its speed, make, interface
location (in terms of numbers), model etc.
Router Type - The router type contains information about the function of the router. For
example, border router, core router etc.
Location - This contains the information about the location of the router.
The very first thing that needs to be done to analyze the names is to build a database
of all these codes so that when we encounter these codes in the DNS names, we can deduce
the information present in them. The CLLI codes [3] (which are not to be shared without the
permission of Telcodata.us [12]) are stored in city clli table. The airport codes are stored in
city decode table.
Decoding Interfaces As mentioned before the interface naming techniques by many ISPs
follow the naming standards of the company that their routers are made of. Cisco and Juniper
are the major vendors of routers to the ISPs. Upon searching the Cisco and Juniper interface
naming conventions there are interesting details about the router interface naming procedures.
Please find the Cisco and Juniper Interface types mentioned in tables below 3.. Juniper has a
much elaborate explanation of the interface naming procedure whereas Cisco just mentions the
interface types. In Juniper routers, the physical part of an interface name identifies the physical
device, which corresponds to a single physical network connector. This part of the interface
name has the format mentioned in the table(Only part of the table is shown here). The full table
can be found in [17]). Table shows the Huawei router interface naming guidelines. 4. shows the
interface naming guidelines for Cisco routers. Both Cisco and Huawei networks don’t explicitly
tell how the interfaces might be named. They give guidelines for them. And based on them I
have come up with dictionaries for them. For example, F might ”FE/GE interface” of any one of
the Cisco, Juniper of Huawei. L could be ”Simplified Interface” of Huawei network etc.
12
TABLE 3. Juniper Interface Naming
Code Description
ae Aggregated Ethernet interface. This is a virtual aggregated link and has
a different naming format from most PICs; for more information— see
Aggregated Ethernet Interfaces Overview.
as Aggregated SONET/SDH interface. This is a virtual aggregated link and
has a different naming format from most PICs; for more information— see
Configuring Aggregated SONET/SDH Interfaces.
at ATM1 or ATM2 intelligent queuing (IQ) interface or a virtual ATM interface
on a circuit emulation (CE) interface.
bcm Gigabit Ethernet internal interface.
br Integrated Services Digital Network (ISDN) interface (configured on a 1-
port or 4-port ISDN Basic Rate Interface (BRI) card). This interface has a
different naming format from most PICs: br-pim/0/port. The second number
is always 0. For more information— see Configuring ISDN Physical Interface
Properties.
cau4 Channelized AU-4 IQ interface (configured on the Channelized STM1 IQ or
IQE PIC or Channelized OC12 IQ and IQE PICs). ce1 Channelized E1 IQ
interface (configured on the Channelized E1 IQ PIC or Channelized STM1
IQ or IQE PIC).
ci Container interface.
coc1 Channelized OC1 IQ interface (configured on the Channelized OC12 IQ and
IQE or Channelized OC3 IQ and IQE PICs). coc3 Channelized OC3 IQ
interface (configured on the Channelized OC3 IQ and IQE PICs).
coc12 Channelized OC12 IQ interface (configured on the Channelized OC12 IQ and
IQE PICs).
coc48 Channelized OC48 interface (configured on the Channelized OC48 and
Channelized OC48 IQE PICs).
cp Collector interface (configured on the Monitoring Services II PIC).
cstm1 Channelized STM1 IQ interface (configured on the Channelized STM1 IQ or
IQE PIC).
cstm4 Channelized STM4 IQ interface (configured on the Channelized OC12 IQ
and IQE PICs).
cstm16 Channelized STM16 IQ interface (configured on the Channelized
OC48/STM16 and Channelized OC48/STM16 IQE PICs).
ct1 Channelized T1 IQ interface (configured on the Channelized DS3 IQ and
IQE PICs— Channelized OC3 IQ and IQE PICs— Channelized OC12 IQ
and IQE PICs— or Channelized T1 IQ PIC).
ct3 Channelized T3 IQ interface (configured on the Channelized DS3 IQ and
IQE PICs— Channelized OC3 IQ and IQE PICs— or Channelized OC12 IQ
and IQE PICs).
demux Interface that supports logical IP interfaces that use the IP source or
destination address to demultiplex received packets. Only one demux
interface (demux0) exists per chassis. All demux logical interfaces must
be associated with an underlying logical interface.
13
TABLE 4. Cisco Interface Naming
Type Description
Null Null interface.
Analysis-
module
A Fast Ethernet interface that connects to the internal interface on the
Network Analysis Module (NAM).
Async Port line used as an asynchronous interface.
ATM ATM interface.
BRI ISDN BRI interface. This interface configuration propagates to each B
channel. B channels cannot be configured individually.
BVI Bridge-group virtual interface. BVI interfaces are used to route traffic at
Layer 3 to the interfaces in a bridge group.
Content-
engine
Content engine (CE) network module interface.
Dialer Dialer interface.
Ethernet Ethernet IEEE 802.3 interface.
Fast
Ethernet
100-Mbps Ethernet interface.
FDDI Fiber Distributed Data Interface.
Gigabit
Ethernet
1000-Mbps Ethernet interface.
Group-
Async
Master asynchronous interface. This interface type creates a single
asynchronous interfaces to which other interfaces are associated. This one-to-
many configuration enables you to configure all associated member interfaces
by configuring the master interface.
HSSI High-Speed Serial Interface.
Loopback A logical interface that emulates an interface that is always up. For example,
having a loopback interface on the router prevents a loss of adjacency with
neighboring OSPF routers if the physical interfaces on the router go down.
The name of a loopback interface must end with a number ranging from
0-2147483647.
Multilink Multilink interface. A logical interface used for multilink PPP (MLP).
Port channel Port channel interface. This interface type enables you to bundle multiple
point-to-point Fast Ethernet links into one logical link. It provides
bidirectional bandwidth of up to 800 Mbps.
POS Packet OC-3 interface on the Packet-over-SONET (POS) interface processor.
PRI ISDN PRI interface. Includes 23/30 B-channels and one D-channel.
Serial Serial interface.
Switch Switch interface.
Ten Gigabit
Ethernet
10000-Mbps Ethernet interface.
Token Ring Token Ring interface.
Tunnel Tunnel interface.
VG-
AnyLAN
100VG-AnyLAN port adapter.
VLAN Virtual LAN subinterface.
Virtual
Template
Virtual template interface. When a user dials in, a predefined configuration
template is used to configure a virtual access interface; when the user is done,
the virtual access interface goes down and the resources are freed for other
dial-in uses.
14
TABLE 5. Huawei Interfaces And Their Meanings.
Field Meaning Description
A Product name AR: application
and access routers
B Hardware platform type. The value can be 1
or 2.
1: four LAN
interfaces
2: eight LAN
interfaces
C Combines with B to indicate different router
series using the same hardware platform.
The following router series are available:
15: 4*FE LAN
interface series
16: 4*GE LAN
interface series
20: 8*FE LAN
interface series
D Type of major or fixed uplink interfaces on
the router
1: FE or GE
6: ADSL-B/J
7: ADSL-A/M
8: G.SHDSL
9: VDSL over
POTS
E Other interface types supported by the
router. This field is optional.
E: enhanced major
uplink interface
(dual-uplink or
two-wire/four-wire
DSL enhanced)
F: uplink GE
combo interface
Continued on next page
15
Table 5. – continued from previous page
Time (s) Triple chosen Other feasible triples
G: uplink wireless
interface (GPRS,
3G, or LTE)
V: voice interface
W: Wi-Fi access
interface
F Extended information about the router. This
field is optional.
HSPA+7:
WCDMA HSPA+7
3G standard
C: CDMA2000 3G
standard
NOTE: D: DC model
This field starts with and specifies
supplementary interface descriptions or
other possible configurations.
P: PoE supported
L: FDD-LTE, a
European standard
A Product name AR: application
and access routers
B Hardware platform series code Currently, three
router series are
available: 1, 2 and
3. A larger value
indicates higher
performance.
C Hardware platform type 2: modular router
Continued on next page
16
Table 5. – continued from previous page
Time (s) Triple chosen Other feasible triples
D Maximum number of slots supported by the
router
AR1200 series:
D indicates
the maximum
number of SIC
slots supported.
AR2200/3200
series: D indicates
the maximum
number of XISC
slots supported.
NOTE: D can
be 0, indicating
the cost-effective
router model
with fixed uplink
interfaces or
reduced number of
slots. E represents
the number of
fixed uplink
interfaces and
or reduced number
of slots.
E Fixed uplink interfaces on the router 1: FE/GE
2: E1/SA
4: four SIC slots
Continued on next page
17
Table 5. – continued from previous page
Time (s) Triple chosen Other feasible triples
NOTE: If E is
0, the device has
no fixed uplink
interface.
F Other interface types supported by the
router. This field is optional.
F: FE LAN
interface
L: simplified
interface
V: fixed voice
interface
W: fixed Wi-Fi
access interface
G Extended information about the router. This
field is optional.
A: AC model
(AC is the default
configuration, and
this field can be
omitted in AC
models.)
D: DC model
NOTE: 48FE: 48 fixed
100M switching
ports
This field starts with and specifies
supplementary interface descriptions or
other possible configurations.
We stored these descriptions of interface names and their types in our database. Once the
interface types of Cisco and Interface naming conventions of Juniper as discovered, its fairly easy
18
to make a fairly accurate guess of the type of interface present in the DNS names. Every name
can be checked against these values. And we are a step closer to the process of coming up with a
technique to automatically interpret the DNS names without human intervention of writing rules.
(The current procedure requires writing rules or regular expressions that explains the classes of
DNS names and their meanings.)
Decoding Router Function
Some of the DNS names have coded information about the function the router performs
such as border, gateway etc. The codes for these routers are usually br,gw etc. We have stored
such router information in a table for use later while parsing the DNS names. An example for
such a DNS name is 3e-company.edge2.chicago2.level3.net. This DNS name shows that it is an
edge router. Most of the information required to decode the router function is derived from the
regular expressions mentioned in [14]. Some other router function is based on observation such as
observing that some of the names have ’core’, ’gateway’, ’border’ etc in them.
Decoding Cities
City or region information is abundantly available in DNS names. It is
present in 4 forms. In the first case, the city names are fully spelled out. For e.g,
8-2-9.ear1.amsterdam1.level3.net.. We download the database of world city names from
geonames.org. [5]. The database which includes all the cities and their information is too large
(9,115,154 cities). The ISPs are not likely to host their routers in cities where the population
is less than 5,000 (this conclusion is based on our observation and the probability). So, we use
only the cities which have a population of 5,000 or more. The size of this reduced database is
57,021. This also increases the speed of our parser. In the second case, they are present in the
form of 3-letter airport codes. For example, 212-162-17-225.edge3.dus1-ge-500. Here dus is
an airport code for Dusseldorf, Germany - International airport. It indicates that the router is
situated somewhere near the airport, in the same city. We store all the world’s airport codes in
the database for future use from airportcodes.org [1]. There are 3,833 airport codes. In the third
case, 4 letter city names with 2 letter state names are used. Upon some research, the 4 letter
city names and the 2 letter state names are mostly the CLLI names used in North American
Telecommunication industries. CLLI stands for Common Language Location Identifier code
19
[3]. These codes are currently owned by telcodata (telcordia telecommunications database) [12].
There are 22,223 CLLI codes. The 4th form is in the form of 2-letter state codes of US states.
For e.g, 141-51-97-67-cust-ny.nuvisions.net. This name states that the router is present
somewhere in New York.
Parsing DNS Names
Once we get all the DNS names, we run the parser through two passes. Once we split the
names by both ”.” and ”-”. In the the second pass we split by only ”.”. we parse each DNS name
to extract the embedded info.
1. we extract each part of the names that are separated by a ”.” and ”-”
2. the right two most part should be com and ISP-name (or something else for leased
addresses) we group names based on the two right most parts
3. till the [half of size of array of name segments] +1 of the size of the array of names
(actually, and check them against Cisco’s, Junipers and Huawei’s convention for interface
naming. To avoid conflicts with interface codes and location codes, we assume that
interface name takes precedence if it exists in the first half of the name. This is because
interface names are always at the beginning of dns names (if numbers are present, we ignore
those making interface names as the first entities present in a name).
4. checking against 3 letter airport codes. This a standard code taken from airportcodes.org
[1]. We have 3613 airport codes.
5. Checking against CLLI-codes. These are maintained by Telcodata and these are proprietary
location codes. [12]. We have 22223 CLLI codes of cities. These are codes used in
telecommunication. The codes are like DLLSTX which is the code for Dellas, Texas and
STTLWA, STTMWA and STTNWA all stand for Seattle Washington. E.g in the DNS
name ”evrtwa1-ar2-4-62-114-149.cv.dsl.gtei.net.” evrtwa stands for Everett Washington.
6. Checking against world city names where the population is greater than 5000 obtained from
geonames.org [5]. There are 57021 such city names.
7. Checking against 2-letter state codes in United States obtained from Wikipedia [11]
20
8. We also repeat the above process by splitting only by ”.” in the second pass. If this results
in a higher success in parsing the data, we use the results from this pass and ignore the
previous pass.
For example, consider the name s11-0-3-0.london2-cr2.bbnplanet.net. We first split this
name by ”.” and ”-”. ’s11’ is of type ’s’ interface which means it is a serial interface following
Cisco’s serial interface naming convention. 0,3,0 are not interpreted. They are ignored. london2
is stripped off of numbers and checked against the city. ’cr’ defines the router function saying
it is a ’Core Router’. bbnplanet.net is the company to which the address space of level3
communications is leased to.
There are certain rules that we follow while parsing names.
1. If a segment(split by either ”.” or by ”.-” depending on which pass it is in) only has
numbers, we ignore that segment.
2. We strip all numbers in a code before comparing them to hashmap of codes we have.
3. If a code is followed directly by an English character(without a separator in between), that
code won’t be found by our method. 99% of the names have separators between logical
codes inside a name. For example an airport code SFO sandwiched between other letters of
English characters such as airportSFO etc. By observation, we almost never find codes not
separated by separators.
21
CHAPTER IV
CHARACTERISTICS OF INDIVIDUAL ISPS
Selection of ISPs
The ASes used in our study are given in the table 6. below. The sample is selected by
resolving the reverse DNS names of a small sample of the addresses in those ASes and checking if
they have a well defined naming structure.
TABLE 6. ISPs and Their Details
ASN ISP Name Address Space
Size
174 COGENT Cogent/PSI 19,984,128
701 UUNET - MCI Communications Services
Inc. d/b/a Verizon Business
37,264,384
702 AS702 Verizon Business EMEA -
Commercial IP service provider in Europe
6,960,128
703 UUNET - MCI Communications Services
Inc. d/b/a Verizon Business
877,056
1239 AS1239 SprintLink Global Network 11,355,200
3356 LEVEL3 Level 3 Communications 10,933,760
5650 FRONTIER-FRTR - Frontier
Communications of America Inc.
5,498,368
7018 ATT-INTERNET4 - AT&T Services Inc. 64,134,401
7922 COMCAST-7922 - Comcast Cable
Communications Inc.
69,029,376
22394 CELLCO - Cellco Partnership DBA Verizon
Wireless
17,186,816
25899 LSNET - LS Networks 186,112
7385 INTEGRATELECOM - Integra Telecom Inc. 1,801,728
Table 7. shows status messages for Level3, Verizon and Cogent in the first run. (Since we
observe a lot of TIMEOUT and SERVFAIL errors in the first run, we run the erroneous results
in the second run). Verizon seems to have a very high percentage of DNS names and a very
low rate of error followed by Level3 and Cogent. In the first run, we focus on the speed of our
reverse DNS name resolver to complete large IP address space. In the second run, we select
the DNS servers which have error rate of less than 1%(about 600 such DNS servers). And we
run our reverse DNS name resolver again with a much slower speed by using lesser threads and
inducing wait. Also, we use google DNS server (8.8.8.8) whenever we encounter a SERVFAIL or
TIMEOUT as the last check before storing the result as SERVFAIL or TIMEOUT. In the repair
22
run(second run), 64,333 new DNS names are found in Verizon. 62,298 new DNS names are found
in Cogent. 15,739 new DNS names are found in Level3. Table 8. shows the table with different
status distributions for Level3, Verizon and Cogent.
TABLE 7. ISPs with Status Distribution Before Repair
Status Message Level3 Verizon Cogent
NOERROR 1,465,536 13.40 % 2,721,661 86.55 % 1,055,825 5.28 %
NXDOMAIN 7,582,902 69.35 % 235,390 7.48 % 14,439,522 72.33 %
REFUSED 516,450 4.74 % 48,023 1.53 % 1,055,110 5.28 %
SERVFAIL 952,011 8.74 % 29,626 0.94 % 773,962 3.87 %
TIMEOUT 1,592,910 14.63 % 109,973 3.49 % 2,638,001 13.21 %
TABLE 8. ISPs with Status Distribution After Repair
Status Message Level3 Verizon Cogent
NOERROR 1,477,998 13.51 % 2,786,216 88.6 % 1,108,234 5.55 %
NXDOMAIN 7,745,486 70.84 % 269,254 8.56 % 17,627,692 88.30 %
REFUSED 16,282 0.14 % 48,520 1.54 % 79,754 0.39 %
SERVFAIL 1,691,460 15.47 % 40,666 1.29 % 1,145,655 5.73 %
TIMEOUT 2,047 0.01 % 17 0.0005 % 1,085 0.0054 %
Table 9. shows the count of all the domains in each ISP. For example, gsa.gov found in
Level3 ASN etc.
TABLE 9. Count of All domains in Level3, Verizon and Cogent
ISP Number of companies
Level3 14,403
Verizon 3,759
Cogent 29,908
As we mentioned the number of different domains present in the table 9.. The figures
3., 4. and 5. shows the plot of the different domains and their sizes. The green plot shows the
maximum size of the inferred prefix and the blue line shows the number of DNS names we found
in that prefix (In other words, it shows the utilization of that prefix). The domains are on the
x-axis and are serially indexed. The size of the domains is represented on the y-axis. The y-axis
is a log scale to fit all sizes to scale. There are around 30,000 different prefix lists we could find
in level3. There are around 3,700 prefix lists that we found for Verizon and around 35,000 prefix
lists for Cogent.
23
FIGURE 3. Level3 Inferred Prefixes and Their Size Distribution
Level3 Communications
Domain Distribution
Level3 communications is a major ISP. Its a tier 1 ISP. Some of the statistics uncovered in
this ISP is given below.The number of distinct domains found are 23,941.
The subnets of the address space taken up by each of those domains is depicted in
the table 10. below. The complete results will be shared with the research community. This
classification is important because usually the same domains usually follow the same naming
conventions and it will be helpful in writing the rules.
DNS Name Count
The total number of addresses which are resolved to DNS names are 1,427,358 out of
10,933,760 IP addresses. This is only about 13% of the address space. The distribution of IP
addresses which don’t resolve into DNS names are grouped into prefix and subnet length format
in null subnet level3 table. The table 10. gives a picture of the distribution for the different
companies/domains which have IP addresses that belong to Level3 address space. ’gsa.gov’ has
3,970 entries with different subnets and the table shows the number of IP addresses in that
subnet. The complete results are stored in the database for every such company. The prefix
24
FIGURE 4. Verizon Inferred Prefixes and Their Size Distribution
shows the prefix in which the DNS name containing the domain is found. Count of addrs shows
the number of DNS names/IP addresses that had names with that domain.
TABLE 10. Level3-Domain-Subnet Coverage
Domain Prefix Count of Addrs
Level3.net 63.211.96.0/19 3,972
Level3.net 64.154.64.0/19 3,766
Level3.net 63.208.231.192/26 57
gsa.gov 205.130.224.0/19 3,970
buffalo.edu 8.35.160.0/20 3,959
Level3.net 63.214.128.0/19 3,950
Level3.net 209.246.0.0/15 3,918
fibrant.com 8.25.224.0/19 3,868
Table 11. shows the top 10 domains in Level3 and their size and percentage of names with
that domain. This table just shows the number of different companies/domains in the descending
order of their size. Large portion of the names are Level3.net domain but a significant fraction
of the IP address space is used by other companies. Figure 6. shows the distribution of different
domains/companies and their count. All the domains of count size 1 are ignored for clarity.
25
FIGURE 5. Cogent Inferred Prefixes and Their Size Distribution
TABLE 11. Top 10 Domains in Level3 and Their Distribution
Domain Count Percentage
all domains 1,447,628 100 %
Level3.net 1036156 71.57 %
gsa.gov 9,920 0.68 %
buffalo.edu 4,635 0.32 %
fibrant.com 3,868 0.26 %
bbnplanet.net 13792 0.95 %
gtei.net 11,171 0.77 %
Table 12. shows Level3’s DNS names and the distribution of the components in the names
such as the interface, router, city names, state codes and others. It also shows the dictionary
words present in the names which aren’t categorized as any of the prior categories mentioned.
The number of checks shown in the last column shows the number of times a type of information
in a name was checked against a component type and the number of times each of them is found
is shown in the second column. There are 1,447,628 IP addresses which have DNS names. In
some of the DNS names, there are multiple dictionary words found. Hence, to show the success
in finding the dictionary words in the DNS names it is needed to show the number of parts there
are in all the DNS names when we split them by ”.” and ”-”. Table 22. shows the same.
26
FIGURE 6. Level3 - Domains and Their Size Distribution
TABLE 12. Level3 - Parsed DNS Names
Information gathered Count Percentage Number of Checks
Total no.of DNS Names 1,447,628 - -
Interface 105,996 7.3 % 2,215,277
Router Function 33,303 2.3% 2,801,113
City Names 78,796 5.44 % 2,783,273
City CLLI 34,084 2.35 % 2,771,077
Airport Codes 53,506 3.69 % 2,772,813
State Codes 51,557 3.56 % 2,770,403
TABLE 13. Level3 - Parsed DNS Names (others)
type of information Count Percentage
Number of Segments 2,788,122 -
Dictionary Words 1,117,297 40.07 %
Others 356,485 12.78 %
Fig 7. and 8. shows the pictorial representation of table 12. and 13.. The others bar
shows the number of name segments that couldn’t be classified as either of interface, router,
city categories. Among the ”others”, there are dictionary words which could tell something more
about the dns names. The ”others” section contains unusually large number of parts. And a
large number of parts in others are dictionary words. It points to a situation where there is some
27
kind of pattern but it is not consistent. Each of the names have to be studied carefully and we
have to study the others section for Level3 more closely to analyze it further.
FIGURE 7. Level3 - Names Distribution
Fig. 9. shows Level3’s region information clearer along with interface and router function
categories. Fair number of names have interfaces and location information. Many names have
fully spelled out city names too.
Table 14. shows Level3’s dictionary words, their number of occurrences and an example
DNS name. Unusually large number of DNS names have the word unknown in them. It just
shows that the configuration hasn’t been properly done for them. Fair number of routers are
host, static, mail servers etc.
28
FIGURE 8. Level3 - Names Distribution(Others)
TABLE 14. Level3 - Most Occurring Dictionary Words
ASN Word Number of Occurrences DNS name
3356 unknown 1,002,616 unknown.level3.net.
3356 host 17115 host-23.kletos.net.
3356 static 11247 static.34k.dscga.com.
3356 dynamic 9781 bj-dynamic-245.sys.gtei.net.
3356 bc 6140 8-6-93-255-bc.redplaid.com.
3356 mail 6059 mail.clearwaterhousingauth.org.
3356 wireless 4076 db-wireless.car1.minneapolis1.level3.net.
3356 voice 3618 voice-retri.edge6.dallas1.level3.net.
3356 domain 3306 waident-exch2.domain.waident.com.
3356 customer 2885 customer-co.edge1.minneapolis1.level3.net.
3356 unassigned 2426 unassigned-183.e.active.com.
3356 reverse 2235 reverse.vetronix.com.
3356 unused 1770 8-23-128-124-unused.phx.unsi.net.
3356 dial 1718 dial-800-ll.car1.dallas1.level3.net.
3356 deploy 1259 a8-17-144-105.deploy.akamaitechnologies.com.
Table 15. and fig 10. shows Level3’s CDF of information parsed. The x-axis shows the
number of items of information parsed. Since this is a cdf, the x-axis shows bins. The first bin
is the number of DNS names that have no items of information(x axis from 0 to 1) . The second
bin (x axis from 1 to 2) shows the number of DNS names that have 0 or 1 parts of information
and so on. Since the ”others” section had a high number of name parts, its clear that we couldn’t
29
FIGURE 9. Level3 - Parts Distribution
infer any of interface, router or location information for 86% of the names and that can be seen
at the bin 1 (x axis 0 to 1).
TABLE 15. Level3 - CDF of Information Parsed
Type Count Percentage
No inference 1,250,287 86.36 %
At least one 96,207 6.64 %
At least two 45,917 3.17 %
At least three 51,672 3.56 %
At least four 3,540 0.24 %
Five and above 5 0.0003 %
Verizon
Table 16. shows the top 10 domains in Verizon and their size and percentage of names
with that domain. Figure 11. all the domains whose size is greater than 1. It shows them in an
increasing order of their sizes. Some of the domains clearly have a large size. These tend to be
big companies.
30
FIGURE 10. Level3 - CDF of Extracted Information
TABLE 16. Top 10 Domains in Verizon and Their Distribution
Domain Count Percentage
all domains 3,144,673 100 %
verizon.net 2,601,667 82.73 %
verizon-gni.net 44,312 1.41 %
ALTER.NET 37,059 1.18 %
ba-dsg.net 1,474 0.04 %
nisgroup.com 490 0.01 %
airg.com 475 0.01 %
bellatlantic.net 347 0.01 %
algorithmics.com 245 0.007 %
dwoskin.com 240 0.007 %
Table 17. shows Verizon’s DNS names and the distribution of the components in the
names such as the interface, router, city names, state codes and others. It also shows the
dictionary words present in the names which aren’t categorized as any of the prior categories
mentioned. The number of checks shown in the last column shows the number of times a segment
of a name was checked against a component type and the number of times each of them is found
is shown in the second column.
31
FIGURE 11. Verizon - Domains and Their Size Distribution
TABLE 17. Verizon - Parsed DNS Names
Information gathered Count Percentage Number of Checks
Total no.of DNS Names 2,703,582 - % -
Interface 102,507 7.18 % 10,855,520
Router Function 31,784 2.22% 18,814,333
City Names 84,773 5.44 % 18,794,999
City CLLI 33,243 2.32 % 18,794,865
Airport Codes 70,706 4.95 % 18,801,068
State Codes 95,068 6.66 % 18,794,865
TABLE 18. Verizon - Parsed DNS Names(Others)
Type of Info Count Percentage
Number of Segments 18,814,155 -
Others 2,148,405 11.41 %
Dictionary Words 3,324,352 17.66 %
Fig 12. shows the pictorial representation of table 17.. Figure 13. shows the number of
name segments that couldn’t be classified as either of interface, router, city categories. There are
dictionary words which could tell something more about the dns names upon further analysis.
Some of the names have more than one dictionary names. Hence, to calculate the percentage of
dictionary words and percentage of ”others”, I had to calculate the number of parts in all that we
check against in our list of DNS names. This is represented as the grey bar.
32
FIGURE 12. Verizon - Names Distribution
Fig 14. shows the parts of the names with region information and its distribution clearly.
A very high percentage of the names are in the others section. And most of them are dictionary
words. This means that the naming of DNS names has structure but it doesn’t tell much about
the interface or city. It is possible that it speaks about the router function but the description
of router function varies from ISP to ISP. Since there is no consistency and no format for
naming, we can’t classify them. A detailed analysis of others section along with other types of
measurement could help understand these names.
Table 19. shows Verizon’s most occurring dictionary words, their number of occurrences
and an example name.
33
FIGURE 13. Verizon - Names Distribution(Others)
TABLE 19. Verizon - Most Occurring Dictionary Words
ASN Word Number of occurrences DNS name
702 pool 2,321,007 pool-71-165-110-143.lsanca.fios.verizon.net.
702 east 547,152 pool-71-174-0-142.bstnma.east.verizon.net.
702 static 280,704 static-71-165-70-129.lsanca.dsl-w.verizon.net.
702 customer 13,312 customer.bpsoft.com.
702 client 2,365 client-141-156-58-9.ba-dsg.net.
702 internet 1,845 internet-gw.customer.alter.net.
702 bb 1,737 so-7-3-0-0.lax01-bb-rtr1.verizon-gni.net.
702 mail 1,642 mail.abtinc.com.
702 broadcast 386 broadcast.alter.net.
702 reed 181 smtp17.reed-ian-swx.com.
702 digital 161 smtp29.digital.reinforcedplastics.com.
702 charming 135 charming-gw.customer.alter.net.
702 response 119 email1.response.sdgroup.eu.com.
Table 20. and figure 15. shows Verizon’s CDF of Information parsed. The x-axis shows the
number of items of information parsed. Since this is a cdf, the x-axis shows bins. The first bin
is the number of DNS names that have no items of information(x axis from 0 to 1) . The second
bin (x axis from 1 to 2) shows the number of DNS names that have 0 or 1 parts of information
and so on. The figure spikes at at least one information retrieved (second bin on x axis). It shows
that Verizon has at least one information in a large number of DNS names.
34
FIGURE 14. Verizon - Names Distribution
TABLE 20. Verizon - CDF of Information Parsed
Type Count Percentage
No inference 21,192 0.78 %
At least one 26,198,97 96.9 %
At least two 49,039 1.81 %
At least three 13,297 0.49 %
At least four 138 0.0051 %
Five and above 19 0.0007 %
Cogent
Table 21. shows Cogent’s DNS names and the distribution of the components in the names
such as the interface, router, city names, state codes and others. It also shows the dictionary
words present in the names which aren’t categorized as any of the prior categories mentioned.
The number of checks shown in the last column shows the number of times a segment of a name
was checked against a component type and the number of times each of them is found is shown in
the second column.
35
FIGURE 15. Verizon - CDF of Extracted Information
TABLE 21. Cogent - Parsed DNS Names
Information gathered Count Percentage Number of Checks
Total no.of DNS Names 585,674 100 % 0
Interface 117,187 20 % 1,340,222
Router Function 10,261 1.75% 1,873,504
City Names 24,884 4.24 % 1,863,316
City CLLI 304 0.05 % 1,862,845
Airport Codes 90,370 15.43 % 1,865,624
State Codes 24,106 4.11 % 1,862,844
TABLE 22. Cogent - Parsed DNS Names(Others)
Type of Info Count Percentage
Number of Segments 1,872,904 -
Others 45,523 2.43 %
Dictionary Words 180,224 9.6 %
Fig 16. shows the pictorial representation of table 21.. The number of checks is scaled
down to 10% of its original size for scaling purposes. In the figure 17., the ”others” bar shows the
number of name segments that couldn’t be classified as either of interface, router, city categories.
There are dictionary words which could tell something more about the dns names upon further
investigation. A sample list of most occurring dictionary words are shown later. There can be
more than one dictionary words present in a DNS name. Hence we calculated the number of
36
name segments we find in all the DNS names that we encounter and calculate the percentage of
positive results in finding the dictionary words. The grey bar shows the number of segments for
comparison of the success rate in finding the dictionary words.
FIGURE 16. Cogent - Names Distribution
Fig. 18. shows Cogent’s region information clearer along with interface and router function
categories.
Table 23. shows Cogent’s CDF of information parsed. The x-axis shows the number of
items of information parsed. Since this is a cdf, the x-axis shows bins. The first bin is the number
of DNS names that have no items of information(x axis from 0 to 1) . The second bin (x axis
from 1 to 2) shows the number of DNS names that have 0 or 1 parts of information and so on. A
spike in bin 1 shows that there are a lot of DNS names that don’t have any specific information.
TABLE 23. Cogent - CDF of Information Parsed
Type Count Percentage
No inference 416,346 71.08 %
At least one 92,224 15.74 %
At least two 60,321 10.29%
At least three 14,493 2.47 %
At least four 2,290 0.39 %
Five and above 0 0 %
37
FIGURE 17. Cogent - Names Distribution (Others)
Table 24. shows Cogent’s most occurring dictionary words, their number of occurrences
and an example name.
TABLE 24. Cogent - Most Occurring Dictionary Words
ASN Word Number of occurrences DNS name
174 atlas 56,531 gi0-0-0-18.202.nr11.b022073-0.ord01.atlas.cogentco.com.
174 static 19,630 153.38-89-161.static.servergrove.com.
174 host 13,140 host-38.80.71.016.mmcm.com.
174 mail 10,825 mail.amnow.com.
174 dynamic 8,420 dynamic-capital-management.demarc.cogentco.com.
174 cable 7,753 38-82-64-141-cable.cybercable.net.mx.
174 wireless 7,311 wireless.telebright.com.
174 unassigned 5,654 38.69.129.164.unassigned.neptunetg.com.129.69.38.in-addr.arpa.
174 reverse 3,467 181-18-68-38-static.reverse.queryfoundry.net.
174 domain 3,255 domain.not.configured.
174 customer 3,024 customer.hostiserver.com.
174 user 2,345 a.user.bayweb.com.
174 red 1,955 red.rentpayment.com.
174 net 1,021 38.89.246.0.cirbn.net.246.89.38.in-addr.arpa.
174 sac 988 sac-capital-adviser-llc.demarc.cogentco.com.
174 tnt 979 tnt-38-113-28-191.worldpath.net.
174 port 962 port-chan-1-23.core1.cvg1.zimcom.net.
Table 25. shows the comparison of Cogent’s region information with that of IP2Location.
The first column shows the number of times it matches with IP2Location data and the second
column shows the number of times it doesn’t match.
38
FIGURE 18. Cogent - Parts Distribution
TABLE 25. Cogent - Place Matches and Mismatches with IP2Location Data
Type Match Count Mismatch Count
airport code 10,796 72,245
city 1,564 23,320
state 1,645 6,599
39
FIGURE 19. Cogent - CDF of Extracted Information
40
CHAPTER V
CROSS ISP VS CAIDA DATASET ANALYSIS
Center for Applied internet Data Analysis (CAIDA) runs many projects which do internet
active and passive measurement projects. One of them is the CAIDA DNS [2] lookup. They
perform DNS lookups everyday from a managed central server at CAIDA. It performs millions
of DNS lookups everyday. They have other projects which use alias resolution techniques to find
the topology of the network. Soon after they perform the topology trace, they perform the DNS-
lookups. This is because it is assumed that doing so maintains the same state topology during
DNS name resolution as well. They also don’t lookup an IP address if they have successfully
looked up that address in the last 7 days.
Several teams of monitors produce the IPv4 Routed /24 Topology Dataset from which
they derive this DNS Names data. These teams independently probe every routed /24 in the
IPv4 address space (one pass through every routed /24 is called a cycle). Because different teams
have different members, locations, and capabilities, each team completes a cycle at a different
rate.
The DNS Names data is collected on a per-day basis. Only a loose connection exists
between the topology traces and DNS names exist because the topology data exists on a per-
team and per-cycle basis.
Observations of The CAIDA Dataset
We work on two datasets. One is an old dataset collected on 08-31-2012. Another is a
newer dataset. The hostname which are successful are shown in lowercase and those IP addresses
which result in errors are in the uppercase. Here are some examples of those.
FAIL.NON-AUTHORITATIVE.in-addr.arpa : Equivalent to NXDOMAIN we
encounter with dig
FAIL.SERVER-FAILURE.in-addr.arpa : Equivalent to SERVFAIL with dig
FAIL.TIMEOUT.in-addr.arpa : Equivalent to TIMEOUT with dig
Table 26. gives the basic observations made.
41
TABLE 26. Observations of CAIDA dataset
Observation Value
Date of data collection 08-31-2012
Total number of IP addresses 1,880,374
Total number of SERVFAILs 63,478
Total number of NXDOMAINs 712,537
Total number of TIMEOUTs 6,133
Total number of DNS names 1,098,226
58.4% of the IP addresses have DNS names. 0.32% of the names have TIMEOUTs . 3.38%
of the names have SERVFAIL error. 37.9% of the names are resolved but don’t have DNS names.
Table 27. shows the CDF of the names found in CAIDA dataset. The x-axis shows the
number of items of information parsed. Since this is a cdf, the x-axis shows bins. The first bin
is the number of DNS names that have no items of information(x axis from 0 to 1) . The second
bin (x axis from 1 to 2) shows the number of DNS names that have 0 or 1 parts of information
and so on. Considerably high number of DNS names have at least one field of information such
as interface, router function, city or state etc.
TABLE 27. CAIDA - CDF of Information parsed
Type Count Percentage
No inference 661,602 60.24 %
At least one 350,482 31.91 %
At least two 75,352 6.86 %
At least three 9,656 0.87 %
At least four 1,102 0.1 %
Five and above 32 0.0029 %
Fig 20. shows the cdf of CAIDA names that we found. Fig 21. shows the cdf of CAIDA
along with the CDF of other ISPs namely Level3, Verizon and Cogent. It shows only the
percentages of number of items found so as to scale them equally. From this figure, we can see
that Verizon has the highest percentage of names with at least one part of information in it. It is
significantly higher than the percentage we see for CAIDA names which is collected from multiple
ISPs. Level3 and Cogent have lesser information than CAIDA.
42
FIGURE 20. CAIDA - CDF of Extracted Information
FIGURE 21. CAIDA vs Others CDF of Extracted Information
TABLE 28. CAIDA - Parsed DNS Names
Information gathered Count Percentage Number of Checks
Total no.of DNS Names 1,098,226 100 % 0
Interface 169,645 15.44 % 5,222,736
Router Function 18,045 1.64% 8,039,110
City Names 45,112 4.1 % 8,023,746
City CLLI 60,584 5.51 % 8,023,189
Airport Codes 154,583 14.07 % 8,024,147
State Codes 86753 7.89 % 8,023,101
43
TABLE 29. CAIDA - Parsed DNS names(Others)
Type of Info Count Percentage
Number of Segments 8,038,088 -
Others 537,044 6.68 %
Dictionary Words 1,264,707 15.73 %
Figure 22. shows the distribution of the names in CAIDA dataset. Figure 24. shows the
name parts and their distribution. Figure 21. shows a comparison of the names and entities
found out in the names and that of Level3, Verizon and Cogent. It shows only the percentages
of names/parts found in the ISPs and CAIDA to compare against all the other ISPs. we see that
our analysis follows the general observations across multiple ISPs collected by CAIDA.
FIGURE 22. CAIDA Segment Distribution
44
FIGURE 23. CAIDA Others Distribution
FIGURE 24. CAIDA vs Others CDF of Extracted Information
45
CHAPTER VI
TOPOLOGY MAPPING FROM XNET AND IFFINDER
xnet [25] is a tool which is used for subnet inference. It works by sending IP probe packets
to hypothetical subnets of size /31 along the path of the target IP address and records the nodes
that respond with ICMP port unreachable messages. When the destination IP is reached, it
uses the hop count to determine the possible subnet the target IP address belongs to. More
explanation is provided in [24]. It gives alias information as well that it encountered during the
process.
iffinder [18] is another alias resolution tool. It works by sending IP probe packets to
the destination IP address on high numbered ports. The target IPs are likely to respond with
ICMP port unreachable messages. Sometimes they send this ICMP message from a different
interface than the interface at which it was received. Hence, we have a tuple of interface IP
addresses that belong to the same router. We can configure it to run multiple times as the tuple
might possibly grow as we find more aliases of the same router.
Databank
I ran xnet on a small domain named databank.com. It has 1110 IP addresses assigned
as part of 3356 ASN (level3). Finding the topology of these small components which make
up the larger ASN would be reasonable since the structure of the ASN is structured by these
smaller companies. Out of 1110 IP addresses, I found 82 IP addresses which responded to the
xnet requests. Alias resolution on this particular dataset didn’t yield any aliases in the same IP
address space. So we are assuming that each interface to be a router. Based on the List of IP
addresses within the target subnet and their hop distances from the vantage point, we are able
to find the topology of the 82 nodes that we found. Figure 25. shows the topology that we found
using xnet for databank.com. We used Gephi [6] graph visualization tool for visualization. We
use force atlas layout with a weak attraction strength to show which routers are connected to
each other. This graph has 108 nodes and 76 edges but clearly it is disconnected. This is because
xnet doesn’t respond to all the queries.
46
FIGURE 25. Topology of Databank
Yahoo
A similar analysis is done on yahoo.com domain which also belongs to the same level3
ASN. The domain has 5,788 IP addresses. Out of these 426 IP addresses responded to xnet.
When we plotting it in Gephi we found 559 edges and 421 nodes. Fig. 26. shows the router level
topology of yahoo.
47
FIGURE 26. Topology of Yahoo
Graph Based on City Information
As discussed in the paper Growth Analysis of ISPs [15], /31 subnets are highly likely to
be connected provided they are physically interfaces. Since we have DNS names for these IP
addresses, we know that they are physical nodes. And using xnet and iffinder, we have alias
information of these IP addresses too. Hence we use a similar methodology where we assume
/31 addresses of an IP to be connected. Building on the router level topology mapping process
described above, we also check the region information from the DNS name of the IP address and
we group them together as nodes. When a router from city1 connects to a router in city2, we
add a link. We know that every domain like verizon-gni has its routers in different cities. Hence,
we came up with the region-based graph of a domain like verizon-gni shown in fig. 27.. The size
of the nodes depends on the number of nodes in that region. The color gradient depends on the
degree of the nodes. The nodes which have a degree of 1 are ignored.
48
FIGURE 27. Region Level Topology of Verizon-gni
Another example of such a graph for Level.net domain belonging to level3 ASN (3356) is
shown in fig. 28.. The size of the nodes depends on the number of nodes in that region. The
color gradient depends on the degree of the nodes. The nodes which have a degree of 1 are
ignored.
49
FIGURE 28. Region Level Topology of Level3
50
REFERENCES CITED
[1] Airport codes. https://www.airportcodes.org. Accessed: 2014-09-05.
[2] The caida ucsd ipv4 routed /24 dns names dataset.
http://www.caida.org/data/active/ipv4_dnsnames_dataset.xml. Accessed:
2014-09-09.
[3] Clli code. http://en.wikipedia.org/wiki/CLLI_code. Accessed: 2014-09-05.
[4] Domain names - implementation and specification. https://www.ietf.org/rfc/rfc1035.txt.
Accessed: 2014-09-16.
[5] Geonames. https://www.geonames.org. Accessed: 2014-09-05.
[6] Gephi. "http://gephi.github.io".
[7] Ip2location. http://www.ip2location.com. Accessed: 2014-09-05.
[8] ”mrinfo”. "http://technet.microsoft.com/en-us/library/cc957933.aspx".
[9] ”pathaudit”. "https://github.com/jc-wail/WAIL/tree/master/PathAudit".
[10] Public dns server list. http://www.public-dns.tk. Accessed: 2014-09-05.
[11] ”state codes of us”.
"http://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations".
[12] Telcodata.us telecommunications database. https://www.telcodata.us. Accessed:
2014-09-05.
[13] ”undns”. "http://www.scriptroute.org/source/".
[14] Joseph Chabarek and Paul Barford. What’s in a name?: Decoding router interface names. In
Proceedings of the 5th ACM Workshop on HotPlanet, HotPlanet ’13, pages 3–8, New York,
NY, USA, 2013. ACM.
[15] Andrew D. Ferguson, Jordan Place, and Rodrigo Fonseca. Growth analysis of a large isp. In
Proceedings of the 2013 Conference on Internet Measurement Conference, IMC ’13, pages
347–352, New York, NY, USA, 2013. ACM.
[16] Huanetwork. The naming conventions of huawei ar routers. "http:
//www.huanetwork.com/blog/the-naming-conventions-of-huawei-ar-routers/".
[17] Inc Juniper Systems. Interface naming overview. "http://www.juniper.net/techpubs/en_
US/junos12.3/topics/concept/interfaces-interface-naming-overview.html".
[18] CAIDA Ken Keys. iffinder. "http://www.caida.org/tools/measurement/iffinder/".
[19] P. Mrindol, B. Donnet, J. Pansiot, M. Luckie, and Y. Hyun. MERLIN: MEasure the Router
Level of the INternet. In Conference on Next Generation Internet, Jun 2011.
[20] University of Oregon. University of oregon route views project.
http://www.routeviews.org. Accessed: 2014-09-05.
[21] Neil Spring, Ratul Mahajan, David Wetherall, and Thomas Anderson. Measuring isp
topologies with rocketfuel. IEEE/ACM Trans. Netw., 12(1):2–16, February 2004.
51
[22] ”Cisco Systems”. ”configuring router interfaces”. "http:
//www.cisco.com/c/en/us/td/docs/security/security_management/cisco_security_
manager/security_manager/4-1/user/guide/CSMUserGuide_wrapper/rtintf.pdf".
[23] Team-Cymru. Ip to asn mapping. http://www.team-cymru.org/Services/ip-to-asn.html.
Accessed: 2014-09-05.
[24] M.E. Tozal and K. Sarac. Subnet level network topology mapping. In Performance
Computing and Communications Conference (IPCCC), 2011 IEEE 30th International,
pages 1–8, Nov 2011.
[25] Mehmet Engin Tozal. Ntmaps - network mapping & modeling.
"http://nsrg.louisiana.edu/project/ntmaps/output/explorenet.html".
[26] Ming Zhang, Yaoping Ruan, Vivek Pai, and Jennifer Rexford. How dns misnaming distorts
internet topology mapping. In Proceedings of the Annual Conference on USENIX ’06
Annual Technical Conference, ATEC ’06, pages 34–34, Berkeley, CA, USA, 2006. USENIX
Association.
52