Who’s Got Your Mail? Characterizing Mail Ser vice Provider
Usage
Enze Liu
UC San Diego
Gautam Akiwate
UC San Diego
Mattijs Jonker
University of Twente
Ariana Mirian
UC San Diego
Stefan Savage
UC San Diego
Georey M. Voelker
UC San Diego
ABSTRACT
E-mail has long been a critical component of daily communication
and the core medium for modern business correspondence. While
traditionally e-mail service was provisioned and implemented in-
dependently by each Internet-connected organization, increasingly
this function has been outsourced to third-party services. As with
many pieces of key communications infrastructure, such central-
ization can bring both economies of scale and shared failure risk.
In this paper, we investigate this issue empirically — providing a
large-scale measurement and analysis of modern Internet e-mail
service provisioning. We develop a reliable methodology to better
map domains to mail service providers. We then use this approach
to document the dominant and increasing role played by a handful
of mail service providers and hosting companies over the past four
years. Finally, we briey explore the extent to which nationality
(and hence legal jurisdiction) plays a role in such mail provisioning
decisions.
CCS CONCEPTS
Information systems World Wide Web
;
World Wide
Web Internet communications tools
;
Internet communi-
cations tools E-mail.
ACM Reference Format:
Enze Liu, Gautam Akiwate, Mattijs Jonker, Ariana Mirian, Stefan Savage,
and Georey M. Voelker. 2021. Who’s Got Your Mail? Characterizing Mail
Service Provider Usage. In ACM Internet Measurement Conference (IMC ’21),
November 2–4, 2021, Virtual Event, USA. ACM, New York, NY, USA, 15 pages.
https://doi.org/10.1145/3487552.3487820
1 INTRODUCTION
Despite the rise of interactive chat and online social messaging
applications, e-mail continues to play a central role in communica-
tions. By some estimates, close to 300 billion e-mail messages are
sent and received each day [
34
]. In particular, e-mail remains the
central modality for modern business correspondence — long since
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
IMC ’21, November 2–4, 2021, Virtual Event, USA
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9129-0/21/11.
https://doi.org/10.1145/3487552.3487820
displacing the postal service for such matters over the previous two
decades.
However, unlike the postal service (and many other forms of
person-to-person communication) e-mail is not centrally admin-
istered, but is organized such that each Internet domain owner,
by virtue of their DNS MX record, can make unique provisioning
decisions about how and where they will accept e-mail delivery.
Thus, organizations are free to provision separate e-mail services
for each domain they own, to share service among domains they
operate, or to outsource e-mail entirely to third-party providers.
These choices, in turn, can have signicant implications for the
resilience, security, legal standing, performance and cost of e-mail
service.
In particular, concerns have been raised in recent years about
the general risks of increasing Internet service centralization and
consolidation [5, 10, 17]. For example, centralization amplies the
impact of (even rare) service failures [4, 15, 25]. Similarly, a single
data breach in a widely-used service can put thousands of cus-
tomers’ data at risk.
1
Finally, the legal jurisdiction in which a given
service provider operates is implicitly imposed on the data managed
by that provider. For instance, as a U.S. company, Google-managed
data is subject to the Stored Communications Act, which provides
data access to the government under warrant even if the data be-
longs to a foreign party not residing in the U.S..
Indeed, while historically e-mail was provisioned and imple-
mented independently by each organization (i.e., hosting a local
mail server acting as a full-edged Mail Transfer Agent), the rise
of third-party enterprise mail service providers (notably Google
and Microsoft) has challenged that assumption; indeed, there are
compelling reasons to believe that that global e-mail service is also
increasingly subject to a signicant degree of centralization. How-
ever, in spite of the importance of this issue there has been little
empirical analysis of e-mail provisioning choices and how they
have been evolving over time.
2
In this paper, we perform a large-scale measurement and anal-
ysis of e-mail service provisioning and conguration. Our study
uses three large corpora of domains: one based on all
.gov
domains,
another based on a stable subset of the Alexa top 1 million domains
observed across nine snapshots between 2017 and 2021, and lastly
a similar dataset of one million
.com
domains sampled at random
1
The recent vulnerabilities exploited in Microsoft’s Exchange Server were serious [
20
],
and it could have been even worse had attackers been able to penetrate Microsoft’s
Outlook e-mail service.
2
One example can be found in Trost’s blog post “Mining DNS MX Records for Fun
and Prot”, although, as our results show, their approach has its limitations [36].
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
from the same period. We use these datasets to gain insight into
the present popularity of e-mail service providers and their lon-
gitudinal shifts, and to characterize their makeup. From our data
we demonstrate the clear and growing dominance of a handful
of third-party e-mail service providers and the shrinking number
of domains that provision mail service “in-house” themselves or
through their hosting providers.
We make the following contributions:
(1)
We detail and justify a methodology to map published MX
records to the identity of the mail service provider (providing
signicant accuracy improvements over approaches that
entirely rely on MX record content);
(2)
Using our methodology we identify the top e-mail service
providers and characterize their market share and customer
demographics;
(3)
We provide a longitudinal analysis of mail service provider
popularity over time and document the source of market
share shifts;
(4)
We explore the existence of national biases in the choice
of mail service provider (i.e., the extent to which mail for
domains in country X’s top-level domain (TLD) make use of
mail service from country Y and hence subject themselves
to Y’s legal jurisdiction).
Ultimately, our work not only provides a comprehensive anal-
ysis of the current state of Internet e-mail provisioning (and the
relative role of third-party web mail service providers, mail ltering
providers and “in-house” mail services), but also provides a solid
foundation on which to base future analyses of e-mail infrastruc-
ture.
2 BACKGROUND AND RELATED WORK
2.1 Simple Mail Transfer Protocol
The Simple Mail Transfer Protocol (SMTP) is part of a family of
protocols for mail transmission, including SMTP [
27
], Extended
SMTP (ESMTP) [
18
] and SMTP Service Extension for Authentica-
tion (SMTP-AUTH) [33].
In its purest form, as depicted in Figure 1, an e-mail user operates
a mail user agent (MUA) that uses ESMTP or SMTP-AUTH to sub-
mit e-mail messages to the sender’s mail submission agent (MSA)
software (e.g., their local mail server). The MSA in turn queues the
message for delivery with the sender’s mail transfer agent (MTA)
for relay to the mail infrastructure of the addressed parties in the
To:, CC: or Bcc: lines. Next, the sender’s MTA transfers the e-mail
to the recipient’s MTA, using SMTP or — if supported — ESMTP.
It is during this step that the sending MTA uses the recipient’s
DNS “Mail Exchanger” (MX) record to determine the location of
the receiving MTA. Having received the e-mail, the receiving MTA
then either delivers the mail locally or places it into a queue for fur-
ther processing. In practice, the MSA and MTA are often the same
piece of software (typically run on a single server in an “in-house”
implementation) and in Web mail situations (e.g., Gmail) the MUA
is a Web application provided by the same organization as the MSA
and MTA.
2.1.1 SMTP Procedures: A Summary. All protocols in the SMTP
family follow roughly the same procedure. A session starts when
Figure 1: Mail processing mo del
Figure 2: Banner and EHLO message in a typical SMTP ses-
sion between client (C) and server (S).
an SMTP client (either an MUA seeking to submit mail or an MTA
seeking to relay mail) opens a connection to an SMTP server, which
responds in kind with a greeting message. This message is infor-
mally referred to as the banner message, in which the server typically
provides either its domain name or IP address [18].
Once the SMTP client has received the greeting message, it nor-
mally sends the EHLO (or HELO in earlier versions) command to
the SMTP server, signaling its identity, which in turn elicits an
EHLO response message containing the SMTP server’s domain
name and a list of the extensions it supports. Figure 2 illustrates
the banner and EHLO message in a typical SMTP session with the
SMTP server (S) having domain
foo.com
and the SMTP client (C)
having domain
bar.com
. In this paper, we use EHLO message to refer
to the second EHLO, i.e., the message elicited from the server.
Depending on the protocol, additional messages may be ex-
changed between server and client for negotiating conguration
options such as authentication. The sending SMTP server can then
initiate a mail transaction. These last steps are important for the
delivery of message content, but are not relevant to this paper.
2.1.2 Mail submission and mail relaying. When the SMTP protocol
is used to submit a new message, e.g., between the sender’s MUA
and their MSA, the identity of the mail server is typically well-
known (i.e., pre-congured) and it is common for the MUA to
positively authenticate themselves using the SMTP-AUTH protocol.
Thus, the server will not accept SMTP transactions before the sender
presents appropriate credentials (also typically protected via a TLS
session initiated as part of this protocol step). In this fashion, the
customer-facing mail server designated by a broadband Internet
Service Provider is able to limit outbound mail submissions to only
their customers. In this mail submission mode, servers typically
accept connections on TCP port 587, as per RFC 6409 [
19
]. However,
port 465 is also common (although 465 was deprecated in RFC
8314 [
24
]), and in a number of cases sites may use port 25 for this
purpose (typically designating particular hosts to be MSAs and
others to be MTAs [19]).
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
When the SMTP protocol is used to relay a message (i.e., from
one MTA to another), the sending (i.e., outbound) MTA identies its
partner MTA server by parsing e-mail addresses (i.e.,
user@domain
)
to extract the associated domain names. For each (unique) domain
name in the destination address(es) of an e-mail, the sending MTA
will lookup a DNS MX record. This MX record points to the server
to which receiving e-mail on behalf of the particular domain name
is delegated. By fully resolving this record, the sending MTA server
ultimately identies and establishes a connection with the receiving
MTA server. In this mail relay mode, TCP port 25 is typically used
(there are other ports that are used occasionally, such as port 2525,
but these are not supported by IANA or IETF [
39
] and so we do not
consider them in this paper).
2.2 Mail Exchanger Records
The Mail Exchanger (MX) record species which MTAs handle
inbound mail for a domain name [
18
,
24
,
26
] and is published in
the DNS zone of the domain. An MX record should itself contain
a valid domain name [
23
,
26
]. Multiple MX records can be con-
gured in a zone, each with an assigned preference number. The
lowest preference has highest priority, and multiple MX records
can share the same priority for load balancing [
18
]. An MX record
can be made up, in part, of the registered domain name for which
it receives e-mail, yet resolve to completely separate infrastructure.
For instance, the MX record for our institution
ucsd.edu
contains
inbound.ucsd.edu
, which in turn resolves to an IP address (A record)
owned and operated by ProofPoint, a well-established mail ltering
company wholly dierent from ucsd.edu.
2.3 STARTTLS and TLS certicates
Modern SMTP implementations opportunistically support the START-
TLS option which, in the mail relay context, allows the sending
MTA to initiate a TLS connection with the receiving MTA [
11
,
16
].
If the receiving MTA supports STARTTLS, it will provide a TLS
certicate which can be used to bootstrap a TLS session providing
session condentiality. To provide a valid certicate, the receiving
MTA must obtain a signed certicate from a trusted certicate au-
thority (CA) for which the MX domain name is either specied in
the Common Name (CN) or a Subject Alternative Name (SAN) eld.
While ideally TLS certicates are validated by the sending MTA,
in practice SMTP sessions will continue even if the certicate does
not validate [
13
,
14
]. Note that the SAN eld is used when a single
certicate must support TLS connections across a range of domains.
For example, the certicate used by Gmail has Common Name
mx.google.com
, and its SAN species other alternate domain names,
such as
aspmx2.googlemail.com
and
mx1.smtp.goog
.
3
In these cases,
the Common Name (CN) almost always species a principal domain
operated by the provider of the service.
2.4 Related work
Considering its critical role, remarkably little contemporary anal-
ysis exists of e-mail infrastructure and who provides it. Some of
the best known modern work in this space is the pair of 2015
3
mx1.smtp.goog is a valid and resolvable domain owned by Google.
papers authored by Durumeric et al. and Foster et al. which em-
pirically explored the use and conguration of privacy, authentica-
tion, and integrity mechanisms at each stage of the e-mail delivery
pipeline [
13
,
14
]. Notably, Durumeric et al. also provide one esti-
mate of the top mail providers as a part of their study, although their
methodology may underestimate the inuence of major providers
(notably Microsoft). Rijswijk et al. [
37
,
38
] investigated the growth
of three top mail providers over a relatively short, 50-day period,
and demonstrated the phasing out of Windows Live over Oce365,
among others. Their analysis, unlike ours, considers only the con-
tent of MX records, and mail was not the focal point of their work.
Finally, in 2005, Afergan et al. [
2
] measured the loss, latency, and
errors of e-mail transmission over the course of a month with hun-
dreds of domains.
Somewhat further aeld, there is a literature exploring how dan-
gling DNS records impact e-mail security, starting with the work of
Liu et al. [
22
], who explored e-mail as a special case of a general anal-
ysis of dangling DNS issues. This work was recently expanded by
Reed and Reed in their technical report that focuses specically on
dangling DNS MX records and their potential security impact [
29
].
Another direction of research, notably by Chen et al. [
9
] and Shen et
al. [
32
], studies the vulnerabilities of third-party mail providers and
how those vulnerabilities could be used to spoof e-mail messages.
In spite of these and related eorts, we have found very little
work focused on characterizing which organizations are, in fact,
responsible for providing mail service or how this responsibility
has changed over time. Indeed, perhaps the closest related work
is not from the academic literature, but from the recent Medium
post of Jason Trost which describes an analysis of MX records for
identifying e-mail security providers [36].
3 IDENTIFYING MAIL PROVIDERS
In this section, we rst illustrate the challenges in identifying mail
service providers, in particular how MX records alone can be mis-
leading, and the strengths and weaknesses of using alternative
features. Given these limitations, we then present our priority-
based approach for identifying the mail provider for a given do-
main name. For the purpose of this work, we focus on the primary
e-mail provider, which is identied by the MX record with the
highest priority. Finally, we evaluate the accuracy of this approach
using randomly sampled domains from the three larger datasets
of domains on which we base much of our subsequent analysis
(described in detail in Section 4.3).
3.1 Challenges in Provider Identication
One approach, exemplied by Trost’s analysis [
36
], relies exclu-
sively on MX records to identify the mail provider. However, this
approach can be misleading when the purported MX domain re-
solves to an IP address operated by a dierent entity.
Better accuracy can be achieved by incorporating additional
features, such as the autonomous system number (ASN) of the
IP address to which an MX record resolves, the content of Ban-
ner/EHLO messages in the initial SMTP transaction, and TLS cer-
ticates learned during an SMTP session. However, using multi-
ple features creates additional complexities. In particular, while
SMTP-level information is typically a more reliable indicator of
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
Domain MX MX IP Resolution ASN of IP
netix.com aspmx.l.google.com 172.217.222.26 15169 (Google)
gsipartners.com mailhost.gsipartners.com 173.194.201.27 15169 (Google)
beats24-7.com mx10.mailspamprotection.com 35.192.135.139 15169 (Google)
jeniustoto.net ghs.google.com 172.217.168.243 15169 (Google)
Table 1: Example domains with related mail information.
Domain Banner/EHLO Subject CN
netix.com mx.google.com mx.google.com
gsipartners.com mx.google.com mx.google.com
beats24-7.com se26.mailspamprotection.com *.mailspamprotection.com
jeniustoto.net N/A N/A
Table 2: Example domains with additional information retrieved from SMTP sessions.
mail service provider than the hosting party’s ASN, the latter is
always available while the former is not.
To illustrate these points further, we use the four domains listed
in Tables 1 and 2 as examples. Table 1 shows the MX record, the
IP address resolution, and the ASN from which the address is an-
nounced. Table 2 shows additional information learned by initiating
SMTP sessions with the IP addresses listed in Table 1. Specically,
we show the subject Common Name (CN) listed on the certicate
presented in STARTTLS (if any) and the Banner/EHLO messages
provided during the SMTP session.
3.1.1 MX Record. Using the MX record to infer the mail provider
works well when the domain owner explicitly names its provider
in the MX record (e.g.,
netflix.com
in Table 1). This is a common
practice for domains that outsource their mail services to third-
party companies (e.g., Google) to ensure that their providers can
property receive e-mail on their behalf [28, 35].
However, this idiom is not always accurate. For example, the MX
approach will incorrectly infer that
gsipartners.com
self-hosts its
e-mail delivery because its MX record is
mailhost.gsipartners.com
.
However, this MX name resolves to an IP address announced by
Google. When contacted, it emits
mx.google.com
Banner/EHLO in
the SMTP handshake, and the TLS certicate it produces has a sub-
ject common name (CN) of
mx.google.com
. Clearly,
gsipartners.com
e-mail is handled by Google.
3.1.2 Autonomous System Number (ASN). While the ASN to which
the
mailhost.gsipartners.com
MX leads correctly indicates Google
as the mail provider, this inference is not always accurate. Consider
the domain
beats24-7.com
whose MX record also resolves to an IP
address owned by Google. In this case e-mail is actually handled
by an e-mail security provider that is hosted in Google Cloud’s IP
space, rather than the internal address space used by Google to
host its own services. Another issue with the ASN is that it does
not reect whether an IP address is actually operating an SMTP
server and can accept mail. Consider
jeniustoto.net
in Table 1,
which has an MX record that resolves to an IP address in Google’s
internal address space. However, this IP address is from Google’s
web hosting service and does not run an SMTP server. In this case,
jeniustoto.net
does not actually have a mail server (and thus a
mail provider), even though it uses a Google IP address.
3.1.3 Banner/EHLO messages. During an SMTP session, the mail
server for
gsipartners.com
identies itself in its Banner/EHLO
handshake as
mx.google.com
(Table 2). This information is gen-
erally reliable for identifying third-party mail providers, as most
third-party providers congure their servers to properly identify
themselves. However, the Banner/EHLO information need not be
mechanically generated and can contain any text congured by
the server operator, which makes it unreliable in a small number
of scenarios. First, Banner/EHLO messages may not contain valid
domain names. For example, instead of having a valid domain name,
certain providers put a string (e.g.,
IP-1-2-3-4
) in their servers’
Banner/EHLO messages. Second, an individual, who runs their own
SMTP server, can falsely claim to be
mx.google.com
in Banner/EHLO
messages. While very rare, we have observed a small number of
such cases.
3.1.4 TLS certificate. The
gsipartners.com
mail server also presents
a valid certicate with subject CN
mx.google.com
, which is a clear
indicator of the entity running the mail server (and one attested to
by a trusted Certicate Authority) and thus can generally be used to
infer the mail provider. In the case of
gsipartners.com
, we conclude
that it uses Google as it presents a valid certicate with subject
CN
mx.google.com
(this certicate is also used by other legitimate
Google mail servers).
While certicates are ideal for identifying the mail provider
of a domain, they are not always available. Some mail servers
do not support STARTTLS or they respond with self-signed cer-
ticates which are less reliable. Additionally, we note that cer-
tain web hosting providers (e.g., GoDaddy with domain name
secureserver.net
) allow their virtual private servers (VPS) to cre-
ate certicates using specic subdomains as the subject CN (e.g.,
vps123.secureserver.net
). These servers are operated by individ-
uals renting them instead of the web hosting company provid-
ing the infrastructure. Thus, in this case, the subject CN reects
the hosting provider (e.g., GoDaddy) instead of the mail provider
(e.g., a self-hosted mail server operated by an individual operating
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
1. Certificate Preprocessing
1.1 Count occurrence of each registered domain.
1.2 Group certificates that share at least one FQDN.
1.3 Compute representative name for each group.
2. IDs of an IP
2.1 ID from cert: if a valid certificate is present, use
the representative name of the group containing the
certificate.
2.2 ID from Banner/EHLO: if the same registered
domain show up in both, use that registered
domain.
3.1 If all IPs have the same ID from cert, use that ID as
the provider ID.
3.2 Else if all IPs have the same ID from Banner and
EHLO, use that as the provider ID
3.3 Else use the registered domain part of the MX.
3. Provider ID of an MX
4.1 Discover potential misidentified cases for a
predetermined set of provi der IDs.
4.2 Correct misidentifications with heuristics.
4. Check for misidentification
5. Provider ID of a domain
5.1 Assign the ID of the most preferred MX record.
Split the credit if multiple such MX records exist.
Figure 3: Our ve-step approach to infer the provider of an
MX record. The approach considers data from MX records,
Banner/EHLO messages, and TLS certicates to determine
the e-mail provider.
a GoDaddy VPS). Lastly, in a handful of cases, we observe that
some third-party mail service providers present the certicates of
their customers. For example, the University of Texas (
utexas.edu
)
has an MX record (
inbound.utexas.edu
) that resolves to an IP ad-
dress that, when contacted, presents a valid certicate with CN
inbound.mail.utexas.edu
. However, the ASN of that IP address sug-
gests that mail service is operated by Ironport, an e-mail security
company. Additionally, the server indicates in its Banner/EHLO
message that it is Ironport. In this case, we can conclude that the
University of Texas is using Ironport instead of hosting their own
e-mail infrastructure. Thus, the CN presented in the certicate does
not indicate the service provider in this instance.
Based on these observations and our experience, we propose an
approach that prioritizes SMTP level information when available,
and falls back to MX level information in other cases. This approach
achieves both good accuracy and avoids the availability issues with
SMTP level information. We provide more details below.
3.2 Methodology: A Priority-Based Approach
We propose a methodology, which we term the priority-based ap-
proach, that takes as input a domain (and relevant information)
and outputs a provider ID as the inferred primary mail provider
responsible for mail service for that domain. Our methodology
incorporates data from multiple sources, including MX records,
Banner/EHLO messages, and TLS certicates. We achieve high ac-
curacy through prioritizing these sources by reliability: certicates
rst, then Banner/EHLO messages, and then MX records.
Our methodology consists of ve steps shown in Figure 3. First,
we preprocess all certicates to nd and group certicates that
are potentially operated by the same entity. For each group of
certicates, we designate a representative name to represent the
entity owning these certicates. Second, for each IP address that an
MX record resolves to, we try to determine IDs that best represent
the mail provider associated with that IP address. Since an MX
can resolve to multiple IP addresses, knowing the mail provider
operating each IP address is a prerequisite for determining the
provider ID of an MX. Next, we assign a provider ID to the MX
record. We then lter for misidentications and correct them to
the best of our ability. Finally, we assign a provider ID to a domain,
which is a registered domain representing the entity operating the
mail infrastructure pointed by the MX record.
We detail our ve step methodology below, using the exam-
ples shown in Table 3, in which domains
third-party1.com
and
third-party2.com
use e-mail services provided by the third-party
provider
provider.com
, domain
myvps.com
operates its own e-mail
service on a VPS hosted with
provider.com
, and domain
selfhosted.
com operates its own mail service.
3.2.1 Certificate Preprocessing. The goal of the rst step pre-
processing — is to nd certicates that are potentially operated by
the same mail provider. The domains listed in a certicate aid our
mail provider inferences. However, certicates also introduce two
issues. First, a mail provider can have multiple valid certicates.
Additionally, each certicate can contain multiple domain names
by using the subject alternative name (SAN) extension. Having
multiple certicates, each with multiple domain names, leads to
two challenges: which certicates belong to the same mail provider,
and which name to use to represent that provider.
We address these two challenges by preprocessing all certi-
cates in our dataset and grouping certicates that likely belong to
the same mail provider. We output a representative name for each
group to represent that group and the mail provider. The process
of grouping certicates and producing a representative name has
three steps:
(1) Count Occurrences of Each Registered Domain
: For fully
qualied domain names (FQDNs) that appear on a certi-
cate’s Subject CN and SANs, we take the registered domain
part (e.g., in Table 3
provider.com
is the registered domain
of both
mx1.provider.com
and
mx2.provider.com
) and count
occurrences of each registered domain across all certicates.
For example, in Table 3, the count for
provider.com
will be
5. We extract the registered domain from the FQDN using
the Public Sux List [21].
(2) Grouping Certicates
: Providers may use dierent certi-
cates across their infrastructure, and grouping consolidates
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
Domain MX MX IP Banner/EHLO Subject CN SANs Provider ID
third-party1.com mx1.provider.com 1.2.3.4 mx1.provider.com mx1.provider.com mx2.provider.com provider.com
third-party2.com mx2.provider.com 2.3.4.5 mx2.provider.com mx2.provider.com mx1.provider.com provider.com
myvps.com mx.myvps.com 3.4.5.6 myvps.provider.com myvps.provider.com N/A provider.com
selfhosted.com mx.selfhosted.com 4.5.6.7 ip-4-5-6-7 N/A N/A selfhosted.com
Table 3: Example domains and relevant information used in our methodology.
them into sets of related FQDNs. We put two certicates into
the same group if (and as long as) there is some degree of
overlap between their sets of FQDNs. For instance, in Table 3,
we would create two groups. We merge the certicates used
by
third-party1.com
and
third-party2.com
into one group,
as they contain the same set of FQDNs:
mx1.provider.com
and
mx2.provider.com
. The certicate with subject CN
myvps.
provider.com is in its own group.
(3) Selecting a Representative Name
: For each group of cer-
ticates, we choose the most common registered domain as
the representative name, as it is likely to represent the mail
provider best. In our specic example, the representative
name for both groups is provider.com.
At the end of this process, certicates are organized into groups
and each group will have a representative name.
3.2.2 Identifying IDs for an IP Address. Before assigning a mail
provider ID to an MX record, we need to determine the ID(s) that
best represent(s) the mail provider for the IP address(es) to which
an MX record resolves. We compute one ID with certicates and
another ID with Banner/EHLO messages. We also prioritize the ID
computed with certicates when using both IDs.
(1) ID from TLS Certicates
: If a valid certicate is present at
the IP address, we use the representative name of the group
containing the certicate as the ID. We consider a certicate
valid if it is trusted by a major browser (e.g., Firefox). In our
example, IP addresses
1.2.3.4
,
2.3.4.5
,
3.4.5.6
would have
the ID provider.com from certicates.
(2) ID from Banner/EHLO Messages
: If the Banner/EHLO
message is available and contains a valid FQDN, we use the
registered domain part of the FQDN as the ID. In our example,
we cannot assign an ID to IP address
4.5.6.7
because it does
not present a certicate and its Banner/EHLO message does
not contain a valid FQDN. The other three IP addresses have
the ID provider.com from Banner/EHLO messages.
3.2.3 Identifying Mail Provider ID for an MX Record. Once we have
computed IDs for each IP address, we next analyze the MX records.
If all IP addresses of an MX record have the same ID from certicates,
we assign that ID as the provider ID to the MX record. In cases where
IDs from certicates do not agree or are not available, we check if
all IP addresses share the same ID from Banner/EHLO messages. If
so, we assign that provider ID to the MX record. Otherwise, we fall
back to using the registered domain part of the MX record as the
provider ID.
3.2.4 Checking for Misidentifications. While this approach can in-
fer the mail provider of an MX record correctly in most cases, there
exist a few that lead to misidentications. In the above example,
for domain
myvps.com
, we infer that its MX record
mx.myvps.com
is operated by
provider.com
using the ID from certicates. How-
ever,
myvps.com
is running its own mail server on a VPS hosted
with
provider.com
. In fact, this example represents a situation that
is hard to identify both automatically and correctly: VPS servers
hosted with web hosting companies. Certain web hosting compa-
nies (e.g., GoDaddy with domain name
secureserver.net
) allow
their VPS servers to create certicates under specic domain names
(e.g.,
vps123.secureserver.net
). Similarly, as mentioned above, cer-
ticates can be misleading when third-party providers present their
customer’s certicates. Since there is no good way to automatically
detect such cases without prior knowledge, we have to identify
such situations manually.
Another source of error comes from Banner/EHLO messages.
Recall that Banner/EHLO messages are unrestricted text. Thus, it
is possible to falsely claim to be
mx.google.com
in Banner/EHLO
messages. Since our approach prioritizes Banner/EHLO messages
over the MX record, we would mislabel it as google.com.
To eciently nd instances of misidentications, we use the
observation that the corner cases mentioned above are for unpop-
ular servers, with few domains pointing at them. For example, IP
addresses used by VPS servers (and associated certicates) would
only show up a handful of times in our dataset. By contrast, IP
addresses (and their associated certicates) used by MX records
of popular third-party mail providers would generally be much
more common in our dataset, as those MX records would be used
by many domains. Thus, it is possible to quickly nd potentially
misidentied MX records by looking at the number of domains
pointing at them.
We identify potential instances of misidentications using the
observation above. We keep two counters globally. We keep track
of the number of domains that point to each IP address (
num
I P
)
and each certicate (
num
Cer t
). For each IP address, the condence
score of its mail provider ID inference is
max (num
I P
, num
Cer t
)
.
If an IP address does not have certicate information,
num
Cer t
is ignored. For any dataset of a reasonable size, this score largely
reduces the number of cases we need to examine. That said, it is
still unrealistic to perform such manual work for all the providers
on large datasets. Thus, we only check for misidentications for
large providers.
Once we have identied potential candidates to examine, we
employ various heuristics to ease the process of manually going
through all of them. For example, we can quickly determine a server
is falsely claiming to be
google.com
if it does not reside in Google’s
AS. Similarly, we observe that GoDaddy uses specic hostnames for
their dedicated servers (e.g.,
mailstore1.secureserver.net
) and
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
200 Alexa 200 Alexa w/
Unique MX
200 .com 200 .com w/
Unique MX
200 .gov 200 .gov w/
Unique MX
200 Domains w/ SMTP Servers Sampled from Target Domains
0
50
100
150
200
# of domains inferred correctly
186
180
158
79
194 194
196
192
190
171
198
197
196
194
197
190
199 199
196
194
197
194
199 199
MX-only cert-based banner-based priority-based Examined in step 4
Figure 4: Accuracy of dierent approaches on 200 domains sampled from the three lists of target domains.
dierent patterns for VPS servers (e.g.,
s1-2-3.secureserver.net
).
Such observations can help us quickly sift through all candidates.
3.2.5 Identifying Mail Provider ID for Domain. At the end, every
MX record will have an assigned mail provider ID. This assignment
could be either based on TLS certicate information, Banner/EHLO
messages, or the MX record itself. Based on the MX record that
a domain uses we can assign a mail provider to that domain. In
the case that a domain has more than one primary MX record
(multiple MX records with the same priority but dierent provider
IDs, which happens occasionally), we split the domain across the
multiple providers.
3.3 Relative Accuracy of Approaches
The priority-based approach combines the use of TLS certicates,
Banner/EHLO messages, and MX records. Each of these sources
could be independently used to determine the mail provider for a do-
main. As such, we have four potential approaches: (1) the MX-only
approach [
36
], (2) a cert-based approach that combines TLS certi-
cates and MX records, (3) a banner-based approach that combines
Banner/EHLO messages and MX records, (4) the priority-based ap-
proach that combines TLS certicates, Banner/EHLO messages and
MX records.
We evaluate the four approaches and their relative accuracy
using 200 random domains sampled from three sets of domains in
two ways, resulting in an evaluation set of 1,200 domains. The three
sets of domains we randomly sample are: all
.gov
domains, a stable
set of domains from the Alexa list, and a stable set of 1 million
.com
domains (see Section 4.1 for how we dene stable domains).
We sample (a) 200 domains and (b) 200 domains with unique MX
records from the three datasets.
Since there is no ground truth for mail providers, we use domains
with SMTP servers, scan the relevant information ourselves, and
manually label their providers.
4
We then use this labeled data to
compare the results of the dierent methods.
Figure 4 shows the results. The dark green part of the priority-
based approach highlights the total number of candidates manually
4
Note that we select 200 domains with SMTP servers to ensure a fair comparison
across dierent methods. Some methods (e.g., the MX-only approach) are oblivious
to SMTP server presence, and their accuracy drops considerably if domains with MX
records but without SMTP servers are in the sample.
examined in step 4 (check for misidentications) of our approach.
In general, the priority-based approach works the best among all
four approaches for the two sets of domains, with an accuracy of at
least 97%. In total, it missed 21 domains (1.8%) out of 1200 domains
sampled and required us to manually examine 20 (1.7%) domains.
Among 21 domains it missed, we cannot decide the providers of
4 domains. Three of these four domains are hosted on servers with
unpopular web hosting companies. We do not have enough infor-
mation and condence to decide if the servers are VPS instances
rented from the web hosting companies or directly managed by
them. One presents a valid certicate of company A, but indicates
that it is company B in Banner/EHLO messages (a situation much
like
utexas.edu
described above). However, unlike
utexas.edu
which is hosted with a well-known provider, both company A and
B are relatively unpopular and we are not condent enough to de-
cide whether company A or B is running the mail server. Out of 17
domains for which we decide the provider, 11 are VPS servers that
use subdomains of the web hosting companies in their certicates
or Banner/EHLO messages (like the GoDaddy example mentioned
above),
5
4 are poorly congured servers with Banner/EHLO mes-
sages containing strings like
localhost
operated by web hosting
companies, and 2 are poorly congured local servers that supply
FQDNs that are misleading in their Banner/EHLO messages. For the
20 domains that require manual examination, our heuristic, which
we publish together with our code, can automatically determine if
they need to be corrected. The amount of labor required in the step
is small.
The MX-only approach, on the other hand, relies upon just one
data source, and consequently performs the worst among all four
approaches (notably with an accuracy of only 40% for 200 random
.com
domains with unique MX records). We also observe that its
performance is signicantly better on Alexa and
.gov
domains than
.com
domains. We suspect two factors contribute to this phenom-
enon. On the one hand, if a domain (e.g.,
foo.com
) is hosted with
a web hosting company, often its MX record will be congured as
mx.foo.com
(a default conguration employed by many web hosting
companies), leading the MX approach to believe that the domain
runs its own mail infrastructure. On the other hand, stable Alexa
and
.gov
domains are generally well-congured and more likely to
5
Recall that we only check for misidentications for large providers.
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
name their mail providers in the MX records, in which cases the
MX approach works well.
Considering information from certicates and Banner/EHLO
messages increases accuracy by at least a few percent. Note that
the banner-based approach performs better than the cert-based
approach. This is because, as mentioned in Section 3.1, while more
reliable, certicates information is less often available than Ban-
ner/EHLO messages. Finally, we note that the banner-based ap-
proach achieves an accuracy that is close to the priority-based
approach in most cases. These results suggest that the banner-
based approach is a good fallback in cases where certicates are
not available.
Overall, the priority-based approach performs the best among
these four approaches, identifying at least 5 and at most 115 more
domains than the MX approach on the 200 sampled domains.
3.4 Limitations
The priority-based approach does have several limitations. First, the
ow of exchanging e-mail could involve multiple hops, and we only
observe the rst step of delivery using DNS MX records. As a result,
our inference result may not always reect the eventual e-mail
provider used by users of a domain. Certain heuristics, such as SPF
records, might help discover the eventual e-mail provider. However,
this is not the focus our work and we leave this as future work.
Second, the MX records of a domain could point to any arbitrary
server, and there is no guarantee that the server is actually the one
responsible for handling the domain’s incoming mail. However, this
is a limitation that all approaches share. Furthermore, we develop
a generic inference method based on IPv4 addresses. We imagine
future work extending this method to incorporate IPv6 addresses
and better handle corner cases in an automatic way (e.g., with
machine learning techniques). Finally, the priority-based approach
relies on both DNS data and active measurement data. To carry
out the longitudinal analysis in Section 5, we rely on scanning
information made available by third-party services like OpenINTEL
and Censys. As such, our results can have blind spots (e.g., Censys
may not scan IP addresses if certain providers choose to opt out of
scans or if it has a bug).
4 LARGE-SCALE IDENTIFICATION OF MAIL
PROVIDERS
We now apply the priority-based approach to three lists of target
domains collected from OpenINTEL [
38
] and Censys [
12
]. For each
list we consider nine separate days of data (except for the
.gov
domains, for which we only had seven snapshots), equally spaced
over a four-year period between June 2017 and June 2021.
4.1 Target Domains
The rst set of domains consists of the Alexa Top 1M domains [
3
]
that have an MX record in their DNS zone. To capture long-term
dynamics in mail provider use, we only consider stable domains
that consistently appear on the Alexa Top lists across the four years
of our study. Considering only the domains that are stable across
the years also eliminates noise from the churn [
31
] in the Alexa
Top 1M rankings.
Since the Alexa domains are by denition popular domains, for
comparison we also use a set of stable, random
.com
domains as a
second list. As with the Alexa domains, we consider
.com
domains
with MX records that are registered across the four years. We start
by randomly choosing 1M
.com
domains on June 8, 2017 (the rst
day we consider) and then lter out domains that expire before
June 8, 2021 (the last day we consider) or do not have MX records.
We remove Alexa domains that also appear in this dataset to create
a disjoint view.
The last dataset consists of all
.gov
domains that have an MX
record in their DNS zone. Since OpenINTEL does not have coverage
of all
.gov
domains in 2017, our measurement data of
.gov
domains
starts in June 2018 and consists of seven snapshots instead of nine.
Similar to the
.com
domains, we remove Alexa domains that also
appear in this dataset to create a disjoint view.
Overall, the Alexa set contains 93,538 domains, the
.com
set
contains 580,537, and the
.gov
set contains 3,496 domains. The three
sets of domains provide insight into the changing mail provider
landscape for popular domains, random domains sampled from the
full distribution of registrants in
.com
, and domains in a restricted
TLD.
6
4.2 External Data Sources
To enable our longitudinal and large-scale identication of mail
providers, we use two external data sources: OpenINTEL [
38
] and
Censys [7, 12].
4.2.1 OpenINTEL: Active DNS Measurement Data. OpenINTEL is
a DNS measurement platform that collects snapshots of a large
part of the DNS on a daily basis. It does so by structurally querying
substantial lists of domain names for sets of Resource Records (RRs).
These lists include, for example, all registered domain names under
specic zones such as
.com
. Other sources of names, such as the
Alexa Top 1M, are also targeted for measurement. The resulting
data accounts for MX records as well as for IP addresses (i.e., A
records) associated with the names found inside MX records. By
using OpenINTEL data, which allows us to look years into the
past, we can investigate MX conguration at scale and perform a
longitudinal analysis.
4.2.2 Censys: Internet Scanning Data. Censys is a service that per-
forms regular Internet-wide scans on a wide range of ports in the
IPv4 address space, and publishes the data collected. For example,
Censys regularly scans IP addresses on port 25 and, if hosts re-
spond, collects application-layer information. For our study, we
use the port 25 scans that capture the banner and EHLO messages,
as well as any certicates discovered from the SMTP or START-
TLS handshake. It is worth noting that, though Censys performs
Internet-wide scans, it may not have data for all IP addresses: the IP
address may not publicly accessible, the IP address may be blocked
due to requests from the address owner, the host may not listen (or
have open) the specic port on the day the scan was performed,
or the Censys scan may have failed to cover certain IP addresses
intermittently. These issues may skew results for methods that rely
upon certicates and Banner/EHLO messages. We also note that
6
Note that we randomly sampled 400 domains each from these three lists to evaluate
our methodology in Section 3.3.
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
Category
Alexa
Domains
COM
Domains
GOV
Domains
No MX IP 1,692 23,040 49
No Censys 3,215 17,842 160
No Port 25 Data 8,419 63,042 200
No Valid SSL Cert. 19,920 279,002 665
No Valid Banner/EHLO 2,074 9,992 342
No Missing Data 58,218 187,619 2,080
Total 93,538 580,537 3,496
Table 4: Breakdown of data from the June 2021 snapshot
of the Alexa domains and random .com domains. These
domains have MX records and exist across nine snapshots
spanning four years.
Censys recently rolled out an upgraded scanning system, which
reportedly xed some bugs and should have better coverage [
8
].
However, for consistency reasons, all of our data is taken from the
previous system.
4.3 Data Gathering
We start with the target list of domain names (e.g., stable domains
in Alexa top 1M list) as well as one or more dates for which to
gather data. We then extract from OpenINTEL the relevant DNS
records for domains in the target list on the selected dates. The
extracted data includes the MX records associated with the target
domains, as well as the IP addresses to which the names in those
MX records resolved. We use CAIDA’s IPv4 prex-to-AS data [
6
]
to augment the IP addresses with routing information such as AS
number. For each IP address obtained from OpenINTEL, we query
Censys for the associated scanning information related to port 25.
This data includes the state of the port and data from SMTP and
STARTTLS handshakes, including Banner/EHLO messages and
certicates. Table 4 shows how we lter data collected for a day’s
snapshot.
4.4 Providers and Companies
On the data thus gathered, we then use the priority-based ap-
proach (Section 3) to determine the mail providers for the domains.
Our methodology outputs provider IDs (in the form of registered
domains) as mail providers. For example, our methodology tags
google.com
as the provider ID for
netflix.com
(as seen in Table 1).
The provider ID
google.com
can then be associated with the mail
service provider company, which is Google in this case. However, a
single company may have multiple provider IDs, which can either
be the result of dierent services operated by the company or dif-
ferent sources of data (certicates, Banner/EHLO messages, or MX)
used to derive the provider ID. Table 5 shows various provider IDs
used by Microsoft and ProofPoint identied in our datasets as well
as the ASN information of the mail infrastructure.
For our analyses, we ultimately want to aggregate the registered
domains that make up provider IDs into the companies that operate
these names. This step requires a certain amount of manual work,
which makes a blanket analysis of providers infeasible. Instead,
Company Provider ID ASN
Microsoft
outlook.com
oce365.us
hotmail.com
outlook.cn
outlook.de
8075 (Microsoft)
200517 (MS Deutschland)
58593 (Blue Cloud)
ProofPoint
gpphosted.com
ppops.net
pphosted.com
ppe-hosted.com
52129 (ProofPoint)
26211 (ProofPoint)
22843 (ProofPoint)
13916 (ProofPoint)
15830 (Telecity Group)
Table 5: Provider IDs operated by Microsoft and ProofPoint
identied in our datasets.
we focus on the most prominent mail providers. We investigate
frequently-occurring names to identify prominent provider IDs. We
then map these provider IDs to companies by examining relevant
information (e.g., ASN and the provider ID itself) and searching on
the Internet.
7
We use the resulting company information as input
for our analyses in Section 5.
5 ANALYSIS
In this section we characterize various aspects of mail providers
identied for our target set of popular and random domains (Sec-
tion 4). We characterize the market share, infrastructure and ser-
vices provided by the dominant companies in e-mail delivery, their
trends over time with particular focus on e-mail security services
and web hosting companies, the dynamics of domains switching
companies over the span of our data set, and mail provider prefer-
ences across dierent countries.
5.1 Market Share of Top Companies
We start by examining the most popular companies that MX records
refer to. We use the priority-based approach from Section 3 to
identify the provider IDs most prevalent. We then associate these
provider IDs with companies (Section 4.4).
Figure 5 shows the top ve companies for the three sets of do-
mains in the most recent snapshot in our dataset (June 2021).
8
Since
prior work [
30
,
31
] has demonstrated that the nature of domains
in Alexa vary with ranks, we also present the ve top companies
for domains in the Alexa Top 1k, 10k and 100k. Finally, for
.gov
domains we also identify the top ve companies separately for
federal and non-federal domains.
For Alexa domains of dierent ranks, the top two are consistently
mail hosting providers (Google and Microsoft). For the top 1k, 10k
and 100k domains, the third most popular company is ProofPoint,
an e-mail security company. However, when considering all Alexa
domains, the third company is Yandex, a Russian mail hosting
7
Note that our list of provider IDs associated with a company is never meant to be
exhaustive. Identifying domain names owned by the same company is a research
question by itself [40].
8
The appendix contains a longer table that lists the number and percentage of the top
15 companies in each set of domains. Provider IDs associated with each company can
be found in our GitHub repository: https://github.com/ucsdsysnet/mx_inference
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
Figure 5: Top providers and the number and percentage of
domains using these companies in dierent sets of domain
names (Jun. 2021).
provider. We suspect this likely reects the presence of many
.ru
domains in the long tail of Alexa domains.
We observe similar phenomena in
.gov
domains: Microsoft and
Google are the most prominent companies (although their market
shares are reversed), followed by several e-mail security companies
(Barracuda, ProofPoint and Mimecast). That said, we observe a non-
negligible amount of domains pointing at mail servers operated by
the US Department of Health (hhs.gov) and the US Department of
Treasury (
treasury.gov
) among federal domains. Manually check-
ing a random sample suggests that most of these domains are either
directly operated by or closely related to the two departments.
Finally, for
.com
domains, we note a slightly dierent company
distribution. While Google and Microsoft still have a signicant
presence, the other companies are web hosting providers (GoDaddy,
UnitedInternet, and EIG). Indeed, GoDaddy by far has the dominant
market share among the random
.com
domains.
9
In contrast to the
Alexa and
.gov
sets, the random domains reect the full distribution
of sites using MX records. This distribution has a long tail with
many small sites, and it is not surprising that many of them operate
using the infrastructure of their hosting provider. Finally, while
e-mail security services such as ProofPoint and Mimecast do not
9
Given GoDaddy’s dominance, we performed sanity checks to ensure that the domains
using GoDaddy are not simply parked domains. Indeed, when we registered a domain
or published a website using the registered domain with GoDaddy, GoDaddy did not
automatically set up the MX record for the domain. Instead, GoDaddy only congured
an MX record for the domain when an e-mail address was created and associated with
the domain.
rank highly among the random domain set, Section 5.2.2 shows
that such services are increasing in popularity over time.
5.2 Longitudinal Trends
5.2.1 Top Companies. While Figure 5 shows the most recent break-
down for the top companies, we now use the full data set to examine
the breakdown for top companies longitudinally over time.
For each of the companies from the Alexa data set in Figure 5,
Figure 6 shows the percentage and number of domains whose MX
records point to those companies over the four years of our data
set. Each curve corresponds to one of the companies. While not
dramatic, the trends are all steady increases over time. The top
ve companies combined are used by 40.1% of MX records in 2017,
and the total increases to 49.0% by June 2021. Google dominates
the market with Gmail, with Microsoft and Outlook a notable sec-
ond, and both continue to steadily increase market share. Google
increases from 26.2% to 28.5% from 2017–2021, and Microsoft like-
wise increases from 7.9% to 10.8%.
Notably, ProofPoint and Mimecast are both in the top ve and
increase their market share over the past four years. These compa-
nies are not mail providers, but instead provide an e-mail security
service. We explore the rise of such e-mail security services in more
detail in Section 5.2.2.
The Self-Hosting curve shows the percentage of domains that
host their own SMTP server, rather than using a separate provider.
We estimate the number of domains that are self-hosted by looking
for domains whose provider ID is the same as its registered domain
name. The trend for self-hosting is the opposite that of the top com-
panies. The percentage of domains that self-host steadily decreased
over the four years of our data set, falling from 11.7% in 2017 to
7.9% in 2021. Section 5.3 below explores where they switch to in
more detail.
Figures 6d and 6g similarly shows the trends over time for the
top companies serving the random
.com
and
.gov
data sets in Fig-
ure 5. We note that Censys is only intermittently successful in
scanning EIG for unknown reasons. Thus, for the longitudinal re-
sults we show OVH instead, which is the six largest company in
.com
domains and scanned reliably over time. As with the Alexa
data set, the market share of the dominant mail providers (Google
and Microsoft) increases over time for .com and .gov domains.
10
The consolidation of Google and Microsoft applies not only to pop-
ular domains, but domains across the full distribution. In contrast,
though, the market share of hosting providers is steadily decreasing
(GoDaddy and UnitedInternet) or at (OVH) over time. Either there
are fewer customers of hosting providers overall, or more of their
customers switch away from using the default mail service of the
hosting provider.
While the random
.com
data set has over ve times the number
of Alexa domains (580,537 vs. 93,538), the number of self-hosted do-
mains in
.com
is signicantly smaller than that of the Alexa domains
(1,836 vs. 7,407 in June 2021) and this number slightly decreased
over the last four years. This result matches our expectation that
most .com domains are small sites hosted with other companies.
10
Not for Google in .gov dataset from 2019-12 to 2021-06. A quick sanity check suggests
that the majority of the domains moving away from Google were moving to Microsoft.
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
0
10000
20000
30000
40000
(a) Top Companies in Alexa
Google
Microsoft
Yandex
ProofPoint
Mimecast
Top5 Total
Self-Hosted
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
0
1000
2000
3000
4000
5000
6000
(b) Popular E-mail Security Companies in Alexa
ProofPoint
Mimecast
Barracuda
Cisco Ironport
AppRiver
Total
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
0
500
1000
1500
2000
2500
3000
(c) Popular Web Hosting Companies in Alexa
GoDaddy
OVH
UnitedInternet
Ukraine.ua
NameCheap
Total
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
0
50000
100000
150000
200000
250000
300000
(d) Top Companies in COM
GoDaddy
Google
Microsoft
UnitedInternet
OVH
Top5 Total
Self-Hosted
0.0%
0.2%
0.4%
0.6%
0.8%
0
1000
2000
3000
4000
5000
(e) Popular E-mail Security Companies in COM
ProofPoint
Mimecast
Barracuda
Cisco Ironport
AppRiver
Total
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
0
50000
100000
150000
200000
(f) Popular Web Hosting Companies in COM
GoDaddy
OVH
UnitedInternet
Ukraine.ua
NameCheap
Total
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
2017-06
2017-12
2018-06
2018-12
2019-06
2019-12
2020-06
2020-12
2021-06
0
250
500
750
1000
1250
1500
1750
2000
(g) Top Companies in GOV
Microsoft
Google
Barracuda
ProofPoint
Mimecast
Top5 Total
Self-Hosted
0.0%
2.5%
5.0%
7.5%
10.0%
12.5%
15.0%
17.5%
2017-06
2017-12
2018-06
2018-12
2019-06
2019-12
2020-06
2020-12
2021-06
0
100
200
300
400
500
600
(h) Popular E-mail Security Companies in GOV
ProofPoint
Mimecast
Barracuda
Cisco Ironport
AppRiver
Total
0.0%
0.2%
0.5%
0.8%
1.0%
1.2%
1.5%
1.8%
2017-06
2017-12
2018-06
2018-12
2019-06
2019-12
2020-06
2020-12
2021-06
0
10
20
30
40
50
60
(i) Popular Web Hosting Companies in GOV
GoDaddy
OVH
UnitedInternet
Ukraine.ua
NameCheap
Total
Number of domains using the providers
Percentage of domains using the providers
Figure 6: Market share of dierent types of services from 2017 to 2021. Note that the y-axes of all graphs show the same
quantities, but the value ranges are distinct to each graph.
5.2.2 E-mail Security Services. Figure 6a highlighted ProofPoint
and Mimecast in the top ve companies used by popular domains.
These companies provide e-mail security services that can operate
as a third-party lter for inbound e-mail delivery, removing the
need to purchase and manage a local appliance. Customers use
MX records to direct mail agents to deliver mail intended for the
customer to the security provider instead, either by explicitly using
a provider domain in the MX record (e.g.,
ge.com
, which has MX
mx0a-00176a02.pphosted.com
) or by using a customer domain whose
A record uses a provider IP address (e.g.,
albabotanica.com
, which
has MX record
mx1.haingrp.com
that resolves to a ProofPoint IP).
The provider then performs spam ltering, phishing detection,
URL rewriting, etc., on behalf of the customer, and subsequently
forwards the customer’s mail to the customer’s servers.
The rise of ProofPoint and Mimecast suggests that such compa-
nies are becoming a more attractive service option. To explore this
point further, in addition to ProofPoint and Mimecast, we manually
identied three other popular companies in the third-party e-mail
security market across our data sets. Figures 6b, 6e, and 6h show
the percentage of MX records that refer to each of ve prominent
third-party e-mail security companies over time for the Alexa,
.com
,
and
.gov
domains, respectively. The results conrm that these ser-
vices are becoming increasingly attractive for both popular and
random domains, as security incidents via e-mail continue to be a
major concern.
5.2.3 Web Hosting Companies. Web hosting companies like Go-
Daddy make it convenient for hosted domains to use company
infrastructure for a variety of services including e-mail delivery. As
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
Churn in Self-Hosted Domains (2017 to 2021)
518518
990990
135135
10571057
10441044
15531553
1.Google 20171.Google 2017
2.Microsoft 20172.Microsoft 2017
3.Yandex 20173.Yandex 2017
4.Top100 20174.Top100 2017
5.Self-Hosted 20175.Self-Hosted 2017
6.Others 20176.Others 2017
7.No SMTP 20 177.No SMTP 20 17
1.Google 20211.Google 2021
2.Microsoft 20212.Microsoft 2021
3.Yandex 20213.Yandex 2021
4.Top100 20214.Top100 2021
5.Self-Hosted 20215.Self-Hosted 2021
6.Others 20216.Others 2021
7.No SMTP 20 217.No SMTP 20 21
Highcharts.com
Figure 7: Sankey graph that demonstrates churn in Mail
Providers for Alexa domains from 2017 to 2021
we saw in Figure 6d, though, fewer domains over time are taking ad-
vantage of hosting company e-mail delivery. We expand upon these
results by manually identifying the top ve Web hosting companies
in both data sets.
Figures 6c, 6f and 6i show the number and percentage of MX
records referring to each of these companies in the Alexa,
.com
, and
.gov
data sets, respectively. In both cases the trends are the same.
The most popular hosting companies (GoDaddy and UnitedInternet)
have fewer domains using their e-mail delivery services over time,
and the trend is particularly pronounced among the large sites
using popular domains in the Alexa data set. The remaining hosting
companies are comparatively at.
5.3 Churn
Recall that the set of domains we study have valid MX records for
the entire duration of our data set. During this time there is churn
in the values of the MX records that reect administrative decisions
about mail delivery. Some domains that initially used Google, for
instance, may switch to Microsoft during the four years. Similarly,
other domains that were self-hosting might switch to Google.
Figure 7 is a Sankey diagram illustrating changes in MX records
between the rst snapshot in the Alexa data set (June 2017) and
the last (June 2021). The diagram groups the domains into various
categories: the top three third-party mail hosting providers (Google,
Microsoft, Yandex); the remaining top 100 providers; self-hosted
domains; all other providers; and the residual set that either had
no responding SMTP server or timed out during a Censys scan. For
each category, the diagram shows the number of domains using that
company that did not change, the number of domains that used the
company in 2017 but switched to another by 2021 (outgoing ows),
and the number of domains that switched to use the company by
2021 (incoming ows).
While the use of the top companies increased over time, the
diagram shows that domains from all of the various categories con-
tributed to this increase (e.g., the incoming ows to Google). From
the perspective of domains that switched providers, we in particular
highlight the changes that occurred to self-hosted domains between
2017 and 2021. While self-hosted domains switched to providers
across all categories, more than a quarter of them changed their
mail provider to Google or Microsoft — a quantity larger than the
sum of domains that switched to providers ranked in the remaining
top 100.
5.4 Mail Provider Preferences by Country
Finally, we explore the existence of national biases in e-mail service
provider choice. Since we have no easy mechanical way to classify
the national origin of individual gTLD domains (such as those in
.com
) we focused on country code top-level domains (ccTLDs) found
in our stable subset of the Alexa top 1M list as a proxy. We consider
fteen ccTLDs, namely:
.br
(Brazil),
.ar
(Argentina),
.uk
(the United
Kingdom),
.fr
(France),
.de
(Germany),
.it
(Italy),
.es
(Spain),
.ro
(Romania),
.ca
(Canada),
.au
(Australia),
.ru
(Russia),
.cn
(China),
.jp
(Japan),
.in
(India) and
.sg
(Singapore); thus we assume, for
example, that domains under .ru are likely Russian in origin.
11
Among the domains in these ccTLDs, we focus on the use of four
popular e-mail service providers: Google, Microsoft, Tencent and
Yandex, representing the two dominant e-mail service providers in
the US and each of the dominant e-mail service providers in China
and Russia, respectively. For each of these four providers, Figure 8
shows the percentage (and absolute number) of domains in each of
our ccTLD sets that that make use of the service (June 2021).
There are two clear takeaways. First, Google and Microsoft, the
two dominant US-based e-mail service providers, appear to be in
wide use by organizations outside the US particularly across
Europe, North America, South America, large parts of Asia and,
to a lesser extent, Russia (but not China). For example, 65% of the
.br
domains in our set host mail with Google or Microsoft (sig-
nicantly exceeding even the baseline market share for our stable
Alexa domains of 39.3%). This is of note because under US law
(particularly as claried by the recent Cloud Act’s modication to
the Stored Communications Act [
1
]) providers operating in the US
can be legally compelled to provide information under their control
(including e-mail content) to US law enforcement regardless of the
location of the data, or the nationality or residency of the customer
using the data. The second clear result is that Yandex and Tencent
are comparatively isolated — primarily serving domains only from
the ccTLD matching their own country of origin. Indeed, the hand-
ful of deviations from these patterns primarily reect domains for
companies whose national origin is not reected by their choice of
ccTLD.
12
It is an open question the extent to which this discrepancy is
driven entirely by market power and infrastructure deployment
(i.e., that domain holders do not consider the jurisdictional risk of
hosting mail service with a foreign-owned company and are simply
picking those who are best able to support their feature, perfor-
mance, availability and price requirements) or if it also reects an
explicit trust decision (i.e., that European and Brazilian companies
are suciently comfortable being subject to US jurisdiction that
11
While there are individual instances that may deviate from this assumption (e.g.,
google.ru
), we believe it is predominately true in aggregate (i.e., across the 10,000
plus .ru domains we consider, the majority are Russian-operated).
12
For example, Shein is a Chinese-owned apparel company that operates in the UK
under
shein.co.uk
. Similarly, bitrix24 is a Russian-owned Cloud collaboration
service that operates under a number of ccTLD aliases including bitrix24.fr.
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
Figure 8: Mail Provider Preferences by Country (ccTLD)
they do not seek local alternatives). Similarly, the dominance of
Tencent and Yandex in their local markets may, in part, reect mar-
keting and infrastructure deployment advantages in their home
countries. However, a key role may also be played by state-imposed
security review requirements in those countries that US service
providers are unwilling or unable to meet. Regardless of reason, the
key result is the same: the centralization of e-mail service has been
heterogeneous across the globe, with certain providers dominating
certain markets. However, it is primarily US-based e-mail service
providers who have been eective in attracting foreign customers,
despite the additional legal risks posed by this arrangement.
6 CONCLUSION
In this paper, we have presented a methodology for mapping Inter-
net domains to mail service providers. Our methodology combines
DNS data with active measurement data to signicantly improve
accuracy. We have applied this technique to large sets of domains to
identify and characterize the current distribution of dominant mail
providers. Additionally, our longitudinal study over four years has
empirically documented the steady consolidation of Internet e-mail
service towards a small number of providers. Finally, we explore
the extent to which nationality (and hence legal jurisdiction) plays
a role in such mail provisioning decisions.
The analysis code and results for this paper are available at
https://github.com/ucsdsysnet/mx_inference.
7 ACKNOWLEDGMENTS
We thank our anonymous shepherd and reviewers for their insight-
ful and constructive suggestions and feedback. We also thank Cindy
Moore for her support of the software and hardware infrastructure
necessary for this project, and Stewart Grant for his suggestions
and feedback. Funding for this work was provided in part by Na-
tional Science Foundation grants CNS-1629973 and CNS-1705050,
the UCSD CSE Postdoctoral Fellows program, the Irwin Mark and
Joan Klein Jacobs Chair in Information and Computer Science,
the EU H2020 CONCORDIA project (830927), generous support
from Google, and operational support from the UCSD Center for
Networked Systems. This research used data from OpenINTEL, a
project of the University of Twente, SURF, SIDN, and NLnet Labs.
REFERENCES
[1]
Stored Communications Act. 2018. 18 USC 2713. Required preservation and
disclosure of communications and records.
[2]
Mike Afergan and Robert Beverly. 2005. The state of the email address. ACM
SIGCOMM Computer Communication Review 35, 1 (2005), 29–36.
[3] Alexa. 2021. Top 1M sites. https://toplists.net.in.tum.de/archive/alexa/
[4]
Mark Allman. 2018. Comments on DNS Robustness. In 2018 Internet Measurement
Conference. ACM, Boston, MA.
[5]
J. Arkko, B. Trammell, M. Nottingham, C. Huitema, M. Thomson, J. Tantsura, and
N. ten Oever. 2019. Considerations on Internet Consolidation and the Internet
Architecture. https://tools.ietf.org/html/draft-arkko-iab-internet-consolidation-
02
[6]
CAIDA. 2021. Routeviews Prex to AS mappings Dataset for IPv4 and IPv6.
http://www.caida.org/data/routing/routeviews-prex2as.xml
[7] Censys. 2020. Bulk Data. Censys. https://censys.io/data
[8]
Censys. 2021. Censys Search 2.0 Ocial Announcement. https://support.censys.
io/hc/en-us/articles/360060941211-Censys-Search-2-0-Ocial-Announcement
[9]
Jianjun Chen, Vern Paxson, and Jian Jiang. 2020. Composition kills: A case study
of email sender authentication. In 29th
{
USENIX
}
Security Symposium (
{
USENIX
}
Security 20). 2183–2199.
[10]
Constance Bommelaer de Leusse and Carl Gahnberg. 2019. The Global
Internet Report: Consolidation in the Internet Economy. Internet Society.
https://www.internetsociety.org/blog/2019/02/is-the-internet-shrinking-the-
global-internet-report-consolidation-in-the-internet-economy-explores-this-
question/
[11]
Viktor Dukhovni and Wes Hardaker. 2015. SMTP Security via Opportunistic
DNS-Based Authentication of Named Entities (DANE) Transport Layer Security
(TLS). RFC 7672. , 34 pages. https://doi.org/10.17487/RFC7672
[12]
Zakir Durumeric, David Adrian, Ariana Mirian, Michael Bailey, and J. Alex
Halderman. 2015. A Search Engine Backed by Internet-Wide Scanning. In Pro-
ceedings of the 22nd ACM SIGSAC Conference on Computer and Communications
Security (Denver, Colorado, USA) (CCS ’15). ACM, New York, NY, USA, 542–553.
https://doi.org/10.1145/2810103.2813703
IMC ’21, November 2–4, 2021, Virtual Event, USA Liu, Akiwate, Jonker, Mirian, Savage, and Voelker
[13]
Zakir Durumeric, David Adrian, Ariana Mirian, James Kasten, Elie Bursztein,
Nicolas Lidzborski, Kurt Thomas, Vijay Eranti, Michael Bailey, and J Alex Hal-
derman. 2015. Neither snow nor rain nor MITM... an empirical analysis of email
delivery security. In Proceedings of the 2015 Internet Measurement Conference.
ACM, New York, NY, USA, 27–39.
[14]
Ian D Foster, Jon Larson, Max Masich, Alex C Snoeren, Stefan Savage, and Kirill
Levchenko. 2015. Security by any other name: On the eectiveness of provider
based email security. In Proceedings of the 22nd ACM SIGSAC Conference on
Computer and Communications Security. ACM, New York, NY, USA, 450–464.
[15]
Alex Hern. 2020. Google suers global outage with Gmail, YouTube and
majority of services aected The Guardian. https://www.theguardian.
com/technology/2020/dec/14/google-suers-worldwide-outage-with-gmail-
youtube-and-other-services-down
[16]
Paul E. Homan. 2002. SMTP Service Extension for Secure SMTP over Transport
Layer Security. RFC 3207. , 9 pages. https://doi.org/10.17487/RFC3207
[17]
Cecilia Kang and David McCabe. 2020. Lawmakers, United in Their Ire,
Lash Out at Big Tech’s Leaders - The New York Times. The New York
Times. https://www.nytimes.com/2020/07/29/technology/big-tech-hearing-
apple-amazon-facebook-google.html
[18]
Dr. John C. Klensin. 2008. Simple Mail Transfer Protocol. RFC 5321. https:
//doi.org/10.17487/RFC5321
[19]
Dr. John C. Klensin and Randall Gellens. 2011. Message Submission for Mail.
RFC 6409. https://doi.org/10.17487/RFC6409
[20]
Brian Krebs. 2017. At Least 30,000 U.S. Organizations Newly Hacked
Via Holes in Microsoft’s Email Software Krebs on Security. https:
//krebsonsecurity.com/2021/03/at-least-30000-u-s-organizations-newly-
hacked-via-holes-in-microsofts-email-software/
[21] Public Sux List. 2021. Public Sux List. https://publicsux.org/
[22]
D. Liu, S. Hao, and H. Wang. 2016. All Your DNS Records Point to Us: Un-
derstanding the Security Threats of Dangling DNS Records. In Proceedings
of the 2016 ACM SIGSAC Conference on Computer and Communications Secu-
rity (Vienna, Austria) (CCS). ACM, New York, NY, USA, 1414–1425. https:
//doi.org/10.1145/2976749.2978387
[23]
P. Mockapetris. 1987. Domain Names - Implementation and Specication. RFC
1035. https://rfc-editor.org/rfc/rfc1035.txt
[24]
Keith Moore and Chris Newman. 2018. Cleartext Considered Obsolete: Use of
Transport Layer Security (TLS) for Email Submission and Access. RFC 8314.
https://doi.org/10.17487/RFC8314
[25]
Giovane CM Moura, Sebastian Castro, Wes Hardaker, Maarten Wullink, and
Cristian Hesselman. 2020. Clouding up the Internet: how centralized is DNS
trac becoming?. In Proceedings of the ACM Internet Measurement Conference.
ACM, New York, NY, USA, 42–49.
[26]
Craig Partridge. 1986. Mail routing and the domain system. RFC 974. https:
//doi.org/10.17487/RFC0974
[27]
Jonathan B. Postel. 1982. Simple Mail Transfer Protocol. RFC 821. https:
//doi.org/10.17487/RFC0821
[28]
Protonmail. 2021. Verify your custom domain and set MX record. https:
//protonmail.com/support/knowledge-base/dns-records/
[29]
Joshua Avery Reed and JC Reed. 2020. Potential Email Compromise via Dangling
DNS MX.
[30]
Walter Rweyemamu, Tobias Lauinger, Christo Wilson, William K. Robertson,
and E. Kirda. 2019. Clustering and the Weekend Eect: Recommendations for
the Use of Top Domain Lists in Security Research. In PAM.
[31]
Quirin Scheitle, Oliver Hohlfeld, Julien Gamba, Jonas Jelten, Torsten Zimmer-
mann, Stephen D. Strowes, and Narseo Vallina-Rodriguez. 2018. A Long Way
to the Top: Signicance, Structure, and Stability of Internet Top Lists. In Pro-
ceedings of the Internet Measurement Conference 2018 (Boston, MA, USA) (IMC
’18). Association for Computing Machinery, New York, NY, USA, 478–493.
https://doi.org/10.1145/3278532.3278574
[32]
Kaiwen Shen, Chuhan Wang, Minglei Guo, Xiaofeng Zheng, Chaoyi Lu, Baojun
Liu, Yuxuan Zhao, Shuang Hao, Haixin Duan, Qingfeng Pan, et al
.
2020. Weak
Links in Authentication Chains: A Large-scale Analysis of Email Sender Spoong
Attacks. arXiv preprint arXiv:2011.08420 (2020).
[33]
Rob Siemborski and Alexey Melnikov. 2007. SMTP Service Extension for Authen-
tication. RFC 4954. https://doi.org/10.17487/RFC4954
[34]
Statistica. 2021. Number of sent and received e-mails per day worldwide from
2017 to 2024. https://www.statista.com/statistics/456500/daily-number-of-e-
mails-worldwide/
[35]
Google Support. 2021. Set up MX records for Google Workspace email - Google
Workspace Admin Help. https://support.google.com/a/answer/140034?hl=en
[36]
Jason Trost. 2020. Mining DNS MX Records for Fun and Prot
Medium. https://medium.com/@jason_trost/mining-dns-mx-records-for-fun-
and-prot-7a069da9ee2d
[37]
Roland van Rijswijk-Deij, Mattijs Jonker, Anna Sperotto, and Aiko Pras. 2015.
The Internet of Names: A DNS Big Dataset. In Proceedings of the 2015 ACM
Conference on Special Interest Group on Data Communication. 91–92. https:
//doi.org/10.1145/2785956.2789996
[38]
Roland van Rijswijk-Deij, Mattijs Jonker, Anna Sperotto, and Aiko Pras. 2016. A
high-performance, scalable infrastructure for large-scale active DNS measure-
ments. IEEE Journal on Selected Areas in Communications 34, 6 (2016), 1877–1888.
[39]
Wikipedia. 2020. Simple Mail Transfer Protocol. Wikipedia. https://en.wikipedia.
org/wiki/Simple_Mail_Transfer_Protocol
[40]
Maya Ziv, Liz Izhikevich, Kimberly Ruth, Katherine Izhikevich, and Zakir Du-
rumeric. 2021. ASdb: A System for Classifying Owners of Autonomous Systems.
In ACM Internet Measurement Conference (IMC’21).
Who’s Got Your Mail? Characterizing Mail Service Provider Usage IMC ’21, November 2–4, 2021, Virtual Event, USA
Rank Alexa COM GOV
1 Google 26,697 (28.5%) GoDaddy 168,287 (29.0%) Microsoft 1,124 (32.1%)
2 Microsoft 10,072 (10.8%) Google 54,564 (9.4%) Google 336 (9.6%)
3 Yandex 4,253 (4.5%) Microsoft 33,406 (5.8%) Barracuda 280 (8.0%)
4 ProofPoint 2,815 (3.0%) UnitedInternet 26,939 (4.6%) ProofPoint 155 (4.4%)
5 Mimecast 2,005 (2.1%) EIG 8,714 (1.5%) Mimecast 87 (2.5%)
6 GoDaddy 1,411 (1.5%) OVH 7,752 (1.3%) AppRiver 60 (1.7%)
7 Zoho 1,229 (1.3%) NameCheap 6,620 (1.1%) Rackspace 48 (1.4%)
8 Tencent 826 (0.9%) Tucows 5,517 (1.0%) Cisco 48 (1.4%)
9 Cisco 771 (0.8%) Strato 5,025 (0.9%) GoDaddy 32 (0.9%)
10 Rackspace 752 (0.8%) Rackspace 4,930 (0.8%) Sophos 29 (0.8%)
11 Barracuda 598 (0.6%) Web.com Group 4,200 (0.7%) Solarwinds 28 (0.8%)
12 Mail.Ru 555 (0.6%) Aruba 3,842 (0.7%) IntermediaCloud 24 (0.7%)
13 Beget 420 (0.4%) Yahoo 3,652 (0.6%) TrendMicro 22 (0.6%)
14 MessageLabs 412 (0.4%) SiteGround 3,461 (0.6%) hhs.gov 21 (0.6%)
15 OVH 386 (0.4%) Tencent 3,451 (0.6%) treasury.gov 18 (0.5%)
Total 53,201 (56.9%) 340,362 (58.6%) 2,312 (66.1%)
Table 6: Top 15 companies identied in the three datasets (Jun. 2021)
A TOP 15 COMPANIES IN EACH DATASET
Table 6 lists the top 15 companies identied in the three datasets
and their market share: the number and percentage of domains in
each dataset using services from these companies.