Surveillance & Privacy

Where Is Your Data, Really?: The Technical Case Against Data Localization

Dillon Reisman
Monday, May 22, 2017, 7:00 AM

This post is part of a series written by participants of a conference at Georgia Tech in Surveillance, Privacy, and Data Across Borders: Trans-Atlantic Perspectives.

Published by The Lawfare Institute
in Cooperation With
Brookings

This post is part of a series written by participants of a conference at Georgia Tech in Surveillance, Privacy, and Data Across Borders: Trans-Atlantic Perspectives.

The push for data localization requirements—mandating that certain user data be kept within national borders—reflects an inaccurate understanding of the Internet. The reality is that we cannot have widespread data localization without making web services we rely on technically unviable. To understand why this is the case, we first have to answer a question at the heart of the Mutual Legal Assistance (MLA) and cross-border data transfer debate: Where is data located?

It is effectively impossible for users of services like Facebook or Google’s Gmail to know where their data is stored. Yet in the MLA debate, courts have resolved tricky questions of jurisdiction through reasoning about the physical location of digital data. Take, for example, the Microsoft Ireland case. There, the Second Circuit Court of Appeals ruled that the location of a Microsoft server in Ireland established that its data was extraterritorial, and thus was out of the reach of US law enforcement under ECPA. That may have led to a convenient outcome in the particular case, but the opinion does not accurately reflect how web services actually store their data.

Courts frequently use simplistic mental models for complex systems out of necessity. Legal reasoning does not require—nor could it likely accommodate—a completely faithful representation of how the Internet works. In the debate around data localization, however, over-simplification leads to some negative legal and public policy outcomes with wide-ranging consequences.

And questions of how we should treat data aren’t only relevant to the data localization debate. How we cope with the transnational nature of the Internet will also impact issues ranging from intellectual property law to global anti-censorship efforts. That is why it is important to base the public conversation on a more sophisticated understanding of the nature of data storage. To that end, below is an overview of several of the ways in which well-developed web services might actually store your data across borders and why they do so.

Consider a (fictional) email service that hosts your email and makes it accessible to you wherever you may be. Your emails would probably exist in multiple copies, which could be located in more than one country. Here are some possibilities for where your data might be stored:

  • Your data might be stored in edge caches across borders

One of the main pillars of web architecture is performance: applications need to get data to users as fast as is reasonably possible. One way to accomplish this is to keep a copy of select chunks of data in “edge caches.” Caches place the most in-demand content as close as possible to the end users who will want it, shortening the trip data has to take across the network. The cache network can strategically choose what data to include in cache based on changing demand and other factors. Thus, the expense of storing all data can be moved to a more centralized location while cheaper machines (possibly in different countries) can more quickly distribute data to their locale.

https://lh3.googleusercontent.com/duKlTu7JdPu7tjVtq00sPj9KlfaSHLZzKRX2Dz-BZ3n1ympSOeLQkKyft7mnqK2X1ZGInVwyANFC82tvu9-6sO2TD6CKVphqNCE5A4ks1j3-KR12DgsnPlHSWFbG5iDAAYDXALN1

  • Your data might be replicated for load balancing.

Another principle driving the development web services is efficiency: there should be no wasted resources. To make more efficient use of their servers, a web service might replicate user data across multiple data centers in different regions. If one region sees more user activity and has trouble meeting demand, the network might instead route some user activity to a service’s replica in a different region.

https://lh5.googleusercontent.com/1sWcU6ql04og33CDngDV7dKhB0wgK1PEtahRQumRoqQE_am_BU8eh3o4xBl2osxAkAWp6xX8RXy4GeSJwVGO2IA28rdapL_QPoVuTFRIvaQPskWzfmc-tsToy1wH5HVq83Cs8Khj

  • Your data might be ‘sharded’ across multiple machines in multiple datacenters.

A web service may store millions of gigabytes of data. To do this, the web service stores data across many “shards,” with an individual computer responsible for holding a shard of data. An individual’s data can be split between any number of shards and distributed, copied, and backed up across multiple machines. This helps support a web service’s goals for performance and efficiency—load balancing, for instance, can be made even more efficient if the network chooses which “shards” of data need to be copied and distributed based on demand.

https://lh6.googleusercontent.com/O2M_8Zko7N1Cu5Z0wD-1O0oX8KgGD6C25yQV4yCRIbA9LiWhFCVBq9vCtR2ShCAnQ06f32O-SF4bcWC72nbiK92ASOkpcs-ydtB8zafNr04lhXrRvfQ8d4Ry_HSKFh1AVu6kw_QK

  • Your data might be backed up to multiple locations in case of failure.

Most people have experienced the horrible realization you’ve accidentally deleted an important file. Imagine accidentally deleting data for several million customers. These sorts of disasters affect even the most mature web applications—in 2011, Gmail lost millions of users’ inboxes thanks to a software bug. Fortunately for those users, Google kept regular backups and so the data was never truly lost. This demonstrates how web services need to maintain a high degree of data integrity: the assurance that data will not be lost or corrupted over its lifetime. Emergencies caused by natural disaster or physical disruption to a data center require that backups be readily deployable to different data centers in different regions.

  • Your data might be made accessible to engineers in different countries for maintenance and debugging.

No software is perfectly written on the first go, and new issues will always pop up. This is a normal and expected part of software’s lifecycle. How well a web service functions will depend on how well engineers can diagnose and fix a problem. In some cases, this might mean engineers in one country have access to user data originating from another. Even if engineers only have access to activity logs—records of simple timestamps noting when a user accesses a service—those logs can constitute metadata.

https://lh3.googleusercontent.com/fh_Ybve7PDVZnZwQsTwO5xwAYsp_e_zdZH-83B3tk3rKSeNJeuT-YZeVvy-vW6z5WHS2kRhSMCLwN53xjzgWFeVpdhpgQ3TIw8htPD2qprGH8iJrXC_t915MujHIjcxwU6b9S1L1

  • Your data might be processed in batches at a central location, to add features like search or artificial intelligence.

Fetching an email from your inbox or writing a new post on a friend’s social media page are relatively cheap jobs for the web service to perform. Relative to those basic operations, adding more advanced features like the ability to search your inbox or use artificial intelligence to build spam filters are more costly. Fortunately, the service might not need to update the feature immediately with every user action. A web service can save crucial resources by processing data in batches on a set schedule. These operations don’t necessarily need to have the same redundancy as other, more user-visible processes, so data can be copied to one single location that is responsible for all of the expensive work. That location might be any one of the data centers that the service operates around the world.

https://lh5.googleusercontent.com/MU7-o7JD65VqgM43-ktbOa7rnJcpMzSu6Uzq_lLPCNM3z0Cwl7ft0-jzxDXEXcqC-I6tjY0rpILpUJBGGdBfSxj8-VH_VthLvnbH3xiOtMv3T2MKYTjBOqYE4yRCJz3M8v90MVa8

  • Your data might be used to generate “derived data.”

In many cases derived data requires the same protections as the private data it came from—if an AI could extract your opinion about everyone you’ve spoken to from your text messages, you’d likely consider that information highly sensitive and requiring a level of protection similar to the texts themselves. In other cases, aggregate statistics derived from data could be useful to share between engineers or release publicly without revealing anything about any individuals—if an AI only noted how many people you text regularly, without recording their names or your name, you’d probably view that as less sensitive information than the content of the original texts. Deciding when derived data is sufficiently “safe” can be a difficult problem.

The overarching point is that data can live ephemerally, in many copies and in many places. Some of our most important Internet applications, from search functions to communications, rely on those places being across a national border. It is an immense challenge to design laws and policies that best serve the interests of users and law enforcement without compromising on the principles that power the Internet. We will only be able to meet that challenge, however, by developing a more complete understanding of how our data actually exists in the world.


Dillon Reisman is an independent research engineer, currently collaborating with Princeton's Center for Information Technology Policy. Previously, he was a software engineer on the Google privacy team, where he advised product teams on privacy-conscious development and developed infrastructure to better protect user data.

Subscribe to Lawfare