A Smear of DNA Can Hold 10,000 Gigabytes of Data – OneZero

Software
Emily Mullin

Feb 5 · 5 min read

Photo: imaginima/E+/Getty Images

Around the world, warehouses the size of several football fields store millions of hard drives’ worth of data. Every time we send an email, search Google, upload photos to Facebook, or stream a movie on Netflix — which is to say, all the time — those hard drives are put to work.

Big tech is building more of these sprawling data centers to keep up with the massive growth in data needs. But we are generating so much digital data that our current storage systems won’t be able to keep up for long. Already, large-scale U.S. data centers cost hundreds of millions of dollars to maintain and account for nearly 2% of the country’s electricity consumption, and those numbers are only expected to grow.

“There’s a problem coming where we’re going to have more data than we can store,” says Nicholas Guise, a senior research scientist at the Georgia Tech Research Institute who works on cybersecurity. To solve it, he says, we’ll need to figure out how to store more data in less space.

The U.S. government, which also has a huge data storage problem, has just invested $48 million into one possible solution: storing data in DNA.

“There’s a problem coming where we’re going to have more data than we can store.”

For the past few years, researchers have been tinkering with encoding songs, images, and other files in DNA. But it’s still expensive and time-consuming. Now a new program launched by the Intelligence Advanced Research Projects Activity (IARPA), a research agency within the Office of the Director of National Intelligence, aims to change that. Its goal is to shrink a warehouse-sized data center into an affordable tabletop device that can store one exabyte of data — which is equal to a million terabyte-sized hard drives.

“The scale and complexity of the world’s ‘big data’ problems are increasing rapidly, and we are entering an era when the solutions will require storage and random access from an exabyte or more of data,” IARPA program manager David Markowitz tells OneZero. “Faced with exponential data growth, large data consumers may soon face a choice between investing exponentially more resources in storage or discarding an exponentially increasing fraction of data.”

In January, IARPA awarded Guise’s team up to $25 million to start working toward this goal, together with some collaborators. His group will work with San Francisco-based DNA synthesis company Twist Bioscience, San Diego startup Roswell Biotechnologies, and a team at the University of Washington that’s collaborating with Microsoft to develop a fully automated DNA data storage system. Meanwhile, researchers from the Broad Institute of MIT and Harvard and French company DNA Script have been awarded a separate contract worth up to $23 million to work on ways to encode data into DNA and retrieve that information.

Like big tech companies, the government also needs archival data storage capabilities that are more affordable than conventional systems. The federal government collects and stores data on everything from taxes and crime to public health and climate. DNA offers an extremely compact means of storing immense amounts of data. A data storage facility as big as a Walmart Supercenter could be shrunk down to the size of a sugar cube.

Getting data into DNA is similar to traditional coding, albeit with a few more steps. On a computer, data is coded into sequences of zeros and ones, known as binary code. DNA, meanwhile, can store data using a quaternary code consisting of A, C, G, and T — adenine, cytosine, guanine, and thymine — the four chemical bases, or building blocks, that makeup DNA. Just like a “bit” of information on a computer has a value of 0 or 1, each chemical base can be viewed as a bit.

Bits of information can be encoded in DNA by machines that print out synthetic DNA bases to reflect the desired sequence. Advances in manufacturing DNA are making this process faster, but it can still be slow, depending on the amount of data that’s being encoded. Guise says it can take up to a few minutes to add each base, but his team is hoping to speed that up.

“The real limiting step is how long it takes to chemically synthesize DNA, which is much slower than how long it takes your body to replicate DNA,” he says.

To retrieve or “read” the data, scientists must use a DNA a sequencing machine. DNA sequencing is the process of spelling out the order of A’s, C’s, G’s, and T’s. But extracting just one file or a few files is challenging because DNA is so small, making a particular piece of data hard to pinpoint. The University of Washington group involved in the IARPA project is developing a chemical that could essentially find and bind to the particular piece of data you want to extract and read. In 2016, the same researchers reported that they were able to encode four digital images in DNA and then successfully retrieve them without any data loss.

The University of Washington and Microsoft group has since encoded all sorts of things in DNA, from an OK Go music video to the top 100 books of Project Gutenberg.

One of the advantages of DNA is that it can be preserved for hundreds of years, unlike the hard drives, flash drives, and magnetic tapes that are used today, which degrade in years or decades. DNA can be stored in a liquid form or dehydrated into a powder, which makes it last longer.

Once scientists create enough synthetic DNA with information encoded in it, they’ll next need to figure out how to store these tiny amounts of liquid and powder. One storage approach is dehydrating DNA so that it can be organized into tiny specks in vials depending on what kind of data it encodes, says Adam Meier, a senior research scientist at Georgia Tech who is leading the project with Guise. “Depending on which data you want to read, you would find a different spot of powder and feed that into a DNA-reading device,” Meier says.

Another possibility would be to route liquid DNA onto the surface of small storage chips. “When we start producing more and more of this stuff we’ll have to come up with a solution,” Meier says.

Guise says DNA also has potential security benefits over traditional storage systems, since it requires having a DNA sequencing machine — which cost hundreds of thousands of dollars — to read encoded data.

In the near future, the average person probably won’t buy DNA for data storage like they do an external hard drive. But they will likely end up using it without realizing it if companies like Apple and Facebook move from the cloud into DNA storage. “Some of the main users of these giant exabyte-scale data farms are places like Facebook or Apple and a lot of what they’re storing is people’s photos,” Guise says.