Microsoft Researching Storage Based on Biological DNA
SAN JOSE -- @Scale -- Microsoft is researching DNA-based storage, a technology that promises to compact a data center's worth of information into a space the size of a few sugar cubes.
DNA-based storage would require less space than other media, and consume less power -- important in the age of web-scale data centers. But the medium is really slow, with data retrieval times of about 10 MBytes per week.
"Most of this, by the way, is dominated by FedEx moving test tubes around," said Luis Ceze, a Microsoft researcher and a professor at the University of Washington, who presented on DNA storage during the keynote at Facebook's @Scale conference today.
Read time can be speeded up in some obvious ways -- putting everything in the same building, for example. And given how quickly DNA sequencing is advancing, Ceze believes 100Gbit/s read speeds might eventually be possible.
It helps that the data doesn't need to be retrieved perfectly (more on that in a moment).
The primary motivation for this research is to save power at web scale. But there are other factors, too. Every other storage mechanism is reaching its limits and will inevitably deteriorate. DNA offers the promise of more efficient storage that can last hundreds of thousands of years.
(By the way, all the DNA here is synthetic. It's not as if they're injecting mice with data.)
DNA is made up of combinations of four nucleotides: adenine, cytosine, guanine and thymine. So the translation from bits into nucleotides seems straightforward -- 00 could be "A," 01 could be "C," and so on.
Of course, it's not that easy. Repeating one nucleotide too many times -- such as C-C-C-C -- makes the sequence more fragile and harder to read, so Microsoft adds coding tricks to prevent such combinations. Long chains are prone to instability, so Microsoft puts only 150 nucleotides in a chain -- but that means adding codes to preserve the proper order of these chains.
And finally, DNA replication is inherently imperfect, so error-correcting codes go in there as well.
Reading the data involves DNA sequencing, but you can't exactly grab a nucleotide chain with tweezers. The approach is to use polymerase -- an enzyme used to sequence DNA and RNA molecules -- to make lots of copies of a strand of interest (that's another benefit to DNA storage: unlimited copies nearly for free) and sequence a bunch of them, finding a consensus about what the chain was supposed to be.
That brings up a point: This type of storage smashes the expectations of precision that we used to have with tapes and hard disks. That's OK, though, because software itself might give up some of that precision for the sake of saving energy.
Ceze referred to an area of study called approximate computing, where a processor's "thinking" can be made less thorough in exchange for consuming less power. It's the same way our brains work, he said; full attention takes more energy. This approach of accepting good-enough rather than perfect might be practical in some cases, "because most applications do not require perfect communication and storage accuracy," he said.
— Craig Matsumoto, Editor-in-Chief, Light Reading