Hashing It Out
The DNA Address kicks off the workflow for Undisclosed
DNA. It functions as sort of hash that takes in the genetic code of
a person and spits out a number that is fairly unique but not reversible. As a specific form of genetic address, it enables you to filter
out the vast majority of ‘non-matches’ or to correlate large populations – all whilst preserving everyone’s privacy.
I Humble Beginnings
Very simple hashes help us to do file searches, and we can use them
to even confirm identities in real-life scenarios.
Let us say that I have gone on a hiking trip. Over the radio –
a walkie-talkie – I hear someone who seems to be my sister. We
are in the forest, on trails that are a half-mile apart. I would like to
confirm that this person is my sister, and not some woman who is
pulling a prank. We are about a half a mile apart, however, and cannot see each other. If that woman on the other end of the walkietalkie is my sister, she could be wondering if I am really her brother
and not some man who is impersonating me. We need to test each
other’s identity.
We could test our identities by checking the other person’s
knowledge on something that we should both know. We could use
our parents’ birthdays, for example. If we are able to confirm those
two pieces of information quickly, then I can be confident that the
other person is my sister and she will know that I am her brother.
I begin by asking when our father was born. My sister replies back,
‘22 March 1959.’ Then she asks, ‘Now how about our mother?’ I
say, ‘3 November 1961.’ Success!
This sounds grand, except for the fact that someone else could
be listening to our conversation over the radio waves. If we stated
our parents’ birthdays whilst someone was listening, then we
would have revealed private information. What we need is a zeroknowledge proof, and a sort of ‘birthday hash’ can get us there.
For the zero-knowledge proof, we turn to a type of maths called
modulo. We apply it to the day, month, and year of each parent’s
birthday. With modulo 3, we can take each number in a person’s
birthday and divide it by 3. We then record the remainder, which
would be 0, 1, or 2.
Let’s begin with our father’s birthday, which is 22 March 1959
or 22-03-1959. Dividing 22 by 3 yields 7 and a remainder of 1,
which is now our first number. March is the third month, and 3 divides by 3 perfectly with no remainder. For the year, 1959 divided
by 3 is 653, and the remainder is 0. Our father’s birthday then becomes 1-0-0, and our mother’s, 0-2-1.
It is unlikely for an impostor to correctly guess all three of the
numbers of someone’s ‘birthday hash’ on the first attempt. If I say
the correct ‘birthday hash’ for our father, then my sister knows that
I am her brother. If she says the correct ‘birthday hash’ for our
mother, then I know that she is my sister.
Even if somebody had been secretly listening to our conversation, then he would not learn what our parents’ birthdays are.
The spy would only hear the ‘birthday hashes’ of ‘one-zero-zero’
or ‘zero-two-one’. He could know that we used modulo 3 and still
be unable to know if our father’s birthday is 22 March 1959, 10
September 1962, 25 June 1956, or something else.
With computers, we conduct similar tests with very large numbers. These values allow for tests where a lucky guess is practically
impossible as well as for more finely tuned searches and proofs.
2 Searching as We’ve Known It
Many of us became familiar with the word hash through Twitter’s
hashtags, which are human-readable tags for categories and keywords – not quite hashes as computing experts know them. You
can, however, get a feel for the purpose of a hash.
When you search for something that someone said on some
event, you may search for the word that was the hashtag. Maybe
the band McMusicface has a concert tour announcement. You can
search ‘#McMusicface’ and see what comes up. The results include
concert announcements, a birth, a fan club meetup in Glasgow,
and a review of the band’s first album back in 2012. You can comb
through this limited list of results to find the info you need.
Did you get the exact Tweet you sought right off the bat? No,
but it was easier to manually scroll through a dozen posts than the
millions of Tweets made on one day.
As for hashes that are more similar to the DNA Address, you
encounter them whenever you do a file search. Earlier computers
looked for matches in searches by comparing two files. Doing this
was feasible with a small number of small files. With many larger
files, however, this is a slow process.
Your computer will eventually do a slow and precise comparison within a shortlist of candidates, but the computer needs to
make that shortlist quickly. It accomplishes this thanks to hashes
that you use without ever seeing.
When you search within a hard drive for a specific file, your
search program will look at the filename or some information about
that file, and then derive a number. Maybe that number is relatively
small, only 8 or 64 digits. Remember that computer files can be very
long, and it is easier to work with even 512 zeroes and ones than a
million of them.
With these shorter tags or hashes, we are able to create a shortlist
of candidates quickly. From a pool of thousands of files on your
hard drive, the search program has narrowed our list of potential
matches to a dozen files.
Somewhere in the results, you have the real answer; the rest of
the candidate files are ‘false positives’. Every hash is going to lose
some information, and we have an infinite number of possible computer files. Eventually, two or more unrelated files may yield hashes
that are identical, and we will end up with false positives.
After we have our shortlist of candidates – the fileswith the ‘correct’ hashes – the next step is to separate the file that is a true match
from the false positive files. This part is a more expensive analysis
that your computer performs on each file. Thanks to the hash, however, your computer will do this on only a small handful of files. The
whole process can happen so fast that you may not notice it.
3 Saving Time, Saving Lives
The process for a file search can happen very fast. You are reaping
the benefits of powerful computers, yes, but more importantly your
computer has an efficient workflow. Time-saving measures apply
to computers today now as much as they did at Bletchley Park in
World War II.
It became possible for some people to forget the lessons, however. For many years, computers grew exponentially in power. Some
programmers have not taken for granted that this would always be
the case, but the state of affairs lulled some of their lazier colleagues
into a simpler state of mind. They would try to solve every problem by throwing more computing resources at it – as if energy and
silicon chips were free and limitless.
We have two responses to challenges: We can work harder or
smarter – or more dearly. Pretend that I take a slow program and I
‘sped it up’ by buying a newer and better computer. Nobody would
marvel at my supposed genius. On the flip-side, look at how Britain
cracked the Nazis’ Enigma machine. The story still captivates audiences.
Alan Turing helped to beat the Nazis by cracking the codes
to their secret messages, but he did not invent a magical supercomputer. His computer performed the repetitive task of ‘brute
force’ decryption. In its simplest form, the computer simply tests
its guesses one by one. Each attempt looks for the correct ‘key’ that
will unlock a certain Nazi communique that the Allies heard over
the radio. The computer at Bletchley could have just tried every possibility. It would eventually crack every message, but British intelligence could not wait five months to decrypt a message about a Nazi
operation that would take place in a week. They needed a faster solution. Turing had to get creative.
The innovation lay in reducing the number of possible keys
that the computer had to test out. The codebreakers looked over
old messages that they already solved as well as new ones that were
still unsolved. They found patterns. Now they knew the ways to direct the computer so that it would spend its time only on the best
candidates as it searched for the key.
We can see pattern recognitionwith an example involving a German weather report. Let us assume that our encryption machine is
straightforward. A given key will move each letter by a certain number of steps. In this scheme, the key is from −25 to −1 or from +1
to +25 – fifty possible keys. If the message of ‘Khoor’ decrypts to
‘Hello’, then the key to encrypt is +3, and to decrypt is −3. If the
message of ‘Gdkkn’ results in ‘Hello’, then the keys are −1 and +1.
With any message, potentially fifty keys exist, but we should
work smart, not hard. Consider this message: ‘Kv yknn tckp qp
Ucvwtfca, Uwpfca, cpf Oqpfca.’ If we know this is a weather report,
the three cases of fca could be day. In a short amount of time, we
have decoded it as ‘It will rain on Saturday, Sunday, and Monday.’’
Genetic testing of any kind has had to test all possibilities and
tediously compare long strings of DNA. Undisclosed DNA uses a
set of four numbers on a chromosome to skip dead ends and focus
all other efforts on a smaller pool of candidates. The Bletchley team
did more with less. So must we.
4 The Path Less Taken
As computers grew in power, they could work with larger numbers. This enabled new ways of hashing for cryptography, which
is the field of computer security that verifies the integrity of the
data sent and received on the Internet, provides virtually complete anonymity and unique identities with fewer collisions, yields
nonces and seeds, and underpins cryptographic signatures.
The typical hashes of cryptography have different goals from
the DNA Address, however. Cryptographic hashes strive to be as
random as possible with no patterns for anyone to find when comparing inputs and outputs. Coincidentally identical hashes (‘collisions’) should be practically impossible. The hashes not only hide
the ordering of the data that was the input; they obscure what
kind of data you even had. Hence, the computing world coalesced
around the so-called SHA algorithms.
Differently from the SHA hashes, the DNA Addresses are not
totally unique and do not seem random – and they do not want to
be. The choice seems rather puzzling on first glance, but we must
mind the respective contexts for SHA and the DNA Address.
With computing security, you want to completely obscure a
file’s contents. With DNA, we do not need to hide that we are talking about DNA or about the totals of cytosine, guanine, adenine,
and thymine in a chromosome. Those numbers are useless to identity thieves and those who know which genetic configurations lead
to some feature that they want to discriminate against. The four
proteins only yield useful information when you know the order
of them. The DNA Address hides that
Cryptographic hashes can be used to differentiate people. You
want the fingerprint of some file you share over bit-torrent to stand
apart from all other fingerprints on the bit-torrent network.
DNA Addresses do not serve this purpose. They do not need
to have at least 8,000,000,000
2 unique addresses in order to virtually eliminate the possibilities of collisions between two humans
on the planet. The final check to differentiate people and to measure the degree of familial relation – it uses the private keys, not the
DNA Address.
As with cryptographic hashes, however, the DNA Address
cannot be reversed. The cryptographic hash must hide everything
about the underlying data. The DNA Address is safe because the
usage of someone else’s DNA requires theordering of the underlying data as much as the data itself, which is often useless.
4 Working Together
We can take advantage of the benefits of hashes that are shorter and
less ‘pseudo-random’. This is because the DNA Address exists to
cut the workload dramatically. It does not aim to do double-duty
for identification purposes or to supplant the private key. The process of Undisclosed DNA uses that private key to do the final confirmation of relation and the specification of the degree of relation.
The DNA Address gets you to that phase more quickly.
Additionally, we only need to make the hash impossible to reverse with surety. If someone wants to ‘forge a signature’ by genetically engineering a human whose DNA will cause a collision, then
that is a whole other kettle of fish!
The DNA Address allows for some features to stand out because it is not the kind of hash that springs forth from a black box after something entered it. The readable features of someone’s DNA
Address are not things such as ‘This person has curly hair’ but instead ‘This person has 104 C proteins in the mitochondrial DNA
along with 92 G proteins, 3 A proteins, and 50 T proteins.’ This
information is useless for an adversary. Because this is just the total
amount of proteins and not the order of those proteins, this does
not tell us anything about the person’s ancestry or how the person
looks. To make sense of the DNA, you need to know the order of
those proteins.
To illustrate how a spy or criminal can never take advantage of
the information in DNA Address, imagine reading a telephone directory. In that directory, the telephone number of 0370 010 6676
is assigned to someone whose given name has two Ts in it. Hence,
the owner of that telephone number cannot be someone called
Maxwell. It could, however, belong to Matthew – or to Suzette,
Brittany, Scott, Charlotte, Otto, Betty, Tristan, or Yvette. It could
be one of those people, but you do not know for sure which one.
Now pretend that instead of a name with a certain letter in it, we
were discussing strands of DNA with many millions of possibilities. This lack of clarity suffices to foil an attack by a spy.
By pure accident, two different women could share the same
number of C, G, A, and T proteins across mitochondria at the same
time that the totals for these proteins in their X chromosomes are
extremely close. If you were looking for a long-lost sister, you may
have hit upon the right one – or a false positive. By virtue of being
a hash, the DNA Address creates opportunities for this to happen.
The process of Undisclosed DNA uses a second step with private
keys to confirm relation, however, and nobody gets false hopes
We cannot realistically jump to that confirmation step and perform match testing on pairs of people, one-by-one, even within a
small city of twenty thousand people. DNA Address greatly reduces our workload by quickly reducing the pool of candidates
to test.
If you are looking for a missing cousin in Newcastle, comparing
the genetic code of yourself and that of tens of thousands of people
is not feasible. Looking in depth at only ten people in the city makes
your task much easier.
The DNA Address plugs a hole created by the varying demands
of experts, professionals, activists, and legislators. For too long, they
toiled away in silos, unable to create the win-win possibilities that
Undisclosed DNA has unlocked.