Collisions: Resolving the ‘False Positives’
In everyday life, we come across many false positives. The easiest
example is with surnames (‘last names’). Two people from the same
European country with an aristocratic name are likely related. ‘De
Leon’ means ‘of Leon’. It reflects a time where the minority of people, the nobility, marked their holdings with a location in a family
name. The masses usually bore no titles or surnames.
Sometimes, however, identical surnames could signify a different situation. When Italy unified and modernized in the late
nineteenth century, many commoners from a given village simply
adopted surnames without trying to be unique. For example, two
people with the surname of Chiarella may be no more related than
any two Americans with a forebear in Sicily.
The DNA Address is many digits long. Millions of different addresses (hashes) are possible. Although many more humans than
that exist and seemingly collisions are always possible, the DNA Address fulfils an important role. It reduces a list of hundreds of millions of candidates to one match or a small handful of matches. The
final step is differentiating between multiple potentialities when we
encounter them.
I The Cryptographic Strength of Hashes
Computer specialists know that collisions are always possible, no
matter how big the hashes are. Two different things could coincidentally end up with the same hash. The chance always exists because of the way in which the hash is totally irreversible.
The hash converts many things to one. In the reverse direction,
a given hash could be an infinite number of things.
As a simplified example, let’s pretend we have a machine that
takes in any household object and spits out a number between
naught and five thousand. A basketball gives you an output of
2,014. A doorknob is 731. After a full day of playing around with
this machine, you are confident that every object in your house will
yield a unique number. By dumb luck, you discover that a serviette makes for a hash of 4,010, and a paperclip also creates a hash
of 4,010. You have no way to know if a hash of 4,010 on someone’s
computer screen came from a serviette or a paperclip – or something else entirely.
The more you think about it, other ‘collisions’ are possible.
When you break it down, a household contains more than five
thousand pieces. Collisions are not just possible, they are guaranteed. In order to make collisions very improbable, you tweak your
machine. Now, a hash can be between naught and fifty thousand.
You can approach the ideals of cryptographic practices if you
can make that number even higher. Maybe you can stretch that
number to nine hundred million, or even ninety million to the
tenth power.
2 Balancing Obscurity with Ease of Use
A big, complex maths problem requires a lot more work to solve
than a smaller one – and this is why extremely long hashes present
a stumbling block. With a number of 8,100,000,000,000, every
household item is guaranteed to give you a unique hash that no
one can reverse, but this number seems like overkill. If you want a
searchable directory of items in your household and have it be practical, then this number is perverse.
In the world of computing, separate from the costlier world of
material DNA, it makes perfect sense to use gigantic numbers. We
should want every credit card transaction be unique, and you want
every encrypted connection or signed message to be distinct. These
are one-off events, so to speak.
If you can refine an operation through a second step, however,
we have less reason to avoid smaller numbers. It is similar to the fact
that you do not need a stupendous password when your workplace
uses two-factor authentication
With simpler DNA Addresses you will encounter more collisions, but you can separate the seemingly identical candidates in
a more complicated second step. It costs very little in resources
if that second step is arduous if you apply that second step only
to the handful of collisions that make up the shortlist of potential matches.
The matching process of Undisclosed DNA mirrors that of the
workflow in file searches. You begin by indexing not each file or filename, but a short hash of the filename. If you have many thousands
of files on your computer, you will probably have collisions, but
that is okay. This step of comparing hashes is simple to do. Your
computer can whittle down the list of potential matches from a
million to a few dozen. The second step is to carefully read the files
themselves and to see which one actually matches. To carefully pore
over a file will take longer than cross-referencing hashes, but you
only need to do it a few times.
3 Potentialities and Possibilities
By following the workflow of file searches and not cryptography as
we knew it, we are on the right track. We have a two-step process
that dramatically reduces our workload and also ensures that we do
not give false positives. We can bring the list of millions down to a
score and then to just one.
The major problem left is that this second step of the twostep process requires someone to read the original files or information. To differentiate between candidates, we have to test the underlying material. This compromises the privacy of one’s genetic
code. Hence, we should not do that second step in a straightforward manner.
Without a different way to confirm matches, we are stuck. A
very long hash that incorporated data from multiple chromosomes
and spat out a sequence of pseudo-random numbers would prevent
almost all collisions, but it could hardly prove a connection – much
less encrypt or decrypt secret messages that want to stay secret.
If everyone has a unique hash (‘DNA Address’) then we have
eight billion hashes. If we make the hash too simple, then we have
too many collisions. It does nobody any good to say that someone
in somewhere in London is your lost child from twenty years ago
but all grown up. We need to narrow it down.
Then we end up having to pick some arbitrary cutoff point.
Maybe we pick a number for the complexity of the hash so that we
can say that only one of the following is your child, but we have no
way to prove which one it is: four people in Ottawa, two in Liverpool, five in Edinburgh, eight in New York City. Even with very
long hashes, collisions are always possible.
4 Private Keys
Undisclosed DNA can do better than that. It can also remove the
need to ever examine anyone’s DNA as is. It does so with the private key.
Creating the private key takes more effort than creating the
DNA Address, but it is ‘more unique’. Using the private key is even
more work, but it is used within a much smaller pool of people.
Nobody can simply reverse a private key. I could ask you to multiply 101 with 117, but it would take you longer to figure out what
two numbers multiply to make 11,349. Now imagine the numbers
are much longer – perhaps hundreds of digits in size.
Even then, some may say, you could get lucky. Instead of trying this with a supercomputer the size of a planet for billions of
years, let’s pretend that you strike gold with a one-in-a-trillion shot.
No one is fortunate enough, however, to blindly guess the underlying the DNA that would yield both a certain hash and a certain key. Additionally, even the luckiest guess in the universe for all
time (it would have to be) could not reveal the underlying DNA
with certainty.
The careful reader will now wonder about the public key
that pairs with this or that private key. Undisclosed DNA uses
cryptographic keys in a way that fundamentally differs from
the system of public–private key pairs. This unlocks a realm of
novel opportunities