Sequenced DNA comparison,

without DNA sequence disclosure.

robots

🢩 A Compliance Imperative 🢨

Patents: GRANTED

Offering: Licences with know-how, training & reference code.


Collisions: Resolving the ‘False Positives’


        
        In everyday life, we come across many false positives. The easiest example is with surnames (‘last names’). Two people from the same European country with an aristocratic name are likely related. ‘De Leon’ means ‘of Leon’. It reflects a time where the minority of people, the nobility, marked their holdings with a location in a family name. The masses usually bore no titles or surnames.
        Sometimes, however, identical surnames could signify a different situation. When Italy unified and modernized in the late nineteenth century, many commoners from a given village simply adopted surnames without trying to be unique. For example, two people with the surname of Chiarella may be no more related than any two Americans with a forebear in Sicily.
        The DNA Address is many digits long. Millions of different addresses (hashes) are possible. Although many more humans than that exist and seemingly collisions are always possible, the DNA Address fulfils an important role. It reduces a list of hundreds of millions of candidates to one match or a small handful of matches. The final step is differentiating between multiple potentialities when we encounter them.


I     The Cryptographic Strength of Hashes

Computer specialists know that collisions are always possible, no matter how big the hashes are. Two different things could coincidentally end up with the same hash. The chance always exists because of the way in which the hash is totally irreversible.
        The hash converts many things to one. In the reverse direction, a given hash could be an infinite number of things.
        As a simplified example, let’s pretend we have a machine that takes in any household object and spits out a number between naught and five thousand. A basketball gives you an output of 2,014. A doorknob is 731. After a full day of playing around with this machine, you are confident that every object in your house will yield a unique number. By dumb luck, you discover that a serviette makes for a hash of 4,010, and a paperclip also creates a hash of 4,010. You have no way to know if a hash of 4,010 on someone’s computer screen came from a serviette or a paperclip – or something else entirely.
        The more you think about it, other ‘collisions’ are possible. When you break it down, a household contains more than five thousand pieces. Collisions are not just possible, they are guaranteed. In order to make collisions very improbable, you tweak your machine. Now, a hash can be between naught and fifty thousand.
        You can approach the ideals of cryptographic practices if you can make that number even higher. Maybe you can stretch that number to nine hundred million, or even ninety million to the tenth power.


2     Balancing Obscurity with Ease of Use

A big, complex maths problem requires a lot more work to solve than a smaller one – and this is why extremely long hashes present a stumbling block. With a number of 8,100,000,000,000, every household item is guaranteed to give you a unique hash that no one can reverse, but this number seems like overkill. If you want a searchable directory of items in your household and have it be practical, then this number is perverse.
        In the world of computing, separate from the costlier world of material DNA, it makes perfect sense to use gigantic numbers. We should want every credit card transaction be unique, and you want every encrypted connection or signed message to be distinct. These are one-off events, so to speak.
        If you can refine an operation through a second step, however, we have less reason to avoid smaller numbers. It is similar to the fact that you do not need a stupendous password when your workplace uses two-factor authentication
        With simpler DNA Addresses you will encounter more collisions, but you can separate the seemingly identical candidates in a more complicated second step. It costs very little in resources if that second step is arduous if you apply that second step only to the handful of collisions that make up the shortlist of potential matches.
        The matching process of Undisclosed DNA mirrors that of the workflow in file searches. You begin by indexing not each file or filename, but a short hash of the filename. If you have many thousands of files on your computer, you will probably have collisions, but that is okay. This step of comparing hashes is simple to do. Your computer can whittle down the list of potential matches from a million to a few dozen. The second step is to carefully read the files themselves and to see which one actually matches. To carefully pore over a file will take longer than cross-referencing hashes, but you only need to do it a few times.


3     Potentialities and Possibilities

By following the workflow of file searches and not cryptography as we knew it, we are on the right track. We have a two-step process that dramatically reduces our workload and also ensures that we do not give false positives. We can bring the list of millions down to a score and then to just one.
        The major problem left is that this second step of the twostep process requires someone to read the original files or information. To differentiate between candidates, we have to test the underlying material. This compromises the privacy of one’s genetic code. Hence, we should not do that second step in a straightforward manner.
        Without a different way to confirm matches, we are stuck. A very long hash that incorporated data from multiple chromosomes and spat out a sequence of pseudo-random numbers would prevent almost all collisions, but it could hardly prove a connection – much less encrypt or decrypt secret messages that want to stay secret.
        If everyone has a unique hash (‘DNA Address’) then we have eight billion hashes. If we make the hash too simple, then we have too many collisions. It does nobody any good to say that someone in somewhere in London is your lost child from twenty years ago but all grown up. We need to narrow it down.
        Then we end up having to pick some arbitrary cutoff point. Maybe we pick a number for the complexity of the hash so that we can say that only one of the following is your child, but we have no way to prove which one it is: four people in Ottawa, two in Liverpool, five in Edinburgh, eight in New York City. Even with very long hashes, collisions are always possible.


4     Private Keys

Undisclosed DNA can do better than that. It can also remove the need to ever examine anyone’s DNA as is. It does so with the private key.
        Creating the private key takes more effort than creating the DNA Address, but it is ‘more unique’. Using the private key is even more work, but it is used within a much smaller pool of people.
        Nobody can simply reverse a private key. I could ask you to multiply 101 with 117, but it would take you longer to figure out what two numbers multiply to make 11,349. Now imagine the numbers are much longer – perhaps hundreds of digits in size.
        Even then, some may say, you could get lucky. Instead of trying this with a supercomputer the size of a planet for billions of years, let’s pretend that you strike gold with a one-in-a-trillion shot. No one is fortunate enough, however, to blindly guess the underlying the DNA that would yield both a certain hash and a certain key. Additionally, even the luckiest guess in the universe for all time (it would have to be) could not reveal the underlying DNA with certainty.
        The careful reader will now wonder about the public key that pairs with this or that private key. Undisclosed DNA uses cryptographic keys in a way that fundamentally differs from the system of public–private key pairs. This unlocks a realm of novel opportunities