Sequenced DNA comparison,

without DNA sequence disclosure.

robots

🢩 A Compliance Imperative 🢨

Patents: GRANTED

Offering: Licences with know-how, training & reference code.


Hashing It Out


        The DNA Address kicks off the workflow for Undisclosed DNA. It functions as sort of hash that takes in the genetic code of a person and spits out a number that is fairly unique but not reversible. As a specific form of genetic address, it enables you to filter out the vast majority of ‘non-matches’ or to correlate large populations – all whilst preserving everyone’s privacy.


I     Humble Beginnings

Very simple hashes help us to do file searches, and we can use them to even confirm identities in real-life scenarios.
        Let us say that I have gone on a hiking trip. Over the radio – a walkie-talkie – I hear someone who seems to be my sister. We are in the forest, on trails that are a half-mile apart. I would like to confirm that this person is my sister, and not some woman who is pulling a prank. We are about a half a mile apart, however, and cannot see each other. If that woman on the other end of the walkietalkie is my sister, she could be wondering if I am really her brother and not some man who is impersonating me. We need to test each other’s identity.
        We could test our identities by checking the other person’s knowledge on something that we should both know. We could use our parents’ birthdays, for example. If we are able to confirm those two pieces of information quickly, then I can be confident that the other person is my sister and she will know that I am her brother. I begin by asking when our father was born. My sister replies back, ‘22 March 1959.’ Then she asks, ‘Now how about our mother?’ I say, ‘3 November 1961.’ Success!
        This sounds grand, except for the fact that someone else could be listening to our conversation over the radio waves. If we stated our parents’ birthdays whilst someone was listening, then we would have revealed private information. What we need is a zeroknowledge proof, and a sort of ‘birthday hash’ can get us there.
        For the zero-knowledge proof, we turn to a type of maths called modulo. We apply it to the day, month, and year of each parent’s birthday. With modulo 3, we can take each number in a person’s birthday and divide it by 3. We then record the remainder, which would be 0, 1, or 2.
        Let’s begin with our father’s birthday, which is 22 March 1959 or 22-03-1959. Dividing 22 by 3 yields 7 and a remainder of 1, which is now our first number. March is the third month, and 3 divides by 3 perfectly with no remainder. For the year, 1959 divided by 3 is 653, and the remainder is 0. Our father’s birthday then becomes 1-0-0, and our mother’s, 0-2-1.
        It is unlikely for an impostor to correctly guess all three of the numbers of someone’s ‘birthday hash’ on the first attempt. If I say the correct ‘birthday hash’ for our father, then my sister knows that I am her brother. If she says the correct ‘birthday hash’ for our mother, then I know that she is my sister.
        Even if somebody had been secretly listening to our conversation, then he would not learn what our parents’ birthdays are. The spy would only hear the ‘birthday hashes’ of ‘one-zero-zero’ or ‘zero-two-one’. He could know that we used modulo 3 and still be unable to know if our father’s birthday is 22 March 1959, 10 September 1962, 25 June 1956, or something else.
        With computers, we conduct similar tests with very large numbers. These values allow for tests where a lucky guess is practically impossible as well as for more finely tuned searches and proofs.


2     Searching as We’ve Known It

Many of us became familiar with the word hash through Twitter’s hashtags, which are human-readable tags for categories and keywords – not quite hashes as computing experts know them. You can, however, get a feel for the purpose of a hash.
        When you search for something that someone said on some event, you may search for the word that was the hashtag. Maybe the band McMusicface has a concert tour announcement. You can search ‘#McMusicface’ and see what comes up. The results include concert announcements, a birth, a fan club meetup in Glasgow, and a review of the band’s first album back in 2012. You can comb through this limited list of results to find the info you need.
        Did you get the exact Tweet you sought right off the bat? No, but it was easier to manually scroll through a dozen posts than the millions of Tweets made on one day.
        As for hashes that are more similar to the DNA Address, you encounter them whenever you do a file search. Earlier computers looked for matches in searches by comparing two files. Doing this was feasible with a small number of small files. With many larger files, however, this is a slow process.
        Your computer will eventually do a slow and precise comparison within a shortlist of candidates, but the computer needs to make that shortlist quickly. It accomplishes this thanks to hashes that you use without ever seeing.
        When you search within a hard drive for a specific file, your search program will look at the filename or some information about that file, and then derive a number. Maybe that number is relatively small, only 8 or 64 digits. Remember that computer files can be very long, and it is easier to work with even 512 zeroes and ones than a million of them.
        With these shorter tags or hashes, we are able to create a shortlist of candidates quickly. From a pool of thousands of files on your hard drive, the search program has narrowed our list of potential matches to a dozen files. Somewhere in the results, you have the real answer; the rest of the candidate files are ‘false positives’. Every hash is going to lose some information, and we have an infinite number of possible computer files. Eventually, two or more unrelated files may yield hashes that are identical, and we will end up with false positives.
        After we have our shortlist of candidates – the fileswith the ‘correct’ hashes – the next step is to separate the file that is a true match from the false positive files. This part is a more expensive analysis that your computer performs on each file. Thanks to the hash, however, your computer will do this on only a small handful of files. The whole process can happen so fast that you may not notice it.


3     Saving Time, Saving Lives

The process for a file search can happen very fast. You are reaping the benefits of powerful computers, yes, but more importantly your computer has an efficient workflow. Time-saving measures apply to computers today now as much as they did at Bletchley Park in World War II.
        It became possible for some people to forget the lessons, however. For many years, computers grew exponentially in power. Some programmers have not taken for granted that this would always be the case, but the state of affairs lulled some of their lazier colleagues into a simpler state of mind. They would try to solve every problem by throwing more computing resources at it – as if energy and silicon chips were free and limitless.
        We have two responses to challenges: We can work harder or smarter – or more dearly. Pretend that I take a slow program and I ‘sped it up’ by buying a newer and better computer. Nobody would marvel at my supposed genius. On the flip-side, look at how Britain cracked the Nazis’ Enigma machine. The story still captivates audiences.
        Alan Turing helped to beat the Nazis by cracking the codes to their secret messages, but he did not invent a magical supercomputer. His computer performed the repetitive task of ‘brute force’ decryption. In its simplest form, the computer simply tests its guesses one by one. Each attempt looks for the correct ‘key’ that will unlock a certain Nazi communique that the Allies heard over the radio. The computer at Bletchley could have just tried every possibility. It would eventually crack every message, but British intelligence could not wait five months to decrypt a message about a Nazi operation that would take place in a week. They needed a faster solution. Turing had to get creative.
        The innovation lay in reducing the number of possible keys that the computer had to test out. The codebreakers looked over old messages that they already solved as well as new ones that were still unsolved. They found patterns. Now they knew the ways to direct the computer so that it would spend its time only on the best candidates as it searched for the key.
        We can see pattern recognitionwith an example involving a German weather report. Let us assume that our encryption machine is straightforward. A given key will move each letter by a certain number of steps. In this scheme, the key is from −25 to −1 or from +1 to +25 – fifty possible keys. If the message of ‘Khoor’ decrypts to ‘Hello’, then the key to encrypt is +3, and to decrypt is −3. If the message of ‘Gdkkn’ results in ‘Hello’, then the keys are −1 and +1.
        With any message, potentially fifty keys exist, but we should work smart, not hard. Consider this message: ‘Kv yknn tckp qp Ucvwtfca, Uwpfca, cpf Oqpfca.’ If we know this is a weather report, the three cases of fca could be day. In a short amount of time, we have decoded it as ‘It will rain on Saturday, Sunday, and Monday.’’
        Genetic testing of any kind has had to test all possibilities and tediously compare long strings of DNA. Undisclosed DNA uses a set of four numbers on a chromosome to skip dead ends and focus all other efforts on a smaller pool of candidates. The Bletchley team did more with less. So must we.


4     The Path Less Taken

As computers grew in power, they could work with larger numbers. This enabled new ways of hashing for cryptography, which is the field of computer security that verifies the integrity of the data sent and received on the Internet, provides virtually complete anonymity and unique identities with fewer collisions, yields nonces and seeds, and underpins cryptographic signatures.
        The typical hashes of cryptography have different goals from the DNA Address, however. Cryptographic hashes strive to be as random as possible with no patterns for anyone to find when comparing inputs and outputs. Coincidentally identical hashes (‘collisions’) should be practically impossible. The hashes not only hide the ordering of the data that was the input; they obscure what kind of data you even had. Hence, the computing world coalesced around the so-called SHA algorithms.
        Differently from the SHA hashes, the DNA Addresses are not totally unique and do not seem random – and they do not want to be. The choice seems rather puzzling on first glance, but we must mind the respective contexts for SHA and the DNA Address.
        With computing security, you want to completely obscure a file’s contents. With DNA, we do not need to hide that we are talking about DNA or about the totals of cytosine, guanine, adenine, and thymine in a chromosome. Those numbers are useless to identity thieves and those who know which genetic configurations lead to some feature that they want to discriminate against. The four proteins only yield useful information when you know the order of them. The DNA Address hides that
        Cryptographic hashes can be used to differentiate people. You want the fingerprint of some file you share over bit-torrent to stand apart from all other fingerprints on the bit-torrent network.
        DNA Addresses do not serve this purpose. They do not need to have at least 8,000,000,0002 unique addresses in order to virtually eliminate the possibilities of collisions between two humans on the planet. The final check to differentiate people and to measure the degree of familial relation – it uses the private keys, not the DNA Address.
        As with cryptographic hashes, however, the DNA Address cannot be reversed. The cryptographic hash must hide everything about the underlying data. The DNA Address is safe because the usage of someone else’s DNA requires theordering of the underlying data as much as the data itself, which is often useless.


4     Working Together

We can take advantage of the benefits of hashes that are shorter and less ‘pseudo-random’. This is because the DNA Address exists to cut the workload dramatically. It does not aim to do double-duty for identification purposes or to supplant the private key. The process of Undisclosed DNA uses that private key to do the final confirmation of relation and the specification of the degree of relation. The DNA Address gets you to that phase more quickly.
        Additionally, we only need to make the hash impossible to reverse with surety. If someone wants to ‘forge a signature’ by genetically engineering a human whose DNA will cause a collision, then that is a whole other kettle of fish!
        The DNA Address allows for some features to stand out because it is not the kind of hash that springs forth from a black box after something entered it. The readable features of someone’s DNA Address are not things such as ‘This person has curly hair’ but instead ‘This person has 104 C proteins in the mitochondrial DNA along with 92 G proteins, 3 A proteins, and 50 T proteins.’ This information is useless for an adversary. Because this is just the total amount of proteins and not the order of those proteins, this does not tell us anything about the person’s ancestry or how the person looks. To make sense of the DNA, you need to know the order of those proteins.
        To illustrate how a spy or criminal can never take advantage of the information in DNA Address, imagine reading a telephone directory. In that directory, the telephone number of 0370 010 6676 is assigned to someone whose given name has two Ts in it. Hence, the owner of that telephone number cannot be someone called Maxwell. It could, however, belong to Matthew – or to Suzette, Brittany, Scott, Charlotte, Otto, Betty, Tristan, or Yvette. It could be one of those people, but you do not know for sure which one. Now pretend that instead of a name with a certain letter in it, we were discussing strands of DNA with many millions of possibilities. This lack of clarity suffices to foil an attack by a spy.
        By pure accident, two different women could share the same number of C, G, A, and T proteins across mitochondria at the same time that the totals for these proteins in their X chromosomes are extremely close. If you were looking for a long-lost sister, you may have hit upon the right one – or a false positive. By virtue of being a hash, the DNA Address creates opportunities for this to happen. The process of Undisclosed DNA uses a second step with private keys to confirm relation, however, and nobody gets false hopes
        We cannot realistically jump to that confirmation step and perform match testing on pairs of people, one-by-one, even within a small city of twenty thousand people. DNA Address greatly reduces our workload by quickly reducing the pool of candidates to test.
        If you are looking for a missing cousin in Newcastle, comparing the genetic code of yourself and that of tens of thousands of people is not feasible. Looking in depth at only ten people in the city makes your task much easier.
        The DNA Address plugs a hole created by the varying demands of experts, professionals, activists, and legislators. For too long, they toiled away in silos, unable to create the win-win possibilities that Undisclosed DNA has unlocked.