Andres Corrada-Emmanuel, Chief Science Officer at Data Engines, will be speaking at MassTLC’s upcoming ReDev B0st0n event, Boston’s premier conference for developers and technical executives. Andres will be leading a session entitled “Ground Truth Data Problems in Business.”
The following piece by Andres originally appeared on LinkedIn.
Tickets for ReDev are now available. Visit the conference page for more information.
Uniqueness, not identity is the source of most value in user data. There is a difference. Although identity carries with it uniqueness, uniqueness can be provided without identity. I came to this realization while developing an algorithm that could measure the accuracy of a company’s algorithm for producing unique IDs of Internet users. A super cookie, if you will.
The gold standard of measuring this accuracy requires you have the ground truth for your data – the true identity of the users that arrived at the company’s publisher network. But we did not have that information. The vast majority of the IDs produced by the unique id product were completely detached from any Personal Identifiable Information (PII). The company’s industrial process was a concrete realization of the idea that uniqueness is not identity.
This raised a measurement paradox. If the vast majority of the IDs were unmatched to PII, how would we know our system was healthy? that it was fulfilling SLAs in quality? that algorithm X is better than algorithm Y? The way to solve this paradox has general utility – a statistic of the ground truth is possible even when the ground truth is missing. It is possible to measure the accuracy of a unique identifier system without knowing any of the true identities for the users. In this case, privacy that forgets us is possible while fulfilling business analytic needs.
The paper Algebra of Ground Truth Inference for Web Unique Identifiers shows the math behind the idea. The solution to the above measurement problem had a couple of interesting features –
- It is based on flipping the wisdom of the crowd on its head. Combining votes for inference, not decisions. In this case, two systems were compared and used to measure themselves and the other system. Diversity of opinion solves ground truth problems.
- The key to solving the math problem was that the two systems compared against each other made different types of errors. Diversity of errors sometimes has value!