Andres Corrada-Emmanuel, Chief Science Officer at Data Engines, will be speaking at MassTLC’s upcoming ReDev B0st0n event, Boston’s premier conference for developers and technical executives. Andres will be leading a session entitled “Ground Truth Data Problems in Business.”
The following piece by Andres originally appeared on LinkedIn.
Tickets for ReDev are now available. Visit the conference page for more information.
Ground truth is frequently missing in today’s data intensive flows. A new algebraic method, patented by Data Engines, allows algorithms/robots to autonomously measure their own errors without knowing ground truth. The algorithm is almost magical. Its one flaw bodes well for humans. Here is how it works.
You keep counts of the decisions of the three independent classifiers. If they voted (a,b,a) you increase the counter for that voting pattern. Like the figure illustrating this article shows, there are 8 = 2^3 voting patterns when you do binary classification. That is it. You keep processing your data and increasing the counter for each vote pattern you see. This makes this algorithm very similar to other data streaming algorithms – a few counters summarize the data needed as input for the polynomial system. You could have a million decisions or a trillion decisions in production. It does not matter, eight counters is enough.
Using mathematical software like Mathematica then allows you to solve this polynomial system. You plug the counts you collected on the left side of the equations, the computer gives you two solutions for the variables on the right side –
- The prevalence of the the two labels in your data.
- The accuracies of the three classifier for each of the labels.
This double solution means that robots can measure their precision error but not their accuracy error. Luckily, in the real world accuracy is cheap, precision is expensive. To learn more about these ideas, visit the GitHub Repository for Ground Truth Problems in Business.
Tickets for ReDev are now available.