The Problem: Bad Data Equals Garbage In and Poor Analytics Results
According to surveys conducted by Anaconda and Figure Eight, data scientists spend 45% of their time preparing data, and data cleaning can take a quarter of that time. Data cleaning fixes or discards anomalous or wrong numbers and/or data and otherwise ensures that the data is an accurate representation of the analytics model it is meant to measure. Automating the task is challenging because different data sets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which Davis in the city of Denver, CO).
Poor data quality is likely caused by four factors:
Remember the old computer science adage—“Garbage In, Garbage Out”!
The Solution: Applying Bayesian Logic to Data Cleaning
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis to be true as more evidence or information becomes available. MIT researchers have created a new system that automatically cleans “dirty data” like typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists. The system is called PClean. PClean provides generic common-sense models for judgment calls that can be customized to specific databases and types of errors. PClean is the first Bayesian data-cleaning system that can combine domain expertise with common-sense reasoning to automatically clean databases of millions of records. PClean also incorporates an AI programming model developed by the MIT Probabilistic Computing Project. PClean achieves this scale via three innovations:
Other data-cleaning solutions are likely to use probabilistic programming models. These models are evolving and will compete with the MIT solution while the market evaluates the ROI of both approaches.
The Justification: Data-Cleaning Solutions Will Enable Evaluations of Data-Capture Processes
Healthcare organizations will continue to invest large amounts of informaticist and data scientist time preparing and cleaning data for use with analytics, business intelligence, and artificial intelligence applications. This time would be better used by focusing on upstream data acquisition processes for all enterprise applications. Data governance boards should be focused on guiding the organization to standard data models that dictate the capture of data in defined formats and time frames. Once most enterprise applications meet the data governance standards, data cleaning will take little if any time. This will result in analytics and artificial intelligence programs generating results that improve the organizations’ business efficiency, medical outcomes, and quality of care.
The Players: Emerging and Existing Data-Cleaning Solution Competition Will Benefit Healthcare
Data-cleaning solutions based on probabilistic programming applications range from focused functions to broad data-management capabilities.
Success Factors
Summary
Data cleaning represents significant overhead from highly skilled and expensive resources. The ability to reduce the amount of time spent on cleaning data to ensure accurate analytics and AI results will allow for the existing models to be updated or new ones to be created.
The reduction of data-cleaning processing efforts will also allow for healthcare organizations to focus data governance efforts on capturing clean data and/or driving improvements for standardizing data formats across enterprise systems.
The healthcare industry still has data sets that are not standardized, such as symptoms or laboratory tests. Laboratories could quickly standardize their data with the use of LOINC codes. Symptoms are not standardized, although some are associated with ICD-10 codes.
These examples demonstrate the challenges that informaticists and data scientists have when evaluating data sets from different applications and different organizations. Data-cleaning solutions will eliminate time wasted on matching data that is better used for creating effective data models.
Photo credit: nattaphol, Adobe Stock
End of Messages