Challenges for Creating Accurate AI Algorithms across Healthcare Data Sets

AI has passed from the “peak of inflated expectations” and now resides in the “trough of disillusionment” in Gartner's Hype Cycle for healthcare professionals who are trying to generate accurate and reliable AI algorithms to assist with improving healthcare delivery while also reducing service costs. While healthcare organizations are more likely to control the quality of data within their organizations, the challenge with AI is the need for large volumes of data that comes from third-party resources or other organizations. This requires a data normalization process that can involve significant resources to initiate and maintain. One report states that data scientists spend 45% of their time on data preparation used to inform their algorithms.

Existing data normalization challenges include the identification and ingestion of data sources, mapping data so it is searchable, and building algorithms for analytics and AI. While healthcare has codified several data elements (e.g., DRGs, ICD-10, CPT-4, and HCPCS) a significant portion of healthcare documentation is still captured as unstructured text data. Using natural language processing (NLP) engines to extract and map unstructured text into coded data elements is improving the quality of internal data for organizations but is still challenging when external data acquisition is incorporated into larger data sets. Some healthcare organizations are creating solutions to help alleviate these data acquisition challenges for creating accurate and reliable data environments that improve analytics and AI accuracy and value.

Black Box Data Testing across Healthcare Organizations

A key challenge for healthcare organizations is how to share data with other organizations without violating HIPAA regulations. To get a large data set that is needed to provide accurate and defensible AI algorithms, hospitals must use data from other organizations. UCSF has developed a solution called BeeKeeperAI to help resolve this problem. The nexus of this solution is a black box that creates a secure enclave for multiple healthcare organization data sets. Organizations can test their algorithms against the data sets of other healthcare organizations without identifying the patient or data sources.

UCSF has used this environment to create AI algorithms for radiology image evaluations and for extracting data from fax documents that are used to identify and schedule referral services. On the image front BeeKeeperAI is working with GE Healthcare to create a repository of images that can be used to tune the AI algorithms to identify abnormalities that are used to prioritize the image read status for radiologists. Philips is the partner used by UCSF to assist with the fax referral solution and is also working with UCSF to improve patient-flow algorithms to ensure patients are being treated in the correct modalities of care for their services or treatment progressions. Bed availability and staffing algorithms are also being developed to assist with these issues that have been magnified by the pandemic.

Acquisition and Normalization Support Successful AI Programs

AI will continue to evolve to become a necessary component of all healthcare organizations relative to data analytics and AI algorithms that drive innovative healthcare services to improve care quality and patient safety while reducing costs. Value-based care will require an AI infrastructure to continue screen high-risk patient populations to ensure patients are receiving appropriate care in modalities of care best suited for their treatments. The ability to acquire data sets for AI use from comparable healthcare organizations will allow organizations to develop more accurate algorithms that can be trained with larger data sets that deliver the value envisioned for AI.

Data Sharing Solutions to Improve AI Capabilities

The following solutions assist healthcare organizations with the acquisition of de-identified data that can be used to improve data analytics and AI algorithms:

BeeKeeperAI: developed at UCSF to improve several AI solutions.

Truveta: a consortium of 20 leading U.S. healthcare organizations to support sharing of de-identified data.

Project Nightingale : collaboration between Google and Ascension to generate large patient data volumes to improve healthcare analytics.

Success Factors

  1. Organizations evaluating healthcare data collaborations should first evaluate the data used to ensure it is de-identified and is representative of similar patient populations managed by the organization.
  2. After selection of the data source, the organization must evaluate the ability to acquire and download the data in a timely manner (e.g., FHIR APIs) and confirm that data mapping can be implemented to ensure appropriate organization database population.
  3. The organization should use their innovation centers and/or data informatics groups to provide thorough testing of the data acquisition and normalization process before it is made available for general use.


Several healthcare data collaboration entities are emerging to assist healthcare organizations with acquiring larger data sets that will improve analytics and AI algorithms. While BeeKeeperAI and Truveta are driven directly by healthcare organizations, Project Nightingale is a collaborative between Google and Accension that demonstrates that large technology companies are also moving to develop effective data-sharing solutions. De-identification of the patient data used in these collaborations is a key challenge for meeting HIPAA regulations. The use of synthetic data generators is likely a key component of these collaborations as they evolve and should be included in the evaluation of any healthcare data collaboration solutions by healthcare organizations.

The other challenge for data acquisition is the ability to easily extract, transform, and download the data from the collaborative database. Data normalization and the mapping of normalized data to the organization’s database models require skilled informaticists to ensure optimum efficiency in this process. If an organization doesn’t have these resources, they will have to be contracted with consultants.