NCCID case study: Setting standards for testing Artificial Intelligence

21 February 2022

Validate AI Evaluate AI

Evaluating the performance of AI models using data from the National COVID-19 Chest Imaging Database has produced a valuable proof-of-concept validation process for testing the quality of AI in health and care.

There has been increasing interest in the role of artificial intelligence (AI) to support disease diagnosis, improve system efficiency and enable at-home care. The potential of AI in imaging is being investigated through tools that support radiology, where medical images like X-rays are used to diagnose and treat disease.

The challenge

During the pandemic, many new AI radiology products have been developed to help identify COVID-19 from scans. The potential benefits of faster diagnosis and early treatment encouraged hospitals to consider using AI tools during the pandemic to help manage rising demand.

However, there are concerns about the effectiveness and reliability of some of these tools. To develop algorithms successfully, you need large volumes of good quality data. Part of this data needs to be kept for training the algorithm and another part needs to be kept separately for testing its performance. This can help innovators to understand the performance of their AI tool.

If both the training and test data come from the same hospital, it can be hard to know if the AI tool would perform well on data from a different population, for example another hospital or region of the UK. Being able to assess this is particularly important for understanding bias in AI models. The term “bias” is used to describe systematic and repeated errors that an AI model can make because of limitations in the training and test data used, which can lead to negative outcomes like discrimination.

Assessing bias can be done through a validation process (using validation data), which is key to ensuring that the AI technologies adopted in the NHS are safe and ethical.

The challenge...

The challenge was to create a validation process that could be used as a blueprint for examining the robustness of AI models, paving the way for safer AI adoption in health and care.

The solution

In May 2020, the NHS AI Lab began building a database of chest scans and supporting medical information from NHS trusts. The aim was to provide researchers, analysts and developers with a sufficiently large volume of quality data to support investigations into COVID-19 diagnosis and treatment.

This National COVID-19 Chest Imaging Database (NCCID) now includes over 60,000 images from more than 27 hospital trusts, and is being used by 16 research groups (February 2022).

Such a large database is well placed to provide both separate training and testing data, as well as a process through which to validate and examine the robustness of the AI tool and detect any potential issues around performance and bias.

In 2021, the NHS AI Lab Imaging team, together with a research group formed by the British Society of Thoracic Imaging, Faculty, Queen Mary University of London, Royal Surrey Foundation Trust and the University of Dundee, ran a proof-of-concept validation process on 5 AI models / algorithms using NCCID data. A proof of concept is a feasibility study that determines whether or not an idea can be turned into reality.

The models (from a research team, a global company and 3 start-ups) all detected COVID-19, or proxies for COVID-19, from medical images. The exercise aimed to run statistical tests to calculate each model’s performance and understand any biases in the results.

What did the validation process look like?

The tests calculated how accurately the models detected positive and negative COVID-19 cases from medical images (how ‘specific’ and ‘sensitive’ the algorithms were). For an explanation of specificity and sensitivity, please see our AI Dictionary.

It assessed how the models performed with different patient sub-groups. For example, by age, ethnicity and / or sex. It assessed the robustness of the algorithm by looking at how the algorithm performed in response to changes in the data, like the inclusion of patients with additional medical conditions, or using images taken using different scanning equipment.

Each external validation had the following 4 broad steps, tailored depending on the inputs and outputs of each model. For each model we:

created a validation data set based on the intended use case of each algorithm by using data from the NCCID that had not been used to train the algorithm
used a cloud-based deployment environment to run the algorithms rather than a locally hosted one, providing a secure accessible space that protects the intellectual property of the developer
ran the model on the validation data set, and performed pre-defined statistical tests to assess robustness and performance of the model against different demographics..
reported the results to the organisations that built the models in order to inform model improvements / developments.

Outcomes and learnings

By taking part in this study, the NHS team and vendors were able to learn more about the performance of their algorithms, providing them with learnings to help shape their development.

The proof of concept process has proved a valuable blueprint for testing that the AI models adopted for use in health and care are safe, robust and accurate. There is currently no AI model validation process that helps developers to build better performing AI for radiology.

The exercise was particularly valuable for:

Improving understanding of the potential for AI models to support clinicians in diagnosing COVID-19 from medical images
Producing guidance about the statistical tests needed to assess model performance and robustness
Creating a method of developing labelled data sets
Experimenting with data curation that will guide future imaging platform development
Supporting methods of quantifying and helping to reduce bias in AI models for health and care
Producing technical guidance for creating secure development environments that AI vendors can trust

Tools available

Github open source tools and guidance

Interactive optimal operating point tool

The above links to tools and guidance are available to help data scientists validate their own models, allowing anyone to improve the performance and safety of their models.

The optimal operating point tool allows developers to easily determine the optimal operating point for their model depending on the prevalence of COVID-19 in the community. This ensures they can adjust their model’s threshold so it can detect as many positive cases as possible (low false negative rate) while having a low false positive rate.

Our rigorous validation and testing procedures have implemented a novel process to test that AI models adopted are safe, robust and accurate in diagnosing COVID-19 - while protecting developers intellectual property. Unfair and biased models can lead to inconsistent levels of care, a serious problem in these critical circumstances. Outside of the NHS, our validation process has helped guide the use of AI in medical diagnosis and inform new approaches to the international governance of AI in healthcare. (Dominic Cushnan, Head of AI Imaging, NHS AI Lab)

What’s next?

We want to provide AI developers and researchers with the right tools to analyse data and train and validate their AI technologies, but in a secure and controlled environment that protects their intellectual property and doesn’t expose them to risk.

This validation process will support work at NHS Digital to develop new, rigorous AI assurance processes and bespoke training to ensure that the health and care sector is ready to deal with the challenges set by AI.

In order to support AI in becoming a real-world success for radiology, we will take more steps like this to answer questions on ethics, regulations, safety and validation.

This work goes hand in hand with the NHS AI Lab’s projects on AI ethics where we are working to ensure that AI is developed in a way that is fair and patient-centred. One project (with the Ada Lovelace Institute) includes proposals for an algorithmic impact assessment for data access, which aims to ensure that algorithmic uses of public-sector data are evaluated and governed to produce benefits for society, governments, public bodies and technology developers, as well as the people represented in the data and affected by the technologies and their outcomes.

The validation pilot also forms part of the Lab’s role in clarifying a structure for AI regulations, by helping to define how we evaluate and quality assure the models being developed.

The NHS AI Lab Imaging team is aiming to continue the progress of the NCCID and of the validation process by developing a broader medical imaging platform, providing researchers with a single source of national medical data.

With thanks

The academic research consortium included:

the Scientific Computing team at the Royal Surrey NHS Foundation Trust
Health Informatics Centre at the University of Dundee
statisticians from Queen Mary University of London
radiologists at the British Society of Thoracic Imaging
the AI company, Faculty

Faculty is an applied AI company that specialises in designing, building and implementing custom AI technology. Faculty works with a number of high-profile brands globally as well as departments and agencies across government.