What public datasets will have high impact on AI in the UK?

The other day I got asked a question: what public datasets will have high impact on AI in the UK? The person asking me was pulling together ideas for the UK’s AI Opportunities Action Plan.

This is the AI action plan that, to use the Prime Minister Keir Starmer’s vivid phrase, is intended to ‘mainline AI into the veins of the UK‘.

Hmm. It is a surprisingly big question, not least as the AI opportunities action plan is not very specific in how or why it wants the UK to inject AI into its veins.

But, after asking some wiser people for their thoughts and thinking about the AI tools I’ve helped people with in my work, here are some ideas that might be useful.

So, read on if you want to:

learn a bit more about what the plan recommends
think about five areas of AI that publishing data could impact:
- unlocking other datasets
- supporting accountability and innovation
- models for government priorities
- models for particular domains
- models with particular capabilities
five recommended datasets to unlock capabilities that AI investors and firms don’t seem to be working on, yet
- navigating the UK’s legal and regulatory framework
- seeing the UK as the administrative state does
- understanding what is generally accepted as a fact in the UK
- understanding how people in the UK communicate with each other
- understanding how people in the UK experience public services
remember that publishing data doesn’t lead to magic

What is the AI Opportunities Action Plan recommendation about datasets

The top-line of recommendation 7 is to:

Rapidly identify at least 5 high-impact public datasets it will seek to make available to AI researchers and innovators

The description says:

Prioritisation should consider the potential economic and social value of the data, as well as public trust, national security, privacy, ethics, and data protection considerations. We should explore use of synthetic data generation techniques to construct privacy-preserving versions of highly sensitive data sets. Government data sets are a public asset, and careful consideration should be given to their valuation.

There are other recommendations related to data, but this one is key to getting things moving.

What does “high-impact” mean?

The recommendation talks about “high-impact” datasets but does not clearly define what type of impact is being looked for. Many conversations in this area tend to simplistically focus on data for developing new AI models but if we squint a bit deeper then five high-level areas might be interesting:

Datasets that unlock other datasets
Datasets about AI models and products to support transparency, accountability and innovation
Datasets that make it easier to develop AI models and products needed for government’s priorities
Datasets that make it easier to develop AI models and products in particular domains
Datasets that make it easier to develop AI models and products with particular capabilities

All of these areas align with the UK government’s overarching goal of (economic) growth, but they get there in different ways and might generate other benefits as well as impact for AI.

The first area is to focus on datasets that unlock other datasets. This is similar to the approach that the EU has taken on high-value datasets. The EU is making specific datasets in geospatial, earth observation and environment, meteorological, statistics, companies and mobility domains available based on analysis that says it will create billions of economic value. Rather than looking for ideas for new datasets the UK government could simply copy this work as these datasets could have similarly high economic value for the UK.

The second is to focus on datasets about AI models and products, for example datasheets, model cards, system cards or information about service performance. These kinds of datasets can drive both accountability and innovation. The UK government could be a lot more radical in its work to improve transparency about AI models and usage, but I suspect this isn’t the desired impact of the AI Action Plan.

The third area identifies datasets that would contribute to developing AI models that help deliver the government’s priorities, for example the five missions. The missions are still being described to the public in quite broad terms so these datasets are best identified by teams working on the government’s missions and other priorities. The AI team should probably just speak with the mission teams.

The fourth area is to focus on datasets that help build AI products in a domain. For example in the domain of medical diagnosis Moorfields Hospital collected eye scans for use in identifying or predicting a range of medical conditions, people who want to improve how major infrastructure projects are managed might want to make data available about how big things get done, or people who want to improve public sector procurement/delivery might want to look at procurement data. These datasets are best identified by teams working in those domains. The government will need to select the domains it prioritises.

The fifth area is to focus on datasets that help develop domains of capability for AI tools. These datasets might be used to develop models or as technical components that can be incorporated into AI products – for example as filters, or as authoritative knowledge bases that can be looked up when creating a response.

The precise technical architectures will depend upon AI researchers and innovators.

These capabilities can simultaneously benefit a wide range of other domains and AI products so could create a lot of impact across the previous four areas.

Given this, the following section recommends five datasets in this fifth area, with a leaning towards capabilities that AI investors and firms are struggling do by themselves.

Suggestions for datasets that unlock domains of AI capabilities

Legal data

Example dataset: legislation data on legislation.gov.uk

Desired impact/outcome for AI: More AI models and products that can help users navigate both the UK’s legislation as written and the intent behind the rules

Data holder(s): National Archives, MoJ, Parliament, regulators

Getting started: Some of this data is already modelled – for example through legislation.gov.uk – but requires annotating and contextualising for use by AI researchers and innovators.

The law can change more regularly than many people expect, case law evolves on a daily basis. So developing products that have this capability is likely to encourage the creation of technical architectures that use multiple small components and models that can be iterated at different frequencies, and require UI/UX design that helps users understand the limitations of the tools they are using.

The government’s guide to making legislation describes other information created during the legislation drafting process that could be turned into data and annotated for use in AI processes. For example the policy and factual background to a piece of legislation. This could further enrich AI models and products.

Beyond legislative data this provides a path to move into regulatory guidance, government policies, and other legal datasets.

The National Archives have expertise in the required activities and data.

AI models and products won’t know the law like fictional character Neo in The Matrix gets to know kung fu. Our world is more complicated than the movies.

Authoritative reference data about the UK

Example dataset: Address data

Desired impact/outcome: More AI models and products that understand how to navigate, interact with and see the UK as the administrative state does

Data holder(s): Ordnance Survey, ONS, central and devolved government departments, local government, Met Office, Highways England, Transport Scotland, etcetera

Getting started: Some of this data is already available – eg national statistics – but will need annotating and contextualising for use by AI researchers and innovators.

Some of this data is behind paywall and copyright restrictions, for example Ordnance Survey’s geospatial data. This requires additional work to create a sustainable funding model that makes the data available for use by AI researchers and innovators. This will reduce financial and IP barriers for AI model developers and help level the playing field between large AI firms and smaller organisations.

Other data – for example planning rules or benefits eligibility rules – may not be held in digital formats and will require work to create data standards, digitise the data, develop funding models and then annotate and contextualise for use in AI. MHCLG’s digital planning programme is doing some of this work for planning data.

This work will need to cater for the leeway in some of these rules where humans will make decisions and rules cannot be cleanly turned into data.

AI models and products that use this data will require additional work to ensure both that models appropriately use authoritative factual data and that this usage is appropriately communicated to service users. This will help service users understand the source and quality of the data.

Authoritative factual data about the world

Example dataset: Eurostat data

Desired impact/outcome: More AI models and products that understand what is generally accepted as a fact in the UK

Data holder(s): Non-UK NSOs (national statistics organisations), trustworthy news sources

Getting started: Investment will be required to support some data holders to publish and annotate this data for use in AI models and products.

AI models and products that use this data will require additional work to ensure both that models appropriately use authoritative factual data and that this usage is communicated to service users in ways that help them understand the source and quality of the data.

Some facts are disputed between or within countries. AI models and products will need to be developed that can navigate these disputes.

Full Fact have developed a methodology for identifying, collecting and annotating this data so that it is useful for AI researchers and innovators. Government could get started by talking with that team.

Cultural data about the UK

Example dataset: Transcripts of a diverse range of television soap operas and sitcoms

Desired impact/outcome: More AI models and products that understand how people in the UK communicate with each other.

Data holder(s): BBC, ITV, Channel 4, National Archives, British Library

Getting started: The Action Plan recommended that the government make a “copyright-cleared British media asset training data set” available. This is a broad description.

The emphasis should be on making data available that will help AI researchers and innovators develop AI models and products that understand how the UK’s diverse publics communicate with each other.

Identifying a wide range of television shows that depict the lives of people from different and overlapping communities across the UK’s four nations would help achieve this goal.

To use some slightly whimsical examples a list that included Coronation Street, Emmerdale, Eastenders, Take The High Road, Pobol y Cwm, Grange Hill, Derry Girls, Desmond’s, People Just Do Nothing and Alma’s Not Normal would provide a rich training set about people in the UK.

Government could get started by talking with both creative industry organisations and workers in creative industries.

What use is an AI model that can’t communicate with Alma?

Public service complaints data

Example dataset: Feedback data collected by gov.uk services

Desired impact/outcome: More AI models and products that understand how people in the UK experience public services.

Data holder(s): Central government departments, local authorities, regulators, politicians

Getting started: People make complaints when things do not work. These complaints help us both understand how people communicate when they are unhappy, frustrated or even angry. They also help identify where things can be improved and how people use and experience services.

These complaints might be made in a number of different places and times. For example, while people are using an online service on gov.uk, in person at a public sector building like a town hall or library, to regulators, or to politicians.

People are more likely to make these complaints if they believe they will be heard, so this data tends not to be fully representative and is likely to over-represent majority groups. Statistical techniques can help counter this bias but public sector organisations can also invest in improving feedback processes to provide a more representative dataset. Done well this will also have the direct benefit of improving public services. Perhaps that improvement should be the primary goal.

Complaints data is sensitive so synthetic data will need to be created and appropriate governance put in place to ensure that the data is only used in ways aligned with the purposes for which it has been provided

Government could get started by exploring the feedback processes and data that should be collected by the gov.uk channel used by some public services.

Publishing data doesn’t make magic happen

So, some recommendations for datasets to help develop capabilities in AI models and products that would have a high impact in the UK. Some of those recommendations will be useful, others will probably turn out to be useless, but there’s another important thing to remember.

Simply making data available does not mean that people will use it in ways that you expect.

Luckily for the government it can act like a system, intervening in a range of ways and places to reduce harms and create more desirable outcomes. Government’s roles as a regulator, funder, user and provider of AI models and products will all be useful to help deliver on the outcomes described in this post.

Those, and other, roles are things that the government will need to lean into if it wants the AI being mainlined into the UK’s veins to lead to better lives for everyone in the UK.

What public datasets will have high impact on AI in the UK?

What is the AI Opportunities Action Plan recommendation about datasets

What does “high-impact” mean?