Electronic Health Records (EHR) have led to a proliferation of healthcare data,  held in a combination of structured, coded information and freetext formats. Patient records in primary, secondary and community care and in different specialities are largely held in isolated silos, with limited sharing across healthcare settings. Patient-generated data from mobile apps, wearables and sensors is a potentially valuable source of knowledge, but is currently locked in a combination of structured and unstructured forms that cannot easily be blended with the traditional structures of clinical systems.

Data structures, data quality and coding vary greatly across systems and individual organisations. Clinicians find it challenging to apply consistent coding and data quality under the pressures of clinical time.  The volume and complexity of data from health services and patients is growing exponentially, alongside growing rates of chronic illness and co-morbidities.

Against this backdrop of seemingly unmanageable data expansion, our dependency on structured data held in information silos presents a major barrier to person-centred and personalised care, which needs the full picture of information from all sources to support the person’s needs.


The vision for many clinicians and patients is of an intelligent system that can automatically analyse unstructured, uncoded data from multiple sources, and synthesise this data to make transparent predictions about risk of adverse events, alongside evidence-based and personalised recommendations about treatment and prevention.

The potential of Machine Learning and Natural Language Processing (NLP)

The Digital Health and Care Institute, RedStar Consulting, Tactuum and NHS Greater Glasgow and Clyde, with clinical leadership from Dr Chris Sainsbury, are working on an Innovate UK funded feasibility project to explore the potential of Machine Learning and NLP to deliver this vision.

The objective is to build Machine Learning models which will (1) analyse all the clinical notes associated with a patient, (2) predict the risk of different clinical endpoints such as heart attack or death (3) and present this information to the clinician as a score or alert. Clinicians can then use this to tailor the consultation, identify high risk patients, and target specific clinical outcomes.  For this initial feasibility study, the project is using patient records from SCI-Diabetes, a world-renowned electronic health record system, which has comprehensive records for 99% of diabetes patients in Scotland.

A key innovation is the use of sentiment analysis, using freetext to directly predict different clinical endpoints. This is in contrast to most other NLP approaches, which aim to extract structured information from free text and convert these into clinical codes – for example, identifying mentions of specific diseases and automatically applying the relevant code.

As well as analysing the entire patient history, the model will benefit from being able to aggregate different clinical judgements and even detect new patterns of disease progression.

Initial testing of the ML/NLP model using a test dataset to predict readmissions and mortality is giving encouraging outcomes, comparable to published results using deep learning techniques.


This feasibility study is only a first step, but it holds great potential to help transform the way we use clinical and patient data to drive personalized medicine and improve healthcare quality.