Synthetic longitudinal patient data

Background and prerequisites

My current research focuses on generating and evaluaiting synthetic longitudinal patient data. Synthetic data are data that are generated on the basis of some existing real-world data (original data) and they try to mimic the original data as closely as possible. Longitudinal data, on the other hand, are data in which at least one variable has been measured at several time points (at least two) and the time is treated as an index rather than a random variable (cf. survival or time-to-event data). Longitudinal data usually have fewer of these so called repeated measurements than time series or signal data. In addition, longitudinal data usually include so-called background variables, covariates, that do not change over time but remain constant.

In the current literature, however, these terms are used to mean different things or different names are used for them. For example, synthetic data sometimes means data that are simulated from a model or distribution that is not based on any existing data or is only loosely connected to such data. Nonetheless, in general, synthetic data are assumed to be fake in some way, i.e., they are not the same as real-world data (RWD). Synthetic data are sometimes referred to as realistic synthetic data to emphasize the fact that synthetic data resemble real-world data. In addition, longitudinal data are sometimes called repeated measurement data, panel data, event time series and sometimes even time series, because the limit on the number of repeated measurements between longitudinal data and time series is indeterminate. Generally speaking, there are more covariates in longitudinal data than in time series, but with big data, this limit has become even more unclear.

My inspiration for this topic started when I was working as a biostatistician at Auria Clinical Informatic, a unit of Turku University Central Hospital, where patient information stored in the hospital’s database is used in research, development and innovation (R&D) activities. Current Finnish legislation allows the use of identifiable patient data for scientific research without the patient’s consent, but not for development and innovation activities, only anonymized aggregate-level data can be granted for these actions. Although it is possible to obtain individual level data for scientific research, it may take months, even years, to obtain such data. When generated correctly, synthetic data can be considered anonymous, so they can be published and distributed freely for R&D activities. This, in turn, could facilitate R&D activities, as data would be more easily and promptly available. And even though synthetic data could not necessarily be used to make inferences in scientific research, they would nevertheless enable, for example, the fitting and testing of different models while waiting for access to the real data.

My current and past studies related to the topic are presented below.

Methods for generating and evaluating synthetic longitudinal data: a systematic review

The scientific article related to this topic was published in (updated when published).

In 2021, when I startend my doctoral studies, I wanted to find out what methods were available at the time for generating and evaluating synthetic longitudinal patient data. Together with my supervisors, Joni Virta, Kari Auranen and Arho Virkki, we decided to do a systematic literature review. I had previously worked on a review article as part of the SHARED project and received feedback from the reviewers on how we had concluded the presented literature. For this reason, I wanted to do a systematic review, so that I could transparently describe how we had chosen the included literature. In addition, I believed that with a systematic review we would most likely cover almost all relevant methods.

We first started by familiarizing ourselves with how to conduct a systematic literature review and found the PRISMA guidelines. In accordance with the instructions, we made a research protocol and uploaded it publicly available to PROSPERO, an international database of prospectively registered systematic reviews in various fields including, but not limited to, health and social care. Below are links to the protocol and the data collection forms used in the review. Writing the protocol included, e.g., defining the data sources to be used and designing the search algorithm.

After completing the protocol, we started the searches and reviewed the literature. Due to my free time hobbies, the research was on hiatus for half a year in 2022, and we decided to update the literature search at the end of 2022 in order to include the latest methods in the review. The research also progressed iteratively as we learned new things and how the review could be done better. We updated the protocol twice and iterated the data collection forms several times so that we could ask relevant questions for our review.

In January 2023, I switched to full-time doctoral researcher with the help of a personal grant I received from the Finnish Cultural Foundation (grant 00220801), and the research gained good momentum. We finished reviewing the literature (almost 7000 abstracts and 400 full reports) in March 2023, after which we started collecting data and writing the manuscript. The manuscript is currently under review and below you can find the amended protocol, supplementary material of the manuscript and poster presentation related to the review. The supplemental material also includes the REDCap data collection forms. The current version of manuscript is available at arXiv.

Supplemental Material of the Systematic Review

PRISMA protocol (amended 2023-03-09)

EXACTUS Seminar Day 2023 Poster