In a real-world scenario, you would connect to a live NHS database to get the most recent operational data. However, for this training, we will generate a synthetic dataset. This allows us to create a reproducible example that mimics the key features of real data without handling sensitive information.
Our goal is to create a daily time-series dataset with variables that could plausibly influence the OPEL-4 status.
1. The Structure of Our Synthetic Data
We will create a data frame with the following columns:
date: The date of the observation.
daily_admissions: The number of new hospital admissions on that day.
staff_absences: The number of staff members absent on that day.
winter_pressure: A seasonal factor that increases during winter months.
opel_level: The official OPEL level for that day (0, 1, 2, 3, or 4).
2. R Code for Data Generation
Copy and paste the following code into your RStudio console to generate the data. We will then save it as a CSV file to be used in the later modules.
Note on Data Realism: For the purpose of this training, we have adjusted the parameters of the synthetic data generation to ensure a more frequent occurrence of OPEL-4 events and a stronger relationship between the predictors and the outcome. This is done to create a clearer signal for the model to learn from, resulting in more discernible predictions and narrower credible intervals in the final plots. In a real-world scenario, the frequency of such events might be lower, and the relationships more subtle.
# Load necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)# Set a seed for reproducibilityset.seed(123)# Define the time period for our data (2 years)dates <-seq(ymd("2023-01-01"), ymd("2025-06-30"), by ="day")# Generate synthetic datan_days <-length(dates)# 1. Create a seasonal winter pressure effectwinter_pressure <-cos(2* pi * (yday(dates) -45) /365) *15+10winter_pressure[winter_pressure <0] <-0# Pressure can't be negative# 2. Simulate daily admissions and staff absencesdaily_admissions <-rpois(n_days, lambda =50+ winter_pressure)staff_absences <-rpois(n_days, lambda =20+ winter_pressure /2)# 3. Combine into a tibblesynthetic_data <-tibble(date = dates,daily_admissions = daily_admissions,staff_absences = staff_absences,winter_pressure = winter_pressure)# 4. Model the probability of being in OPEL-4# We use a logistic function to make OPEL-4 more likely when pressures are highprob_opel_4 <-plogis(-4.5+0.05* daily_admissions +0.08* staff_absences)# 5. Assign an OPEL level based on probabilitiessynthetic_data <- synthetic_data %>%mutate(opel_level =case_when( prob_opel_4 >0.6~4, # High probability -> OPEL 4 prob_opel_4 >0.3~3, prob_opel_4 >0.1~2,TRUE~sample(0:1, n(), replace =TRUE, prob =c(0.6, 0.4)) ) )# 6. Save the data to a filewrite_csv(synthetic_data, "nhs_opel_data.csv")# 7. Display the first few rows of the datahead(synthetic_data)