Module 2: Sourcing and Preparing the Data

In a real-world scenario, you would connect to a live NHS database to get the most recent operational data. However, for this training, we will generate a synthetic dataset. This allows us to create a reproducible example that mimics the key features of real data without handling sensitive information.

Our goal is to create a daily time-series dataset with variables that could plausibly influence the OPEL-4 status.

1. The Structure of Our Synthetic Data

We will create a data frame with the following columns:

  • date: The date of the observation.
  • daily_admissions: The number of new hospital admissions on that day.
  • staff_absences: The number of staff members absent on that day.
  • winter_pressure: A seasonal factor that increases during winter months.
  • opel_level: The official OPEL level for that day (0, 1, 2, 3, or 4).

2. R Code for Data Generation

Copy and paste the following code into your RStudio console to generate the data. We will then save it as a CSV file to be used in the later modules.

Note on Data Realism: For the purpose of this training, we have adjusted the parameters of the synthetic data generation to ensure a more frequent occurrence of OPEL-4 events and a stronger relationship between the predictors and the outcome. This is done to create a clearer signal for the model to learn from, resulting in more discernible predictions and narrower credible intervals in the final plots. In a real-world scenario, the frequency of such events might be lower, and the relationships more subtle.

# Load necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

# Set a seed for reproducibility
set.seed(123)

# Define the time period for our data (2 years)
dates <- seq(ymd("2023-01-01"), ymd("2025-06-30"), by = "day")

# Generate synthetic data
n_days <- length(dates)

# 1. Create a seasonal winter pressure effect
winter_pressure <- cos(2 * pi * (yday(dates) - 45) / 365) * 15 + 10
winter_pressure[winter_pressure < 0] <- 0 # Pressure can't be negative

# 2. Simulate daily admissions and staff absences
daily_admissions <- rpois(n_days, lambda = 50 + winter_pressure)
staff_absences <- rpois(n_days, lambda = 20 + winter_pressure / 2)

# 3. Combine into a tibble
synthetic_data <- tibble(
  date = dates,
  daily_admissions = daily_admissions,
  staff_absences = staff_absences,
  winter_pressure = winter_pressure
)

# 4. Model the probability of being in OPEL-4
# We use a logistic function to make OPEL-4 more likely when pressures are high
prob_opel_4 <- plogis(-4.5 + 0.05 * daily_admissions + 0.08 * staff_absences)

# 5. Assign an OPEL level based on probabilities
synthetic_data <- synthetic_data %>%
  mutate(
    opel_level = case_when(
      prob_opel_4 > 0.6 ~ 4, # High probability -> OPEL 4
      prob_opel_4 > 0.3 ~ 3,
      prob_opel_4 > 0.1 ~ 2,
      TRUE ~ sample(0:1, n(), replace = TRUE, prob = c(0.6, 0.4))
    )
  )

# 6. Save the data to a file
write_csv(synthetic_data, "nhs_opel_data.csv")

# 7. Display the first few rows of the data
head(synthetic_data)
# A tibble: 6 × 5
  date       daily_admissions staff_absences winter_pressure opel_level
  <date>                <int>          <int>           <dbl>      <dbl>
1 2023-01-01               66             28            20.9          4
2 2023-01-02               81             28            21.1          4
3 2023-01-03               56             33            21.2          4
4 2023-01-04               72             30            21.4          4
5 2023-01-05               86             33            21.6          4
6 2023-01-06               75             25            21.7          4

3. Understanding the Code

  • set.seed(42): This ensures that every time you run the code, you get the exact same “random” data. This is crucial for reproducibility.
  • winter_pressure: We use a cosine wave to simulate the cyclical nature of winter pressures on the NHS, peaking in mid-February.
  • rpois(...): We use the Poisson distribution to simulate “count” data like admissions and absences, as these are whole numbers.
  • plogis(...): This is the logistic function. We use it to convert our linear combination of predictors into a probability (a value between 0 and 1).
  • rbinom(...): We use the binomial distribution to simulate the binary is_opel_4 outcome based on the calculated probability.
  • case_when(...): This is how we assign the final OPEL level, ensuring consistency with the is_opel_4 outcome.

After running this code, you will have a file named nhs_opel_data.csv in your project directory. We will use this file as the input for our model.


With our dataset created, we can now move on to the next module.