Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #234 Blog Post: How I Rebuilt a Lost ECG Data Script in R #235

Merged
merged 15 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
suppressMessages(library(dplyr))
# markdown helpers --------------------------------------------------------

markdown_appendix <- function(name, content) {
paste(paste("##", name, "{.appendix}"), " ", content, sep = "\n")
}
markdown_link <- function(text, path) {
paste0("[", text, "](", path, ")")
}



# worker functions --------------------------------------------------------

insert_source <- function(repo_spec, name,
collection = "posts",
branch = "main",
host = "https://github.com",
text = "Source",
file_name) {
path <- paste(
host,
repo_spec,
"tree",
branch,
collection,
name,
file_name,
sep = "/"
)
return(markdown_link(text, path))
}

insert_timestamp <- function(tzone = Sys.timezone()) {
time <- lubridate::now(tzone = tzone)
stamp <- as.character(time, tz = tzone, usetz = TRUE)
return(stamp)
}

insert_lockfile <- function(repo_spec, name,
collection = "posts",
branch = "main",
host = "https://github.com",
text = "Session info") {
path <- path <- "https://pharmaverse.github.io/blog/session_info.html"

return(markdown_link(text, path))
}



# top level function ------------------------------------------------------

insert_appendix <- function(repo_spec, name, collection = "posts", file_name) {
appendices <- paste(
markdown_appendix(
name = "Last updated",
content = insert_timestamp()
),
" ",
markdown_appendix(
name = "Details",
content = paste(
insert_source(repo_spec, name, collection, file_name = file_name),
# get renv information,
insert_lockfile(repo_spec, name, collection),
sep = ", "
)
),
sep = "\n"
)
knitr::asis_output(appendices)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
---
title: "How I Rebuilt a Lost ECG Data Script in R"
author:
- name: Vladyslav Shuliar
description: "During my Data Science placement, I faced the challenge of recreating an essential ECG dataset for the {pharmaversesdtm} project after the original R script was lost. I explored the existing data, identified key parameters, and experimented with R packages to replicate the dataset structure and ensure SDTM compliance. Despite challenges with ensuring accurate ECG measurements, I eventually regenerated the dataset, learning valuable lessons in problem-solving and resilience."
# Note that the date below will be auto-updated when the post is merged.
date: "2024-09-30"
# Please do not use any non-default categories.
# You can find the default categories in the repository README.md
categories: [SDTM, Technical]
# Feel free to change the image
image: "pharmaverse.png"

---

<!--------------- typical setup ----------------->

```{r setup, include=FALSE}
long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."
# renv::use(lockfile = "renv.lock")
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
```

<!--------------- post begins here ----------------->

# Rebuilding a Lost Script: My Journey into Open-Source Data Science
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

As a new Data Science placement student at Roche UK, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task within {pharmaversesdtm}, which tested both my technical know-how and problem-solving abilities.

## The Challenge: Rewriting a Lost Script

One of the open-source datasets from CDISC, specifically the Electrocardiogram (ECG) data, had been created by a script that was unfortunately lost and couldn’t be recovered. My task was to write a new R script from scratch to regenerate the ECG dataset, closely matching the original in structure and content.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

The {pharmaversesdtm} test SDTM datasets are typically either copies of open-source data from the CDISC pilot project or generated by developers for use within the pharmaverse. In this case, the original ECG dataset came from the CDISC pilot, but since that source was no longer available, we had no file to reference. Fortunately, we still had a saved copy of the dataset, which I was able to analyze. By studying its structure and variables, I could better understand its contents and recreate a similar dataset for ongoing use.

## My Approach: Reverse-Engineering the Data

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated. Here's how I approached it:

### 1. Data Exploration and Analysis
I started by thoroughly analyzing the available `EG` dataset. My goal was to understand the structure and key variables involved in the original dataset. By digging deep into the data, I gained insights into how it was organized, such as having one row per test per visit. I also examined the characteristics of the tests, including the range of values and variance, to ensure that I could replicate the dataset faithfully. This understanding was crucial for generating a new dataset that closely mirrored the original.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

### 2. Writing the New R Script
Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.

## Challenges and Solutions

Working with a dataset of over 25,000 entries brought its own challenges. Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus. I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.

## The Result: A Recreated ECG Dataset

After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset.

## Main Parts of the Code

In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the `EG` dataset. The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.

### 1. Loading Libraries and Data

To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset. The vs dataset is essential for providing key clinical information about the participants. This data complements the ECG measurements and allows for a more comprehensive analysis of each subject’s health status during the study. By setting a seed for the random data generation, I ensure that the process is reproducible, allowing others to verify my results and maintain consistency in future analyses.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
library(dplyr)
library(metatools)
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

data("vs")
set.seed(123)
```

### 2. Extracting Unique Date/Time of Measurements

Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset. This data will be used later to match the generated ECG data to the correct visit and time points.

```r
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
egdtc <- vs %>%
select(USUBJID, VISIT, VSDTC) %>%
distinct() %>%
rename(EGDTC = VSDTC)
```

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
The output:

```
USUBJID VISIT EGDTC
<chr> <chr> <chr>
1 01-701-1015 SCREENING 1 2013-12-26
2 01-701-1015 SCREENING 2 2013-12-31
3 01-701-1015 BASELINE 2014-01-02
4 01-701-1015 AMBUL ECG PLACEMENT 2014-01-14
5 01-701-1015 WEEK 2 2014-01-16
6 01-701-1015 WEEK 4 2014-01-30
```

### 3. Generating a Grid of Patient Data

Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represent different test results collected across multiple visits.

Each of these test codes corresponds to specific ECG measurements:
- `QT` refers to the QT interval, which measures the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.
- `HR` represents heart rate.
- `RR` is the interval between R waves, typically measuring the time between heartbeats.
- `ECGINT` covers other general ECG interpretations.

As I analyzed the original ECG dataset, I learned about these test codes and their relevance to the clinical data. This analysis helped me understand how different visits and time points corresponded to various test results, and why it was important to regenerate all these combinations for accuracy.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
eg <- expand.grid(
USUBJID = unique(vs$USUBJID),
EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
EGTPT = c("AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES"),
VISIT = c(
"SCREENING 1",
"SCREENING 2",
"BASELINE",
"AMBUL ECG PLACEMENT",
"WEEK 2",
"WEEK 4",
"AMBUL ECG REMOVAL",
"WEEK 6",
"WEEK 8",
"WEEK 12",
"WEEK 16",
"WEEK 20",
"WEEK 24",
"WEEK 26",
"RETRIEVAL"
), stringsAsFactors = FALSE
)
```

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
The output:

```
USUBJID EGTESTCD EGTPT VISIT
1 01-701-1015 QT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
2 01-701-1015 HR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
3 01-701-1015 RR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
4 01-701-1015 ECGINT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
5 01-701-1015 QT AFTER STANDING FOR 1 MINUTE SCREENING 1
6 01-701-1015 HR AFTER STANDING FOR 1 MINUTE SCREENING 1
```

### 4. Generating Random Test Results

For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I could extract realistic means and standard deviations for each ECG test (e.g., QT, HR, RR, ECGINT). This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
EGSTRESN = case_when(
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
)
```

### 5. Finalizing the Dataset

Finally, I'm adding labels to the dataframe for easier analysis and future use by utilizing the add_labels function from the metatools package. This helps to provide descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in subsequent use.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
add_labels(
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
STUDYID = "Study Identifier",
DOMAIN = "Domain Abbreviation",
USUBJID = "Unique Subject Identifier",
EGSEQ = "Sequence Number",
EGTESTCD = "ECG Test Short Name",
EGTEST = "ECG Test Name",
EGORRES = "Result or Finding in Original Units",
EGORRESU = "Original Units",
EGELTM = "Elapsed Time",
EGSTRESC = "Character Result/Finding in Std Format",
EGSTRESN = "Numeric Result/Finding in Standard Units",
EGSTRESU = "Standard Units",
EGSTAT = "Completion Status",
EGLOC = "Location of Vital Signs Measurement",
EGBLFL = "Baseline Flag",
VISITNUM = "Visit Number",
VISIT = "Visit Name",
VISITDY = "Planned Study Day of Visit",
EGDTC = "Date/Time of Measurements",
EGDY = "Study Day of Vital Signs",
EGTPT = "Planned Time Point Number",
EGTPTNUM = "Time Point Number",
EGELTM = "Planned Elapsed Time from Time Point Ref",
EGTPTREF = "Time Point Reference"
)
```

This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world. This structured approach allowed me to successfully recreate the `EG` dataset synthetically, which will be available in the next release of {pharmaversesdtm}.

<!--------------- appendices go here ----------------->

```{r, echo=FALSE}
source("appendix.R")
insert_appendix(
repo_spec = "pharmaverse/blog",
name = long_slug,
# file_name should be the name of your file
file_name = list.files() %>% stringr::str_subset(".qmd") %>% first()
)
```
Loading