Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #234 Blog Post: How I Rebuilt a Lost ECG Data Script in R #235

Merged
merged 15 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
suppressMessages(library(dplyr))
# markdown helpers --------------------------------------------------------

markdown_appendix <- function(name, content) {
paste(paste("##", name, "{.appendix}"), " ", content, sep = "\n")
}
markdown_link <- function(text, path) {
paste0("[", text, "](", path, ")")
}



# worker functions --------------------------------------------------------

insert_source <- function(repo_spec, name,
collection = "posts",
branch = "main",
host = "https://github.com",
text = "Source",
file_name) {
path <- paste(
host,
repo_spec,
"tree",
branch,
collection,
name,
file_name,
sep = "/"
)
return(markdown_link(text, path))
}

insert_timestamp <- function(tzone = Sys.timezone()) {
time <- lubridate::now(tzone = tzone)
stamp <- as.character(time, tz = tzone, usetz = TRUE)
return(stamp)
}

insert_lockfile <- function(repo_spec, name,
collection = "posts",
branch = "main",
host = "https://github.com",
text = "Session info") {
path <- path <- "https://pharmaverse.github.io/blog/session_info.html"

return(markdown_link(text, path))
}



# top level function ------------------------------------------------------

insert_appendix <- function(repo_spec, name, collection = "posts", file_name) {
appendices <- paste(
markdown_appendix(
name = "Last updated",
content = insert_timestamp()
),
" ",
markdown_appendix(
name = "Details",
content = paste(
insert_source(repo_spec, name, collection, file_name = file_name),
# get renv information,
insert_lockfile(repo_spec, name, collection),
sep = ", "
)
),
sep = "\n"
)
knitr::asis_output(appendices)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: "How I Rebuilt a Lost ECG Data Script in R"
author:
- name: Vladyslav Shuliar
description: "A Data Science placement student shares their experience of rewriting a lost R script to regenerate an essential ECG dataset for the open-source *pharmaversesdtm* project. The post covers their approach to data exploration, identifying key parameters, and overcoming challenges in recreating the dataset from scratch."
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
# Note that the date below will be auto-updated when the post is merged.
date: "2024-09-30"
# Please do not use any non-default categories.
# You can find the default categories in the repository README.md
categories: [SDTM, Community, Technical]
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
# Feel free to change the image
image: "pharmaverse.png"

---

<!--------------- typical setup ----------------->

```{r setup, include=FALSE}
long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."
# renv::use(lockfile = "renv.lock")
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
```

<!--------------- post begins here ----------------->

# Rebuilding a Lost Script: My Journey into Open-Source Data Science
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

As a new Data Science placement student, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task for *pharmaversesdtm*, which tested both my technical know-how and problem-solving abilities.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

## The Challenge: Rewriting a Lost Script

One of the open-source datasets from CDISC, specifically the Electrocardiogram (ECG) data, had been created by a script that had unfortunately been lost and couldn’t be recovered. This was a major issue because the program used to retrieve and process the ECG data was essential for future work. My task was to write a new R script from scratch to regenerate the ECG dataset—one that closely matched the original in both structure and content.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

## My Approach: Reverse-Engineering the Data

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated. Here's how I approached it:

### 1. Data Exploration and Analysis
I started by thoroughly analysing the available ECG dataset. My goal was to identify patterns, structures, and key variables that were likely involved in creating the original dataset. By digging deep into the data, I could understand how it was organised and what factors were critical to replicate.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

### 2. Identifying the Parameters
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
As I explored the dataset, I focused on identifying which features were crucial for recreating the lost data. By paying close attention to trends and relationships between different variables, I could form a rough idea of how the original script might have worked.

### 3. Writing the New R Script
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.

## Challenges and Solutions

Working with a dataset of over 25,000 entries brought its own challenges. Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus. I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.

## The Result: A Recreated ECG Dataset

After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset. This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved


## Main Parts of the Code

In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the ECG dataset. The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

### 1. Loading Libraries and Data

To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset. The seed is set to ensure that the random data generation is reproducible.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
library(dplyr)
library(metatools)
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

data("vs")
set.seed(123)
```

### 2. Extracting Unique Date/Time of Measurements

Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset. This data will be used later to match the generated ECG data to the correct visit and time points.

```r
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
egdtc <- vs %>%
select(USUBJID, VISIT, VSDTC) %>%
distinct() %>%
rename(EGDTC = VSDTC)
```

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
### 3. Generating a Grid of Patient Data

Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represent different test results across multiple visits.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
eg <- expand.grid(
USUBJID = unique(vs$USUBJID),
EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
EGTPT = c("AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES"),
VISIT = c(
"SCREENING 1",
"SCREENING 2",
"BASELINE",
"AMBUL ECG PLACEMENT",
"WEEK 2",
"WEEK 4",
"AMBUL ECG REMOVAL",
"WEEK 6",
"WEEK 8",
"WEEK 12",
"WEEK 16",
"WEEK 20",
"WEEK 24",
"WEEK 26",
"RETRIEVAL"
), stringsAsFactors = FALSE
)
```

vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
### 4. Generating Random Test Results

For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

```r
EGSTRESN = case_when(
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
)
```

### 5. Finalizing the Dataset

Finally, I'm adding labels to the dataframe for easier analysis and future use.

```r
add_labels(
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved
STUDYID = "Study Identifier",
DOMAIN = "Domain Abbreviation",
USUBJID = "Unique Subject Identifier",
EGSEQ = "Sequence Number",
EGTESTCD = "ECG Test Short Name",
EGTEST = "ECG Test Name",
EGORRES = "Result or Finding in Original Units",
EGORRESU = "Original Units",
EGELTM = "Elapsed Time",
EGSTRESC = "Character Result/Finding in Std Format",
EGSTRESN = "Numeric Result/Finding in Standard Units",
EGSTRESU = "Standard Units",
EGSTAT = "Completion Status",
EGLOC = "Location of Vital Signs Measurement",
EGBLFL = "Baseline Flag",
VISITNUM = "Visit Number",
VISIT = "Visit Name",
VISITDY = "Planned Study Day of Visit",
EGDTC = "Date/Time of Measurements",
EGDY = "Study Day of Vital Signs",
EGTPT = "Planned Time Point Number",
EGTPTNUM = "Time Point Number",
EGELTM = "Planned Elapsed Time from Time Point Ref",
EGTPTREF = "Time Point Reference"
)
```

This structured approach allowed me to successfully recreate the lost ECG dataset, providing a solid foundation for future analysis and research.
vbshuliar marked this conversation as resolved.
Show resolved Hide resolved

<!--------------- appendices go here ----------------->

```{r, echo=FALSE}
source("appendix.R")
insert_appendix(
repo_spec = "pharmaverse/blog",
name = long_slug,
# file_name should be the name of your file
file_name = list.files() %>% stringr::str_subset(".qmd") %>% first()
)
```
Loading