Skip to content

Commit

Permalink
#234 New version of the post
Browse files Browse the repository at this point in the history
  • Loading branch information
Vladyslav committed Oct 15, 2024
1 parent 5f03106 commit 59cafb9
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 116 deletions.
1 change: 1 addition & 0 deletions inst/WORDLIST.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1163,3 +1163,4 @@ Shuliar
stringsAsFactors
VISITNUM
Vladyslav
dataset's
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
title: "How I Rebuilt a Lost ECG Data Script in R"
author:
- name: Vladyslav Shuliar
description: "During my Data Science placement, I faced the challenge of recreating an essential ECG dataset for the {pharmaversesdtm} project after the original R script was lost. I explored the existing data, identified key parameters, and experimented with R packages to replicate the dataset structure and ensure SDTM compliance. Despite challenges with ensuring accurate ECG measurements, I eventually regenerated the dataset, learning valuable lessons in problem-solving and resilience."
description: "During my Data Science placement, I faced the challenge of recreating an ECG dataset for the {pharmaversesdtm} project after the original R script was lost. I explored the existing data, identified key parameters, and experimented with R packages to replicate the dataset structure and ensure SDTM compliance. Despite challenges with ensuring accurate ECG measurements, I eventually regenerated the dataset, learning valuable lessons in problem-solving and resilience."
# Note that the date below will be auto-updated when the post is merged.
date: "2024-09-30"
# Please do not use any non-default categories.
# You can find the default categories in the repository README.md
categories: [SDTM, Technical]
# Feel free to change the image
image: "pharmaverse.png"
image: "pharmaverse.PNG"
---

<!--------------- typical setup ----------------->
Expand All @@ -23,73 +23,38 @@ long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."

# Rebuilding a Lost Script: My Journey into Open-Source Data Science

As a new Data Science placement student at Roche UK, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community.
Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task within {pharmaversesdtm}, which tested both my technical know-how and problem-solving abilities.

## My Approach: Reverse-Engineering the Data

The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated.
Here's how I approached it:

The {pharmaversesdtm} test SDTM datasets are typically either copies of open-source data from the CDISC pilot project or generated by developers for use within the pharmaverse.
In this case, the original ECG dataset came from the CDISC pilot, but since that source was no longer available, we had no file to reference.
Fortunately, we still had a saved copy of the dataset, which I was able to analyze.
By studying its structure and variables, I could better understand its contents and recreate a similar dataset for ongoing use.

### 1. Data Exploration and Analysis

I started by thoroughly analyzing the available `EG` dataset.
My goal was to understand the structure and key variables involved in the original dataset.
By digging deep into the data, I gained insights into how it was organized, such as having one row per test per visit.
I also examined the characteristics of the tests, including the range of values and variance, to ensure that I could replicate the dataset faithfully.
This understanding was crucial for generating a new dataset that closely mirrored the original.

```
USUBJID VISIT EGTPT EGTEST EGSTRESN EGSTRESC
<chr> <chr> <chr> <chr> <dbl> <chr>
1 01-701-1015 WEEK 2 "" ECG Int… NA ABNORMAL
2 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" Heart R… 63 63
3 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" Heart R… 83 83
4 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" Heart R… 66 66
5 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" QT Dura… 449 449
6 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" QT Dura… 511 511
7 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" QT Dura… 534 534
8 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" RR Dura… 316 316
9 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" RR Dura… 581 581
10 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" RR Dura… 570 570
```
As a Data Science placement student at Roche UK, I was given an exciting opportunity to enhance my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I undertook a unique and challenging task within the {pharmaversesdtm} project that tested both my technical expertise and problem-solving abilities.

### 2. Writing the New R Script
The project involved recreating the `eg` domain (Electrocardiogram data) from the SDTM datasets used within {pharmaversesdtm}. The original dataset had been sourced from the CDISC pilot project, but since that source was no longer available, we had no direct reference. Fortunately, a saved copy of the dataset still existed, allowing me to analyze it and attempt to reproduce it as closely as possible.

Armed with insights from my analysis, I set about writing a new R script to replicate the lost one.
This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.
## How I Solved the Problem

## Challenges and Solutions
### 1. Explored and Analyzed the Data

Working with a dataset of over 25,000 entries brought its own challenges.
Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus.
I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.
The first step was to thoroughly explore the existing ECG dataset of over 25,000 entries. I needed to understand the structure and key variables that defined the dataset, such as the "one row for each patient's test during each visit" format. By analyzing these elements, I was able to gain a clear picture of how the dataset was organized. I also examined the range of values, variance, and other characteristics of the tests to ensure that my recreated version would align with the original dataset's structure and statistical properties.

## The Result: A Recreated ECG Dataset
To provide a clearer understanding of how the data is structured, let’s take a quick look at the information collected during a patient’s visit. Below is an example of data for patient `01-701-1015` during their `WEEK 2` visit:

After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset.
```{r, eval = T, message = F, echo = F}
library(dplyr)
library(pharmaversesdtm)
## Main Parts of the Code
eg %>%
select(USUBJID, EGTEST, VISIT, EGDTC, EGTPT, EGSTRESC) %>%
filter(USUBJID == "01-701-1015" & VISIT == "WEEK 2")
```

In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the `EG` dataset.
The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.
In this example, the `EGTPT` column indicates the timing of the test, while `EGTEST` specifies the type of test conducted. The numeric results are recorded in `EGSTRESN`, with the corresponding categorical values in `EGSTRESC`.

### 1. Loading Libraries and Data
### 2. Wrote the New R Script

To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset.
The `vs` dataset is essential for providing key clinical information about the participants.
This data complements the ECG measurements and allows for a more comprehensive analysis of each subject's health status during the study.
By setting a seed for the random data generation, I ensure that the process is reproducible, allowing others to verify my results and maintain consistency in future analyses.
Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content. In order to give you, my reader, better understanding of the solution, I'll walk you through the main parts of the script.

Additionally, the metatools package is loaded to facilitate adding labels to the variables later.
This will enhance the readability and interpretability of the dataset when conducting further analysis.
#### Loading Libraries and Data

```{r, eval = T}
To begin, I loaded the necessary libraries and read in the vital signs (`vs`) dataset, which was essential for providing key clinical information about the participants. This data complemented the ECG measurements and allowed for a more comprehensive analysis of each subject's health status during the study. By setting a seed for the random data generation, I ensured that the process was reproducible, allowing others to verify my results and maintain consistency in future analyses. Additionally, the metatools package was loaded to facilitate adding labels to the variables later, which enhanced the readability of the dataset.

```{r, eval = T, message = F}
library(dplyr)
library(metatools)
library(pharmaversesdtm)
Expand All @@ -98,37 +63,34 @@ data("vs")
set.seed(123)
```

### 2. Extracting Unique Date/Time of Measurements
#### Extracting Unique Date/Time of Measurements

Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset.
This data will be used later to match the generated ECG data to the correct visit and time points.
Next, I extracted the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset.

```{r, eval = T}
egdtc <- vs %>%
select(USUBJID, VISIT, VSDTC) %>%
distinct() %>%
rename(EGDTC = VSDTC)
print(n = 10, egdtc)
head(egdtc)
```

### 3. Generating a Grid of Patient Data

Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits.
These combinations represent different test results collected across multiple visits.
This data was used later to match the generated ECG data to the correct visit and time points.

Each of these test codes corresponds to specific ECG measurements: - `QT` refers to the QT interval, which measures the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.
- `HR` represents heart rate.
- `RR` is the interval between R waves, typically measuring the time between heartbeats.
- `ECGINT` covers other general ECG interpretations.
#### Generating a Grid of Patient Data

As I analyzed the original ECG dataset, I learned about these test codes and their relevance to the clinical data.
Here, I created a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represented different test results collected across multiple visits.

```{r, eval = T}
eg <- expand.grid(
USUBJID = unique(vs$USUBJID),
EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
EGTPT = c("AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES"),
EGTPT = c(
"AFTER LYING DOWN FOR 5 MINUTES",
"AFTER STANDING FOR 1 MINUTE",
"AFTER STANDING FOR 3 MINUTES"
),
VISIT = c(
"SCREENING 1",
"SCREENING 2",
Expand All @@ -148,66 +110,54 @@ eg <- expand.grid(
), stringsAsFactors = FALSE
)
print(n = 10, eg)
head(eg)
```

### 4. Generating Random Test Results
Each of these test codes corresponds to specific ECG measurements: `QT` refers to the QT interval, `HR` represents heart rate, `RR` is the interval between R waves, `ECGINT` refers to the ECG interpretation.

For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code.
To determine the means and standard deviations, I used the original EG dataset as a reference.
By analyzing the range and distribution of values in the original dataset, I could extract realistic means and standard deviations for each ECG test (`QT`, `HR`, `RR`, `ECGINT`).
This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.
As I analyzed the original ECG dataset, I learned more about these test codes and their relevance to the clinical data.

#### Generating Random Test Results

For each combination in the grid, I generated random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I had used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I was able to extract realistic means and standard deviations for each numerical ECG test (`QT`, `HR`, `RR`).

``` r
EGSTRESN = case_when(
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
)
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
)
```

### 5. Finalizing the Dataset
This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.

#### Finalizing the Dataset

Finally, I added labels to the dataframe for easier analysis and future use by utilizing the `add_labels` function from the `metatools` package.
This provided descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in its subsequent use.

``` r
add_labels(
STUDYID = "Study Identifier",
DOMAIN = "Domain Abbreviation",
USUBJID = "Unique Subject Identifier",
EGSEQ = "Sequence Number",
EGTESTCD = "ECG Test Short Name",
EGTEST = "ECG Test Name",
EGORRES = "Result or Finding in Original Units",
EGORRESU = "Original Units",
EGELTM = "Elapsed Time",
EGSTRESC = "Character Result/Finding in Std Format",
EGSTRESN = "Numeric Result/Finding in Standard Units",
EGSTRESU = "Standard Units",
EGSTAT = "Completion Status",
EGLOC = "Location of Vital Signs Measurement",
EGBLFL = "Baseline Flag",
VISITNUM = "Visit Number",
VISIT = "Visit Name",
VISITDY = "Planned Study Day of Visit",
EGDTC = "Date/Time of Measurements",
EGDY = "Study Day of Vital Signs",
EGTPT = "Planned Time Point Number",
EGTPTNUM = "Time Point Number",
EGELTM = "Planned Elapsed Time from Time Point Ref",
EGTPTREF = "Time Point Reference"
)
STUDYID = "Study Identifier",
USUBJID = "Unique Subject Identifier",
EGTEST = "ECG Test Name",
VISIT = "Visit Name",
EGSTRESC = "Character Result/Finding in Std Format",
EGSTRESN = "Numeric Result/Finding in Standard Units",
<etc>
)
```

This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world.
This structured approach allowed me to successfully recreate the `EG` dataset synthetically, which will be available in the next release of {pharmaversesdtm}.
This provided descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in its subsequent use.

### Conclusion

This project not only sharpened my R programming skills but also provided invaluable experience in reverse-engineering data, analyzing large healthcare datasets, and tackling real-world challenges in the open-source domain. By following a structured approach, I was able to successfully recreate the `EG` dataset synthetically, ensuring it mirrors realistic clinical data. This achievement not only enhances my technical capabilities but also contributes to the broader open-source community, as the synthetic dataset will be featured in the next release of {pharmaversesdtm}, offering a valuable resource for future research and development.

<!--------------- appendices go here ----------------->

Expand Down

0 comments on commit 59cafb9

Please sign in to comment.