Skip to content

Commit

Permalink
#234 Post updated accordingly to the feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
Vladyslav committed Oct 14, 2024
1 parent 35fc952 commit e272053
Showing 1 changed file with 67 additions and 50 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ date: "2024-09-30"
categories: [SDTM, Technical]
# Feel free to change the image
image: "pharmaverse.png"

---

<!--------------- typical setup ----------------->
Expand All @@ -24,85 +23,108 @@ long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."

# Rebuilding a Lost Script: My Journey into Open-Source Data Science

As a new Data Science placement student at Roche UK, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task within {pharmaversesdtm}, which tested both my technical know-how and problem-solving abilities.

## The Challenge: Rewriting a Lost Script

One of the open-source datasets from CDISC, specifically the Electrocardiogram (ECG) data, had been created by a script that was unfortunately lost and couldn’t be recovered. My task was to write a new R script from scratch to regenerate the ECG dataset, closely matching the original in structure and content.

The {pharmaversesdtm} test SDTM datasets are typically either copies of open-source data from the CDISC pilot project or generated by developers for use within the pharmaverse. In this case, the original ECG dataset came from the CDISC pilot, but since that source was no longer available, we had no file to reference. Fortunately, we still had a saved copy of the dataset, which I was able to analyze. By studying its structure and variables, I could better understand its contents and recreate a similar dataset for ongoing use.
As a new Data Science placement student at Roche UK, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community.
Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task within {pharmaversesdtm}, which tested both my technical know-how and problem-solving abilities.

## My Approach: Reverse-Engineering the Data

The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated. Here's how I approached it:
The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated.
Here's how I approached it:

The {pharmaversesdtm} test SDTM datasets are typically either copies of open-source data from the CDISC pilot project or generated by developers for use within the pharmaverse.
In this case, the original ECG dataset came from the CDISC pilot, but since that source was no longer available, we had no file to reference.
Fortunately, we still had a saved copy of the dataset, which I was able to analyze.
By studying its structure and variables, I could better understand its contents and recreate a similar dataset for ongoing use.

### 1. Data Exploration and Analysis
I started by thoroughly analyzing the available `EG` dataset. My goal was to understand the structure and key variables involved in the original dataset. By digging deep into the data, I gained insights into how it was organized, such as having one row per test per visit. I also examined the characteristics of the tests, including the range of values and variance, to ensure that I could replicate the dataset faithfully. This understanding was crucial for generating a new dataset that closely mirrored the original.

I started by thoroughly analyzing the available `EG` dataset.
My goal was to understand the structure and key variables involved in the original dataset.
By digging deep into the data, I gained insights into how it was organized, such as having one row per test per visit.
I also examined the characteristics of the tests, including the range of values and variance, to ensure that I could replicate the dataset faithfully.
This understanding was crucial for generating a new dataset that closely mirrored the original.

```
USUBJID VISIT EGTPT EGTEST EGSTRESN EGSTRESC
<chr> <chr> <chr> <chr> <dbl> <chr>
1 01-701-1015 WEEK 2 "" ECG Int… NA ABNORMAL
2 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" Heart R… 63 63
3 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" Heart R… 83 83
4 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" Heart R… 66 66
5 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" QT Dura… 449 449
6 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" QT Dura… 511 511
7 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" QT Dura… 534 534
8 01-701-1015 WEEK 2 "AFTER LYING DOWN FOR 5 MINUTES" RR Dura… 316 316
9 01-701-1015 WEEK 2 "AFTER STANDING FOR 1 MINUTE" RR Dura… 581 581
10 01-701-1015 WEEK 2 "AFTER STANDING FOR 3 MINUTES" RR Dura… 570 570
```

### 2. Writing the New R Script
Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.

Armed with insights from my analysis, I set about writing a new R script to replicate the lost one.
This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.

## Challenges and Solutions

Working with a dataset of over 25,000 entries brought its own challenges. Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus. I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.
Working with a dataset of over 25,000 entries brought its own challenges.
Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus.
I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.

## The Result: A Recreated ECG Dataset

After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset.

## Main Parts of the Code

In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the `EG` dataset. The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.
In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the `EG` dataset.
The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.

### 1. Loading Libraries and Data

To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset. The vs dataset is essential for providing key clinical information about the participants. This data complements the ECG measurements and allows for a more comprehensive analysis of each subject’s health status during the study. By setting a seed for the random data generation, I ensure that the process is reproducible, allowing others to verify my results and maintain consistency in future analyses.
To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset.
The `vs` dataset is essential for providing key clinical information about the participants.
This data complements the ECG measurements and allows for a more comprehensive analysis of each subject's health status during the study.
By setting a seed for the random data generation, I ensure that the process is reproducible, allowing others to verify my results and maintain consistency in future analyses.

```r
Additionally, the metatools package is loaded to facilitate adding labels to the variables later.
This will enhance the readability and interpretability of the dataset when conducting further analysis.

```{r, eval = T}
library(dplyr)
library(metatools)
library(pharmaversesdtm)
data("vs")
set.seed(123)
```

### 2. Extracting Unique Date/Time of Measurements

Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset. This data will be used later to match the generated ECG data to the correct visit and time points.
Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset.
This data will be used later to match the generated ECG data to the correct visit and time points.

```r
```{r, eval = T}
egdtc <- vs %>%
select(USUBJID, VISIT, VSDTC) %>%
distinct() %>%
rename(EGDTC = VSDTC)
```
The output:

```
USUBJID VISIT EGDTC
<chr> <chr> <chr>
1 01-701-1015 SCREENING 1 2013-12-26
2 01-701-1015 SCREENING 2 2013-12-31
3 01-701-1015 BASELINE 2014-01-02
4 01-701-1015 AMBUL ECG PLACEMENT 2014-01-14
5 01-701-1015 WEEK 2 2014-01-16
6 01-701-1015 WEEK 4 2014-01-30
print(n=10, egdtc)
```

### 3. Generating a Grid of Patient Data

Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represent different test results collected across multiple visits.
Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits.
These combinations represent different test results collected across multiple visits.

Each of these test codes corresponds to specific ECG measurements:
- `QT` refers to the QT interval, which measures the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.
Each of these test codes corresponds to specific ECG measurements: - `QT` refers to the QT interval, which measures the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.
- `HR` represents heart rate.
- `RR` is the interval between R waves, typically measuring the time between heartbeats.
- `ECGINT` covers other general ECG interpretations.

As I analyzed the original ECG dataset, I learned about these test codes and their relevance to the clinical data. This analysis helped me understand how different visits and time points corresponded to various test results, and why it was important to regenerate all these combinations for accuracy.
As I analyzed the original ECG dataset, I learned about these test codes and their relevance to the clinical data.

```r
```{r, eval = T}
eg <- expand.grid(
USUBJID = unique(vs$USUBJID),
EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
Expand All @@ -125,25 +147,18 @@ eg <- expand.grid(
"RETRIEVAL"
), stringsAsFactors = FALSE
)
```

The output:
```
USUBJID EGTESTCD EGTPT VISIT
1 01-701-1015 QT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
2 01-701-1015 HR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
3 01-701-1015 RR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
4 01-701-1015 ECGINT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
5 01-701-1015 QT AFTER STANDING FOR 1 MINUTE SCREENING 1
6 01-701-1015 HR AFTER STANDING FOR 1 MINUTE SCREENING 1
print(n=10, eg)
```

### 4. Generating Random Test Results

For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I could extract realistic means and standard deviations for each ECG test (e.g., QT, HR, RR, ECGINT). This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.
For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code.
To determine the means and standard deviations, I used the original EG dataset as a reference.
By analyzing the range and distribution of values in the original dataset, I could extract realistic means and standard deviations for each ECG test (`QT`, `HR`, `RR`, `ECGINT`).
This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.

```r
``` r
EGSTRESN = case_when(
EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
Expand All @@ -159,9 +174,10 @@ EGSTRESN = case_when(

### 5. Finalizing the Dataset

Finally, I'm adding labels to the dataframe for easier analysis and future use by utilizing the add_labels function from the metatools package. This helps to provide descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in subsequent use.
Finally, I added labels to the dataframe for easier analysis and future use by utilizing the `add_labels` function from the `metatools` package.
This provided descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in its subsequent use.

```r
``` r
add_labels(
STUDYID = "Study Identifier",
DOMAIN = "Domain Abbreviation",
Expand Down Expand Up @@ -190,7 +206,8 @@ add_labels(
)
```

This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world. This structured approach allowed me to successfully recreate the `EG` dataset synthetically, which will be available in the next release of {pharmaversesdtm}.
This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world.
This structured approach allowed me to successfully recreate the `EG` dataset synthetically, which will be available in the next release of {pharmaversesdtm}.

<!--------------- appendices go here ----------------->

Expand Down

0 comments on commit e272053

Please sign in to comment.