pharmaverse · manciniedoardo · Oct 31, 2024 · Sep 30, 2024 · Sep 30, 2024 · Oct 1, 2024
diff --git a/posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R b/posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R
@@ -0,0 +1,73 @@
+suppressMessages(library(dplyr))
+# markdown helpers --------------------------------------------------------
+
+markdown_appendix <- function(name, content) {
+  paste(paste("##", name, "{.appendix}"), " ", content, sep = "\n")
+}
+markdown_link <- function(text, path) {
+  paste0("[", text, "](", path, ")")
+}
+
+
+
+# worker functions --------------------------------------------------------
+
+insert_source <- function(repo_spec, name,
+                          collection = "posts",
+                          branch = "main",
+                          host = "https://github.com",
+                          text = "Source",
+                          file_name) {
+  path <- paste(
+    host,
+    repo_spec,
+    "tree",
+    branch,
+    collection,
+    name,
+    file_name,
+    sep = "/"
+  )
+  return(markdown_link(text, path))
+}
+
+insert_timestamp <- function(tzone = Sys.timezone()) {
+  time <- lubridate::now(tzone = tzone)
+  stamp <- as.character(time, tz = tzone, usetz = TRUE)
+  return(stamp)
+}
+
+insert_lockfile <- function(repo_spec, name,
+                            collection = "posts",
+                            branch = "main",
+                            host = "https://github.com",
+                            text = "Session info") {
+  path <- path <- "https://pharmaverse.github.io/blog/session_info.html"
+
+  return(markdown_link(text, path))
+}
+
+
+
+# top level function ------------------------------------------------------
+
+insert_appendix <- function(repo_spec, name, collection = "posts", file_name) {
+  appendices <- paste(
+    markdown_appendix(
+      name = "Last updated",
+      content = insert_timestamp()
+    ),
+    " ",
+    markdown_appendix(
+      name = "Details",
+      content = paste(
+        insert_source(repo_spec, name, collection, file_name = file_name),
+        # get renv information,
+        insert_lockfile(repo_spec, name, collection),
+        sep = ", "
+      )
+    ),
+    sep = "\n"
+  )
+  knitr::asis_output(appendices)
+}
diff --git a/posts/zzz_DO_NOT_EDIT_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.qmd b/posts/zzz_DO_NOT_EDIT_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.qmd
@@ -0,0 +1,205 @@
+---
+title: "How I Rebuilt a Lost ECG Data Script in R"
+author:
+  - name: Vladyslav Shuliar
+description: "During my Data Science placement, I faced the challenge of recreating an essential ECG dataset for the {pharmaversesdtm} project after the original R script was lost. I explored the existing data, identified key parameters, and experimented with R packages to replicate the dataset structure and ensure SDTM compliance. Despite challenges with ensuring accurate ECG measurements, I eventually regenerated the dataset, learning valuable lessons in problem-solving and resilience."
+# Note that the date below will be auto-updated when the post is merged.
+date: "2024-09-30"
+# Please do not use any non-default categories.
+# You can find the default categories in the repository README.md
+categories: [SDTM, Technical]
+# Feel free to change the image
+image: "pharmaverse.png"
+
+---
+
+<!--------------- typical setup ----------------->
+
+```{r setup, include=FALSE}
+long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."
+# renv::use(lockfile = "renv.lock")
+```
+
+<!--------------- post begins here ----------------->
+
+# Rebuilding a Lost Script: My Journey into Open-Source Data Science
+
+As a new Data Science placement student at Roche UK, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task within {pharmaversesdtm}, which tested both my technical know-how and problem-solving abilities.
+
+## The Challenge: Rewriting a Lost Script
+
+One of the open-source datasets from CDISC, specifically the Electrocardiogram (ECG) data, had been created by a script that was unfortunately lost and couldn’t be recovered. My task was to write a new R script from scratch to regenerate the ECG dataset, closely matching the original in structure and content.
+
+The {pharmaversesdtm} test SDTM datasets are typically either copies of open-source data from the CDISC pilot project or generated by developers for use within the pharmaverse. In this case, the original ECG dataset came from the CDISC pilot, but since that source was no longer available, we had no file to reference. Fortunately, we still had a saved copy of the dataset, which I was able to analyze. By studying its structure and variables, I could better understand its contents and recreate a similar dataset for ongoing use.
+
+## My Approach: Reverse-Engineering the Data
+
+The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated. Here's how I approached it:
+
+### 1. Data Exploration and Analysis
+I started by thoroughly analyzing the available `EG` dataset. My goal was to understand the structure and key variables involved in the original dataset. By digging deep into the data, I gained insights into how it was organized, such as having one row per test per visit. I also examined the characteristics of the tests, including the range of values and variance, to ensure that I could replicate the dataset faithfully. This understanding was crucial for generating a new dataset that closely mirrored the original.
+
+### 2. Writing the New R Script
+Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.
+
+## Challenges and Solutions
+
+Working with a dataset of over 25,000 entries brought its own challenges. Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus. I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.
+
+## The Result: A Recreated ECG Dataset
+
+After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset.
+
+## Main Parts of the Code
+
+In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the `EG` dataset. The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.
+
+### 1. Loading Libraries and Data
+
+To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset. The vs dataset is essential for providing key clinical information about the participants. This data complements the ECG measurements and allows for a more comprehensive analysis of each subject’s health status during the study. By setting a seed for the random data generation, I ensure that the process is reproducible, allowing others to verify my results and maintain consistency in future analyses.
+
+```r
+library(dplyr)
+library(metatools)
+
+data("vs")
+set.seed(123)
+```
+
+### 2. Extracting Unique Date/Time of Measurements
+
+Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset. This data will be used later to match the generated ECG data to the correct visit and time points.
+
+```r
+egdtc <- vs %>%
+  select(USUBJID, VISIT, VSDTC) %>%
+  distinct() %>%
+  rename(EGDTC = VSDTC)
+```
+
+The output:
+
+```
+USUBJID     VISIT               EGDTC     
+  <chr>       <chr>               <chr>     
+1 01-701-1015 SCREENING 1         2013-12-26
+2 01-701-1015 SCREENING 2         2013-12-31
+3 01-701-1015 BASELINE            2014-01-02
+4 01-701-1015 AMBUL ECG PLACEMENT 2014-01-14
+5 01-701-1015 WEEK 2              2014-01-16
+6 01-701-1015 WEEK 4              2014-01-30
+```
+
+### 3. Generating a Grid of Patient Data
+
+Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represent different test results collected across multiple visits.
+
+Each of these test codes corresponds to specific ECG measurements: 
+- `QT` refers to the QT interval, which measures the time between the start of the Q wave and the end of the T wave in the heart's electrical cycle.
+- `HR` represents heart rate.
+- `RR` is the interval between R waves, typically measuring the time between heartbeats.
+- `ECGINT` covers other general ECG interpretations.
+
+As I analyzed the original ECG dataset, I learned about these test codes and their relevance to the clinical data. This analysis helped me understand how different visits and time points corresponded to various test results, and why it was important to regenerate all these combinations for accuracy.
+
+```r
+eg <- expand.grid(
+  USUBJID = unique(vs$USUBJID),
+  EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
+  EGTPT = c("AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES"),
+  VISIT = c(
+    "SCREENING 1",
+    "SCREENING 2",
+    "BASELINE",
+    "AMBUL ECG PLACEMENT",
+    "WEEK 2",
+    "WEEK 4",
+    "AMBUL ECG REMOVAL",
+    "WEEK 6",
+    "WEEK 8",
+    "WEEK 12",
+    "WEEK 16",
+    "WEEK 20",
+    "WEEK 24",
+    "WEEK 26",
+    "RETRIEVAL"
+  ), stringsAsFactors = FALSE
+)
+```
+
+The output:
+
+```
+      USUBJID EGTESTCD                          EGTPT       VISIT
+1 01-701-1015       QT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
+2 01-701-1015       HR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
+3 01-701-1015       RR AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
+4 01-701-1015   ECGINT AFTER LYING DOWN FOR 5 MINUTES SCREENING 1
+5 01-701-1015       QT    AFTER STANDING FOR 1 MINUTE SCREENING 1
+6 01-701-1015       HR    AFTER STANDING FOR 1 MINUTE SCREENING 1
+```
+
+### 4. Generating Random Test Results
+
+For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code. To determine the means and standard deviations, I used the original EG dataset as a reference. By analyzing the range and distribution of values in the original dataset, I could extract realistic means and standard deviations for each ECG test (e.g., QT, HR, RR, ECGINT). This approach allowed me to ensure that the synthetic data aligned closely with the patterns and variability observed in the original clinical data.
+
+```r
+EGSTRESN = case_when(
+      EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
+      EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
+      EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
+      EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
+      EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
+      EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
+      EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
+      EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
+      EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
+    )
+```
+
+### 5. Finalizing the Dataset
+
+Finally, I'm adding labels to the dataframe for easier analysis and future use by utilizing the add_labels function from the metatools package. This helps to provide descriptive names for each column in the dataset, making it more intuitive to understand the data during analysis and ensuring clarity in subsequent use.
+
+```r
+add_labels(
+    STUDYID = "Study Identifier",
+    DOMAIN = "Domain Abbreviation",
+    USUBJID = "Unique Subject Identifier",
+    EGSEQ = "Sequence Number",
+    EGTESTCD = "ECG Test Short Name",
+    EGTEST = "ECG Test Name",
+    EGORRES = "Result or Finding in Original Units",
+    EGORRESU = "Original Units",
+    EGELTM = "Elapsed Time",
+    EGSTRESC = "Character Result/Finding in Std Format",
+    EGSTRESN = "Numeric Result/Finding in Standard Units",
+    EGSTRESU = "Standard Units",
+    EGSTAT = "Completion Status",
+    EGLOC = "Location of Vital Signs Measurement",
+    EGBLFL = "Baseline Flag",
+    VISITNUM = "Visit Number",
+    VISIT = "Visit Name",
+    VISITDY = "Planned Study Day of Visit",
+    EGDTC = "Date/Time of Measurements",
+    EGDY = "Study Day of Vital Signs",
+    EGTPT = "Planned Time Point Number",
+    EGTPTNUM = "Time Point Number",
+    EGELTM = "Planned Elapsed Time from Time Point Ref",
+    EGTPTREF = "Time Point Reference"
+  )
+```
+
+This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world. This structured approach allowed me to successfully recreate the `EG` dataset synthetically, which will be available in the next release of {pharmaversesdtm}.
+
+<!--------------- appendices go here ----------------->
+
+```{r, echo=FALSE}
+source("appendix.R")
+insert_appendix(
+  repo_spec = "pharmaverse/blog",
+  name = long_slug,
+  # file_name should be the name of your file
+  file_name = list.files() %>% stringr::str_subset(".qmd") %>% first()
+)
+```