pharmaverse · manciniedoardo · Oct 31, 2024 · Sep 30, 2024 · Sep 30, 2024 · Oct 1, 2024
diff --git a/posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R b/posts/zzz_DO_NOT_EDIT_how__i__reb.../appendix.R
@@ -0,0 +1,73 @@
+suppressMessages(library(dplyr))
+# markdown helpers --------------------------------------------------------
+
+markdown_appendix <- function(name, content) {
+  paste(paste("##", name, "{.appendix}"), " ", content, sep = "\n")
+}
+markdown_link <- function(text, path) {
+  paste0("[", text, "](", path, ")")
+}
+
+
+
+# worker functions --------------------------------------------------------
+
+insert_source <- function(repo_spec, name,
+                          collection = "posts",
+                          branch = "main",
+                          host = "https://github.com",
+                          text = "Source",
+                          file_name) {
+  path <- paste(
+    host,
+    repo_spec,
+    "tree",
+    branch,
+    collection,
+    name,
+    file_name,
+    sep = "/"
+  )
+  return(markdown_link(text, path))
+}
+
+insert_timestamp <- function(tzone = Sys.timezone()) {
+  time <- lubridate::now(tzone = tzone)
+  stamp <- as.character(time, tz = tzone, usetz = TRUE)
+  return(stamp)
+}
+
+insert_lockfile <- function(repo_spec, name,
+                            collection = "posts",
+                            branch = "main",
+                            host = "https://github.com",
+                            text = "Session info") {
+  path <- path <- "https://pharmaverse.github.io/blog/session_info.html"
+
+  return(markdown_link(text, path))
+}
+
+
+
+# top level function ------------------------------------------------------
+
+insert_appendix <- function(repo_spec, name, collection = "posts", file_name) {
+  appendices <- paste(
+    markdown_appendix(
+      name = "Last updated",
+      content = insert_timestamp()
+    ),
+    " ",
+    markdown_appendix(
+      name = "Details",
+      content = paste(
+        insert_source(repo_spec, name, collection, file_name = file_name),
+        # get renv information,
+        insert_lockfile(repo_spec, name, collection),
+        sep = ", "
+      )
+    ),
+    sep = "\n"
+  )
+  knitr::asis_output(appendices)
+}
diff --git a/posts/zzz_DO_NOT_EDIT_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.qmd b/posts/zzz_DO_NOT_EDIT_how__i__reb.../how__i__rebuilt_a__lost__ec_g__data__script_in__r.qmd
@@ -0,0 +1,174 @@
+---
+title: "How I Rebuilt a Lost ECG Data Script in R"
+author:
+  - name: Vladyslav Shuliar
+description: "A Data Science placement student shares their experience of rewriting a lost R script to regenerate an essential ECG dataset for the open-source *pharmaversesdtm* project. The post covers their approach to data exploration, identifying key parameters, and overcoming challenges in recreating the dataset from scratch."
+# Note that the date below will be auto-updated when the post is merged.
+date: "2024-09-30"
+# Please do not use any non-default categories.
+# You can find the default categories in the repository README.md
+categories: [SDTM, Community, Technical]
+# Feel free to change the image
+image: "pharmaverse.png"
+
+---
+
+<!--------------- typical setup ----------------->
+
+```{r setup, include=FALSE}
+long_slug <- "zzz_DO_NOT_EDIT_how__i__reb..."
+# renv::use(lockfile = "renv.lock")
+```
+
+<!--------------- post begins here ----------------->
+
+# Rebuilding a Lost Script: My Journey into Open-Source Data Science
+
+As a new Data Science placement student, I was given an exciting opportunity to sharpen my R programming skills while contributing to the open-source community. Under the guidance of my manager, Edoardo Mancini, I took on a unique and challenging task for *pharmaversesdtm*, which tested both my technical know-how and problem-solving abilities.
+
+## The Challenge: Rewriting a Lost Script
+
+One of the open-source datasets from CDISC, specifically the Electrocardiogram (ECG) data, had been created by a script that had unfortunately been lost and couldn’t be recovered. This was a major issue because the program used to retrieve and process the ECG data was essential for future work. My task was to write a new R script from scratch to regenerate the ECG dataset—one that closely matched the original in both structure and content.
+
+## My Approach: Reverse-Engineering the Data
+
+The existing ECG dataset contained over 25,000 entries, and without the original code, I had to manually explore and make sense of the data to understand how it had been generated. Here's how I approached it:
+
+### 1. Data Exploration and Analysis
+I started by thoroughly analysing the available ECG dataset. My goal was to identify patterns, structures, and key variables that were likely involved in creating the original dataset. By digging deep into the data, I could understand how it was organised and what factors were critical to replicate.
+
+### 2. Identifying the Parameters
+As I explored the dataset, I focused on identifying which features were crucial for recreating the lost data. By paying close attention to trends and relationships between different variables, I could form a rough idea of how the original script might have worked.
+
+### 3. Writing the New R Script
+Armed with insights from my analysis, I set about writing a new R script to replicate the lost one. This involved a lot of trial and error, as I kept refining the code to ensure it generated a dataset that closely resembled the original ECG data in both structure and content.
+
+## Challenges and Solutions
+
+Working with a dataset of over 25,000 entries brought its own challenges. Making sure the script was efficient and scalable while still producing accurate, high-quality data was a key focus. I used a range of R techniques to streamline the process and make sure the dataset followed the original patterns.
+
+## The Result: A Recreated ECG Dataset
+
+After days of analysis, coding, and refinement, I successfully wrote an R script that could regenerate the lost ECG dataset. This project not only helped me improve my R programming skills but also gave me valuable experience in reverse-engineering data, exploring large healthcare datasets, and solving practical problems in the open-source world.
+
+
+## Main Parts of the Code
+
+In this section, I’ll walk through the most important pieces of the R script I wrote to recreate the ECG dataset. The code involved generating a set of dummy patient data, complete with visit information and random test results, based on existing patterns from the original dataset.
+
+### 1. Loading Libraries and Data
+
+To begin, I load the necessary libraries and read in the vital signs (`vs`) dataset. The seed is set to ensure that the random data generation is reproducible.
+
+```r
+library(dplyr)
+library(metatools)
+
+data("vs")
+set.seed(123)
+```
+
+### 2. Extracting Unique Date/Time of Measurements
+
+Next, I extract the unique combination of subject IDs, visit names, and visit dates from the `vs` dataset. This data will be used later to match the generated ECG data to the correct visit and time points.
+
+```r
+egdtc <- vs %>%
+  select(USUBJID, VISIT, VSDTC) %>%
+  distinct() %>%
+  rename(EGDTC = VSDTC)
+```
+
+### 3. Generating a Grid of Patient Data
+
+Here, I create a grid of all possible combinations of subject IDs, test codes (e.g., `QT`, `HR`, `RR`, `ECGINT`), time points (e.g., after lying down, after standing), and visits. These combinations represent different test results across multiple visits.
+
+```r
+eg <- expand.grid(
+  USUBJID = unique(vs$USUBJID),
+  EGTESTCD = c("QT", "HR", "RR", "ECGINT"),
+  EGTPT = c("AFTER LYING DOWN FOR 5 MINUTES", "AFTER STANDING FOR 1 MINUTE", "AFTER STANDING FOR 3 MINUTES"),
+  VISIT = c(
+    "SCREENING 1",
+    "SCREENING 2",
+    "BASELINE",
+    "AMBUL ECG PLACEMENT",
+    "WEEK 2",
+    "WEEK 4",
+    "AMBUL ECG REMOVAL",
+    "WEEK 6",
+    "WEEK 8",
+    "WEEK 12",
+    "WEEK 16",
+    "WEEK 20",
+    "WEEK 24",
+    "WEEK 26",
+    "RETRIEVAL"
+  ), stringsAsFactors = FALSE
+)
+```
+
+### 4. Generating Random Test Results
+
+For each combination in the grid, I generate random test results using a normal distribution to simulate realistic values for each test code.
+
+```r
+EGSTRESN = case_when(
+      EGTESTCD == "RR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 543.9985, 80)),
+      EGTESTCD == "RR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 536.0161, 80)),
+      EGTESTCD == "RR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 532.3233, 80)),
+      EGTESTCD == "HR" & EGELTM == "PT5M" ~ floor(rnorm(n(), 70.04389, 8)),
+      EGTESTCD == "HR" & EGELTM == "PT3M" ~ floor(rnorm(n(), 74.27798, 8)),
+      EGTESTCD == "HR" & EGELTM == "PT1M" ~ floor(rnorm(n(), 74.77461, 8)),
+      EGTESTCD == "QT" & EGELTM == "PT5M" ~ floor(rnorm(n(), 450.9781, 60)),
+      EGTESTCD == "QT" & EGELTM == "PT3M" ~ floor(rnorm(n(), 457.7265, 60)),
+      EGTESTCD == "QT" & EGELTM == "PT1M" ~ floor(rnorm(n(), 455.3394, 60))
+    )
+```
+
+### 5. Finalizing the Dataset
+
+Finally, I'm adding labels to the dataframe for easier analysis and future use.
+
+```r
+add_labels(
+    STUDYID = "Study Identifier",
+    DOMAIN = "Domain Abbreviation",
+    USUBJID = "Unique Subject Identifier",
+    EGSEQ = "Sequence Number",
+    EGTESTCD = "ECG Test Short Name",
+    EGTEST = "ECG Test Name",
+    EGORRES = "Result or Finding in Original Units",
+    EGORRESU = "Original Units",
+    EGELTM = "Elapsed Time",
+    EGSTRESC = "Character Result/Finding in Std Format",
+    EGSTRESN = "Numeric Result/Finding in Standard Units",
+    EGSTRESU = "Standard Units",
+    EGSTAT = "Completion Status",
+    EGLOC = "Location of Vital Signs Measurement",
+    EGBLFL = "Baseline Flag",
+    VISITNUM = "Visit Number",
+    VISIT = "Visit Name",
+    VISITDY = "Planned Study Day of Visit",
+    EGDTC = "Date/Time of Measurements",
+    EGDY = "Study Day of Vital Signs",
+    EGTPT = "Planned Time Point Number",
+    EGTPTNUM = "Time Point Number",
+    EGELTM = "Planned Elapsed Time from Time Point Ref",
+    EGTPTREF = "Time Point Reference"
+  )
+```
+
+This structured approach allowed me to successfully recreate the lost ECG dataset, providing a solid foundation for future analysis and research.
+
+<!--------------- appendices go here ----------------->
+
+```{r, echo=FALSE}
+source("appendix.R")
+insert_appendix(
+  repo_spec = "pharmaverse/blog",
+  name = long_slug,
+  # file_name should be the name of your file
+  file_name = list.files() %>% stringr::str_subset(".qmd") %>% first()
+)
+```