description added

hadexversum · Apr 23, 2024 · eb3e692 · eb3e692
1 parent 4e69cbd
commit eb3e692
Show file tree

Hide file tree

Showing 7 changed files with 328 additions and 10 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,4 @@
 .Ruserdata
 .RProj
 docs
+inst/doc
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,15 +1,17 @@
 Package: HRaDeX
-Title: What the Package Does (One Line, Title Case)
-Version: 0.0.0.9000
+Title: Uptake classification for HDXMS
+Version: 0.5
 Authors@R: 
-    person(given = "First",
-           family = "Last",
-           role = c("aut", "cre"),
-           email = "[email protected]",
-           comment = c(ORCID = "YOUR-ORCID-ID"))
-Description: What the package does (one paragraph).
-License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
-    license
+     c(person("Weronika", "Puchala", 
+              email = "[email protected]", 
+              comment = c(ORCID = "0000-0003-2163-1429"),
+              role = c("cre", "aut")),
+       person("Michal", "Burdukiewicz", 
+              email = "[email protected]", 
+              comment = c(ORCID = "0000-0001-8926-582X"),
+              role = c("aut")))
+Description: Analyses the H/D exchange dynamics of proteins in high reolution. 
+License: GPL-3
 Encoding: UTF-8
 LazyData: true
 Roxygen: list(markdown = TRUE)
@@ -26,6 +28,9 @@ Imports:
     r3dmol,
     tibble
 Suggests: 
+    knitr,
+    rmarkdown,
     testthat (>= 3.0.0)
 Config/testthat/edition: 3
 URL: https://hadexversum.github.io/HRaDeX/
+VignetteBuilder: knitr
diff --git a/HRaDeX.Rproj b/HRaDeX.Rproj
@@ -0,0 +1,22 @@
+Version: 1.0
+
+RestoreWorkspace: No
+SaveWorkspace: No
+AlwaysSaveHistory: Default
+
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+
+RnwWeave: knitr
+LaTeX: pdfLaTeX
+
+AutoAppendNewline: Yes
+StripTrailingWhitespace: Yes
+LineEndingConversion: Posix
+
+BuildType: Package
+PackageUseDevtools: Yes
+PackageInstallArgs: --no-multiarch --with-keep.source
+PackageRoxygenize: rd,collate,namespace
diff --git a/README.md b/README.md
@@ -0,0 +1,12 @@
+
+# hadexversum
+
+HRaDeX is a part of tool family for analysing HDX-MS data, HaDeXversum. It is developed in Mass Spectrometry Lab in Institute of Biophysics and Biochemistry, Polish Acadamy of Sciences.
+
+HRaDeX is a package for classification process of peptide-level uptake curves. While HRaDeX covers all computational processes, there are two GUIs for its functionalities.
+HRaDeXGUI is an app that allows one-state classification process.
+compaHRaDeX is an app for comparative analysis of classification results for two biological states, conducted separately using HRaDeXGUI. 
+
+The applications are available for use online: [HRaDeX](https://hradex.mslab-ibb.pl/) and [compaHRaDeX](https://compahradex.mslab-ibb.pl/).
+
+The project is currently under development.
diff --git a/vignettes/.gitignore b/vignettes/.gitignore
@@ -0,0 +1,2 @@
+*.html
+*.R
diff --git a/vignettes/gui_guide.Rmd b/vignettes/gui_guide.Rmd
@@ -0,0 +1,31 @@
+---
+title: "How to use the apps?"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{How to use the apps?}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+HRaDeX itself is just the package for the calculation and visulazation. To make use of the solution more comfortable, we prepared shiny applications. To use them, there are two function calls:
+
+
+You can also use them online, here: [HRaDeX](https://hradex.mslab-ibb.pl/) and [compaHRaDeX](https://compahradex.mslab-ibb.pl/).
+
+Should you notice any inconveniences while using them, please let us know. The best way to contact us is to use the following email: [email protected]. 
+
+# HRaDeX
+
+# compaHRaDeX
+
+Once you have processed two biological states using HRaDeX, there is a need for comparative analysis. What is necessary to perform one?
+
+* fit_data_.csv and uc_data_.csv for the first state
+* fit_data_.csv and uc_data_.csv for the second state
+* PDB file structure for the protein.
+
+fit_data_.csv is to be found in the `Fit params` tab in HRaDeX, and uc_data_.csv is in `UC Data` tab. 
+
+Of course, you can perform compartive analysis providing only some of desired files, but then not all of the fetures will be available.
+
+
diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
@@ -0,0 +1,245 @@
+---
+title: "Workflow description"
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Workflow description}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+# Introduction
+
+The baseline of the HDX-MS method is the investigation of the exchange between hydrogen from the protein structure with deuterium from the buffer in which the protein is suspended. The difference in mass between hydrogen and deuterium is very close to 1 Da, so that we can identify the mass uptake with the number of exchanged hydrogens. Multiple time measurements allow us to see the exchange in time and create an exchange curve of the process for each measured and identified peptide (in a bottom-up approach).
+
+The main challenge in HDX-MS is data analysis and interpretation.  Although different methods of analysis focus on numerical values, that may be difficult to interpret, we wanted to propose a classification method to simplify communication of the results. Moreover, we are avoiding any data transformation, instead, propose a class label for perfectly known and valid experimental curves. 
+
+
+# Classification method
+
+This method aims to assign each peptide a specific combination of exchange classes, consistent with experimental results. This allows us to distinguish areas with different exchange patterns in the protein sequence, and to describe the exchange process itself.
+
+The class assignment is the final step of the classification process, based on the literature description of the hydrogen-deuterium phenomenon. The process has several steps, with multiple workflow decisions under the experimenter's control.
+
+In the figure \ref{fig:ex_duc}, examples of experimental curves are presented. Although it is easy to discriminate between the extreme cases (described and defined in section \ref{sec:extreme}) - immediate exchange (red) and none-exchange (black), we need a classification method to provide information on the remaining curves (grey). The problem of classification is mainly focused on them. 
+
+*Future plans: the classification of the subfragments of the overlapping peptides.*
+
+## Obtaining the uptake curve
+
+The experimental data from the csv file is processed and aggregated.  
+Firstly, the deuterium uptake curve is prepared. It is a function of deuterium uptake in time,  with values defined as:
+
+$$
+    D_t = m_t - m_0
+$$
+
+where $D_t$ is deuterium uptake in time $t$ in daltons, $m_t$ is mass of the peptide measured in time $t$ and $m_0$ is mass of undeuterated peptide. 
+
+Then, the uptake curve is normalized. The values are scaled according to the assumption, that the last time point of measurement or supplied fully deuterated control (difference explained below) is the maximal experimental deuterium uptake with 100\% deuterium uptake. This can be calculated in two ways:
+
+$$
+    D_{t, sc} = \frac{D_t}{D_{100}} = \frac{m_t - m_0}{m_{100} - m_0}
+$$
+
+This form of the uptake curve is used for the fitting process. Because of this, we can operate using percentages for populations (denoted as $n_i$, the fraction of exchangeable hydrogens) undergoing different exchange patterns, and disregard the back-exchange problem for a while.
+
+*Comment on the difference between treating the last time point of measurement as fully deuterated (FD) control and additionally prepared fully deuteration control. When the first option is chosen, the last time point of measurement measuremnt ia assigned the value of 100% of deuterium uptake and the measurement point is taking part in the fitting process. When the additional fully deuterated control is chosen, all the measurement values are scaled with respect to that, but this FD value is not assigned to any time point, and not taking part in the fitting process, only the real measurement time points.*
+
+## Extreme case recognition
+
+Firstly, we recognize extreme cases of uptake curve, and assigned them a class name without fitting process.
+
+For the peptide to be labeled as **none exchange**, has to fulfill defined requirements: 
+* the uptake level [Da] in the last time point of measurement ($D_{100}$) is lower than 1 Da,
+* the thoretical deuterium uptake in the last time point of measurement (scaled with respect to maximal possible uptake calculated from the sequence) is lower than 10%.
+
+We interpret this case as a lack of observable exchange during the course of experiment, and indicate it with the black color on the visualization.
+
+For the peptide to be disqualified from the fitting process and labeled by **invalid** there has to be lack of sufficient number of time points. It is crucial for the peptide to have valid measurements for $t_0$ and $t_100$, as otherwise it prevents the normalization of values and causes the falsyfication of data.
+The missing of one or two time points of measurement (apart from controls) causes the use of two- or one- component model, but does not prevent the fitting process.
+
+The fitting process is not conducted for uptake curves classified as extreme cases (no exchange or invalid).
+
+## Exchange class definition
+
+Following the literature - especially in a cornerstone of the studies on the modeling hydrogen-deuterium exchange, the pioneering article of Zhang and Smith \cite{zhang_determination_1993}, we expect the process of deuterium exchange for the peptide to be described as a three-component equation \ref{eq:fit_3}:
+
+$$
+D = n_1 \cdot (1 - exp(-k_1 \cdot t)) + n_2 \cdot (1-exp(-k_2 \cdot t)) + n_3 \cdot (1 - exp(-k_3 \cdot t)) 
+$$
+where $n_i$ indicates the population of i-th group and $k_i$ is the exchange rate of i-th group. Moreover, we assume:
+
+
+* $n_1$, $k_1$ - fast exchange (red)
+* $n_2$, $k_2$ - medium exchange (green)
+* $n_3$, $k_3$ - slow exchange (blue)
+
+Ideally, $n_1 + n_2 + n_3 = 1$.
+In this equation, we have six unknown parameters that need to be found.
+
+Unfortunately, the authors of this paper did not provide boundaries for each exchange group. The  ranges proposed by us, according to our experience, can be found in table \ref{tab:k_ranges}. 
+
+The ranges of exchange rates can be changed if desired. However, we recommend avoiding situations where a certain range is covered by two classes, as it may lead to misclassification. 
+
+
+|      | lower | upper |
+|------|-------|-------|
+|$k_1$ | 1     | 30    |
+|$k_2$ | 0.1   | 1     |
+|$k_3$ | 0     | 0.1   |
+
+
+The Zhang \& Smith equation describes a theoretical situation, which may not occur during every experiment. Thus, we prepared different workflows and modifications (described later on), but the three-component equation is our desired starting point.  
+
+## Workflow selection
+
+We have multiple workflows prepared. We strongly advise using the first one, but we understand the specificity of each experiment and we leave the final decision to the user, trusting their motives.
+
+Available workflows:
+
+* 3exp/1exp - we start the fitting process by looking for the best fit using function \ref{eq:fit_3} with ranges specified in table \ref{tab:k_ranges}. For the peptides with unsatisfactory results, we look for only one component function (described below). The selection of better results is made by comparing the $R^2$ values. 
+* 2exp/1exp - the starting point is the two-component function (described below), and then the one-component function. The decision process is the same as in the previous point.
+* 3exp/2exp/1exp - all of the variants of the fitting process are prepared and the best is chosen based on the $BIC$ value. 
+
+### 3exp fit
+
+The fitting function is described as Zhang \& Smith equation \ref{eq:fit_3}, with initial values: 
+
+* $n_1 = n_2 = n_3 = 0.33$
+* $k_1 = 1$
+* $k_2 = 0.1$
+* $k_3 = 0.01$
+
+The boundaries for the values are as described in table \ref{tab:k_ranges}. 
+
+The number of unknown parameters is 6.
+
+### 2exp fit
+
+The two-component fitting function, is described as follows:
+
+$$
+D = n_a \cdot (1 - exp(-k_a \cdot t)) + n_b \cdot (1-exp(-k_b \cdot t)) 
+$$
+
+Where $a$ and $b$ are two of three exchange groups defined for the Zhang \& Smith equation. We perform three fitting processes (for each group combination) and select the best result comparing the $R^2$ value. That means we look for the fit using $k_1$ with $k_2$, $k_2$ with $k_3$, or $k_1$ with $k_3$ and select one as the answer.
+
+The initial value for $n_a$ and $n_b$ is 0.5. The initial values for $k$ are the same with analogical cases from the three-component equation (see section \ref{sssec:e3exp}).
+
+In this case, we assume that $n_a$ of hydrogen particles are undergoing the exchange with $k_a$ exchange rate, an $n_b$ with exchange rate $k_b$. 
+
+The number of unknown parameters is 4.
+
+### 1exp fit
+
+The fitting function is described as follows: 
+
+$$
+D = n \cdot (1 - exp(-k \cdot t))  
+$$
+
+In this case, we assume that all of the hydrogen particles in the peptide are undergoing the exchange in a similar way. Ideally with $n \sim 1$.
+
+The initial values for the fit:
+
+* $n = 1$
+* $k = 1$ 
+* upper boundary of $k$ = upper boundary of $k_1$
+* lower boundary of $k$ = lower boundary of $k_3$
+
+The number of unknown parameters is 2.
+
+### Fitting parameters
+
+To fit the curve we use the nonlinear least square method. For more information, see the documentation of the used R package, \hyperlink{https://rdocumentation.org/packages/minpack.lm/versions/1.2-3}{here}.
+
+Recommended parameters (possible change):
+* max iteration: 100,
+* algorithm: Levenberg-Marquardt.
+
+
+## Results of the fitting process
+
+TODO: n, k, rgb, interpretation
+
+## Calculation of estimed exchange rate k
+
+As we are working on normalized values and each value $n_i$ is within the limits of (0, 1), whe can treat them somehow as a probability of getting $k_i$. Then, we can caluclate the estimate k value, aggrgating all of the fit parameters into one value:
+
+$$
+k_{exp} = \sum^i n_i \cdot k_i
+$$
+
+## High resolution results
+
+As the classification is performed on uptake curves of each peptide, there is a need of aggrgation to obtain the high resolution results. Currently, HRaDeX allows the use of two methods.
+
+### Selection of the shortest peptide.
+
+For each residue there is a subset of peptides covering said residue. From this set, the shortest peptide is chosen as a determinant for the resiude, as the most represantative. 
+
+### Aggregation of values
+
+For each residue there is a subset of peptides covering said residue. Then, the final $n_{i, agg}$ is calculated from the subset of $n_i$, with weights inverse propotional to the peptides lengths ($l_i$):
+
+$$
+w_i = \frac{\frac{1}{l_i}}{\sum^i \frac{1}{l_i}}
+$$
+
+Then, the $n_i%$ is a weighted average of set of $n_i$.
+
+$$
+n_{i, agg} = \frac{\sum^i w_i \cdot n_i}{\sum^i w_i}
+$$
+There is a possibility that $n_{i, agg}$ obtained that way exceeds the limit of 1. Then, we normalize the values, to get the sense of proportionality. 
+
+$$
+n_{i, agg.scaled} = \frac{n_{i, agg}}{\sum^i n_{i, agg}}
+$$
+
+This way, we have the right values to obtain the RGB color value.
+
+The estimated k values i calculated analogically.
+
+
+## Visualization of the results
+
+TODO: describe all methods 
+
+Each of the $n_i$ indicates a population from the range [0,1] and has an associated group color. Based on that, and the defined exchange classes, for each peptide we construct a color using RGB system. The participation of each group is presented on the chart \ref{fig:color_triangle}:
+
+
+\begin{figure}[h]
+\centering
+\includegraphics[width=10cm]{rgb_class.png}
+\caption{RGB color triangle}
+\label{fig:color_triangle}
+\end{figure}
+
+The more red color is selected, the biggest the population of fast-exchanging hydrogens in the peptide.
+
+The more green color is selected, the biggest the population of medium-exchanging hydrogens in the peptide.
+
+The more blue color is selected, the biggest the population of slow-exchanging hydrogens in the peptide.
+
+*Disclaimer: We are aware that the color code convention may lead to inaccessibility for some users with color blindness. We recognize the problem although cannot see the solution without resigning from the RGB color code. For users whom it may concern, we provide numerical data in the web server, next to the plots for direct information.*
+
+
+
+# Example visualizations
+
+Below, in the figure \ref{fig:coverage_heatmap}, we present the classification results of all peptides in a single biological state on the coverage map (selected workflow: 3exp/1exp). The length of each bar represents the length of the peptide and its position on the protein sequence. It is crucial to notice, that overlapping peptides are classified  as one class, and the cases with different classes can be explained by the length and position. The color code for peptides that are mixed type can be interpreted using the RGB triangle (see figure \ref{fig:color_triangle}). 
+
+The consistency of classification for sub-areas of peptide leads us to believe that this method of classification is working and provide information coherent with experimental data.
+
+\begin{figure}[h!]
+    \centering
+    \includegraphics[width=17cm]{example_class_coverage.png}
+    \caption{Example of class coverage}
+    \label{fig:coverage_heatmap}
+\end{figure}
+
+# Workflow schema
+
+Figure \ref{diag:workflow_peptide} presents the diagram for classifying one peptide in one biological state, based on its uptake curve. This process can be repeated for each peptide from the data file to access the information for the set of peptides from a protein.
+