02.2-general-math.Rmd

## General Math

### Number Sets

| Notation     | Denotes          | Examples                                                |
|-------------------|----------------------|-------------------------------|
| $\emptyset$  | Empty set        | No members                                              |
| $\mathbb{N}$ | Natural numbers  | $\{1, 2, \ldots\}$                                      |
| $\mathbb{Z}$ | Integers         | $\{\ldots, -1, 0, 1, \ldots\}$                          |
| $\mathbb{Q}$ | Rational numbers | Including fractions                                     |
| $\mathbb{R}$ | Real numbers     | Including all finite decimals, irrational numbers       |
| $\mathbb{C}$ | Complex numbers  | Including numbers of the form $a + bi$ where $i^2 = -1$ |

------------------------------------------------------------------------

### Summation Notation and Series

#### Chebyshev's Inequality

Let $X$ be a random variable with mean $\mu$ and standard deviation $\sigma$. For any positive number $k$, Chebyshev's Inequality states:

$$
P(|X-\mu| \geq k\sigma) \leq \frac{1}{k^2}
$$

This provides a probabilistic bound on the deviation of $X$ from its mean and does not require $X$ to follow a normal distribution.

------------------------------------------------------------------------

#### Geometric Sum

For a geometric series of the form $\sum_{k=0}^{n-1} ar^k$, the sum is given by:

$$
\sum_{k=0}^{n-1} ar^k = a\frac{1-r^n}{1-r} \quad \text{where } r \neq 1
$$

#### Infinite Geometric Series

When $|r| < 1$, the geometric series converges to:

$$
\sum_{k=0}^\infty ar^k = \frac{a}{1-r}
$$

------------------------------------------------------------------------

#### Binomial Theorem

The binomial expansion for $(x + y)^n$ is:

$$
(x + y)^n = \sum_{k=0}^n \binom{n}{k} x^{n-k} y^k \quad \text{where } n \geq 0
$$

#### Binomial Series

For non-integer exponents $\alpha$:

$$
\sum_{k=0}^\infty \binom{\alpha}{k} x^k = (1 + x)^\alpha \quad \text{where } |x| < 1
$$

------------------------------------------------------------------------

#### Telescoping Sum

A telescoping sum simplifies as intermediate terms cancel, leaving:

$$
\sum_{a \leq k < b} \Delta F(k) = F(b) - F(a) \quad \text{where } a, b \in \mathbb{Z}, a \leq b
$$

------------------------------------------------------------------------

#### Vandermonde Convolution

The Vandermonde convolution identity is:

$$
\sum_{k=0}^n \binom{r}{k} \binom{s}{n-k} = \binom{r+s}{n} \quad \text{where } n \in \mathbb{Z}
$$

------------------------------------------------------------------------

#### Exponential Series

The exponential function $e^x$ can be represented as:

$$
\sum_{k=0}^\infty \frac{x^k}{k!} = e^x \quad \text{where } x \in \mathbb{C}
$$

------------------------------------------------------------------------

#### Taylor Series

The Taylor series expansion for a function $f(x)$ about $x=a$ is:

$$
\sum_{k=0}^\infty \frac{f^{(k)}(a)}{k!} (x-a)^k = f(x)
$$

For $a = 0$, this becomes the **Maclaurin series**.

------------------------------------------------------------------------

#### Maclaurin Series for $e^z$

A special case of the Taylor series, the Maclaurin expansion for $e^z$ is:

$$
e^z = 1 + z + \frac{z^2}{2!} + \frac{z^3}{3!} + \cdots
$$

------------------------------------------------------------------------

#### Euler's Summation Formula

Euler's summation formula connects sums and integrals:

$$
\sum_{a \leq k < b} f(k) = \int_a^b f(x) \, dx + \sum_{k=1}^m \frac{B_k}{k!} \left[f^{(k-1)}(x)\right]_a^b 
+ (-1)^{m+1} \int_a^b \frac{B_m(x-\lfloor x \rfloor)}{m!} f^{(m)}(x) \, dx
$$

Here, $B_k$ are Bernoulli numbers.

-   **For** $m=1$ (Trapezoidal Rule):

$$
\sum_{a \leq k < b} f(k) \approx \int_a^b f(x) \, dx - \frac{1}{2}(f(b) - f(a))
$$

### Taylor Expansion

A differentiable function, $G(x)$, can be written as an infinite sum of its derivatives. More specifically, if $G(x)$ is infinitely differentiable and evaluated at $a$, its Taylor expansion is:

$$
G(x) = G(a) + \frac{G'(a)}{1!} (x-a) + \frac{G''(a)}{2!}(x-a)^2 + \frac{G'''(a)}{3!}(x-a)^3 + \dots
$$

This expansion is valid within the radius of convergence.

------------------------------------------------------------------------

### Law of Large Numbers

Let $X_1, X_2, \ldots$ be an infinite sequence of independent and identically distributed (i.i.d.) random variables with finite mean $\mu$ and variance $\sigma^2$. The **Law of Large Numbers (LLN)** states that the sample average:

$$
\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i
$$

converges to the expected value $\mu$ as $n \rightarrow \infty$. This can be expressed as:

$$
\bar{X}_n \rightarrow \mu \quad \text{(as $n \rightarrow \infty$)}.
$$

#### Variance of the Sample Mean

The variance of the sample mean decreases as the sample size increases:

$$
Var(\bar{X}_n) = Var\left(\frac{1}{n} \sum_{i=1}^n X_i\right) = \frac{\sigma^2}{n}.
$$

$$ 
\begin{aligned}
Var(\bar{X}_n) &= Var(\frac{1}{n}(X_1 + ... + X_n)) =Var\left(\frac{1}{n} \sum_{i=1}^n X_i\right) \\
&= \frac{1}{n^2}Var(X_1 + ... + X_n) \\
&=\frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n} 
\end{aligned}
$$

**Note**: The connection between the [Law of Large Numbers] and the [Normal Distribution] lies in the [Central Limit Theorem]. The CLT states that, regardless of the original distribution of a dataset, the distribution of the sample means will tend to follow a normal distribution as the sample size becomes larger.

The difference between [Weak Law] and [Strong Law] regards the mode of convergence.

------------------------------------------------------------------------

#### Weak Law of Large Numbers

The **Weak Law of Large Numbers** states that the sample average converges in probability to the expected value:

$$
\bar{X}_n \xrightarrow{p} \mu \quad \text{as } n \rightarrow \infty.
$$

Formally, for any $\epsilon > 0$:

$$
\lim_{n \to \infty} P(|\bar{X}_n - \mu| > \epsilon) = 0.
$$

Additionally, the sample mean of an i.i.d. random sample ($\{ X_i \}_{i=1}^n$) from any population with a finite mean and variance is a consistent estimator of the population mean $\mu$:

$$
plim(\bar{X}_n) = plim\left(\frac{1}{n}\sum_{i=1}^{n} X_i\right) = \mu.
$$

------------------------------------------------------------------------

#### Strong Law of Large Numbers

The **Strong Law of Large Numbers** states that the sample average converges almost surely to the expected value:

$$
\bar{X}_n \xrightarrow{a.s.} \mu \quad \text{as } n \rightarrow \infty.
$$

Equivalently, this can be expressed as:

$$
P\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1.
$$

------------------------------------------------------------------------

### Law of Iterated Expectation

The **Law of Iterated Expectation** states that for random variables $X$ and $Y$:

$$
E(X) = E(E(X|Y)).
$$

This means the expected value of $X$ can be obtained by first calculating the conditional expectation $E(X|Y)$ and then taking the expectation of this quantity over the distribution of $Y$.

### Convergence

#### Convergence in Probability

As $n \rightarrow \infty$, an estimator (random variable) $\theta_n$ is said to converge in probability to a constant $c$ if:

$$
\lim_{n \to \infty} P(|\theta_n - c| \geq \epsilon) = 0 \quad \text{for any } \epsilon > 0.
$$

This is denoted as:

$$
plim(\theta_n) = c \quad \text{or equivalently, } \theta_n \xrightarrow{p} c.
$$

------------------------------------------------------------------------

**Properties of Convergence in Probability:**

1.  **Slutsky's Theorem**: For a continuous function $g(\cdot)$, if $plim(\theta_n) = \theta$, then:

    $$
    plim(g(\theta_n)) = g(\theta)
    $$

2.  If $\gamma_n \xrightarrow{p} \gamma$, then:

    -   $plim(\theta_n + \gamma_n) = \theta + \gamma$,
    -   $plim(\theta_n \gamma_n) = \theta \gamma$,
    -   $plim(\theta_n / \gamma_n) = \theta / \gamma$ (if $\gamma \neq 0$).

3.  These properties extend to random vectors and matrices.

------------------------------------------------------------------------

#### Convergence in Distribution

As $n \rightarrow \infty$, the distribution of a random variable $X_n$ may converge to another ("fixed") distribution. Formally, $X_n$ with CDF $F_n(x)$ converges in distribution to $X$ with CDF $F(x)$ if:

$$
\lim_{n \to \infty} |F_n(x) - F(x)| = 0
$$

at all points of continuity of $F(x)$. This is denoted as:

$$
X_n \xrightarrow{d} X \quad \text{or equivalently, } F(x) \text{ is the limiting distribution of } X_n.
$$

**Asymptotic Properties:**

-   $E(X)$: Limiting mean (asymptotic mean).
-   $Var(X)$: Limiting variance (asymptotic variance).

**Note:** Limiting expectations and variances do not necessarily match the expectations and variances of $X_n$:

$$
\begin{aligned}
E(X) &\neq \lim_{n \to \infty} E(X_n), \\
Avar(X_n) &\neq \lim_{n \to \infty} Var(X_n).
\end{aligned}
$$

------------------------------------------------------------------------

**Properties of Convergence in Distribution:**

1.  **Continuous Mapping Theorem**: For a continuous function $g(\cdot)$, if $X_n \xrightarrow{d} X$, then:

    $$
    g(X_n) \xrightarrow{d} g(X).
    $$

2.  If $Y_n \xrightarrow{d} c$ (a constant), then:

    -   $X_n + Y_n \xrightarrow{d} X + c$,
    -   $Y_n X_n \xrightarrow{d} c X$,
    -   $X_n / Y_n \xrightarrow{d} X / c$ (if $c \neq 0$).

3.  These properties also extend to random vectors and matrices.

------------------------------------------------------------------------

#### Summary: Properties of Convergence

| Convergence in Probability                                                                                         | Convergence in Distribution                                                                                             |
|-------------------------------------|-----------------------------------|
| Slutsky's Theorem: For a continuous $g(\cdot)$, if $plim(\theta_n) = \theta$, then $plim(g(\theta_n)) = g(\theta)$ | Continuous Mapping Theorem: For a continuous $g(\cdot)$, if $X_n \xrightarrow{d} X$, then $g(X_n) \xrightarrow{d} g(X)$ |
| If $\gamma_n \xrightarrow{p} \gamma$, then:                                                                        | If $Y_n \xrightarrow{d} c$, then:                                                                                       |
| $plim(\theta_n + \gamma_n) = \theta + \gamma$                                                                      | $X_n + Y_n \xrightarrow{d} X + c$                                                                                       |
| $plim(\theta_n \gamma_n) = \theta \gamma$                                                                          | $Y_n X_n \xrightarrow{d} c X$                                                                                           |
| $plim(\theta_n / \gamma_n) = \theta / \gamma$ (if $\gamma \neq 0$)                                                 | $X_n / Y_n \xrightarrow{d} X / c$ (if $c \neq 0$)                                                                       |

**Relationship between Convergence Types:**

[Convergence in Probability] is stronger than [Convergence in Distribution]. Therefore:

-   [Convergence in Distribution] does not guarantee [Convergence in Probability].

### Sufficient Statistics and Likelihood

#### Likelihood

The **likelihood** describes the degree to which the observed data supports a particular value of a parameter $\theta$.

-   The exact value of the likelihood is **not meaningful**; only relative comparisons matter.
-   Likelihood is **informative** when comparing parameter values, helping identify which values of $\theta$ are more plausible given the data.

For a single observation $Y=y$, the likelihood function is:

$$
L(\theta_0; y) = P(Y = y | \theta = \theta_0) = f_Y(y; \theta_0)
$$

------------------------------------------------------------------------

#### Likelihood Ratio

The **likelihood ratio** compares the relative likelihood of two parameter values $\theta_0$ and $\theta_1$ given the data:

$$
\frac{L(\theta_0; y)}{L(\theta_1; y)}
$$

A likelihood ratio greater than 1 implies that $\theta_0$ is more likely than $\theta_1$, given the observed data.

------------------------------------------------------------------------

#### Likelihood Function

For a given sample, the likelihood for all possible values of $\theta$ forms the **likelihood function**:

$$
L(\theta) = L(\theta; y) = f_Y(y; \theta).
$$

For a sample of size $n$, assuming independence among observations:

$$
L(\theta) = \prod_{i=1}^{n} f_i(y_i; \theta).
$$

Taking the natural logarithm of the likelihood gives the **log-likelihood function**:

$$
l(\theta) = \sum_{i=1}^{n} \log f_i(y_i; \theta).
$$

The log-likelihood function is particularly useful in optimization problems, as logarithms convert products into sums, simplifying computation.

------------------------------------------------------------------------

#### Sufficient Statistics

A statistic $T(y)$ is **sufficient** for a parameter $\theta$ if it summarizes all the information in the data about $\theta$. Formally, by the **Factorization Theorem**, $T(y)$ is sufficient for $\theta$ if:

$$
L(\theta; y) = c(y) L^*(\theta; T(y)),
$$

where:

-   $c(y)$ is a function of the data independent of $\theta$.
-   $L^*(\theta; T(y))$ is a function that depends on $\theta$ and $T(y)$.

In other words, the likelihood function can be rewritten in terms of $T(y)$ alone, without loss of information about $\theta$.

**Example**:

For a sample of i.i.d. observations $Y_1, Y_2, \dots, Y_n$ from a normal distribution $N(\mu, \sigma^2)$:

-   The sample mean $\bar{Y}$ is sufficient for $\mu$.

-   The sufficient statistic conveys all the information about $\mu$ contained in the data.

------------------------------------------------------------------------

#### Nuisance Parameters

Parameters that are not of direct interest in the analysis but are necessary to model the data are called **nuisance parameters**.

**Profile Likelihood**: To handle nuisance parameters, replace them with their maximum likelihood estimates (MLEs) in the likelihood function, creating a **profile likelihood** for the parameter of interest.

------------------------------------------------------------------------

### Parameter Transformations

Transformations of parameters are often used to improve interpretability or statistical properties of models.

#### Log-Odds Transformation

The **log-odds transformation** is commonly used in logistic regression and binary classification problems. It transforms probabilities (which are bounded between 0 and 1) to the real line:

$$
\text{Log odds} = g(\theta) = \ln\left(\frac{\theta}{1-\theta}\right),
$$

where $\theta$ represents a probability (e.g., the success probability in a Bernoulli trial).

------------------------------------------------------------------------

#### General Parameter Transformations

For a parameter $\theta$ and a transformation $g(\cdot)$:

-   If $\theta \in (a, b)$, $g(\theta)$ may map $\theta$ to a different range (e.g., $\mathbb{R}$).
-   Useful transformations include:
    -   Logarithmic: $g(\theta) = \ln(\theta)$ for $\theta > 0$.
    -   Exponential: $g(\theta) = e^{\theta}$ for unconstrained $\theta$.
    -   Square root: $g(\theta) = \sqrt{\theta}$ for $\theta \geq 0$.

**Jacobian Adjustment for Transformations**: If transforming a parameter in Bayesian inference, the Jacobian of the transformation must be included to ensure proper posterior scaling.

------------------------------------------------------------------------

#### Applications of Parameter Transformations

1.  **Improving Interpretability**:
    -   Probabilities can be transformed to odds or log-odds for logistic models.
    -   Rates can be transformed logarithmically for multiplicative effects.
2.  **Statistical Modeling**:
    -   Variance-stabilizing transformations (e.g., log for Poisson data or arcsine for proportions).
    -   Regularization or simplification of complex relationships.
3.  **Optimization**:
    -   Transforming constrained parameters (e.g., probabilities or positive scales) to unconstrained scales simplifies optimization algorithms.