37262 - Mathematical Statistics

Simulation

Process of generating sample instances \(X_i\) from some random variable \(X\), essentially simulating the behaviour random variable. It requires using a standard sampling technique (such as uniform sampling) and appropriately transforming the results to match the distribution of the random variable of interest

Convolution

Probability distribution formed by a finite known linear combination (sum where coefficients may be modified by some scalar, see Linear Algebra) of independent random variables

\( \sum_{i=1}^{n}\text{Bernoulli}(p) \sim \text{Bin}(n,p)\)

\( \sum_{i=1}^{r}\text{Geo}(p) \sim \text{NB}(r,p)\)

Acceptance-Rejection methods

Probability distribution formed by counting amount of independent random variables until a criterion is reached

For instance, geometric variable can be realized by simulating Bernoulli variables until one realises the value 1

Uniform sampling

Generating sample instances from \(U \sim \text{U}(0,1)\) (by a computer or otherwise) to be used in simulation

In the discrete case, such samples can be transformed to represent any random variable by mapping partitions of the range of \(U\) to partitions in the range of \(X\) such that

\( \text{Pr}(U \in [a,b]) = \text{Pr}(X \in [c,d]) \)

\( [a,b] \) is a subset of the range of \(U\)
\( [c,d] \) is a subset of the range of \(X\)

Inverse transform sampling

Cumulative functions can be used to create a partition

\(X_i= \min \{ k : \text{Pr}(X \leq k) = U_i \} = F_{X}^{-1}(U_i) \)

Transforming uniformly sampled values for specific variables

Bernoulli simulation

\(X \sim \text{Bernoulli}(p)\) one can trivially form the rule

\(\begin{cases} X_i=0 & U_i \geq p \\ X_i=1 & U_i \lt p \end{cases}\) one can trivially form the rule

Binomial simulation

\(X \sim \text{Bin}(n,p)\)

This is a convolution of \(n\) Bernoulli variables; simulate each of the \(n\) Bernoullis and sum their values for the Binomail variable

Geometric simulation

\(X \sim \text{Geo}(p)\)

This is an acceptance-rejection technique on a sequence of Bernoulli variables; simulate each Bernoulli until a 1 is reached, the realized value is the amount of Bernoulli variables simulated

Poisson simulation

\(X \sim \text{Pois}(\lambda)\)

This is an acceptance-rejection technique on a sequence of Exponential variables

Box-Muller method

Proposition that allows the transform of uniform samples to standard normal samples

\(U_1 , U_2 \sim \text{U}(0,1) \land U_1 ,U_2 \text{ are independent } \implies \)

\( \xi_1=\sqrt{-2\ln(U_1)}\cos (2\pi U_2) \sim \text{N}(0,1) \land \xi_2=\sqrt{-2\ln(U_1)}\sin (2\pi U_2) \sim \text{N}(0,1) \text{ are independent } \implies \)

Multivariate distributions

Univariate distribution

Distribution of a single random variable \(X\), defined by its PMF (discrete) or PDF (continuous)

Random vector (Vector RV)

On a probability space \( (\Omega, \mathcal{F}, \text{Pr}) \),

\(\textbf{x} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}\)

Expectation of functions

\( \text{E}[g(\textbf{x})] = \int_{\Omega} g[X(\omega)] d\text{Pr}(\omega ) \)

Expectation

\(\text{E}( \textbf{x} ) = \begin{bmatrix} \text{E} (X_1) \\ \text{E}(X_2) \\ \vdots \\ \text{E}(X_n) \end{bmatrix} \)

Variance

\(\text{Var}( \textbf{x} ) = \begin{bmatrix} \text{Var} (X_1) \\ \text{Var}(X_2) \\ \vdots \\ \text{Var}(X_n) \end{bmatrix} \)

Covariance

\( \text{cov}(\textbf{x} , \textbf{y}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{y} - \text{E}[\textbf{y}]) ] \)

Autocovariance

\( \text{cov}(\textbf{x} , \textbf{x}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{x} - \text{E}[\textbf{x}]) ] \)

\( \text{cov}(\textbf{x} , \textbf{x}) = \text{cov}(\textbf{x} , \textbf{x})^{T} \)

Multivariate distribution

Distribution of an ordered tuple of random variables or vector RV (these are isomorphic interpretations), such as \( (X,Y)\), defined by its JPMF (discrete) or JPDF (continuous)

Joint Probability Mass Function (JPMF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a discrete random vector to its probability

\( f_{\textbf{x}}(\textbf{u}) = \text{Pr}( \textbf{x} = \textbf{u} ) \)

\(\textbf{x}\) is a vector RV
\(\textbf{u}\) is a real valued vector

Note that the use of vector RVs is for convenient notation

Properties

\( \forall \textbf{n} \in \Omega ( f_{\textbf{x}} ( \textbf{n} ) \geq 0 ) \)
\( \displaystyle \sum_{ \textbf{n} \in \Omega} f_{\textbf{x}} ( \textbf{n} ) = 1 \)
\( \displaystyle \text{Pr}( \textbf{x} \in S ) = \sum_{\textbf{s} \in S} f_{\textbf{x}} ( \textbf{s} ) \)

Joint Probability Density Function (JPDF)

Vector function \(f : \mathbb{R}^n \to \mathbb{R}_{+} \) mapping the range of a continuous random vector to its probability density

\( f_{\textbf{x}}(\textbf{u})\)

\(\textbf{x}\) is a vector RV
\(\textbf{u}\) is a real valued vector

Properties

\( \displaystyle \text{Pr}( \textbf{x} \in S ) = \int \cdots \iint_{S} f_{\textbf{x}}(\textbf{u}) d(\textbf{u}) \)
\( \displaystyle \forall \textbf{v} \in \Omega, f_\textbf{x}(x,y) \geq 0 \)
\( \forall \textbf{v} \in \Omega ( f_{\textbf{x}} ( \textbf{v} ) \geq 0 ) \)
\( \displaystyle \int \cdots \iint_{\Omega} f_{\textbf{x}}(\textbf{u}) d(\textbf{u}) = 1 \)

Joint Cumulative Density Function (JCDF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous random vector to the probability that

\( \displaystyle F_{\textbf{X}} (\textbf{u}) = \text{Pr}(\bigwedge^{n}_{i=1} x_i \leq u_i) = \int^{u_1}_{-\infty} \cdots \int^{u_n}_{-\infty}f_{\textbf{x}}(x_1,\cdots,x_n ) dx_n \cdots x_1 \)

Conditional distribution

Distribution of a RV \(X\) given another RV \(Y\) has a known outcome

Conditional Probability Density Function (CPDF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous scalar RV to the probability density given that the outcome of \(Y\) is known

\( f_{X|Y} (x|y) = \frac{f_{X,Y} (x,y) }{f_{Y}(y)} \)

Marginal distribution

Distribution of a RV \(X\) regardless of the outcome of all other RVs

Marginal Probability Mass Function (MPMF)

\( \displaystyle f_{X_1} (x_1) = \sum_{x_2 \in \text{Im}(X_2)} \cdots \sum_{x_n \in \text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) \)

Marginal Probability Density Function (MPDF)

\( \displaystyle f_{X_1} (x_1) = \int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) dx_n \cdots x_2 \)

Marginal Cumulative Distribution Function (MPDF)

\( \displaystyle F_{X_1} (u) = \int^{u}_{-\infty}\int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) d x_1 x_n \cdots x_2 \)

Multivariate independence

\(X,Y \text{ are independent } \iff f_{X,Y}(x,y) = f_{X}(x) f_{Y}(y) \)

With multivariate independence, marginal distributions equal the conditional distributions

Functions of random variables

A function \(u\) in terms of a tuple of random variables \((X,Y)\) can become its own random variable \((S,T)=u(X,Y)\), and its JPMF/JPDF can be found when:

\(u: (X,Y) \to (S,T) \text{ is injective}\)

Discrete

When \(X\) is discrete, form the PMF of \(Y=u(X)\) by mapping each of the original probabilities to be the transformed value probabilities

\(u: X \to Y \text{ is injective} \implies \text{Pr}(Y=u(k)) = \text{Pr}(X=k) \)

\(\text{Pr}(Y=u(k)) = \sum_{i\in X : u(k)=u(i)}\text{Pr}(X=i) \)

Continuous

When \(X\) is continuous, the concept remains, however one forms the PDF by integration by substitution

\(u: (X,Y) \to (S,T) \text{ is injective} \land \text{Pr}(X \in [a,b]) = \text{Pr}(Y \in [u(a),u(b)]) \implies \int^{b}_{a} f(x)dx = \int^{u(b)}_{u(a)} f(u^{-1}_1(s,t), u^{-1}_1(s,t))|\det (J)| dy\)

\(J\) is the Jacobian matrix

Gamma distribution

Continuous distribution of a convolution of \(\alpha\) independent exponential variables, perhaps this represents the space between two independent Poisson events occuring

\(X \sim \Gamma (\alpha,\beta)\)

\(\alpha\) is the amount of exponentially distributed variables (extended to the reals)
\(\beta\) is the mean of each exponentially distributed variable

PDF

\(f_{X}(x) = \begin{cases} \frac{\beta^{\alpha} x^{\alpha - 1}e^{-\beta x}}{\Gamma(\alpha)} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)

Features

\(\text{E}(X)= \frac{\alpha}{\beta}\)
\(\text{Var}(X)= \frac{\alpha}{\beta^2} \)
\(\sum^{n}_{i=1} \text{exp}(\beta) \sim \Gamma(n,\beta) \)

Beta distribution

\(X \sim \text{Beta}(\alpha,\beta)\)

PDF

\(f_{X}(x) = \begin{cases} \frac{x^{\alpha -1}(1-x)^{\beta -1}}{\text{B}(\alpha,\beta)} & x \in [0,1] \\ 0 & x \notin [0,1] \end{cases}\)

Test statistic distributions

Normal distribution

Continuous, symmetric distribution representing a 'bell curve'

\(Z \sim \text{N}(\mu,\sigma^2)\)

PDF

\( \phi_{Z} (z) = \begin{cases} \frac{e^{-\frac{(z-\mu)^2}{2 \sigma^2}}}{\sqrt{2 \pi \sigma^2}} & z \in \mathbb{R} \\ 0 \end{cases} \)

Central Limit Theorem (CLT)

Theorem stating that as a sample size approaches infinity, the standardization of a random sample's mean has a standardized normal distribution

Test statistics can therefore be formed by standardization and by creating functions of normal variables

\( X_n \text{ is a sequence of independent and identically distributed RVs} \implies \text{plim}_{n \to \infty} \frac{\overline{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}} \sim \text{N}(0,1) \)

Chi-squared distribution Distribuzione chi quadrato カイ二乗分布

Continuous distribution representing the sum of \(k\) independent, squared, standardized, normal variables.

\(X \sim \chi^2(k)\)

\(k\) is the degree of freedom

PDF

\(f_{X}(x) = \begin{cases} \frac{1}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} x^{\frac{k}{2}-1}e^{-\frac{x}{2}} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)

Features

\(\text{E}(X)= k \)
\(\text{Var}(X)= 2k \)
\(\sum^{k}_{i=1} Z_i^2 \sim \chi^2(k) \)
- \(Z_i \sim \text{N}(0,1)\)

Pearson's Chi-squared Test

Test statistic for goodness-of-fit, which is Chi-Squared distributed by the CTL

Since Chi-squared variables are linear combinations, this allows for the fit of combinations of categorical data to be examined, which is notably absent from T-tests, Z-tests and F-tests

\( \chi^2 = \sum_{i=1}^{k} \frac{ (O_i - \text{E} (X_i))^2 }{ \text{E}(X_i) } \)

\( \chi^2 \sim \chi^2(\nu) \) is the test statistic
\( O_i \) is the observed count of observations of type \(i\) for \(N\) samples
\( \text{E}(X_i) \) is the expected count of observations of type \(i\) for \(N\) samples, under the null model \(X_i\)
\( N \) is the amount of samples taken

Student's T distribution

Continuous distribution representing the ratio of a standardized normal variable and the root of a chi-squared variable.

It was discovered by William Sealy while working at Guinness

\(X \sim t(\nu)\)

\(\nu\) is the degree of freedom of samples

PDF

\(f_{X}(x) = \frac{\Gamma (\frac{\nu + 1}{2} )}{ \sqrt{\pi \nu} \Gamma (\frac{\nu}{2} ) } (1+ \frac{x^2}{\nu})^{ - \frac{\nu +1}{2}} \)

Features

\(\text{E}(X) = \begin{cases} 0 & \nu \gt 1 \\ \text{Undefined} \end{cases} \)
\(\text{Var}(X) = \begin{cases} \frac{\nu}{\nu -2} & \nu \gt 2 \\ \infty & \nu \in (1,2] \\ \text{Undefined} \end{cases}\)
\(\frac{Z}{\sqrt{\frac{Q}{k}}} \sim t(k) \)
- \(Z \sim \text{N}(0,1)\)
- \(Q \sim \chi^2(k)\)

F distribution

Continuous distribution representing the ratio of two chi-squared variables divided by their degrees of freedom.

\(X \sim \text{F}(d_1,d_2)\)

\(d_1\) is the degree of freedom for chi-squared variable 1
\(d_2\) is the degree of freedom for chi-squared variable 2

PDF

\(f_{X}(x) = \frac{x^{\frac{m}{2}-1} (\frac{m}{n})^{\frac{m}{2}}}{(1+\frac{xm}{n})^{\frac{m+n}{2}}\text{B}(\frac{d_1}{2},\frac{d_2}{2})} \)

Features

\( \frac{\frac{Q_1}{d_1}}{\frac{Q_2}{d_2}} \sim \text{F}(d_1,d_2) \)
- \(Q_i \sim \chi^2(d_i)\)

Probability theorems and Monte Carlo integration

Markov's inequality

Lower bound for expected value

\(\text{E}(X) \geq a\text{Pr}(X \geq a)\)

\(X\) is a non-negative random variable
\(a \in \mathbb{R}_{+}\) is any positive real

Proof

\(\text{E}(X) = \int^{\infty}_{0} xf_{X}(x)dx \leq \int^{\infty}_{a} xf_{X}(x)dx \leq a\int^{\infty}_{a} f_{X}(x)dx = a \text{Pr}(X \geq a)\)

Chebyshev's inequality

Upper bound for probability that random value's distance from mean exceeds \(k\) standard deviations, found by applying Markov's inequality to s standardized random variable

\(\frac{1}{k^2} \geq \text{Pr}(|X- \text{E}(X)| \geq k\sqrt{\text{Var}(X)})\)

\(X\) is a non-negative random variable
\(k \in \mathbb{R}\) is any real

Cantelli's inequality

\(\text{Pr}(X - \text{E}(X) \geq q) \leq \frac{\text{Var}(X)}{q^2 +\text{Var}(X)}\)

Weak Law of Large Numbers (WLLN)

Theorem that the sample mean made from each term of a independent and identically distributed sequence of random variables converges to their mean in probability

\(X_i \text{ are independent } \land X_i \text{ are all identically distributed } \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)

\(X_k \text{ are uncorrelated } \land \sup [ \text{Var}(X_k) ] \lt \infty \land \forall k ( \text{E}(X_k) = \mu ) \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)

\( (X_k)^{\infty}_{k=1} \) is a sequence of random variables

Monte Carlo integration (MC integration)

Numerical integration techniques based on sampling random variables, typically one samples from a uniform RV

Consider the integral \(\int^{b}_{a} f(x)dx\)
Let \( (u_i)^{n}_{i=1} \) be a sequence of \(n\) samples of \( (b-a)U+a : U \sim \text{U}(0,1)\)
Calculate \( \frac{(b-a)}{n}\sum^{n}_{i=1} f(u_i)\) as the MC estimate
- By WLLN, \( \text{plim}_{n \to \infty} \frac{(b-a)}{n}\sum^{n}_{i=1} f(u_i) = (b-a) \text{E}[f(U)] = \int_{a}^{b} f(x)dx \)

Estimation

Technique that estimates a distribution's parameters

Method of moments; equating moments to sample moments to calculate parameters
Maxiumum likelihood estimation; calculating parameters that optimize likelihood function

This is in contrast to hypothesis testing, where an educated hypothesis of a distribution parameter is proposed and verified by \(100(1-\alpha)\)% confidence

Moments

Functions relating to a random variable's (and distribution's) properties.

\(n\) represents the order of the moment and \(X\) represents the random variable of interest

Raw; accumulation of all values (raw since centered around nothing) multiplied by their probability

\( \mu'_{n}=\text{E}(X^n) \)

Central; accumulation of the difference between each value with the mean (centered around the mean), multiplied by their probability

\( \mu_{n}=\text{E}[(X-\mu)^n] \)

Standardized; ratio between the central moment and the standard deviation to the power of the moment's order

\( \tilde{\mu_{n}} = \frac{\mu_{n}}{\sigma^n} = \frac{\mu_{n}}{ ( \mu_{2} )^\frac{n}{2} }\)

Sample moments

Considering a set \( \{ x_1 ,x_2, ..., x_n \} \) of observations of \(X\):

Raw \( s_{n}= \frac{1}{n} \sum_{i=1}^{n} X_i \)
Central \( cs_{n}= \frac{1}{n} \sum_{i=1}^{n} (X_i - s_1) \)
Standardized \( ss_{n} = \frac{cs_n}{cs_{2}^{\frac{n}{2}}} \)

Method of moments

Parameter estimation statistic by substitution of known values into a moment and solving for unknown parameter

Likelihood function

Function representing the probability/probability density (for discrete and continuous cases respectively) of a random variable \(X\) returning a certain sequence of realizations if \(X\) were distributed by \(\text{D}(\boldsymbol{\theta})\)

\( \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) = \prod^{n}_{i=1} f_{X| \boldsymbol{\theta}}( x_i ) \)

\(\boldsymbol{\theta}\) is a vector of the assumed parameters for the distribution of \(X\)
\(X\) is the RV of interest with unknown distribution parameters
\( (x_i)^{n}_{i=1} = \textbf{x} \) is a sequence/vector of samples from \(X\)
\(X|\boldsymbol{\theta} \sim \text{D}(\boldsymbol{\theta})\) is \(X\) given that it were to have parameters \(\boldsymbol{\theta}\)

Maximum Likelihood Estimation (MLE)

Parameter estimation statistic by finding \(\hat{\theta}\) that maximizes likelihood function

Findin estimator \(\hat{\theta}\) for a distribution's parameters by maximizing the likelihood function with some sample

\( \hat{ \theta } = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) ] \)

Loglikelihood function

To facilitate optimization, the logarithm of the likelihood function is considered since it is easier to perform a derivative test on, and the loglikelihood is maximized the same as the likelihood function since it is monotone increasing.

\( \ln \mathcal{L} ( \boldsymbol{\theta} |\textbf{x} ) = \sum^{n}_{i=1} \ln ( f_{X| \boldsymbol{\theta}}(x_i ) ) \)

\( \text{argmax}_{\boldsymbol{\theta}} [ \ln \mathcal{L}_n ( \boldsymbol{\theta} | \textbf{x} ) = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x} ) ] \)

Linear regression

OLS
GLS
Coefficent of Determination

Bias

Bias 偏見

Measure of the bias of a statistic \(T\) that estimates the parameter value \(\theta\)

\(\text{bias}(T, \theta) = \text{E}(T) - \theta\)

\(T\) is the estimator
\(\theta \) is the parameter estimated by \(T\)

\( T \text{ is a biased estimator of } \theta \iff \text{bias}(T,\theta) \neq 0\)

Unbiased estimators are desirable in that sampling them converges to the true value (WLLN), however smaller variance may lead to better utility of an unbiased estimator, see Basu's Elephant

Examples

\( \text{bias}(s_1,\mu) = 0\)

\( \text{bias}(cs_2,\sigma^2) = \frac{\sigma^2}{n}\)

\( \text{bias}( b_n , \beta_n ) = 0 \)

Estimator comparison

Score function

Function quantifying change in likelihood as parameter with respect to \( \theta \)

\( \text{Score}( \theta | \textbf{x} ) = \frac{\partial \ln \mathcal{L} ( \theta | \textbf{x} ) }{\partial \theta} \)

Fisher information

Measure of information about the parameter value disclosed by the sample.

It is defined as the variance of the score. This definition is suitable since if one knows that the likelihood function is very sensitive when straying from some certain \(\theta\), then this gives stronger information about the true parameter setting

\( \mathcal{I}( \theta ) = \text{Var}[\text{Score}( \theta | \textbf{x} )] = - E ( \frac{\partial^2 \ln \mathcal{L} (\theta | \textbf{x}) }{\partial \theta^2} ) \)

Cramer-Rao Bound

\( \text{Var}(\hat{\theta} ) \geq \frac{1}{\mathcal{I}(\theta)} \)

Efficiency

Measure inspired by the Cramer-Rao bound, representing what proportion of the bound an estimator yields

\( \text{eff}(\hat{ \theta }) = \frac{1}{\mathcal{I}(\theta) \text{Var}(\hat{\theta} ) } \)

Relative Efficiency

Ratio of variances, it is identical to the reciprocal ratio of efficiencies since the term \( \frac{1}{\mathcal{I}(\theta)}\) is fixed

\( \text{eff}(\hat{\theta}_1 , \hat{\theta}_2 ) = \frac{\text{Var}(\hat{\theta}_1) }{ \text{Var}( \hat{\theta}_2 ) } = \frac{ \text{eff}( \hat{\theta}_2 ) }{ \text{eff}( \hat{\theta}_1 ) }\)

Exponential family

Family of distributions such that the PDF has some exponential factor, support refers to the domain of a function minus elements such that \(f(x)=0\)

\(X \sim \text{D}(\boldsymbol{\theta} ) \text{ is in the exponential family } \iff f_{X}(x) = h(x) g( \boldsymbol{\theta} ) \exp ( \textbf{T}(x) \cdot \boldsymbol{\eta}( \boldsymbol{\theta} ) ) \land \text{ and the support of } f_{X} \text{ is independent of } \boldsymbol{\theta}\)

\(h\) is a function of the outcome
\(g\) is a function of the parameter
\(\boldsymbol{\eta}\) is the natural parameter
\(\textbf{T}\) is the sufficient statistic

Properties

\(\text{E}[ \textbf{T}(X)] = -\frac{\partial}{\partial \boldsymbol{\eta}} \ln [ g(\boldsymbol{\eta})] \)
\(\text{Var}[ \textbf{T}(X)] = -\frac{\partial^2}{\partial \boldsymbol{\eta}^2} \ln [ g(\boldsymbol{\eta})]\)

Markov Chain Monte Carlo (MCMC)

Equilibrium distributions

As the amount of moves in a Markov chain approaches infinity, many Markov chains the observed probability of being in

Eigenequation technique

Simulation technique

The use of simulations on Markov chains to predict equilibrium distributions. A Naive simulation may be susceptible to inaccuracy due to the following phenomenon:

Periodicity hides states; capturing a a state after \(n\) moves where \(n\) does not divide the period of state \(i\) will lead to bias against state \(i\)
Not reaching 'burn-in'; the amount of moves in a Markov chain many not being adequate to surpass the bias from initial conditions

Metropolis Algorithm

MCMC algorithm for modelling a probability distribution through an equilibrium distribution of a Markov chain

It requires some symmetric proposal distribution \(g(x) : g(x|x_p) = g(x_p |x) \) that proposes a legal move in the Markov chain, and this move is accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) }{\text{Pr} (x) } \}\)

\(x\) is the current state
\(x_p\) is the proposed state
\(g(x) : g(x|x_p) = g(x_p |x) \) is the proposal distribution with detailed balance;

'Detailed balance' is assumed \(P(x|x_p) = P(x_p | x)\), that is, the probabilities between each state is not directional. Then one considers the probability \( P(x_p |x ) = g( x_p | x) A(x_p | x) \), where \(A\) is the probability of algorithm accepting the move proposed

This algorithm is useful in cases where a probability distribution is difficult to calculate, however a proportional probability distribution can easily be calculated

Algorithm

Select a random stating state \( x_0 \)
Propose a possible move by \( g(x_{p}|x) \)
Create a realization \( u \) from \( U \sim \text{U}(0,1) \)
If \( u \lt \min \{ 1, \frac{\text{Pr} (x_{p}) }{\text{Pr} (x) } \} \), then accept the move, otherwise return to step 2
Record the move and then jump to step 2

Derivation

Assume the probability distribution has detailed balance \(\text{Pr}(x | x_p ) \text{Pr}( x_p ) = \text{Pr}(x_p | x) \text{Pr}( x ) \). The probability distribution related to the proposed distribution by \( \text{Pr}(x | x_p ) = g( x | x_p )A( x | x_p ) \), where \(A\) is the probability of algorithm accepting the move proposed

Substitution into the detailed balance equation leads to the acceptance probability for the Metropolis-Hastings algorithm:

\( \frac{A(x_p | x )}{A( x_p | x )} = \frac{ \text{Pr} ( x_p ) g( x | x_p ) }{ \text{Pr} ( x_p ) g( x | x_p ) } \)

Applying the detailed balance condition provides the desired acceptance probability for the Metropolis algorithm..

Metropolis-Hastings Algorithm

Variant of the Metropolis algorithm that relaxes the symmetry requirement of the proposal distribution, hence proposed moves are accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) g (x|x_p) }{\text{Pr} (x) g (x_p|x) } \}\)

MCMC Diagnostics

Markov chains with aperiodicity and ergodicity at every state are ideal for MCMC. Lack of obvious large scale patterns is a sign of good mixing, that is, there are close-to-ergodic conditions

Traceplot
Running mean
Autocorrelation plot

Bayesian inference

Statistical technique of estimating parameter \(\theta\) of an RV \(X\) with a known distribution, and assumption of independent realizations through Baye's theorem

\( f_{\Theta | \textbf{x}} ( \theta ) = \frac{ \mathcal{L}(\theta | \textbf{x}) f_{\Theta}(\theta) }{ \prod_{x_i \in \textbf{x}} f_{X} ( x_i) } \)

By the total law of probability, \(\displaystyle f_{X}(x_i) = \int_{ \text{Im}(\Theta) } \mathcal{L}(\theta | x_i )f_{\Theta}(\theta) d\theta\), however this is often a highly nontrivial calculation

\( X \sim \text{D}(\theta) \) is a scalar RV with unknown parameters
\( \Theta \) is the prior distribution of \(\theta\) ; the original assumed distribution of \(\theta\)
- \(f_{\Theta} \) is the PDF/PMF of \(\Theta\)
\(\textbf{x} \) is a vector of realizations of \(X\)
\(\Theta|\textbf{x} \) is the postierior distribution of \(\theta\); the updated distribution of \(\theta\) after evidence from \(\textbf{x}\)

\(\textbf{x}\) is known through sampling and \(f_{\Theta}\) is arbitrarily set and \(f_{X | \Theta = \theta}\) is implies.

Prior distribution

Probability distribution representing assumed probability distribution before evidence

Posterior distribution

Probability distribution representing an updated prior distribution using Baye's theorem

Conjugate prior

Prior distribution such that under Bayesian inference the posterior distribution is of the same class, albeit with updated parameters

zaco | ザコ

37262 - Mathematical Statistics

Simulation

Simulation

Convolution

Acceptance-Rejection methods

Uniform sampling

Inverse transform sampling

Transforming uniformly sampled values for specific variables

Bernoulli simulation

Binomial simulation

Geometric simulation

Poisson simulation

Box-Muller method

Multivariate distributions

Univariate distribution

Random vector (Vector RV)

Expectation of functions

Expectation

Variance

Covariance

Autocovariance

Multivariate distribution

Joint Probability Mass Function (JPMF)

Properties

Joint Probability Density Function (JPDF)

Properties

Joint Cumulative Density Function (JCDF)

Conditional distribution

Conditional Probability Density Function (CPDF)

Marginal distribution

Marginal Probability Mass Function (MPMF)

Marginal Probability Density Function (MPDF)

Marginal Cumulative Distribution Function (MPDF)

Multivariate independence

Functions of random variables

Discrete

Continuous

Gamma distribution

PDF

Features

Beta distribution

PDF

Test statistic distributions

Normal distribution

PDF

Central Limit Theorem (CLT)

Chi-squared distribution Distribuzione chi quadrato カイ二乗分布

PDF

Features

Pearson's Chi-squared Test

Student's T distribution

PDF

Features

F distribution

PDF

Features

Probability theorems and Monte Carlo integration

Markov's inequality

Proof

Chebyshev's inequality

Cantelli's inequality

Weak Law of Large Numbers (WLLN)

Monte Carlo integration (MC integration)

Estimation

Estimation

Moments

Sample moments

Method of moments

Likelihood function

Maximum Likelihood Estimation (MLE)

Loglikelihood function

Linear regression

Bias

Bias 偏見

Examples

Estimator comparison

Score function

Fisher information

Cramer-Rao Bound