37262 - Mathematical Statistics


Simulation

Simulation

Process of generating sample instances \(X_i\) from some random variable \(X\), essentially simulating the behaviour random variable. It requires using a standard sampling technique (such as uniform sampling) and appropriately transforming the results to match the distribution of the random variable of interest

Convolution

Probability distribution formed by a finite known linear combination (sum where coefficients may be modified by some scalar, see Linear Algebra) of independent random variables

\( \sum_{i=1}^{n}\text{Bernoulli}(p) \sim \text{Bin}(n,p)\)

\( \sum_{i=1}^{r}\text{Geo}(p) \sim \text{NB}(r,p)\)

Acceptance-Rejection methods

Probability distribution formed by counting amount of independent random variables until a criterion is reached

For instance, geometric variable can be realized by simulating Bernoulli variables until one realises the value 1

Uniform sampling

Generating sample instances from \(U \sim \text{U}(0,1)\) (by a computer or otherwise) to be used in simulation

In the discrete case, such samples can be transformed to represent any random variable by mapping partitions of the range of \(U\) to partitions in the range of \(X\) such that

\( \text{Pr}(U \in [a,b]) = \text{Pr}(X \in [c,d]) \)

Inverse transform sampling

Cumulative functions can be used to create a partition

\(X_i= \min \{ k : \text{Pr}(X \leq k) = U_i \} = F_{X}^{-1}(U_i) \)

Transforming uniformly sampled values for specific variables

Bernoulli simulation

\(X \sim \text{Bernoulli}(p)\) one can trivially form the rule

\(\begin{cases} X_i=0 & U_i \geq p \\ X_i=1 & U_i \lt p \end{cases}\) one can trivially form the rule

Binomial simulation

\(X \sim \text{Bin}(n,p)\)

This is a convolution of \(n\) Bernoulli variables; simulate each of the \(n\) Bernoullis and sum their values for the Binomail variable

Geometric simulation

\(X \sim \text{Geo}(p)\)

This is an acceptance-rejection technique on a sequence of Bernoulli variables; simulate each Bernoulli until a 1 is reached, the realized value is the amount of Bernoulli variables simulated

Poisson simulation

\(X \sim \text{Pois}(\lambda)\)

This is an acceptance-rejection technique on a sequence of Exponential variables

Box-Muller method

Proposition that allows the transform of uniform samples to standard normal samples

\(U_1 , U_2 \sim \text{U}(0,1) \land U_1 ,U_2 \text{ are independent } \implies \)

\( \xi_1=\sqrt{-2\ln(U_1)}\cos (2\pi U_2) \sim \text{N}(0,1) \land \xi_2=\sqrt{-2\ln(U_1)}\sin (2\pi U_2) \sim \text{N}(0,1) \text{ are independent } \implies \)

Multivariate distributions

Univariate distribution

Distribution of a single random variable \(X\), defined by its PMF (discrete) or PDF (continuous)

Random vector (Vector RV)

On a probability space \( (\Omega, \mathcal{F}, \text{Pr}) \),

\(\textbf{x} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}\)

Expectation of functions

\( \text{E}[g(\textbf{x})] = \int_{\Omega} g[X(\omega)] d\text{Pr}(\omega ) \)

Expectation

\(\text{E}( \textbf{x} ) = \begin{bmatrix} \text{E} (X_1) \\ \text{E}(X_2) \\ \vdots \\ \text{E}(X_n) \end{bmatrix} \)

Variance

\(\text{Var}( \textbf{x} ) = \begin{bmatrix} \text{Var} (X_1) \\ \text{Var}(X_2) \\ \vdots \\ \text{Var}(X_n) \end{bmatrix} \)

Covariance

\( \text{cov}(\textbf{x} , \textbf{y}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{y} - \text{E}[\textbf{y}]) ] \)

Autocovariance

\( \text{cov}(\textbf{x} , \textbf{x}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{x} - \text{E}[\textbf{x}]) ] \)

\( \text{cov}(\textbf{x} , \textbf{x}) = \text{cov}(\textbf{x} , \textbf{x})^{T} \)

Multivariate distribution

Distribution of an ordered tuple of random variables or vector RV (these are isomorphic interpretations), such as \( (X,Y)\), defined by its JPMF (discrete) or JPDF (continuous)

Joint Probability Mass Function (JPMF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a discrete random vector to its probability

\( f_{\textbf{x}}(\textbf{u}) = \text{Pr}( \textbf{x} = \textbf{u} ) \)

Note that the use of vector RVs is for convenient notation

Properties

Joint Probability Density Function (JPDF)

Vector function \(f : \mathbb{R}^n \to \mathbb{R}_{+} \) mapping the range of a continuous random vector to its probability density

\( f_{\textbf{x}}(\textbf{u})\)

Properties

Joint Cumulative Density Function (JCDF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous random vector to the probability that

\( \displaystyle F_{\textbf{X}} (\textbf{u}) = \text{Pr}(\bigwedge^{n}_{i=1} x_i \leq u_i) = \int^{u_1}_{-\infty} \cdots \int^{u_n}_{-\infty}f_{\textbf{x}}(x_1,\cdots,x_n ) dx_n \cdots x_1 \)

Conditional distribution

Distribution of a RV \(X\) given another RV \(Y\) has a known outcome

Conditional Probability Density Function (CPDF)

Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous scalar RV to the probability density given that the outcome of \(Y\) is known

\( f_{X|Y} (x|y) = \frac{f_{X,Y} (x,y) }{f_{Y}(y)} \)

Marginal distribution

Distribution of a RV \(X\) regardless of the outcome of all other RVs

Marginal Probability Mass Function (MPMF)

\( \displaystyle f_{X_1} (x_1) = \sum_{x_2 \in \text{Im}(X_2)} \cdots \sum_{x_n \in \text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) \)

Marginal Probability Density Function (MPDF)

\( \displaystyle f_{X_1} (x_1) = \int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) dx_n \cdots x_2 \)

Marginal Cumulative Distribution Function (MPDF)

\( \displaystyle F_{X_1} (u) = \int^{u}_{-\infty}\int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) d x_1 x_n \cdots x_2 \)

Multivariate independence

\(X,Y \text{ are independent } \iff f_{X,Y}(x,y) = f_{X}(x) f_{Y}(y) \)

With multivariate independence, marginal distributions equal the conditional distributions

Functions of random variables

A function \(u\) in terms of a tuple of random variables \((X,Y)\) can become its own random variable \((S,T)=u(X,Y)\), and its JPMF/JPDF can be found when:

\(u: (X,Y) \to (S,T) \text{ is injective}\)

Discrete

When \(X\) is discrete, form the PMF of \(Y=u(X)\) by mapping each of the original probabilities to be the transformed value probabilities

\(u: X \to Y \text{ is injective} \implies \text{Pr}(Y=u(k)) = \text{Pr}(X=k) \)

\(\text{Pr}(Y=u(k)) = \sum_{i\in X : u(k)=u(i)}\text{Pr}(X=i) \)

Continuous

When \(X\) is continuous, the concept remains, however one forms the PDF by integration by substitution

\(u: (X,Y) \to (S,T) \text{ is injective} \land \text{Pr}(X \in [a,b]) = \text{Pr}(Y \in [u(a),u(b)]) \implies \int^{b}_{a} f(x)dx = \int^{u(b)}_{u(a)} f(u^{-1}_1(s,t), u^{-1}_1(s,t))|\det (J)| dy\)

Gamma distribution

Continuous distribution of a convolution of \(\alpha\) independent exponential variables, perhaps this represents the space between two independent Poisson events occuring

\(X \sim \Gamma (\alpha,\beta)\)

PDF

\(f_{X}(x) = \begin{cases} \frac{\beta^{\alpha} x^{\alpha - 1}e^{-\beta x}}{\Gamma(\alpha)} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)

Features

Beta distribution

\(X \sim \text{Beta}(\alpha,\beta)\)

PDF

\(f_{X}(x) = \begin{cases} \frac{x^{\alpha -1}(1-x)^{\beta -1}}{\text{B}(\alpha,\beta)} & x \in [0,1] \\ 0 & x \notin [0,1] \end{cases}\)

Test statistic distributions

Normal distribution

Continuous, symmetric distribution representing a 'bell curve'

\(Z \sim \text{N}(\mu,\sigma^2)\)

PDF

\( \phi_{Z} (z) = \begin{cases} \frac{e^{-\frac{(z-\mu)^2}{2 \sigma^2}}}{\sqrt{2 \pi \sigma^2}} & z \in \mathbb{R} \\ 0 \end{cases} \)

Central Limit Theorem (CLT)

Theorem stating that as a sample size approaches infinity, the standardization of a random sample's mean has a standardized normal distribution

Test statistics can therefore be formed by standardization and by creating functions of normal variables

\( X_n \text{ is a sequence of independent and identically distributed RVs} \implies \text{plim}_{n \to \infty} \frac{\overline{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}} \sim \text{N}(0,1) \)

Chi-squared distribution Distribuzione chi quadrato カイ二乗分布

Continuous distribution representing the sum of \(k\) independent, squared, standardized, normal variables.

\(X \sim \chi^2(k)\)

PDF

\(f_{X}(x) = \begin{cases} \frac{1}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} x^{\frac{k}{2}-1}e^{-\frac{x}{2}} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)

Features

Pearson's Chi-squared Test

Test statistic for goodness-of-fit, which is Chi-Squared distributed by the CTL

Since Chi-squared variables are linear combinations, this allows for the fit of combinations of categorical data to be examined, which is notably absent from T-tests, Z-tests and F-tests

\( \chi^2 = \sum_{i=1}^{k} \frac{ (O_i - \text{E} (X_i))^2 }{ \text{E}(X_i) } \)

Student's T distribution

Continuous distribution representing the ratio of a standardized normal variable and the root of a chi-squared variable.

It was discovered by William Sealy while working at Guinness

\(X \sim t(\nu)\)

PDF

\(f_{X}(x) = \frac{\Gamma (\frac{\nu + 1}{2} )}{ \sqrt{\pi \nu} \Gamma (\frac{\nu}{2} ) } (1+ \frac{x^2}{\nu})^{ - \frac{\nu +1}{2}} \)

Features

F distribution

Continuous distribution representing the ratio of two chi-squared variables divided by their degrees of freedom.

\(X \sim \text{F}(d_1,d_2)\)

PDF

\(f_{X}(x) = \frac{x^{\frac{m}{2}-1} (\frac{m}{n})^{\frac{m}{2}}}{(1+\frac{xm}{n})^{\frac{m+n}{2}}\text{B}(\frac{d_1}{2},\frac{d_2}{2})} \)

Features

Probability theorems and Monte Carlo integration

Markov's inequality

Lower bound for expected value

\(\text{E}(X) \geq a\text{Pr}(X \geq a)\)

Proof

\(\text{E}(X) = \int^{\infty}_{0} xf_{X}(x)dx \leq \int^{\infty}_{a} xf_{X}(x)dx \leq a\int^{\infty}_{a} f_{X}(x)dx = a \text{Pr}(X \geq a)\)

Chebyshev's inequality

Upper bound for probability that random value's distance from mean exceeds \(k\) standard deviations, found by applying Markov's inequality to s standardized random variable

\(\frac{1}{k^2} \geq \text{Pr}(|X- \text{E}(X)| \geq k\sqrt{\text{Var}(X)})\)

Cantelli's inequality

\(\text{Pr}(X - \text{E}(X) \geq q) \leq \frac{\text{Var}(X)}{q^2 +\text{Var}(X)}\)

Weak Law of Large Numbers (WLLN)

Theorem that the sample mean made from each term of a independent and identically distributed sequence of random variables converges to their mean in probability

\(X_i \text{ are independent } \land X_i \text{ are all identically distributed } \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)

\(X_k \text{ are uncorrelated } \land \sup [ \text{Var}(X_k) ] \lt \infty \land \forall k ( \text{E}(X_k) = \mu ) \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)

Monte Carlo integration (MC integration)

Numerical integration techniques based on sampling random variables, typically one samples from a uniform RV

Estimation

Estimation

Technique that estimates a distribution's parameters

This is in contrast to hypothesis testing, where an educated hypothesis of a distribution parameter is proposed and verified by \(100(1-\alpha)\)% confidence

Moments

Functions relating to a random variable's (and distribution's) properties.

\(n\) represents the order of the moment and \(X\) represents the random variable of interest

Sample moments

Considering a set \( \{ x_1 ,x_2, ..., x_n \} \) of observations of \(X\):

Method of moments

Parameter estimation statistic by substitution of known values into a moment and solving for unknown parameter

Likelihood function

Function representing the probability/probability density (for discrete and continuous cases respectively) of a random variable \(X\) returning a certain sequence of realizations if \(X\) were distributed by \(\text{D}(\boldsymbol{\theta})\)

\( \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) = \prod^{n}_{i=1} f_{X| \boldsymbol{\theta}}( x_i ) \)

Maximum Likelihood Estimation (MLE)

Parameter estimation statistic by finding \(\hat{\theta}\) that maximizes likelihood function

Findin estimator \(\hat{\theta}\) for a distribution's parameters by maximizing the likelihood function with some sample

\( \hat{ \theta } = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) ] \)

Loglikelihood function

To facilitate optimization, the logarithm of the likelihood function is considered since it is easier to perform a derivative test on, and the loglikelihood is maximized the same as the likelihood function since it is monotone increasing.

\( \ln \mathcal{L} ( \boldsymbol{\theta} |\textbf{x} ) = \sum^{n}_{i=1} \ln ( f_{X| \boldsymbol{\theta}}(x_i ) ) \)

\( \text{argmax}_{\boldsymbol{\theta}} [ \ln \mathcal{L}_n ( \boldsymbol{\theta} | \textbf{x} ) = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x} ) ] \)

Linear regression

Bias

Bias 偏見

Measure of the bias of a statistic \(T\) that estimates the parameter value \(\theta\)

\(\text{bias}(T, \theta) = \text{E}(T) - \theta\)

\( T \text{ is a biased estimator of } \theta \iff \text{bias}(T,\theta) \neq 0\)

Unbiased estimators are desirable in that sampling them converges to the true value (WLLN), however smaller variance may lead to better utility of an unbiased estimator, see Basu's Elephant

Examples

\( \text{bias}(s_1,\mu) = 0\)

\( \text{bias}(cs_2,\sigma^2) = \frac{\sigma^2}{n}\)

\( \text{bias}( b_n , \beta_n ) = 0 \)

Estimator comparison

Score function

Function quantifying change in likelihood as parameter with respect to \( \theta \)

\( \text{Score}( \theta | \textbf{x} ) = \frac{\partial \ln \mathcal{L} ( \theta | \textbf{x} ) }{\partial \theta} \)

Fisher information

Measure of information about the parameter value disclosed by the sample.

It is defined as the variance of the score. This definition is suitable since if one knows that the likelihood function is very sensitive when straying from some certain \(\theta\), then this gives stronger information about the true parameter setting

\( \mathcal{I}( \theta ) = \text{Var}[\text{Score}( \theta | \textbf{x} )] = - E ( \frac{\partial^2 \ln \mathcal{L} (\theta | \textbf{x}) }{\partial \theta^2} ) \)

Cramer-Rao Bound

\( \text{Var}(\hat{\theta} ) \geq \frac{1}{\mathcal{I}(\theta)} \)

Efficiency

Measure inspired by the Cramer-Rao bound, representing what proportion of the bound an estimator yields

\( \text{eff}(\hat{ \theta }) = \frac{1}{\mathcal{I}(\theta) \text{Var}(\hat{\theta} ) } \)

Relative Efficiency

Ratio of variances, it is identical to the reciprocal ratio of efficiencies since the term \( \frac{1}{\mathcal{I}(\theta)}\) is fixed

\( \text{eff}(\hat{\theta}_1 , \hat{\theta}_2 ) = \frac{\text{Var}(\hat{\theta}_1) }{ \text{Var}( \hat{\theta}_2 ) } = \frac{ \text{eff}( \hat{\theta}_2 ) }{ \text{eff}( \hat{\theta}_1 ) }\)

Exponential family

Family of distributions such that the PDF has some exponential factor, support refers to the domain of a function minus elements such that \(f(x)=0\)

\(X \sim \text{D}(\boldsymbol{\theta} ) \text{ is in the exponential family } \iff f_{X}(x) = h(x) g( \boldsymbol{\theta} ) \exp ( \textbf{T}(x) \cdot \boldsymbol{\eta}( \boldsymbol{\theta} ) ) \land \text{ and the support of } f_{X} \text{ is independent of } \boldsymbol{\theta}\)

Properties

Markov Chain Monte Carlo (MCMC)

Equilibrium distributions

As the amount of moves in a Markov chain approaches infinity, many Markov chains the observed probability of being in

Eigenequation technique

Simulation technique

The use of simulations on Markov chains to predict equilibrium distributions. A Naive simulation may be susceptible to inaccuracy due to the following phenomenon:

Metropolis Algorithm

MCMC algorithm for modelling a probability distribution through an equilibrium distribution of a Markov chain

It requires some symmetric proposal distribution \(g(x) : g(x|x_p) = g(x_p |x) \) that proposes a legal move in the Markov chain, and this move is accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) }{\text{Pr} (x) } \}\)

'Detailed balance' is assumed \(P(x|x_p) = P(x_p | x)\), that is, the probabilities between each state is not directional. Then one considers the probability \( P(x_p |x ) = g( x_p | x) A(x_p | x) \), where \(A\) is the probability of algorithm accepting the move proposed

This algorithm is useful in cases where a probability distribution is difficult to calculate, however a proportional probability distribution can easily be calculated

Algorithm

  1. Select a random stating state \( x_0 \)
  2. Propose a possible move by \( g(x_{p}|x) \)
  3. Create a realization \( u \) from \( U \sim \text{U}(0,1) \)
  4. If \( u \lt \min \{ 1, \frac{\text{Pr} (x_{p}) }{\text{Pr} (x) } \} \), then accept the move, otherwise return to step 2
  5. Record the move and then jump to step 2

Derivation

Assume the probability distribution has detailed balance \(\text{Pr}(x | x_p ) \text{Pr}( x_p ) = \text{Pr}(x_p | x) \text{Pr}( x ) \). The probability distribution related to the proposed distribution by \( \text{Pr}(x | x_p ) = g( x | x_p )A( x | x_p ) \), where \(A\) is the probability of algorithm accepting the move proposed

Substitution into the detailed balance equation leads to the acceptance probability for the Metropolis-Hastings algorithm:

\( \frac{A(x_p | x )}{A( x_p | x )} = \frac{ \text{Pr} ( x_p ) g( x | x_p ) }{ \text{Pr} ( x_p ) g( x | x_p ) } \)

Applying the detailed balance condition provides the desired acceptance probability for the Metropolis algorithm..

Metropolis-Hastings Algorithm

Variant of the Metropolis algorithm that relaxes the symmetry requirement of the proposal distribution, hence proposed moves are accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) g (x|x_p) }{\text{Pr} (x) g (x_p|x) } \}\)

MCMC Diagnostics

Markov chains with aperiodicity and ergodicity at every state are ideal for MCMC. Lack of obvious large scale patterns is a sign of good mixing, that is, there are close-to-ergodic conditions

Bayesian inference

Statistical technique of estimating parameter \(\theta\) of an RV \(X\) with a known distribution, and assumption of independent realizations through Baye's theorem

\( f_{\Theta | \textbf{x}} ( \theta ) = \frac{ \mathcal{L}(\theta | \textbf{x}) f_{\Theta}(\theta) }{ \prod_{x_i \in \textbf{x}} f_{X} ( x_i) } \)

By the total law of probability, \(\displaystyle f_{X}(x_i) = \int_{ \text{Im}(\Theta) } \mathcal{L}(\theta | x_i )f_{\Theta}(\theta) d\theta\), however this is often a highly nontrivial calculation

\(\textbf{x}\) is known through sampling and \(f_{\Theta}\) is arbitrarily set and \(f_{X | \Theta = \theta}\) is implies.

Prior distribution

Probability distribution representing assumed probability distribution before evidence

Posterior distribution

Probability distribution representing an updated prior distribution using Baye's theorem

Conjugate prior

Prior distribution such that under Bayesian inference the posterior distribution is of the same class, albeit with updated parameters