Process of generating sample instances \(X_i\) from some random variable \(X\), essentially simulating the behaviour random variable. It requires using a standard sampling technique (such as uniform sampling) and appropriately transforming the results to match the distribution of the random variable of interest
Probability distribution formed by a finite known linear combination (sum where coefficients may be modified by some scalar, see Linear Algebra) of independent random variables
\( \sum_{i=1}^{n}\text{Bernoulli}(p) \sim \text{Bin}(n,p)\)
\( \sum_{i=1}^{r}\text{Geo}(p) \sim \text{NB}(r,p)\)
Probability distribution formed by counting amount of independent random variables until a criterion is reached
For instance, geometric variable can be realized by simulating Bernoulli variables until one realises the value 1
Generating sample instances from \(U \sim \text{U}(0,1)\) (by a computer or otherwise) to be used in simulation
In the discrete case, such samples can be transformed to represent any random variable by mapping partitions of the range of \(U\) to partitions in the range of \(X\) such that
\( \text{Pr}(U \in [a,b]) = \text{Pr}(X \in [c,d]) \)
Cumulative functions can be used to create a partition
\(X_i= \min \{ k : \text{Pr}(X \leq k) = U_i \} = F_{X}^{-1}(U_i) \)
\(X \sim \text{Bernoulli}(p)\) one can trivially form the rule
\(\begin{cases} X_i=0 & U_i \geq p \\ X_i=1 & U_i \lt p \end{cases}\) one can trivially form the rule
\(X \sim \text{Bin}(n,p)\)
This is a convolution of \(n\) Bernoulli variables; simulate each of the \(n\) Bernoullis and sum their values for the Binomail variable
\(X \sim \text{Geo}(p)\)
This is an acceptance-rejection technique on a sequence of Bernoulli variables; simulate each Bernoulli until a 1 is reached, the realized value is the amount of Bernoulli variables simulated
\(X \sim \text{Pois}(\lambda)\)
This is an acceptance-rejection technique on a sequence of Exponential variables
Proposition that allows the transform of uniform samples to standard normal samples
\(U_1 , U_2 \sim \text{U}(0,1) \land U_1 ,U_2 \text{ are independent } \implies \)
\( \xi_1=\sqrt{-2\ln(U_1)}\cos (2\pi U_2) \sim \text{N}(0,1) \land \xi_2=\sqrt{-2\ln(U_1)}\sin (2\pi U_2) \sim \text{N}(0,1) \text{ are independent } \implies \)
Distribution of a single random variable \(X\), defined by its PMF (discrete) or PDF (continuous)
On a probability space \( (\Omega, \mathcal{F}, \text{Pr}) \),
\(\textbf{x} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}\)
\( \text{E}[g(\textbf{x})] = \int_{\Omega} g[X(\omega)] d\text{Pr}(\omega ) \)
\(\text{E}( \textbf{x} ) = \begin{bmatrix} \text{E} (X_1) \\ \text{E}(X_2) \\ \vdots \\ \text{E}(X_n) \end{bmatrix} \)
\(\text{Var}( \textbf{x} ) = \begin{bmatrix} \text{Var} (X_1) \\ \text{Var}(X_2) \\ \vdots \\ \text{Var}(X_n) \end{bmatrix} \)
\( \text{cov}(\textbf{x} , \textbf{y}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{y} - \text{E}[\textbf{y}]) ] \)
\( \text{cov}(\textbf{x} , \textbf{x}) = \text{E}[ (\textbf{x} - \text{E}[\textbf{x}]) (\textbf{x} - \text{E}[\textbf{x}]) ] \)
\( \text{cov}(\textbf{x} , \textbf{x}) = \text{cov}(\textbf{x} , \textbf{x})^{T} \)
Distribution of an ordered tuple of random variables or vector RV (these are isomorphic interpretations), such as \( (X,Y)\), defined by its JPMF (discrete) or JPDF (continuous)
Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a discrete random vector to its probability
\( f_{\textbf{x}}(\textbf{u}) = \text{Pr}( \textbf{x} = \textbf{u} ) \)
Note that the use of vector RVs is for convenient notation
Vector function \(f : \mathbb{R}^n \to \mathbb{R}_{+} \) mapping the range of a continuous random vector to its probability density
\( f_{\textbf{x}}(\textbf{u})\)
Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous random vector to the probability that
\( \displaystyle F_{\textbf{X}} (\textbf{u}) = \text{Pr}(\bigwedge^{n}_{i=1} x_i \leq u_i) = \int^{u_1}_{-\infty} \cdots \int^{u_n}_{-\infty}f_{\textbf{x}}(x_1,\cdots,x_n ) dx_n \cdots x_1 \)
Distribution of a RV \(X\) given another RV \(Y\) has a known outcome
Vector function \(f : \mathbb{R}^n \to [0,1] \) mapping the range of a continuous scalar RV to the probability density given that the outcome of \(Y\) is known
\( f_{X|Y} (x|y) = \frac{f_{X,Y} (x,y) }{f_{Y}(y)} \)
Distribution of a RV \(X\) regardless of the outcome of all other RVs
\( \displaystyle f_{X_1} (x_1) = \sum_{x_2 \in \text{Im}(X_2)} \cdots \sum_{x_n \in \text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) \)
\( \displaystyle f_{X_1} (x_1) = \int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) dx_n \cdots x_2 \)
\( \displaystyle F_{X_1} (u) = \int^{u}_{-\infty}\int_{\text{Im}(X_2)} \cdots \int_{\text{Im}(X_n)} f_{\textbf{x}}(x_1 , \cdots x_n) d x_1 x_n \cdots x_2 \)
\(X,Y \text{ are independent } \iff f_{X,Y}(x,y) = f_{X}(x) f_{Y}(y) \)
With multivariate independence, marginal distributions equal the conditional distributions
A function \(u\) in terms of a tuple of random variables \((X,Y)\) can become its own random variable \((S,T)=u(X,Y)\), and its JPMF/JPDF can be found when:
\(u: (X,Y) \to (S,T) \text{ is injective}\)
When \(X\) is discrete, form the PMF of \(Y=u(X)\) by mapping each of the original probabilities to be the transformed value probabilities
\(u: X \to Y \text{ is injective} \implies \text{Pr}(Y=u(k)) = \text{Pr}(X=k) \)
\(\text{Pr}(Y=u(k)) = \sum_{i\in X : u(k)=u(i)}\text{Pr}(X=i) \)
When \(X\) is continuous, the concept remains, however one forms the PDF by integration by substitution
\(u: (X,Y) \to (S,T) \text{ is injective} \land \text{Pr}(X \in [a,b]) = \text{Pr}(Y \in [u(a),u(b)]) \implies \int^{b}_{a} f(x)dx = \int^{u(b)}_{u(a)} f(u^{-1}_1(s,t), u^{-1}_1(s,t))|\det (J)| dy\)
Continuous distribution of a convolution of \(\alpha\) independent exponential variables, perhaps this represents the space between two independent Poisson events occuring
\(X \sim \Gamma (\alpha,\beta)\)
\(f_{X}(x) = \begin{cases} \frac{\beta^{\alpha} x^{\alpha - 1}e^{-\beta x}}{\Gamma(\alpha)} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)
\(X \sim \text{Beta}(\alpha,\beta)\)
\(f_{X}(x) = \begin{cases} \frac{x^{\alpha -1}(1-x)^{\beta -1}}{\text{B}(\alpha,\beta)} & x \in [0,1] \\ 0 & x \notin [0,1] \end{cases}\)
Continuous, symmetric distribution representing a 'bell curve'
\(Z \sim \text{N}(\mu,\sigma^2)\)
\( \phi_{Z} (z) = \begin{cases} \frac{e^{-\frac{(z-\mu)^2}{2 \sigma^2}}}{\sqrt{2 \pi \sigma^2}} & z \in \mathbb{R} \\ 0 \end{cases} \)
Theorem stating that as a sample size approaches infinity, the standardization of a random sample's mean has a standardized normal distribution
Test statistics can therefore be formed by standardization and by creating functions of normal variables
\( X_n \text{ is a sequence of independent and identically distributed RVs} \implies \text{plim}_{n \to \infty} \frac{\overline{X}_n - \mu}{\frac{\sigma}{\sqrt{n}}} \sim \text{N}(0,1) \)
Continuous distribution representing the sum of \(k\) independent, squared, standardized, normal variables.
\(X \sim \chi^2(k)\)
\(f_{X}(x) = \begin{cases} \frac{1}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} x^{\frac{k}{2}-1}e^{-\frac{x}{2}} & x \in [0,\infty) \\ 0 & x \notin [0,\infty) \end{cases}\)
Test statistic for goodness-of-fit, which is Chi-Squared distributed by the CTL
Since Chi-squared variables are linear combinations, this allows for the fit of combinations of categorical data to be examined, which is notably absent from T-tests, Z-tests and F-tests
\( \chi^2 = \sum_{i=1}^{k} \frac{ (O_i - \text{E} (X_i))^2 }{ \text{E}(X_i) } \)
Continuous distribution representing the ratio of a standardized normal variable and the root of a chi-squared variable.
It was discovered by William Sealy while working at Guinness
\(X \sim t(\nu)\)
\(f_{X}(x) = \frac{\Gamma (\frac{\nu + 1}{2} )}{ \sqrt{\pi \nu} \Gamma (\frac{\nu}{2} ) } (1+ \frac{x^2}{\nu})^{ - \frac{\nu +1}{2}} \)
Continuous distribution representing the ratio of two chi-squared variables divided by their degrees of freedom.
\(X \sim \text{F}(d_1,d_2)\)
\(f_{X}(x) = \frac{x^{\frac{m}{2}-1} (\frac{m}{n})^{\frac{m}{2}}}{(1+\frac{xm}{n})^{\frac{m+n}{2}}\text{B}(\frac{d_1}{2},\frac{d_2}{2})} \)
Lower bound for expected value
\(\text{E}(X) \geq a\text{Pr}(X \geq a)\)
\(\text{E}(X) = \int^{\infty}_{0} xf_{X}(x)dx \leq \int^{\infty}_{a} xf_{X}(x)dx \leq a\int^{\infty}_{a} f_{X}(x)dx = a \text{Pr}(X \geq a)\)
Upper bound for probability that random value's distance from mean exceeds \(k\) standard deviations, found by applying Markov's inequality to s standardized random variable
\(\frac{1}{k^2} \geq \text{Pr}(|X- \text{E}(X)| \geq k\sqrt{\text{Var}(X)})\)
\(\text{Pr}(X - \text{E}(X) \geq q) \leq \frac{\text{Var}(X)}{q^2 +\text{Var}(X)}\)
Theorem that the sample mean made from each term of a independent and identically distributed sequence of random variables converges to their mean in probability
\(X_i \text{ are independent } \land X_i \text{ are all identically distributed } \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)
\(X_k \text{ are uncorrelated } \land \sup [ \text{Var}(X_k) ] \lt \infty \land \forall k ( \text{E}(X_k) = \mu ) \implies \text{plim}_{n \to \infty} \frac{\sum^{n}_{i=1} X_i}{n} = \mu\)
Numerical integration techniques based on sampling random variables, typically one samples from a uniform RV
Technique that estimates a distribution's parameters
This is in contrast to hypothesis testing, where an educated hypothesis of a distribution parameter is proposed and verified by \(100(1-\alpha)\)% confidence
Functions relating to a random variable's (and distribution's) properties.
\(n\) represents the order of the moment and \(X\) represents the random variable of interest
Considering a set \( \{ x_1 ,x_2, ..., x_n \} \) of observations of \(X\):
Parameter estimation statistic by substitution of known values into a moment and solving for unknown parameter
Function representing the probability/probability density (for discrete and continuous cases respectively) of a random variable \(X\) returning a certain sequence of realizations if \(X\) were distributed by \(\text{D}(\boldsymbol{\theta})\)
\( \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) = \prod^{n}_{i=1} f_{X| \boldsymbol{\theta}}( x_i ) \)
Parameter estimation statistic by finding \(\hat{\theta}\) that maximizes likelihood function
Findin estimator \(\hat{\theta}\) for a distribution's parameters by maximizing the likelihood function with some sample
\( \hat{ \theta } = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x}) ] \)
To facilitate optimization, the logarithm of the likelihood function is considered since it is easier to perform a derivative test on, and the loglikelihood is maximized the same as the likelihood function since it is monotone increasing.
\( \ln \mathcal{L} ( \boldsymbol{\theta} |\textbf{x} ) = \sum^{n}_{i=1} \ln ( f_{X| \boldsymbol{\theta}}(x_i ) ) \)
\( \text{argmax}_{\boldsymbol{\theta}} [ \ln \mathcal{L}_n ( \boldsymbol{\theta} | \textbf{x} ) = \text{argmax}_{\boldsymbol{\theta}} [ \mathcal{L} ( \boldsymbol{\theta} | \textbf{x} ) ] \)
Measure of the bias of a statistic \(T\) that estimates the parameter value \(\theta\)
\(\text{bias}(T, \theta) = \text{E}(T) - \theta\)
\( T \text{ is a biased estimator of } \theta \iff \text{bias}(T,\theta) \neq 0\)
Unbiased estimators are desirable in that sampling them converges to the true value (WLLN), however smaller variance may lead to better utility of an unbiased estimator, see Basu's Elephant
\( \text{bias}(s_1,\mu) = 0\)
\( \text{bias}(cs_2,\sigma^2) = \frac{\sigma^2}{n}\)
\( \text{bias}( b_n , \beta_n ) = 0 \)
Function quantifying change in likelihood as parameter with respect to \( \theta \)
\( \text{Score}( \theta | \textbf{x} ) = \frac{\partial \ln \mathcal{L} ( \theta | \textbf{x} ) }{\partial \theta} \)
Measure of information about the parameter value disclosed by the sample.
It is defined as the variance of the score. This definition is suitable since if one knows that the likelihood function is very sensitive when straying from some certain \(\theta\), then this gives stronger information about the true parameter setting
\( \mathcal{I}( \theta ) = \text{Var}[\text{Score}( \theta | \textbf{x} )] = - E ( \frac{\partial^2 \ln \mathcal{L} (\theta | \textbf{x}) }{\partial \theta^2} ) \)
\( \text{Var}(\hat{\theta} ) \geq \frac{1}{\mathcal{I}(\theta)} \)
Measure inspired by the Cramer-Rao bound, representing what proportion of the bound an estimator yields
\( \text{eff}(\hat{ \theta }) = \frac{1}{\mathcal{I}(\theta) \text{Var}(\hat{\theta} ) } \)
Ratio of variances, it is identical to the reciprocal ratio of efficiencies since the term \( \frac{1}{\mathcal{I}(\theta)}\) is fixed
\( \text{eff}(\hat{\theta}_1 , \hat{\theta}_2 ) = \frac{\text{Var}(\hat{\theta}_1) }{ \text{Var}( \hat{\theta}_2 ) } = \frac{ \text{eff}( \hat{\theta}_2 ) }{ \text{eff}( \hat{\theta}_1 ) }\)
Family of distributions such that the PDF has some exponential factor, support refers to the domain of a function minus elements such that \(f(x)=0\)
\(X \sim \text{D}(\boldsymbol{\theta} ) \text{ is in the exponential family } \iff f_{X}(x) = h(x) g( \boldsymbol{\theta} ) \exp ( \textbf{T}(x) \cdot \boldsymbol{\eta}( \boldsymbol{\theta} ) ) \land \text{ and the support of } f_{X} \text{ is independent of } \boldsymbol{\theta}\)
As the amount of moves in a Markov chain approaches infinity, many Markov chains the observed probability of being in
The use of simulations on Markov chains to predict equilibrium distributions. A Naive simulation may be susceptible to inaccuracy due to the following phenomenon:
MCMC algorithm for modelling a probability distribution through an equilibrium distribution of a Markov chain
It requires some symmetric proposal distribution \(g(x) : g(x|x_p) = g(x_p |x) \) that proposes a legal move in the Markov chain, and this move is accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) }{\text{Pr} (x) } \}\)
'Detailed balance' is assumed \(P(x|x_p) = P(x_p | x)\), that is, the probabilities between each state is not directional. Then one considers the probability \( P(x_p |x ) = g( x_p | x) A(x_p | x) \), where \(A\) is the probability of algorithm accepting the move proposed
This algorithm is useful in cases where a probability distribution is difficult to calculate, however a proportional probability distribution can easily be calculated
Assume the probability distribution has detailed balance \(\text{Pr}(x | x_p ) \text{Pr}( x_p ) = \text{Pr}(x_p | x) \text{Pr}( x ) \). The probability distribution related to the proposed distribution by \( \text{Pr}(x | x_p ) = g( x | x_p )A( x | x_p ) \), where \(A\) is the probability of algorithm accepting the move proposed
Substitution into the detailed balance equation leads to the acceptance probability for the Metropolis-Hastings algorithm:
\( \frac{A(x_p | x )}{A( x_p | x )} = \frac{ \text{Pr} ( x_p ) g( x | x_p ) }{ \text{Pr} ( x_p ) g( x | x_p ) } \)
Applying the detailed balance condition provides the desired acceptance probability for the Metropolis algorithm..
Variant of the Metropolis algorithm that relaxes the symmetry requirement of the proposal distribution, hence proposed moves are accepted with probability \(\min \{ 1, \frac{\text{Pr} (x_p) g (x|x_p) }{\text{Pr} (x) g (x_p|x) } \}\)
Markov chains with aperiodicity and ergodicity at every state are ideal for MCMC. Lack of obvious large scale patterns is a sign of good mixing, that is, there are close-to-ergodic conditions
Statistical technique of estimating parameter \(\theta\) of an RV \(X\) with a known distribution, and assumption of independent realizations through Baye's theorem
\( f_{\Theta | \textbf{x}} ( \theta ) = \frac{ \mathcal{L}(\theta | \textbf{x}) f_{\Theta}(\theta) }{ \prod_{x_i \in \textbf{x}} f_{X} ( x_i) } \)
By the total law of probability, \(\displaystyle f_{X}(x_i) = \int_{ \text{Im}(\Theta) } \mathcal{L}(\theta | x_i )f_{\Theta}(\theta) d\theta\), however this is often a highly nontrivial calculation
\(\textbf{x}\) is known through sampling and \(f_{\Theta}\) is arbitrarily set and \(f_{X | \Theta = \theta}\) is implies.
Probability distribution representing assumed probability distribution before evidence
Probability distribution representing an updated prior distribution using Baye's theorem
Prior distribution such that under Bayesian inference the posterior distribution is of the same class, albeit with updated parameters