37161 - Probability and Random Variables


Basic probability

Sample space

Set \( \Omega \) of all outcomes that can occur

Events

'Measurable' subset of the sample space \(E \subseteq \Omega\)

\(E \text{ is an event } \iff E \subseteq \Omega\)

\(E \text{ occurs } \iff \exists e \in E ( e \text{ occurs} )\)

Probability

Quantity stating how likely an event is. A probability function \(\text{Pr}\) that maps events to probabilites typically obeying the following properties

Frequentist probability

Ratio repsesenting frequency of an event occurence on the sample space occurence, called relative frequency

\( \text{Pr}(E) = \mathrm{plim}_{n \to \infty} \frac{\sum^{n}_{k=1} E_{k} }{n}\)

Theoretical probability

Assuming all outcomes have equal probability, we have the following for the discrete case

\( \text{Pr}(E) = \frac{|E|}{|\Omega|}\)

Corrolaries

Set theory of probability

Combinatorics of probability

Conditional probability

Probability of an event given that some other event is known to occur.

\( \text{Pr}(E_1|E_2) = \frac{\text{Pr}(E_1 \cap E_2)}{\text{Pr}(E_2)|} \)

Mathematically, conditional probability means restricting the sample space to the known event.

Baye's theorem

\( \text{Pr}(E_1|E_2) = \frac{\text{Pr}(E_2|E_1) \text{Pr}(E_1)}{\text{Pr}(E_2)} \)

Degree of freedom

For situations such as circular arrangements witout repetition and situations where the first event doesn't matter (like 'what is the probability of rolling the same number on a 6 sided die 3 times in a row?'; it doesn't matter what we get on the first throw, but the second and third times have importance), the degree of freedom is \(n-1\), meaning that circular arrangements use the formula \((n-1)!\) and that dice question yields the result \(\frac{1}{6^2}\). Basically we're arbitrarily guaranteeing the placement of some restricted element

Independence Indipendenza 自主

Relationship between a pair of events such that an event occuring has no affect the other event occuring

\( A,B \text{ are independent } \iff \text{Pr}(A\cap B)=\text{Pr}(A) \times \text{Pr}(B) \iff \text{Pr}(A|B)=\text{Pr}(A)\)

Basic random variables

Experiment Esperimento 実験

Process that results in some unpredictable outcome

Random variable (RV) Variabile casuale 乱数

Function that maps all outcomes in the sample space to a result in the state space, a state space being a numerical representation for an outcome.

Essentially, RVs map outcomes to numerical representations called results that can be analyzed mathematically.

\(X : \Omega \to R\)

\(X(\omega)\)

Common abuse of notation

Probability functions should technically map events to probabilities, however when dealing with random variables, a common abuse of notation uses a predicate that represents the events that satisfy it

\( \text{Pr}[P(X )] = \text{Pr}( \{ \omega \in \Omega : P[X(\omega)] \} ) \)

Probability Mass Function (PMF)

Function \(f\) that maps outcomes of a discrete RV \(X\) to its probability

\(f_{X}(n) = \text{Pr}(X=n)\)

Properties

Cumulative Probability Function (CPF)

Function that returns the probability for all discrete outcomes less than some \(n\)

\( F_{X}(n) = \text{Pr}(X \leq n) = \sum_{k = -\infty}^{n} f_{X}(k) \)

Probability Density Function (PDF)

Function \(f\) that maps outcomes of a continuous random variable \(X\) to a 'probability density'

Note that this 'density' is NOT a probability (since in compact sets of outcomes, \(\text{Pr}(X=x)=0\)), Probability for sets of outcomes must be handled by the CDF (see below).

\(f_{X}(x)\)

Properties

Cumulative Density Function (CDF)

Function that takes a continuous random variable returns the probability that an event is less than or equal to that variable

\( \displaystyle F_{X}(x) = \text{Pr}(X \leq x) = \int_{-\infty}^{x} f_{X}(t) dt \)

Expectation Media 平均

Quantity representing expected outcome for a random variable as a weighted sum, where outcomes are weighed by probabilities.

\( \displaystyle \text{E}(X) = \int_{\Omega} X(\omega) d\text{Pr}(\omega)\)

Discrete

\( \displaystyle \text{E}(X) = \sum_{n \in \Omega} n f_{X}(n)\)

Continuous

\( \displaystyle \text{E}(X) = \int_{\Omega} xf_{X}(x) dx\)

Corrolaries

Conditional expected value

RV of the expectation of \(X\) given the result of \(Y\)

\( \text{E}(X|Y) : \Omega \to R\)

\( \text{E}(X|Y=\omega) \)

Variance Varianza 分散

Expected square distance from the average for a random variable

\( \text{Var}(X) = \text{E}([X- \text{E}(X)]^2) = \text{E}(X^2) - \text{E}(X)^2 \)

Distribution

Class of random variable defined by a certain structure of PMF/PDF

Discrete

Continuous

Bernoulli distribution

Discrete distribution of an event with binary outcomes; success or failure

\(X \sim \text{Bern} (p)\)

PMF

\( f_{X}(n) = \begin{cases} p & n=1 \\ 1-p & n=0 \\ 0 \end{cases} \)

Features

Binomial distribution

Discrete distribution of a series (linear combination) of \(n\) independent Bernoulli events

\(X \sim \text{Bin}(n,p)\)

PMF

\(\displaystyle f_{X}(k)= \begin{cases} \binom{n}{k} p^{k} (1-p)^{n-k} & k \in [0,n] \cap \mathbb{N} \\ 0 \end{cases}\)

Features

Geometric distribution

Discrete distribution of the number of independent Bernoulli trials needed for the first success

\(X \sim \text{Geo}(p)\)

PMF

\( f_{X}(n) = \begin{cases} (1-p)^{n-1}p & n \in \mathbb{N} \\ 0 \end{cases} \)

Features

Uniform distribution

Continuous distribution where any group of equally sized ranges of the sample have equal probabilities

\(X \sim \text{U}(a,b)\)

PDF

\(f_{X}(x) = \begin{cases} \frac{1}{b-a} & x \in [a,b] \\ 0 & x \notin [a,b] \end{cases}\)

Features

Law of total probability

Law describing how to calculate the probability of \(A\) through conditional probabilities given that the disjoint union of the given events equals the sample space

\( \text{Pr}(A) = \sum_{n} \text{Pr}(A|E_n) \text{Pr}(E_n) \)

Law of total expectation

Law asserting that the expected value of \(X\) is the expected value of the conditional expected value of \(X|Y=y\)

\( \text{E}(X) = \text{E}(\text{E}(X|Y)) \)

\( \text{E}(\text{E}(X|Y)) = \sum_{n \in \text{Im}(Y) } \text{E}(X|Y=n) \text{Pr}(Y=n) \)

Law of total variance

Law describing how to calculate the expected value of \(X\) through the conditional expected value of \(X|Y=y\)

\( \text{Var}(X) = \text{E}(\text{Var}(X|Y)) + \text{Var}(\text{E}(X|Y)) \)

Functions of random variables

A function \(u\) in terms of a random variable \(X\) can become its own random variable \(Y=u(X)\), and its PMF/PDF can be found when:

\(u: X \to Y \text{ is injective}\)

Discrete

When \(X\) is discrete, form the PMF of \(Y=u(X)\) by mapping each of the original probabilities to be the transformed value probabilities

\(u: X \to Y \text{ is injective} \implies f_{Y}(u(k)) = f_{X}(k) \)

\(f_{Y}(k) = \sum_{i\in X : u(k)=u(i)} f_{X}(i) \)

Continuous

When \(X\) is continuous, the concept remains, however one forms the PDF by integration by substitution

\(u: X \to Y \text{ is injective} \land \text{Pr}(X \in [a,b]) = \text{Pr}(Y \in [u(a),u(b)]) \implies \int^{b}_{a} f(x)dx = \int^{u(b)}_{u(a)} f(u^{-1}(y))(u^{-1})'(y) dy\)

Poisson distribution

Discrete distribution where there is a known average of amount of events per interval of time or space, and is independent

\(X \sim \text{Pois}(\lambda)\)

PMF

\( f_{X}(n) = \begin{cases} \frac{e^{-\lambda}\lambda^{n}}{n!} & n \in \mathbb{N} \\ 0 \end{cases} \)

Features

Derivation

Poisson distribution works on taking the infinite limit of a binomial distribution \(\lim_{N \to \infty} X \sim \text{Bin}(N,\frac{\lambda}{N})\), since we assume N is the interval of space and we make it approach infinity to simulate the continuous nature of time and space, though we are still counting discrete events

Merging poisson distributions

\( X \sim \text{Pois}(\lambda) \land Y \sim \text{Pois}(\mu) \implies X+Y \sim \text{Pois}(\lambda + \mu)\)

Proof

To get a PMF for X+Y, we can interpret the probability of \(k\) as the sum of the different ways each distribution in unison can produce \(k\), for instance \(\text{Pr}(X+Y = k) = \sum_{j=0}^{k} \text{Pr}(X=j)\text{Pr}(Y=k-j) \). Then by substituting the PMFs of X and Y into this, we can discover \( X+Y \sim \text{Pois}(\lambda + \mu) \)

The PMF of \(X+Y\) is \(\text{Pr}(X+Y = k) = \sum_{j=0}^{k} \text{Pr}(X=j)\text{Pr}(Y=k-j) \). Under assumption that \(X\) and \(Y\) are Poisson distributed, substituring their Poisson PMFs shows a PMF of Poisson form, hence \( X+Y \sim \text{Pois}(\lambda + \mu) \)

Splitting poisson distributions

Assuming a distribution \(X \sim \text{Pois}(\lambda)\) that has two types of events \( X_1 , X_2 \) with probabilities \(p,1-p\) respectively. This implies \(X_1 \sim \text{Pois}(\lambda p)\)

Proof

We want to find \(\text{Pr}(X_1 = n)\), and we can do this by doing \(\text{Pr}(X_1 = n) = \sum \text{Pr}(X_1 = n| X = n+m) \text{Pr}(X = n+m)\). We can see that \(X_1,X = n+m \sim \text{Bin}(n+m,p)\) since out of all the events \(n+m\) we have probability \(p\) of geting an \(n\) type value. We recall that \(\text{Pr}(X = n+m) = \frac{e^{-\lambda}\lambda^{(n+m)}}{(n+m)!}\). Substituting these will show that our theory is true

Poisson Processes

We can take \(X(t) \sim \text{Pois}(\lambda t)\) to extend a poisson distribution to larger intervals, for instance, if \(\lambda\) represents the average amount of events for an h

Memorylessness

Distribution property such that a given conditional lower bound \(a\) for the random variable's return value translates the PDF \(a\) units backwards; it forgets its distance between Poisson events when it is a given condition

\(X \text{ is memoryless} \iff \text{Pr}(X\gt b|X \gt a)=\text{Pr}(X\gt b-a)\)

Symmetry

Distribution property such that all outcomes within the same distance of some 'symmetric outcome' \(x_0\) have the same probability

\(X \text{ is symmetric } \iff \exists x_0 : \text{Pr}(X = x_0 + \delta)=\text{Pr}(X = x_0 - \delta)\)

Exponential distribution

Continuous, memoryless distribution representing the distance between poisson events

\(X \sim \text{exp}(\lambda)\)

PDF

\(f_{X}(x) = \begin{cases} \lambda e^{-\lambda x} & x \in [0,\infty ) \\ 0 \end{cases}\)

Features

Advanced random variables

Probability generating function

Generating functions are ways of encoding a function in the form of some power series. For a random variable \(X\), we take the following generating function:

\(G(z)= \text{E}(z^{X})\)

Theorems

\(\text{E}(X) = G'(1) \)

\(\text{Var}(X) = G''(1)-G'(1)+G'(1)^2 \)

Generating functions can be split as such

\(G_X(z)= G_{X_{1}}(z) G_{X_{2}}(z)\)

Applying conditional expectation with probability generating functions implies

\(G_{S_{N}}(z)= G_{N}(G_{X_{i}}(z))\)

\(X \sim \text{exp}(\lambda) \implies G(z) = \frac{\lambda}{\lambda - \ln (z)}\)

\(X \sim \text{Bern}(p) \implies G(z) = p(z-1)+1\)

\(X \sim \text{Bin}(n,p) \implies G(z) = (p(z-1)+1)^n \)

\(X \sim \text{Geo}(p) \implies G(z) = \frac{zp}{z(1-p)-1} \)

\(X \sim \text{Pois}(\lambda) \implies G(z) = e^{\lambda(z-1)} \)

Time series

Discrete series of datapoints recording numerical observations at regular time intervals.

Markov chain Catena di Markov マルコフ連鎖

Sequence of random variables \(X : \Omega \times D \to R\) (where \(R\) must be countable) is a Markov chain if the probabilities for the subsequent random variable in the sequence are determined solely by the previous random variable

\( (X_t) \text{ is a Markov chain } \iff \text{Pr}(X_n = k_n | \bigwedge_{i=1}^{n-1} X_i = k_i) = \text{Pr}(X_n = k_n | X_{n-1}=k_{n-1})\)

Random walk Passeggiata aleatoria 乱歩

Markov chain defined on the integers \(\mathbb{Z}\) such that \(\text{Pr}_{ij}\) reprsents the probability of the \(n+1\)th state being \(j\), given that the current state is \(i\)

\( \text{Pr}_{ij} = \text{Pr}(X_{n+1}= j | X_{n} = i) \)

Asymmetric random walk

Random walk where one event is more probable than another

Difference equations

A recursive sequence that relates a present event to a previous event. It is desirable to find some closed form solution for these

First order difference equations

A difference equation such that the highest and lowest order terms differ by 1

\( D_n = aD_{n-1}\)

\( D_n = D_{0} a^{n}\)

Use some recurrence relation to find some closed form for \(C_n\)

Second order difference equations

\( D_n = aD_{n-1} + bD_{n-2}\)

\( D_n = D_{0} k^{n}\)

A difference equation such that the highest and lowest order terms differ by 2

For difference equations where the highest and lowest term differ by two

Transition probability

Given \(X_t =x_i\), Probability of transitioning to \(x_j\) at time \(s\)

\(\text{Pr} _{ij}(t,s) = \text{Pr} (X_{t} = x_j | X_{s} = x_i)\)

Homogeneous transition probability

Transition probability that depends only on the time-distance between the time indexes conditioned RV and observed RV

\(\text{Pr} _{ij}(h) = \text{Pr} (X_{t+h} = x_j | X_{t} = x_i)\)

Probability vector

Vector representing each of the probabilities of a Markov chain realizing each value at time \(t\)

\( \textbf{p}(t) = \begin{bmatrix} \text{Pr}(X_t =x_1 ) \\ \text{Pr}(X_t =x_2 ) \\ \cdots \\ \text{Pr}(X_t =x_n ) \end{bmatrix} \)

\( \sum_{i=1}^{n} p_i = 1\)

Transition matrix

Square matrix of all homogeneous transition probabilities

\( \textbf{P}(s) = \begin{bmatrix} \text{Pr}_{11}(s) & \text{Pr}_{12}(s) & \cdots & \text{Pr}_{1n}(s) \\ \text{Pr}_{21}(s) & \text{Pr}_{22}(s) & \cdots & \text{Pr}_{2n}(s) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Pr}_{n1}(s) & \text{Pr}_{n2}(s) & \cdots & \text{Pr}_{nn}(s) \end{bmatrix} \)

\(\sum_{j=1}^{n} \text{Pr}_{ij}(s) = 1 \)

\( \textbf{P}^{T}(s) \textbf{p}(t) = \textbf{p}(t+s)\)

State diagram

Representing possible values as state nodes with directional edges defined by each element in the matrix

Absorbing state

State quality such that the Markov chain will never leave that state once that state is reached

\( x_i \text{ is absorbing } \iff \text{Pr}_{ii}(s)=1\)

Accessible states

Pair of states where one may transition to the other given some time

\( x_j \leftarrow x_i \iff \exists s ( \text{Pr}_{ij}(s) \gt 0 ) \)

Communicable states

States that can access one another

\( x_j \leftrightarrow x_i \iff x_j \leftarrow x_i \land x_i \leftarrow x_j\)

Irreducible Markov Chain

Markov chain where all states communicate

\((X_t) \text{ is irreducible } \iff \forall i,j ( x_j \leftrightarrow x_i )\)

Ergodic Markov Chain

Markov chain with all states aperiodic and persistent, alternatively, in the limit the probability of being in each state approaches an equilibrium distribution that is independent of the initial probabilities

\( (X_t) \text{ is ergodic } \iff \forall i [ x_i \text{ is aperiodic and persistent} ] \)

\( (X_t) \text{ is ergodic } \iff \exists \boldsymbol{\pi} ( \forall \textbf{p}(0) [ \lim_{t \to \infty} \textbf{p}(t) = \boldsymbol{\pi} ] ) \)

\( (X_t) \text{ is ergodic } \implies \exists! \boldsymbol{\pi} ( \lim_{t \to \infty} \textbf{p}(t) = \boldsymbol{\pi} ) \)

\( (X_t) \text{ is a Markov chain with finite states } \land \exists s_0 [ \min_{i,j \geq 1} \text{Pr}_{ij}(s_0) \gt 0 ] \implies (X_t) \text{ is ergodic } \land \)

Recurrent state

State quality such that the state diagram will definitely cicle back to that state after ample time

\( x_i \text{ is recurrent } \iff \text{Pr}(\exists s [ X_{t+s}=X_{t}=x_i]) = 1\)

Transient state

State quality such that there exists a non-zero probability that the Markov chain will never return to that state

\( x_i \text{ is transient } \iff \text{Pr}(\exists s [ X_{t+s}=X_{t}=x_i]) \lt 1\)

Periodicity Periodicità 周期性

A state has a period \(d\) when \(d\) is the GCD of the length of all possible amount of transitions that can possibly lead back to that state

\( i \text{ has period } k_i \iff k_i =\gcd \{ n \gt 0 : \text{Pr}( X_n = i | X_0 = i )\} \)

\( x_i \text{ has period } k_i \iff [ \text{Pr}_{ii}(s)=0 \iff \neg ( s|k_i ) ]\)

Aperiodicity

\( x_i \text{ is aperiodic } \iff k_i = 1 \)

Equilibrium distribution

For an infinite amount of events, each of the probability vector elements converge to some element

\(\lim_{t \to \infty} \textbf{p}(t) = \boldsymbol{\pi}\)

Eigenequation

\( \textbf{P}^{T} \boldsymbol{\pi} = \boldsymbol{\pi} \)