A test statistic that determines a linear relationship and its strength
\(\rho = \frac{\text{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}\)
\(\rho \in [-1,1] \)
\(r = \frac{\text{cov}(X,Y)}{s_{X}s_{Y}}\)
\(r \in [-1,1] \)
Assuming a linear relationship between random variables, a regression model can represent a dependent variable with an added error compoent. One can estimate such a model using various techniques.
For some data points \( (X_i , Y_i) : i \in \mathbb{N} \land i \in [1,n]\)
\( Y_i = \beta_0 + \beta_{1}X_i + \epsilon_i \)
The coefficents for the independent variables are called beta coefficients
Note that \(E(Y_i|X_{i}=x) = \beta_0 +\beta_{1}x\)
Given a sample, the coefficients are interpreted as random variables since they vary based on the sample
\( \hat{Y}_i = b_0 + b_{1}X_i \)
The estimated coefficents for the independent variables are called estimated beta coefficients, and are although constants in the model, are interpreted as random variables as they vary between samples
Class of optimization algorithms for finding estimated beta coefficients that minimize \(\sum_{i=1}^{n} |Y_i - \hat{Y_i}| \) (sum of residuals)
Basic method of least squares implementation for a model under Gauss-Markov assumptions.
Since each \( |Y_i - \hat{Y_i}| \) is difficult to apply calculus to, we transform to \( (Y_i - \hat{Y_i})^2\) instead (note that the minimizing estimated beta coefficients of these terms would be identical)
Now minimizing \( \text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 = \sum_{i=1}^{n} (Y_i - b_0 - b_1 X_i)^2\) requires finding the minima with respect to both estimated beta coefficients, so
\( (b_0,b_1) : \frac{\partial}{\partial b_0} \text{SSE} = 0 \land \frac{\partial}{\partial b_1} \text{SSE} = 0\)
Calculating the derivative returns:
The difference between true value and observed value
\( \epsilon_i = Y_i - \beta_1 X_i - \beta_0\)
The difference between an estimated value and observed result
\( \hat{\epsilon_i} = Y_i - \hat{Y}\)
\( \epsilon_i \sim \text{N}(0,\sigma^2)\)
Note that on the scale of randomly generating a subset of obvserved values, estimated beta coefficients can be interpreted as random variables \(b \sim \text{N}(\beta, \sigma^2_{\beta})\)
\( \text{Gauss-Markov assumptions } \implies E(b_{i}) = \beta_{i} \)
Interval in which the statistician believes a true value of the random variable lies
\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)
\( \sigma_{b_0}^2 = \sigma^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)
\( \sigma_{b_1}^2 = \frac{\sigma^2}{s_{XX}} \)
\( s_{b_0}^2 = s^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)
\( s_{b_1}^2 = \frac{s}{s_{XX}} \)
\( s^2 = \frac{\text{SSE}(b_0,b_1)}{n-2} \)
Property of response variable such that all its beta coefficients contribute to the model (i.e, are not equal to 0), confirmed by an F-test (note how for simple regression models of 1 variable, a T-test is identical to an F-test)
\(Y \text{ is significant } \iff \forall \beta_i, \beta_i \neq 0\)
\( Z_{b_{i}} = \frac{b_i - \beta_{i}}{ \sigma_{b_i} }\)
\( \text{PI} = [(\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{ \frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{\frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)
\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)
\(\nu = n-2\)
\( \text{PI} = [(\hat{Y}|X_i=x) - t_{\frac{\alpha}{2},\nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - t_{\frac{\alpha}{2}, \nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)
Sum of squared differences between estimated values and the mean of the ideal response variable
\(SSE = \sum_{i=1}^{n} (\hat{Y_i} - \overline{Y})^2\) represents the squared sum of the differences between the sample model estimates and mean
Sum of squared differences between observed values and estimated values
\(SSR =\sum_{i=1}^{n} (Y - \hat{Y_i})^2 = \sum \hat{\epsilon_i}^2\)
Sum of squared differences between observeds value and the mean of the ideal response variable
\(SST = SSE + SSR = \sum_{i=1}^{n} (Y_i - \overline{Y})^2\)
A ratio that determines what proportion of the variation is attributed to the regression model rather than noise
\(R^2 = \frac{SSE}{SST}\)
Test statistic \(F \sim \text{F}(1,\nu) \) of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:
\(H_0 : \beta_{1} = 0\)
\(H_1 : \beta_{1} \neq 0\)
\(\nu = n-2\)
\(F = \frac{SSR}{s^2}\)
\(F \gt f_{\alpha}\)
\(p = \text{Pr}(F \gt f)\)
Ratio of some sampled independent random variable instance's square distance from the mean by the sampled variance, offset by \(\frac{1}{n}\) to account for the fact that smaller sample sizes have more uncertainty and therefore more leverage
\( h_{i} = \frac{1}{n} + \frac{ (X_{i} - \overline{X})^2 }{s_{XX}} \)
\(n \)
\(X_i\) is the sampled value of the independent random variable
\(h_i\) is the leverage for the \(i\)th sampled value
\(s_{XX}\) is the sample variance of the independent random variable
Quantity describing a point's contribution in influencing the values of calculated estimated beta coefficients \(b_i\)
When finding the \(b_i\), different values of \(Y_i\) will have more 'power' in influencing these coefficients.
A test statistic to determine the influence of some particular \(X_i\)
\( D_{i} = \frac{1}{m} \frac{h_{ii}}{1-h_{ii}} \hat{t_{i}}^2 \)
\( D_{i} \gt \frac{4}{n-m-1} \hookrightarrow \text{influential} \)
\( D_{i} \gt \frac{4}{n} \hookrightarrow \text{influential (R program)} \)
A test statistic to determine the influence of some particular \(X_i\)
\( \text{DFITS}_{i} = \hat{d_i} \sqrt{\frac{h_{ii}}{1-h_{ii}}} \)
\( |\text{DFITS}_{i}| \gt 2 \sqrt{\frac{m+1}{n-m-1}} \hookrightarrow \text{influential} \)
Test for serial residual correlation, by comparing the sum of square differences between adjacent residuals to the sum of square residuals.
\(dw = \frac{\sum^{n}_{i=1} (\hat{\epsilon_{i}} - \hat{\epsilon_{i-1}})^2 }{\text{SSR}}\)
\(dw \in [0,4]\)
\(dw = 2 \implies \text{no residual correlation}\)
Plot that each points (x,y) where
Plot that each points (x,y) where
Test for residual normality
\(W = \frac{ (\sum_{i=1}^{n} a_{i}X_{i})^2 }{ \sum_{i=1}^{n} (X_{i} - \overline{X_{i}})^2 } \)
\(\hat{t_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)
\(\hat{d_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)
\( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \epsilon_i \)
\( \textbf{y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \)
\( \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} \beta_0 \\ \vdots \\ \beta_m \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix} \)
\( \text{E}(Y_i|X_{i,j}=x_{j}) = \beta_0 + \sum_{j=1}^{m} \beta_{j}x_j\)
\( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} \)
\( \hat{\textbf{y}} = \textbf{X}\textbf{b} \)
\( \begin{bmatrix} \hat{Y}_1 \\ \vdots \\ \hat{Y}_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} b_0 \\ \vdots \\ b_m \end{bmatrix} \)
\( \textbf{b} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}\textbf{y} \)
\( \text{SSE} (\textbf{b}) = \sum_{i=1}^{n} (Y_i - b_0- \sum_{j=1}^{m} b_{j}X_{i,j})^2\)
Then by the previous reasoning of optimisation:
\( \textbf{b} : \forall j \leq m, \frac{\partial}{\partial b_j} \text{SSE} (\textbf{b}) = 0\)
Deriving and setting to zero gives:
Like how \(s_{XX}b_1=s_{XY}\), for multidimensions one has \(\textbf{X}^{T}\textbf{X}\textbf{b}=\textbf{X}^{T}\textbf{y}\)
\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)
\( s^2 = \frac{\text{SSE}(\textbf{b})}{n-m-1} \)
\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)
\(\nu = n-m-1\)
Note that at this point they just expect you to rip \(s_{b_i}\) directly from R
\( F = \frac{\text{MSR}}{\text{MSE}}\)
\( \text{MSR} = \frac{\text{SSR}}{m}\)
\( \text{MSE} = \frac{\text{SSE}}{n-m-1}\)
To account for statistical variation with more dependent variables, an adjusted COD is used
\(R^2_{\text{adj}} = 1- (1-R^2) \frac{n-1}{n-m-1}\)
Note how this adjusted COD expands the \(SSR\) proportional to the ratio of the single and multiple degrees of freedom
Relationship between predictor variables such that they are not statistically independent
Collinearity causes difficulties in examining the true effect of individual predictor variables on the model, and is hence generally discouraged
The presence of excess collinearity is called an association
Case of collinearity where there is a linear function that perfectly maps sample values of one predictor variable to equal another. Technically in linear algebra terms, it means that the \(X\) matrix has linearly dependent columns and hence not full rank (See Linear Algebra)
\( X \text{ is perfectly collinear } \iff \text{rank}(X) \lt n\)
\( X \text{ is perfectly collinear } \iff \exists \textbf{X}_i : \textbf{X}_i = \sum_{j \neq i}^{n-1} k_{j}\textbf{X}_j\)
Case of collinearity established by some test statistic such as VIF
VIF is used to test collinearity between the suspected collinear variable \(j\) and the other independent variables. By forming a regression model with the variable \(j\) as independent and calculating the coefficient of determination of this system as\(R^2_{j}\), the subsequent testing is used:
\(\text{VIF}_{j} = \frac{1}{1-R_{j}^2}\)
\(\text{VIF}_{j} \gt 5 \hookrightarrow \text{potentially collinear} \)
\(\text{VIF}_{j} \gt 10 \hookrightarrow \text{collinear} \)
Assume we have \(m\) independent variables, combinatorics says there are \(2^m\) possible models that can be created by ommiting certain independent variables and searching for the optimum model.
Model with no independent variables
Model with all \(m\) independent variables
Predictor random variable \(Z\) that represents a categorical state, rather than a continuous numerical value, introducing qualitative measures to a regression model.
Let \(Z\) have M possible states (including the state representing 'no state'), then there exist binary value dummy variables \(Z_1, Z_2, ..., Z_{m-1}\) with gamma coefficients \(\gamma_1, \gamma_2, ... ,\gamma_{m-1}\) such that
\(Z_i = \begin{cases} 1 & \text{categorical state employed}\\ 0 & \text{otherwise}\end{cases}\)
If there are \(j\) independent variables and category \(M\) is employed, then one considers the plane:
\( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \sum_{j=1}^{M-1}\gamma_j Z_j\)
\( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} + \sum_{j=1}^{M-1} g_j Z_j\)
Categorical predictiors do not necessarily affect a model independently to continuous predictors, and hence in some cases the dummy variable may be coupled with an independent variable by multiplication, called an interation. This occurs when an interaction effect is statistically significant, and that this phenomenon is not restricted to the interaction of a numerical and a categorical predictor, two numerical or two categorical predictors may be bound by an interaction.
Since a categorical predictor is a set of \(M-1\) binary predictors (in contrast to numerical predictors), a T test is insufficient and hence an F test on only the \(M-1\) desired predictors; enter the Partial F test
Test statistic \(F \sim \text{F}(1,\nu) \) of partition of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:
\(F_{m-q} = \frac{\text{MSR}_{m-q}}{\text{MSE}}\)
\(\text{MSR}_{m-q} = \frac{\text{SSR}_{m-q}}{m-q}\)
\(\text{MSE} = \frac{\text{SSE}}{n-m-1}\)
\(\nu = n-2\)
\(p = \text{Pr}(F \gt f)\)
\(H_0 : \beta_{1} = 0\)
\(H_1 : \beta_{1} \neq 0\)
\(F \gt f_{\alpha}\)
If a bivariate relation fits a non-linear model \(f : X \to Y\), then:
\( \hat{Y}_i = b_0 + b_{1}f(X_i) \)
\( f^{-1}(\hat{Y}_i) = b_0 + b_{1}X_i \)
Transformation to improve model factors such as:
\(Y_{i}^{\lambda} = \begin{cases} \frac{Y_{i}^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\ \ln(Y_{i}) & \lambda = 0 \end{cases}\)
\( \hat{Y_{i}} = \sum_{j=0}^{m} b_{j}X_{i}^{j}\)
Concept of a different variance for each event of \(Y_i\)
\( \forall\epsilon_i \sim \text{N}(0,\sigma_{i}^2), \sigma_{i}^2 \propto f(Y_{i}) \)
When data demonstrates non-constant variance, one can conduct OLS on the logaritm tranformation, then put to the power of \(e\) to receive the true values of \(\hat{Y_i}\)
\( \ln(\hat{Y}_i) = b_0 + b_{1}X_i \to \hat{Y}_i = e^{b_{0}} e^{b_{1}X_{i}}\)
Square matrix representing the covariance betweeen random variables, note that the Gauss-Markov assumption has a covariance matrix \(\sigma^2 I\) since independence implies only diagonal columns may have entries, and heteroskedasticity (constant variance) implies the diagoanls are all \(\sigma^2\)
\(\textbf{Q} : q_{ij} = \text{cov}(X_i , X_j)\)
Let \(\Sigma = \Gamma \Gamma^{T}\)
\(\mathfrak{y} = \Gamma^{-1}\textbf{y}\)
\(\mathfrak{X} = \Gamma^{-1}\textbf{X}\)
\(\mathfrak{e} = \Gamma^{-1}\boldsymbol{\epsilon}\)
\(\hat{\mathfrak{y}} = \Gamma^{-1}\hat{\textbf{y}}\)
\(X = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix}\)
Method of least squares for
Relaxing Gauss-Markov assumptions to \(\boldsymbol{\epsilon} \sim \text{N}(\textbf{0}, \sigma^2 \Sigma)\) where \(\Sigma\) is a matrix that is:
Therefore, \(\Sigma = \Gamma \Gamma^{T}\), hence using Choleski's algorithm to find \(\Gamma\) (See Linear Algebra)
Considering \(\mathfrak{\textbf{y}} = \mathfrak{X}\boldsymbol{\beta} + \mathfrak{e}\), one can see that \(\mathfrak{e} \sim \text{N}(\textbf{0}, \sigma^2 I)\) (since multiplication by \(\Gamma^{-1}\) conserves normality)
\( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{y}} = (X^{T}\Sigma^{-1}X)^{-1}X^{T}\Sigma^{-1}\textbf{y} \)
Special case of GLS where:
\( \Sigma = \begin{bmatrix} \frac{1}{w_1} & 0 & \ldots & 0 \\ 0 & \frac{1}{w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{w_n} \end{bmatrix} \)
And trivially
\( \Sigma^{-1} = \begin{bmatrix} {w_1} & 0 & \ldots & 0 \\ 0 & {w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & {w_n} \end{bmatrix} \)
\( \Gamma = \begin{bmatrix} \frac{1}{\sqrt{w_1}} & 0 & \ldots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{\sqrt{w_n}} \end{bmatrix} \)
\( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{Y}} \)
Considering categorical variable \(Z\) with states \(z_1,...,z_K\), let \(p_k = \text{Pr} (Z = z_k)\) be the probability of being in each state, and let \(N_k = |\{ n \in [1,N] :Z=z_k \}|\) be the amount of times state \(z_k\) is randomly drawn after \(N\) draws. Then:
Discrete distribution of number of trials to obtain a 'success' when a series of trials have binary possibilites
\( (N_1, N_2, ... , N_K) \sim \text{Multinomial}(N,p_1,p_2,...,p_K)\)
\( N_k \sim \text{bin}(N,p_k)\)
Table demonstrating two categorical variables, say \(Z_1, Z_2\), and the frequency of sample results that are the intersection of some state \(Z_1\) and \(Z_2\)
\( \sum_{k=1}^{K} \frac{ (N_k - E [N_k] )^2 }{ E[N_k] } \sim \chi^2\)
Ratio of two probabilities to denote how many times more likely some event is over another.
\( \frac{\text{Pr} (A| Z=z_i) }{\text{Pr} (A |Z=z_j ) }\)
Ratio of the probability event to the the probability of the event not occuring
\( \text{odds}_{k} = \frac{p_k}{1-p_k} \)
\( \text{oddsRatio}_{k,j} = \frac{ \text{odds}_{k} }{ \text{odds}_{j} } \)
Regression model where the response categorical variable has a Bernoulli distribution \(Y \sim \text{Bern}(p)\)
Since the output of the regression model should be \({0,1}\), the approach is to construct a regression model that maps dependent variables to probability of returning 1. Instead of methods like OLS and GLS, MLE is required.
See Mathematical Statistics
Logistic regression cannot use OLS or GLS since the response variable is binary. A probability-based regression approach will be used to decide whether the output is 0 or 1.
This method generates its estimated beta coefficients by maximizing the likelihood function and then compose this linear function inside a link function, which turns the regression model into a probability function.
We let probability determine the binary prediction result by letting \(p_i \geq 0.5 \hookrightarrow Y_i = 1\) and \(p_i \lt 0.5 \hookrightarrow Y_i = 0\)
Function to estimate beta coefficients for MLE regression by finding which beta coefficients maximise the function.
\(\mathcal{L} (\boldsymbol{\beta} |(X_i,Y_i)) = \prod^{n}_{i=1} p(X_i)^{Y_i} (1-p(X_i))^{1-Y_i}\)
Function \(g\) that maps non-linear models to linear regression models, it is a method of applying the techniques of linear regression to generalized response variables
In the case of binary logistic regression \(g\) maps a CDF (cumulative probabilities) to a regression model generated using MLE
A variety of choides for logistic link functions exist, including:
\(\eta (\textbf{x}) = \beta_0 + \beta_1 x : x \in \{0,1\} \)
\(g = \text{logit}\)
\(p(x) = \frac{1}{1+e^{-\beta_0 -\beta_1 x}}\)
Probability when all predictors are set to 0, so \(p(\textbf{0})\). In the above example, \(p(0) = \frac{1}{1+e^{-\beta_0}}\)
Continuous predictor value that makes a \(50%\) probability of the categorical response variable being toggled
\(x : p(x) = \frac{1}{2}\)
Singular beta coefficient significance test for logistic regression models, analogue of a T test
\( Z_{b_i} = \frac{b_{i} - \beta_i}{s_{b_i}} \sim \text{N}(0,1)\)
\(H_0 : \beta_{j} = \beta_{j^{*}}\)
\(H_1 : \beta_{j} \neq \beta_{j^{*}}\)
Beta coefficient significance test for all predictors in a logistic regression models, analogue of an F test
\( L_{m} = -2 \log (\frac{\mathcal{L}(\beta_0)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m)\)
\(H_0 : \beta_{1} = .. = \beta_{m} = 0\)
\(H_1 : \exists j : \beta_{j} \neq 0\)
Beta coefficient significance test for a subset of predictors in a logistic regression models, analogue of a partial F test
\( L_{m-q} = -2 \log (\frac{\mathcal{L}(\beta_0,...,\beta_q)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m-q)\)
\(H_0 : \beta_{q+1} = .. = \beta_{m} = 0\)
\(H_1 : \exists j \in \{q+1,q+2,...,m\} : \beta_{j} \neq 0\)
A goodness-of-fit test statistic for logistic regression models.
\(H_0 : \text{ predicted probabilities match observed probabilities}\)
\(H_1 : \text{ predicted probabilities do not match observed probabilities}\)
Logistic regression models lack a coefficient of determination statistic, so in place there exist pseudo coefficient of determinations such as those by Cox, Nagelkerke, and Snell
A residual based test statistic to test for outliers
\( r_{i} = \frac{GP_i - \hat{p_i}}{\sqrt{\hat{p_i} (1 - \hat{p_i})}} \)
\( |r_{i}| \gt 2 \implies \text{potential; outlier}\)