37252 - Regression and Linear Models


Simple Linear Regression Regressione lineare semplice 単純線形回帰

Pearson's correlation coefficient

A test statistic that determines a linear relationship and its strength

Population

\(\rho = \frac{\text{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}\)

\(\rho \in [-1,1] \)

Sample

\(r = \frac{\text{cov}(X,Y)}{s_{X}s_{Y}}\)

\(r \in [-1,1] \)

Linear Regression Model

Assuming a linear relationship between random variables, a regression model can represent a dependent variable with an added error compoent. One can estimate such a model using various techniques.

For some data points \( (X_i , Y_i) : i \in \mathbb{N} \land i \in [1,n]\)

Ideal model

\( Y_i = \beta_0 + \beta_{1}X_i + \epsilon_i \)

The coefficents for the independent variables are called beta coefficients

Note that \(E(Y_i|X_{i}=x) = \beta_0 +\beta_{1}x\)

Estimated model

Given a sample, the coefficients are interpreted as random variables since they vary based on the sample

\( \hat{Y}_i = b_0 + b_{1}X_i \)

The estimated coefficents for the independent variables are called estimated beta coefficients, and are although constants in the model, are interpreted as random variables as they vary between samples

Method of least squares metodo dei minimi quadrati 最小二乗法

Class of optimization algorithms for finding estimated beta coefficients that minimize \(\sum_{i=1}^{n} |Y_i - \hat{Y_i}| \) (sum of residuals)

Ordinary Least Squares (OLS)

Basic method of least squares implementation for a model under Gauss-Markov assumptions.

Derivation

Since each \( |Y_i - \hat{Y_i}| \) is difficult to apply calculus to, we transform to \( (Y_i - \hat{Y_i})^2\) instead (note that the minimizing estimated beta coefficients of these terms would be identical)

Now minimizing \( \text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 = \sum_{i=1}^{n} (Y_i - b_0 - b_1 X_i)^2\) requires finding the minima with respect to both estimated beta coefficients, so

\( (b_0,b_1) : \frac{\partial}{\partial b_0} \text{SSE} = 0 \land \frac{\partial}{\partial b_1} \text{SSE} = 0\)

Calculating the derivative returns:

Error Errore 測定誤差

The difference between true value and observed value

\( \epsilon_i = Y_i - \beta_1 X_i - \beta_0\)

Residual Residuo 計算誤差

The difference between an estimated value and observed result

\( \hat{\epsilon_i} = Y_i - \hat{Y}\)

Gauss-Markov assumptions

\( \epsilon_i \sim \text{N}(0,\sigma^2)\)

Gauss-Markov theorem

Note that on the scale of randomly generating a subset of obvserved values, estimated beta coefficients can be interpreted as random variables \(b \sim \text{N}(\beta, \sigma^2_{\beta})\)

\( \text{Gauss-Markov assumptions } \implies E(b_{i}) = \beta_{i} \)

Prediction interval

Interval in which the statistician believes a true value of the random variable lies

Distributions of estimated coefficients

\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)

Population variance

\( \sigma_{b_0}^2 = \sigma^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)

\( \sigma_{b_1}^2 = \frac{\sigma^2}{s_{XX}} \)

Sample variance estimate

\( s_{b_0}^2 = s^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)

\( s_{b_1}^2 = \frac{s}{s_{XX}} \)

\( s^2 = \frac{\text{SSE}(b_0,b_1)}{n-2} \)

Significance

Property of response variable such that all its beta coefficients contribute to the model (i.e, are not equal to 0), confirmed by an F-test (note how for simple regression models of 1 variable, a T-test is identical to an F-test)

\(Y \text{ is significant } \iff \forall \beta_i, \beta_i \neq 0\)

Significance Z-test

\( Z_{b_{i}} = \frac{b_i - \beta_{i}}{ \sigma_{b_i} }\)

Prediction interval

\( \text{PI} = [(\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{ \frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{\frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)

Significance T-test

\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)

\(\nu = n-2\)

Prediction interval

\( \text{PI} = [(\hat{Y}|X_i=x) - t_{\frac{\alpha}{2},\nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - t_{\frac{\alpha}{2}, \nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)

ANOVA and residual analysis

Types of variation

Sum of Squares Error

Sum of squared differences between estimated values and the mean of the ideal response variable

\(SSE = \sum_{i=1}^{n} (\hat{Y_i} - \overline{Y})^2\) represents the squared sum of the differences between the sample model estimates and mean

Sum of Squares Regression

Sum of squared differences between observed values and estimated values

\(SSR =\sum_{i=1}^{n} (Y - \hat{Y_i})^2 = \sum \hat{\epsilon_i}^2\)

Sum of Squares Total

Sum of squared differences between observeds value and the mean of the ideal response variable

\(SST = SSE + SSR = \sum_{i=1}^{n} (Y_i - \overline{Y})^2\)

Coefficient of determination (R-Squared)

A ratio that determines what proportion of the variation is attributed to the regression model rather than noise

\(R^2 = \frac{SSE}{SST}\)

F-test

Test statistic \(F \sim \text{F}(1,\nu) \) of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:

\(H_0 : \beta_{1} = 0\)

\(H_1 : \beta_{1} \neq 0\)

\(\nu = n-2\)

\(F = \frac{SSR}{s^2}\)

\(F \gt f_{\alpha}\)

\(p = \text{Pr}(F \gt f)\)

Leverage

Ratio of some sampled independent random variable instance's square distance from the mean by the sampled variance, offset by \(\frac{1}{n}\) to account for the fact that smaller sample sizes have more uncertainty and therefore more leverage

\( h_{i} = \frac{1}{n} + \frac{ (X_{i} - \overline{X})^2 }{s_{XX}} \)

\(n \)

\(X_i\) is the sampled value of the independent random variable

\(h_i\) is the leverage for the \(i\)th sampled value

\(s_{XX}\) is the sample variance of the independent random variable

Influence

Quantity describing a point's contribution in influencing the values of calculated estimated beta coefficients \(b_i\)

When finding the \(b_i\), different values of \(Y_i\) will have more 'power' in influencing these coefficients.

Cook's D

A test statistic to determine the influence of some particular \(X_i\)

\( D_{i} = \frac{1}{m} \frac{h_{ii}}{1-h_{ii}} \hat{t_{i}}^2 \)

\( D_{i} \gt \frac{4}{n-m-1} \hookrightarrow \text{influential} \)

\( D_{i} \gt \frac{4}{n} \hookrightarrow \text{influential (R program)} \)

DFITS

A test statistic to determine the influence of some particular \(X_i\)

\( \text{DFITS}_{i} = \hat{d_i} \sqrt{\frac{h_{ii}}{1-h_{ii}}} \)

\( |\text{DFITS}_{i}| \gt 2 \sqrt{\frac{m+1}{n-m-1}} \hookrightarrow \text{influential} \)

Durbin-Watson test

Test for serial residual correlation, by comparing the sum of square differences between adjacent residuals to the sum of square residuals.

\(dw = \frac{\sum^{n}_{i=1} (\hat{\epsilon_{i}} - \hat{\epsilon_{i-1}})^2 }{\text{SSR}}\)

\(dw \in [0,4]\)

\(dw = 2 \implies \text{no residual correlation}\)

Probability-Probability plot (PP plot)

Plot that each points (x,y) where

QQ plot

Plot that each points (x,y) where

Shapiro-Wilk test

Test for residual normality

\(W = \frac{ (\sum_{i=1}^{n} a_{i}X_{i})^2 }{ \sum_{i=1}^{n} (X_{i} - \overline{X_{i}})^2 } \)

Studentized residual

Internally

\(\hat{t_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)

Extenally

\(\hat{d_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)

Multilinear regression

Multidimensional linear regression model

Ideal model

\( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \epsilon_i \)

\( \textbf{y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \)

\( \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} \beta_0 \\ \vdots \\ \beta_m \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix} \)

\( \text{E}(Y_i|X_{i,j}=x_{j}) = \beta_0 + \sum_{j=1}^{m} \beta_{j}x_j\)

Estimated model

\( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} \)

\( \hat{\textbf{y}} = \textbf{X}\textbf{b} \)

\( \begin{bmatrix} \hat{Y}_1 \\ \vdots \\ \hat{Y}_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} b_0 \\ \vdots \\ b_m \end{bmatrix} \)

Multidimensional Ordinary Least Squares (MOLS)

\( \textbf{b} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}\textbf{y} \)

Derivation

\( \text{SSE} (\textbf{b}) = \sum_{i=1}^{n} (Y_i - b_0- \sum_{j=1}^{m} b_{j}X_{i,j})^2\)

Then by the previous reasoning of optimisation:

\( \textbf{b} : \forall j \leq m, \frac{\partial}{\partial b_j} \text{SSE} (\textbf{b}) = 0\)

Deriving and setting to zero gives:

Like how \(s_{XX}b_1=s_{XY}\), for multidimensions one has \(\textbf{X}^{T}\textbf{X}\textbf{b}=\textbf{X}^{T}\textbf{y}\)

Distributions of estimated coefficients

\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)

Sample error variance estimate

\( s^2 = \frac{\text{SSE}(\textbf{b})}{n-m-1} \)

Multidimensional T-test

\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)

\(\nu = n-m-1\)

Note that at this point they just expect you to rip \(s_{b_i}\) directly from R

Multidimensional F-test

\( F = \frac{\text{MSR}}{\text{MSE}}\)

\( \text{MSR} = \frac{\text{SSR}}{m}\)

\( \text{MSE} = \frac{\text{SSE}}{n-m-1}\)

Adjusted coefficient of determination

To account for statistical variation with more dependent variables, an adjusted COD is used

\(R^2_{\text{adj}} = 1- (1-R^2) \frac{n-1}{n-m-1}\)

Note how this adjusted COD expands the \(SSR\) proportional to the ratio of the single and multiple degrees of freedom

Collinearity

Relationship between predictor variables such that they are not statistically independent

Collinearity causes difficulties in examining the true effect of individual predictor variables on the model, and is hence generally discouraged

The presence of excess collinearity is called an association

Perfect collinearity

Case of collinearity where there is a linear function that perfectly maps sample values of one predictor variable to equal another. Technically in linear algebra terms, it means that the \(X\) matrix has linearly dependent columns and hence not full rank (See Linear Algebra)

\( X \text{ is perfectly collinear } \iff \text{rank}(X) \lt n\)

\( X \text{ is perfectly collinear } \iff \exists \textbf{X}_i : \textbf{X}_i = \sum_{j \neq i}^{n-1} k_{j}\textbf{X}_j\)

Imperfect collinearity

Case of collinearity established by some test statistic such as VIF

Variance Inflation Factor (VIF)

VIF is used to test collinearity between the suspected collinear variable \(j\) and the other independent variables. By forming a regression model with the variable \(j\) as independent and calculating the coefficient of determination of this system as\(R^2_{j}\), the subsequent testing is used:

\(\text{VIF}_{j} = \frac{1}{1-R_{j}^2}\)

\(\text{VIF}_{j} \gt 5 \hookrightarrow \text{potentially collinear} \)

\(\text{VIF}_{j} \gt 10 \hookrightarrow \text{collinear} \)

Model selection

Assume we have \(m\) independent variables, combinatorics says there are \(2^m\) possible models that can be created by ommiting certain independent variables and searching for the optimum model.

Null model

Model with no independent variables

Maximal model

Model with all \(m\) independent variables

Selection from all method

  1. Create the \(2^m\) models and calculate some statistic that describes goodness of fit (like \(R^2_{\text{adj}}\) but only the adjusted version, otherwise this method always chooses the maximal model)
  2. Select the model with the most desirable statistic

Forward selection method

  1. Create the null model, let this be \(\hat{Y}\)
  2. Create all possible models with an additional independent variable
  3. Test for significance for each model (F-test), if:
    • all models insignificant, then \(\hat{Y}\) has been finished.
    • some model is significant, then let the most significant model (T-test) be chosen as \(\hat{Y}\). Continue from step 2.

Backwards selection method

  1. Create the maximal model, let this be \(\hat{Y}\)
  2. Create all possible models with an independent variable removed
  3. Test for significance for each model (F-test), if:
    • all models significant, then \(\hat{Y}\) has been finished.
    • some model is insignificant, then let the least significant model (T-test) be chosen as \(\hat{Y}\). Continue from step 2.

Categorical predictors

Predictor random variable \(Z\) that represents a categorical state, rather than a continuous numerical value, introducing qualitative measures to a regression model.

Let \(Z\) have M possible states (including the state representing 'no state'), then there exist binary value dummy variables \(Z_1, Z_2, ..., Z_{m-1}\) with gamma coefficients \(\gamma_1, \gamma_2, ... ,\gamma_{m-1}\) such that

\(Z_i = \begin{cases} 1 & \text{categorical state employed}\\ 0 & \text{otherwise}\end{cases}\)

  • only one category in \(Z_i\) can be toggled at a given time
  • the case where all categorical states in \(Z_i\) are not employed is the null category
  • Population

    If there are \(j\) independent variables and category \(M\) is employed, then one considers the plane:

    \( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \sum_{j=1}^{M-1}\gamma_j Z_j\)

    Sample

    \( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} + \sum_{j=1}^{M-1} g_j Z_j\)

    Interaction Interazione 交互作用

    Categorical predictiors do not necessarily affect a model independently to continuous predictors, and hence in some cases the dummy variable may be coupled with an independent variable by multiplication, called an interation. This occurs when an interaction effect is statistically significant, and that this phenomenon is not restricted to the interaction of a numerical and a categorical predictor, two numerical or two categorical predictors may be bound by an interaction.

    Partial F-test

    Since a categorical predictor is a set of \(M-1\) binary predictors (in contrast to numerical predictors), a T test is insufficient and hence an F test on only the \(M-1\) desired predictors; enter the Partial F test

    Test statistic \(F \sim \text{F}(1,\nu) \) of partition of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:

    \(F_{m-q} = \frac{\text{MSR}_{m-q}}{\text{MSE}}\)

    \(\text{MSR}_{m-q} = \frac{\text{SSR}_{m-q}}{m-q}\)

    \(\text{MSE} = \frac{\text{SSE}}{n-m-1}\)

    Degree of freedom

    \(\nu = n-2\)

    P-value

    \(p = \text{Pr}(F \gt f)\)

    Hypothesis

    \(H_0 : \beta_{1} = 0\)

    \(H_1 : \beta_{1} \neq 0\)

    \(F \gt f_{\alpha}\)

    Transformations

    Data transformations

    If a bivariate relation fits a non-linear model \(f : X \to Y\), then:

    \( \hat{Y}_i = b_0 + b_{1}f(X_i) \)

    \( f^{-1}(\hat{Y}_i) = b_0 + b_{1}X_i \)

    Box-Cox transformation

    Transformation to improve model factors such as:

    \(Y_{i}^{\lambda} = \begin{cases} \frac{Y_{i}^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\ \ln(Y_{i}) & \lambda = 0 \end{cases}\)

    Polynomial linear regression

    Standard

    \( \hat{Y_{i}} = \sum_{j=0}^{m} b_{j}X_{i}^{j}\)

    Categorical

    Heteroskedasticity Eteroschedasticità 分散不均一性

    Concept of a different variance for each event of \(Y_i\)

    \( \forall\epsilon_i \sim \text{N}(0,\sigma_{i}^2), \sigma_{i}^2 \propto f(Y_{i}) \)

    Logarithm transformation

    When data demonstrates non-constant variance, one can conduct OLS on the logaritm tranformation, then put to the power of \(e\) to receive the true values of \(\hat{Y_i}\)

    \( \ln(\hat{Y}_i) = b_0 + b_{1}X_i \to \hat{Y}_i = e^{b_{0}} e^{b_{1}X_{i}}\)

    Covariance matrix

    Square matrix representing the covariance betweeen random variables, note that the Gauss-Markov assumption has a covariance matrix \(\sigma^2 I\) since independence implies only diagonal columns may have entries, and heteroskedasticity (constant variance) implies the diagoanls are all \(\sigma^2\)

    \(\textbf{Q} : q_{ij} = \text{cov}(X_i , X_j)\)

    GLS Matrixes

    Let \(\Sigma = \Gamma \Gamma^{T}\)

    \(\mathfrak{y} = \Gamma^{-1}\textbf{y}\)

    \(\mathfrak{X} = \Gamma^{-1}\textbf{X}\)

    \(\mathfrak{e} = \Gamma^{-1}\boldsymbol{\epsilon}\)

    \(\hat{\mathfrak{y}} = \Gamma^{-1}\hat{\textbf{y}}\)

    \(X = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix}\)

    Generalized Least Squares (GLS)

    Method of least squares for

    Relaxing Gauss-Markov assumptions to \(\boldsymbol{\epsilon} \sim \text{N}(\textbf{0}, \sigma^2 \Sigma)\) where \(\Sigma\) is a matrix that is:

    Therefore, \(\Sigma = \Gamma \Gamma^{T}\), hence using Choleski's algorithm to find \(\Gamma\) (See Linear Algebra)

    Considering \(\mathfrak{\textbf{y}} = \mathfrak{X}\boldsymbol{\beta} + \mathfrak{e}\), one can see that \(\mathfrak{e} \sim \text{N}(\textbf{0}, \sigma^2 I)\) (since multiplication by \(\Gamma^{-1}\) conserves normality)

    \( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{y}} = (X^{T}\Sigma^{-1}X)^{-1}X^{T}\Sigma^{-1}\textbf{y} \)

    Weighted Least Squares (WLS)

    Special case of GLS where:

    \( \Sigma = \begin{bmatrix} \frac{1}{w_1} & 0 & \ldots & 0 \\ 0 & \frac{1}{w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{w_n} \end{bmatrix} \)

    And trivially

    \( \Sigma^{-1} = \begin{bmatrix} {w_1} & 0 & \ldots & 0 \\ 0 & {w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & {w_n} \end{bmatrix} \)

    \( \Gamma = \begin{bmatrix} \frac{1}{\sqrt{w_1}} & 0 & \ldots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{\sqrt{w_n}} \end{bmatrix} \)

    \( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{Y}} \)

    Categorial RV Analysis

    Multinomial distribution

    Considering categorical variable \(Z\) with states \(z_1,...,z_K\), let \(p_k = \text{Pr} (Z = z_k)\) be the probability of being in each state, and let \(N_k = |\{ n \in [1,N] :Z=z_k \}|\) be the amount of times state \(z_k\) is randomly drawn after \(N\) draws. Then:

    Discrete distribution of number of trials to obtain a 'success' when a series of trials have binary possibilites

    \( (N_1, N_2, ... , N_K) \sim \text{Multinomial}(N,p_1,p_2,...,p_K)\)

    \( N_k \sim \text{bin}(N,p_k)\)

    Two way table

    Table demonstrating two categorical variables, say \(Z_1, Z_2\), and the frequency of sample results that are the intersection of some state \(Z_1\) and \(Z_2\)

    Chi-square Goodness-of-Fit Test

    \( \sum_{k=1}^{K} \frac{ (N_k - E [N_k] )^2 }{ E[N_k] } \sim \chi^2\)

    Relative Risk (RR)

    Ratio of two probabilities to denote how many times more likely some event is over another.

    \( \frac{\text{Pr} (A| Z=z_i) }{\text{Pr} (A |Z=z_j ) }\)

    Odds

    Ratio of the probability event to the the probability of the event not occuring

    \( \text{odds}_{k} = \frac{p_k}{1-p_k} \)

    \( \text{oddsRatio}_{k,j} = \frac{ \text{odds}_{k} }{ \text{odds}_{j} } \)

    Binary Logistic regression

    Binary logistic regression

    Regression model where the response categorical variable has a Bernoulli distribution \(Y \sim \text{Bern}(p)\)

    Since the output of the regression model should be \({0,1}\), the approach is to construct a regression model that maps dependent variables to probability of returning 1. Instead of methods like OLS and GLS, MLE is required.

    Maximum Likelihood Estimation (MLE)

    See Mathematical Statistics

    Logistic regression cannot use OLS or GLS since the response variable is binary. A probability-based regression approach will be used to decide whether the output is 0 or 1.

    This method generates its estimated beta coefficients by maximizing the likelihood function and then compose this linear function inside a link function, which turns the regression model into a probability function.

    We let probability determine the binary prediction result by letting \(p_i \geq 0.5 \hookrightarrow Y_i = 1\) and \(p_i \lt 0.5 \hookrightarrow Y_i = 0\)

    Likelihood function

    Function to estimate beta coefficients for MLE regression by finding which beta coefficients maximise the function.

    \(\mathcal{L} (\boldsymbol{\beta} |(X_i,Y_i)) = \prod^{n}_{i=1} p(X_i)^{Y_i} (1-p(X_i))^{1-Y_i}\)

    Link function

    Function \(g\) that maps non-linear models to linear regression models, it is a method of applying the techniques of linear regression to generalized response variables

    In the case of binary logistic regression \(g\) maps a CDF (cumulative probabilities) to a regression model generated using MLE

    A variety of choides for logistic link functions exist, including:

    Simple logistic regression

    \(\eta (\textbf{x}) = \beta_0 + \beta_1 x : x \in \{0,1\} \)

    \(g = \text{logit}\)

    \(p(x) = \frac{1}{1+e^{-\beta_0 -\beta_1 x}}\)

    Baseline probability

    Probability when all predictors are set to 0, so \(p(\textbf{0})\). In the above example, \(p(0) = \frac{1}{1+e^{-\beta_0}}\)

    Median Effective Level

    Continuous predictor value that makes a \(50%\) probability of the categorical response variable being toggled

    \(x : p(x) = \frac{1}{2}\)

    Wald test

    Singular beta coefficient significance test for logistic regression models, analogue of a T test

    Test statistic

    \( Z_{b_i} = \frac{b_{i} - \beta_i}{s_{b_i}} \sim \text{N}(0,1)\)

    Hypothesis

    \(H_0 : \beta_{j} = \beta_{j^{*}}\)

    \(H_1 : \beta_{j} \neq \beta_{j^{*}}\)

    Omnibus test

    Beta coefficient significance test for all predictors in a logistic regression models, analogue of an F test

    \( L_{m} = -2 \log (\frac{\mathcal{L}(\beta_0)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m)\)

    Hypothesis

    \(H_0 : \beta_{1} = .. = \beta_{m} = 0\)

    \(H_1 : \exists j : \beta_{j} \neq 0\)

    Partial omnibus test

    Beta coefficient significance test for a subset of predictors in a logistic regression models, analogue of a partial F test

    \( L_{m-q} = -2 \log (\frac{\mathcal{L}(\beta_0,...,\beta_q)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m-q)\)

    Hypothesis

    \(H_0 : \beta_{q+1} = .. = \beta_{m} = 0\)

    \(H_1 : \exists j \in \{q+1,q+2,...,m\} : \beta_{j} \neq 0\)

    Hosmer-Lemeshow test

    A goodness-of-fit test statistic for logistic regression models.

    Hypothesis

    \(H_0 : \text{ predicted probabilities match observed probabilities}\)

    \(H_1 : \text{ predicted probabilities do not match observed probabilities}\)

    Pseudo \(R^2\)

    Logistic regression models lack a coefficient of determination statistic, so in place there exist pseudo coefficient of determinations such as those by Cox, Nagelkerke, and Snell

    Pearson residuals

    A residual based test statistic to test for outliers

    \( r_{i} = \frac{GP_i - \hat{p_i}}{\sqrt{\hat{p_i} (1 - \hat{p_i})}} \)

    \( |r_{i}| \gt 2 \implies \text{potential; outlier}\)