37252 - Regression and Linear Models

Mathematics 2

Simple Linear Regression Regressione lineare semplice 単純線形回帰

Pearson's correlation coefficient

A test statistic that determines a linear relationship and its strength

Population

\(\rho = \frac{\text{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}\)

\(\rho \in [-1,1] \)

Sample

\(r = \frac{\text{cov}(X,Y)}{s_{X}s_{Y}}\)

\(r \in [-1,1] \)

Linear Regression Model

Assuming a linear relationship between random variables, a regression model can represent a dependent variable with an added error compoent. One can estimate such a model using various techniques.

For some data points \( (X_i , Y_i) : i \in \mathbb{N} \land i \in [1,n]\)

Ideal model

\( Y_i = \beta_0 + \beta_{1}X_i + \epsilon_i \)

\(X_i\) is the independent random variable, the predictor variable
\(Y_i\) is the dependent random variable, the response variable representing actual observations
\(\beta_{0}\) is the vertical translation coefficient
\(\beta_{1}\) is the gradient coefficient
\(\epsilon_i\) is the random variable of error between the linear model and the observed results

The coefficents for the independent variables are called beta coefficients

Note that \(E(Y_i|X_{i}=x) = \beta_0 +\beta_{1}x\)

Estimated model

Given a sample, the coefficients are interpreted as random variables since they vary based on the sample

\( \hat{Y}_i = b_0 + b_{1}X_i \)

\(X_i\) is the independent random variable, the predictor variable
\(\hat{Y}_i\) is the estimated dependent random variable, the response variable representing estimated values
\(b_{0}\) is the estimated vertical translation coefficient
\(b_{1}\) is the estimated gradient coefficient

The estimated coefficents for the independent variables are called estimated beta coefficients, and are although constants in the model, are interpreted as random variables as they vary between samples

Method of least squares metodo dei minimi quadrati 最小二乗法

Class of optimization algorithms for finding estimated beta coefficients that minimize \(\sum_{i=1}^{n} |Y_i - \hat{Y_i}| \) (sum of residuals)

Ordinary Least Squares (OLS)

Basic method of least squares implementation for a model under Gauss-Markov assumptions.

\( b_1 = \frac{ \sum_{i=1}^{n} (X_{i} - \overline{X})(Y_{i} - \overline{Y}) }{ \sum_{i=1}^{n} (X_{i} - \overline{X})^2 } = \frac{s_{XY}}{s_{XX}} \)
\( b_0 = \overline{Y} - b_{1}\overline{X} \)

Derivation

Since each \( |Y_i - \hat{Y_i}| \) is difficult to apply calculus to, we transform to \( (Y_i - \hat{Y_i})^2\) instead (note that the minimizing estimated beta coefficients of these terms would be identical)

Now minimizing \( \text{SSE} = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 = \sum_{i=1}^{n} (Y_i - b_0 - b_1 X_i)^2\) requires finding the minima with respect to both estimated beta coefficients, so

\( (b_0,b_1) : \frac{\partial}{\partial b_0} \text{SSE} = 0 \land \frac{\partial}{\partial b_1} \text{SSE} = 0\)

Calculating the derivative returns:

\( \sum_{i=1}^{n} Y_i - nb_0 - b_{1}\sum_{i=1}^{n} X_i = 0 \)
\( \sum_{i=1}^{n} X_{i}Y_{i} - b_0 \sum_{i=1}^{n} X_{i} - b_{1} \sum_{i=1}^{n} X_{i}^2 = 0 \)

Error Errore 測定誤差

The difference between true value and observed value

\( \epsilon_i = Y_i - \beta_1 X_i - \beta_0\)

Residual Residuo 計算誤差

The difference between an estimated value and observed result

\( \hat{\epsilon_i} = Y_i - \hat{Y}\)

Gauss-Markov assumptions

\( \epsilon_i \sim \text{N}(0,\sigma^2)\)

Constant variance (heteroskedasticity), \(\forall i, \text{Var}(\epsilon_{i}) = \sigma^2\)
Residual independence, \( \forall i,j : i\neq j, \epsilon_i , \epsilon_{j} \text{ are independent}\)
Normally distributed errors, \(\epsilon_i \sim \text{N}(0, \sigma^2)\)

Gauss-Markov theorem

Note that on the scale of randomly generating a subset of obvserved values, estimated beta coefficients can be interpreted as random variables \(b \sim \text{N}(\beta, \sigma^2_{\beta})\)

\( \text{Gauss-Markov assumptions } \implies E(b_{i}) = \beta_{i} \)

Prediction interval

Interval in which the statistician believes a true value of the random variable lies

Distributions of estimated coefficients

\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)

Population variance

\( \sigma_{b_0}^2 = \sigma^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)

\( \sigma_{b_1}^2 = \frac{\sigma^2}{s_{XX}} \)

\(\sigma^2\) is the variance of all the \(\epsilon_i\)

Sample variance estimate

\( s_{b_0}^2 = s^2 (\frac{1}{n} + \frac{\overline{X}^2}{s_{XX}}) \)

\( s_{b_1}^2 = \frac{s}{s_{XX}} \)

\( s^2 = \frac{\text{SSE}(b_0,b_1)}{n-2} \)

Significance

Property of response variable such that all its beta coefficients contribute to the model (i.e, are not equal to 0), confirmed by an F-test (note how for simple regression models of 1 variable, a T-test is identical to an F-test)

\(Y \text{ is significant } \iff \forall \beta_i, \beta_i \neq 0\)

Significance Z-test

\( Z_{b_{i}} = \frac{b_i - \beta_{i}}{ \sigma_{b_i} }\)

\(Z_{b_i}\) is the Z test statistic
\(\beta_i\) is the hypothesized value for the true \(\beta_i\)
\(b_i\) is the estimate for \(\beta_i\) found by least squares (or another interpolation method)
\( \sigma_{b_i} \) is the population standard deviation of the estimate \(b_i\)

Prediction interval

\( \text{PI} = [(\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{ \frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - z_{\frac{\alpha}{2}} \sigma \sqrt{\frac{n+1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)

Significance T-test

\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)

\(\nu = n-2\)

\(T_{b_i}\) is the T test statistic
\(\beta_i\) is the hypothesized value for the true \(\beta_i\)
\(b_i\) is the estimate for \(\beta_i\) found by least squares (or another interpolation method)
\( s_{b_i} \) is the sample standard deviation of the estimate \(b_i\)

Prediction interval

\( \text{PI} = [(\hat{Y}|X_i=x) - t_{\frac{\alpha}{2},\nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}, (\hat{Y}|X_i=x) - t_{\frac{\alpha}{2}, \nu} s \sqrt{\frac{1}{n} + \frac{\overline{X}^2}{ s_{XX} }}] \)

ANOVA and residual analysis

Types of variation

Sum of Squares Error

Sum of squared differences between estimated values and the mean of the ideal response variable

\(SSE = \sum_{i=1}^{n} (\hat{Y_i} - \overline{Y})^2\) represents the squared sum of the differences between the sample model estimates and mean

Sum of Squares Regression

Sum of squared differences between observed values and estimated values

\(SSR =\sum_{i=1}^{n} (Y - \hat{Y_i})^2 = \sum \hat{\epsilon_i}^2\)

Sum of Squares Total

Sum of squared differences between observeds value and the mean of the ideal response variable

\(SST = SSE + SSR = \sum_{i=1}^{n} (Y_i - \overline{Y})^2\)

Coefficient of determination (R-Squared)

A ratio that determines what proportion of the variation is attributed to the regression model rather than noise

\(R^2 = \frac{SSE}{SST}\)

F-test

Test statistic \(F \sim \text{F}(1,\nu) \) of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:

\(H_0 : \beta_{1} = 0\)

\(H_1 : \beta_{1} \neq 0\)

\(\nu = n-2\)

\(F = \frac{SSR}{s^2}\)

\(F \gt f_{\alpha}\)

\(p = \text{Pr}(F \gt f)\)

Leverage

Ratio of some sampled independent random variable instance's square distance from the mean by the sampled variance, offset by \(\frac{1}{n}\) to account for the fact that smaller sample sizes have more uncertainty and therefore more leverage

\( h_{i} = \frac{1}{n} + \frac{ (X_{i} - \overline{X})^2 }{s_{XX}} \)

\(n \)

\(X_i\) is the sampled value of the independent random variable

\(h_i\) is the leverage for the \(i\)th sampled value

\(s_{XX}\) is the sample variance of the independent random variable

Influence

Quantity describing a point's contribution in influencing the values of calculated estimated beta coefficients \(b_i\)

When finding the \(b_i\), different values of \(Y_i\) will have more 'power' in influencing these coefficients.

Cook's D

A test statistic to determine the influence of some particular \(X_i\)

\( D_{i} = \frac{1}{m} \frac{h_{ii}}{1-h_{ii}} \hat{t_{i}}^2 \)

\( D_{i} \gt \frac{4}{n-m-1} \hookrightarrow \text{influential} \)

\( D_{i} \gt \frac{4}{n} \hookrightarrow \text{influential (R program)} \)

\(m\) is the amount of independent variables

DFITS

A test statistic to determine the influence of some particular \(X_i\)

\( \text{DFITS}_{i} = \hat{d_i} \sqrt{\frac{h_{ii}}{1-h_{ii}}} \)

\( |\text{DFITS}_{i}| \gt 2 \sqrt{\frac{m+1}{n-m-1}} \hookrightarrow \text{influential} \)

\(m\) is the amount of independent variables

Durbin-Watson test

Test for serial residual correlation, by comparing the sum of square differences between adjacent residuals to the sum of square residuals.

\(dw = \frac{\sum^{n}_{i=1} (\hat{\epsilon_{i}} - \hat{\epsilon_{i-1}})^2 }{\text{SSR}}\)

\(dw \in [0,4]\)

\(dw = 2 \implies \text{no residual correlation}\)

Probability-Probability plot (PP plot)

Plot that each points (x,y) where

\(x\) is the CDF value at \(x\) of the error's assumed normal distribution
\(y\) is the observed CDF value at \(x\) of the error's distribution

QQ plot

Plot that each points (x,y) where

\(x\) is the value of the \(n\)th quantile of the error's assumed normal distribution
\(y\) is the observed value of the \(n\)th quantile of the error's distribution

Shapiro-Wilk test

Test for residual normality

\(W = \frac{ (\sum_{i=1}^{n} a_{i}X_{i})^2 }{ \sum_{i=1}^{n} (X_{i} - \overline{X_{i}})^2 } \)

\(W\) is the Shapiro-Wilk statistic
\(X_i\) is a dependent variable

Studentized residual

Internally

\(\hat{t_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)

Extenally

\(\hat{d_i} = \frac{ \hat{\epsilon_i} }{s\sqrt{1 - h_{i,i}}} \)

Multilinear regression

Multidimensional linear regression model

Ideal model

\( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \epsilon_i \)

\( \textbf{y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \)

\( \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} \beta_0 \\ \vdots \\ \beta_m \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{bmatrix} \)

\( \text{E}(Y_i|X_{i,j}=x_{j}) = \beta_0 + \sum_{j=1}^{m} \beta_{j}x_j\)

Estimated model

\( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} \)

\( \hat{\textbf{y}} = \textbf{X}\textbf{b} \)

\( \begin{bmatrix} \hat{Y}_1 \\ \vdots \\ \hat{Y}_n \end{bmatrix} = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix} \begin{bmatrix} b_0 \\ \vdots \\ b_m \end{bmatrix} \)

Multidimensional Ordinary Least Squares (MOLS)

\( \textbf{b} = (\textbf{X}^{T}\textbf{X})^{-1}\textbf{X}^{T}\textbf{y} \)

Derivation

\( \text{SSE} (\textbf{b}) = \sum_{i=1}^{n} (Y_i - b_0- \sum_{j=1}^{m} b_{j}X_{i,j})^2\)

Then by the previous reasoning of optimisation:

\( \textbf{b} : \forall j \leq m, \frac{\partial}{\partial b_j} \text{SSE} (\textbf{b}) = 0\)

Deriving and setting to zero gives:

\( \sum_{i=1}^{n} (Y_i - b_0 - \sum_{j=1}^{m} b_{j}X_{i,j}) = 0 \)
\( \forall j \leq m, \sum_{i=1}^{n} X_{i,j}(Y_i - b_0 - \sum_{j=1}^{m} b_{j}X_{i,j}) = 0 \)

Like how \(s_{XX}b_1=s_{XY}\), for multidimensions one has \(\textbf{X}^{T}\textbf{X}\textbf{b}=\textbf{X}^{T}\textbf{y}\)

Distributions of estimated coefficients

\(b_{i} \sim \text{N}(\beta_i, \sigma_{b_i}^2) \)

Sample error variance estimate

\( s^2 = \frac{\text{SSE}(\textbf{b})}{n-m-1} \)

Multidimensional T-test

\( T_{b_{i}} = \frac{b_i - \beta_{i}}{s_{b_i}}\)

\(\nu = n-m-1\)

\(T_{b_i}\) is the T test statistic
\(\beta_i\) is the hypothesized value for the true \(\beta_i\)
\(b_i\) is the estimate for \(\beta_i\) found by least squares (or another interpolation method)
\( s_{b_i} \) is the sample standard deviation of the estimate \(b_i\)

Note that at this point they just expect you to rip \(s_{b_i}\) directly from R

Multidimensional F-test

\( F = \frac{\text{MSR}}{\text{MSE}}\)

\( \text{MSR} = \frac{\text{SSR}}{m}\)

\( \text{MSE} = \frac{\text{SSE}}{n-m-1}\)

Adjusted coefficient of determination

To account for statistical variation with more dependent variables, an adjusted COD is used

\(R^2_{\text{adj}} = 1- (1-R^2) \frac{n-1}{n-m-1}\)

Note how this adjusted COD expands the \(SSR\) proportional to the ratio of the single and multiple degrees of freedom

Collinearity

Relationship between predictor variables such that they are not statistically independent

Collinearity causes difficulties in examining the true effect of individual predictor variables on the model, and is hence generally discouraged

The presence of excess collinearity is called an association

Perfect collinearity

Case of collinearity where there is a linear function that perfectly maps sample values of one predictor variable to equal another. Technically in linear algebra terms, it means that the \(X\) matrix has linearly dependent columns and hence not full rank (See Linear Algebra)

\( X \text{ is perfectly collinear } \iff \text{rank}(X) \lt n\)

\( X \text{ is perfectly collinear } \iff \exists \textbf{X}_i : \textbf{X}_i = \sum_{j \neq i}^{n-1} k_{j}\textbf{X}_j\)

Imperfect collinearity

Case of collinearity established by some test statistic such as VIF

Variance Inflation Factor (VIF)

VIF is used to test collinearity between the suspected collinear variable \(j\) and the other independent variables. By forming a regression model with the variable \(j\) as independent and calculating the coefficient of determination of this system as\(R^2_{j}\), the subsequent testing is used:

\(\text{VIF}_{j} = \frac{1}{1-R_{j}^2}\)

\(\text{VIF}_{j} \gt 5 \hookrightarrow \text{potentially collinear} \)

\(\text{VIF}_{j} \gt 10 \hookrightarrow \text{collinear} \)

Model selection

Assume we have \(m\) independent variables, combinatorics says there are \(2^m\) possible models that can be created by ommiting certain independent variables and searching for the optimum model.

Null model

Model with no independent variables

Maximal model

Model with all \(m\) independent variables

Selection from all method

Create the \(2^m\) models and calculate some statistic that describes goodness of fit (like \(R^2_{\text{adj}}\) but only the adjusted version, otherwise this method always chooses the maximal model)
Select the model with the most desirable statistic

Forward selection method

Create the null model, let this be \(\hat{Y}\)
Create all possible models with an additional independent variable
Test for significance for each model (F-test), if:
- all models insignificant, then \(\hat{Y}\) has been finished.
- some model is significant, then let the most significant model (T-test) be chosen as \(\hat{Y}\). Continue from step 2.

Backwards selection method

Create the maximal model, let this be \(\hat{Y}\)
Create all possible models with an independent variable removed
Test for significance for each model (F-test), if:
- all models significant, then \(\hat{Y}\) has been finished.
- some model is insignificant, then let the least significant model (T-test) be chosen as \(\hat{Y}\). Continue from step 2.

Categorical predictors

Predictor random variable \(Z\) that represents a categorical state, rather than a continuous numerical value, introducing qualitative measures to a regression model.

Let \(Z\) have M possible states (including the state representing 'no state'), then there exist binary value dummy variables \(Z_1, Z_2, ..., Z_{m-1}\) with gamma coefficients \(\gamma_1, \gamma_2, ... ,\gamma_{m-1}\) such that

\(Z_i = \begin{cases} 1 & \text{categorical state employed}\\ 0 & \text{otherwise}\end{cases}\)

only one category in \(Z_i\) can be toggled at a given time

the case where all categorical states in \(Z_i\) are not employed is the null category

Population

If there are \(j\) independent variables and category \(M\) is employed, then one considers the plane:

\( Y_i = \beta_0 + \sum_{j=1}^{m} \beta_{j}X_{i,j} + \sum_{j=1}^{M-1}\gamma_j Z_j\)

Sample

\( \hat{Y}_i = b_0 + \sum_{j=1}^{m} b_{j}X_{i,j} + \sum_{j=1}^{M-1} g_j Z_j\)

Interaction Interazione 交互作用

Categorical predictiors do not necessarily affect a model independently to continuous predictors, and hence in some cases the dummy variable may be coupled with an independent variable by multiplication, called an interation. This occurs when an interaction effect is statistically significant, and that this phenomenon is not restricted to the interaction of a numerical and a categorical predictor, two numerical or two categorical predictors may be bound by an interaction.

Partial F-test

Since a categorical predictor is a set of \(M-1\) binary predictors (in contrast to numerical predictors), a T test is insufficient and hence an F test on only the \(M-1\) desired predictors; enter the Partial F test

Test statistic \(F \sim \text{F}(1,\nu) \) of partition of predictors \(\beta\) with a null-hypothesis \(\exists\beta: \beta \neq 0\) where the population variance is unknown:

\(F_{m-q} = \frac{\text{MSR}_{m-q}}{\text{MSE}}\)

\(\text{MSR}_{m-q} = \frac{\text{SSR}_{m-q}}{m-q}\)

\(\text{MSE} = \frac{\text{SSE}}{n-m-1}\)

Degree of freedom

\(\nu = n-2\)

P-value

\(p = \text{Pr}(F \gt f)\)

Hypothesis

\(H_0 : \beta_{1} = 0\)

\(H_1 : \beta_{1} \neq 0\)

\(F \gt f_{\alpha}\)

Transformations

Data transformations

If a bivariate relation fits a non-linear model \(f : X \to Y\), then:

\( \hat{Y}_i = b_0 + b_{1}f(X_i) \)

\( f^{-1}(\hat{Y}_i) = b_0 + b_{1}X_i \)

Box-Cox transformation

Transformation to improve model factors such as:

Variance stability
Normality

\(Y_{i}^{\lambda} = \begin{cases} \frac{Y_{i}^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\ \ln(Y_{i}) & \lambda = 0 \end{cases}\)

Polynomial linear regression

Standard

\( \hat{Y_{i}} = \sum_{j=0}^{m} b_{j}X_{i}^{j}\)

Categorical

Verify linear assumptions (Shapiro-Wilk, PP-plot, Scatter plot)
Eliminate overly influential points (Cook's D, DFITS)
Elimiate significant predictors (F tests, T tests)
Choose for best \(R_{\text{adj}}^2\)

Heteroskedasticity Eteroschedasticità 分散不均一性

Concept of a different variance for each event of \(Y_i\)

\( \forall\epsilon_i \sim \text{N}(0,\sigma_{i}^2), \sigma_{i}^2 \propto f(Y_{i}) \)

Logarithm transformation

When data demonstrates non-constant variance, one can conduct OLS on the logaritm tranformation, then put to the power of \(e\) to receive the true values of \(\hat{Y_i}\)

\( \ln(\hat{Y}_i) = b_0 + b_{1}X_i \to \hat{Y}_i = e^{b_{0}} e^{b_{1}X_{i}}\)

Covariance matrix

Square matrix representing the covariance betweeen random variables, note that the Gauss-Markov assumption has a covariance matrix \(\sigma^2 I\) since independence implies only diagonal columns may have entries, and heteroskedasticity (constant variance) implies the diagoanls are all \(\sigma^2\)

\(\textbf{Q} : q_{ij} = \text{cov}(X_i , X_j)\)

GLS Matrixes

Let \(\Sigma = \Gamma \Gamma^{T}\)

\(\mathfrak{y} = \Gamma^{-1}\textbf{y}\)

\(\mathfrak{X} = \Gamma^{-1}\textbf{X}\)

\(\mathfrak{e} = \Gamma^{-1}\boldsymbol{\epsilon}\)

\(\hat{\mathfrak{y}} = \Gamma^{-1}\hat{\textbf{y}}\)

\(X = \begin{bmatrix} 1 & X_{1,1} & \ldots & X_{1,m} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n,1} & \ldots & X_{n,m}\end{bmatrix}\)

Generalized Least Squares (GLS)

Method of least squares for

Relaxing Gauss-Markov assumptions to \(\boldsymbol{\epsilon} \sim \text{N}(\textbf{0}, \sigma^2 \Sigma)\) where \(\Sigma\) is a matrix that is:

Symmetric (since covariance has commutative parameters)
Positive-definite (since covariance is always positive)

Therefore, \(\Sigma = \Gamma \Gamma^{T}\), hence using Choleski's algorithm to find \(\Gamma\) (See Linear Algebra)

Considering \(\mathfrak{\textbf{y}} = \mathfrak{X}\boldsymbol{\beta} + \mathfrak{e}\), one can see that \(\mathfrak{e} \sim \text{N}(\textbf{0}, \sigma^2 I)\) (since multiplication by \(\Gamma^{-1}\) conserves normality)

\( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{y}} = (X^{T}\Sigma^{-1}X)^{-1}X^{T}\Sigma^{-1}\textbf{y} \)

Weighted Least Squares (WLS)

Special case of GLS where:

\( \Sigma = \begin{bmatrix} \frac{1}{w_1} & 0 & \ldots & 0 \\ 0 & \frac{1}{w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{w_n} \end{bmatrix} \)

And trivially

\( \Sigma^{-1} = \begin{bmatrix} {w_1} & 0 & \ldots & 0 \\ 0 & {w_2} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & {w_n} \end{bmatrix} \)

\( \Gamma = \begin{bmatrix} \frac{1}{\sqrt{w_1}} & 0 & \ldots & 0 \\ 0 & \frac{1}{\sqrt{w_2}} & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \frac{1}{\sqrt{w_n}} \end{bmatrix} \)

\( \textbf{b} = (\mathfrak{X}^{T}\mathfrak{X})^{-1}\mathfrak{X}^{T}\mathfrak{\textbf{Y}} \)

Categorial RV Analysis

Multinomial distribution

Considering categorical variable \(Z\) with states \(z_1,...,z_K\), let \(p_k = \text{Pr} (Z = z_k)\) be the probability of being in each state, and let \(N_k = |\{ n \in [1,N] :Z=z_k \}|\) be the amount of times state \(z_k\) is randomly drawn after \(N\) draws. Then:

Discrete distribution of number of trials to obtain a 'success' when a series of trials have binary possibilites

\( (N_1, N_2, ... , N_K) \sim \text{Multinomial}(N,p_1,p_2,...,p_K)\)

\( N_k \sim \text{bin}(N,p_k)\)

Two way table

Table demonstrating two categorical variables, say \(Z_1, Z_2\), and the frequency of sample results that are the intersection of some state \(Z_1\) and \(Z_2\)

Chi-square Goodness-of-Fit Test

\( \sum_{k=1}^{K} \frac{ (N_k - E [N_k] )^2 }{ E[N_k] } \sim \chi^2\)

Relative Risk (RR)

Ratio of two probabilities to denote how many times more likely some event is over another.

\( \frac{\text{Pr} (A| Z=z_i) }{\text{Pr} (A |Z=z_j ) }\)

Odds

Ratio of the probability event to the the probability of the event not occuring

\( \text{odds}_{k} = \frac{p_k}{1-p_k} \)

\( \text{oddsRatio}_{k,j} = \frac{ \text{odds}_{k} }{ \text{odds}_{j} } \)

Binary Logistic regression

Binary logistic regression

Regression model where the response categorical variable has a Bernoulli distribution \(Y \sim \text{Bern}(p)\)

Since the output of the regression model should be \({0,1}\), the approach is to construct a regression model that maps dependent variables to probability of returning 1. Instead of methods like OLS and GLS, MLE is required.

Maximum Likelihood Estimation (MLE)

See Mathematical Statistics

Logistic regression cannot use OLS or GLS since the response variable is binary. A probability-based regression approach will be used to decide whether the output is 0 or 1.

This method generates its estimated beta coefficients by maximizing the likelihood function and then compose this linear function inside a link function, which turns the regression model into a probability function.

We let probability determine the binary prediction result by letting \(p_i \geq 0.5 \hookrightarrow Y_i = 1\) and \(p_i \lt 0.5 \hookrightarrow Y_i = 0\)

Likelihood function

Function to estimate beta coefficients for MLE regression by finding which beta coefficients maximise the function.

\(\mathcal{L} (\boldsymbol{\beta} |(X_i,Y_i)) = \prod^{n}_{i=1} p(X_i)^{Y_i} (1-p(X_i))^{1-Y_i}\)

Link function

Function \(g\) that maps non-linear models to linear regression models, it is a method of applying the techniques of linear regression to generalized response variables

In the case of binary logistic regression \(g\) maps a CDF (cumulative probabilities) to a regression model generated using MLE

A variety of choides for logistic link functions exist, including:

\(\text{probit}(p) = \Phi^{-1}(p)\)
\(\text{logit}(p) = \ln(\frac{p}{1-p})\)
\(\text{loglog}(p) = -\ln( -\ln(p) ) \)
\(\text{cloglog}(p) = \ln( -\ln(1-p) ) \)

Simple logistic regression

\(\eta (\textbf{x}) = \beta_0 + \beta_1 x : x \in \{0,1\} \)

\(g = \text{logit}\)

\(p(x) = \frac{1}{1+e^{-\beta_0 -\beta_1 x}}\)

Baseline probability

Probability when all predictors are set to 0, so \(p(\textbf{0})\). In the above example, \(p(0) = \frac{1}{1+e^{-\beta_0}}\)

Median Effective Level

Continuous predictor value that makes a \(50%\) probability of the categorical response variable being toggled

\(x : p(x) = \frac{1}{2}\)

Wald test

Singular beta coefficient significance test for logistic regression models, analogue of a T test

Test statistic

\( Z_{b_i} = \frac{b_{i} - \beta_i}{s_{b_i}} \sim \text{N}(0,1)\)

Hypothesis

\(H_0 : \beta_{j} = \beta_{j^{*}}\)

\(H_1 : \beta_{j} \neq \beta_{j^{*}}\)

Omnibus test

Beta coefficient significance test for all predictors in a logistic regression models, analogue of an F test

\( L_{m} = -2 \log (\frac{\mathcal{L}(\beta_0)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m)\)

Hypothesis

\(H_0 : \beta_{1} = .. = \beta_{m} = 0\)

\(H_1 : \exists j : \beta_{j} \neq 0\)

Partial omnibus test

Beta coefficient significance test for a subset of predictors in a logistic regression models, analogue of a partial F test

\( L_{m-q} = -2 \log (\frac{\mathcal{L}(\beta_0,...,\beta_q)}{\mathcal{L}(\beta_0,...,\beta_m)}) \sim \chi^2 (m-q)\)

Hypothesis

\(H_0 : \beta_{q+1} = .. = \beta_{m} = 0\)

\(H_1 : \exists j \in \{q+1,q+2,...,m\} : \beta_{j} \neq 0\)

Hosmer-Lemeshow test

A goodness-of-fit test statistic for logistic regression models.

Hypothesis

\(H_0 : \text{ predicted probabilities match observed probabilities}\)

\(H_1 : \text{ predicted probabilities do not match observed probabilities}\)

Pseudo \(R^2\)

Logistic regression models lack a coefficient of determination statistic, so in place there exist pseudo coefficient of determinations such as those by Cox, Nagelkerke, and Snell

Pearson residuals

A residual based test statistic to test for outliers

\( r_{i} = \frac{GP_i - \hat{p_i}}{\sqrt{\hat{p_i} (1 - \hat{p_i})}} \)

\( |r_{i}| \gt 2 \implies \text{potential; outlier}\)

zaco | ザコ

37252 - Regression and Linear Models

Simple Linear Regression Regressione lineare semplice 単純線形回帰

Pearson's correlation coefficient

Population

Sample

Linear Regression Model

Ideal model

Estimated model

Method of least squares metodo dei minimi quadrati 最小二乗法

Ordinary Least Squares (OLS)

Derivation

Error Errore 測定誤差

Residual Residuo 計算誤差

Gauss-Markov assumptions

Gauss-Markov theorem

Prediction interval

Distributions of estimated coefficients

Population variance

Sample variance estimate

Significance

Significance Z-test

Prediction interval

Significance T-test

Prediction interval

ANOVA and residual analysis

Types of variation

Sum of Squares Error

Sum of Squares Regression

Sum of Squares Total

Coefficient of determination (R-Squared)

F-test

Leverage

Influence

Cook's D

DFITS

Durbin-Watson test

Probability-Probability plot (PP plot)

QQ plot

Shapiro-Wilk test

Studentized residual

Internally

Extenally

Multilinear regression

Multidimensional linear regression model

Ideal model

Estimated model

Multidimensional Ordinary Least Squares (MOLS)

Derivation

Distributions of estimated coefficients

Sample error variance estimate

Multidimensional T-test

Multidimensional F-test

Adjusted coefficient of determination

Collinearity

Perfect collinearity

Imperfect collinearity

Variance Inflation Factor (VIF)

Model selection

Null model

Maximal model

Selection from all method

Forward selection method

Backwards selection method

Categorical predictors

Population

Sample

Interaction Interazione 交互作用

Partial F-test

Degree of freedom

P-value

Hypothesis

Transformations

Data transformations

Box-Cox transformation

Polynomial linear regression

Standard

Categorical

Heteroskedasticity Eteroschedasticità 分散不均一性

Logarithm transformation