#sdsc5001

English / 中文

Simple Linear Regression

Basic Setup

Given data (x1,y1),,(xn,yn)\left(x_{1}, y_{1}\right),\ldots,\left(x_{n}, y_{n}\right), where:

  • xiRx_{i} \in \mathbb{R} is the predictor variable (independent variable, input, feature)

  • yiRy_{i} \in \mathbb{R} is the response variable (dependent variable, output, outcome)

The regression function is expressed as:

y=f(x)+εy = f(x) + \varepsilon

The linear regression model assumes:

f(x)=β0+β1xf(x) = \beta_0 + \beta_1 x

This is usually considered an approximation of the true relationship.

Example (Attachment Page 2): A simple toy example showing data points and linear fit.

Screenshot 2025-10-06 17.23.26.png

Least Squares Fitting

Parameters are estimated by minimizing the residual sum of squares:

minβ0,β1i=1n(yi(β0+β1xi))2\min_{\beta_0, \beta_1} \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2

The solutions are:

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2β^0=yˉβ^1xˉ\begin{aligned} &\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\\ &\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \end{aligned}

Where:

  • y^i=β^0+β^1xi\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i is the fitted value

  • ei=yiy^ie_i = y_i - \hat{y}_i is the residual

Parameter Estimation and Statistical Inference

Model Assumptions

Assume the data generating process is:

Yi=β0+β1xi+εiY_i = \beta_0 + \beta_1 x_i + \varepsilon_i

where εi\varepsilon_i are i.i.d. N(0,σ2)N(0, \sigma^2)

Under these assumptions, it can be proven:

  • β^0\hat{\beta}_0 and β^1\hat{\beta}_1 are unbiased estimators of β0\beta_0 and β1\beta_1

    β^1N(β1,σ2i(xixˉ)2)\hat{\beta}_1 \sim N\left(\beta_1,\frac{\sigma^2}{\sum_i(x_i - \bar{x})^2}\right)

  • β^1\hat{\beta}_1 has the smallest variance among all unbiased linear estimators (BLUE estimator)

    β^0N(β0,{1n+xˉ2i(xixˉ)2}σ2)\hat{\beta}_0 \sim N\left(\beta_0,\left\{\frac{1}{n} + \frac{\bar{x}^2}{\sum_i(x_i - \bar{x})^2}\right\}\sigma^2\right)

Derivation of Unbiasedness

To prove β^1\hat{\beta}_1 is unbiased, we first write the formula for β^1\hat{\beta}_1:

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}

where xˉ=1nxi\bar{x} = \frac{1}{n} \sum x_i and yˉ=1nyi\bar{y} = \frac{1}{n} \sum y_i.

Substitute yi=β0+β1xi+εiy_i = \beta_0 + \beta_1 x_i + \varepsilon_i, and note that yˉ=β0+β1xˉ+εˉ\bar{y} = \beta_0 + \beta_1 \bar{x} + \bar{\varepsilon}, where εˉ=1nεi\bar{\varepsilon} = \frac{1}{n} \sum \varepsilon_i.

After algebraic simplification (specific steps omitted), we get:

β^1=β1+(xixˉ)εi(xixˉ)2\hat{\beta}_1 = \beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{\sum (x_i - \bar{x})^2}

Taking expectation:

E[β^1]=E[β1+(xixˉ)εi(xixˉ)2]=β1+(xixˉ)E[εi](xixˉ)2=β1E[\hat{\beta}_1] = E\left[\beta_1 + \frac{\sum (x_i - \bar{x}) \varepsilon_i}{\sum (x_i - \bar{x})^2}\right] = \beta_1 + \frac{\sum (x_i - \bar{x}) E[\varepsilon_i]}{\sum (x_i - \bar{x})^2} = \beta_1

because E[εi]=0E[\varepsilon_i] = 0. Therefore, β^1\hat{\beta}_1 is unbiased.

Similarly, for β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}, taking expectation:

E[β^0]=E[yˉ]xˉE[β^1]=(β0+β1xˉ)xˉβ1=β0E[\hat{\beta}_0] = E[\bar{y}] - \bar{x} E[\hat{\beta}_1] = (\beta_0 + \beta_1 \bar{x}) - \bar{x} \beta_1 = \beta_0

So β^0\hat{\beta}_0 is also unbiased.

Practical Significance: Unbiasedness means that over many repeated samples, the average of the estimates will be close to the true parameter value, increasing the reliability of the estimation.

Derivation of BLUE Property (Gauss-Markov Theorem)

The Gauss-Markov theorem states that in the linear regression model, if the error terms have zero mean, constant variance, and are uncorrelated, then the least squares estimator β^1\hat{\beta}_1 has the smallest variance among all linear unbiased estimators.

Consider any linear unbiased estimator b1=iciyib_1 = \sum_{i} c_i y_i, where cic_i are constants. Unbiasedness requires E[b1]=β1E[b_1] = \beta_1, which implies ci=0\sum c_i = 0 and cixi=1\sum c_i x_i = 1 (by substituting the expression for yiy_i).

The variance is:

Var(b1)=Var(ciyi)=ci2Var(yi)=σ2ci2\text{Var}(b_1) = \text{Var}\left(\sum c_i y_i\right) = \sum c_i^2 \text{Var}(y_i) = \sigma^2 \sum c_i^2

because Var(yi)=σ2\text{Var}(y_i) = \sigma^2.

The variance of the least squares estimator β^1\hat{\beta}_1 is:

Var(β^1)=σ2(xixˉ)2\text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{\sum (x_i - \bar{x})^2}

Through an optimization problem, it can be shown that for any other linear unbiased estimator b1b_1, Var(b1)Var(β^1)\text{Var}(b_1) \geq \text{Var}(\hat{\beta}_1). This demonstrates the minimum variance property of β^1\hat{\beta}_1.

Practical Significance: The BLUE property means the OLS estimator is the most precise (minimum variance), thus more efficient in statistical inference, e.g., producing narrower confidence intervals.

Illustrative Example

Suppose we have a simple dataset: house area (xix_i) and house price (yiy_i). The model is yi=β0+β1xi+εiy_i = \beta_0 + \beta_1 x_i + \varepsilon_i.

  • β0\beta_0 might represent the base price when the area is 0 (though this may not be meaningful in practice, so it is often considered a model offset).

  • β1\beta_1 represents the average increase in price per additional square meter.

  • Unbiasedness: If we collect data multiple times and compute β^1\hat{\beta}_1, its average will be close to the true β1\beta_1.

  • BLUE: If we use other linear methods (e.g., weighted least squares) but choose weights inappropriately, the variance might be larger, making the estimation less stable than OLS.

Confidence Intervals

σ2\sigma^2 can be estimated unbiasedly by MSE:

σ^2=MSE=i=1n(yiy^i)2n2\hat{\sigma}^2 = \text{MSE} = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-2}

Based on Cochran’s theorem, the confidence intervals for β^0\hat{\beta}_0 and β^1\hat{\beta}_1 are:

β^j±t(α2,n2)se(β^j),j=0,1\hat{\beta}_j \pm t\left(\frac{\alpha}{2}, n-2\right) \cdot \text{se}(\hat{\beta}_j), \quad j=0,1

Symbol Definitions and Interpretations

  • σ2\sigma^2: Variance of the error term, representing the variation in the data not accounted for by the model. It is an unknown constant parameter.

  • σ^2\hat{\sigma}^2 or MSE: Mean Squared Error, an unbiased estimator of σ2\sigma^2. Calculated as σ^2=MSE=i=1n(yiy^i)2n2\hat{\sigma}^2 = \text{MSE} = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n-2}, where nn is the sample size, and n2n-2 is the degrees of freedom (because two parameters β0\beta_0 and β1\beta_1 are estimated). MSE measures the average squared prediction error of the model.

  • se(β^j)\text{se}(\hat{\beta}_j): Standard error of the estimator β^j\hat{\beta}_j, representing the standard deviation of the sampling distribution of β^j\hat{\beta}_j. For simple linear regression:

    • se(β^0)=MSE(1n+xˉ2i=1n(xixˉ)2)\text{se}(\hat{\beta}_0) = \sqrt{\text{MSE} \cdot \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right)}
    • se(β^1)=MSEi=1n(xixˉ)2\text{se}(\hat{\beta}_1) = \sqrt{\frac{\text{MSE}}{\sum_{i=1}^{n} (x_i - \bar{x})^2}}
  • t(α2,n2)t\left(\frac{\alpha}{2}, n-2\right): The upper α/2\alpha/2 quantile of the t-distribution, where α\alpha is the significance level (e.g., 95% confidence level corresponds to α=0.05\alpha=0.05), and n2n-2 is the degrees of freedom. The t-distribution is used instead of the normal distribution to construct confidence intervals when the population variance is unknown.

Derivation Principle of Confidence Intervals

The derivation of confidence intervals is based on the following steps:

  1. Sampling Distribution: Under the model assumptions (error terms εiN(0,σ2)\varepsilon_i \sim N(0, \sigma^2) and independent), the least squares estimators β^j\hat{\beta}_j follow a normal distribution:

    β^jN(βj,Var(β^j))\hat{\beta}_j \sim N\left(\beta_j, \text{Var}(\hat{\beta}_j)\right)

    where Var(β^j)\text{Var}(\hat{\beta}_j) is the variance, depending on σ2\sigma^2.

  2. Variance Estimation: Since σ2\sigma^2 is unknown, we use MSE to estimate it. Cochran’s theorem (or related theorems) ensures:

    • σ^2=MSE\hat{\sigma}^2 = \text{MSE} is independent of β^j\hat{\beta}_j.
    • (n2)MSEσ2χ2(n2)\frac{(n-2) \text{MSE}}{\sigma^2} \sim \chi^2(n-2), i.e., follows a chi-square distribution with n2n-2 degrees of freedom.
  3. t-statistic: Standardizing β^j\hat{\beta}_j gives the t-statistic:

    t=β^jβjse(β^j)t(n2)t = \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \sim t(n-2)

    This is because:

    t=β^jβjVar(β^j)/MSEσ2=N(0,1)χn22/(n2)t = \frac{\hat{\beta}_j - \beta_j}{\sqrt{\text{Var}(\hat{\beta}_j)}} \bigg/ \sqrt{\frac{\text{MSE}}{\sigma^2}} = \frac{N(0,1)}{\sqrt{\chi^2_{n-2} / (n-2)}}

    which is exactly the definition of the t-distribution.

  4. Confidence Interval: Based on the properties of the t-distribution:

    P(tα/2,n2β^jβjse(β^j)tα/2,n2)=1αP\left( -t_{\alpha/2, n-2} \leq \frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \leq t_{\alpha/2, n-2} \right) = 1 - \alpha

    Rearranging the inequality gives the confidence interval:

    β^j±tα/2,n2se(β^j)\hat{\beta}_j \pm t_{\alpha/2, n-2} \cdot \text{se}(\hat{\beta}_j)

    This means we are 1α1-\alpha confident that the true parameter βj\beta_j lies within this interval.

Practical Significance and Interpretation

Confidence intervals provide a measure of uncertainty for parameter estimates. For example, for a 95% confidence interval for β1\beta_1:

  • Interpretation: If we repeatedly sample many times and compute a confidence interval each time, about 95% of these intervals will contain the true β1\beta_1.

  • Application: If the confidence interval includes zero, it may indicate that the predictor variable has no significant effect on the response variable (but needs hypothesis testing confirmation). The width of the interval reflects the precision of the estimate: a narrower interval indicates higher precision.

  • Example: In a house price prediction model, if β1\beta_1 represents the effect of area on price, and its 95% confidence interval is [100, 200], we can say “we are 95% confident that for each additional square meter, the house price increases by an average of 100 to 200 units.”

Illustrative Example

Suppose we have a simple linear regression model predicting test scores (yy) based on study time (xx). Sample size n=20n=20, calculations give:

  • β^1=5\hat{\beta}_1 = 5 (each additional hour of study time increases the score by 5 points on average)

  • se(β^1)=0.8\text{se}(\hat{\beta}_1) = 0.8

  • MSE=10\text{MSE} = 10, degrees of freedom n2=18n-2=18

  • For a 95% confidence interval, α=0.05\alpha=0.05, from the t-distribution table t0.025,182.101t_{0.025, 18} \approx 2.101

Then the confidence interval for β1\beta_1 is:

5±2.101×0.8=[3.32,6.68]5 \pm 2.101 \times 0.8 = [3.32, 6.68]

This means we are 95% confident that the true effect of study time is between 3.32 and 6.68 points.

Hypothesis Testing

Test H0:β1=0H_0: \beta_1 = 0 vs H1:β10H_1: \beta_1 \neq 0:

t1=β^1se(β^1)tn2t_1^* = \frac{\hat{\beta}_1}{\text{se}(\hat{\beta}_1)} \sim t_{n-2}

If t1>t(α2,n2)|t_1^*| > t\left(\frac{\alpha}{2}, n-2\right), reject H0H_0

Example: Fitted line and confidence band

Screenshot 2025-10-06 22.57.30.png

Multiple Linear Regression

Model Setup

yi=β0+β1xi1++βpxip+εiy_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} + \varepsilon_i

Matrix form:

y=Xβ+ε\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

Where:

  • y\mathbf{y} is the n×1n \times 1 response vector

  • X\mathbf{X} is the n×(p+1)n \times (p+1) design matrix (first column is 1s)

  • β\boldsymbol{\beta} is the (p+1)×1(p+1) \times 1 parameter vector


Least Squares Estimation

Objective Function

The goal of least squares is to minimize the residual sum of squares:

β^=arg minβi=1n(yiβ0j=1pβjxji)2=arg minβ(yXβ)(yXβ)\begin{aligned} \hat{\beta} & = \argmin_{\boldsymbol{\beta}} \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ji}\right)^2 \\ & = \argmin_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \end{aligned}

Formula Meaning: The left side is the summation form of the residual sum of squares, the right side is the matrix form. y\mathbf{y} is the response vector, X\mathbf{X} is the design matrix, β\boldsymbol{\beta} is the parameter vector to be estimated.

Least Squares Solution

Solving the above optimization problem gives the parameter estimator:

β^=(XX)1Xy\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

Statistical Properties:

  • Expectation: E[β^]=β\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} (unbiased estimator)
  • Covariance matrix: cov(β^)=σ2(XX)1\text{cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}

Fitted Values and Hat Matrix

Fitted Value Calculation

Using the parameter estimator to get fitted values:

y^=Xβ^=X(XX)1Xy=Hy\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}

where H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top is called the hat matrix.

Hat Matrix Properties:

  • H\mathbf{H} is a symmetric idempotent matrix (H2=H\mathbf{H}^2 = \mathbf{H})
  • Trace tr(H)=p+1\text{tr}(\mathbf{H}) = p + 1 (number of parameters)
  • Projects the response vector y\mathbf{y} onto the column space of the design matrix

Statistical Properties of Fitted Values

E[y^]=Xβcov(y^)=σ2H\mathbb{E}[\hat{\mathbf{y}}] = \mathbf{X}\boldsymbol{\beta}\\ \text{cov}(\hat{\mathbf{y}}) = \sigma^2\mathbf{H}

Geometric Interpretation: The fitted values y^\hat{\mathbf{y}} are the orthogonal projection of y\mathbf{y} onto the column space of the design matrix.

Residual Properties Analysis

Residual Definition and Expression

The residual vector is defined as the difference between observed and fitted values:

e=yy^=(IH)y\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}

Statistical Properties of Residuals

E[e]=0cov(e)=σ2(IH)\mathbb{E}[\mathbf{e}] = \mathbf{0}\\ \text{cov}(\mathbf{e}) = \sigma^2(\mathbf{I} - \mathbf{H})

Key Understanding:

  • The expectation of residuals is zero, indicating no systematic bias in the model
  • The covariance matrix of residuals is not diagonal, indicating correlation between residuals of different observations
  • IH\mathbf{I} - \mathbf{H} is also a symmetric idempotent matrix, trace is np1n - p - 1

Expectation of Residual Sum of Squares

Derivation of the expected value of the residual sum of squares:

E[ee]=E[tr(ee)]=E[tr(ee)]=tr(E[ee])=tr(σ2(IH))=σ2(np1)\mathbb{E}[\mathbf{e}^\top\mathbf{e}] = \mathbb{E}[\text{tr}(\mathbf{e}^\top\mathbf{e})] = \mathbb{E}[\text{tr}(\mathbf{e}\mathbf{e}^\top)] = \text{tr}(\mathbb{E}[\mathbf{e}\mathbf{e}^\top]) \\ = \text{tr}(\sigma^2(\mathbf{I} - \mathbf{H})) = \sigma^2(n - p - 1)

Derivation Explanation:

  • Using the cyclic property of trace: tr(ABC)=tr(BCA)\text{tr}(ABC) = \text{tr}(BCA)
  • E[ee]=cov(e)=σ2(IH)\mathbb{E}[\mathbf{e}\mathbf{e}^\top] = \text{cov}(\mathbf{e}) = \sigma^2(\mathbf{I} - \mathbf{H})
  • tr(H)=p+1\text{tr}(\mathbf{H}) = p + 1, so tr(IH)=n(p+1)\text{tr}(\mathbf{I} - \mathbf{H}) = n - (p + 1)

Variance Estimation

Mean Squared Error (MSE)

Using the residual sum of squares to estimate the error variance:

σ^2=MSE=eenp1\hat{\sigma}^2 = MSE = \frac{\mathbf{e}^\top\mathbf{e}}{n - p - 1}

Statistical Meaning:

  • Denominator np1n - p - 1 is the degrees of freedom of the residuals
  • From the above derivation, E[σ^2]=σ2\mathbb{E}[\hat{\sigma}^2] = \sigma^2, it is an unbiased estimator
  • Used to measure model goodness-of-fit and for statistical inference

Model Evaluation

ANOVA Decomposition

Total Sum of Squares Decomposition

In regression analysis, the total variation can be decomposed into variation explained by regression and residual variation:

SSTO=SSE+SSRSS_{TO} = SS_E + SS_R

Where:

  • SSTOSS_{TO}: Total Sum of Squares

  • SSESS_E: Error Sum of Squares

  • SSRSS_R: Regression Sum of Squares

Matrix Form Expressions

Total Sum of Squares:

SSTO=yT(IJn)ySS_{TO} = \mathbf{y}^T\left(\mathbf{I} - \frac{\mathbf{J}}{n}\right)\mathbf{y}

Error Sum of Squares:

SSE=eTe=yT(IH)y=yTyβ^TXTySS_E = \mathbf{e}^T\mathbf{e} = \mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{y}^T\mathbf{y} - \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y}

Regression Sum of Squares:

SSR=yT(HJn)y=β^TXTy(yi)2nSS_R = \mathbf{y}^T\left(\mathbf{H} - \frac{\mathbf{J}}{n}\right)\mathbf{y} = \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y} - \frac{(\sum y_i)^2}{n}

Symbol Explanation:

  • J\mathbf{J} is the matrix of ones (n×nn \times n, all elements are 1)
  • H\mathbf{H} is the hat matrix X(XTX)1XT\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T
  • I\mathbf{I} is the identity matrix

Expectation Derivation

Expectation of Error Sum of Squares

E[SSE]=σ2(np1)\mathbb{E}[SS_E] = \sigma^2(n - p - 1)

Derivation Explanation:
Since E[eTe]=σ2(np1)\mathbb{E}[\mathbf{e}^T\mathbf{e}] = \sigma^2(n - p - 1), and SSE=eTeSS_E = \mathbf{e}^T\mathbf{e}

Expectation of Total Sum of Squares

E[SSTO]=(n1)σ2+βTXT(IJn)Xβ\mathbb{E}[SS_{TO}] = (n - 1)\sigma^2 + \boldsymbol{\beta}^T\mathbf{X}^T\left(\mathbf{I} - \frac{\mathbf{J}}{n}\right)\mathbf{X}\boldsymbol{\beta}

Statistical Meaning:

  • (n1)σ2(n - 1)\sigma^2: Variation due to random error

  • βTXT(IJn)Xβ\boldsymbol{\beta}^T\mathbf{X}^T(\mathbf{I} - \frac{\mathbf{J}}{n})\mathbf{X}\boldsymbol{\beta}: Systematic variation explained by the model

Expectation of Regression Sum of Squares

E[SSR]=pσ2+βTXT(IJn)Xβ\mathbb{E}[SS_R] = p\sigma^2 + \boldsymbol{\beta}^T\mathbf{X}^T\left(\mathbf{I} - \frac{\mathbf{J}}{n}\right)\mathbf{X}\boldsymbol{\beta}

Statistical Meaning:

  • pσ2p\sigma^2: Variation due to parameter estimation uncertainty

  • βTXT(IJn)Xβ\boldsymbol{\beta}^T\mathbf{X}^T(\mathbf{I} - \frac{\mathbf{J}}{n})\mathbf{X}\boldsymbol{\beta}: Variation explained by the true regression effect

Coefficient of Determination

R2=SSRSSTO=1SSESSTOR^2 = \frac{SS_R}{SS_{TO}} = 1 - \frac{SS_E}{SS_{TO}}

Measures the proportion of variation explained by the model, range [0,1].

Adjusted Coefficient of Determination

Radj2=1SSE/(np1)SSTO/(n1)R^2_{adj} = 1 - \frac{SS_E/(n-p-1)}{SS_{TO}/(n-1)}

Adjusted index considering the number of parameters, used for comparing models of different complexity.

Practical Application: ANOVA analysis not only provides tests for the overall significance of the model but also important basis for model comparison and selection. By decomposing variations from different sources, we can better understand the model’s explanatory power and fitting effect.

Coefficient of Determination (R2R^2) and Adjusted R2R^2

Coefficient of Determination R2R^2

The coefficient of multiple determination is defined as:

R2=SSRSSTO=1SSESSTOR^2 = \frac{SS_R}{SS_{TO}} = 1 - \frac{SS_E}{SS_{TO}}

Statistical Meaning:

  • Measures the proportion of total variation in the dependent variable Y explained by the predictor variables X

  • Range [0,1], larger value indicates better model fit

  • Reflects the model’s explanatory power for the data

Example: If R2=0.85R^2 = 0.85, it means 85% of the variation in Y can be explained by X, only 15% is random error.

Limitations of R2R^2

R2R^2 is not suitable for comparing different models because:

  • It always increases as the number of variables in the model increases

  • Even if irrelevant variables are added, R2R^2 does not decrease

  • May lead to overfitting problems

Adjusted Coefficient of Determination Ra2R^2_a

To address the limitations of R2R^2, the adjusted coefficient of determination is introduced:

Ra2=1SSE/(np1)SSTO/(n1)=1n1np1SSESSTOR_a^2 = 1 - \frac{SS_E/(n-p-1)}{SS_{TO}/(n-1)} = 1 - \frac{n-1}{n-p-1} \cdot \frac{SS_E}{SS_{TO}}

Advantages:

  • Penalizes for the number of variables, avoiding overfitting

  • More suitable for comparing models of different complexity

  • Ra2R^2_a increases only if the new variable improves the model sufficiently

Comparison Rule: In model comparison, prefer models with larger Ra2R^2_a.

F-Test for Linear Models

Hypothesis Test Setup

Test the overall significance of the regression model:

H0:β1=β2==βp=0H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0

Ha:At least one βk0(k1)H_a: \text{At least one } \beta_k \neq 0 \quad (k \geq 1)

Null Hypothesis Meaning: All slope coefficients are simultaneously 0, meaning the predictor variables have no linear effect on the response variable.

F-Test Statistic

F=MSRMSE=SSR/pSSE/(np1)F^* = \frac{MS_R}{MS_E} = \frac{SS_R/p}{SS_E/(n-p-1)}

Statistical Distribution: Under the null hypothesis, FF(p,np1)F^* \sim F(p, n-p-1)

Decision Rule: If F>F(α;p,np1)F^* > F(\alpha; p, n-p-1), reject the null hypothesis.

Practical Application: The F-test is used to determine if the regression model is overall significant. If H₀ is rejected, it indicates that at least one predictor variable has a significant linear effect on the response variable.

Statistical Inference for Regression Coefficients

Covariance Matrix of Coefficient Estimates

Theoretical covariance matrix:

cov(β^)=σ2(XTX)1\text{cov}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}

Estimated covariance matrix (using MSE instead of unknown σ²):

s2(β^)=MSE(XTX)1s^2(\hat{\boldsymbol{\beta}}) = MSE \cdot (\mathbf{X}^T\mathbf{X})^{-1}

t-Test for Individual Coefficients

Hypothesis Test:

H0:βk=0,Ha:βk0H_0: \beta_k = 0, \quad H_a: \beta_k \neq 0

Test Statistic:

t=β^kβks(β^k)t(np1);k=0,,pt^* = \frac{\hat{\beta}_k - \beta_k}{s(\hat{\beta}_k)} \sim t(n - p - 1); k = 0, \dots, p

Where s(β^k)s(\hat{\beta}_k) is the corresponding diagonal element in the s(β^)s(\hat{\beta}) matrix (coefficient standard error).

Statistical Distribution: Under normal error assumption, tt(np1)t^* \sim t(n-p-1)

Decision Rule: If t>t(α/2,np1)|t^*| > t(\alpha/2, n-p-1), reject H₀.

Confidence Interval Construction

100(1α)%100(1-\alpha)\% confidence interval for βk\beta_k:

β^k±t(α2,np1)s(β^k)\hat{\beta}_k \pm t\left(\frac{\alpha}{2}, n-p-1\right) \cdot s(\hat{\beta}_k)

Practical Application Example

Overall Model Significance Test

Suppose we have p=3 predictor variables, n=30 observations:

  • Calculated F=15.2F^* = 15.2

  • From F-distribution table: F(0.05;3,26)2.98F(0.05; 3, 26) ≈ 2.98

  • Since 15.2>2.9815.2 > 2.98, reject H₀, the model is overall significant

Individual Variable Significance Test

Test the significance of the second predictor variable:

  • β^2=2.5\hat{\beta}_2 = 2.5, s(β^2)=0.8s(\hat{\beta}_2) = 0.8

  • t=2.5/0.8=3.125t^* = 2.5/0.8 = 3.125

  • t(0.025,26)2.056t(0.025, 26) ≈ 2.056

  • Since 3.125>2.0563.125 > 2.056, β2\beta_2 is significantly different from 0

Confidence Interval Calculation

95% confidence interval for β2\beta_2:

  • 2.5±2.056×0.8=[0.855,4.145]2.5 \pm 2.056 \times 0.8 = [0.855, 4.145]

Model Diagnostics

Review of Normal Error Assumption Model

Basic form of linear regression model:

yi=β0+j=1pβjxij+εi,i=1,,ny_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \varepsilon_i, \quad i = 1, \ldots, n

Model Assumptions:

  1. β0,,βp\beta_0, \ldots, \beta_p are parameters to be estimated

  2. xix_i are treated as fixed constants (non-random variables)

  3. εi\varepsilon_i are i.i.d. N(0,σ2)N(0, \sigma^2)

Potential Problems and Model Inapplicability

Situations where the linear regression model may not be applicable include:

  1. Nonlinear regression function: The true relationship is not linear

  2. Omission of important predictor variables: The model lacks key explanatory variables

  3. Non-constant error variance: Variance of ε\varepsilon is not constant (heteroscedasticity)

  4. Dependent error terms: Autocorrelation exists among ε\varepsilon

  5. Non-normal error distribution: ε\varepsilon does not follow a normal distribution

  6. Presence of outliers: A few extreme observations affect the model

  7. Correlated predictor variables: Multicollinearity problem

Residual Properties and Diagnostic Basics

Definition and Properties of Residuals

Residuals are estimates of error terms: ei=yiy^ie_i = y_i - \hat{y}_i

Statistical Properties:

  • eN(0,σ2(IH))\mathbf{e} \sim N(0, \sigma^2(\mathbf{I} - \mathbf{H}))

  • Even if εi\varepsilon_i are independent, eie_i are not independent (but approximately independent in large samples)

  • E[ei]=0\mathbb{E}[e_i] = 0, Var(ei)=σ2(1hii)\text{Var}(e_i) = \sigma^2(1 - h_{ii})

Standardized Residuals

For better model diagnostics, standardized residuals are often used:

Semi-studentized residuals:

ei=eiMSEe_i^* = \frac{e_i}{\sqrt{MSE}}

Studentized residuals (more commonly used):

ri=eiMSE(1hii)r_i = \frac{e_i}{\sqrt{MSE(1 - h_{ii})}}

Note: Studentized residuals account for the leverage effect of each observation, more suitable for outlier detection.

Detection of Nonlinear Regression Function

Diagnostic Methods

  1. Plot residuals vs. fitted values scatter plot

    • If the relationship is linear, residuals should be randomly distributed around 0
    • If nonlinear patterns exist, residuals will show systematic trends

Screenshot 2025-10-18 13.37.51.png

  1. Plot residuals vs. predictor variables scatter plots

    • Check the relationship between each predictor variable and residuals
    • Systematic patterns indicate incorrect functional form for that variable

Linear Regression Model Diagnostics and Problem Handling

Diagnostics and Handling of Omitted Important Predictor Variables

Diagnostic Methods

Detect by plotting residuals against other predictor variables:

  • If residuals show systematic patterns with a predictor variable not included, it suggests that variable should be included in the model

  • Any non-random patterns in residuals may indicate omission of important variables

Variable Selection Problem

When multiple predictor variables exist, variable selection becomes an important research area:

  • Forward selection: Start with an empty model, add significant variables step by step

  • Backward elimination: Start with the full model, remove insignificant variables step by step

  • Stepwise regression: Combines forward and backward methods

  • Regularization methods: LASSO, Ridge regression, etc.

Practical Advice: Variable selection should combine theoretical guidance and statistical criteria (e.g., AIC, BIC)

Heteroscedasticity (Non-constant Error Variance) Detection

Diagnostic Methods

Check scatter plot of residuals vs. fitted values (attachment page 25):

  • Ideally, all residuals should have roughly the same variability

  • Increasing (or decreasing) residual variability with fitted values indicates heteroscedasticity

  • Since the sign of residuals is less important for detecting heteroscedasticity, often use scatter plots of ei|e_i| or ei2e_i^2 vs. yi^\hat{y_i}

Screenshot 2025-10-18 14.28.38.png

Effects of Heteroscedasticity

  • Parameter estimates remain unbiased, but standard error estimates are biased

  • t-tests and F-tests become invalid

  • Confidence intervals and prediction intervals are inaccurate

Model Diagnostics: Error Term Tests

Dependence of Error Terms

In time series or spatial data, check scatter plots of residuals eie_i against time or geographical location:

  • Purpose: Detect if there is correlation between adjacent residuals in the sequence

  • Method: Plot eie_i against time or spatial location

  • Ideal situation: Residuals should be randomly distributed, no specific pattern

Screenshot 2025-10-21 22.45.42.png

Non-normality of Error Term

Three methods to test the normality of residuals eie_i:

  1. Distribution Plots

    • Histogram: Observe if the distribution shape is close to a bell curve
    • Box plot: Detect symmetry and outliers
    • Stem-and-leaf plot: Detailed display of data distribution characteristics
  2. Cumulative Distribution Function Comparison

    • Estimate the sample cumulative distribution function
    • Compare with the theoretical normal cumulative distribution function
    • Large deviations indicate non-normality
  3. Q-Q Plot (Quantile-Quantile Plot)

    • Principle: Compare sample quantiles with theoretical normal distribution quantiles
    • Judgment Criterion:
      • Points approximately on a straight line → Support normality assumption
      • Points significantly deviate from the line → Error terms are non-normal
    • Advantage: Sensitive to deviations from normality, good visualization effect

The core of model diagnostics is to verify whether the basic assumptions of linear regression hold, especially the i.i.d. and normality assumptions of the error terms ϵi\epsilon_i. These diagnostic tools help identify model defects and provide direction for model improvement.

Outlying Observations

Definition

  • Outlier: An observation significantly separated from the majority of the data

  • Classification:

    • Outlying Y observation (outlier): yiy_i is far from the model predicted value
    • Outlying X observation (high leverage point): Observation with unusual X values

Screenshot 2025-10-21 22.53.04.png

Detection Methods for Outlying Y Observations

Types of Residuals and Their Definitions

  1. Ordinary Residuals and Semi-studentized Residuals

  • Ordinary residual: ei=yiy^ie_i = y_i - \hat{y}_i

  • Semi-studentized residual: ei=eiMSEe_i^* = \frac{e_i}{\sqrt{MSE}}

  1. Studentized Residuals

  • Definition: ri=eis(ei)=eiMSE(1hii)r_i = \frac{e_i}{s(e_i)} = \frac{e_i}{\sqrt{MSE(1-h_{ii})}}

  • Characteristic: Accounts for differences in residual variability

  1. Deleted Residual

  • Definition: di=yiy^i(i)d_i = y_i - \hat{y}_{i(-i)}

    • y^i(i)\hat{y}_{i(-i)}: Model predicted value fitted without the i-th observation
  • Property: di=ei1hiid_i = \frac{e_i}{1-h_{ii}}

  • Meaning: Simulates prediction error for a new observation

  1. Studentized Deleted Residual

  • Definition: ti=dis(di)=eiMSE(i)(1hii)t_i = \frac{d_i}{s(d_i)} = \frac{e_i}{\sqrt{MSE_{(-i)}(1-h_{ii})}}

  • Distribution: titnp2t_i \sim t_{n-p-2}

  • Calculation formula: ti=ei[np2SSE(1hii)ei2]1/2t_i = e_i \left[ \frac{n-p-2}{SSE(1-h_{ii}) - e_i^2} \right]^{1/2}

Formal Test Methods

  • Test statistic: Compare ti|t_i| with t(1α2n,np2)t(1-\frac{\alpha}{2n}, n-p-2)

  • Bonferroni correction: Adjust significance level to account for multiple testing

Detection of Outlying X Observations

Leverage

  • Definition: Diagonal elements hiih_{ii} of the hat matrix H=X(XTX)1XTH = X(X^TX)^{-1}X^T

  • Properties:

    • 0hii10 \leq h_{ii} \leq 1
    • i=1nhii=tr(H)=p+1\sum_{i=1}^n h_{ii} = tr(H) = p+1
  • Meaning: Measures the distance of xix_i from the center of all X values

  • Judgment criterion: hii>2(p+1)nh_{ii} > \frac{2(p+1)}{n} indicates an outlying X observation

Outlier detection is an important part of model diagnostics. Outliers (Y anomalies) may be caused by measurement errors, while high leverage points (X anomalies) may have excessive influence on regression results. Through different residual definitions and leverage analysis, these outliers can be systematically identified and handled, improving model robustness.

Multicollinearity

Definition and Examples

  • Multicollinearity: High correlation exists among predictor variables

  • Ideal situation: Predictor variables are independent of each other (“independent variables” in statistics)

  • Examples:

    • YX1(weight)+X2(BMI)+othersY \sim X_1(\text{weight}) + X_2(BMI) + \text{others}
    • YX1(credit rating)+X2(credit limit)+othersY \sim X_1(\text{credit rating}) + X_2(\text{credit limit}) + \text{others}

Effects of Multicollinearity

  • Variance of regression coefficient estimates becomes very large

  • After deleting one variable, regression coefficients may change sign

  • Marginal significance of predictor variables highly depends on other predictor variables included in the model

  • Significance of predictor variables may be masked by correlated variables in the model

Variance Inflation Factor (VIF)

  • Definition: (VIF)j=(1Rj2)1(\text{VIF})_j = (1 - R_j^2)^{-1}

    • Where Rj2R_j^2 is the coefficient of determination obtained by regressing the j-th variable on the other p1p-1 variables in the model
  • Judgment criterion:

    • Maximum VIF > 10 → Multicollinearity is considered to have an undue influence on least squares estimates
    • Average of all VIFs much greater than 1 → Indicates serious multicollinearity

Variable Transformation

Purpose

  • Linearize nonlinear regression functions

  • Stabilize error variance

  • Normalize error terms

Box-Cox Transformation

  • Transformation form: Use yλy^\lambda (λ0\lambda \geq 0) as the response variable, where y0y^0 is defined as ln(Y)\ln(Y)

  • Choose optimal λ\lambda: Based on maximizing the likelihood function

    L(λ;β0,β1,σ2)=1(2πσ)n/2exp(12σ2i=1n(yi(λ)β0β1Txi)2)L(\lambda; \beta_0, \beta_1, \sigma^2) = \frac{1}{(2\pi\sigma)^{n/2}} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n \left(y_i^{(\lambda)} - \beta_0 - \beta_1^T \mathbf{x}_i\right)^2\right)

Bias-Variance Tradeoff

Mean Squared Error (MSE) Decomposition

  • Let f0(x)f_0(\mathbf{x}) be the true regression function at x\mathbf{x}, then the mean squared error of the estimator f^(x)\hat{f}(\mathbf{x}) is:

    MSE(f^(x))=E[(f^(x)f0(x))2]MSE(\hat{f}(\mathbf{x})) = E\left[ \left( \hat{f}(\mathbf{x}) - f_0(\mathbf{x}) \right)^2 \right]

  • Decomposed as:

    MSE(f^(x))=var(f^(x))+[E(f^(x))f0(x)]2MSE(\hat{f}(\mathbf{x})) = \text{var}(\hat{f}(\mathbf{x})) + \left[ E(\hat{f}(\mathbf{x})) - f_0(\mathbf{x}) \right]^2

    • First term: Variance (fluctuation of the estimator)
    • Second term: Squared bias (systematic error of the estimator)

Tradeoff Relationship and Regularization

  • Gauss-Markov theorem: If the linear model is correct, the least squares estimator f^\hat{f} is unbiased and has the smallest variance among all linear unbiased estimators of yy

  • Advantage of biased estimators: There may exist biased estimators with smaller MSE

  • Regularization methods: Reduce variance through regularization, worthwhile if the increase in bias is small

    • Subset selection (forward, backward, all subsets)
    • Ridge Regression
    • Lasso regression
  • Reality: Models are almost never completely correct, there is model bias between the “best” linear model and the true regression function

Multicollinearity seriously affects the interpretation and stability of regression coefficients and needs to be detected by indicators like VIF. Variable transformation is an effective means to improve model assumptions. The bias-variance tradeoff is the core issue in model selection; regularization methods may achieve smaller prediction errors by introducing bias to reduce variance.

Qualitative Predictors

Basic Model Setup

Consider a quantitative predictor variable X1X_1 and a qualitative predictor with two levels M1M_1 and M2M_2:

Dummy Variable Coding

  • Definition: X2={1if level M10if level M2X_2 = \begin{cases} 1 & \text{if level } M_1 \\ 0 & \text{if level } M_2 \end{cases}

  • Regression model: E(YX)=β0+β1X1+β2X2E(Y|X) = \beta_0 + \beta_1X_1 + \beta_2X_2

Model Interpretation

  • For level M1M_1: E(YX)=β0+β1X1+β2E(Y|X) = \beta_0 + \beta_1X_1 + \beta_2

  • For level M2M_2: E(YX)=β0+β1X1E(Y|X) = \beta_0 + \beta_1X_1

  • Geometric meaning: Parallel lines with different intercepts but same slope

  • Parameter meaning: β2=E(YX2=1)E(YX2=0)=E(YM1)E(YM2)\beta_2 = E(Y|X_2=1) - E(Y|X_2=0) = E(Y|M_1) - E(Y|M_2)

    • β2\beta_2 represents the difference in average response between the two levels

Interaction Effects

Model with Interaction Term

  • Model form: E(YX)=β0+β1X1+β2X2+β3X1X2E(Y|X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2

  • Model interpretation:

    • For level M1M_1: E(YX)=(β0+β2)+(β1+β3)X1E(Y|X) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)X_1
    • For level M2M_2: E(YX)=β0+β1X1E(Y|X) = \beta_0 + \beta_1X_1

Meaning of Interaction Effects

  • Geometric meaning: Non-parallel lines with different intercepts and slopes

  • Parameter interpretation:

    • β2\beta_2: Difference in intercepts between the two levels at X1=0X_1=0
    • β3\beta_3: Difference in slopes between the two levels
  • Interaction term: X1X2X_1X_2 allows the slope to vary with the level of the qualitative variable

Extended Explanation

Multiple Qualitative Predictors

  • Can include multiple qualitative predictors

  • Each qualitative variable needs separate coding

Multi-level Qualitative Variables

For a qualitative variable with 5 levels, coding methods:

  • Directly code as 1, 2, 3, 4, 5

  • Problem: Implies ordinal relationship, may not reflect reality

  • Define 4 dummy variables X1,X2,X3,X4X_1, X_2, X_3, X_4

  • Xj=1X_j = 1 if level jj, otherwise 0 (j=1,2,3,4j=1,2,3,4)

  • Baseline level: The 5th level serves as the reference baseline

Method 3: Effect Coding

  • Define X1,X2,X3,X4X_1, X_2, X_3, X_4

  • Xj=1X_j = 1 if level jj, Xj=1X_j = -1 if level 5, otherwise 0

  • Characteristic: Parameters represent deviations from the overall mean

Qualitative predictors are introduced into the regression model through dummy variable coding. Interaction effects allow different groups to have different slopes. Multi-level qualitative variables require careful coding to avoid spurious ordinal relationships; dummy variable coding is the most commonly used method. Correct coding is essential for model interpretation and statistical inference.