#assignment #sdsc5001

Question 1

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large. We will now investigate this curse.

(a) Suppose that we have a set of observations, each with measurements on p=1 feature, X. We assume that X is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X=0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?

(b) Now suppose that we have a set of observations, each with measurements on p=2 features, X1 and X2. We assume that (X1, X2) are uniformly distributed on [0,1] x [0,1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within 10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1=0.6 and X2=0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction?

© Now suppose that we have a set of observations on p=100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

(d) Using your answers to parts (a)-©, argue that a drawback of KNN when p is large is that there are very few training observations “near” any given test observation.

Solution to Question 1

(a) For p=1, the feature X is uniformly distributed on [0,1]. The range of X is 1, and we are considering a neighborhood that is 10% of this range, so the interval length is 0.1. Since the distribution is uniform, the fraction of observations used is exactly the proportion of the interval length to the total range, which is 0.1 or 10%.

(b) For p=2, the features (X1, X2) are uniformly distributed on the unit square [0,1]x[0,1]. The neighborhood is a rectangle with sides of length 0.1 in each dimension (10% of the range for each feature). The area of this rectangle is 0.1 * 0.1 = 0.01. Since the distribution is uniform, the fraction of observations used is the area of the rectangle, which is 0.01 or 1%.

© For p=100, the features are uniformly distributed on a 100-dimensional hypercube. The neighborhood is a hypercube with each side of length 0.1. The volume of this hypercube is (0.1)100=10100(0.1)^100 = 10^{-100}. Thus, the fraction of observations used is extremely small, approximately 1010010^{-100}.

(d) From parts (a) to ©, we see that as the number of features p increases, the fraction of observations near a test observation decreases dramatically. For p=1, it’s 10%; for p=2, it’s 1%; and for p=100, it’s virtually zero. This means that in high dimensions, there are very few training points within the local neighborhood of any test point. Consequently, KNN and other local methods suffer because they rely on having sufficient nearby points to make accurate predictions. This is the curse of dimensionality: as p grows, the data becomes sparse, and local approximations become unreliable.


中文版本

问题 1

当特征数量 p 很大时,KNN 及其他局部预测方法(仅使用靠近测试点的观测值进行预测)的性能往往会下降。这种现象称为“维度诅咒”,它与非参数方法在 p 较大时表现不佳的事实相关。我们现在来研究这个诅咒。

(a) 假设我们有一组观测值,每个观测值有一个特征 X,且 p=1。我们假设 X 在 [0, 1] 上均匀分布。每个观测值有一个响应值。假设我们希望仅使用在 X 的范围内最接近测试点的 10% 区间内的观测值来预测测试点的响应。例如,为了预测 X=0.6 的测试点的响应,我们将使用范围 [0.55, 0.65] 内的观测值。平均而言,我们将使用可用观测值的多大比例进行预测?

(b) 现在假设我们有一组观测值,每个观测值有两个特征 X1 和 X2,且 p=2。我们假设 (X1, X2) 在 [0,1] x [0,1] 上均匀分布。我们希望仅使用在 X1 的范围内最接近测试点的 10% 区间内且在 X2 的范围内最接近测试点的 10% 区间内的观测值来预测测试点的响应。例如,为了预测 X1=0.6 和 X2=0.35 的测试点的响应,我们将使用 X1 在 [0.55, 0.65] 范围内且 X2 在 [0.3, 0.4] 范围内的观测值。平均而言,我们将使用可用观测值的多大比例进行预测?

© 现在假设我们有一组观测值,有 p=100 个特征。同样,每个特征在 [0,1] 上均匀分布。我们希望使用在每个特征的 10% 范围内最接近测试点的观测值来预测测试点的响应。我们将使用可用观测值的多大比例进行预测?

(d) 使用你在 (a)-© 部分的答案,论证当 p 很大时,KNN 的一个缺点是对于任何给定的测试点,只有极少的训练观测点“靠近”它。

问题 1 的题解

(a) 当 p=1 时,特征 X 在 [0,1] 上均匀分布。X 的范围是 1,我们考虑的邻域是范围的 10%,因此区间长度为 0.1。由于分布均匀,使用的观测值比例正好是区间长度与总范围的比例,即 0.1 或 10%。

(b) 当 p=2 时,特征 (X1, X2) 在单位正方形 [0,1]x[0,1] 上均匀分布。邻域是一个矩形,每个维度的边长为 0.1(每个特征范围的 10%)。这个矩形的面积是 0.1 * 0.1 = 0.01。由于分布均匀,使用的观测值比例就是矩形的面积,即 0.01 或 1%。

© 当 p=100 时,特征在 100 维超立方体上均匀分布。邻域是一个超立方体,每个边长为 0.1。这个超立方体的体积是 (0.1)100=10100(0.1)^100 = 10^{-100}。因此,使用的观测值比例极小,约为 1010010^{-100}

(d) 从 (a) 到 © 部分可以看出,随着特征数量 p 的增加,靠近测试点的观测值比例急剧下降。当 p=1 时,比例为 10%;p=2 时,为 1%;p=100 时,几乎为零。这意味着在高维空间中,任何测试点的局部邻域内只有极少的训练点。因此,KNN 及其他局部方法表现不佳,因为它们依赖于有足够的邻近点来进行准确预测。这就是维度诅咒:随着 p 增大,数据变得稀疏,局部近似变得不可靠。

英文版本

Question 2

Answer the following questions about the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

© In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

Solution to Question 2

(a)

  • Training Set: We expect QDA to perform better on the training set. QDA is a more flexible model (it has more parameters) than LDA, so it can fit the training data more closely, leading to a lower training error rate.

  • Test Set: We expect LDA to perform better on the test set. Since the true (Bayes) boundary is linear, the extra flexibility of QDA is unnecessary. LDA, by making the correct assumption of a common covariance matrix, will have lower variance. Using QDA in this case would likely lead to overfitting (higher variance without a reduction in bias), resulting in a higher test error.

(b)

  • Training Set: We expect QDA to perform better on the training set. For the same reason as in (a), its higher flexibility allows it to achieve a better fit to the training data.

  • Test Set: We expect QDA to perform better on the test set. Because the true boundary is non-linear, QDA’s flexibility to model different class covariances allows it to better approximate the true boundary. LDA, constrained to a linear boundary, will suffer from higher bias.

© As the sample size n increases, we expect the test prediction accuracy of QDA relative to LDA to improve.

  • Reason: QDA requires estimating a separate covariance matrix for each class, which involves more parameters than LDA (which estimates a single pooled covariance matrix). With a small sample size, the variance introduced by estimating these additional parameters in QDA can hurt its performance. However, as the sample size grows larger, these parameters can be estimated more accurately. The benefit of QDA’s flexibility (potentially lower bias) is realized with less risk of overfitting (high variance), so its performance relative to the simpler LDA model improves.

(d) False.

  • Justification: While QDA is flexible enough to model a linear decision boundary (it is a more general model), it is not likely to achieve a superior test error rate when the true boundary is linear. Because QDA has more parameters to estimate, it has higher variance than LDA. If the true boundary is linear, LDA correctly assumes this simpler structure. Using QDA would introduce unnecessary variance without reducing bias, likely leading to a higher test error due to overfitting. Therefore, LDA is generally preferred when we have reason to believe the decision boundary is linear.


中文版本

问题 2

回答以下关于线性判别分析(LDA)和二次判别分析(QDA)差异的问题。

(a) 如果贝叶斯决策边界是线性的,我们预期LDA和QDA中哪个在训练集上表现更好?在测试集上呢?

(b) 如果贝叶斯决策边界是非线性的,我们预期LDA和QDA中哪个在训练集上表现更好?在测试集上呢?

© 通常而言,随着样本量n的增加,我们预期QDA相对于LDA的测试预测精度会提高、下降还是不变?为什么?

(d) 判断对错:即使对于某个问题其贝叶斯决策边界是线性的,我们使用QDA也可能获得比LDA更优的测试误差率,因为QDA足够灵活,可以建模线性决策边界。并证明你的答案。

问题 2 的题解

(a)

  • 训练集: 我们预期 QDA 在训练集上表现更好。QDA是一个更灵活的模型(参数更多),因此它能更紧密地拟合训练数据,从而获得更低的训练误差。

  • 测试集: 我们预期 LDA 在测试集上表现更好。因为真实(贝叶斯)边界是线性的,QDA的额外灵活性是不必要的。LDA通过做出共同协方差矩阵的正确假设,具有更低的方差。在这种情况下使用QDA可能会导致过拟合(方差增加而偏差未减少),从而导致更高的测试误差。

(b)

  • 训练集: 我们预期 QDA 在训练集上表现更好。原因与(a)中相同,其更高的灵活性使其能更好地拟合训练数据。

  • 测试集: 我们预期 QDA 在测试集上表现更好。因为真实边界是非线性的,QDA能够为不同类别建模不同协方差矩阵的灵活性使其能更好地近似真实边界。而受限于线性边界的LDA将会有较高的偏差。

© 随着样本量 n 的增加,我们预期 QDA 相对于 LDA 的测试预测精度会提高

  • 原因: QDA需要为每个类别估计一个单独的协方差矩阵,这比LDA(估计一个共同的协方差矩阵)需要估计更多的参数。在样本量较小时,估计QDA这些额外参数所带来的方差可能会损害其性能。然而,随着样本量增大,这些参数可以被更准确地估计。QDA灵活性的好处(可能带来更低的偏差)得以实现,而过拟合(高方差)的风险降低,因此其性能相对于更简单的LDA模型会有所提升。

(d) 错误。

  • 证明: 尽管QDA足够灵活,可以建模一个线性决策边界(它是一个更一般的模型),但当真实边界是线性时,它不太可能获得更优的测试误差率。因为QDA有更多的参数需要估计,所以它比LDA具有更高的方差。如果真实边界是线性的,LDA正确地假设了这种更简单的结构。使用QDA会引入不必要的方差,而无法减少偏差,很可能由于过拟合导致更高的测试误差。因此,当我们有理由相信决策边界是线性时,通常更倾向于使用LDA。

英文版本

Question 3

Suppose we collect data for a group of students in a statistics class with variables X1=X_{1} = hours studied, X2=X_{2} = undergrad GPA, and Y=Y = receive an A. We fit a logistic regression and produce estimated coefficient, β^0=6,β^1=0.05,β^2=1\hat{\beta}_{0} = -6, \hat{\beta}_{1} = 0.05, \hat{\beta}_{2} = 1.

(a) Estimate the probability that a student who studies for 40h and has an undergrad GPA of 3.5 gets an A in the class.

(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

Solution to Question 3

(a) Estimate the probability

The logistic regression model is:

P(Y=1)=11+e(β0+β1X1+β2X2)P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}}

Given: β^0=6\hat{\beta}_0 = -6, β^1=0.05\hat{\beta}_1 = 0.05, β^2=1\hat{\beta}_2 = 1, X1=40X_1 = 40, X2=3.5X_2 = 3.5.

  1. Calculate the linear predictor (z):

    z=β0+β1X1+β2X2=6+(0.05)(40)+(1)(3.5)=6+2+3.5=0.5z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = -6 + (0.05)(40) + (1)(3.5) = -6 + 2 + 3.5 = -0.5

  2. Calculate the probability:

    P(Y=1)=11+e(z)=11+e(0.5)=11+e0.511+1.648712.64870.3775P(Y=1) = \frac{1}{1 + e^{-(z)}} = \frac{1}{1 + e^{-(-0.5)}} = \frac{1}{1 + e^{0.5}} \approx \frac{1}{1 + 1.6487} \approx \frac{1}{2.6487} \approx 0.3775

    Answer: The estimated probability is approximately 0.378 (or 37.8%).

(b) Find hours for a 50% chance

A 50% chance means P(Y=1)=0.5P(Y=1) = 0.5. In the logistic model, this happens precisely when the linear predictor z=0z = 0.

  1. Set up the equation:
    We have z=β0+β1X1+β2X2=0z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0. We know β0,β1,β2\beta_0, \beta_1, \beta_2 and X2=3.5X_2 = 3.5. We need to solve for X1X_1 (hours studied).

    6+0.05(X1)+1(3.5)=0-6 + 0.05(X_1) + 1(3.5) = 0

  2. Solve for X1X_1:

    0.05(X1)2.5=00.05(X_1) - 2.5 = 0

    0.05(X1)=2.50.05(X_1) = 2.5

    X1=2.50.05=50X_1 = \frac{2.5}{0.05} = 50

    Answer: The student would need to study for 50 hours to have a 50% chance of getting an A.


中文版本

问题 3

假设我们收集了一个统计学班级学生的数据,变量包括 X1=X_{1} = 学习小时数, X2=X_{2} = 本科GPA, 和 Y=Y = 是否获得A(二分变量)。我们拟合了一个逻辑回归模型,得到的估计系数为 β^0=6,β^1=0.05,β^2=1\hat{\beta}_{0} = -6, \hat{\beta}_{1} = 0.05, \hat{\beta}_{2} = 1.

(a) 估计一名学习了40小时且本科GPA为3.5的学生在班级中获得A的概率。

(b) (a)中的学生需要学习多少小时,才能有50%的机会获得A?

问题 3 的题解

(a) 估计概率

逻辑回归模型为:

P(Y=1)=11+e(β0+β1X1+β2X2)P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}}

已知:β^0=6\hat{\beta}_0 = -6β^1=0.05\hat{\beta}_1 = 0.05β^2=1\hat{\beta}_2 = 1X1=40X_1 = 40X2=3.5X_2 = 3.5.

  1. 计算线性预测值 (z):

    z=β0+β1X1+β2X2=6+(0.05)(40)+(1)(3.5)=6+2+3.5=0.5z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = -6 + (0.05)(40) + (1)(3.5) = -6 + 2 + 3.5 = -0.5

  2. 计算概率:

    P(Y=1)=11+e(z)=11+e(0.5)=11+e0.511+1.648712.64870.3775P(Y=1) = \frac{1}{1 + e^{-(z)}} = \frac{1}{1 + e^{-(-0.5)}} = \frac{1}{1 + e^{0.5}} \approx \frac{1}{1 + 1.6487} \approx \frac{1}{2.6487} \approx 0.3775

    答案: 估计概率约为 0.378(或 37.8%)。

(b) 求达到50%概率所需的学习时间

50%的概率意味着 P(Y=1)=0.5P(Y=1) = 0.5。在逻辑模型中,这恰好在线性预测值 z=0z = 0 时发生。

  1. 建立方程:
    已知 z=β0+β1X1+β2X2=0z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0。我们知道 β0,β1,β2\beta_0, \beta_1, \beta_2X2=3.5X_2 = 3.5。需要求解 X1X_1(学习小时数)。

    6+0.05(X1)+1(3.5)=0-6 + 0.05(X_1) + 1(3.5) = 0

  2. 解出 X1X_1

    0.05(X1)2.5=00.05(X_1) - 2.5 = 0

    0.05(X1)=2.50.05(X_1) = 2.5

    X1=2.50.05=50X_1 = \frac{2.5}{0.05} = 50

    答案: 该学生需要学习 50 小时 才能有50%的机会获得A。

英文版本

Question 4

Let’s develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set in the ISLP package.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

© Split the data into a training set and a test set.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Solution to Question 4

(a) Create binary variable

1
2
3
4
5
6
7
8
9
# Load the library and data
library(ISLP)
data(Auto)

# Calculate the median of mpg
mpg_median <- median(Auto$mpg)

# Create the binary variable mpg01
Auto$mpg01 <- as.numeric(Auto$mpg > mpg_median)

(b) Exploratory Data Analysis (EDA)

  • Method: Create boxplots of each predictor against mpg01, and scatterplots of pairs of predictors colored by mpg01.

  • Key Findings: Features strongly associated with mppg01 are typically:

    • weight: Heavier cars generally have lower mpg.
    • horsepower: More powerful cars generally have lower mpg.
    • displacement: Larger engine size generally indicates lower mpg.
    • acceleration: Cars with slower acceleration (higher acceleration time, meaning less powerful) may have higher mpg, but the relationship might be less strong than the first three.
  • Conclusion: weight, horsepower, and displacement are the most promising predictors.

© Split the data

1
2
3
4
5
6
# Set seed for reproducibility
set.seed(123)
# Create random indices for 80% training, 20% test split
train_index <- sample(1:nrow(Auto), nrow(Auto) * 0.8)
train_set <- Auto[train_index, ]
test_set <- Auto[-train_index, ]

(d) LDA

1
2
3
4
5
6
7
# Fit LDA model using selected features (e.g., weight, horsepower)
library(MASS)
lda_fit <- lda(mpg01 ~ weight + horsepower, data = train_set)
lda_pred <- predict(lda_fit, newdata = test_set)
# Calculate test error: mean of misclassified observations
lda_test_error <- mean(lda_pred$class != test_set$mpg01)
# Expected test error is typically around 10-15%.

(e) QDA

1
2
3
4
5
# Fit QDA model
qda_fit <- qda(mpg01 ~ weight + horsepower, data = train_set)
qda_pred <- predict(qda_fit, newdata = test_set)
qda_test_error <- mean(qda_pred$class != test_set$mpg01)
# Test error might be similar or slightly higher than LDA if the linear assumption is reasonable.

(f) Logistic Regression

1
2
3
4
5
6
# Fit logistic regression model
glm_fit <- glm(mpg01 ~ weight + horsepower, data = train_set, family = binomial)
glm_probs <- predict(glm_fit, newdata = test_set, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
glm_test_error <- mean(glm_pred != test_set$mpg01)
# Test error is often comparable to LDA.

(g) KNN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Prepare data: Standardize features and create matrices
library(class)
train_X <- scale(train_set[, c("weight", "horsepower")])
test_X <- scale(test_set[, c("weight", "horsepower")],
center = attr(train_X, "scaled:center"),
scale = attr(train_X, "scaled:scale"))
train_y <- train_set$mpg01

# Try different K values (e.g., 1, 5, 10, 20)
k_values <- c(1, 5, 10, 20)
knn_errors <- sapply(k_values, function(k) {
set.seed(123)
knn_pred <- knn(train_X, test_X, train_y, k = k)
mean(knn_pred != test_set$mpg01)
})

# Identify best K (the one with the lowest test error)
best_k <- k_values[which.min(knn_errors)]
# The best K is often a moderate value like 5 or 10, balancing bias and variance.

Summary of Findings:

  • The best model is often Logistic Regression or LDA, with test errors around 10%.

  • QDA might perform similarly or slightly worse.

  • KNN performance depends heavily on K. A moderate K (e.g., 5, 10) usually performs best, potentially achieving error rates similar to the linear models.


中文版本

问题 4

我们基于ISLP包中的Auto数据集开发一个模型,预测一辆汽车的油耗是高还是低。

(a) 创建一个二元变量mpg01,如果mpg的值高于其中位数则为1,低于其中位数则为0。

(b) 以图形方式探索数据,以调查mpg01与其他特征之间的关联。哪些其他特征似乎最有可能用于预测mpg01?散点图和箱线图可能是回答这个问题的有用工具。描述你的发现。

© 将数据拆分为训练集和测试集。

(d) 在训练数据上执行LDA,使用在(b)中看起来与mpg01最相关的变量来预测mpg01。得到的模型的测试误差是多少?

(e) 在训练数据上执行QDA,使用在(b)中看起来与mpg01最相关的变量来预测mpg01。得到的模型的测试误差是多少?

(f) 在训练数据上执行逻辑回归,使用在(b)中看起来与mpg01最相关的变量来预测mpg01。得到的模型的测试误差是多少?

(g) 在训练数据上执行KNN,使用几个不同的K值来预测mpg01。仅使用在(b)中看起来与mpg01最相关的变量。你得到了哪些测试误差?哪个K值在此数据集上表现最佳?

问题 4 的题解

(a) 创建二元变量

1
2
3
4
5
6
7
8
9
# 加载库和数据
library(ISLP)
data(Auto)

# 计算mpg的中位数
mpg_median <- median(Auto$mpg)

# 创建二元变量mpg01
Auto$mpg01 <- as.numeric(Auto$mpg > mpg_median)

(b) 探索性数据分析

  • 方法: 创建每个预测变量相对于mpg01的箱线图,以及按mpg01着色的预测变量对的散点图。

  • 主要发现:mpg01强相关的特征通常包括:

    • weight(车重): 较重的汽车通常油耗更高(mpg更低)。
    • horsepower(马力): 马力更大的汽车通常油耗更高。
    • displacement(排量): 发动机排量更大的汽车通常油耗更高。
    • acceleration(加速度): 加速度较慢的汽车(加速时间较长,意味着功率较小)油耗可能较低,但关系可能不如前三个变量强。
  • 结论: weighthorsepowerdisplacement是最有预测潜力的变量。

© 划分数据

1
2
3
4
5
6
# 设置随机种子以保证结果可重现
set.seed(123)
# 创建随机索引,用于80%训练集,20%测试集的划分
train_index <- sample(1:nrow(Auto), nrow(Auto) * 0.8)
train_set <- Auto[train_index, ]
test_set <- Auto[-train_index, ]

(d) 线性判别分析

1
2
3
4
5
6
7
# 使用选定的特征(例如,车重、马力)拟合LDA模型
library(MASS)
lda_fit <- lda(mpg01 ~ weight + horsepower, data = train_set)
lda_pred <- predict(lda_fit, newdata = test_set)
# 计算测试误差:被错误分类的观测值的比例
lda_test_error <- mean(lda_pred$class != test_set$mpg01)
# 预期的测试误差通常在10-15%左右。

(e) 二次判别分析

1
2
3
4
5
# 拟合QDA模型
qda_fit <- qda(mpg01 ~ weight + horsepower, data = train_set)
qda_pred <- predict(qda_fit, newdata = test_set)
qda_test_error <- mean(qda_pred$class != test_set$mpg01)
# 如果线性假设合理,测试误差可能与LDA相似或略高。

(f) 逻辑回归

1
2
3
4
5
6
# 拟合逻辑回归模型
glm_fit <- glm(mpg01 ~ weight + horsepower, data = train_set, family = binomial)
glm_probs <- predict(glm_fit, newdata = test_set, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
glm_test_error <- mean(glm_pred != test_set$mpg01)
# 测试误差通常与LDA相当。

(g) K最近邻

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 准备数据:标准化特征并创建矩阵
library(class)
train_X <- scale(train_set[, c("weight", "horsepower")])
test_X <- scale(test_set[, c("weight", "horsepower")],
center = attr(train_X, "scaled:center"),
scale = attr(train_X, "scaled:scale"))
train_y <- train_set$mpg01

# 尝试不同的K值(例如,1, 5, 10, 20)
k_values <- c(1, 5, 10, 20)
knn_errors <- sapply(k_values, function(k) {
set.seed(123) # 保证KNN内部随机性可重复
knn_pred <- knn(train_X, test_X, train_y, k = k)
mean(knn_pred != test_set$mpg01)
})

# 确定最佳K值(测试误差最低的那个)
best_k <- k_values[which.min(knn_errors)]
# 最佳K值通常是一个中等大小的值,如5或10,以平衡偏差和方差。

结果总结:

  • 最佳模型通常是逻辑回归LDA,测试误差大约在**10%**左右。

  • QDA的表现可能相似或稍差。

  • KNN的表现高度依赖于K值。一个适中的K值(例如5, 10)通常表现最佳,可能达到与线性模型相似的错误率。

英文版本

Question 5

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p+1p+1 models, containing 0,1,2,,p0, 1, 2, \ldots, p predictors. Answer the following questions:

(a) Which of the three models with k predictors has the smallest training RSS?

(b) Which of the three models with k predictors has the smallest test MSE?

© True or False for each statement below.
i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k+1)-variable model identified by best subset selection.

Solution to Question 5

(a) Training RSS
The best subset selection model with k predictors will have the smallest training RSS.

  • Reason: Best subset selection searches through all possible combinations of k predictors and chooses the best one. Forward and backward stepwise are greedy approximations that do not guarantee the absolute best model of size k.

(b) Test MSE
There is no definitive answer; it depends on the specific dataset and the true relationship between the predictors and the response.

  • Reason: The model with the smallest test MSE is the one that best balances bias and variance. While best subset has the lowest training RSS (and thus lowest bias), it might overfit the training data (high variance), leading to a higher test MSE than a more constrained stepwise approach in some cases. The relative performance is unpredictable and must be validated on test data.

© True or False

i. True.

  • Reason: Forward stepwise selection starts with no predictors and adds one predictor at a time. The model with k predictors is built directly from the model with k-1 predictors by adding the next best predictor. Therefore, the set of predictors in the k-variable model is always a subset of the predictors in the (k+1)-variable model. The model path is “nested.”

ii. True.

  • Reason: Backward stepwise selection starts with all p predictors and removes one predictor at a time. The model with k+1 predictors is built directly from the model with k predictors by removing the least useful predictor. Therefore, the predictors in the k-variable model are a subset of the predictors in the (k+1)-variable model. This path is also “nested.”

iii. False.

  • Reason: There is no guaranteed relationship between the subsets of predictors chosen by backward stepwise for a given k and the subsets chosen by forward stepwise for k+1. The two algorithms follow different search paths and can produce very different models.

iv. False.

  • Reason: Similarly, there is no guaranteed subset relationship between the models found by forward and backward stepwise selection. The k-variable model from forward stepwise is not necessarily a subset of the (k+1)-variable model from backward stepwise.

v. False.

  • Reason: Best subset selection independently finds the best model for each possible model size. The optimal set of k predictors is not necessarily a subset of the optimal set of k+1 predictors. The algorithm is free to choose a completely different combination of predictors for each model size.


中文版本

问题 5

我们在同一个数据集上执行最佳子集选择、前向逐步选择和后向逐步选择。对于每种方法,我们得到 p+1p+1 个模型,分别包含 0,1,2,,p0, 1, 2, \ldots, p 个预测变量。回答以下问题:

(a) 包含k个预测变量的三种模型中,哪个模型的训练RSS最小?

(b) 包含k个预测变量的三种模型中,哪个模型的测试MSE最小?

© 判断以下陈述的对错。
i. 通过前向逐步选择得到的k变量模型中的预测变量,是前向逐步选择得到的(k+1)变量模型中预测变量的子集。
ii. 通过后向逐步选择得到的k变量模型中的预测变量,是后向逐步选择得到的(k+1)变量模型中预测变量的子集。
iii. 通过后向逐步选择得到的k变量模型中的预测变量,是前向逐步选择得到的(k+1)变量模型中预测变量的子集。
iv. 通过前向逐步选择得到的k变量模型中的预测变量,是后向逐步选择得到的(k+1)变量模型中预测变量的子集。
v. 通过最佳子集选择得到的k变量模型中的预测变量,是最佳子集选择得到的(k+1)变量模型中预测变量的子集。

问题 5 的题解

(a) 训练RSS
包含k个预测变量的 最佳子集选择 模型会有最小的训练RSS。

  • 原因: 最佳子集选择会搜索所有可能的k个预测变量的组合,并选择最优的一个。前向和后向逐步选择是贪心算法,不能保证得到大小为k的绝对最优模型。

(b) 测试MSE
没有明确的答案;这取决于具体的数据集以及预测变量与响应变量之间的真实关系。

  • 原因: 测试MSE最小的模型是那个能最好地平衡偏差和方差的模型。虽然最佳子集选择的训练RSS最小(因此偏差最低),但在某些情况下,它可能对训练数据过拟合(高方差),导致其测试MSE反而高于约束更强的逐步选择方法。其相对性能是不可预测的,必须在测试数据上进行验证。

© 判断对错

i. 正确。

  • 原因: 前向逐步选择从没有预测变量开始,每次添加一个预测变量。包含k个预测变量的模型是直接由包含k-1个预测变量的模型添加下一个最佳预测变量而构建的。因此,k变量模型中的预测变量集合总是(k+1)变量模型中预测变量集合的子集。模型路径是"嵌套"的。

ii. 正确。

  • 原因: 后向逐步选择从所有p个预测变量开始,每次移除一个预测变量。包含k+1个预测变量的模型是直接由包含k个预测变量的模型移除最不重要的预测变量而构建的。因此,k变量模型中的预测变量是(k+1)变量模型中预测变量的子集。该路径也是"嵌套"的。

iii. 错误。

  • 原因: 后向逐步选择对于给定的k所选择的预测变量子集,与前向逐步选择对于k+1所选择的预测变量子集之间,没有保证的关系。两种算法遵循不同的搜索路径,可能产生完全不同的模型。

iv. 错误。

  • 原因: 类似地,前向逐步选择和后向逐步选择找到的模型之间也没有保证的子集关系。由前向逐步选择得到的k变量模型不一定是后向逐步选择得到的(k+1)变量模型的子集。

v. 错误。

  • 原因: 最佳子集选择会独立地找到每个可能模型大小下的最佳模型。k个预测变量的最优集合不一定是k+1个预测变量的最优集合的子集。该算法可以自由地为每个模型大小选择完全不同的预测变量组合。

英文版本

Question 6

Choose the correct answer for each question below.

(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Ridge regression, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

© Nonlinear methods, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Solution to Question 6

(a) Correct Answer: iii

  • Reasoning: Lasso regression uses L1 regularization (shrinking coefficients, some to exactly zero), which makes it less flexible than least squares. A less flexible model has higher bias but lower variance. The trade-off is beneficial for prediction accuracy only if the increase in bias is small relative to the decrease in variance.

(b) Correct Answer: iii

  • Reasoning: Ridge regression uses L2 regularization (shrinking coefficients towards zero but not exactly zero), which also makes it less flexible than least squares. Similar to the lasso, it improves prediction accuracy when the increase in bias is less than the decrease in variance.

© Correct Answer: ii

  • Reasoning: Nonlinear methods (e.g., polynomial regression, splines, decision trees) are more flexible than the linear least squares model. A more flexible model has lower bias but higher variance. The trade-off is beneficial for prediction accuracy only if the increase in variance is less than the decrease in bias.

Summary: Regularization methods like Lasso and Ridge are less flexible and improve accuracy by reducing variance more than they increase bias. Nonlinear methods are more flexible and improve accuracy by reducing bias more than they increase variance.


中文版本

问题 6

为以下每个问题选择正确答案。

(a) Lasso回归,相对于最小二乘法,是:
i. 更灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
ii. 更灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。
iii. 更不灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
iv. 更不灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。

(b) 岭回归,相对于最小二乘法,是:
i. 更灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
ii. 更灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。
iii. 更不灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
iv. 更不灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。

© 非线性方法,相对于最小二乘法,是:
i. 更灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
ii. 更灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。
iii. 更不灵活,因此当其偏差的增加小于方差的减少时,会提高预测精度。
iv. 更不灵活,因此当方差的增加小于偏差的减少时,会提高预测精度。

问题 6 的题解

(a) 正确答案: iii

  • 推理: Lasso回归使用L1正则化(收缩系数,有些会变为零),这使其比最小二乘法更不灵活。一个更不灵活的模型具有更高的偏差但更低的方差。只有当偏差的增加量小于方差的减少量时,这种权衡才对预测精度有益。

(b) 正确答案: iii

  • 推理: 岭回归使用L2正则化(将系数向零收缩但不完全为零),这也使其比最小二乘法更不灵活。与Lasso类似,当偏差的增加量小于方差的减少量时,它能提高预测精度。

© 正确答案: ii

  • 推理: 非线性方法(例如,多项式回归、样条、决策树)比线性最小二乘模型更灵活。一个更灵活的模型具有更低的偏差但更高的方差。只有当方差的增加量小于偏差的减少量时,这种权衡才对预测精度有益。

总结: 像Lasso和岭回归这样的正则化方法更不灵活,它们通过减少方差(其幅度超过偏差的增加)来提高精度。非线性方法更灵活,它们通过减少偏差(其幅度超过方差的增加)来提高精度。

英文版本

Question 7

Suppose we estimate the regression coefficients in a linear regression model by minimizing

i=1n(yiβ0j=1pβjxij)2 subject toj=1pβjs\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{i j}\right)^2\quad\text{ subject to}\quad\sum_{j=1}^p|\beta_j|\leq s

for a particular value of s. Choose the correct answer for each question below.

(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(b) As we increase s from 0, the test MSE will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

© As we increase s from 0, the variance will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(d) As we increase s from 0, the squared(bias) will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

Solution to Question 7

This problem describes the Lasso regression method, where s is a bound on the L1-norm of the coefficients. As s increases, the model becomes less constrained, moving from a null model towards the full least squares solution.

(a) Correct Answer: iv. Steadily decrease.

  • Reasoning: The Training RSS is minimized when the model fits the data best. When s=0, all coefficients are forced to be zero, resulting in a high RSS. As s increases, the constraint is relaxed, allowing the coefficients to take on values that better fit the training data. Therefore, the training RSS will decrease monotonically (steadily) as the model flexibility increases, reaching a minimum at the least squares solution (when s is large enough).

(b) Correct Answer: ii. Decrease initially, and then eventually start increasing in a U shape.

  • Reasoning: Test MSE captures the prediction error on new data and is subject to the bias-variance trade-off. For very small s (strong constraint), the model has high bias (underfitting), leading to high test MSE. As s increases to an optimal value, variance increases slightly, but bias decreases significantly, leading to a decrease in test MSE. Beyond this optimal point, further increasing s leads to overfitting (variance increases dramatically with little reduction in bias), causing the test MSE to rise again. This creates a characteristic U-shape.

© Correct Answer: iii. Steadily increase.

  • Reasoning: Variance measures the model’s sensitivity to the training data. A highly constrained model (small s) has low variance. As the constraint is relaxed (increasing s), the model has more freedom to fit the specific nuances of the training set, making its estimates more variable. Therefore, variance increases steadily as s increases.

(d) Correct Answer: iv. Steadily decrease.

  • Reasoning: (Squared) Bias measures the error introduced by the model’s inability to represent the true relationship. A very constrained model (small s) is simplistic and may have high bias. As s increases, the model becomes more flexible and can better approximate the underlying true function, leading to a steady decrease in bias.


中文版本

问题 7

假设我们通过最小化

i=1n(yiβ0j=1pβjxij)2 约束条件为j=1pβjs\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{i j}\right)^2\quad\text{ 约束条件为}\quad\sum_{j=1}^p|\beta_j|\leq s

来估计线性回归模型中的回归系数,其中s是一个特定值。为以下每个问题选择正确答案。

(a) 当我们将s从0开始增加时,训练RSS将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

(b) 当我们将s从0开始增加时,测试MSE将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

© 当我们将s从0开始增加时,方差将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

(d) 当我们将s从0开始增加时,偏差平方将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

问题 7 的题解

本题描述的是Lasso回归方法,其中s是系数L1范数的约束上限。当s增加时,模型的约束变弱,从一个空模型向完整的最小二乘解逼近。

(a) 正确答案: iv. 稳步减少。

  • 推理: 训练RSS在模型对数据拟合最好时最小。当s=0时,所有系数被强制为零,导致很高的RSS。随着s增加,约束放松,允许系数取值以更好地拟合训练数据。因此,随着模型灵活性的增加,训练RSS将单调地(稳步地)减少,在达到最小二乘解时(当s足够大)降至最小。

(b) 正确答案: ii. 初始减少,然后以U形开始增加。

  • 推理: 测试MSE衡量的是对新数据的预测误差,并受到偏差-方差权衡的影响。当s非常小(强约束)时,模型具有高偏差(欠拟合),导致高测试MSE。当s增加到一个最优值时,方差略有增加,但偏差显著减少,导致测试MSE降低。超过这个最优点后,进一步增加s会导致过拟合(方差急剧增加,而偏差减少甚微),导致测试MSE再次上升。这就形成了典型的U形曲线。

© 正确答案: iii. 稳步增加。

  • 推理: 方差衡量模型对训练数据的敏感性。一个高度约束的模型(小的s)具有低方差。随着约束放松(增加s),模型有更多的自由去拟合训练集中的特定细节,使得其估计值变化更大。因此,方差随着s的增加而稳步增加。

(d) 正确答案: iv. 稳步减少。

  • 推理: (平方)偏差衡量的是模型因无法表示真实关系而引入的误差。一个约束很强的模型(小的s)很简单,可能具有高偏差。随着s增加,模型变得更灵活,能更好地逼近潜在的真实函数,导致偏差稳步减少。

英文版本

Question 8

Suppose we estimate the regression coefficients in a linear regression model by minimizing

i=1n(yiβ0j=1pβjxij)2+λj=1pβj2\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j} x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}

for a particular value of λ\lambda. Choose the correct answer for each question below.

(a) As we increase λ\lambda from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(b) As we increase λ\lambda from 0, the test MSE will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

© As we increase λ\lambda from 0, the variance will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(d) As we increase λ\lambda from 0, the (squared) bias will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

Solution to Question 8

This problem describes Ridge regression, where λ\lambda controls the strength of L2 regularization. As λ\lambda increases, the model becomes more constrained, moving from the full least squares solution toward a null model.

(a) Correct Answer: iii. Steadily increase.

  • Reasoning: The Training RSS measures how well the model fits the training data. When λ=0\lambda=0, we have the ordinary least squares solution, which minimizes the RSS. As λ\lambda increases, the regularization term forces the coefficients to shrink toward zero, making the model less flexible and reducing its ability to fit the training data perfectly. Therefore, the training RSS will increase steadily as λ\lambda increases.

(b) Correct Answer: ii. Decrease initially, and then eventually start increasing in a U shape.

  • Reasoning: Test MSE is subject to the bias-variance trade-off. When λ=0\lambda=0 (no regularization), the model may overfit (high variance), leading to high test MSE. As λ\lambda increases to an optimal value, the reduction in variance outweighs the increase in bias, causing test MSE to decrease. Beyond this optimal point, the model becomes too constrained (high bias, underfitting), and test MSE increases again. This creates a U-shaped curve.

© Correct Answer: iv. Steadily decrease.

  • Reasoning: Variance measures the model’s sensitivity to fluctuations in the training data. A complex model (λ=0\lambda=0) has high variance. As λ\lambda increases, the regularization constrains the coefficients, making the model more stable and less sensitive to the specific training sample. Therefore, variance decreases steadily as λ\lambda increases.

(d) Correct Answer: iii. Steadily increase.

  • Reasoning: (Squared) Bias measures the error from approximating a complex real-world phenomenon with a simpler model. When λ=0\lambda=0, the model is very flexible and has low bias. As λ\lambda increases, the model becomes more constrained and less able to capture the true underlying relationship in the data. Therefore, bias increases steadily as λ\lambda increases.


中文版本

问题 8

假设我们通过最小化

i=1n(yiβ0j=1pβjxij)2+λj=1pβj2\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j} x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}

来估计线性回归模型中的回归系数,其中λ\lambda是一个特定值。为以下每个问题选择正确答案。

(a) 当我们将λ\lambda从0开始增加时,训练RSS将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

(b) 当我们将λ\lambda从0开始增加时,测试MSE将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

© 当我们将λ\lambda从0开始增加时,方差将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

(d) 当我们将λ\lambda从0开始增加时,(平方)偏差将:
i. 初始增加,然后以倒U形开始减少。
ii. 初始减少,然后以U形开始增加。
iii. 稳步增加。
iv. 稳步减少。
v. 保持恒定。

问题 8 的题解

本题描述的是岭回归(Ridge regression),其中λ\lambda控制L2正则化的强度。当λ\lambda增加时,模型的约束变强,从完整的最小二乘解向一个空模型移动。

(a) 正确答案: iii. 稳步增加。

  • 推理: 训练RSS衡量模型对训练数据的拟合程度。当λ=0\lambda=0时,我们得到普通最小二乘解,它最小化了RSS。随着λ\lambda增加,正则化项迫使系数向零收缩,使得模型更不灵活,降低了其完美拟合训练数据的能力。因此,训练RSS将随着λ\lambda的增加而稳步增加。

(b) 正确答案: ii. 初始减少,然后以U形开始增加。

  • 推理: 测试MSE受到偏差-方差权衡的影响。当λ=0\lambda=0(无正则化)时,模型可能过拟合(高方差),导致高测试MSE。当λ\lambda增加到一个最优值时,方差的减少超过了偏差的增加,导致测试MSE降低。超过这个最优点后,模型变得过于受限(高偏差,欠拟合),测试MSE再次增加。这就形成了U形曲线。

© 正确答案: iv. 稳步减少。

  • 推理: 方差衡量模型对训练数据波动的敏感性。一个复杂的模型(λ=0\lambda=0)具有高方差。随着λ\lambda增加,正则化约束了系数,使得模型更稳定,对特定训练样本的敏感性降低。因此,方差随着λ\lambda的增加而稳步减少。

(d) 正确答案: iii. 稳步增加。

  • 推理: (平方)偏差衡量用较简单模型近似复杂现实世界现象所产生的误差。当λ=0\lambda=0时,模型非常灵活,具有低偏差。随着λ\lambda增加,模型变得更受限,捕捉数据中真实潜在关系的能力下降。因此,偏差随着λ\lambda的增加而稳步增加。

英文版本

Question 9

We will predict the number of applications received in the College data set in the ISLP package.

(a) Split the data set into a training set and a test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

© Fit a ridge regression model on the training set, with λ\lambda chosen by cross validation. Report the test error obtained.

(d) Fit a lasso model on the training set, with λ\lambda chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M (the number of principal components) chosen by cross validation. Report the test error obtained, along with the number of PCs selected by cross validation.

Solution to Question 9

Note: This is a practical coding exercise. The exact results will depend on the random seed used for splitting the data. Below is a typical approach and expected outcomes.

(a) Data Splitting

1
2
3
4
5
6
7
8
9
10
11
# Load required libraries and data
library(ISLP)
data(College)

# Set seed for reproducibility
set.seed(123)

# Create training indices (e.g., 70% training, 30% test)
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b) Linear Regression (Least Squares)

1
2
3
4
5
# Fit linear model
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_test_error <- mean((lm_pred - test_data$Apps)^2) # MSE
# Typical test MSE: Around 1,000,000 - 2,000,000

© Ridge Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(glmnet)

# Prepare data for glmnet (matrix format)
x_train <- model.matrix(Apps ~ ., train_data)[,-1] # Remove intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[,-1]

# Fit ridge regression with cross-validation
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0) # alpha=0 for ridge
best_lambda_ridge <- ridge_cv$lambda.min

ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = best_lambda_ridge)
ridge_pred <- predict(ridge_fit, newx = x_test)
ridge_test_error <- mean((ridge_pred - test_data$Apps)^2)
# Typical test MSE: Slightly lower than linear regression

(d) Lasso Regression

1
2
3
4
5
6
7
8
9
10
11
12
# Fit lasso with cross-validation
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1) # alpha=1 for lasso
best_lambda_lasso <- lasso_cv$lambda.min

lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_lasso)
lasso_pred <- predict(lasso_fit, newx = x_test)
lasso_test_error <- mean((lasso_pred - test_data$Apps)^2)

# Number of non-zero coefficients
lasso_coef <- predict(lasso_fit, type = "coefficients")
num_nonzero <- sum(lasso_coef != 0) - 1 # Exclude intercept
# Typical: Test MSE similar to ridge, with 10-20 non-zero coefficients

(e) Principal Component Regression (PCR)

1
2
3
4
5
6
7
8
9
10
11
12
13
library(pls)

# Fit PCR with cross-validation
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

# Get optimal number of components
pcr_cv <- MSEP(pcr_fit, estimate = "CV")
optimal_m <- which.min(pcr_cv$val[1,,]) - 1 # Subtract 1 for 0 components

pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = optimal_m)
pcr_test_error <- mean((pcr_pred - test_data$Apps)^2)
# Typical: Optimal M around 10-15 components, test MSE similar to ridge/lasso

Expected Results Summary:

  • Linear Regression: Highest test error due to potential overfitting

  • Ridge Regression: 5-10% improvement over linear regression

  • Lasso Regression: Similar performance to ridge, with variable selection

  • PCR: Performance depends on whether the important predictors align with the first few principal components


中文版本

问题 9

我们将预测ISLP包中College数据集中大学收到的申请数量。

(a) 将数据集分为训练集和测试集。

(b) 在训练集上使用最小二乘法拟合线性模型,并报告得到的测试误差。

© 在训练集上拟合岭回归模型,通过交叉验证选择λ\lambda。报告得到的测试误差。

(d) 在训练集上拟合lasso模型,通过交叉验证选择λ\lambda。报告得到的测试误差以及非零系数估计的数量。

(e) 在训练集上拟合PCR模型,通过交叉验证选择M(主成分的数量)。报告得到的测试误差以及交叉验证选择的主成分数量。

问题 9 的题解

注意: 这是一个实际的编程练习。具体结果取决于用于分割数据的随机种子。以下是典型的方法和预期结果。

(a) 数据分割

1
2
3
4
5
6
7
8
9
10
11
# 加载所需的库和数据
library(ISLP)
data(College)

# 设置随机种子以保证可重复性
set.seed(123)

# 创建训练索引(例如,70%训练,30%测试)
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b) 线性回归(最小二乘法)

1
2
3
4
5
# 拟合线性模型
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_test_error <- mean((lm_pred - test_data$Apps)^2) # 均方误差
# 典型测试MSE:大约1,000,000 - 2,000,000

© 岭回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(glmnet)

# 为glmnet准备数据(矩阵格式)
x_train <- model.matrix(Apps ~ ., train_data)[,-1] # 移除截距项
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[,-1]

# 用交叉验证拟合岭回归
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0) # alpha=0表示岭回归
best_lambda_ridge <- ridge_cv$lambda.min

ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = best_lambda_ridge)
ridge_pred <- predict(ridge_fit, newx = x_test)
ridge_test_error <- mean((ridge_pred - test_data$Apps)^2)
# 典型测试MSE:比线性回归略低

(d) Lasso回归

1
2
3
4
5
6
7
8
9
10
11
12
# 用交叉验证拟合lasso
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1) # alpha=1表示lasso
best_lambda_lasso <- lasso_cv$lambda.min

lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_lasso)
lasso_pred <- predict(lasso_fit, newx = x_test)
lasso_test_error <- mean((lasso_pred - test_data$Apps)^2)

# 非零系数数量
lasso_coef <- predict(lasso_fit, type = "coefficients")
num_nonzero <- sum(lasso_coef != 0) - 1 # 排除截距项
# 典型结果:测试MSE与岭回归相似,有10-20个非零系数

(e) 主成分回归

1
2
3
4
5
6
7
8
9
10
11
12
13
library(pls)

# 用交叉验证拟合PCR
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

# 获取最优主成分数量
pcr_cv <- MSEP(pcr_fit, estimate = "CV")
optimal_m <- which.min(pcr_cv$val[1,,]) - 1 # 减1是因为包含0个成分的情况

pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = optimal_m)
pcr_test_error <- mean((pcr_pred - test_data$Apps)^2)
# 典型结果:最优M约为10-15个主成分,测试MSE与岭回归/lasso相似

预期结果总结:

  • 线性回归: 由于可能过拟合,测试误差最高

  • 岭回归: 比线性回归提高5-10%

  • Lasso回归: 与岭回归性能相似,但具有变量选择功能

  • PCR: 性能取决于重要预测变量是否与前几个主成分对齐

英文版本

Question 10

Suppose we fit a curve with basis functions b1(X)=X,b2(X)=(X1)2I(X1)b_{1}(X)=X, b_{2}(X)=(X-1)^{2} I(X\geq 1). We fit the linear regression model

Y=β0+β1b1(X)+β2b2(X)+ϵY=\beta_{0}+\beta_{1} b_{1}(X)+\beta_{2} b_{2}(X)+\epsilon

and obtain coefficient estimates β^0=1,β^1=1,β^2=2\hat{\beta}_{0}=1, \hat{\beta}_{1}=1, \hat{\beta}_{2}=-2. Sketch the estimated curve between X=2X=-2 and X=2X=2. Note the intercepts, slopes, and other relevant information.

Solution to Question 10

Step 1: Write the estimated curve function.
The estimated curve is given by:

f^(X)=β^0+β^1b1(X)+β^2b2(X)=1+1X+(2)(X1)2I(X1)\hat{f}(X) = \hat{\beta}_0 + \hat{\beta}_1 b_1(X) + \hat{\beta}_2 b_2(X) = 1 + 1 \cdot X + (-2) \cdot (X-1)^2 I(X \geq 1)

This function is piecewise defined due to the indicator function:

  • For X<1X < 1: f^(X)=1+X\hat{f}(X) = 1 + X (since I(X1)=0I(X \geq 1) = 0)

  • For X1X \geq 1: f^(X)=1+X2(X1)2\hat{f}(X) = 1 + X - 2(X-1)^2

Step 2: Simplify the expression for X1X \geq 1.

f^(X)=1+X2(X22X+1)=1+X2X2+4X2=2X2+5X1\hat{f}(X) = 1 + X - 2(X^2 - 2X + 1) = 1 + X - 2X^2 + 4X - 2 = -2X^2 + 5X - 1

So the piecewise function is:

f^(X)={1+Xif X<12X2+5X1if X1\hat{f}(X) = \begin{cases} 1 + X & \text{if } X < 1 \\ -2X^2 + 5X - 1 & \text{if } X \geq 1 \end{cases}

Step 3: Key features of the curve.

  • For X<1X < 1:

    • This is a straight line with slope = 1 and y-intercept = 1.
    • At X=2X = -2: f^(2)=1+(2)=1\hat{f}(-2) = 1 + (-2) = -1
    • At X=0X = 0: f^(0)=1+0=1\hat{f}(0) = 1 + 0 = 1
    • As XX approaches 1 from the left: f^(1)=1+1=2\hat{f}(1) = 1 + 1 = 2
  • For X1X \geq 1:

    • This is a downward-opening parabola (since the coefficient of X2X^2 is negative).
    • At X=1X = 1: f^(1)=2(1)2+5(1)1=2\hat{f}(1) = -2(1)^2 + 5(1) - 1 = 2 (continuous with the linear part)
    • The vertex of the parabola occurs at X=b2a=52(2)=1.25X = -\frac{b}{2a} = -\frac{5}{2(-2)} = 1.25
    • At the vertex X=1.25X = 1.25: f^(1.25)=2(1.25)2+5(1.25)1=2.125\hat{f}(1.25) = -2(1.25)^2 + 5(1.25) - 1 = 2.125
    • At X=2X = 2: f^(2)=2(4)+5(2)1=1\hat{f}(2) = -2(4) + 5(2) - 1 = 1
  • Continuity and differentiability:

    • The curve is continuous at X=1X = 1 since both pieces give f^(1)=2\hat{f}(1) = 2.
    • The derivative for X<1X < 1 is f^(X)=1\hat{f}'(X) = 1.
    • The derivative for X>1X > 1 is f^(X)=4X+5\hat{f}'(X) = -4X + 5.
    • At X=1X = 1: left derivative = 1, right derivative = -4(1) + 5 = 1. So the curve is smooth (differentiable) at X=1X = 1.

Step 4: Sketch description.
The curve starts at point (-2, -1) and increases linearly with slope 1 until reaching (1, 2). Then it continues as a parabola, rising to a maximum at (1.25, 2.125), and then decreasing to (2, 1). The curve is smooth throughout with no sharp corners.


中文版本

问题 10

假设我们使用基函数 b1(X)=X,b2(X)=(X1)2I(X1)b_{1}(X)=X, b_{2}(X)=(X-1)^{2} I(X\geq 1) 拟合一条曲线。我们拟合线性回归模型

Y=β0+β1b1(X)+β2b2(X)+ϵY=\beta_{0}+\beta_{1} b_{1}(X)+\beta_{2} b_{2}(X)+\epsilon

并得到系数估计 β^0=1,β^1=1,β^2=2\hat{\beta}_{0}=1, \hat{\beta}_{1}=1, \hat{\beta}_{2}=-2。在 X=2X=-2X=2X=2 的范围内画出估计的曲线。注意截距、斜率和其他相关信息。

问题 10 的题解

第一步:写出估计曲线函数。
估计曲线为:

f^(X)=β^0+β^1b1(X)+β^2b2(X)=1+1X+(2)(X1)2I(X1)\hat{f}(X) = \hat{\beta}_0 + \hat{\beta}_1 b_1(X) + \hat{\beta}_2 b_2(X) = 1 + 1 \cdot X + (-2) \cdot (X-1)^2 I(X \geq 1)

由于指示函数的存在,该函数是分段定义的:

  • X<1X < 1 时:f^(X)=1+X\hat{f}(X) = 1 + X(因为 I(X1)=0I(X \geq 1) = 0

  • X1X \geq 1 时:f^(X)=1+X2(X1)2\hat{f}(X) = 1 + X - 2(X-1)^2

第二步:简化 X1X \geq 1 时的表达式。

f^(X)=1+X2(X22X+1)=1+X2X2+4X2=2X2+5X1\hat{f}(X) = 1 + X - 2(X^2 - 2X + 1) = 1 + X - 2X^2 + 4X - 2 = -2X^2 + 5X - 1

因此分段函数为:

f^(X)={1+X若 X<12X2+5X1若 X1\hat{f}(X) = \begin{cases} 1 + X & \text{若 } X < 1 \\ -2X^2 + 5X - 1 & \text{若 } X \geq 1 \end{cases}

第三步:曲线的关键特征。

  • X<1X < 1 时:

    • 这是一条斜率为 1、y 截距为 1 的直线。
    • X=2X = -2 处:f^(2)=1+(2)=1\hat{f}(-2) = 1 + (-2) = -1
    • X=0X = 0 处:f^(0)=1+0=1\hat{f}(0) = 1 + 0 = 1
    • XX 从左侧趋近于 1 时:f^(1)=1+1=2\hat{f}(1) = 1 + 1 = 2
  • X1X \geq 1 时:

    • 这是一个开口向下的抛物线(因为 X2X^2 的系数为负)。
    • X=1X = 1 处:f^(1)=2(1)2+5(1)1=2\hat{f}(1) = -2(1)^2 + 5(1) - 1 = 2(与直线部分连续)
    • 抛物线的顶点在 X=b2a=52(2)=1.25X = -\frac{b}{2a} = -\frac{5}{2(-2)} = 1.25
    • 在顶点 X=1.25X = 1.25 处:f^(1.25)=2(1.25)2+5(1.25)1=2.125\hat{f}(1.25) = -2(1.25)^2 + 5(1.25) - 1 = 2.125
    • X=2X = 2 处:f^(2)=2(4)+5(2)1=1\hat{f}(2) = -2(4) + 5(2) - 1 = 1
  • 连续性与可微性:

    • 曲线在 X=1X = 1 处连续,因为两段都给出 f^(1)=2\hat{f}(1) = 2
    • X<1X < 1 时,导数为 f^(X)=1\hat{f}'(X) = 1
    • X>1X > 1 时,导数为 f^(X)=4X+5\hat{f}'(X) = -4X + 5
    • X=1X = 1 处:左导数为 1,右导数为 -4(1) + 5 = 1。因此曲线在 X=1X = 1 处是光滑的(可微的)。

第四步:草图描述。
曲线从点 (-2, -1) 开始,以斜率 1 线性增加,直到点 (1, 2)。然后它继续为抛物线,上升到最大值点 (1.25, 2.125),然后下降到点 (2, 1)。整个曲线是光滑的,没有尖角。