SDSC5001 - Assignment 2 | 迷麟の小站

#assignment #sdsc5001

Question 1

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large. We will now investigate this curse.

(a) Suppose that we have a set of observations, each with measurements on p=1 feature, X. We assume that X is uniformly (evenly) distributed on [0, 1]. Associated with each observation is a response value. Suppose that we wish to predict a test observation’s response using only observations that are within 10% of the range of X closest to that test observation. For instance, in order to predict the response for a test observation with X=0.6, we will use observations in the range [0.55, 0.65]. On average, what fraction of the available observations will we use to make the prediction?

(b) Now suppose that we have a set of observations, each with measurements on p=2 features, X1 and X2. We assume that (X1, X2) are uniformly distributed on [0,1] x [0,1]. We wish to predict a test observation’s response using only observations that are within 10% of the range of X1 and within 10% of the range of X2 closest to that test observation. For instance, in order to predict the response for a test observation with X1=0.6 and X2=0.35, we will use observations in the range [0.55, 0.65] for X1 and in the range [0.3, 0.4] for X2. On average, what fraction of the available observations will we use to make the prediction?

© Now suppose that we have a set of observations on p=100 features. Again the observations are uniformly distributed on each feature, and again each feature ranges in value from 0 to 1. We wish to predict a test observation’s response using observations within the 10% of each feature’s range that is closest to that test observation. What fraction of the available observations will we use to make the prediction?

(d) Using your answers to parts (a)-©, argue that a drawback of KNN when p is large is that there are very few training observations “near” any given test observation.

Solution to Question 1

(a) For p=1, the feature X is uniformly distributed on [0,1]. The range of X is 1, and we are considering a neighborhood that is 10% of this range, so the interval length is 0.1. Since the distribution is uniform, the fraction of observations used is exactly the proportion of the interval length to the total range, which is 0.1 or 10%.

(b) For p=2, the features (X1, X2) are uniformly distributed on the unit square [0,1]x[0,1]. The neighborhood is a rectangle with sides of length 0.1 in each dimension (10% of the range for each feature). The area of this rectangle is 0.1 * 0.1 = 0.01. Since the distribution is uniform, the fraction of observations used is the area of the rectangle, which is 0.01 or 1%.

© For p=100, the features are uniformly distributed on a 100-dimensional hypercube. The neighborhood is a hypercube with each side of length 0.1. The volume of this hypercube is $(0.1)^100 = 10^{-100}$ . Thus, the fraction of observations used is extremely small, approximately $10^{-100}$ .

(d) From parts (a) to ©, we see that as the number of features p increases, the fraction of observations near a test observation decreases dramatically. For p=1, it’s 10%; for p=2, it’s 1%; and for p=100, it’s virtually zero. This means that in high dimensions, there are very few training points within the local neighborhood of any test point. Consequently, KNN and other local methods suffer because they rely on having sufficient nearby points to make accurate predictions. This is the curse of dimensionality: as p grows, the data becomes sparse, and local approximations become unreliable.

Question 2

Answer the following questions about the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

(d) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

Solution to Question 2

(a)

Training Set: We expect QDA to perform better on the training set. QDA is a more flexible model (it has more parameters) than LDA, so it can fit the training data more closely, leading to a lower training error rate.
Test Set: We expect LDA to perform better on the test set. Since the true (Bayes) boundary is linear, the extra flexibility of QDA is unnecessary. LDA, by making the correct assumption of a common covariance matrix, will have lower variance. Using QDA in this case would likely lead to overfitting (higher variance without a reduction in bias), resulting in a higher test error.

(b)

Training Set: We expect QDA to perform better on the training set. For the same reason as in (a), its higher flexibility allows it to achieve a better fit to the training data.
Test Set: We expect QDA to perform better on the test set. Because the true boundary is non-linear, QDA’s flexibility to model different class covariances allows it to better approximate the true boundary. LDA, constrained to a linear boundary, will suffer from higher bias.

Reason: QDA requires estimating a separate covariance matrix for each class, which involves more parameters than LDA (which estimates a single pooled covariance matrix). With a small sample size, the variance introduced by estimating these additional parameters in QDA can hurt its performance. However, as the sample size grows larger, these parameters can be estimated more accurately. The benefit of QDA’s flexibility (potentially lower bias) is realized with less risk of overfitting (high variance), so its performance relative to the simpler LDA model improves.

(d) False.

Justification: While QDA is flexible enough to model a linear decision boundary (it is a more general model), it is not likely to achieve a superior test error rate when the true boundary is linear. Because QDA has more parameters to estimate, it has higher variance than LDA. If the true boundary is linear, LDA correctly assumes this simpler structure. Using QDA would introduce unnecessary variance without reducing bias, likely leading to a higher test error due to overfitting. Therefore, LDA is generally preferred when we have reason to believe the decision boundary is linear.

Question 3

Suppose we collect data for a group of students in a statistics class with variables $X_{1} =$ hours studied, $X_{2} =$ undergrad GPA, and $Y =$ receive an A. We fit a logistic regression and produce estimated coefficient, $\hat{\beta}_{0} = -6, \hat{\beta}_{1} = 0.05, \hat{\beta}_{2} = 1$ .

(a) Estimate the probability that a student who studies for 40h and has an undergrad GPA of 3.5 gets an A in the class.

(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

Solution to Question 3

(a) Estimate the probability

The logistic regression model is:

$P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}}$

Given: $\hat{\beta}_0 = -6$ , $\hat{\beta}_1 = 0.05$ , $\hat{\beta}_2 = 1$ , $X_1 = 40$ , $X_2 = 3.5$ .

Calculate the linear predictor (z):

$z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = -6 + (0.05)(40) + (1)(3.5) = -6 + 2 + 3.5 = -0.5$
Calculate the probability:

$P(Y=1) = \frac{1}{1 + e^{-(z)}} = \frac{1}{1 + e^{-(-0.5)}} = \frac{1}{1 + e^{0.5}} \approx \frac{1}{1 + 1.6487} \approx \frac{1}{2.6487} \approx 0.3775$

Answer: The estimated probability is approximately 0.378 (or 37.8%).

(b) Find hours for a 50% chance

A 50% chance means $P(Y=1) = 0.5$ . In the logistic model, this happens precisely when the linear predictor $z = 0$ .

Set up the equation:
We have $z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$ . We know $\beta_0, \beta_1, \beta_2$ and $X_2 = 3.5$ . We need to solve for $X_1$ (hours studied).

$-6 + 0.05(X_1) + 1(3.5) = 0$
Solve for $X_1$ :

$0.05(X_1) - 2.5 = 0$

$0.05(X_1) = 2.5$

$X_1 = \frac{2.5}{0.05} = 50$

Answer: The student would need to study for 50 hours to have a 50% chance of getting an A.

Question 4

Let’s develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set in the ISLP package.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

Solution to Question 4

(a) Create binary variable

# Load the library and data
library(ISLP)
data(Auto)

# Calculate the median of mpg
mpg_median <- median(Auto$mpg)

# Create the binary variable mpg01
Auto$mpg01 <- as.numeric(Auto$mpg > mpg_median)

(b) Exploratory Data Analysis (EDA)

Method: Create boxplots of each predictor against mpg01, and scatterplots of pairs of predictors colored by mpg01.
Key Findings: Features strongly associated with mppg01 are typically:
- weight: Heavier cars generally have lower mpg.
- horsepower: More powerful cars generally have lower mpg.
- displacement: Larger engine size generally indicates lower mpg.
- acceleration: Cars with slower acceleration (higher acceleration time, meaning less powerful) may have higher mpg, but the relationship might be less strong than the first three.
Conclusion: weight, horsepower, and displacement are the most promising predictors.

© Split the data

# Set seed for reproducibility
set.seed(123)
# Create random indices for 80% training, 20% test split
train_index <- sample(1:nrow(Auto), nrow(Auto) * 0.8)
train_set <- Auto[train_index, ]
test_set <- Auto[-train_index, ]

(d) LDA

# Fit LDA model using selected features (e.g., weight, horsepower)
library(MASS)
lda_fit <- lda(mpg01 ~ weight + horsepower, data = train_set)
lda_pred <- predict(lda_fit, newdata = test_set)
# Calculate test error: mean of misclassified observations
lda_test_error <- mean(lda_pred$class != test_set$mpg01)
# Expected test error is typically around 10-15%.

(e) QDA

# Fit QDA model
qda_fit <- qda(mpg01 ~ weight + horsepower, data = train_set)
qda_pred <- predict(qda_fit, newdata = test_set)
qda_test_error <- mean(qda_pred$class != test_set$mpg01)
# Test error might be similar or slightly higher than LDA if the linear assumption is reasonable.

(f) Logistic Regression

# Fit logistic regression model
glm_fit <- glm(mpg01 ~ weight + horsepower, data = train_set, family = binomial)
glm_probs <- predict(glm_fit, newdata = test_set, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
glm_test_error <- mean(glm_pred != test_set$mpg01)
# Test error is often comparable to LDA.

(g) KNN

# Prepare data: Standardize features and create matrices
library(class)
train_X <- scale(train_set[, c("weight", "horsepower")])
test_X <- scale(test_set[, c("weight", "horsepower")],
                center = attr(train_X, "scaled:center"),
                scale = attr(train_X, "scaled:scale"))
train_y <- train_set$mpg01

# Try different K values (e.g., 1, 5, 10, 20)
k_values <- c(1, 5, 10, 20)
knn_errors <- sapply(k_values, function(k) {
    set.seed(123)
    knn_pred <- knn(train_X, test_X, train_y, k = k)
    mean(knn_pred != test_set$mpg01)
})

# Identify best K (the one with the lowest test error)
best_k <- k_values[which.min(knn_errors)]
# The best K is often a moderate value like 5 or 10, balancing bias and variance.

Summary of Findings:

The best model is often Logistic Regression or LDA, with test errors around 10%.
QDA might perform similarly or slightly worse.
KNN performance depends heavily on K. A moderate K (e.g., 5, 10) usually performs best, potentially achieving error rates similar to the linear models.

Question 5

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain $p+1$ models, containing $0, 1, 2, \ldots, p$ predictors. Answer the following questions:

(a) Which of the three models with k predictors has the smallest training RSS?

(b) Which of the three models with k predictors has the smallest test MSE?

© True or False for each statement below.
i. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
ii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
iii. The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
iv. The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
v. The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k+1)-variable model identified by best subset selection.

Solution to Question 5

(a) Training RSS
The best subset selection model with k predictors will have the smallest training RSS.

Reason: Best subset selection searches through all possible combinations of k predictors and chooses the best one. Forward and backward stepwise are greedy approximations that do not guarantee the absolute best model of size k.

(b) Test MSE
There is no definitive answer; it depends on the specific dataset and the true relationship between the predictors and the response.

Reason: The model with the smallest test MSE is the one that best balances bias and variance. While best subset has the lowest training RSS (and thus lowest bias), it might overfit the training data (high variance), leading to a higher test MSE than a more constrained stepwise approach in some cases. The relative performance is unpredictable and must be validated on test data.

© True or False

i. True.

Reason: Forward stepwise selection starts with no predictors and adds one predictor at a time. The model with k predictors is built directly from the model with k-1 predictors by adding the next best predictor. Therefore, the set of predictors in the k-variable model is always a subset of the predictors in the (k+1)-variable model. The model path is “nested.”

ii. True.

Reason: Backward stepwise selection starts with all p predictors and removes one predictor at a time. The model with k+1 predictors is built directly from the model with k predictors by removing the least useful predictor. Therefore, the predictors in the k-variable model are a subset of the predictors in the (k+1)-variable model. This path is also “nested.”

iii. False.

Reason: There is no guaranteed relationship between the subsets of predictors chosen by backward stepwise for a given k and the subsets chosen by forward stepwise for k+1. The two algorithms follow different search paths and can produce very different models.

iv. False.

Reason: Similarly, there is no guaranteed subset relationship between the models found by forward and backward stepwise selection. The k-variable model from forward stepwise is not necessarily a subset of the (k+1)-variable model from backward stepwise.

v. False.

Reason: Best subset selection independently finds the best model for each possible model size. The optimal set of k predictors is not necessarily a subset of the optimal set of k+1 predictors. The algorithm is free to choose a completely different combination of predictors for each model size.

Question 6

Choose the correct answer for each question below.

(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Ridge regression, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

© Nonlinear methods, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Solution to Question 6

(a) Correct Answer: iii

Reasoning: Lasso regression uses L1 regularization (shrinking coefficients, some to exactly zero), which makes it less flexible than least squares. A less flexible model has higher bias but lower variance. The trade-off is beneficial for prediction accuracy only if the increase in bias is small relative to the decrease in variance.

(b) Correct Answer: iii

Reasoning: Ridge regression uses L2 regularization (shrinking coefficients towards zero but not exactly zero), which also makes it less flexible than least squares. Similar to the lasso, it improves prediction accuracy when the increase in bias is less than the decrease in variance.

© Correct Answer: ii

Reasoning: Nonlinear methods (e.g., polynomial regression, splines, decision trees) are more flexible than the linear least squares model. A more flexible model has lower bias but higher variance. The trade-off is beneficial for prediction accuracy only if the increase in variance is less than the decrease in bias.

Question 7

Suppose we estimate the regression coefficients in a linear regression model by minimizing

$\sum_{i=1}^n\left(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{i j}\right)^2\quad\text{ subject to}\quad\sum_{j=1}^p|\beta_j|\leq s$

for a particular value of s. Choose the correct answer for each question below.

(a) As we increase s from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(b) As we increase s from 0, the test MSE will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

© As we increase s from 0, the variance will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(d) As we increase s from 0, the squared(bias) will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

Solution to Question 7

This problem describes the Lasso regression method, where s is a bound on the L1-norm of the coefficients. As s increases, the model becomes less constrained, moving from a null model towards the full least squares solution.

(a) Correct Answer: iv. Steadily decrease.

Reasoning: The Training RSS is minimized when the model fits the data best. When s=0, all coefficients are forced to be zero, resulting in a high RSS. As s increases, the constraint is relaxed, allowing the coefficients to take on values that better fit the training data. Therefore, the training RSS will decrease monotonically (steadily) as the model flexibility increases, reaching a minimum at the least squares solution (when s is large enough).

(b) Correct Answer: ii. Decrease initially, and then eventually start increasing in a U shape.

Reasoning: Test MSE captures the prediction error on new data and is subject to the bias-variance trade-off. For very small s (strong constraint), the model has high bias (underfitting), leading to high test MSE. As s increases to an optimal value, variance increases slightly, but bias decreases significantly, leading to a decrease in test MSE. Beyond this optimal point, further increasing s leads to overfitting (variance increases dramatically with little reduction in bias), causing the test MSE to rise again. This creates a characteristic U-shape.

© Correct Answer: iii. Steadily increase.

Reasoning: Variance measures the model’s sensitivity to the training data. A highly constrained model (small s) has low variance. As the constraint is relaxed (increasing s), the model has more freedom to fit the specific nuances of the training set, making its estimates more variable. Therefore, variance increases steadily as s increases.

(d) Correct Answer: iv. Steadily decrease.

Reasoning: (Squared) Bias measures the error introduced by the model’s inability to represent the true relationship. A very constrained model (small s) is simplistic and may have high bias. As s increases, the model becomes more flexible and can better approximate the underlying true function, leading to a steady decrease in bias.

Question 8

Suppose we estimate the regression coefficients in a linear regression model by minimizing

$\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\sum_{j=1}^{p}\beta_{j} x_{i j}\right)^{2}+\lambda\sum_{j=1}^{p}\beta_{j}^{2}$

for a particular value of $\lambda$ . Choose the correct answer for each question below.

(a) As we increase $\lambda$ from 0, the training RSS will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(b) As we increase $\lambda$ from 0, the test MSE will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

© As we increase $\lambda$ from 0, the variance will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

(d) As we increase $\lambda$ from 0, the (squared) bias will:
i. Increase initially, and then eventually start decreasing in an inverted U shape.
ii. Decrease initially, and then eventually start increasing in a U shape.
iii. Steadily increase.
iv. Steadily decrease.
v. Remain constant.

Solution to Question 8

This problem describes Ridge regression, where $\lambda$ controls the strength of L2 regularization. As $\lambda$ increases, the model becomes more constrained, moving from the full least squares solution toward a null model.

(a) Correct Answer: iii. Steadily increase.

Reasoning: The Training RSS measures how well the model fits the training data. When $\lambda=0$ , we have the ordinary least squares solution, which minimizes the RSS. As $\lambda$ increases, the regularization term forces the coefficients to shrink toward zero, making the model less flexible and reducing its ability to fit the training data perfectly. Therefore, the training RSS will increase steadily as $\lambda$ increases.

(b) Correct Answer: ii. Decrease initially, and then eventually start increasing in a U shape.

Reasoning: Test MSE is subject to the bias-variance trade-off. When $\lambda=0$ (no regularization), the model may overfit (high variance), leading to high test MSE. As $\lambda$ increases to an optimal value, the reduction in variance outweighs the increase in bias, causing test MSE to decrease. Beyond this optimal point, the model becomes too constrained (high bias, underfitting), and test MSE increases again. This creates a U-shaped curve.

Reasoning: Variance measures the model’s sensitivity to fluctuations in the training data. A complex model ( $\lambda=0$ ) has high variance. As $\lambda$ increases, the regularization constrains the coefficients, making the model more stable and less sensitive to the specific training sample. Therefore, variance decreases steadily as $\lambda$ increases.

(d) Correct Answer: iii. Steadily increase.

Reasoning: (Squared) Bias measures the error from approximating a complex real-world phenomenon with a simpler model. When $\lambda=0$ , the model is very flexible and has low bias. As $\lambda$ increases, the model becomes more constrained and less able to capture the true underlying relationship in the data. Therefore, bias increases steadily as $\lambda$ increases.

Question 9

We will predict the number of applications received in the College data set in the ISLP package.

(a) Split the data set into a training set and a test set.

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

(d) Fit a lasso model on the training set, with $\lambda$ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

(e) Fit a PCR model on the training set, with M (the number of principal components) chosen by cross validation. Report the test error obtained, along with the number of PCs selected by cross validation.

Solution to Question 9

Note: This is a practical coding exercise. The exact results will depend on the random seed used for splitting the data. Below is a typical approach and expected outcomes.

(a) Data Splitting

# Load required libraries and data
library(ISLP)
data(College)

# Set seed for reproducibility
set.seed(123)

# Create training indices (e.g., 70% training, 30% test)
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b) Linear Regression (Least Squares)

# Fit linear model
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_test_error <- mean((lm_pred - test_data$Apps)^2)  # MSE
# Typical test MSE: Around 1,000,000 - 2,000,000

library(glmnet)

# Prepare data for glmnet (matrix format)
x_train <- model.matrix(Apps ~ ., train_data)[,-1]  # Remove intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[,-1]

# Fit ridge regression with cross-validation
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)  # alpha=0 for ridge
best_lambda_ridge <- ridge_cv$lambda.min

ridge_fit <- glmnet(x_train, y_train, alpha = 0, lambda = best_lambda_ridge)
ridge_pred <- predict(ridge_fit, newx = x_test)
ridge_test_error <- mean((ridge_pred - test_data$Apps)^2)
# Typical test MSE: Slightly lower than linear regression

(d) Lasso Regression

# Fit lasso with cross-validation
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)  # alpha=1 for lasso
best_lambda_lasso <- lasso_cv$lambda.min

lasso_fit <- glmnet(x_train, y_train, alpha = 1, lambda = best_lambda_lasso)
lasso_pred <- predict(lasso_fit, newx = x_test)
lasso_test_error <- mean((lasso_pred - test_data$Apps)^2)

# Number of non-zero coefficients
lasso_coef <- predict(lasso_fit, type = "coefficients")
num_nonzero <- sum(lasso_coef != 0) - 1  # Exclude intercept
# Typical: Test MSE similar to ridge, with 10-20 non-zero coefficients

(e) Principal Component Regression (PCR)

library(pls)

# Fit PCR with cross-validation
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

# Get optimal number of components
pcr_cv <- MSEP(pcr_fit, estimate = "CV")
optimal_m <- which.min(pcr_cv$val[1,,]) - 1  # Subtract 1 for 0 components

pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = optimal_m)
pcr_test_error <- mean((pcr_pred - test_data$Apps)^2)
# Typical: Optimal M around 10-15 components, test MSE similar to ridge/lasso

Expected Results Summary:

Linear Regression: Highest test error due to potential overfitting
Ridge Regression: 5-10% improvement over linear regression
Lasso Regression: Similar performance to ridge, with variable selection
PCR: Performance depends on whether the important predictors align with the first few principal components

Question 10

Suppose we fit a curve with basis functions $b_{1}(X)=X, b_{2}(X)=(X-1)^{2} I(X\geq 1)$ . We fit the linear regression model

$Y=\beta_{0}+\beta_{1} b_{1}(X)+\beta_{2} b_{2}(X)+\epsilon$

and obtain coefficient estimates $\hat{\beta}_{0}=1, \hat{\beta}_{1}=1, \hat{\beta}_{2}=-2$ . Sketch the estimated curve between $X=-2$ and $X=2$ . Note the intercepts, slopes, and other relevant information.

Solution to Question 10

Step 1: Write the estimated curve function.
The estimated curve is given by:

$\hat{f}(X) = \hat{\beta}_0 + \hat{\beta}_1 b_1(X) + \hat{\beta}_2 b_2(X) = 1 + 1 \cdot X + (-2) \cdot (X-1)^2 I(X \geq 1)$

This function is piecewise defined due to the indicator function:

For $X < 1$ : $\hat{f}(X) = 1 + X$ (since $I(X \geq 1) = 0$ )
For $X \geq 1$ : $\hat{f}(X) = 1 + X - 2(X-1)^2$

Step 2: Simplify the expression for $X \geq 1$ .

$\hat{f}(X) = 1 + X - 2(X^2 - 2X + 1) = 1 + X - 2X^2 + 4X - 2 = -2X^2 + 5X - 1$

So the piecewise function is:

$\hat{f}(X) = \begin{cases} 1 + X & \text{if } X < 1 \\ -2X^2 + 5X - 1 & \text{if } X \geq 1 \end{cases}$

Step 3: Key features of the curve.

For $X < 1$ :
- This is a straight line with slope = 1 and y-intercept = 1.
- At $X = -2$ : $\hat{f}(-2) = 1 + (-2) = -1$
- At $X = 0$ : $\hat{f}(0) = 1 + 0 = 1$
- As $X$ approaches 1 from the left: $\hat{f}(1) = 1 + 1 = 2$
For $X \geq 1$ :
- This is a downward-opening parabola (since the coefficient of $X^2$ is negative).
- At $X = 1$ : $\hat{f}(1) = -2(1)^2 + 5(1) - 1 = 2$ (continuous with the linear part)
- The vertex of the parabola occurs at $X = -\frac{b}{2a} = -\frac{5}{2(-2)} = 1.25$
- At the vertex $X = 1.25$ : $\hat{f}(1.25) = -2(1.25)^2 + 5(1.25) - 1 = 2.125$
- At $X = 2$ : $\hat{f}(2) = -2(4) + 5(2) - 1 = 1$
Continuity and differentiability:
- The curve is continuous at $X = 1$ since both pieces give $\hat{f}(1) = 2$ .
- The derivative for $X < 1$ is $\hat{f}'(X) = 1$ .
- The derivative for $X > 1$ is $\hat{f}'(X) = -4X + 5$ .
- At $X = 1$ : left derivative = 1, right derivative = -4(1) + 5 = 1. So the curve is smooth (differentiable) at $X = 1$ .

Step 4: Sketch description.
The curve starts at point (-2, -1) and increases linearly with slope 1 until reaching (1, 2). Then it continues as a parabola, rising to a maximum at (1.25, 2.125), and then decreasing to (2, 1). The curve is smooth throughout with no sharp corners.