#sdsc6012

English / 中文

Fundamentals of Time Series Theory

Definition and Properties of Time Series

Time series is a sequence of random variables arranged in chronological order, denoted as {Xt:tT}\{X_t: t \in T\}, where TT is the time index set. In practical applications, TT is typically a discrete set (e.g., T={0,1,2,}T = \{0, 1, 2, \ldots\}).

Core Concept: Time series analysis aims to reveal internal dynamic dependencies within the sequence and build predictive models based on historical data.

Example Data Table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd

data = {
'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
'2023-01-06', '2023-01-07', '2023-01-08', '2023-01-09', '2023-01-10'],
'Temperature': [22.5, 24.1, 23.8, 21.2, 20.5, 19.8, 22.3, 23.7, 24.5, 25.2],
'Humidity': [65, 62, 68, 72, 75, 78, 70, 66, 63, 60],
'Sales': [150, 168, 142, 135, 158, 172, 165, 148, 156, 162],
'Stock_Price': [105.2, 106.8, 104.5, 103.1, 107.3, 109.6, 108.2, 106.7, 107.9, 110.4]
}

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

Description: Time series data includes timestamps and multiple observed variables, suitable for multivariate time series analysis.

Stationarity: Strict Definition and Classification

Strictly Stationary Process

A time series {Xt}\{X_t\} is strictly stationary if for any finite-dimensional distribution function and any time shift kk, it satisfies:

FXt1,Xt2,,Xtn(x1,x2,,xn)=FXt1+k,Xt2+k,,Xtn+k(x1,x2,,xn)F_{X_{t_1}, X_{t_2}, \ldots, X_{t_n}}(x_1, x_2, \ldots, x_n) = F_{X_{t_1+k}, X_{t_2+k}, \ldots, X_{t_n+k}}(x_1, x_2, \ldots, x_n)

where FF is the joint distribution function and nn is any positive integer.

Weakly Stationary Process

In practical applications, weak stationarity is more commonly used, requiring three conditions:

  1. Constant mean function:

    E[Xt]=μfor all t\mathbb{E}[X_t] = \mu \quad \text{for all } t

  2. Constant variance function:

    Var(Xt)=σ2for all t\text{Var}(X_t) = \sigma^2 \quad \text{for all } t

  3. Autocovariance function depends only on time lag:

    Cov(Xt,Xs)=γ(ts)for all t,s\text{Cov}(X_t, X_s) = \gamma(|t-s|) \quad \text{for all } t, s

Important Note: Strict stationarity implies weak stationarity, but the converse is not true unless the process follows a multivariate normal distribution.

Stationarity Testing Methodology

Graphical Testing Methods

Time Series Plot Analysis

Plot the time series with a mean line to visually identify:

  • Trend components

  • Seasonality

  • Heteroscedasticity

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
variables = ['Temperature', 'Humidity', 'Sales', 'Stock_Price']

for i, var in enumerate(variables):
row, col = i // 2, i % 2
axes[row, col].plot(df.index, df[var], marker='o', linewidth=2)
axes[row, col].axhline(y=df[var].mean(), color='r', linestyle='--',
label=f'Mean ({df[var].mean():.1f})')
axes[row, col].set_title(f'{var} - Stationarity Analysis')
axes[row, col].legend()
axes[row, col].grid(alpha=0.3)

Autocorrelation Function Plot

Plot the sample autocorrelation function (SACF). For stationary series, SACF decays rapidly to near zero.

Statistical Test: Augmented Dickey-Fuller Test

Test Principle

The ADF test estimates the following regression model:

ΔXt=α+βt+γXt1+i=1pϕiΔXti+εt\Delta X_t = \alpha + \beta t + \gamma X_{t-1} + \sum_{i=1}^{p} \phi_i \Delta X_{t-i} + \varepsilon_t

where ΔXt=XtXt1\Delta X_t = X_t - X_{t-1} is the first-difference operator.

Hypothesis Setting

  • Null hypothesis H0:γ=0H_0: \gamma = 0 (series is non-stationary)

  • Alternative hypothesis H1:γ<0H_1: \gamma < 0 (series is stationary)

Decision Criterion

If the test statistic is less than the critical value (or p-value < significance level, e.g., 0.05), reject the null hypothesis, indicating stationarity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from statsmodels.tsa.stattools import adfuller

print("Augmented Dickey-Fuller Test Results:")
print("=" * 50)

for var in variables:
result = adfuller(df[var])
print(f"{var}:")
print(f" ADF Statistic: {result[0]:.4f}")
print(f" p-value: {result[1]:.4f}")

if result[1] < 0.05:
print(" -> Series is likely STATIONARY (reject null hypothesis)")
else:
print(" -> Series is likely NON-STATIONARY (cannot reject null hypothesis)")
print("-" * 30)

Note: The ADF test is sensitive to the choice of lag order pp, typically determined using AIC or BIC criteria.

Gaussian White Noise Process: Ideal Stationary Series

Gaussian white noise {εt}\{\varepsilon_t\} is a fundamental process in time series analysis, defined by:

  1. E[εt]=0\mathbb{E}[\varepsilon_t] = 0 (zero mean)

  2. Var(εt)=σ2\text{Var}(\varepsilon_t) = \sigma^2 (constant variance)

  3. Cov(εt,εs)=0\text{Cov}(\varepsilon_t, \varepsilon_s) = 0 for tst \neq s (no autocorrelation)

Mathematical Expression: εtIID N(0,σ2)\varepsilon_t \sim \text{IID } \mathcal{N}(0, \sigma^2), where IID denotes independent and identically distributed.

Mathematical Definition

XtN(0,1)for all tX_t \sim \mathcal{N}(0, 1) \quad \text{for all } t

Properties

  • Mean: E[Xt]=0\mathbb{E}[X_t] = 0

  • Variance: Var(Xt)=1\text{Var}(X_t) = 1

  • Autocovariance function:

    γ(k)={1if k=00if k0\gamma(k) = \begin{cases} 1 & \text{if } k = 0 \\ 0 & \text{if } k \neq 0 \end{cases}

Generation and Verification

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np

# Generate Gaussian white noise
np.random.seed(42)
n_points = 500
white_noise = np.random.normal(0, 1, n_points)

# Verify statistical properties
print(f"Overall Mean: {white_noise.mean():.4f}")
print(f"Overall Standard Deviation: {white_noise.std():.4f}")
print(f"Variance: {white_noise.var():.4f}")

# Autocorrelation test
from statsmodels.tsa.stattools import acf
autocorr = acf(white_noise, nlags=10)
print("\nAutocorrelation (lags 1-5):")
for i in range(1, 6):
print(f" Lag {i}: {autocorr[i]:.4f}")

Autocovariance and Autocorrelation Functions

Autocovariance Function

For a weakly stationary process, the autocovariance function is defined as:

γ(k)=Cov(Xt,Xt+k)=E[(Xtμ)(Xt+kμ)]\gamma(k) = \text{Cov}(X_t, X_{t+k}) = \mathbb{E}[(X_t - \mu)(X_{t+k} - \mu)]

where kk is the lag order.

Autocorrelation Function

The standardized autocovariance function gives the autocorrelation function:

ρ(k)=γ(k)γ(0)=γ(k)σ2\rho(k) = \frac{\gamma(k)}{\gamma(0)} = \frac{\gamma(k)}{\sigma^2}

The autocorrelation function satisfies: ρ(0)=1\rho(0) = 1, ρ(k)=ρ(k)\rho(k) = \rho(-k), and ρ(k)1|\rho(k)| \leq 1.

Autocovariance of White Noise

For a white noise process:

γ(k)={σ2=1if k=00if k0\gamma(k) = \begin{cases} \sigma^2 = 1 & \text{if } k = 0 \\ 0 & \text{if } k \neq 0 \end{cases}

Explanation: White noise has no autocorrelation at any non-zero lag, making it a typical stationary process.

Stationarization Methods for Non-Stationary Series

Differencing

First-Order Differencing

Xt=XtXt1\nabla X_t = X_t - X_{t-1}

Second-Order Differencing

2Xt=(Xt)=(XtXt1)(Xt1Xt2)=Xt2Xt1+Xt2\nabla^2 X_t = \nabla(\nabla X_t) = (X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) = X_t - 2X_{t-1} + X_{t-2}

Seasonal Differencing

For seasonal series with period ss:

sXt=XtXts\nabla_s X_t = X_t - X_{t-s}

Application Principle: Differencing order should generally not exceed 2, as over-differencing increases variance and reduces interpretability.

Differencing Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# First-order differencing
df['Temperature_Diff'] = df['Temperature'].diff()

# Visualize differencing results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))

# Original series
ax1.plot(df.index, df['Temperature'], marker='o', color='red', linewidth=2)
ax1.set_title('Original Temperature Series')
ax1.set_ylabel('Temperature (°C)')
ax1.grid(alpha=0.3)

# Differenced series
ax2.plot(df.index[1:], df['Temperature_Diff'][1:], marker='s', color='blue', linewidth=2)
ax2.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax2.set_title('First Difference Series (ΔTemperature = Temperature_t - Temperature_{t-1})')
ax2.set_ylabel('Temperature Difference (°C)')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Differencing statistics
diff_stats = f"Mean: {df['Temperature_Diff'].mean():.2f}°C\n"
diff_stats += f"Std Dev: {df['Temperature_Diff'].std():.2f}°C\n"
diff_stats += f"Min: {df['Temperature_Diff'].min():.2f}°C\n"
diff_stats += f"Max: {df['Temperature_Diff'].max():.2f}°C"

Differencing Calculation Example

1
2
3
4
5
6
print("Temperature Difference Calculation:")
print("=" * 40)
for i in range(1, len(df)):
date_str = df.index[i].strftime('%Y-%m-%d')
temp_diff = df['Temperature'].iloc[i] - df['Temperature'].iloc[i-1]
print(f"Δ({date_str}) = {df['Temperature'].iloc[i]:.1f} - {df['Temperature'].iloc[i-1]:.1f} = {temp_diff:.1f}°C")

Transformation Methods

Logarithmic Transformation

Yt=log(Xt)Y_t = \log(X_t)

Applicable for exponential trends and increasing variance over time.

Box-Cox Transformation

Yt={Xtλ1λif λ0log(Xt)if λ=0Y_t = \begin{cases} \frac{X_t^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(X_t) & \text{if } \lambda = 0 \end{cases}

Optimizes parameter λ\lambda to handle non-stationarity and heteroscedasticity.

Theoretical Framework for Data Preprocessing

Data Quality Dimensions

According to data quality management theory, data quality is evaluated through multidimensional metrics:

  1. Accuracy: Degree of consistency between data and the real entities it describes.

  2. Completeness: Extent to which required data is fully recorded, measured by:

    Missing Rate=Number of Missing ValuesTotal Data Points×100%\text{Missing Rate} = \frac{\text{Number of Missing Values}}{\text{Total Data Points}} \times 100\%

  3. Consistency: Uniform representation of data across different sources.

  4. Timeliness: Proximity of data updates to the current time.

  5. Believability: Trustworthiness of data sources and values.

  6. Interpretability: Ease of understanding and using the data.

Classification of Missing Data Mechanisms

Types of Missing Mechanisms

  • Missing Completely at Random (MCAR): Missingness is independent of both observed and unobserved values.

  • Missing at Random (MAR): Missingness depends only on observed values, not unobserved ones.

  • Missing Not at Random (MNAR): Missingness depends on unobserved values.

Handling Method Selection Criteria

Missing Mechanism Recommended Handling Methods
MCAR Direct deletion, mean imputation
MAR Regression imputation, multiple imputation
MNAR Model-based methods, selection models

Missing Value Handling Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Detect missing values
print("Missing Values Analysis:")
print("=" * 30)
print(df.isnull().sum())

# Handling methods
# 1. Delete missing values
df_drop = df.dropna()

# 2. Impute missing values
df_fill_mean = df.fillna(df.mean()) # Mean imputation
df_fill_forward = df.fillna(method='ffill') # Forward filling

# 3. Interpolation
df_interpolate = df.interpolate()

Theoretical Basis for Noise Data Handling

Noise Statistical Model

Assume observed data YtY_t consists of true signal f(t)f(t) and noise εt\varepsilon_t:

Yt=f(t)+εtY_t = f(t) + \varepsilon_t

where εtN(0,σ2)\varepsilon_t \sim \mathcal{N}(0, \sigma^2).

Smoothing Techniques Mathematical Principles

Moving Average Method:

f^(t)=12k+1i=kkYt+i\hat{f}(t) = \frac{1}{2k+1} \sum_{i=-k}^{k} Y_{t+i}

Exponential Smoothing Method:

f^(t)=αYt+(1α)f^(t1)\hat{f}(t) = \alpha Y_t + (1-\alpha)\hat{f}(t-1)

where α(0,1)\alpha \in (0,1) is the smoothing parameter.

Noise Data Handling Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
# Binning smoothing
def binning_smooth(data, bin_size=3, method='mean'):
smoothed = []
for i in range(0, len(data), bin_size):
bin_data = data[i:i+bin_size]
if method == 'mean':
smoothed.extend([bin_data.mean()] * len(bin_data))
elif method == 'median':
smoothed.extend([bin_data.median()] * len(bin_data))
return smoothed

# Apply binning smoothing
df['Temperature_Smooth'] = binning_smooth(df['Temperature'].values)

Data Integration and Correlation Analysis

Statistical Correlation Theory

Pearson Correlation Coefficient

Population correlation coefficient:

ρX,Y=Cov(X,Y)σXσY=E[(XμX)(YμY)]σXσY\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}

Sample correlation coefficient:

rxy=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}

Correlation Test

Test statistic:

t=rn21r2t(n2)t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t(n-2)

where H0:ρ=0H_0: \rho = 0, H1:ρ0H_1: \rho \neq 0.

Correlation Coefficient Calculation Implementation

1
2
3
4
5
6
7
8
9
10
11
12
# Pearson correlation coefficient
corr_matrix = df.corr()
print("Correlation Matrix:")
print("=" * 30)
print(corr_matrix)

# Visualize correlation matrix
import seaborn as sns
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.show()

Chi-Square Test and Contingency Table Analysis

For independence testing of two categorical variables:

Expected Frequency Calculation

Eij=(rowi total)×(columnj total)grand totalE_{ij} = \frac{(\text{row}_i \text{ total}) \times (\text{column}_j \text{ total})}{\text{grand total}}

Chi-Square Statistic

χ2=i=1rj=1c(OijEij)2Eijχ2((r1)(c1))\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \sim \chi^2((r-1)(c-1))

Application Note: Use Fisher’s exact test when >20% of cells have expected frequencies <5.

Feature Engineering and Dimensionality Reduction

Mathematical Basis of Principal Component Analysis (PCA)

Problem Formulation

Given centered data matrix XX (n×pn \times p), find projection direction ww to maximize variance:

maxwwTΣws.t.wTw=1\max_{w} w^T \Sigma w \quad \text{s.t.} \quad w^T w = 1

where Σ=1nXTX\Sigma = \frac{1}{n} X^T X is the sample covariance matrix.

Eigenvalue Decomposition Solution

Σvi=λivi,i=1,2,,p\Sigma v_i = \lambda_i v_i, \quad i=1,2,\ldots,p

where λ1λ2λp0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0 are eigenvalues, viv_i are corresponding eigenvectors.

Variance Explained Ratio

The variance explained by the kk-th principal component is:

λki=1pλi\frac{\lambda_k}{\sum_{i=1}^p \lambda_i}

Feature Selection Theory

Filter Methods

Evaluate feature importance using statistics (e.g., correlation coefficient, chi-square statistic, mutual information).

Wrapper Methods

Select optimal feature subsets via subset search and cross-validation. Common algorithms:

  • Forward Selection

  • Backward Elimination

  • Recursive Feature Elimination (RFE)

Embedded Methods

Automatically perform feature selection during model training, e.g.:

  • Lasso Regression: L1L_1 regularization induces sparsity

  • Decision Trees: Feature importance scoring

Feature Engineering Implementation

Polynomial Features

1
2
3
4
5
6
7
8
9
10
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
data_poly = poly.fit_transform(df[['Temperature', 'Humidity']])
feature_names = poly.get_feature_names_out(['Temperature', 'Humidity'])

# Combine features
df_poly = pd.DataFrame(data_poly, columns=feature_names)
df_extended = pd.concat([df, df_poly], axis=1)

Statistical Feature Creation

1
2
3
4
5
6
7
8
# Create new statistical features
df['RM_LSTAT'] = df['RM'] * df['LSTAT'] # Number of rooms × Low-income population ratio
df['RM_PTRATIO'] = df['RM'] / df['PTRATIO'] # Number of rooms ÷ Pupil-teacher ratio
df['RM_TAX'] = df['RM'] / df['TAX'] # Number of rooms ÷ Property tax

# Calculate correlation with target variable
target_corrs = df.corr()['target'].abs().sort_values(ascending=False)
selected_features = target_corrs[target_corrs >= 0.5].index.tolist()

Data Transformation and Normalization

Normalization Methods

Min-Max Normalization

v=vminAmaxAminA×(new_maxAnew_minA)+new_minAv' = \frac{v - \min_A}{\max_A - \min_A} \times (\text{new\_max}_A - \text{new\_min}_A) + \text{new\_min}_A

Z-Score Normalization

v=vμAσAv' = \frac{v - \mu_A}{\sigma_A}

Decimal Scaling Normalization

v=v10jv' = \frac{v}{10^j}

where jj is the smallest integer such that max(v)<1\max(|v'|) < 1.

Normalization Implementation

1
2
3
4
5
6
7
8
9
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max normalization
minmax_scaler = MinMaxScaler()
df_minmax = minmax_scaler.fit_transform(df[['Temperature', 'Humidity']])

# Z-Score normalization
std_scaler = StandardScaler()
df_std = std_scaler.fit_transform(df[['Temperature', 'Humidity']])

Data Discretization and Concept Hierarchy

Discretization Algorithm Classification

Unsupervised Discretization

  • Equal-Width Binning: Fixed interval width

    bin width=maxminN\text{bin width} = \frac{\max - \min}{N}

  • Equal-Frequency Binning: Each bin contains approximately equal samples

Supervised Discretization

  • Entropy-Based Discretization (e.g., ID3 algorithm)

  • ChiMerge Algorithm: Bottom-up merging based on chi-square statistic

Concept Hierarchy Generation Methods

Statistical-Based Methods

Automatically generate hierarchies based on distinct attribute values, with attributes having more distinct values at lower levels.

Domain Knowledge-Based Methods

Define hierarchies explicitly using domain expertise, e.g.:

  • Geographic hierarchy: Street < City < State < Country

  • Temporal hierarchy: Day < Month < Quarter < Year

Discretization Implementation

1
2
3
4
5
6
7
8
9
10
11
# Equal-width discretization
df['Temp_Binned'] = pd.cut(df['Temperature'], bins=5, labels=['Low', 'Medium-Low',
'Medium', 'Medium-High', 'High'])

# Equal-depth discretization
df['Humidity_Binned'] = pd.qcut(df['Humidity'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Cluster-based discretization
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
df['Sales_Cluster'] = kmeans.fit_predict(df[['Sales']])

Summary

Key Points in Time Series Analysis

  1. Stationarity testing is a prerequisite for time series analysis.

  2. Non-stationary series can be transformed to stationary via methods like differencing.

  3. Autocorrelation analysis helps understand internal series structure.

Critical Steps in Data Preprocessing

  1. Data Cleaning: Handle missing values, noise, and outliers.

  2. Data Integration: Resolve consistency issues in multi-source data.

  3. Data Reduction: Improve efficiency through feature selection and data compression.

  4. Data Transformation: Enhance model performance via normalization and discretization.

Best Practice Recommendations

  • Always start analysis with data exploration and visualization.

  • Choose preprocessing methods based on data characteristics.

  • Iteratively optimize feature engineering strategies.

  • Validate preprocessing impact on final models.