#sdsc5001

English / 中文

Population and Sample

Population: Refers to the entire set of individuals from which we attempt to draw conclusions.
Sample: Refers to a subset observed from the population.
Relationship: Samples are used to infer characteristics of the population; the core of statistics and machine learning is to estimate or predict population parameters based on sample data.

For example, in a coin toss experiment, the population is all possible coin toss outcomes, while the sample is the 10,000 coin toss outcomes we actually observe.

Probability Basics

Experiment: Any action that produces an observable outcome.
Sample Space: The set of all possible outcomes of an experiment, denoted as $S$ .
Event: A subset of the sample space $S$ .

Example: Rolling a Six-Sided Die

Experiment: Rolling the die.
Sample space $S = \{1, 2, 3, 4, 5, 6\}$ .
The event “getting an even number” is the subset $\{2, 4, 6\}$ .

Set Operations

Given two events $A$ and $B$ :

$A \cup B$ : Union of $A$ and $B$ .
$A \cap B$ : Intersection of $A$ and $B$ .
$A'$ : Complement of $A$ .

These operations form the basis for calculating probabilities of more complex events.

Set operations can be visualized using Venn diagrams, but since no images are provided in the document, illustrations are omitted.

Probability Calculation

Probability is defined as:

$P(A) = \frac{\text{ outcomes in } A}{\text{ outcomes in } S}$

Counting Techniques:

Product Rule: If experiment 1 has $m$ outcomes and experiment 2 has $n$ outcomes, the combined experiment has $m \times n$ outcomes.
- Example: Rolling a die twice has $6 \times 6 = 36$ possible outcomes.
Permutation: The number of ways to select $k$ objects in order from $n$ distinct objects:

$P_n^k = \frac{n!}{(n-k)!}$
- Example: Permuting the letters {a, b, c} has $3! = 6$ ways: (abc), (acb), (bac), (bca), (cab), (cba).
Combination: The number of ways to select $k$ objects without regard to order from $n$ distinct objects:

$C_n^k = \binom{n}{k} = \frac{n!}{k!(n-k)!}$
- Example: Selecting 3 from {a, b, c, d, e} has $\binom{5}{3} = 10$ ways.

Example: Probability that the second die shows a higher value than the first when rolling two fair dice

Sample space $S$ has 36 outcomes.
Event $E$ : Second die value is greater, with 15 outcomes (e.g., (1,2), (1,3), …, (5,6)).
$P(E) = \frac{15}{36}$

Probability Axioms

Complement Rule: $P(A') = 1 - P(A)$
Addition Rule: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
- If $A$ and $B$ are mutually exclusive ( $A \cap B = \emptyset$ ), then $P(A \cup B) = P(A) + P(B)$ .
Extension to Three Events:

$P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)$

Example: Probability of selecting at least one psychologist in a meeting

There are 30 psychiatrists and 24 psychologists, totaling 54 people; 3 are randomly selected.
Define event $A$ : At least one psychologist is selected.
Use complement: $A'$ : No psychologists selected (i.e., all psychiatrists).
$P(A) = 1 - P(A') = 1 - \frac{\binom{30}{3}}{\binom{54}{3}} = 1 - \frac{30 \times 29 \times 28}{54 \times 53 \times 52} \approx 0.84$

Birthday Paradox

Probability that at least two people share a birthday in a group of $n$ people:

$P(A) = 1 - \frac{365 \times 364 \times \cdots \times (365 - n + 1)}{365^n}$
- When $n = 23$ , the probability exceeds 50%.

Conditional Probability and Bayes’ Theorem

Conditional Probability: Probability of $A$ given that $B$ has occurred:

$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$
Independence: $A$ and $B$ are independent if $P(A \mid B) = P(A)$ or $P(A \cap B) = P(A)P(B)$ .
Bayes’ Theorem:

$P(A_i \mid B) = \frac{P(A_i) P(B \mid A_i)}{\sum_{k=1}^K P(A_k) P(B \mid A_k)}$
- $P(A_i \mid B)$ : Posterior probability
- $P(B \mid A_i)$ : Likelihood
- $P(A_i)$ : Prior probability
- $P(B)$ : Marginal likelihood (evidence)

Example: Drug Testing

Assume true positive rate $P(+ \mid \text{User}) = 0.99$ , true negative rate $P(- \mid \text{Non-user}) = 0.99$ , and user prevalence $P(\text{User}) = 0.005$ .
Find $P(\text{User} \mid +)$ :

$P(\text{User} \mid +) = \frac{0.99 \times 0.005}{0.99 \times 0.005 + 0.01 \times 0.995} \approx 0.332$

This means that even with a positive test, there is only about a 33.2% probability of being a true user.

Monty Hall Problem

Three doors: one has a car, two have goats. After the host opens a door with a goat, should you switch?
Calculation using Bayes’ theorem:
- Assume initial choice is door 1, host opens door 3 (revealing a goat).
- $P(A_1 \mid B) = \frac{1/3 \times 1/2}{1/2} = \frac{1}{3}$
- $P(A_2 \mid B) = \frac{1/3 \times 1}{1/2} = \frac{2}{3}$
Thus, switching doors gives a win probability of 2/3, while staying gives 1/3, so switching is optimal.

Random Variables

Random Variable: Any characteristic whose value may vary across individuals.
Descriptive Statistics:
- Numerical: Mean, median, trimmed mean, variance, standard deviation.
- Graphical: Histograms, pie charts, box plots.

Discrete Random Variables

Have a finite or countably infinite number of possible outcomes.
Probability Mass Function (PMF): $p(x) = P(X = x)$
Cumulative Distribution Function (CDF): $F(x) = P(X \leq x)$
Expectation: $E(X) = \sum x p(x)$
Variance: $\operatorname{Var}(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2$

Common Discrete Distributions:

Bernoulli: $X \sim \operatorname{Bern}(p)$ , $p(x) = p^x (1-p)^{1-x}$ for $x=0,1$
Binomial: $X \sim \operatorname{Bin}(n, p)$ , $p(x) = \binom{n}{x} p^x (1-p)^{n-x}$ for $x=0,1,\ldots,n$
Poisson: $X \sim \operatorname{Poi}(\lambda)$ , $p(x) = \frac{\lambda^x}{x!} e^{-\lambda}$ for $x=0,1,\ldots$
- Example: Call center averages 5 calls per hour ( $\lambda=5$ ), probability of exactly 3 calls: $p(3) = \frac{5^3}{3!} e^{-5}$

Continuous Random Variables

Possible outcomes form an interval on the real number line.
Probability Density Function (PDF): $f(x) = \lim_{h \to 0} P(x \leq X \leq x+h)$
Cumulative Distribution Function (CDF): $F(x) = P(X \leq x) = \int_{-\infty}^x f(t) dt$
Expectation: $E(X) = \int_{-\infty}^{\infty} x f(x) dx$
Variance: $\operatorname{Var}(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2$

Common Continuous Distributions:

Uniform: $X \sim \operatorname{unif}(a, b)$ , $f(x) = \frac{1}{b-a}$ for $x \in [a, b]$
Exponential: $X \sim \exp(\lambda)$ , $f(x) = \lambda e^{-\lambda x}$ for $x > 0$
Normal: $X \sim N(\mu, \sigma^2)$ , $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ for $-\infty < x < \infty$

Joint Distributions

For random variables $X$ and $Y$ (continuous or discrete):

Joint PDF/PMF: $f(x, y)$
Marginal PDF: $f_X(x) = \int f(x, y) dy$ (continuous) or sum (discrete)
Independence: $X$ and $Y$ are independent if $f(x, y) = f_X(x) f_Y(y)$ .
Conditional PDF: $f_{Y \mid X}(y \mid x) = \frac{f(x, y)}{f_X(x)}$
Expectation: $E[h(X, Y)] = \int \int h(x, y) f(x, y) dy dx$
Covariance: $\operatorname{cov}(X, Y) = E(XY) - E(X)E(Y)$
Correlation: $\operatorname{corr}(X, Y) = \frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{Var}(X) \operatorname{Var}(Y)}}$

Statistics and Distributions

Statistic: A function of data, hence a random variable.
Simple Random Sample: If $X_1, X_2, \ldots, X_n$ are independent and identically distributed (i.i.d.), it is called a simple random sample.
Sample Mean: $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$
- If $X_i \sim N(\mu, \sigma^2)$ , then $\bar{X} \sim N(\mu, \frac{\sigma^2}{n})$
Central Limit Theorem (CLT): If $X_i$ are i.i.d. with mean $\mu$ and variance $\sigma^2$ , then:

$\frac{\sqrt{n} (\bar{X} - \mu)}{\sigma} \xrightarrow{d} N(0,1)$

This means that even if the original distribution is non-normal, the sample mean is approximately normal for large samples.

General Results:

If $X_i \sim N(\mu_i, \sigma_i^2)$ and independent, then $Y = \sum a_i X_i \sim N(\sum a_i \mu_i, \sum a_i^2 \sigma_i^2)$
Linearity of expectation and variance: $E(Y) = \sum a_i \mu_i$ , $\operatorname{Var}(Y) = \sum a_i^2 \sigma_i^2$ (if independent), otherwise add covariance terms.

Statistical Inference

Inferring population truths based on sample data:

Estimation: Finding estimates for unknown parameters.
- Point estimation: e.g., $\hat{\mu} = 2.5$
- Confidence interval (CI) estimation: e.g., 95% CI for $\mu$ is (2.0, 3.0)
Hypothesis Testing: Making decisions based on specific hypotheses (e.g., $\mu \leq 2$ vs. $\mu > 2$ )

Point Estimation

Point Estimator: A statistic used to estimate a parameter $\theta$ .
Example: Estimating population mean from sample data.
- Sample mean: $\bar{x} = \frac{1}{n} \sum x_i$
- Sample median: Middle value after sorting
- Trimmed mean: Mean after removing extreme values
Question: Which estimator is closer to the population mean? Depends on the distribution.

Unbiased Estimation

Unbiased Estimator: $\hat{\theta}$ is unbiased for $\theta$ if $E(\hat{\theta}) = \theta$ .
Bias: $E(\hat{\theta}) - \theta$
Examples:
- Sample mean $\bar{X}$ is unbiased for $\mu$ .
- Sample variance $\hat{\sigma}^2 = \frac{1}{n-1} \sum (X_i - \bar{X})^2$ is unbiased for $\sigma^2$ (Bessel’s correction).
  
  Degrees of freedom $n-1$ arise because the sample mean is used in estimation, reducing independence.

Another Example: $X_i \sim \operatorname{unif}(0, \theta)$ , $\hat{\theta} = \max\{X_i\}$ , then $E[\hat{\theta}] = \frac{n}{n+1} \theta$ , so biased.

MVUE (Minimum Variance Unbiased Estimator)

Among all unbiased estimators, choose the one with minimum variance.
Example: For uniform distribution, $\hat{\theta}_1 = \frac{n+1}{n} \max\{X_i\}$ and $\hat{\theta}_2 = 2\bar{X}$ are both unbiased, but $\hat{\theta}_1$ has smaller variance.
If $X_i \sim N(\mu, \sigma^2)$ , $\bar{X}$ is MVUE for $\mu$ .

Method of Moments (MM) Estimation

Philosophy: Sample moments should resemble population moments.
k-th Sample Moment: $\frac{1}{n} \sum X_i^k$
k-th Population Moment: $E(X^k)$
Solve for parameters by equating sample moments to population moments.

Example: Estimating $\mu$ and $\sigma^2$ for normal distribution

First moment: $E(X) = \mu$ , sample moment $\bar{X}$ , so $\hat{\mu}_{MM} = \bar{X}$
Second moment: $E(X^2) = \mu^2 + \sigma^2$ , sample moment $\frac{1}{n} \sum X_i^2$ , so $\hat{\sigma}_{MM}^2 = \frac{1}{n} \sum X_i^2 - (\bar{X})^2$

Maximum Likelihood Estimation (MLE)

Philosophy: Given observed data, choose parameter values that maximize the probability of observing the data.
Likelihood Function: $L(\theta) = f(x_1, \ldots, x_n; \theta)$ , treated as a function of $\theta$ .
MLE estimator $\hat{\theta}$ maximizes $L(\theta)$ .

Example: $X_i \sim N(\mu, \sigma^2)$

Likelihood function: $L(\mu, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left[ -\frac{1}{2\sigma^2} \sum (x_i - \mu)^2 \right]$
MLE: $\hat{\mu} = \bar{X}$ , $\hat{\sigma}^2 = \frac{1}{n} \sum (X_i - \bar{X})^2$

Another Example: $X_i \sim \operatorname{unif}(0, \theta)$

Likelihood function: $L(\theta) = \left( \frac{1}{\theta} \right)^n$ if all $X_i \in [0, \theta]$ , else 0.
To maximize, $\theta$ should be as small as possible but not less than any $X_i$ , so $\hat{\theta}_{MLE} = \max\{X_i\}$

Confidence Intervals (CI)

Confidence Interval: An interval estimate based on a statistic, containing the unknown population parameter with a predetermined probability.
CI = Point estimate ± Margin of error

Example: Normal distribution with known $\sigma^2$ , $100(1-\alpha)\%$ CI for $\mu$ :

$\left( \bar{X} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right)$

If $\sigma$ unknown, use sample standard deviation $s$ and t-distribution:

$\left( \bar{X} - t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}, \bar{X} + t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} \right)$

Interpretation: CI is a random interval; if repeated sampling is done, approximately $100(1-\alpha)\%$ of CIs will cover the true $\mu$ .

General CI Construction:

If estimator $\hat{\theta}$ is approximately normal, unbiased, with known variance $\sigma_{\hat{\theta}}^2$ , then approximate $100(1-\alpha)\%$ CI:

$\left( \hat{\theta} - z_{\alpha/2} \sigma_{\hat{\theta}}, \hat{\theta} + z_{\alpha/2} \sigma_{\hat{\theta}} \right)$

Hypothesis Testing

Hypothesis: A claim about a characteristic of a probability distribution.
Hypothesis Testing: Using data to decide between two competing hypotheses.
- Null Hypothesis ( $H_0$ ): The initial assumption to be tested.
- Alternative Hypothesis ( $H_a$ ): A claim contradicting $H_0$ .

Test Types:

For $H_0: \theta = \theta_0$ , $H_a$ can be:
- $\theta > \theta_0$ (right-tailed)
- $\theta < \theta_0$ (left-tailed)
- $\theta \neq \theta_0$ (two-tailed)

Testing Procedure:

Test Statistic: A function based on sample data.
Rejection Region: The set of test statistic values that lead to rejecting $H_0$ .

Error Types:

Type I Error: Rejecting a true $H_0$ , probability $\alpha$ (significance level).
Type II Error: Failing to reject a false $H_0$ , probability $\beta$ .
Power: $1 - \beta$ , probability of rejecting a false $H_0$ .

p-value:

The smallest significance level at which $H_0$ would be rejected given the data.
If p-value < $\alpha$ , reject $H_0$ ; otherwise, do not reject.
p-value is the probability, assuming $H_0$ is true, of obtaining a test statistic at least as extreme as the sample.

Typically $\alpha = 0.05$ , but use with caution as p-values are sensitive to sample size.