#sdsc5001

English / 中文

Population and Sample

  • Population: Refers to the entire set of individuals from which we attempt to draw conclusions.

  • Sample: Refers to a subset observed from the population.

  • Relationship: Samples are used to infer characteristics of the population; the core of statistics and machine learning is to estimate or predict population parameters based on sample data.

For example, in a coin toss experiment, the population is all possible coin toss outcomes, while the sample is the 10,000 coin toss outcomes we actually observe.


Probability Basics

  • Experiment: Any action that produces an observable outcome.

  • Sample Space: The set of all possible outcomes of an experiment, denoted as SS.

  • Event: A subset of the sample space SS.

Example: Rolling a Six-Sided Die

  • Experiment: Rolling the die.

  • Sample space S={1,2,3,4,5,6}S = \{1, 2, 3, 4, 5, 6\}.

  • The event “getting an even number” is the subset {2,4,6}\{2, 4, 6\}.


Set Operations

Given two events AA and BB:

  • ABA \cup B: Union of AA and BB.

  • ABA \cap B: Intersection of AA and BB.

  • AA': Complement of AA.

These operations form the basis for calculating probabilities of more complex events.

Set operations can be visualized using Venn diagrams, but since no images are provided in the document, illustrations are omitted.


Probability Calculation

Probability is defined as:

P(A)= outcomes in A outcomes in SP(A) = \frac{\text{ outcomes in } A}{\text{ outcomes in } S}

Counting Techniques:

  • Product Rule: If experiment 1 has mm outcomes and experiment 2 has nn outcomes, the combined experiment has m×nm \times n outcomes.

    • Example: Rolling a die twice has 6×6=366 \times 6 = 36 possible outcomes.
  • Permutation: The number of ways to select kk objects in order from nn distinct objects:

    Pnk=n!(nk)!P_n^k = \frac{n!}{(n-k)!}

    • Example: Permuting the letters {a, b, c} has 3!=63! = 6 ways: (abc), (acb), (bac), (bca), (cab), (cba).
  • Combination: The number of ways to select kk objects without regard to order from nn distinct objects:

    Cnk=(nk)=n!k!(nk)!C_n^k = \binom{n}{k} = \frac{n!}{k!(n-k)!}

    • Example: Selecting 3 from {a, b, c, d, e} has (53)=10\binom{5}{3} = 10 ways.

Example: Probability that the second die shows a higher value than the first when rolling two fair dice

  • Sample space SS has 36 outcomes.

  • Event EE: Second die value is greater, with 15 outcomes (e.g., (1,2), (1,3), …, (5,6)).

  • P(E)=1536P(E) = \frac{15}{36}


Probability Axioms

  • Complement Rule: P(A)=1P(A)P(A') = 1 - P(A)

  • Addition Rule: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

    • If AA and BB are mutually exclusive (AB=A \cap B = \emptyset), then P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B).
  • Extension to Three Events:

    P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)

Example: Probability of selecting at least one psychologist in a meeting

  • There are 30 psychiatrists and 24 psychologists, totaling 54 people; 3 are randomly selected.

  • Define event AA: At least one psychologist is selected.

  • Use complement: AA': No psychologists selected (i.e., all psychiatrists).

  • P(A)=1P(A)=1(303)(543)=130×29×2854×53×520.84P(A) = 1 - P(A') = 1 - \frac{\binom{30}{3}}{\binom{54}{3}} = 1 - \frac{30 \times 29 \times 28}{54 \times 53 \times 52} \approx 0.84

Birthday Paradox

  • Probability that at least two people share a birthday in a group of nn people:

    P(A)=1365×364××(365n+1)365nP(A) = 1 - \frac{365 \times 364 \times \cdots \times (365 - n + 1)}{365^n}

    • When n=23n = 23, the probability exceeds 50%.

Conditional Probability and Bayes’ Theorem

  • Conditional Probability: Probability of AA given that BB has occurred:

    P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}

  • Independence: AA and BB are independent if P(AB)=P(A)P(A \mid B) = P(A) or P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B).

  • Bayes’ Theorem:

    P(AiB)=P(Ai)P(BAi)k=1KP(Ak)P(BAk)P(A_i \mid B) = \frac{P(A_i) P(B \mid A_i)}{\sum_{k=1}^K P(A_k) P(B \mid A_k)}

    • P(AiB)P(A_i \mid B): Posterior probability
    • P(BAi)P(B \mid A_i): Likelihood
    • P(Ai)P(A_i): Prior probability
    • P(B)P(B): Marginal likelihood (evidence)

Example: Drug Testing

  • Assume true positive rate P(+User)=0.99P(+ \mid \text{User}) = 0.99, true negative rate P(Non-user)=0.99P(- \mid \text{Non-user}) = 0.99, and user prevalence P(User)=0.005P(\text{User}) = 0.005.

  • Find P(User+)P(\text{User} \mid +):

    P(User+)=0.99×0.0050.99×0.005+0.01×0.9950.332P(\text{User} \mid +) = \frac{0.99 \times 0.005}{0.99 \times 0.005 + 0.01 \times 0.995} \approx 0.332

    This means that even with a positive test, there is only about a 33.2% probability of being a true user.

Monty Hall Problem

  • Three doors: one has a car, two have goats. After the host opens a door with a goat, should you switch?

  • Calculation using Bayes’ theorem:

    • Assume initial choice is door 1, host opens door 3 (revealing a goat).
    • P(A1B)=1/3×1/21/2=13P(A_1 \mid B) = \frac{1/3 \times 1/2}{1/2} = \frac{1}{3}

    • P(A2B)=1/3×11/2=23P(A_2 \mid B) = \frac{1/3 \times 1}{1/2} = \frac{2}{3}

    Thus, switching doors gives a win probability of 2/3, while staying gives 1/3, so switching is optimal.


Random Variables

  • Random Variable: Any characteristic whose value may vary across individuals.

  • Descriptive Statistics:

    • Numerical: Mean, median, trimmed mean, variance, standard deviation.
    • Graphical: Histograms, pie charts, box plots.

Discrete Random Variables

  • Have a finite or countably infinite number of possible outcomes.

  • Probability Mass Function (PMF): p(x)=P(X=x)p(x) = P(X = x)

  • Cumulative Distribution Function (CDF): F(x)=P(Xx)F(x) = P(X \leq x)

  • Expectation: E(X)=xp(x)E(X) = \sum x p(x)

  • Variance: Var(X)=E[(XE(X))2]=E(X2)[E(X)]2\operatorname{Var}(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2

Common Discrete Distributions:

  • Bernoulli: XBern(p)X \sim \operatorname{Bern}(p), p(x)=px(1p)1xp(x) = p^x (1-p)^{1-x} for x=0,1x=0,1

  • Binomial: XBin(n,p)X \sim \operatorname{Bin}(n, p), p(x)=(nx)px(1p)nxp(x) = \binom{n}{x} p^x (1-p)^{n-x} for x=0,1,,nx=0,1,\ldots,n

  • Poisson: XPoi(λ)X \sim \operatorname{Poi}(\lambda), p(x)=λxx!eλp(x) = \frac{\lambda^x}{x!} e^{-\lambda} for x=0,1,x=0,1,\ldots

    • Example: Call center averages 5 calls per hour (λ=5\lambda=5), probability of exactly 3 calls: p(3)=533!e5p(3) = \frac{5^3}{3!} e^{-5}

Continuous Random Variables

  • Possible outcomes form an interval on the real number line.

  • Probability Density Function (PDF): f(x)=limh0P(xXx+h)f(x) = \lim_{h \to 0} P(x \leq X \leq x+h)

  • Cumulative Distribution Function (CDF): F(x)=P(Xx)=xf(t)dtF(x) = P(X \leq x) = \int_{-\infty}^x f(t) dt

  • Expectation: E(X)=xf(x)dxE(X) = \int_{-\infty}^{\infty} x f(x) dx

  • Variance: Var(X)=E[(XE(X))2]=E(X2)[E(X)]2\operatorname{Var}(X) = E[(X - E(X))^2] = E(X^2) - [E(X)]^2

Common Continuous Distributions:

  • Uniform: Xunif(a,b)X \sim \operatorname{unif}(a, b), f(x)=1baf(x) = \frac{1}{b-a} for x[a,b]x \in [a, b]

  • Exponential: Xexp(λ)X \sim \exp(\lambda), f(x)=λeλxf(x) = \lambda e^{-\lambda x} for x>0x > 0

  • Normal: XN(μ,σ2)X \sim N(\mu, \sigma^2), f(x)=12πσ2e(xμ)22σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} for <x<-\infty < x < \infty


Joint Distributions

For random variables XX and YY (continuous or discrete):

  • Joint PDF/PMF: f(x,y)f(x, y)

  • Marginal PDF: fX(x)=f(x,y)dyf_X(x) = \int f(x, y) dy (continuous) or sum (discrete)

  • Independence: XX and YY are independent if f(x,y)=fX(x)fY(y)f(x, y) = f_X(x) f_Y(y).

  • Conditional PDF: fYX(yx)=f(x,y)fX(x)f_{Y \mid X}(y \mid x) = \frac{f(x, y)}{f_X(x)}

  • Expectation: E[h(X,Y)]=h(x,y)f(x,y)dydxE[h(X, Y)] = \int \int h(x, y) f(x, y) dy dx

  • Covariance: cov(X,Y)=E(XY)E(X)E(Y)\operatorname{cov}(X, Y) = E(XY) - E(X)E(Y)

  • Correlation: corr(X,Y)=cov(X,Y)Var(X)Var(Y)\operatorname{corr}(X, Y) = \frac{\operatorname{cov}(X, Y)}{\sqrt{\operatorname{Var}(X) \operatorname{Var}(Y)}}


Statistics and Distributions

  • Statistic: A function of data, hence a random variable.

  • Simple Random Sample: If X1,X2,,XnX_1, X_2, \ldots, X_n are independent and identically distributed (i.i.d.), it is called a simple random sample.

  • Sample Mean: Xˉ=1ni=1nXi\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i

    • If XiN(μ,σ2)X_i \sim N(\mu, \sigma^2), then XˉN(μ,σ2n)\bar{X} \sim N(\mu, \frac{\sigma^2}{n})
  • Central Limit Theorem (CLT): If XiX_i are i.i.d. with mean μ\mu and variance σ2\sigma^2, then:

    n(Xˉμ)σdN(0,1)\frac{\sqrt{n} (\bar{X} - \mu)}{\sigma} \xrightarrow{d} N(0,1)

    This means that even if the original distribution is non-normal, the sample mean is approximately normal for large samples.

General Results:

  • If XiN(μi,σi2)X_i \sim N(\mu_i, \sigma_i^2) and independent, then Y=aiXiN(aiμi,ai2σi2)Y = \sum a_i X_i \sim N(\sum a_i \mu_i, \sum a_i^2 \sigma_i^2)

  • Linearity of expectation and variance: E(Y)=aiμiE(Y) = \sum a_i \mu_i, Var(Y)=ai2σi2\operatorname{Var}(Y) = \sum a_i^2 \sigma_i^2 (if independent), otherwise add covariance terms.


Statistical Inference

Inferring population truths based on sample data:

  • Estimation: Finding estimates for unknown parameters.

    • Point estimation: e.g., μ^=2.5\hat{\mu} = 2.5
    • Confidence interval (CI) estimation: e.g., 95% CI for μ\mu is (2.0, 3.0)
  • Hypothesis Testing: Making decisions based on specific hypotheses (e.g., μ2\mu \leq 2 vs. μ>2\mu > 2)

Point Estimation

  • Point Estimator: A statistic used to estimate a parameter θ\theta.

  • Example: Estimating population mean from sample data.

    • Sample mean: xˉ=1nxi\bar{x} = \frac{1}{n} \sum x_i
    • Sample median: Middle value after sorting
    • Trimmed mean: Mean after removing extreme values

    Question: Which estimator is closer to the population mean? Depends on the distribution.

Unbiased Estimation

  • Unbiased Estimator: θ^\hat{\theta} is unbiased for θ\theta if E(θ^)=θE(\hat{\theta}) = \theta.

  • Bias: E(θ^)θE(\hat{\theta}) - \theta

  • Examples:

    • Sample mean Xˉ\bar{X} is unbiased for μ\mu.
    • Sample variance σ^2=1n1(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n-1} \sum (X_i - \bar{X})^2 is unbiased for σ2\sigma^2 (Bessel’s correction).

      Degrees of freedom n1n-1 arise because the sample mean is used in estimation, reducing independence.

Another Example: Xiunif(0,θ)X_i \sim \operatorname{unif}(0, \theta), θ^=max{Xi}\hat{\theta} = \max\{X_i\}, then E[θ^]=nn+1θE[\hat{\theta}] = \frac{n}{n+1} \theta, so biased.

MVUE (Minimum Variance Unbiased Estimator)

  • Among all unbiased estimators, choose the one with minimum variance.

  • Example: For uniform distribution, θ^1=n+1nmax{Xi}\hat{\theta}_1 = \frac{n+1}{n} \max\{X_i\} and θ^2=2Xˉ\hat{\theta}_2 = 2\bar{X} are both unbiased, but θ^1\hat{\theta}_1 has smaller variance.

  • If XiN(μ,σ2)X_i \sim N(\mu, \sigma^2), Xˉ\bar{X} is MVUE for μ\mu.

Method of Moments (MM) Estimation

  • Philosophy: Sample moments should resemble population moments.

  • k-th Sample Moment: 1nXik\frac{1}{n} \sum X_i^k

  • k-th Population Moment: E(Xk)E(X^k)

  • Solve for parameters by equating sample moments to population moments.

Example: Estimating μ\mu and σ2\sigma^2 for normal distribution

  • First moment: E(X)=μE(X) = \mu, sample moment Xˉ\bar{X}, so μ^MM=Xˉ\hat{\mu}_{MM} = \bar{X}

  • Second moment: E(X2)=μ2+σ2E(X^2) = \mu^2 + \sigma^2, sample moment 1nXi2\frac{1}{n} \sum X_i^2, so σ^MM2=1nXi2(Xˉ)2\hat{\sigma}_{MM}^2 = \frac{1}{n} \sum X_i^2 - (\bar{X})^2

Maximum Likelihood Estimation (MLE)

  • Philosophy: Given observed data, choose parameter values that maximize the probability of observing the data.

  • Likelihood Function: L(θ)=f(x1,,xn;θ)L(\theta) = f(x_1, \ldots, x_n; \theta), treated as a function of θ\theta.

  • MLE estimator θ^\hat{\theta} maximizes L(θ)L(\theta).

Example: XiN(μ,σ2)X_i \sim N(\mu, \sigma^2)

  • Likelihood function: L(μ,σ2)=(2πσ2)n/2exp[12σ2(xiμ)2]L(\mu, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left[ -\frac{1}{2\sigma^2} \sum (x_i - \mu)^2 \right]

  • MLE: μ^=Xˉ\hat{\mu} = \bar{X}, σ^2=1n(XiXˉ)2\hat{\sigma}^2 = \frac{1}{n} \sum (X_i - \bar{X})^2

Another Example: Xiunif(0,θ)X_i \sim \operatorname{unif}(0, \theta)

  • Likelihood function: L(θ)=(1θ)nL(\theta) = \left( \frac{1}{\theta} \right)^n if all Xi[0,θ]X_i \in [0, \theta], else 0.

  • To maximize, θ\theta should be as small as possible but not less than any XiX_i, so θ^MLE=max{Xi}\hat{\theta}_{MLE} = \max\{X_i\}


Confidence Intervals (CI)

  • Confidence Interval: An interval estimate based on a statistic, containing the unknown population parameter with a predetermined probability.

  • CI = Point estimate ± Margin of error

Example: Normal distribution with known σ2\sigma^2, 100(1α)%100(1-\alpha)\% CI for μ\mu:

(Xˉzα/2σn,Xˉ+zα/2σn)\left( \bar{X} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{X} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \right)

  • If σ\sigma unknown, use sample standard deviation ss and t-distribution:

(Xˉtα/2,n1sn,Xˉ+tα/2,n1sn)\left( \bar{X} - t_{\alpha/2, n-1} \frac{s}{\sqrt{n}}, \bar{X} + t_{\alpha/2, n-1} \frac{s}{\sqrt{n}} \right)

Interpretation: CI is a random interval; if repeated sampling is done, approximately 100(1α)%100(1-\alpha)\% of CIs will cover the true μ\mu.

General CI Construction:

  • If estimator θ^\hat{\theta} is approximately normal, unbiased, with known variance σθ^2\sigma_{\hat{\theta}}^2, then approximate 100(1α)%100(1-\alpha)\% CI:

(θ^zα/2σθ^,θ^+zα/2σθ^)\left( \hat{\theta} - z_{\alpha/2} \sigma_{\hat{\theta}}, \hat{\theta} + z_{\alpha/2} \sigma_{\hat{\theta}} \right)


Hypothesis Testing

  • Hypothesis: A claim about a characteristic of a probability distribution.

  • Hypothesis Testing: Using data to decide between two competing hypotheses.

    • Null Hypothesis (H0H_0): The initial assumption to be tested.
    • Alternative Hypothesis (HaH_a): A claim contradicting H0H_0.

Test Types:

  • For H0:θ=θ0H_0: \theta = \theta_0, HaH_a can be:

    • θ>θ0\theta > \theta_0 (right-tailed)
    • θ<θ0\theta < \theta_0 (left-tailed)
    • θθ0\theta \neq \theta_0 (two-tailed)

Testing Procedure:

  • Test Statistic: A function based on sample data.

  • Rejection Region: The set of test statistic values that lead to rejecting H0H_0.

Error Types:

  • Type I Error: Rejecting a true H0H_0, probability α\alpha (significance level).

  • Type II Error: Failing to reject a false H0H_0, probability β\beta.

  • Power: 1β1 - \beta, probability of rejecting a false H0H_0.

p-value:

  • The smallest significance level at which H0H_0 would be rejected given the data.

  • If p-value < α\alpha, reject H0H_0; otherwise, do not reject.

  • p-value is the probability, assuming H0H_0 is true, of obtaining a test statistic at least as extreme as the sample.

Typically α=0.05\alpha = 0.05, but use with caution as p-values are sensitive to sample size.