SDSC5001 Course 1-Review: Probability and Statistics
#sdsc5001
English / 中文
Population and Sample
-
Population: Refers to the entire set of individuals from which we attempt to draw conclusions.
-
Sample: Refers to a subset observed from the population.
-
Relationship: Samples are used to infer characteristics of the population; the core of statistics and machine learning is to estimate or predict population parameters based on sample data.
For example, in a coin toss experiment, the population is all possible coin toss outcomes, while the sample is the 10,000 coin toss outcomes we actually observe.
Probability Basics
-
Experiment: Any action that produces an observable outcome.
-
Sample Space: The set of all possible outcomes of an experiment, denoted as .
-
Event: A subset of the sample space .
Example: Rolling a Six-Sided Die
-
Experiment: Rolling the die.
-
Sample space .
-
The event “getting an even number” is the subset .
Set Operations
Given two events and :
-
: Union of and .
-
: Intersection of and .
-
: Complement of .
These operations form the basis for calculating probabilities of more complex events.
Set operations can be visualized using Venn diagrams, but since no images are provided in the document, illustrations are omitted.
Probability Calculation
Probability is defined as:
Counting Techniques:
-
Product Rule: If experiment 1 has outcomes and experiment 2 has outcomes, the combined experiment has outcomes.
- Example: Rolling a die twice has possible outcomes.
-
Permutation: The number of ways to select objects in order from distinct objects:
- Example: Permuting the letters {a, b, c} has ways: (abc), (acb), (bac), (bca), (cab), (cba).
-
Combination: The number of ways to select objects without regard to order from distinct objects:
- Example: Selecting 3 from {a, b, c, d, e} has ways.
Example: Probability that the second die shows a higher value than the first when rolling two fair dice
-
Sample space has 36 outcomes.
-
Event : Second die value is greater, with 15 outcomes (e.g., (1,2), (1,3), …, (5,6)).
-
Probability Axioms
-
Complement Rule:
-
Addition Rule:
- If and are mutually exclusive (), then .
-
Extension to Three Events:
Example: Probability of selecting at least one psychologist in a meeting
-
There are 30 psychiatrists and 24 psychologists, totaling 54 people; 3 are randomly selected.
-
Define event : At least one psychologist is selected.
-
Use complement: : No psychologists selected (i.e., all psychiatrists).
-
Birthday Paradox
-
Probability that at least two people share a birthday in a group of people:
- When , the probability exceeds 50%.
Conditional Probability and Bayes’ Theorem
-
Conditional Probability: Probability of given that has occurred:
-
Independence: and are independent if or .
-
Bayes’ Theorem:
- : Posterior probability
- : Likelihood
- : Prior probability
- : Marginal likelihood (evidence)
Example: Drug Testing
-
Assume true positive rate , true negative rate , and user prevalence .
-
Find :
This means that even with a positive test, there is only about a 33.2% probability of being a true user.
Monty Hall Problem
-
Three doors: one has a car, two have goats. After the host opens a door with a goat, should you switch?
-
Calculation using Bayes’ theorem:
- Assume initial choice is door 1, host opens door 3 (revealing a goat).
-
-
Thus, switching doors gives a win probability of 2/3, while staying gives 1/3, so switching is optimal.
Random Variables
-
Random Variable: Any characteristic whose value may vary across individuals.
-
Descriptive Statistics:
- Numerical: Mean, median, trimmed mean, variance, standard deviation.
- Graphical: Histograms, pie charts, box plots.
Discrete Random Variables
-
Have a finite or countably infinite number of possible outcomes.
-
Probability Mass Function (PMF):
-
Cumulative Distribution Function (CDF):
-
Expectation:
-
Variance:
Common Discrete Distributions:
-
Bernoulli: , for
-
Binomial: , for
-
Poisson: , for
- Example: Call center averages 5 calls per hour (), probability of exactly 3 calls:
Continuous Random Variables
-
Possible outcomes form an interval on the real number line.
-
Probability Density Function (PDF):
-
Cumulative Distribution Function (CDF):
-
Expectation:
-
Variance:
Common Continuous Distributions:
-
Uniform: , for
-
Exponential: , for
-
Normal: , for
Joint Distributions
For random variables and (continuous or discrete):
-
Joint PDF/PMF:
-
Marginal PDF: (continuous) or sum (discrete)
-
Independence: and are independent if .
-
Conditional PDF:
-
Expectation:
-
Covariance:
-
Correlation:
Statistics and Distributions
-
Statistic: A function of data, hence a random variable.
-
Simple Random Sample: If are independent and identically distributed (i.i.d.), it is called a simple random sample.
-
Sample Mean:
- If , then
-
Central Limit Theorem (CLT): If are i.i.d. with mean and variance , then:
This means that even if the original distribution is non-normal, the sample mean is approximately normal for large samples.
General Results:
-
If and independent, then
-
Linearity of expectation and variance: , (if independent), otherwise add covariance terms.
Statistical Inference
Inferring population truths based on sample data:
-
Estimation: Finding estimates for unknown parameters.
- Point estimation: e.g.,
- Confidence interval (CI) estimation: e.g., 95% CI for is (2.0, 3.0)
-
Hypothesis Testing: Making decisions based on specific hypotheses (e.g., vs. )
Point Estimation
-
Point Estimator: A statistic used to estimate a parameter .
-
Example: Estimating population mean from sample data.
- Sample mean:
- Sample median: Middle value after sorting
- Trimmed mean: Mean after removing extreme values
Question: Which estimator is closer to the population mean? Depends on the distribution.
Unbiased Estimation
-
Unbiased Estimator: is unbiased for if .
-
Bias:
-
Examples:
- Sample mean is unbiased for .
- Sample variance is unbiased for (Bessel’s correction).
Degrees of freedom arise because the sample mean is used in estimation, reducing independence.
Another Example: , , then , so biased.
MVUE (Minimum Variance Unbiased Estimator)
-
Among all unbiased estimators, choose the one with minimum variance.
-
Example: For uniform distribution, and are both unbiased, but has smaller variance.
-
If , is MVUE for .
Method of Moments (MM) Estimation
-
Philosophy: Sample moments should resemble population moments.
-
k-th Sample Moment:
-
k-th Population Moment:
-
Solve for parameters by equating sample moments to population moments.
Example: Estimating and for normal distribution
-
First moment: , sample moment , so
-
Second moment: , sample moment , so
Maximum Likelihood Estimation (MLE)
-
Philosophy: Given observed data, choose parameter values that maximize the probability of observing the data.
-
Likelihood Function: , treated as a function of .
-
MLE estimator maximizes .
Example:
-
Likelihood function:
-
MLE: ,
Another Example:
-
Likelihood function: if all , else 0.
-
To maximize, should be as small as possible but not less than any , so
Confidence Intervals (CI)
-
Confidence Interval: An interval estimate based on a statistic, containing the unknown population parameter with a predetermined probability.
-
CI = Point estimate ± Margin of error
Example: Normal distribution with known , CI for :
-
If unknown, use sample standard deviation and t-distribution:
Interpretation: CI is a random interval; if repeated sampling is done, approximately of CIs will cover the true .
General CI Construction:
-
If estimator is approximately normal, unbiased, with known variance , then approximate CI:
Hypothesis Testing
-
Hypothesis: A claim about a characteristic of a probability distribution.
-
Hypothesis Testing: Using data to decide between two competing hypotheses.
- Null Hypothesis (): The initial assumption to be tested.
- Alternative Hypothesis (): A claim contradicting .
Test Types:
-
For , can be:
- (right-tailed)
- (left-tailed)
- (two-tailed)
Testing Procedure:
-
Test Statistic: A function based on sample data.
-
Rejection Region: The set of test statistic values that lead to rejecting .
Error Types:
-
Type I Error: Rejecting a true , probability (significance level).
-
Type II Error: Failing to reject a false , probability .
-
Power: , probability of rejecting a false .
p-value:
-
The smallest significance level at which would be rejected given the data.
-
If p-value < , reject ; otherwise, do not reject.
-
p-value is the probability, assuming is true, of obtaining a test statistic at least as extreme as the sample.
Typically , but use with caution as p-values are sensitive to sample size.
