#sdsc5002

English / 中文

Data Fundamentals

What is Data?

Data are values obtained by measuring certain variables from individuals (people, objects, etc.).

Types of Variables

Categorical Variables (Qualitative)
- Examples: Gender, blood type, disease status
- If categories can be ordered, they are called ordinal categorical variables (e.g., course grades, COVID-19 severity)
Numerical Variables (Quantitative)
- Examples: Height, weight, age, income, blood pressure
- Only numerical variables support arithmetic operations

Data Table Structure

Columns: Correspond to variables
Rows: Correspond to individuals or observations, with count denoted as $n$

Song	Artist	Genre	Size(MB)	Length(sec)
My Friends	D. Williams	Alternative	3.83	247
Up the Road	E. Clapton	Rock	5.62	378

Data Collection Methods

Observational Method: Direct observation or comparison
Testing and Experimentation: Measurement using tools (e.g., tape measure)
Survey Method:
- Questionnaires
- Interviews
- Email/Phone
Document Analysis: Such as reviewing medical records

📌 Note: The order of questionnaire design may affect response consistency (e.g., asking specific questions before general ones, or vice versa)

Exploratory Data Analysis (EDA)

Definition and Goals of EDA

Proposed by John Tukey in the 1970s, it is a preliminary method for examining data, mainly including:

Examining each variable

Examining relationships between variables

Methods are divided into:

Numerical Summaries (calculating numbers)
Graphical Summaries (plotting charts)

Distribution

Describes possible values of a variable and their frequencies
Key questions:
- What are the possible values of the variable?
- How frequently do these values occur?

Summary of Categorical Variables

Numerical Summary

Use counts or percentages to describe the distribution of each category:

Education Level	Count (millions)	Percentage (%)
Below High School	4.7	12.3
High School Graduate	11.8	30.7
Some College	10.9	28.3
Bachelor’s Degree	8.5	22.1
Advanced Degree	2.5	6.6

Graphical Summary

Bar Chart: Bar height proportional to count/percentage
- If sorted by frequency, called a Pareto Chart
Pie Chart: Sector area proportional to count/percentage

💡 Tip: Bar charts are easier for comparing actual values than pie charts (comparing heights is more intuitive than comparing angles)

Summary of Numerical Variables

Graphical Summary: Histogram

Divide the range into equal-width intervals
Draw rectangles for each interval, with height proportional to the number of observations in that interval

Example: Histogram of “sepal length” variable in the iris dataset, range [4, 8], divided into 8 equal-width intervals

Density Curves and Density Plots

Density Curve: Approximates the proportion in each value range
Density Plot: Smoothed using kernel density estimation to show data distribution more continuously

Describing Distribution Shapes

Unimodal: One main peak
Bimodal: Two main peaks
Symmetric: Symmetrical around the center
Skewed to the right: Longer tail on the right
Skewed to the left: Longer tail on the left

Numerical Summary

Measures of Central Tendency

Mean: Average value

$\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}$
Median: 50th percentile
- If the number of observations is even, take the average of the middle two values

🔍 Insight: Median is more robust to extreme values (outliers)

Measures of Variability

Variance: “Average” of squared deviations from the mean

$s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n-1}$
Standard Deviation (SD): Square root of variance

$s = \sqrt{s^2}$

Why divide by $n-1$ ? To obtain an unbiased estimate of variance (effect is small when $n$ is large)

Other Measures of Variability

Range: Maximum value - Minimum value
Interquartile Range (IQR): Difference between the third quartile and first quartile

$IQR = Q3 - Q1$

Five-Number Summary and Boxplot

Five-Number Summary: Minimum, Q1, Median, Q3, Maximum
Boxplot:
- Box: From Q1 to Q3, with median inside
- Whiskers: Extend to the minimum and maximum values (or the farthest points within 1.5×IQR range)
- Outliers: Points outside the whiskers

⚠ Warning: Boxplots and numerical summaries alone may not fully describe distribution shapes (e.g., bimodal distributions); supplement with histograms

Relationships Between Two Variables

Two Categorical Variables: Contingency Table

Contains counts or proportions (percentages)
Can compute:
- Joint Distribution: Proportion in each cell
- Marginal Distribution: Distribution of a single variable
- Conditional Distribution: Distribution given the value of another variable

Simpson’s Paradox

When considering a third variable (hidden or confounding variable), the direction of association between two variables may reverse

Two Numerical Variables: Scatterplot

X-axis and Y-axis represent two variables
Each observation is a point
Can add categorical variables (using different colors/symbols)

Interpreting Scatterplots:

Form: Clustering, linear association, etc.
Direction:
- Positive correlation: When one variable is above average, the other tends to be above average
- Negative correlation: When one variable is above average, the other tends to be below average
Strength: How closely points follow the form
Outliers: Points deviating from the overall pattern

Correlation Coefficient (Correlation, $r$ )

Measures the direction and strength of linear relationship between two numerical variables
Range: $[-1, 1]$
Characteristics:
- Unitless
- Unaffected by measurement units
- Only measures linear relationships, not curved ones
- Sensitive to outliers

📌 Note: Correlation does not imply causation

Key Formula Summary

Mean: $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$
Variance: $s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$
Standard Deviation: $s = \sqrt{s^2}$
Correlation Coefficient: $r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n-1) s_x s_y}$

Data Fundamentals

What is Data?

Types of Variables

Data Table Structure

Data Collection Methods

Exploratory Data Analysis (EDA)

Definition and Goals of EDA

Distribution

Summary of Categorical Variables

Numerical Summary

Graphical Summary

Summary of Numerical Variables

Graphical Summary: Histogram

Density Curves and Density Plots

Describing Distribution Shapes

Numerical Summary

Measures of Central Tendency

Measures of Variability

Other Measures of Variability

Five-Number Summary and Boxplot

Relationships Between Two Variables

Two Categorical Variables: Contingency Table

Simpson’s Paradox

Two Numerical Variables: Scatterplot

Interpreting Scatterplots:

Correlation Coefficient (Correlation, rrr)

Key Formula Summary

Correlation Coefficient (Correlation, $r$ )