SDSC5002 Course 2-EDA
#sdsc5002
English / 中文
Data Fundamentals
What is Data?
Data are values obtained by measuring certain variables from individuals (people, objects, etc.).
Types of Variables
-
Categorical Variables (Qualitative)
- Examples: Gender, blood type, disease status
- If categories can be ordered, they are called ordinal categorical variables (e.g., course grades, COVID-19 severity)
-
Numerical Variables (Quantitative)
- Examples: Height, weight, age, income, blood pressure
- Only numerical variables support arithmetic operations
Data Table Structure
-
Columns: Correspond to variables
-
Rows: Correspond to individuals or observations, with count denoted as
| Song | Artist | Genre | Size(MB) | Length(sec) |
|---|---|---|---|---|
| My Friends | D. Williams | Alternative | 3.83 | 247 |
| Up the Road | E. Clapton | Rock | 5.62 | 378 |
Data Collection Methods
-
Observational Method: Direct observation or comparison
-
Testing and Experimentation: Measurement using tools (e.g., tape measure)
-
Survey Method:
- Questionnaires
- Interviews
- Email/Phone
-
Document Analysis: Such as reviewing medical records
📌 Note: The order of questionnaire design may affect response consistency (e.g., asking specific questions before general ones, or vice versa)
Exploratory Data Analysis (EDA)
Definition and Goals of EDA
Proposed by John Tukey in the 1970s, it is a preliminary method for examining data, mainly including:
- Examining each variable
- Examining relationships between variables
Methods are divided into:
-
Numerical Summaries (calculating numbers)
-
Graphical Summaries (plotting charts)
Distribution
-
Describes possible values of a variable and their frequencies
-
Key questions:
- What are the possible values of the variable?
- How frequently do these values occur?
Summary of Categorical Variables
Numerical Summary
Use counts or percentages to describe the distribution of each category:
| Education Level | Count (millions) | Percentage (%) |
|---|---|---|
| Below High School | 4.7 | 12.3 |
| High School Graduate | 11.8 | 30.7 |
| Some College | 10.9 | 28.3 |
| Bachelor’s Degree | 8.5 | 22.1 |
| Advanced Degree | 2.5 | 6.6 |
Graphical Summary
-
Bar Chart: Bar height proportional to count/percentage
- If sorted by frequency, called a Pareto Chart
-
Pie Chart: Sector area proportional to count/percentage
💡 Tip: Bar charts are easier for comparing actual values than pie charts (comparing heights is more intuitive than comparing angles)
Summary of Numerical Variables
Graphical Summary: Histogram
-
Divide the range into equal-width intervals
-
Draw rectangles for each interval, with height proportional to the number of observations in that interval
Example: Histogram of “sepal length” variable in the iris dataset, range [4, 8], divided into 8 equal-width intervals
Density Curves and Density Plots
-
Density Curve: Approximates the proportion in each value range
-
Density Plot: Smoothed using kernel density estimation to show data distribution more continuously
Describing Distribution Shapes
-
Unimodal: One main peak
-
Bimodal: Two main peaks
-
Symmetric: Symmetrical around the center
-
Skewed to the right: Longer tail on the right
-
Skewed to the left: Longer tail on the left
Numerical Summary
Measures of Central Tendency
-
Mean: Average value
-
Median: 50th percentile
- If the number of observations is even, take the average of the middle two values
🔍 Insight: Median is more robust to extreme values (outliers)
Measures of Variability
-
Variance: “Average” of squared deviations from the mean
-
Standard Deviation (SD): Square root of variance
Why divide by ? To obtain an unbiased estimate of variance (effect is small when is large)
Other Measures of Variability
-
Range: Maximum value - Minimum value
-
Interquartile Range (IQR): Difference between the third quartile and first quartile
Five-Number Summary and Boxplot
-
Five-Number Summary: Minimum, Q1, Median, Q3, Maximum
-
Boxplot:
- Box: From Q1 to Q3, with median inside
- Whiskers: Extend to the minimum and maximum values (or the farthest points within 1.5×IQR range)
- Outliers: Points outside the whiskers
⚠ Warning: Boxplots and numerical summaries alone may not fully describe distribution shapes (e.g., bimodal distributions); supplement with histograms
Relationships Between Two Variables
Two Categorical Variables: Contingency Table
-
Contains counts or proportions (percentages)
-
Can compute:
- Joint Distribution: Proportion in each cell
- Marginal Distribution: Distribution of a single variable
- Conditional Distribution: Distribution given the value of another variable
Simpson’s Paradox
When considering a third variable (hidden or confounding variable), the direction of association between two variables may reverse
Two Numerical Variables: Scatterplot
-
X-axis and Y-axis represent two variables
-
Each observation is a point
-
Can add categorical variables (using different colors/symbols)
Interpreting Scatterplots:
-
Form: Clustering, linear association, etc.
-
Direction:
- Positive correlation: When one variable is above average, the other tends to be above average
- Negative correlation: When one variable is above average, the other tends to be below average
-
Strength: How closely points follow the form
-
Outliers: Points deviating from the overall pattern
Correlation Coefficient (Correlation, )
-
Measures the direction and strength of linear relationship between two numerical variables
-
Range:
-
Characteristics:
- Unitless
- Unaffected by measurement units
- Only measures linear relationships, not curved ones
- Sensitive to outliers
📌 Note: Correlation does not imply causation
Key Formula Summary
-
Mean:
-
Variance:
-
Standard Deviation:
-
Correlation Coefficient:
