#sdsc5002

English / 中文

Data Fundamentals

What is Data?

Data are values obtained by measuring certain variables from individuals (people, objects, etc.).

Types of Variables

  • Categorical Variables (Qualitative)

    • Examples: Gender, blood type, disease status
    • If categories can be ordered, they are called ordinal categorical variables (e.g., course grades, COVID-19 severity)
  • Numerical Variables (Quantitative)

    • Examples: Height, weight, age, income, blood pressure
    • Only numerical variables support arithmetic operations

Data Table Structure

  • Columns: Correspond to variables

  • Rows: Correspond to individuals or observations, with count denoted as nn

Song Artist Genre Size(MB) Length(sec)
My Friends D. Williams Alternative 3.83 247
Up the Road E. Clapton Rock 5.62 378

Data Collection Methods

  • Observational Method: Direct observation or comparison

  • Testing and Experimentation: Measurement using tools (e.g., tape measure)

  • Survey Method:

    • Questionnaires
    • Interviews
    • Email/Phone
  • Document Analysis: Such as reviewing medical records

📌 Note: The order of questionnaire design may affect response consistency (e.g., asking specific questions before general ones, or vice versa)

Exploratory Data Analysis (EDA)

Definition and Goals of EDA

Proposed by John Tukey in the 1970s, it is a preliminary method for examining data, mainly including:

  1. Examining each variable
  2. Examining relationships between variables

Methods are divided into:

  • Numerical Summaries (calculating numbers)

  • Graphical Summaries (plotting charts)

Distribution

  • Describes possible values of a variable and their frequencies

  • Key questions:

    • What are the possible values of the variable?
    • How frequently do these values occur?

Summary of Categorical Variables

Numerical Summary

Use counts or percentages to describe the distribution of each category:

Education Level Count (millions) Percentage (%)
Below High School 4.7 12.3
High School Graduate 11.8 30.7
Some College 10.9 28.3
Bachelor’s Degree 8.5 22.1
Advanced Degree 2.5 6.6

Graphical Summary

  • Bar Chart: Bar height proportional to count/percentage

    • If sorted by frequency, called a Pareto Chart
  • Pie Chart: Sector area proportional to count/percentage

💡 Tip: Bar charts are easier for comparing actual values than pie charts (comparing heights is more intuitive than comparing angles)


Summary of Numerical Variables

Graphical Summary: Histogram

  1. Divide the range into equal-width intervals

  2. Draw rectangles for each interval, with height proportional to the number of observations in that interval

Example: Histogram of “sepal length” variable in the iris dataset, range [4, 8], divided into 8 equal-width intervals

Density Curves and Density Plots

  • Density Curve: Approximates the proportion in each value range

  • Density Plot: Smoothed using kernel density estimation to show data distribution more continuously

Describing Distribution Shapes

  • Unimodal: One main peak

  • Bimodal: Two main peaks

  • Symmetric: Symmetrical around the center

  • Skewed to the right: Longer tail on the right

  • Skewed to the left: Longer tail on the left


Numerical Summary

Measures of Central Tendency

  • Mean: Average value

    xˉ=x1+x2++xnn\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}

  • Median: 50th percentile

    • If the number of observations is even, take the average of the middle two values

🔍 Insight: Median is more robust to extreme values (outliers)

Measures of Variability

  • Variance: “Average” of squared deviations from the mean

    s2=(x1xˉ)2+(x2xˉ)2++(xnxˉ)2n1s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n-1}

  • Standard Deviation (SD): Square root of variance

    s=s2s = \sqrt{s^2}

Why divide by n1n-1? To obtain an unbiased estimate of variance (effect is small when nn is large)

Other Measures of Variability

  • Range: Maximum value - Minimum value

  • Interquartile Range (IQR): Difference between the third quartile and first quartile

    IQR=Q3Q1IQR = Q3 - Q1

Five-Number Summary and Boxplot

  • Five-Number Summary: Minimum, Q1, Median, Q3, Maximum

  • Boxplot:

    • Box: From Q1 to Q3, with median inside
    • Whiskers: Extend to the minimum and maximum values (or the farthest points within 1.5×IQR range)
    • Outliers: Points outside the whiskers

⚠ Warning: Boxplots and numerical summaries alone may not fully describe distribution shapes (e.g., bimodal distributions); supplement with histograms


Relationships Between Two Variables

Two Categorical Variables: Contingency Table

  • Contains counts or proportions (percentages)

  • Can compute:

    • Joint Distribution: Proportion in each cell
    • Marginal Distribution: Distribution of a single variable
    • Conditional Distribution: Distribution given the value of another variable

Simpson’s Paradox

When considering a third variable (hidden or confounding variable), the direction of association between two variables may reverse

Two Numerical Variables: Scatterplot

  • X-axis and Y-axis represent two variables

  • Each observation is a point

  • Can add categorical variables (using different colors/symbols)

Interpreting Scatterplots:

  • Form: Clustering, linear association, etc.

  • Direction:

    • Positive correlation: When one variable is above average, the other tends to be above average
    • Negative correlation: When one variable is above average, the other tends to be below average
  • Strength: How closely points follow the form

  • Outliers: Points deviating from the overall pattern

Correlation Coefficient (Correlation, rr)

  • Measures the direction and strength of linear relationship between two numerical variables

  • Range: [1,1][-1, 1]

  • Characteristics:

    • Unitless
    • Unaffected by measurement units
    • Only measures linear relationships, not curved ones
    • Sensitive to outliers

📌 Note: Correlation does not imply causation


Key Formula Summary

  • Mean: xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i

  • Variance: s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2

  • Standard Deviation: s=s2s = \sqrt{s^2}

  • Correlation Coefficient: r=(xixˉ)(yiyˉ)(n1)sxsyr = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{(n-1) s_x s_y}