#sdsc5002 #course information

English / 中文

Course Overview

Course Code: SDSC5002C61

Course Name: Exploratory Data Analysis and Visualization

Semester: First Semester, 2025/26 Academic Year

Instructor: Professor Wang Lijia

Email: lijiwang@cityu.edu.hk

Office: Room 16-272, Lau Pak Here Building

Lecture Time: To be specified (please check Canvas for updates)

Office Hours: To be specified

Teaching Mode: Face-to-face

Teaching Assistants:

  • Li Minghe (mingheli2-c@my.cityu.edu.hk) responsible for Tableau

  • Yin Yanxin (wl.z@cityu.edu.hk) responsible for Python

Assessment Methods

Assessment Item Description Weight or Points
Group Project Requires 4-8 person teams, presentations in Weeks 11-13, evaluating teamwork and data analysis skills. 40%
Individual Assignment Graded based on assignment performance, focusing on individual practical skills. 25%
Quizzes 2 points for on-time submission, 1 point for late submission, assessing timely participation and understanding. Point-based (contributes to overall score)
Assignments Graded based on performance, full score 10 points, evaluating quality of task completion. 10 points
Midterm Exam Held in Week 10, no final exam, testing theoretical knowledge and application skills. 35%

Schedule and Teaching

Week Date Activity Content Deadline
1 September 6 Lecture Key concepts of exploratory data analysis and visualization
2 September 13 Lecture Statistical analysis and visualization for machine learning
Tutorial 1 Introduction to Python and Tableau
3 September 20 Lecture High-dimensional data visualization
Tutorial 2 Practical data exploration (e.g., Iris dataset)
4 September 27 Lecture Machine learning visualization, linear regression
Tutorial 3 Cross-validation and linear regression practice
5 October 4 National Day No class Group formation deadline
6 October 11 Lecture Model selection, regularization, classification methods
Tutorial 4 Subset selection, shrinkage methods, PCR and PLS
7 October 18 Lecture Classification methods, midterm exam Q&A
Tutorial 5 Classification methods practice
8 October 25 Lecture High-dimensional data techniques
9 November 1 Chung Yeung Festival No class Project proposal submission
10 November 8 Midterm Exam Midterm test (closed-book; allowed one A4 cheat sheet)
11 November 15 Lecture Network visualization Project presentations begin
12 November 22 Project presentations
13 November 29 Project presentations, course summary Final project report submission

Project Requirements (40% of Total Score)

The project is a core component of the course “Exploratory Data Analysis and Visualization,” aiming to apply exploratory data analysis (EDA) and visualization techniques through practical work. Below is a detailed elaboration of the project requirements, based on the project proposal guidelines in the course document.

Project Overview

The project requires students to complete a full data analysis project in groups (4-8 people), from topic selection to final presentation. The project proposal must be submitted by Week 9 (Chung Yeung Festival), with presentations in Weeks 11-13. The project aims to develop teamwork, data cleaning, analysis, and visualization skills, while emphasizing clear research questions and reasonable methods.

Project Proposal Components

The project proposal must include the following sections, each requiring detailed description to ensure clarity and feasibility.

1. Title

  • Requirements: Provide a clear and concise title that accurately reflects the project theme. The title should directly relate to the chosen dataset and research questions, avoiding vague or overly broad statements.

  • Example: For instance, “Visualization Analysis of Transmission Patterns Based on COVID-19 Data” or “Consumer Behavior Analysis: Smartphone Brand Switching Trends.”

2. Introduction

  • Requirements: Briefly describe the motivation and background of the project. Explain why the topic was chosen, including its practical significance, relevance, or academic value. The introduction should provide context to help readers understand the project’s importance.

  • Key Points:

    • Motivation: e.g., based on current social issues, industry trends, or personal interest.
    • Relevance: Explain how the topic relates to data science, machine learning, or visualization techniques.
    • Background information: Briefly overview relevant fields or prior research (if applicable).

3. Research Questions

  • Requirements: Specify 1-3 clear research questions that will guide the entire analysis process. The questions should be specific, measurable, and answerable through EDA and visualization.

  • Example Questions:

    • “What factors influence smartphone users’ brand loyalty?”
    • “Which variables in COVID-19 data are correlated with transmission rates?”
    • “How can visualization identify anomalous patterns in the data?”
  • Importance: Research questions should serve as the project framework, ensuring focused and directed analysis.

4. Dataset Description

  • Requirements: Describe the selected dataset in detail, including source, size, characteristics, and suitability. This section must demonstrate that the dataset is adequate to support the research questions.

  • Key Points:

    • Source: Provide the dataset’s access link or citation (e.g., from Kaggle, UCI Machine Learning Repository, government open data, etc.).
    • Size: State the number of records (rows) and features (columns), e.g., “The dataset contains 10,000 records and 20 features.”
    • Characteristics: Describe data types (e.g., numerical, categorical, time series), any special features (e.g., missing values, outliers), and the basic structure of the data.
    • Suitability: Explain why this dataset is appropriate for answering the research questions, e.g., it contains relevant variables or sufficient historical data.

5. EDA and Data Visualization Plan

  • Requirements: Describe in detail the EDA techniques and visualization methods to be used. The plan should include data cleaning, summary statistics, visualization types, and correlation analysis.

  • Specific Steps:

    • Data Cleaning:
      • Handling missing values: Use interpolation, deletion, or imputation methods.
      • Handling duplicates: Identify and remove duplicate records.
      • Handling outliers: Detect and address outliers using statistical methods (e.g., IQR) or visualizations (e.g., box plots).
    • Summary Statistics:
      • Calculate basic statistics: mean, median, mode, standard deviation, quantiles, etc., to describe data distribution.
      • Present statistical results using tables or brief reports.
    • Data Visualization:
      • List the types of visualizations planned, e.g.:
        • Histograms: For distribution analysis.
        • Scatter plots: For exploring relationships between variables.
        • Box plots: For comparing group differences.
        • Heatmaps: For correlation visualization.
        • Time series plots: For trend analysis (if the data includes a time element).
      • Explain how each visualization helps identify trends, patterns, or anomalies, e.g., “Scatter plots will be used to explore the relationship between education years and income.”
    • Correlation Analysis:
      • Use correlation coefficients (e.g., Pearson or Spearman) or visualization tools (e.g., scatter plot matrices) to explore variable relationships.
      • Discuss how these analyses support the research questions.

6. Optional Methods

  • Requirements: Describe any additional methods, such as machine learning algorithms or statistical tests, to enhance the analysis. This section is optional but encouraged to add depth to the project.

  • Possible Methods:

    • Classification: e.g., using logistic regression or decision trees to predict categorical variables.
    • Clustering: e.g., using K-means for customer segmentation.
    • Regression: e.g., linear regression for predicting continuous variables.
    • Hypothesis Testing: e.g., t-tests or ANOVA for comparing group differences.
    • Feature Selection: Use methods like PCA or random forest importance scores for dimensionality reduction.
  • Rationale: Explain why these methods are suitable for the project and how they complement EDA and visualization.

7. Expected Outcomes

  • Requirements: Discuss the potential findings and insights from the project. Expected outcomes should be based on the research questions and include visualization results and data analysis conclusions.

  • Key Points:

    • Insights: e.g., identifying key trends, patterns, or causal relationships.
    • Visualization Outputs: Describe the visualizations to be generated and their expected impact (e.g., aiding decision-making or communicating results).
    • Application Value: Explain how the outcomes can be applied in the real world, e.g., providing recommendations for policy-making or business strategies.

Grading Criteria

The project proposal will be graded based on the following criteria, with a total score of 10 points:

  • Clarity (3 points): Is the proposal well-structured and logically coherent? Does the introduction effectively set the project background?

  • Justification (4 points): Is the topic choice persuasive? Does it demonstrate importance and relevance?

  • Research Plan Rationality (3 points): Are the proposed methods (e.g., EDA, visualization, or other techniques) aligned with the research questions? Is the plan feasible and comprehensive?

Additional Notes

  • Teamwork: The project must be completed in groups, encouraging division of labor, e.g., some members handle data cleaning, others handle visualization.

  • Tool Usage: It is recommended to use Python and Tableau taught in the course, but other tools (e.g., R or D3.js) are acceptable if they effectively present results.

  • Presentation Requirements: During the presentations in Weeks 11-13, groups must showcase visualization results and the analysis process, emphasizing communication and presentation skills.

By following these detailed requirements, students can ensure the project proposal is comprehensive and efficient, successfully achieving the course objectives.