SDSC5002 Course Information
#sdsc5002 #course information
English / 中文
Course Overview
Course Code: SDSC5002C61
Course Name: Exploratory Data Analysis and Visualization
Semester: First Semester, 2025/26 Academic Year
Instructor: Professor Wang Lijia
Email: lijiwang@cityu.edu.hk
Office: Room 16-272, Lau Pak Here Building
Lecture Time: To be specified (please check Canvas for updates)
Office Hours: To be specified
Teaching Mode: Face-to-face
Teaching Assistants:
-
Li Minghe (
mingheli2-c@my.cityu.edu.hk) responsible for Tableau -
Yin Yanxin (
wl.z@cityu.edu.hk) responsible for Python
Assessment Methods
| Assessment Item | Description | Weight or Points |
|---|---|---|
| Group Project | Requires 4-8 person teams, presentations in Weeks 11-13, evaluating teamwork and data analysis skills. | 40% |
| Individual Assignment | Graded based on assignment performance, focusing on individual practical skills. | 25% |
| Quizzes | 2 points for on-time submission, 1 point for late submission, assessing timely participation and understanding. | Point-based (contributes to overall score) |
| Assignments | Graded based on performance, full score 10 points, evaluating quality of task completion. | 10 points |
| Midterm Exam | Held in Week 10, no final exam, testing theoretical knowledge and application skills. | 35% |
Schedule and Teaching
| Week | Date | Activity | Content | Deadline |
|---|---|---|---|---|
| 1 | September 6 | Lecture | Key concepts of exploratory data analysis and visualization | |
| 2 | September 13 | Lecture | Statistical analysis and visualization for machine learning | |
| Tutorial 1 | Introduction to Python and Tableau | |||
| 3 | September 20 | Lecture | High-dimensional data visualization | |
| Tutorial 2 | Practical data exploration (e.g., Iris dataset) | |||
| 4 | September 27 | Lecture | Machine learning visualization, linear regression | |
| Tutorial 3 | Cross-validation and linear regression practice | |||
| 5 | October 4 | National Day | No class | Group formation deadline |
| 6 | October 11 | Lecture | Model selection, regularization, classification methods | |
| Tutorial 4 | Subset selection, shrinkage methods, PCR and PLS | |||
| 7 | October 18 | Lecture | Classification methods, midterm exam Q&A | |
| Tutorial 5 | Classification methods practice | |||
| 8 | October 25 | Lecture | High-dimensional data techniques | |
| 9 | November 1 | Chung Yeung Festival | No class | Project proposal submission |
| 10 | November 8 | Midterm Exam | Midterm test (closed-book; allowed one A4 cheat sheet) | |
| 11 | November 15 | Lecture | Network visualization | Project presentations begin |
| 12 | November 22 | Project presentations | ||
| 13 | November 29 | Project presentations, course summary | Final project report submission |
Project Requirements (40% of Total Score)
The project is a core component of the course “Exploratory Data Analysis and Visualization,” aiming to apply exploratory data analysis (EDA) and visualization techniques through practical work. Below is a detailed elaboration of the project requirements, based on the project proposal guidelines in the course document.
Project Overview
The project requires students to complete a full data analysis project in groups (4-8 people), from topic selection to final presentation. The project proposal must be submitted by Week 9 (Chung Yeung Festival), with presentations in Weeks 11-13. The project aims to develop teamwork, data cleaning, analysis, and visualization skills, while emphasizing clear research questions and reasonable methods.
Project Proposal Components
The project proposal must include the following sections, each requiring detailed description to ensure clarity and feasibility.
1. Title
-
Requirements: Provide a clear and concise title that accurately reflects the project theme. The title should directly relate to the chosen dataset and research questions, avoiding vague or overly broad statements.
-
Example: For instance, “Visualization Analysis of Transmission Patterns Based on COVID-19 Data” or “Consumer Behavior Analysis: Smartphone Brand Switching Trends.”
2. Introduction
-
Requirements: Briefly describe the motivation and background of the project. Explain why the topic was chosen, including its practical significance, relevance, or academic value. The introduction should provide context to help readers understand the project’s importance.
-
Key Points:
- Motivation: e.g., based on current social issues, industry trends, or personal interest.
- Relevance: Explain how the topic relates to data science, machine learning, or visualization techniques.
- Background information: Briefly overview relevant fields or prior research (if applicable).
3. Research Questions
-
Requirements: Specify 1-3 clear research questions that will guide the entire analysis process. The questions should be specific, measurable, and answerable through EDA and visualization.
-
Example Questions:
- “What factors influence smartphone users’ brand loyalty?”
- “Which variables in COVID-19 data are correlated with transmission rates?”
- “How can visualization identify anomalous patterns in the data?”
-
Importance: Research questions should serve as the project framework, ensuring focused and directed analysis.
4. Dataset Description
-
Requirements: Describe the selected dataset in detail, including source, size, characteristics, and suitability. This section must demonstrate that the dataset is adequate to support the research questions.
-
Key Points:
- Source: Provide the dataset’s access link or citation (e.g., from Kaggle, UCI Machine Learning Repository, government open data, etc.).
- Size: State the number of records (rows) and features (columns), e.g., “The dataset contains 10,000 records and 20 features.”
- Characteristics: Describe data types (e.g., numerical, categorical, time series), any special features (e.g., missing values, outliers), and the basic structure of the data.
- Suitability: Explain why this dataset is appropriate for answering the research questions, e.g., it contains relevant variables or sufficient historical data.
5. EDA and Data Visualization Plan
-
Requirements: Describe in detail the EDA techniques and visualization methods to be used. The plan should include data cleaning, summary statistics, visualization types, and correlation analysis.
-
Specific Steps:
- Data Cleaning:
- Handling missing values: Use interpolation, deletion, or imputation methods.
- Handling duplicates: Identify and remove duplicate records.
- Handling outliers: Detect and address outliers using statistical methods (e.g., IQR) or visualizations (e.g., box plots).
- Summary Statistics:
- Calculate basic statistics: mean, median, mode, standard deviation, quantiles, etc., to describe data distribution.
- Present statistical results using tables or brief reports.
- Data Visualization:
- List the types of visualizations planned, e.g.:
- Histograms: For distribution analysis.
- Scatter plots: For exploring relationships between variables.
- Box plots: For comparing group differences.
- Heatmaps: For correlation visualization.
- Time series plots: For trend analysis (if the data includes a time element).
- Explain how each visualization helps identify trends, patterns, or anomalies, e.g., “Scatter plots will be used to explore the relationship between education years and income.”
- List the types of visualizations planned, e.g.:
- Correlation Analysis:
- Use correlation coefficients (e.g., Pearson or Spearman) or visualization tools (e.g., scatter plot matrices) to explore variable relationships.
- Discuss how these analyses support the research questions.
- Data Cleaning:
6. Optional Methods
-
Requirements: Describe any additional methods, such as machine learning algorithms or statistical tests, to enhance the analysis. This section is optional but encouraged to add depth to the project.
-
Possible Methods:
- Classification: e.g., using logistic regression or decision trees to predict categorical variables.
- Clustering: e.g., using K-means for customer segmentation.
- Regression: e.g., linear regression for predicting continuous variables.
- Hypothesis Testing: e.g., t-tests or ANOVA for comparing group differences.
- Feature Selection: Use methods like PCA or random forest importance scores for dimensionality reduction.
-
Rationale: Explain why these methods are suitable for the project and how they complement EDA and visualization.
7. Expected Outcomes
-
Requirements: Discuss the potential findings and insights from the project. Expected outcomes should be based on the research questions and include visualization results and data analysis conclusions.
-
Key Points:
- Insights: e.g., identifying key trends, patterns, or causal relationships.
- Visualization Outputs: Describe the visualizations to be generated and their expected impact (e.g., aiding decision-making or communicating results).
- Application Value: Explain how the outcomes can be applied in the real world, e.g., providing recommendations for policy-making or business strategies.
Grading Criteria
The project proposal will be graded based on the following criteria, with a total score of 10 points:
-
Clarity (3 points): Is the proposal well-structured and logically coherent? Does the introduction effectively set the project background?
-
Justification (4 points): Is the topic choice persuasive? Does it demonstrate importance and relevance?
-
Research Plan Rationality (3 points): Are the proposed methods (e.g., EDA, visualization, or other techniques) aligned with the research questions? Is the plan feasible and comprehensive?
Additional Notes
-
Teamwork: The project must be completed in groups, encouraging division of labor, e.g., some members handle data cleaning, others handle visualization.
-
Tool Usage: It is recommended to use Python and Tableau taught in the course, but other tools (e.g., R or D3.js) are acceptable if they effectively present results.
-
Presentation Requirements: During the presentations in Weeks 11-13, groups must showcase visualization results and the analysis process, emphasizing communication and presentation skills.
By following these detailed requirements, students can ensure the project proposal is comprehensive and efficient, successfully achieving the course objectives.
