Welcome to Practical Data Science!

The course site for Duke MIDS Fall 2019 Practical Data Science Course

If you are not a Duke Masters in Data Science student, please see this page about how best to use this site!

Warning: This site is a work in progress and is subject to change

Data Science is an intrinsically applied field, and yet all too often students are taught the advanced math and statistics behind data science tools, but are left to fend for themselves when it comes to learning the tools we use to do data science on a day-to-day basis or how to manage actual projects. This course is designed to fill that gap.

This course will be divided into two parts:

  • Part 1: Data Wrangling: In Part 1 of this course, students will develop hands-on experience manipulating real world data using a range of data science tools (including the command line, python, jupyter, git, and github).
  • Part 2: Answering Questions: This course adopts the view that Data Science is about answering important questions using quantitative data. In Part 2 of this course, students will learn to develop data science projects that achieve this goal via backwards design, and learn tips for managing projects from inception to presentation of results.

The first portion of the course will provide students with extensive hands-on experience manipulating real (often messy, error ridden, and poorly documented) data using the a range of bread-and-butter data science tools (like the command line, git, python (especially numpy and pandas), jupyter notebooks, and more). The goal of these exercises is to ensure students are comfortable working with data in most any form. In addition to being of intrinsic value, developing these skills will also ensure that in advanced statistics or machine learning courses, students can focus on understanding the concepts being taught rather than having to split their attention between concepts and the nuts and bolts of data manipulation required to complete assignments.

The first portion of the course will culminate in students completing the full data manipulation and analysis component of a data science project (the goal of the project will be provided). This will include everything from gathering data from third parties, cleaning and merging different data sources, and analyzing the resultant data. These projects will be completed in teams using Git and Github to give students experience managing github working flows.

In the second portion of the class, we will take a step back from the nuts and bolts of data manipulation and talk about how to approach the central task of data science: answering questions about the world. In particular, we’ll discuss how to use backwards design to plan data science projects, how to refine questions to ensure they are answerable, how to evaluate whether you’ve actually answered the question you set out to answer, and how to pick the most appropriate data science tool based on the question you seek to answer (this will be a bit of preview of material we will engage with even more in Practical Data Science II in Spring 2020).

This portion of the course will culminate in students picking a topic, developing an answerable question, thinking about what (in very concerte terms) an answer to that question would look like, figuring out what tools they would employ to generate that answer, and developing a plan for finding the data they would need to actually execute their project.

Course Syllabus

The full syllabus for this course can be downloaded here. Please note that this syllabus is subject to change up until he first day of class.

Class Schedule

Date Day Topic Do Before Class In-Class Exercise
27-Aug Tues Intro N/A  
29-Aug Thurs Command Line Basics Link
3-Sep Tues Advanced Command Line Link
5-Sep Thurs Jupyter Lab / Notebooks Link
10-Sep Tues
  • IPython
  • Python v. R / variables as pointers
12-Sep Thurs Numpy Basics Link
17-Sep Tues Pandas: Series Link
19-Sep Thurs Pandas: DataFrames Link
24-Sep Tues Intro to Plotting with PlotNine Link
26-Sep Thurs Advanced Plotting Link
1-Oct Tues Pandas: Indices & Missing Link Link
3-Oct Thurs
  • Pandas: Loading and saving data
  • Pandas: Cleaning
8-Oct Tues
  • Pandas: Merging
  • JVP pp 149 - 157
10-Oct Thurs FALL BREAK    
15-Oct Tues Pandas: Reshaping
  • WM 8.3
17-Oct Thurs Pandas: Groupby / Split Apply Combine
  • WM Chapter 10
22-Oct Tues Collaborating using Github  
24-Oct Thurs Defensive Programming  
29-Oct Tues Getting Help Online  
31-Oct Thurs Pandas: Categorical Data; Eval and Query
  • WM 12.1
  • JVP pp 208 - 213
5-Nov Tues Statistics with statsmodels
  • WM Chapter 13
7-Nov Thurs Machine Learning with sckikit-learn
  • JVP pp 331 - 359
12-Nov Tues Data Science: Questions    
14-Nov Thurs Data Science: Backwards Design    
19-Nov Tues Data Science: Backwards Design II    
21-Nov Thurs Data Science: Tool Selection    
26-Nov Tues Project Proposal Workshopping