Welcome to Practical Data Science!¶
The course site for Duke MIDS Fall 2019 Practical Data Science Course
If you are not a Duke Masters in Data Science student, please see this page about how best to use this site!
Data Science is an intrinsically applied field, and yet all too often students are taught the advanced math and statistics behind data science tools, but are left to fend for themselves when it comes to learning the tools we use to do data science on a daytoday basis or how to manage actual projects. This course is designed to fill that gap.
This course will be divided into two parts:
Part 1: Data Wrangling: In Part 1 of this course, students will develop handson experience manipulating real world data using a range of data science tools (including the command line, python, jupyter, git, and github).
Part 2: Answering Questions: This course adopts the view that Data Science is about answering important questions using quantitative data. In Part 2 of this course, students will learn to develop data science projects that achieve this goal via backwards design, and learn tips for managing projects from inception to presentation of results.
The first portion of the course will provide students with extensive handson experience manipulating real (often messy, error ridden, and poorly documented) data using the a range of breadandbutter data science tools (like the command line, git, python (especially numpy and pandas), jupyter notebooks, and more). The goal of these exercises is to ensure students are comfortable working with data in most any form. In addition to being of intrinsic value, developing these skills will also ensure that in advanced statistics or machine learning courses, students can focus on understanding the concepts being taught rather than having to split their attention between concepts and the nuts and bolts of data manipulation required to complete assignments.
The first portion of the course will culminate in students completing the full data manipulation and analysis component of a data science project (the goal of the project will be provided). This will include everything from gathering data from third parties, cleaning and merging different data sources, and analyzing the resultant data. These projects will be completed in teams using Git and Github to give students experience managing github working flows.
In the second portion of the class, we will take a step back from the nuts and bolts of data manipulation and talk about how to approach the central task of data science: answering questions about the world. In particular, we’ll discuss how to use backwards design to plan data science projects, how to refine questions to ensure they are answerable, how to evaluate whether you’ve actually answered the question you set out to answer, and how to pick the most appropriate data science tool based on the question you seek to answer (this will be a bit of preview of material we will engage with even more in Practical Data Science II in Spring 2020).
This portion of the course will culminate in students picking a topic, developing an answerable question, thinking about what (in very concerte terms) an answer to that question would look like, figuring out what tools they would employ to generate that answer, and developing a plan for finding the data they would need to actually execute their project.
Course Syllabus¶
The full syllabus for this course can be downloaded here. Please note that this syllabus is subject to change up until he first day of class.
Class Schedule¶
Date 
Day 
Rm 
Topic 
Do Before Class 
InClass Exercise 

27Aug 
Tues 
270 
Intro 
N/A 

29Aug 
Thurs 
CC 
Command Line Basics 


3Sep 
Tues 
270 


5Sep 
Thurs 
270 



10Sep 
Tues 
330 
Numpy Basics 


12Sep 
Thurs 
270 
Pandas: Series 

17Sep 
Tues 
CC 
Pandas: DataFrames 

19Sep 
Thurs 
CC 
Intro to Plotting with PlotNine 

24Sep 
Tues 
330 
Advanced Plotting 

26Sep 
Thurs 
270 
Pandas: Indices & Missing 


1Oct 
Tues 
270 



3Oct 
Thurs 
330 



8Oct 
Tues 
270 
Pandas: Reshaping 

10Oct 
Thurs 
330 
FALL BREAK 

15Oct 
Tues 
270 
Groupby / SplitApplyCombine 


17Oct 
Thurs 
330 
Collaborating using Github 

22Oct 
Tues 
330 
Defensive Programming 

24Oct 
Thurs 
330 
Getting Help Online 

29Oct 
Tues 
270 
Pandas: Categorical Data; Eval and Query 


31Oct 
Thurs 
330 
Statistics with statsmodels 


5Nov 
Tues 
270 
Machine Learning with sckikitlearn 


7Nov 
Thurs 
330 
Big Data: What is it, how do I work with it? 

12Nov 
Tues 
330 
Speed and Performance in Python 

14Nov 
Thurs 
330 
Data Science: Questions 

19Nov 
Tues 
270 
Data Science: Backwards Design 

21Nov 
Thurs 
330 
Data Science: Tool Selection 

26Nov 
Tues 
270 
Project Proposal Workshopping 

28Nov 
Thurs 
THANKSGIVING BREAK 
Reminders:
JVP: Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas.
WM: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second Edition by Wes McKinney.