Welcome to Practical Data Science!¶
The course site for Duke MIDS Fall 2019 Practical Data Science Course
If you are not a Duke Masters in Data Science student, please see this page about how best to use this site!
Warning: This site is a work in progress and is subject to change
Data Science is an intrinsically applied field, and yet all too often students are taught the advanced math and statistics behind data science tools, but are left to fend for themselves when it comes to learning the tools we use to do data science on a daytoday basis or how to manage actual projects. This course is designed to fill that gap.
This course will be divided into two parts:
 Part 1: Data Wrangling: In Part 1 of this course, students will develop handson experience manipulating real world data using a range of data science tools (including the command line, python, jupyter, git, and github).
 Part 2: Answering Questions: This course adopts the view that Data Science is about answering important questions using quantitative data. In Part 2 of this course, students will learn to develop data science projects that achieve this goal via backwards design, and learn tips for managing projects from inception to presentation of results.
The first portion of the course will provide students with extensive handson experience manipulating real (often messy, error ridden, and poorly documented) data using the a range of breadandbutter data science tools (like the command line, git, python (especially numpy and pandas), jupyter notebooks, and more). The goal of these exercises is to ensure students are comfortable working with data in most any form. In addition to being of intrinsic value, developing these skills will also ensure that in advanced statistics or machine learning courses, students can focus on understanding the concepts being taught rather than having to split their attention between concepts and the nuts and bolts of data manipulation required to complete assignments.
The first portion of the course will culminate in students completing the full data manipulation and analysis component of a data science project (the goal of the project will be provided). This will include everything from gathering data from third parties, cleaning and merging different data sources, and analyzing the resultant data. These projects will be completed in teams using Git and Github to give students experience managing github working flows.
In the second portion of the class, we will take a step back from the nuts and bolts of data manipulation and talk about how to approach the central task of data science: answering questions about the world. In particular, we’ll discuss how to use backwards design to plan data science projects, how to refine questions to ensure they are answerable, how to evaluate whether you’ve actually answered the question you set out to answer, and how to pick the most appropriate data science tool based on the question you seek to answer (this will be a bit of preview of material we will engage with even more in Practical Data Science II in Spring 2020).
This portion of the course will culminate in students picking a topic, developing an answerable question, thinking about what (in very concerte terms) an answer to that question would look like, figuring out what tools they would employ to generate that answer, and developing a plan for finding the data they would need to actually execute their project.
Course Syllabus¶
The full syllabus for this course can be downloaded here. Please note that this syllabus is subject to change up until he first day of class.
Class Schedule¶
Date  Day  Topic  Do Before Class  InClass Exercise 

27Aug  Tues  Intro  N/A  
29Aug  Thurs  Command Line Basics 

Link 
3Sep  Tues  Advanced Command Line  Link  
5Sep  Thurs  Jupyter Lab / Notebooks  Link  
10Sep  Tues 


Link 
12Sep  Thurs  Numpy Basics 

Link 
17Sep  Tues  Pandas: Series  Link  
19Sep  Thurs  Pandas: DataFrames  Link  
24Sep  Tues  Intro to Plotting with PlotNine  Link  
26Sep  Thurs  Advanced Plotting  Link  
1Oct  Tues  Pandas: Indices & Missing 

Link Link 
3Oct  Thurs 


Link 
8Oct  Tues 



10Oct  Thurs  FALL BREAK  
15Oct  Tues  Pandas: Reshaping 


17Oct  Thurs  Pandas: Groupby / Split Apply Combine 


22Oct  Tues  Collaborating using Github  
24Oct  Thurs  Defensive Programming  
29Oct  Tues  Getting Help Online  
31Oct  Thurs  Pandas: Categorical Data; Eval and Query 


5Nov  Tues  Statistics with statsmodels 


7Nov  Thurs  Machine Learning with sckikitlearn 


12Nov  Tues  Data Science: Questions  
14Nov  Thurs  Data Science: Backwards Design  
19Nov  Tues  Data Science: Backwards Design II  
21Nov  Thurs  Data Science: Tool Selection  
26Nov  Tues  Project Proposal Workshopping  
28Nov  Thurs  THANKSGIVING BREAK 
Reminders:
 JVP: Python Data Science Handbook: Essential Tools for Working with Data by Jake VanderPlas.
 WM: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Second Edition by Wes McKinney.