Welcome to Practical Data Science!

The course site for Duke MIDS Fall 2019 Practical Data Science Course

If you are not a Duke Masters in Data Science student, please see this page about how best to use this site!

Data Science is an intrinsically applied field, and yet all too often students are taught the advanced math and statistics behind data science tools, but are left to fend for themselves when it comes to learning the tools we use to do data science on a day-to-day basis or how to manage actual projects. This course is designed to fill that gap.

This course will be divided into two parts:

  • Part 1: Data Wrangling: In Part 1 of this course, students will develop hands-on experience manipulating real world data using a range of data science tools (including the command line, python, jupyter, git, and github).

  • Part 2: Answering Questions: This course adopts the view that Data Science is about answering important questions using quantitative data. In Part 2 of this course, students will learn to develop data science projects that achieve this goal via backwards design, and learn tips for managing projects from inception to presentation of results.

The first portion of the course will provide students with extensive hands-on experience manipulating real (often messy, error ridden, and poorly documented) data using the a range of bread-and-butter data science tools (like the command line, git, python (especially numpy and pandas), jupyter notebooks, and more). The goal of these exercises is to ensure students are comfortable working with data in most any form. In addition to being of intrinsic value, developing these skills will also ensure that in advanced statistics or machine learning courses, students can focus on understanding the concepts being taught rather than having to split their attention between concepts and the nuts and bolts of data manipulation required to complete assignments.

The first portion of the course will culminate in students completing the full data manipulation and analysis component of a data science project (the goal of the project will be provided). This will include everything from gathering data from third parties, cleaning and merging different data sources, and analyzing the resultant data. These projects will be completed in teams using Git and Github to give students experience managing github working flows.

In the second portion of the class, we will take a step back from the nuts and bolts of data manipulation and talk about how to approach the central task of data science: answering questions about the world. In particular, we’ll discuss how to use backwards design to plan data science projects, how to refine questions to ensure they are answerable, how to evaluate whether you’ve actually answered the question you set out to answer, and how to pick the most appropriate data science tool based on the question you seek to answer (this will be a bit of preview of material we will engage with even more in Practical Data Science II in Spring 2020).

This portion of the course will culminate in students picking a topic, developing an answerable question, thinking about what (in very concerte terms) an answer to that question would look like, figuring out what tools they would employ to generate that answer, and developing a plan for finding the data they would need to actually execute their project.

Course Syllabus

The full syllabus for this course can be downloaded here. Please note that this syllabus is subject to change up until he first day of class.

Class Schedule

Date

Day

Rm

Topic

Do Before Class

In-Class Exercise

27-Aug

Tues

270

Intro

N/A

29-Aug

Thurs

CC

Command Line Basics

Link

3-Sep

Tues

270

  • Advanced Command Line

  • Jupyter Lab / Notebooks

Link Link 2

5-Sep

Thurs

270

  • Ipython

  • Packages

  • Python v. R / variables as pointers

Link

10-Sep

Tues

330

Numpy Basics

Link

12-Sep

Thurs

270

Pandas: Series

Link

17-Sep

Tues

CC

Pandas: DataFrames

Link

19-Sep

Thurs

CC

Intro to Plotting with PlotNine

Link

24-Sep

Tues

330

Advanced Plotting

Link

26-Sep

Thurs

270

Pandas: Indices & Missing

Link Link

1-Oct

Tues

270

  • Pandas: Loading and saving data

  • Pandas: Cleaning

Link

3-Oct

Thurs

330

  • Pandas: Merging

  • JVP pp 149 - 157

Link

8-Oct

Tues

270

Pandas: Reshaping

Link

10-Oct

Thurs

330

FALL BREAK

15-Oct

Tues

270

Groupby / Split-Apply-Combine

  • JVP pp 212-228

Link

17-Oct

Thurs

330

Collaborating using Github

Link

22-Oct

Tues

330

Defensive Programming

24-Oct

Thurs

330

Getting Help Online

29-Oct

Tues

270

Pandas: Categorical Data; Eval and Query

  • WM 12.1

  • JVP pp 208 - 213

31-Oct

Thurs

330

Statistics with statsmodels

  • WM Chapter 13

5-Nov

Tues

270

Machine Learning with sckikit-learn

  • JVP pp 331 - 359

7-Nov

Thurs

330

Big Data: What is it, how do I work with it?

12-Nov

Tues

330

Speed and Performance in Python

14-Nov

Thurs

330

Data Science: Questions

19-Nov

Tues

270

Data Science: Backwards Design

21-Nov

Thurs

330

Data Science: Tool Selection

26-Nov

Tues

270

Project Proposal Workshopping

28-Nov

Thurs

THANKSGIVING BREAK

Reminders: