Welcome non-Duke MIDS Students!

In designing this course, I’ve endeavored to make all the materials and resources accessible to students anywhere. However, the materials in this course are tailored to meet the needs of incoming Duke Masters in Data Science (MIDS) students, and so if you are not an incoming Duke MIDS student, there are a couple things you should know about how the course has been designed.

In short:

  • This course assumes a familiarity with (regular) Python. Guidance for those new to Python below.

  • MIDS students have also seen the two Python data science packages we’ll use a lot in this course: numpy, and pandas. As a result, this course is not optimally designed for those who have no exposure to these tools. A person who knows Python but has never seen numpy and pandas can still learn all they need from the materials in this course, but for reasons discussed below, it may be harder than is really necessary. Guidance for students in this position below.

  • No experience or past exposure is assumed for topic areas other than numpy and pandas.

If you’ve never used python before

The DataCamp courses listed below include a good introduction to Python, so I’d start by taking those courses.

In addition, MIDS students also take a little Python bootcamp when they arrive. Over the last year we have been working to convert that into a set of Coursera courses, the first of which will be up shortly (June 2023 I think). I would strongly recommend students also take that as soon as it’s available.

How python, numpy and pandas are taught

Over the summer before they arrive, Duke MIDS students are asked to complete some summer coursework to prepare them for their first year. Part of that training is completing a a set of DataCamp tutorials on Data Science in Python, which introduce the basics of using Python for data science (namely, it introduces python, numpy, and pandas). In addition, they get some additional training in (regular) Python the week they arrive on campus.

As a result, the material on numpy and pandas in this course is written assuming that students have completed those DataCamp courses (the specific classes are provided below). Now to be clear, one need not have taken those courses to be able to suceed in this class: the readings we do on numpy and pandas essentially explain those libraries from the ground up. The only thing that this course definitely will not teach is the basics of Python, so you should think of that as an actual pre-requisite (more on resources for learning Python below).

But in teaching, an important concept is the difference between “the logic of a subject” and the “logic of learning”. Subjects often have their own logic, which (once understood) seem to flow from basic ideas (e.g. what is the most basic way to represent data) and then build up to fancy concepts (how to merge datasets and fit models). The problem is that teaching a course by following this “logic of the subject” turns out not to be the best way to help people learn. Instead, it’s often useful to start from the middle (doing basic data manipulations) so students see the goal of the tools they are learning about before they double back and enrich their understanding by learning about the deep internal logic of a tool.

The DataCamp courses that MIDS students completed was a “start from the middle” introduction, and so in this course we’re not doubling back and teaching the logic of these libraries from the bottom up. That means we will cover everything on needs to know about these libraries, but we may not be doing it in the most efficient way for newcomers.

If you’ve never seen numpy or pandas before

So: if you have never worked with numpy or pandas, before you work through this course, I would strongly recommend that (if you can afford it), you do what we asked the MIDS student to do: go complete the following excellent DataCamp courses:

DataCamp’s courses are really well designed, and crucially their entirely interactive (and research shows that active learning is the best way to learn). They’re a tremendous resource, and you should definitely start there if you can affort their service ($30 a month in the US).

How other topics are taught

While some knowledge is assumed for python, numpy and pandas, no knowledge is assumed in other areas. As a result, all the other resources on this site should be totally accessible to all readers (and if they’re not, then please let me know by openning an issue on this site’s github repo!).

Do the exercises!

This course is taught (for in-person students) as a flipped classroom: students are required to read instructional materials at home before class, then students spend class time is spent doing exercises in pairs.

For non-Duke students looking to use this website to develop their data science skills, there are two consequences of this organization:

  • Duke students are using the exact same instructional materials you are. You aren’t missing lecture materials.

  • Duke students are being required to do the exercises associated with each topic, and they are integral to the course in two ways. First, there are lessons that come up in the exercises that aren’t covered (or aren’t covered as well as I’d like) in instructional materials. If you don’t do the exercises, you will miss important take-aways. And second, programming is a skilled learned by doing. Requiring students to do the exercises in class is a way of making sure that they get appropriate attention. There’s no way to make them “mandatory” online, but I hope you will take to heart my strong encouragement that you complete the exercises that follow each lesson if you want to get the most out of this site.