Jupyter notebooks are an incredible tool for data science communication, but only when used correctly. In this reading I will provide some guidelines on how to write good notebooks. In addition to hopefully provide you with some guidance on how to use notebooks effectively in your lives, these will also form the basis of how assignments submitted as Jupyter notebooks in this class will be graded, so please take them seriously!
The Point of Jupyter Notebooks¶
Jupyter Notebooks are sometimes thought of as just an easy environment in which to introduce students to programming, but when used well, they can be so much more than that – Jupyter Notebooks are amazing tools for data science communication, allowing authors to place explanations for their work alongside the code that executes their calculations. Indeed, a good Jupyter Notebook should be a free-standing document that fully communicates the motivation of the document, what the author is doing in the document, and provides interpretation of any results presented.
Obviously this may not always feel strictly necessary when doing class exercises, since graders have access to the exercise whose questions you are answering. But like so much in class, the use of Jupyter Notebooks in class is just an opportunity for developing skills you will use in the workplace, and so from this point forward your notebooks will be expected to be coherent examples of data science communication. They don’t need to be quite as free-standing as a document you might provide a manager, but they should be coherent and follow all the following guidelines.
Guideline 1: Use Clear Sign-Posting¶
Markdown cells aren’t just an easy place to write text, they also support formatting that can be used to easily communicate how a document is organized. Use this capacity!
Each question in your exercise should be clearly marked. Comments within a code-cell don’t count.
Guideline 2: No Code Without Motivation¶
Before any blocks of code, you should have text explaining what you’re doing and why. Even for homework, a grader should never have to go back to the original question to remind themselves of what you’re trying to accomplish.
Guideline 3: No Numbers Without Context¶
The job of the data scientist is to answer questions about the world using quantitative data. To do so we will often have to do lots of mathematical calculations, and it’s easy to forget that the result of those mathematical calculations does not, in and of itself, constitute a coherent answer to a question. Numbers must always be provided with context: an explanation of what the number means, its units, the question it answers, etc.
In other words, if an exercise asks for the average income of woman in healthcare, you can’t just put a code cell that outputs
68321.239023198213. You need to either use Python to print out something like:
The average income of women in healthcare in the US is 68,321.24 dollars, or write out that interpretation of the output in a markdown cell under the printed output.
(Also, note that that formatting isn’t really coherent given the context – no one writes incomes with so many decimals, and rarely would one write out a number without a thousands-comma between 68 and 321. Not doing so may feel silly, but is an impediment to your reader’s ability to quickly understand the data you are providing them, and your goal as a communicator is to always communicate meaning as clearly as possible!)
Guidelines 3.1: Always Provide Interpretation¶
If you are presenting a number that has significance beyond its explicit value, provide your interpretation of the number! For example, if you’re doing an exercise looking at wage gaps across industries, you probably don’t just want to report that women in health care make 68,321 dollars on average – you also want to relate that quantity back to the average wage for men in healthcare you calculated in the previous question, compare them, and make note of what that tells you about wage gaps.
Guideline 4: Format Your Code¶
Code readability is important to communication. Once people become accustomed to seeing code written with a certain style, not conforming to that style undermines the ability of readers to parse understand your code easily.
Python style is dictated by a set of guidelines called PEP8, and you can easily format your code to conform to most aspects of PEP8 with a tool like black. In Jupyter Notebooks running in VS Code, you can format the code in a cell by typing
option-shift-f (macs) or
Because “format-on-save” is not yet implemented for Jupyter Notebooks (feel free to add a thumbs up on these the issues here and here if you want to see it implemented), we won’t grade you down if not everything is perfectly formatted. But if you have a cell with a big block of stringy code that isn’t formatted, or none of your cells are formatted, we will mark you down.
Guidelines 5: Restart and Run-All¶
You absolutely must “Restart” your kernel and “Run All” before submitting your assignment. It’s all too common for people to share notebooks with cells that have been run out of order, or notebooks where the code doesn’t work because people had the notebook open forever, defined variables, changed their code, and wrote new code that relied on those already-existing-but-no-longer-defined-in-code variables.
If your notebook doesn’t work when you Restart and Run All, then that means that the other person won’t be able to run it either, and you’re defeating half the point of a Jupyter Notebook!
Moving forward, here is an approximate rubric for homeworks:
All of the following is true:
There are no significant errors in the code/analysis.
The document is well-formatted and everything the authors are doing are coherently communicated.
Numbers are presented in context, and with interpretation where appropriate.
All code is well-formatted.
One of the following is true:
There are some significant errors in the code/analysis,
The document is not well-formatted / some of what the authors are doing is not coherently communicated,
Some numbers are presented without context, or without interpretation where interpretation would have been appropriate.
The code has significant formatting errors.
Two of the conditions for a B assignment are true.
(Grades above an A (e.g. above 0.95), or below a C are possible where deemed appropriate by graders.)