The Jupyter Project

Today, if you use more than one programming language for data science, you probably also also use different programs to edit and interact with those programs. R users, for example, often use RStudio, Python users use Spyder, and Julia users use Juno.

But in recent years, an amazing efforts has been underway to provide a single set of tools that work with nearly any underlying programming language: Jupyter (as in Ju (Julia) - py (Python) - teR (R)).

The idea of Jupyter is to seperate the interface you are working with from the underlying programming language doing your analysis. This makes it possible to create one interface (a text editor, a window where results are displayed, etc.) that can be used to run your analyses in any number of different programs. In the Jupyter ecosystem, the program being used to actually run your analysis (i.e. Python, R) is referred to as a kernel.

Jupyter was originally focused on unifying Julia, Python, and R, it actually now supports dozens and dozens of different kernels including javascript, Go, Haskell, Matlab, Stata, bash, Scala, and so much more.

(Note: Jupyter Notebooks used to be called IPython Notebooks before they expanded to support more languages, so if you see people talking about IPython Notebooks, just think of that as an early, Python specific version of Jupyter Notebooks).

Jupyter Notebooks

Jupyter notebooks are a tool for easily integrating text, code, and code output into a single document. This not only makes them incredibly useful for instructional materials (this entire site is actually built with Jupyter Notebooks), but also makes them useful as a method of sharing analyses. Using Jupyter Notebooks, you can not only share the conclusions of your analysis with colleagues, but also the code that generated those analyses, making it easy for others to see how you reached your conclusions and, crucially, play with that code to see what happens if the analysis is changed slightly. Indeed, Notebooks are so useful for sharing analyses that they’ve become the de facto standard for sharing information at many companies, including Netflix.

OK, I know, that all sounds really abstract. What makes Jupyter Notebooks special is their interactivity, so it’s hard to understand their value without seeing them in action.

Jupyter Notebook Tutorial

To learn the basics of Jupyter Notebooks, please watch watch this short tutorial. As you watch the video, follow along yourself – since you’ve installed Anaconda, you already have Jupyter Notebooks installed, so just running the command jupyter notebook from your terminal should launch the notebook navigator. We’ll build on the skills from this tutorial in our in-class exercises.

If you got an error when you tried to run jupyter notebook like "Error executing Jupyter command 'notebook': [Errno 2] No such file or directory" or command not found: jupyter (this might happen if you didn’t install a new installation of Anaconda because you already had it installed, or if you’re using Minicoda), you can make sure Jupyter Notebooks is installed by running conda install jupyterlab from the command line.

Jupyter Lab

Notebooks have been around since 2014, but while they are great for instructional materials and sharing analyses, they weren’t appropriate for managing full projects. In 2018, though, Jupyter launched a new tool: Jupyter Lab.

Jupyter Lab is meant to be an all-in-one data science interface:

  • Easily run and write Jupyter Notebooks (and do things you couldn’t do before, like drag cells from one notebook to another, collapse cells, etc.).

  • Work with a text-editor in one pane and an active kernel session in another, just like you do in RStudio or Spyder.

  • Edit popular data science file formats with live preview, such as Markdown, JSON, CSV, Vega, VegaLite, and more.

Again, this is all a little abstract for a tool that’s fundamentally about interactivity, so to see Jupyter Lab in action, watch this video starting 9 minutes in and stopping 30 minutes in (when they switch speakers). Don’t worry about absorbing every detail — we’ll get lots of practice with Jupyter Lab below — the goal is just to give you a sense of how Jupyter Lab works generally.

To follow along with this video:

  • Make sure you have the most recent version of Jupyter Lab installed (the version that comes with the Anaconda distribution you already installed is a little behind) by running conda update jupyterlab.

  • Type jupyter lab on the command line to start a jupyter lab session!

Is Jupyter the best tool for data science?

Where Jupyter Lab really shines is in it’s support for interactive programming and data analysis, and in providing a single interface for working in different programming languages. But whether that makes it the best tool depends on how you like to work, and what you’re doing. The reality is that every tool has strengths and weaknesses, so what’s best depends on what you are doing and your personal preferences. But there are some use cases for which I think Jupyter is really valueable.

In the introduction of this course, we talked about the two big branches of data science: the software development branch, and the data analysis branch.

The way people in these different branches work is very different. People who work in the software development branch of data science are writing generalized software that will be packaged up, shipped out, and run on other computers somewhere. And as a result, the code they write is often “non-linear” (it isn’t meant to be run one line after the other sequentially). Instead it’s full of different functions and routines that may get called at different times depending on the input data. Moreover, when writing this kind of code, you get to assume the data entering your program has some kind of initial structure, making it easier to predict what your code will do when executed. As a result, people doing this type of data science tend to prefer more developed text editors (Atom, Emacs, Vim, etc.) that don’t provide as much support for interactivity, but offer better tools for things like debugging software packages.

People in the data analysis branch, by contrast, often want to see the result of each line of code they write. That’s because when analyzing data for the first time, you may know what your code is doing in an abstract sense (say, calculating the average of a variable), but you don’t know what the output of that code will be! Moreover, you never really know the structure of real-world your data until you’ve cleaned the heck out of it, so you want to look at it often to see what issues are coming up. And when cleaning data, you are often writing very linear (this-line-runs-after-the-last-line) scripts where you want to check your efforts to clean your data after each line. When doing this kind of work (especially if you spend a lot of time moving between programming languages), having one program that will work with different kernels and allows for rich, interactive programming like Jupyter can be very valuable.

So if you do a lot of interactive data analysis, I would encourage you to give Jupyter Notebooks and Jupyter Labs a try. It might not be the tool you use for everything you do, but its used by enough people and has enough advantages, I think it’s a tool most Data Scientists should have a working familiarity with.

Setting Up R with Jupyter

When you installed Anaconda, you actually also installed Jupyter, and you so you can already open up Jupyter and use it with a Python Kernel. However, since most people taking this course are Duke MIDS students who are also doing coursework in R, let’s set up Jupyter to work with R as well.

  1. If you do not have R installed, download and install it here. If you have R installed, skip to step 2.

  2. Open R by openning your command line tool (Oh-My-Zsh on Mac, Cmder in Windows) and typing R. Don’t open it by double clicking its icon!

    • If you can’t open R by just typing R, you have to launch it by putting in the absolute path to your R installation. On a Mac, doing this requires typing something like the following into your command line (depending on exactly where R is installed on your system): /Applications/R.app/Contents/MacOS/R. Similarly, using Cmder on Windows you need to type something like (depending on installed version): /c/Program Files/R/R-3.6.0/bin/R.exe.

  3. Run install.packages("IRkernel")

  4. After installation is complete, execute the command: IRkernel::installspec() in R.

That should be it. To see if R installed correctly, open a new session of Jupyter Lab (open a new console and type jupyter lab), and you should see buttons for both “Python 3” and “R” (though you won’t have Bash, Julia, or Stata listed like I do):

jupyter_launch_page

If you don’t see a button for R, make sure you followed all the steps above!

If the command ``IRkernel::installspec()`` generates this error:

Error in IRkernel::installspec() :
  jupyter-client has to be installed butjupyter kernelspec --versionexited with code 127.
In addition: Warning message:
In system2("jupyter", c("kernelspec", "--version"), FALSE, FALSE) :
  error in running command

That means R can’t find your installation of jupyter. That probably means anaconda isn’t set up with your command line tool, so please go back and see setup_environment.ipynb.

Jupyter Lab Exercises!

Click here for some Jupyter Lab exercises!