Installing Python and Miniconda¶
One of the major learning goals of this class is for you to be comfortable managing all the software and settings required for you to do data science on your own computer.
Why deal with all the headaches of setting up your own environment, you may ask? Why not just use a cloud platform like Google Colab or a virtual machine with everything already set up?
Getting data science tools installed and working together is, for better or worse, a pretty core part of the day-to-day life of data scientists, and learning how to troubleshoot problems quickly is an important skill for being productive in the profession. But it is a skill that takes time and energy to learn, and so in most classes — which want to focus on teaching topics like statistical analysis or programming concepts — instructors choose to provide students with clean, ready-to-use environments so everyone can focus on those topics. For example, if the MIDS Python Bootcamp included a module on setting up Python environments instead of providing you with a clean virtual machine, you’d probably end up learning ~25% less programming!
But the problem with this approach is that if every course you take pursues this strategy, you may find that you don’t feel empowered to go do data science yourself when those clean VMs are taken away at the end of the semester. Moreover, it means you may not know enough about how data science tools work to debug problems on your own when they come up.
So in this course, we’re going to address environment setup head-on. That will probably mean you’ll get a little annoyed at the fragility of many of these tools, and you may get frustrated spending hours trying to find a setting that got set wrong (though we’ll try to minimize these experiences!), but try to think of this time not as wasted, but instead as part of your data science education!
What We’ll Be Setting Up¶
To set ourselves up for this course (and hopefully our careers!), we’ll need to set up the following things:
Python and the conda package manager: This is a Python-centric course, so the first thing we’ll need to do is install Python and a robust, data-science-appropriate package manager.
Visual Studio Code (VS Code): There are a lot of opinions (most strongly held) about what editor is “best.” My own view is that what editor is best depends entirely on not just the kind of work you do and your own working style, but also what the people around you use (nothing better than being able to ask ther person next to you for help!). But even more importantly, I think everyone who works in data science would agree that more important than picking the “correct” editor is becoming proficient in whatever editor you use. With that in mind, most Duke MIDS courses have decided to coordinate around VS Code to allow you the opportunity to get really, really good at VS Code. You may later decide it isn’t for you, but at least this way you’ll have a good sense of what a good editor can do.
Augmented Command Line: As a data scientist, you’ll spend a lot of time working at the command line, so it’s a good idea to invest a little in setting up something more advanced than the default command line tool offered by your operating system (e.g., Terminal/CMD Prompt/Powershell). In addition, this will give us a chance to learn a little about how the command line works, which will be really important to effective troubleshooting.
But that’s a lot, so let’s take things one step at a time! First, let’s install PYTHON!
Reading These Setup Instructions¶
As you work through these set up readings, be certain to follow the directions very carefully! As a data scientist you are working at the frontier of software, which often means that there are little quirks and issues with the tools that we use that are just waiting to trap you. Every note you find in these readings I put there because either I ran into a problem or one of my students has, so please take your time and try to be very methodical!
Installing Python with Miniconda¶
The first thing you’ll likely want to do on any computer you work with is install both Python and the package manager conda. This is necessary because unlike a language like R where you can install packages with the
install.packages() command, Python doesn’t have an internal tool for installing packages. This means that we need a tool like conda if we want to use anything other than vanilla Python (e.g., tools for plotting, numpy, pandas, etc.).
Python has two main package managers:
conda. While most software engineers use pip, most data scientists like conda. That’s because while pip is good at installing Python libraries, conda is better at installing many of the big dependencies that underlie data science tools. Plus, if we install conda, it will come with pip, so we get the best of both worlds!
So the first thing we need to do to get started with Python is go to the Miniconda download page and download the most recent installer for our system (as of August 2022).
Note that there are actually two well-known ways to get conda on your system — installing Anaconda from anaconda.com, and installing Miniconda from docs.conda.io. It is my strong recommendation that you use Miniconda. That’s because if you install Anaconda from anaconda.com, you get not only Python and the conda package manager, but also dozens of pre-loaded packages. And while that sounds great, the reality is that it tends to cause lots of package conflicts once you start adding anything new to your installation. Miniconda, as the name implies, is the “mini” version of the Anaconda package, and basically only includes Python and a couple core utilities (conda, pip, etc.). As a result, a Miniconda installation is much less likely to cause package conflict problems down the road.
If you already have a conda installation: My recommendation is to delete it and start fresh. Deleting your Python installation can feel scary once you’ve set stuff up, but you don’t want to get in the practice of being too precious about your Python installations, as you’ll often have to just delete it all to deal with software conflicts.
Thankfully, deleting Anaconda/Miniconda is easy — just delete the
anaconda3 folder you created during installation! The great thing about conda is that everything lives in that folder, so you can easily delete it and start fresh!
An IMPORTANT Note on Pyenv¶
Note that miniconda is a SUBSTITUTE for a tool like
venv that may be suggested in some other courses (like our MIDS NLP class). Do NOT install pyenv / venv and miniconda, just install miniconda. (This is the coordinated recommendation of myself and the Duke MIDS NLP Professor Patrick Wang!)
Miniconda comes with two tools for installing packages that you can use together:
pip. As we’ll discuss in a later reading, my suggestion is to always try to install things with
conda first (e.g., run
conda install numpy), and if that fails try
pip (e.g., run
pip install numpy).
conda can also manage multiple environments and something called “environments”, so I promise anything pyenv can do conda can do too (and much more!), so install miniconda but not pyenv.
Go to the miniconda install page.
Download a 64-bit version of Miniconda. The latest Python 3.x package is probably best.
If you have a Mac, go with the
pkginstaller that’s appropriate for the processor on your computer—if you have a new mac with an M1 or M2 processor, choose the “Apple M1 64-bit pkg” installer. If your mac is older, use the “Intel x86 pkg” installer. Not sure which you have? Go to the Apple menu in the top left of your computer and select “About This Mac” and see if M1 or Intel appears in the Processor line.
Run the installer, paying attention to the following options:
If you’re asked where to install the software, you want to install it “For me only,” not “Install for all users of this computer.” Note that as of July 2021, you may find the “For me only” option has a warning saying you can’t install there, but if you click a different option then click on the “For me only” option again, the warning goes away.
On Windows, you’ll be asked if you want to add Miniconda to your PATH variable. Although it recommends that you do not do this, DO add it to your PATH. This will be important when we change how our command line works.
Miniconda is installed!
Why did we want to install it “for me only” in step 3? To install software for all users, you have to install software at the level of your operating system so it’s visible to all users. And your computer is very protective of anything installed at the level of the operating system because of the dangers of computer viruses, so anything installed there can run into “permission” problems when it tries to run. Anything installed “for me only” gets installed in your user folder which your computer is less paranoid about, leading to fewer problems.
Changing the Default Repository¶
Now we also want to change one setting in Miniconda: the default “channel” it uses to get packages.
When you install packages using conda, conda can actually pull from a number of different package repositories (called “channels”). The default for this is the “anaconda” channel, but the best channel is actually called conda-forge. To set this as the default:
Open the default command line on your computer (on a Mac, it’s
Applications > Utilities; on Windows, you can use
PowerShell, which you can get by just putting PowerShell in the search bar), and run the following three commands:
conda config --add channels conda-forge(you may be told you already have it listed)
conda config --set channel_priority strict
conda install python=3.10
This last command may take a little while to run (as much at 10-15 minutes in extreme cases), or you could get “All requested packages already installed.” Why?
The first two of these will just change where conda looks for packages by default. But the last command will cause conda to swap out the version of Python 3.9 that came with miniconda (at least that was the default as of August 2022) a slightly newer Python (3.10) that was built by the folks at conda-forge. That step isn’t strictly necessary, but it will ensure we’re all working with 3.10, and because we’re moving to a conda-forge build, it can help avoid conflicts later.
And that’s it!¶
You now have Python on your system, as well as all the tools you’ll need for managing packages!