Advanced Command Line Tutorial

In the last section, we practiced using a few tools and introduced the idea that the command line is just a way of talking to your operating system using text commands rather than by clicking on icons. In this section, we’ll introduce some more advanced tools, and discuss general principles that will help you during your data science career.

Like everything in this course, this section will focus on tools that are most relevant for an applied data scientist. We will not try and cover advanced bash programming (loops, function definitions, etc.), because anything you can do that way you will also be able to do in Python, for which you are receiving lots of additional training. If you want to learn those skills, there are lots of great tutorials out there (e.g. the full DataCamp tutorial). Instead, the focus here is on skills you’re likely to use when using git, managing packages in Python, or getting stuff set up on remote servers so you can run your R or Python scripts.

In the examples below, we’ll be working with the example data in the Example_Data/command_line folder in this repository, which you can download if you wish to follow along.

Command Line Syntax: General Principles

You may have noticed there are some patterns to how command line tools we’ve covered so far operate. In this section we’ll introduce some general principles that are used by most command line programs (like git, python, julia, conda, zip, ssh, etc.):

1: The first thing you type into the shell is actually just the name of a program. This may not be obvious, but when you type cd and ls, you’re actually asking your operating system to find and execute programs with those names. If you wanted to, you could actually find individual files called cd and ls that the operating system is running when you use those commands. And later on, you’ll spend a lot of time using the commands python or git, which is just a way of asking the operating system to execute those programs.

2: The things that come after the program being called are called “arguments”, and they are passed to the program being called. For example, if you were to run python my_file.py, you are calling the program python and passing it the name of a file as an argument (which it will then execute). What arguments a function accepts or requires depends on the program.

3: The shell is very sensitive to spaces. If you have filenames with spaces, you’ll need to use quotes or escape the spaces in the file names by preceding them with a \ (e.g. less this\ is\ my\ file.txt).

4: Many programs have options that are activated with “flags”. A flag is usually single dash followed by a single letter. For example, you can ask the ls function to display the contents of a directory in a list format using the flag -l.

[1]:
# Normal `ls` display:
cd ~/github/programming4ds/Example_Data/command_line
ls
a_folder_with_stuff     hello.txt
example_csvs            just_another_file.txt
[2]:
# With the `-l` flag, it also shows file sizes, when last modified, and all sorts of operating
# system information that you don't need to worry about.
ls -l
total 16
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt

One Dash Versus Two Dashes

Many flags also have a longer (easier to read version) that you call with two dashes. Basically, if a shell command sees one dash, it knows that each letter immediately afterwards is a different flag. If it sees two dashes, it knows that everything after the dash before the first space is a single flag name.

(Two-dash options are common in modern commands, but aren’t always available in older commands like cd and ls. In the early days of programming, people didn’t see being “user friendly” as a priority).

To illustrate, consider the less (move) command. If you want it to report the currently installed version, you can either type less -V (single dash followed by a single letter) or less --version (double dash followed by a full word).

Note that because a single dash tells the shell that what follows is a single letter flag, you can actually pile up flags after a single dash. For example, we already know that -l tells ls to show files in a list. -h says to include a (human readable) file size. Since each flag after a single dash is only one letter, if we squish them together the command line knows that it’s a series of one letter flags (since lh itself is two letters so wouldn’t be a valid single flag).

[3]:
# You can use these separately
ls -l -h
total 16
drwxr-xr-x   4 Nick  staff   128B Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff   1.1K Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41B Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40B Apr 11 14:08 just_another_file.txt
[4]:
# Or together!
ls -lh
total 16
drwxr-xr-x   4 Nick  staff   128B Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff   1.1K Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41B Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40B Apr 11 14:08 just_another_file.txt

Getting Help

Now that you know that many commands have options, the next obvious question is: how do I learn what options are available?

The answer is that most commands have helpfiles you can get either by typing NAMEOFCOMMAND -h or man NAMEOFCOMMAND.

-h

For most commands, NAMEOFCOMMAND -h or NAMEOFCOMMAND --help will bring up a small guide to command options. For example, python -h or python --help bring up:

usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Options and arguments (and corresponding environment variables):
-b     : issue warnings about str(bytes_instance), str(bytearray_instance)
         and comparing bytes/bytearray with str. (-bb: issue errors)
-B     : don't write .pyc files on import; also PYTHONDONTWRITEBYTECODE=x
-c cmd : program passed in as string (terminates option list)
-d     : debug output from parser; also PYTHONDEBUG=x
-E     : ignore PYTHON* environment variables (such as PYTHONPATH)
-h     : print this help message and exit (also --help)
-i     : inspect interactively after running script; forces a prompt even
         if stdin does not appear to be a terminal; also PYTHONINSPECT=x

man

While NAMEOFCOMMAND -h works for most modern commands, for very old commands (those that have been around since the early days of computing like ls or cd), you often need to use man NAMEOFCOMMAND (man is short for manual). To illustrate, man ls brings up:

LS(1)                     BSD General Commands Manual                    LS(1)

NAME
     ls -- list directory contents

SYNOPSIS
     ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

DESCRIPTION
     For each operand that names a file of a type other than directory, ls
     displays its name as well as any requested, associated information.  For
     each operand that names a file of type directory, ls displays the names
     of files contained within that directory, as well as any requested, asso-
     ciated information.
...

The “Recursive” Flag

Now that you’re familiar with the idea of using flags to modify the behavior of commands, there’s one kinda weird flag that’s worth discussing in detail: -r, or occassionally -R.

Many command line tools are designed to operate on files, and by default they won’t work if you try to use them at folders (directories). For example, if you try and copy a folder with cp, you’ll get the following error:

➜  cp a_folder ~/desktop
cp: a_folder is a directory (not copied).

To get tools that only work on files to work on folders, we use the -r. r stands for “recursive”, and basically it says “do what I’m asking you to do to this directory to every file in this directory.”

Places this comes up a lot:

  • Deleting folders requires rm -r
  • Copying folders requires cp -r
  • Compress a folder with zip requires zip -r

Invisible Files

Now that you’re comfortable with options, it’s time to introduce you to a dark secret of modern operating systems: there are invisible files everywhere. When a programmer needs to hide a file or folder, but doesn’t want to show it to the user, (s)he prefixes the file name with a single period (.). The operating sytem then hides this files from the user.

But now you can see them using the command line. Just use the -a flag (short for “all”) for the ls command to have it show you all the files that are there:

[5]:
# You thought you knew what was in this folder:
ls -l
total 16
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt
[6]:
# But there was another file hiding! Notice that `.this_file_is_invisible.txt` and `.DS_Store` were hidden before?
ls -la
total 48
drwxr-xr-x   8 Nick  staff   256 Jun 24 15:46 .
drwxr-xr-x   5 Nick  staff   160 May 27 16:05 ..
-rw-r--r--@  1 Nick  staff  8196 Jun 25 08:55 .DS_Store
-rw-r--r--@  1 Nick  staff   179 Apr 11 14:09 .this_file_is_invisible.txt
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt

Yup! .this_file_is_invisible.txt and .DS_Store were there all along! These are normal files – you can move them, rename them, or open them like any other – they are just hidden by default. However, be careful about modifying these so-called “dot-files” – they are often hidden for a reason. Dot-files are used by programs to store configuration or settings data, and they’re usually hidden because casual users can easily screw them up.

In this case, .this_file_is_invisible.txt is just a plain text document I created for this exercise, .DS_Store, by contrast, is a file created by the macOS operating system to store information like how this folder should be displayed when openned. This is sufficently unimportant that playing with it won’t ruin your computer, but there’s not really anything in there you’re meant to change.

This trick is useful to know, because in some programs (like git) rely on settings hidden in dot-files. In fact, you should try and memorize this command (ls -la) – many people use it more than plain old ls.

How common are dot-files? Extremely. See for yourself: if you go to your home directory, you’ll find that all sorts of programs have been storying their settings and installed packages in dot-files. Just run cd ~ (remember that ~ is just a short hand for your home folder, which on most systems is /users/YOURUSERNAME), then ls -la.

Feel free to explore these files and folders if you want, but I would strongly suggest against editing anything unless you know what you’re doing – unlike .DS_Store files, changing some of these can really screw up how some applications work.

Wildcards

As we saw in the last set of exercises, one of the most powerful command line tricks (and one of the places where using the command line can be much easier than trying to do things with your mouse) is the use of wildcards. Any time you are listing files, you can use an asterix (*) to allow any pattern to appear in part of a filename. For example, to list all the CSV files in a folder (but only the CSVs), you can type:

[7]:
cd example_csvs
ls *.csv
311calls_2018_11_29_Thursday.csv        311calls_2018_2_9_Friday.csv
311calls_2018_11_30_Friday.csv          311calls_2018_3_15_Thursday.csv
311calls_2018_12_6_Thursday.csv         311calls_2018_3_16_Friday.csv
311calls_2018_12_7_Friday.csv           311calls_2018_3_1_Thursday.csv
311calls_2018_1_18_Thursday.csv         311calls_2018_3_2_Friday.csv
311calls_2018_1_19_Friday.csv           311calls_2018_3_30_Friday.csv
311calls_2018_1_25_Thursday.csv         311calls_2018_3_8_Thursday.csv
311calls_2018_1_26_Friday.csv           311calls_2018_3_9_Friday.csv
311calls_2018_1_4_Thursday.csv          311calls_2018_4_12_Thursday.csv
311calls_2018_2_15_Thursday.csv         311calls_2018_4_13_Friday.csv
311calls_2018_2_16_Friday.csv           311calls_2018_4_5_Thursday.csv
311calls_2018_2_1_Thursday.csv          311calls_2018_4_6_Friday.csv
311calls_2018_2_22_Thursday.csv         311calls_2018_5_4_Friday.csv
311calls_2018_2_23_Friday.csv           311calls_2018_6_14_Thursday.csv
311calls_2018_2_2_Friday.csv            311calls_2018_6_1_Friday.csv
311calls_2018_2_8_Thursday.csv          311calls_2018_6_8_Friday.csv

Or if you only wanted to see the CSVs that have data from the month of February (in this case, the files with 2018_2_ in the middle of the file name) you could type:

[8]:
ls *2018_2*
311calls_2018_2_15_Thursday.csv 311calls_2018_2_23_Friday.csv
311calls_2018_2_16_Friday.csv   311calls_2018_2_2_Friday.csv
311calls_2018_2_1_Thursday.csv  311calls_2018_2_8_Thursday.csv
311calls_2018_2_22_Thursday.csv 311calls_2018_2_9_Friday.csv

This is an extremely powerful tool, and one you’ll use a lot. Just be careful – wildcards can also get you in trouble. For example, supposed you wanted to erase all the CSVs from January. You might be inclined to type rm *2018_1*. But that pattern will catch much more than just January…

[9]:
ls *2018_1*
311calls_2018_11_29_Thursday.csv        311calls_2018_1_19_Friday.csv
311calls_2018_11_30_Friday.csv          311calls_2018_1_25_Thursday.csv
311calls_2018_12_6_Thursday.csv         311calls_2018_1_26_Friday.csv
311calls_2018_12_7_Friday.csv           311calls_2018_1_4_Thursday.csv
311calls_2018_1_18_Thursday.csv

It will also catch (and if you were to use rm, delete) November (2018_11_) and December (2019_12_)! To just catch January, you’d have to be more specific and use rm *2018_1_ (with the trailing underscore).

Using The Outputs of Commands

We’ve seen there are several commands that will print information to the terminal for you to see. But sometimes we want to do something with the information that programs return. For example, it’s nice that ls shows us the contents of a folder, but what if we wanted to save that to disk so we could open it and use it in a different program?

Saving to Disk

You can re-direct the output of any program that prints something to the screen to a file with the > command. For example, to save the output of the ls command to a file on your desktop, you would type ls > ls_output.txt.

Note that this will only work for commands that print something directly to the screen (like ls, or cat). It won’t work for programs that just open up an interactive session (like less).

Piping

Sometimes instead of saving the output of a program to disk, you want to pass it to another program to analyze. This practice – using the output of one program as input to another – is called “piping”, and it can be very powerful (and is actually used in many programming languages, not just bash).

For example, suppose we wanted to count the number of .csv files in a folder. One way to do this would be to use ls *.csv to save the names of all the files in a directory to disk, then use the wc command (short for “word count”) to count the number of lines in that file. To do so, we save the output of ls -1 *.csv to disk (the -1 option forces ls to put one file name on each line), then point wc at the file using the -l option (which counts total lines, since if a file name has a space it would be counted as multiple words. See man wc for more information on how wc works):

[10]:
ls -1 *.csv
311calls_2018_11_29_Thursday.csv
311calls_2018_11_30_Friday.csv
311calls_2018_12_6_Thursday.csv
311calls_2018_12_7_Friday.csv
311calls_2018_1_18_Thursday.csv
311calls_2018_1_19_Friday.csv
311calls_2018_1_25_Thursday.csv
311calls_2018_1_26_Friday.csv
311calls_2018_1_4_Thursday.csv
311calls_2018_2_15_Thursday.csv
311calls_2018_2_16_Friday.csv
311calls_2018_2_1_Thursday.csv
311calls_2018_2_22_Thursday.csv
311calls_2018_2_23_Friday.csv
311calls_2018_2_2_Friday.csv
311calls_2018_2_8_Thursday.csv
311calls_2018_2_9_Friday.csv
311calls_2018_3_15_Thursday.csv
311calls_2018_3_16_Friday.csv
311calls_2018_3_1_Thursday.csv
311calls_2018_3_2_Friday.csv
311calls_2018_3_30_Friday.csv
311calls_2018_3_8_Thursday.csv
311calls_2018_3_9_Friday.csv
311calls_2018_4_12_Thursday.csv
311calls_2018_4_13_Friday.csv
311calls_2018_4_5_Thursday.csv
311calls_2018_4_6_Friday.csv
311calls_2018_5_4_Friday.csv
311calls_2018_6_14_Thursday.csv
311calls_2018_6_1_Friday.csv
311calls_2018_6_8_Friday.csv
[11]:
ls -1 *.csv > ~/files_in_folder.txt
wc -l ~/files_in_folder.txt
      32 /Users/Nick/files_in_folder.txt

But obviously that seems wasteful. Why do we have to save to disk just to move the data from one file to another?

The answer is we don’t! Instead we can use the pipe operator: |. The pipe operator says “just pass the output of the first command as an argument to the second command”. And now we can do:

[12]:
ls -1 | wc -l
      32

The nano Editor

It is often the case when working at the command line that one wants to actually edit a file, not just look at it or move it around. For small, quick edits, bash comes with an extremely useful tool for this purpose: nano. Just type nano FILENAME on almost any system, and you can edit your file without openning or installing additional programs.

The PATH Variable

The last feature of the command line that is important to understand is the PATH variable. We won’t get into all the intricacies of the PATH variable here, but having a basic understanding of its purpose and function will likely prove useful to you if you ever have to troubleshoot problems in the future.

Have you ever wondered how the command line knows what to do when you type a command like python or ls? How does it know what program to run, especially on a computer that might have multiple installations of a program like Python?

The answer is that your system has a list of folders stored in an “environment variable” called PATH, and when you run a command (like python), it goes through those folders in order until it finds an executable file with the name of the command you typed. Then when it finds that file, it executes that program and stops looking.

You can see the value of the PATH variable on your computer by typing echo $PATH (echo says evaluate and print what follows, and the dollar sign in $PATH says “please fill in the value of the environment variable named PATH”. On my system, the PATH variable looks like this:

[13]:
echo $PATH
/Users/Nick/anaconda3/bin:/Users/Nick/anaconda3/bin:/Users/Nick/anaconda3/condabin:/Users/Nick/anaconda3/bin:/Users/Nick/anaconda:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/users/nick/github/barrio_networks/code/modules:/Library/TeX/texbin:/opt/X11/bin:/usr/local/git/bin:/Library/TeX/texbin:/opt/local/bin:/opt/local/sbin:/users/nick/github/barrio_networks/code/modules:/users/nick/.local/bin

That means that when I type python, my computer will first look in the folder /users/nick/anaconda3/bin to see if there’s a file named python it can run. If it can’t find one there, it moves on to /users/nick/anaconda3/condabin, etc.

(You’ll see that /Users/Nick/anaconda3/bin appears twice in my PATH. That’s because the program I’m working with adds /Users/Nick/anaconda3/bin to my PATH when it starts up, leading to duplication. Thankfully, duplication doesn’t really matter – the time it takes the computer to check that folder twice is miniscule).

Why this is useful to know

In a perfect world, you’ll never have to worry about your PATH variable, but there are a couple situations where knowing about your PATH variable can be helpful. In particular:

  • If downloaded a program, but you can’t run it from the command line, that probably means that it’s location isn’t in the PATH variable.
  • If you find that when you type a command like python, the command line isn’t running the version of python you want it to run, that’s probably because a different version of python appears earlier in the PATH variable (since the command line will stop looking through these folders as soon as it finds a match). Note you can diagnose this problem by typing which COMMANDNAME, which will tell you the folder from which COMMANDNAME is being run.

Modifying your PATH Variable

How you modify your PATH variable depends a little on your operating system.

Configuration File on macOS and Linux

In Linux or macOS, the easiest way to modify your PATH variable is using your command line configuration file. This is a small script that runs in the background whenever you open a new command line window. If you add a modification to your PATH variable here, that modification will always be loaded when you open a new command line session.

The exact name of your configuration file will depend a little on what command line tool you’re using. If you haven’t changed your default terminal program (i.e. haven’t installed oh-my-zsh, as suggested in the last tutorial, your configuration file will be located in your home directory (cd ~) and is named either .bash_profile or .bashrc. Note that the name starts with a . so it’s invisible by default! You’ll have to use your ls -la trick to see and open it.

If you installed oh-my-zsh, then the file will still be located in your home directory but will now be called .zshrc.

Configuration File on Windows

If you’re using Windows, you can change the PATH variable globally by going to System Preferences > Advanced > Environment Variables. But if you’re just using bash through Cmder, you should have a .bash_profile file in your home.

Actually Changing your PATH Variable

When modifying your PATH variable, you really don’t want to remove anything already in your PATH variable (because who knows what program may need one of those obscure directories). Instead, the best practice is to just prefix the folders you want searched first. If you program isn’t on your PATH, this will add the program; if the wrong version of a program is being used, because you’re adding to the front of the PATH variable, the folder you add will have higher priority.

So to add a folder to the front of your PATH variable while keeping the old folders at the back, we type:

[15]:
export PATH="/NEW/FOLDER/ON/PATH:$PATH"

Command Line Exercises

Let’s do some exercises!

Advanced Command Line Exercises