In the last section, we practiced using a few tools and introduced the idea that the command line is just a way of talking to your operating system using text commands rather than by clicking on icons. In this section, we’ll introduce some more advanced tools, and discuss general principles that will help you during your data science career.

Like everything in this course, this section will focus on tools that are most relevant for an applied data scientist. We will not try and cover advanced bash programming (loops, function definitions, etc.), because anything you can do that way you will also be able to do in Python, for which you are receiving lots of additional training. If you want to learn those skills, there are lots of great tutorials out there (e.g. the full DataCamp tutorial). Instead, the focus here is on skills you’re likely to use when using git, managing packages in Python, or getting stuff set up on remote servers so you can run your R or Python scripts.

In the examples below, we’ll be working with the example data in the Example_Data/command_line folder in this repository, which you can download if you wish to follow along.

## Command Line Syntax: General Principles¶

You may have noticed there are some patterns to how command line tools we’ve covered so far operate. In this section we’ll introduce some general principles that are used by most command line programs (like git, python, julia, conda, zip, ssh, etc.):

1: The first thing you type into the shell is actually just the name of a program. This may not be obvious, but when you type cd and ls, you’re actually asking your operating system to find and execute programs with those names. If you wanted to, you could actually find individual files called cd and ls that the operating system is running when you use those commands. And later on, you’ll spend a lot of time using the commands python or git, which is just a way of asking the operating system to execute those programs.

2: The things that come after the program being called are called “arguments”, and they are passed to the program being called. For example, if you were to run python my_file.py, you are calling the program python and passing it the name of a file as an argument (which it will then execute). What arguments a function accepts or requires depends on the program.

3: The shell is very sensitive to spaces. If you have filenames with spaces, you’ll need to use quotes or escape the spaces in the file names by preceding them with a \ (e.g. less this\ is\ my\ file.txt).

4: Many programs have options that are activated with “flags”. A flag is usually single dash followed by a single letter. For example, you can ask the ls function to display the contents of a directory in a list format using the flag -l.

[1]:

# Normal ls display:
cd ~/github/programming4ds/Example_Data/command_line
ls

a_folder_with_stuff     hello.txt
example_csvs            just_another_file.txt

[2]:

# With the -l flag, it also shows file sizes, when last modified, and all sorts of operating
# system information that you don't need to worry about.
ls -l

total 16
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt


One Dash Versus Two Dashes

Many flags also have a longer (easier to read) version that you call with two dashes. Basically, if a shell command sees one dash, it knows that each letter immediately afterwards is a different flag. If it sees two dashes, it knows that everything after the dash before the first space is a single flag name.

(Two-dash options are common in modern commands, but aren’t always available in older commands like cd and ls. In the early days of programming, people didn’t see being “user friendly” as a priority).

To illustrate, consider the less (move) command. If you want it to report the currently installed version, you can either type less -V (single dash followed by a single letter) or less --version (double dash followed by a full word).

Note that because a single dash tells the shell that what follows is a single letter flag, you can actually pile up flags after a single dash. For example, we already know that -l tells ls to show files in a list. -h says to include a (human readable) file size. Since each flag after a single dash is only one letter, if we squish them together the command line knows that it’s a series of one letter flags (since lh itself is two letters so wouldn’t be a valid single flag).

[3]:

# You can use these separately
ls -l -h

total 16
drwxr-xr-x   4 Nick  staff   128B Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff   1.1K Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41B Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40B Apr 11 14:08 just_another_file.txt

[4]:

# Or together!
ls -lh

total 16
drwxr-xr-x   4 Nick  staff   128B Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff   1.1K Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41B Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40B Apr 11 14:08 just_another_file.txt


## Getting Help¶

Now that you know that many commands have options, the next obvious question is: how do I learn what options are available?

The answer is that most commands have helpfiles you can get either by typing NAMEOFCOMMAND -h or man NAMEOFCOMMAND.

-h

For most commands, NAMEOFCOMMAND -h or NAMEOFCOMMAND --help will bring up a small guide to command options. For example, python -h or python --help bring up:

usage: python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Options and arguments (and corresponding environment variables):
-b     : issue warnings about str(bytes_instance), str(bytearray_instance)
and comparing bytes/bytearray with str. (-bb: issue errors)
-B     : don't write .pyc files on import; also PYTHONDONTWRITEBYTECODE=x
-c cmd : program passed in as string (terminates option list)
-d     : debug output from parser; also PYTHONDEBUG=x
-E     : ignore PYTHON* environment variables (such as PYTHONPATH)
-h     : print this help message and exit (also --help)
-i     : inspect interactively after running script; forces a prompt even
if stdin does not appear to be a terminal; also PYTHONINSPECT=x


man

While NAMEOFCOMMAND -h works for most modern commands, for very old commands (those that have been around since the early days of computing like ls or cd), you often need to use man NAMEOFCOMMAND (man is short for manual). To illustrate, man ls brings up:

LS(1)                     BSD General Commands Manual                    LS(1)

NAME
ls -- list directory contents

SYNOPSIS
ls [-ABCFGHLOPRSTUW@abcdefghiklmnopqrstuwx1] [file ...]

DESCRIPTION
For each operand that names a file of a type other than directory, ls
displays its name as well as any requested, associated information.  For
each operand that names a file of type directory, ls displays the names
of files contained within that directory, as well as any requested, asso-
ciated information.
...


NOTE: Windows bash clients (like Cmder and git bash often don’t support man. To get help for old commands, try googling what you would type if man actually worked, but into google (e.g. google man rmdir).

## The “Recursive” Flag¶

Now that you’re familiar with the idea of using flags to modify the behavior of commands, there’s one kinda weird flag that’s worth discussing in detail: -r, or occassionally -R.

Many command line tools are designed to operate on files, and by default they won’t work if you try to use them at folders (directories). For example, if you try and copy a folder with cp, you’ll get the following error:

➜  cp a_folder ~/desktop
cp: a_folder is a directory (not copied).


To get tools that only work on files to work on folders, we use the -r. r stands for “recursive”, and basically it says “do what I’m asking you to do to this directory to every file in this directory.”

Places this comes up a lot:

• Deleting folders requires rm -r

• Copying folders requires cp -r

• Compress a folder with zip requires zip -r

## Invisible Files¶

Now that you’re comfortable with options, it’s time to introduce you to a dark secret of modern operating systems: there are invisible files everywhere. When a programmer needs to hide a file or folder, but doesn’t want to show it to the user, (s)he prefixes the file name with a single period (.). The operating sytem then hides this files from the user.

But now you can see them using the command line. Just use the -a flag (short for “all”) for the ls command to have it show you all the files that are there:

[5]:

# You thought you knew what was in this folder:
ls -l

total 16
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt

[6]:

# But there was another file hiding! Notice that .this_file_is_invisible.txt and .DS_Store were hidden before?
ls -la

total 48
drwxr-xr-x   8 Nick  staff   256 Jun 24 15:46 .
drwxr-xr-x   5 Nick  staff   160 May 27 16:05 ..
-rw-r--r--@  1 Nick  staff  8196 Jun 25 08:55 .DS_Store
-rw-r--r--@  1 Nick  staff   179 Apr 11 14:09 .this_file_is_invisible.txt
drwxr-xr-x   4 Nick  staff   128 Apr 11 14:13 a_folder_with_stuff
drwxr-xr-x  35 Nick  staff  1120 Jun 25 08:55 example_csvs
-rw-r--r--@  1 Nick  staff    41 Feb  5 12:41 hello.txt
-rw-r--r--   1 Nick  staff    40 Apr 11 14:08 just_another_file.txt


Yup! .this_file_is_invisible.txt and .DS_Store were there all along! These are normal files – you can move them, rename them, or open them like any other – they are just hidden by default. However, be careful about modifying these so-called “dot-files” – they are often hidden for a reason. Dot-files are used by programs to store configuration or settings data, and they’re usually hidden because casual users can easily screw them up.

In this case, .this_file_is_invisible.txt is just a plain text document I created for this exercise, .DS_Store, by contrast, is a file created by the macOS operating system to store information like how this folder should be displayed when openned. This is sufficently unimportant that playing with it won’t ruin your computer, but there’s not really anything in there you’re meant to change.

This trick is useful to know, because in some programs (like git) rely on settings hidden in dot-files. In fact, you should try and memorize this command (ls -la) – many people use it more than plain old ls.

How common are dot-files? Extremely. See for yourself: if you go to your home directory, you’ll find that all sorts of programs have been storying their settings and installed packages in dot-files. Just run cd ~ (remember that ~ is just a short hand for your home folder, which on most systems is /users/YOURUSERNAME), then ls -la.

Feel free to explore these files and folders if you want, but I would strongly suggest against editing anything unless you know what you’re doing – unlike .DS_Store files, changing some of these can really screw up how some applications work.

## Wildcards¶

As we saw in the last set of exercises, one of the most powerful command line tricks (and one of the places where using the command line can be much easier than trying to do things with your mouse) is the use of wildcards. Any time you are listing files, you can use an asterix (*) to allow any pattern to appear in part of a filename. For example, to list all the CSV files in a folder (but only the CSVs), you can type:

[7]:

cd example_csvs
ls *.csv

311calls_2018_11_29_Thursday.csv        311calls_2018_2_9_Friday.csv
311calls_2018_11_30_Friday.csv          311calls_2018_3_15_Thursday.csv
311calls_2018_12_6_Thursday.csv         311calls_2018_3_16_Friday.csv
311calls_2018_12_7_Friday.csv           311calls_2018_3_1_Thursday.csv
311calls_2018_1_18_Thursday.csv         311calls_2018_3_2_Friday.csv
311calls_2018_1_19_Friday.csv           311calls_2018_3_30_Friday.csv
311calls_2018_1_25_Thursday.csv         311calls_2018_3_8_Thursday.csv
311calls_2018_1_26_Friday.csv           311calls_2018_3_9_Friday.csv
311calls_2018_1_4_Thursday.csv          311calls_2018_4_12_Thursday.csv
311calls_2018_2_15_Thursday.csv         311calls_2018_4_13_Friday.csv
311calls_2018_2_16_Friday.csv           311calls_2018_4_5_Thursday.csv
311calls_2018_2_1_Thursday.csv          311calls_2018_4_6_Friday.csv
311calls_2018_2_22_Thursday.csv         311calls_2018_5_4_Friday.csv
311calls_2018_2_23_Friday.csv           311calls_2018_6_14_Thursday.csv
311calls_2018_2_2_Friday.csv            311calls_2018_6_1_Friday.csv
311calls_2018_2_8_Thursday.csv          311calls_2018_6_8_Friday.csv


Or if you only wanted to see the CSVs that have data from the month of February (in this case, the files with 2018_2_ in the middle of the file name) you could type:

[8]:

ls *2018_2*

311calls_2018_2_15_Thursday.csv 311calls_2018_2_23_Friday.csv
311calls_2018_2_16_Friday.csv   311calls_2018_2_2_Friday.csv
311calls_2018_2_1_Thursday.csv  311calls_2018_2_8_Thursday.csv
311calls_2018_2_22_Thursday.csv 311calls_2018_2_9_Friday.csv


This is an extremely powerful tool, and one you’ll use a lot. Just be careful – wildcards can also get you in trouble. For example, supposed you wanted to erase all the CSVs from January. You might be inclined to type rm *2018_1*. But that pattern will catch much more than just January…

[9]:

ls *2018_1*

311calls_2018_11_29_Thursday.csv        311calls_2018_1_19_Friday.csv
311calls_2018_11_30_Friday.csv          311calls_2018_1_25_Thursday.csv
311calls_2018_12_6_Thursday.csv         311calls_2018_1_26_Friday.csv
311calls_2018_12_7_Friday.csv           311calls_2018_1_4_Thursday.csv
311calls_2018_1_18_Thursday.csv


It will also catch (and if you were to use rm, delete) November (2018_11_) and December (2019_12_)! To just catch January, you’d have to be more specific and use rm *2018_1_ (with the trailing underscore).

## Using The Outputs of Commands¶

We’ve seen there are several commands that will print information to the terminal for you to see. But sometimes we want to do something with the information that programs return. For example, it’s nice that ls shows us the contents of a folder, but what if we wanted to save that to disk so we could open it and use it in a different program?

### Saving to Disk¶

You can re-direct the output of any program that prints something to the screen to a file with the > command. For example, to save the output of the ls command to a file on your desktop, you would type ls > ls_output.txt.

Note that this will only work for commands that print something directly to the screen (like ls, or cat). It won’t work for programs that just open up an interactive session (like less).

### Piping¶

Sometimes instead of saving the output of a program to disk, you want to pass it to another program to analyze. This practice – using the output of one program as input to another – is called “piping”, and it can be very powerful (and is actually used in many programming languages, not just bash).

For example, suppose we wanted to count the number of .csv files in a folder. One way to do this would be to use ls *.csv to save the names of all the files in a directory to disk, then use the wc command (short for “word count”) to count the number of lines in that file. To do so, we save the output of ls -1 *.csv to disk (the -1 option forces ls to put one file name on each line), then point wc at the file using the -l option (which counts total lines, since if a file name has a space it would be counted as multiple words. See man wc for more information on how wc works):

[10]:

ls -1 *.csv

311calls_2018_11_29_Thursday.csv
311calls_2018_11_30_Friday.csv
311calls_2018_12_6_Thursday.csv
311calls_2018_12_7_Friday.csv
311calls_2018_1_18_Thursday.csv
311calls_2018_1_19_Friday.csv
311calls_2018_1_25_Thursday.csv
311calls_2018_1_26_Friday.csv
311calls_2018_1_4_Thursday.csv
311calls_2018_2_15_Thursday.csv
311calls_2018_2_16_Friday.csv
311calls_2018_2_1_Thursday.csv
311calls_2018_2_22_Thursday.csv
311calls_2018_2_23_Friday.csv
311calls_2018_2_2_Friday.csv
311calls_2018_2_8_Thursday.csv
311calls_2018_2_9_Friday.csv
311calls_2018_3_15_Thursday.csv
311calls_2018_3_16_Friday.csv
311calls_2018_3_1_Thursday.csv
311calls_2018_3_2_Friday.csv
311calls_2018_3_30_Friday.csv
311calls_2018_3_8_Thursday.csv
311calls_2018_3_9_Friday.csv
311calls_2018_4_12_Thursday.csv
311calls_2018_4_13_Friday.csv
311calls_2018_4_5_Thursday.csv
311calls_2018_4_6_Friday.csv
311calls_2018_5_4_Friday.csv
311calls_2018_6_14_Thursday.csv
311calls_2018_6_1_Friday.csv
311calls_2018_6_8_Friday.csv

[11]:

ls -1 *.csv > ~/files_in_folder.txt
wc -l ~/files_in_folder.txt

      32 /Users/Nick/files_in_folder.txt


But obviously that seems wasteful. Why do we have to save to disk just to move the data from one file to another?

The answer is we don’t! Instead we can use the pipe operator: |. The pipe operator says “just pass the output of the first command as an argument to the second command”. And now we can do:

[12]:

ls -1 | wc -l

      32


## The nano Editor¶

It is often the case when working at the command line that one wants to actually edit a file, not just look at it or move it around. For small, quick edits, bash comes with an extremely useful tool for this purpose: nano. Just type nano FILENAME on almost any system, and you can edit your file without openning or installing additional programs.

(Note for MIDS Students: you can also use emacs for the same purpose, since you’ve already gone through the pain of learning it!)

## The PATH Variable¶

The last feature of the command line that is important to understand is the PATH variable. We won’t get into all the intricacies of the PATH variable here, but having a basic understanding of its purpose and function will likely prove useful to you if you ever have to troubleshoot problems in the future.

Have you ever wondered how the command line knows what to do when you type a command like python or ls? How does it know what program to run, especially on a computer that might have multiple installations of a program like Python?

The answer is that your system has a list of folders stored in an “environment variable” called PATH, and when you run a command (like python), it goes through those folders in order until it finds an executable file with the name of the command you typed. Then when it finds that file, it executes that program and stops looking.

You can see the value of the PATH variable on your computer by typing echo $PATH (echo says evaluate and print what follows, and the dollar sign in $PATH says “please fill in the value of the environment variable named PATH”. On my system, the PATH variable looks like this:

[13]:



## Command Line Exercises¶

Let’s do some exercises!