Welcome to the Command Line Basics Exercises!¶

In this exercise we’re going to get some practice navigating and exploring files and folders from the command line by looking at some data from New York City’s 311 system. 311 is a citizen hotline set up by the city of New York for reporting non-emergency issues to the city. 311 takes calls about all sorts of issues, from noise complaints to issues with street lights to complaints about restaurant hygeine violations and rodent sightings.

You can find the 311 data we’ll be working with in a zipped file called NYC_311calls_2018.zip here. Please download the file and place it somewhere easy to remember (desktop, downloads, etc.).

Navigating Folders¶

As previously discussed in Command Line Basics, the command line is just a tool for interacting with your computer using text commands rather than by clicking around with a mouse on pictures of folders (i.e. using your Graphical User Interface, or GUI). That means that most of the principles you’re familiar with (like the way files are organized into folders) apply on the command line just like in your more familiar graphical user interface. In this section, we’ll briefly review how to nagivate your folder structure using the command line.

There are two main ways to navigate your file system:

The first step-by-step: analogous to how you would find a file using the operating system GUI:

Check where you are currently located with the pwd command (short for print working directory).
Use ls to see what folders and files are in your current directory.
If the file you want to find is in one of the folders in your current directory, move into that folder with the cd command (e.g. cd FOLDERNAME). Then return to Step 2.
If the file you want is not somewhere inside your current folder, you can move your working directory up one folder by typing cd ... Then return to Step 2.

The second is using an absolute path. This requires knowing the exact path of the folder where you want to work, but if you know it, you can just type cd /THE/PATH/TO/WHERE/YOU/WANT/TO/GO.

Important Things to Remember:

Recall that if you start a path with a /, your are specifying an absolute path. When you use an absolute path, your current working directory doesn’t matter. An absolute path is like a street address: it fully specifies where you want to go. For example, the absolute path to my desktop is /users/nick/desktop.
If you don’t start with a /, you’re specifying a relative path. Wen you use a relative path, that path is interpreted with respect to your current directory. In this sense, a relative path is like the directions you give a lost driver (“from where we’re standing, go one block north, then take a left”). cd FOLDERNAME, for example, says to your computer “I want you to make your new working directory the folder in the current directory named FOLDERNAME.
~ is a shortcut for your user directory. If you get lost, cd ~ will always take you back to your home directory. Also, any path that starts with ~ is actually an absolute path, even though it doesn’t start with a / since ~ always refers to a single (absolute) location on your computer.
Remember that if file or folder name has a space in it, you have to put a \ before the space. So to move into a folder called “My Folder”, you would type cd My\ Folder.

Exercise 1¶

Use the pwd (short for “print working directory”) command to ask your command line session where it currently thinks of itself being located. What is your current working directory?

Exercise 2¶

Using whatever strategy you would like to move around, change your working directory to the folder where you saved NYC_311calls_2018.zip. Then run the command unzip NYC_311calls_2018.zip to decompress the file.

Note: If you download with Safari, Safari will automatically unzip the file and delete the .zip, just leaving you an already unzipped folder. If you want to practice unzipping, just use Chrome or any other browser.

Exploring Files¶

Once you’ve unzipped NYC_311calls_2018, use cd to navigate into the folder so it is now your working directory. Then use ls to look at what’s in the folder. What you see should look something like this:

$ ls
311_SR_Data_Dictionary_2018.xlsx README.md                        raw data
CE-20170824.pdf                  just_the_letter_a.docx
NYC311_column_names.txt          just_the_letter_a.txt

Up until now, we’ve just been moving around at the level of the filesystem, seeing file names but not their contents. But if a file is a plain text file, we can also look at it’s contents. There are actually a few ways to do this, but the two most used options are cat (which will print the contents of the files to your screen), or less (which will open a small program to allow you to read through the document in a controlled manner). cat is quicker, but if you use cat with a big file, the whole file will just print out to your screen and you’ll end up overwhelmed (though you’ll be fine for a small file here).

If you use less, get to the end of a file and can’t get out, type q and enter.

Exercise 3¶

Do as the README.md suggests and read it first with the command cat README.md, then with the command less README.md (press q for quit to get out when you’re done).

Exercise 4¶

Now let’s do the same with CE-20170824.pdf: run less CE-20170824.pdf. If less asks you a question, just type y.

What happened?! Unfortunately, CE-20170824.pdf was not a plain text file, but instead is what is referred to as a binary file. This distinction between plain text files and binary files will come up a lot, so let’s discuss it briefly.

The terms “plain text” and “binary” are a little misleading since everything on your computer is stored as 1s and 0s (i.e. binary). What differentiates plain text and binary files is what those 1s and 0s are meant to represent.

In a plain text file, the 1s and 0s of the file encode numbers and letters based on simple, commonly used codes (like ASCII or Unicode. These files also do not contain anything complicated (pictures, media, etc.), and in fact don’t even include information like fonts, or formatting. This simplicity makes plaintext files universally compatible, and easy to work with, so are a favorite of programmers. Any code you’ve ever written has probably been saved as a plaintext file.

In a binary file, by contrast, the 1s and 0s encode much more complicated information. In this case, CE-20170824.pdf is a PDF file that includes images, different fonts, careful formatting, etc. As a result, it can only be openned by a PDF reader (like Preview or Adobe Reader) that knows how to interprete the file’s complicated content. If you open it with less, less tries to treat the 0s and 1s like they were just encoding simple letters and numbers, but since they don’t, the result is just gobblygook.

Exercise 5¶

Lets actually see the difference between plaintext and binary files. In your folder are two files called just_the_letter_a, one with a .txt suffix, and one with a .docx suffix. Using your normal operating system interface, open both files (assuming you have Microsoft Word installed). You should see that both files include nothing except a lower-case letter “a”.

Exercise 6¶

You can see the actual 1’s and 0’s that underlay a file from the command line using the command xxd -b [filename]. First, use this to see what’s in just_the_letter_a.txt file.

What you will see is a counter on the left, a colon, then the actual contents of the file grouped into sets of 8 bits (what’s called a byte). The first is the code for a lower case “a” (01100001). The second is the code that says “this is the end of the current line”. And that’s it! Congratulations, you can now read binary!

(Don’t believe me? You can find the code for the end of a line here, and for “a” here. Go check for yourself – there’s no magic here.

Exercise 7¶

Now let use xxd -b [filename] do the same for the Microsoft Word doc that also encodes just a single letter “a”. Does it look similar?

Exercise 8¶

And that is why plaintext so useful – it’s simplicity makes it nearly universal across both platforms and time.

Be aware that lots of file endings can be used for plaintext files. For example, .csv files are also plaintext. Indeed, it is because they have such a simple format that .csvs are the most used format for sharing tabular data. .md, .txt, .tsv, and other file suffixes are also usually plaintext.

But just because a file is not plaintext doesn’t mean we don’t want to know what it is!

Use the open command (on a mac) or the start command (if you’re using a bash shell on windows). open FILENAME / start FILENAME just asks your computer to do whatever it would do if you double-clicked on FILENAME. So if you type open CE-20170824.pdf / start CE-20170824.pdf, your computer will open the PDF in your default PDF reader.

Similarly, if you type open . or start ., your file explorer will open with the current directory open!

Exercise 9¶

OK, so CE-20170824.pdf is just a paper someone wrote using this data. Since the name CE-20170824.pdf doesn’t tell us anything about this paper, let’s rename it using the mv command. Recall from DataCamp that mv stands for move, but that while it is moving files it can also rename them. If you “move” something from its current location back to its current location but with a different name, you’ve effectively re-named it!. So try re-naming CE-20170824.pdf to something more descriptive.

Organizing Files¶

Up till now, we haven’t done anything that wouldn’t have been easier to do using a mouse and a regular graphical user interface. But now let’s suppose we want to analyze the data from 311 calls placed on Thursdays and Fridays to see if city workers are less likely to address problems that are reported on Fridays.

In your normal operating system GUI, open up the raw data folder inside NYC_311calls_2018. As you will see, the folder is full of CSVs (comma-separated-values, a plain-text format for storing spreadsheets), with one file for each day.

Exercise 10¶

Without using the command line (or another progamming language), how you would pull out all the files for Thursdays and Fridays and move them to a new folder without using the command line? Would you strategy work if you had 10 years of data instead of 1 year of data?

..

Exercise 11¶

One of the advantages of the command line is that you can use wildcards (the * symbol) to identify any files with a given pattern. For example, if I wanted to list all the CSV files in raw data from February, I would type ls 311calls_2018_2_*.csv, since all the files from February (month 2) would have the same prefix (311calls_2018_2_) and suffix (.csv). Now, using the mv command and the * symbol, move all the Thursday and Friday files to a new folder. (Hint: you’ll probably need to make a new folder to put the files into first.)