Git and Github

1. What Is Git(hub) & Why Should I Learn It?

Git and github are a set of tools designed to facilitate collaboration on code. They make it possible for multiple people (in some cases, dozens or even hundreds of people) to work on code at the same time without getting in one another’s way. (Note: as we’ll discuss below, git and github are technically two different things, but you’ll almost certainly use them together, so I’m just going to call them “git” for now).

It is no exaggeration to say that git (and other forms of version control software) underlie the entire world of open source software, and are central to the operation of nearly every tech company on the planet. Git is just a central part of how software is developed today.

Git is less commonly used in Data Science, however, where people are often working in groups that are small enough they feel comfortable just keeping their code on Dropbox and trying to make sure that collaborators never work on the same file at the same time (in the hopes of never creating a dreaded Conflicted Copy file). But git and github have much to offer the modern data scientist, even if they only work on small teams and don’t want to contribute to larger software projects. In particular, here are the main benefits I see for git and github in order of likelihood of relevance for an applied data scientist (the sort who answers ad hoc questions like an academic researcher, rather than writes data analysis routines into widely distributed software):

  1. Keep an archive of every version of your project: Git works by logging the work you do on your project into a series of discrete sets of changes called “commits”. Crucially, it remembers all of your commits, making it possible to easily go back to a previous version of the project any time you want. Suddenly discover that your co-author deleted the code for your favorite graph weeks ago and you didn’t notice till now? No problem! You can easily recover the version of that file that existed before your co-author’s mistake.
  2. All you and your co-authors to work at the same time: Git is much more forgiving when it comes to allowing people to edit the same document at the same time. Git treats each line in a text document separately, so if your co-author is editing the introduction of your paper (assuming it’s in a text format like \(\LaTeX\)) while you’re editing the conclusion, git can easily integrate your simultaneous edits. Moreover, if you do both edit the same line of code or text, then git will help you resolve those conflicting edits in a very efficient manner instead of what dropbox does: create a Confliced Copy, and leave you to figure out what changes conflicted and how to integrate both authors’ changes.
  3. You can easily see what changes you’re co-author has made: Because git is organized around keeping track of changes (again, called commits), when your co-author makes changes to a document, git allows you to easily see just the changes your co-author has made. This makes it much easier for colleagues to be aware of how their project is changing to watch out of problems (e.g. you can easily see if your co-author recoded a variable in a way that is problematic for code you wrote later). For example, here’s an example of a github report on changes a co-author made in a shared project, where the (red) text on the left is what the code used to say, while the (green) text on the right is what it now says. And github even allows you to comment directly on changes if you want:

git_diff_example

Much more helpful than a Drobox “[Name] made changes to [File]” notification. :)

  1. Allows you to contribute to open source projects: This may not be something you’re planning to jump into, but learning git will make this an option. If you find that a package you use doesn’t have a feature you need, the ability to use git will make it possible for you to add that feature to the package, not only allowing you to do what you want to do, but also making that fix available to the broader community.
  2. Allows you to make your project open source so others can contribute to your project: The time may come when you want to write a software package and share it with the world. If you know git, you can also share that code in a way that makes it easy for other people to contribute code and improve that package. Open source isn’t just for big things like Python programming language – it’s also used for lots of little projects, like packages for simulating electoral boundaries to study gerrymandering, or tools to make it easy to access campaign finance data (sorry, I’m a political scientist, so I’m doing poli sci examples, but you get the idea). And you’ll be amazed how many people contribute to these kinds of projects. Indeed, even these tutorials have a github repositoriy where people can submit improvements!

OK, now the bad news: there’s a reason I put so much energy into discussing the value of git before getting into how it works: learning git kinda sucks. I mean, it’s not painful like performing an appendectomy on yourself without anesthesia, and it’s not hard like quantum mechanics or geometric topology; it’s definitely something anyone can learn. But there’s no pretending that git is user friendly, and you’re sure to have a couple of moments when you’ll find yourself thinking (rightly) “why on Earth did they do it like that?”. So before you dive in, it’s good to have the expectation that there’s a initial uphill slog to learning git before it becomes really useful. But I promise: it pays off big time.

Git versus Github

As I eluded to earlier, though they’re almost always used together, git and github are actually two different things:

  • git is the program that keeps track of changes in your code and helps you manage multiple people working on code at the same time.
  • github is a service that hosts a copy of your project in the cloud so you and your co-authors can easily share project changes. In addition, github also has a great interface for reviewing changes to a project in a user friendly manner, and it has an “issue tracker” system for hosting conversations about things that need to be done in a project.

And when it comes to what is user-friendly and what is not, git is the kinda awful thing to learn, and github is magical.

2. Learning Git

Because it’s such a central tool in software development, there are lots of tutorials online for learning git, and it would be silly for me to try and write my own. However, I think most do a so-so job of offering users a general sense of what working with git is like. With that in mind, below is a simple overview of working with git, after which is a link to what I think is probably the best git tutorial I’ve found. Once you’ve completed the tutorial, please come back here to learn a little bit more about data-science-specific git issues (like how to put datasets into a git repository).

Initializing a git repo

Much of the first part of a git tutorial will be configuring your installation and creating a new git repo (repo is short for repository, which is what one calls a git project). Here’s the good news: you don’t have to remember any of this. If for any reason you have to change you configuration some day, you can google directions, and you can do the same when you need to make a new project. So don’t stress this part.

A Normal Day Working with Git

Once your team has a git repository setup, a normal day goes like this:

  • You starting by using the pull command to integrate the most recent changes your co-authors have made off github. This keeps you up to date.
  • You edit the code you want to edit.
  • You then commit these changes to the project (i.e. tell git you feel good about the changes you’ve made and want them to really be part of the project). You do this by staging the files you want in the commit (you can include as many changes as you want in a commit – it could be 8 new analysis files, or a single change to one line of code in a single file). Then formally commit those staged changes.
  • When you’re done, you push those changes, which is when the changes you’ve committed on your own computer are shared with the project on github.
  • If by some chance a co-author has also been working at the same time as you, you may have to pull their new changes first, and if by some chance you both edited the same lines of code (or same lines of a latex or text document), you may have to then “resolve conflicts” and push again.

That’s it! There are lots of other things one can do in git (like spinning of parallel versions of the project so you can experiment without ruining the main project (called branching), reverting to old versions (if you realized you screwed something up and just want to go back to what you had last week), and more, but those are kinda secondary.

Command line versus Graphical User Interface

Git is most commonly used from the command line (also named the Terminal on macs): the text-based interface for your operating system. There are graphical user interfaces for git, but most don’t quite implement all of gits features, so most tutorials teach git as a command line tool.

If you have any comfort with the command line, learning git from the command line is probably the best way to start. You can later move to a graphical user interface, but knowing how to use the command line version means you can always do anything git can do, and online help will be easier to find.

If you aren’t comfortable with the command line, you have two choices:

If you do eventually want a GUI, I recommend either Github Desktop (very clean and easy, but can’t do everything git can do) or Sourcetree (busy, but more powerful and flexible.

Tutorials

I recommend git-it. Unlike many other tutorials, it asks you to do things on your own computer, not an unrealistically easy to use terminal in the web-browswer, which gives a more authentic tutorial experience. Just follow the directions here to get started.

3. git-lfs

On thing that’s unique to using git for data science is that we often want to put datasets into our git repositories. Unfortunately, git by itself can’t really handle datasets efficiently. To solve this problem, we use git-lfs (git Large File System). You can learn all about it here!

[ ]: