Git is a powerful system for distributed version control that is widely used. Knowing and understanding it’s features are important as they give liberty in source code management. In our articles about Git we are going to explain some fundamental basics and then dive into things which are important for everyone who heavily uses Git on a daily basis. There are no real prerequisites for reading these articles, however we expect our readers to be at least familiar with general concepts of version control system (VCS) such as pulling, updating, committing, pushing, branching and tagging.
Before we discuss some specific features, let’s talk about things that make Git so special in comparison with other VCSs, such as Subversion, Mercurial, Darcs etc.
Git is a distributed version control system (DVCS), which means that nearly all commands are performed locally, and no information is needed from another computer. It allows you to avoid network latency overhead, so most operations are quick and can be performed without connection to network. Moreover, it also gives you an opportunity to experiment without ruining work of other team members.
Therefore, almost everything is local. To make a long story short, you can commit, tag, create and switch branches, view and search history without connection to remote server. Even more, repository can easily exist without any remotes (which rarely makes sense). The commands that require connection are push and fetch.
Nearly all Git operations only add data to the Git database. When something is in there, it is hard to perform undoable actions or to erase any data. It’s not impossible, though. Nevertheless, we will cover this a little bit later.
Every piece of data is checksummed before it is stored in Git database and then referred to by that checksum. This means that it’s impossible to change the contents of any file or directory without Git knowing about it. This feature is a part of Git philosophy and built in the lowest levels of the system.
In order to calculate checksum, Git uses SHA-1 hash algorithm, which returns 40-character string composed of hexadecimal characters (
a–f) and calculated based on the contents of a file or directory structure in Git. Probably you’ve already noticed long hashes like
66e855fd2a52ea55fb75fc4bf2e63875eb58bc29 all over the places. Accordingly, SHA-1 algorithm gives us
2^160 unique values. That’s a lot. It turns out that Git can abbreviate these hashes in most cases. By default abbreviated hash is the first 7 characters. Therefore, the mentioned hash would be abbreviated to
66e855f. We can play with hashes using git rev-parse command.
$ git rev-parse --short 66e855fd2a52ea55fb75fc4bf2e63875eb58bc29 66e855f $ git rev-parse 66e855f 66e855fd2a52ea55fb75fc4bf2e63875eb58bc29
Almost all VCS store information as a list of file-based changes. Where data is considered as a set of files and all the changes made to each of them with time (see figure). This makes some operations harder and longer to perform. For example, switching between revisions with a huge set of changes between them might require some time just to calculate the content of the file.
On the other hand, Git doesn’t think about data this way. Whenever you commit changes to the data, Git takes snapshots of the changed files (see figure). For efficiency reasons, it doesn’t take snapshots of the files that haven’t been changed. Therefore, Git considers data as a list of snapshots.
It gives one major advantage, which is switching revisions as fast as swapping two snapshots. It affects checking out an older version of the project or even switching to another branch. Add it to the ability to perform all these operations locally, you will get almost an instant switch.
Nevertheless, there is a disadvantage of such an approach. It requires more space. It’s pretty common to have overgrown repositories just because Git stores snapshots of files instead of deltas. It becomes very noticeable when you are dealing with huge binary files like images. That’s why the solutions like Git LFS are so in need, and they are popular.
However, there is a good news. Git has some optimizations for non-binary files. It has an ability to combine multiple objects into single files, known as packfiles. Packfiles are multiple objects, which are stored with an efficient delta compression scheme as a single file. You can think of it as something like Zip file of multiple objects, which Git can extract efficiently when needed. You can force Git to pack everything that is possible to pack by calling
git gc. However, it does so automatically, whenever you push. Therefore, it is covered for you.
Git highly encourages to use branches for parallel feature development. One of the reason (as we have mentioned before) is fast switching between them. Another reason is that the branches are very cheap. A branch is actually a file containing commit hash it points to. Yes, a branch is just a pointer to commit. Moreover, it’s stored as a file containing 41 characters (40 characters for SHA-1 hash and a new line). Keeping in mind that the branch operations are also local, you will have a really great and cheap feature.
All your files that are tracked by Git can be in one of the three states:
Therefore, Git also has three main sections of a project:
Therefore, we’ve covered some important fundamental features of Git that might help you with better understanding how Git works. In the next article we are going to talk about branches: what they are and what operations can be performed. So stay tuned.