Under-the-hood of Git

For many years now Git has been the SCM (source control management aka version control) of choice. It offered many features which alternatives such as CVS did not, and combined with GitHub website created an entire CI pipeline which any teams Dev practices could be built around.

When I began reading about the mechanics of Git it was obvious that it’s combination of many different techniques, all of which produce the “replicated versioned file system” know as Git, for example:

  • Linked lists,
  • File system objects database
  • Hashing (stat SHA-1 vs content SHA-1 vs content Deflate)
  • Differential encoding

So I decided to create a mini-working version with some of the core version control features. Thankfully there are many helpful books which break down how things work, so I have attempted to strip the internals down to its bare minimum.

This post will focus on:

  • repositories,
  • working directories,
  • staging,
  • committing
  • status checks.

I have omitted packfiles, deltas, branches, tags, merging and comparing staged chunks (diffing). I may do a follow up post/repository on those.

This is part of my “under-the-hood of” series:

The article today will be broken down into:

  1. Overview
  • Workflow
  • Object model
  • Components
  • Additional reading

2. Building our own Git

  • Our git code
  • Testing it works

3. What have we missed?

1: Overview

Git is described as a distributed version-control system, which tracks changes in any set of files. It was initially released 15 years ago (in 2005) and has grown in functionality and popularity massively since then. As any developer who uses Github knows (or an alternative e.g. BitBucket/GitLab) it has become a staple in the world of software as a best practice.

I am not going to review how it is used but the basic workflow can be summarised by:

  1. initialise a new git repository
  2. A file/s change is made locally and saved
  3. The file/s is added to staging
  4. The file/s in the staging area are comitted
  5. The commit is pushed to a remote repository (pulling the latest before doing so).

We will break down each step, but before we do we need to review the mechanism at the core of Git, the “Object model”.

The object model is essentially an incredibly efficient versioned file system (with replication).

Each file in the repository exists in the file system and the object database. The object database is a hash of the contents. A hash is an object, there are 4 types in total but today we will look at (excluding “tags”):

  1. Blob -> a sequence of bytes. A blob in Git will contain the same exact data as a file, it’s just that a blob is stored in the Git object database. Basically the file contents.
  2. Tree -> corresponds to UNIX directory entries. Can contain blobs or sub trees (sub directory). The commit tree has the entire project in blob and trees at time of the commit. It can recreate the entire project from that tree. Always from root directory even if a sub directory file is being updated in the commit.
  3. Commit -> single tree id and commits preceding it

Each tree node, commit and file have their own unique 40 character long SHA-1 representation. The filename is a hash of the contents. If the contents change, so does the hash. Each time it changes a new entry/hash is added but keeps the old ones.

Inside a git repository they are found under the .git/objects folder.

This is my favourite image to describe the structure.

Hash

Within the object model, the filename is a 2-way SHA-1 encoding of the contents.

Git prefixes any Blob objects with blob , followed by the length (as a human-readable integer), followed by a NUL character Example:

Equivalent to

Object file contents are compressed via DEFLATE zlib algorithm, it is less human readable or filename-friendly but a more efficient encoding.

Components

I will be covering the components we will be building in our mini-working version.

The current system folder with git repository in, also known as the working tree.

A file holding a ref to current working branch. Basically the last checked out workspace. It holds a reference to the parent commit, usually last branch checked out.

Found in the file .git/HEAD. Example

A branch is actually just a named pointer to specific snapshot. When it is checked out

  1. moves HEAD pointer to point to the feature ref (branch)
  2. moves all content from the current branch repo into the index file, so it’s easy to track changes.
  3. Make working dir match content of commit pointing to (using tree and blob objects to update working dir contents)

An alias for a commit id. The HEAD will point to the latest or predefined e.g. .git/refs/heads/tags/<tag_name>

A git project stored on disk i.e. not in-memory. Essentially a collection of objects.

Area between working directory and repository. All changes in staging will be in the next commit.

The index is a binary file, it does not hold objects (blobs/trees), it stores info about files in repository. It is a virtual working tree state.

The index file is located at .git/index. You can see the status of the Index file via > git ls-files --stage

Information stored

For each file it stores

  • time of last update, name of file,
  • file version in working dir,
  • file version in index,
  • file version in repository

File versions are marked with checksums, a SHA-1 hash of stat(), not a hash of the contents. This is more efficient.

Refresh

It is updated when you checkout a branch or the working directory is updated. Runs in the background automatically.

Hashing

It uses uses a filesystem stat() to get the files information, to check quickly if the working tree file content has changed from version recorder in index file. Checks the file modification time under st_mtime.

The refresh literally calls stat() for all files.

The main goal of this post is the mini-working version below so we have only just touched briefly on how git works. Here are websites which go into far more details

2: Building our own Git

The code consists of 4 files, one for each command, plus a util.

  • init.mjs
  • status.mjs
  • add.mjs
  • commit.mjs
  • util.mjs

(1) grab all the files from the current working directory
(2) build the index file using files stat() SHA-1 hash for each file
(3) write a repository folder under.repo
(4) Inside repository write a HEAD file and objects folder

(1) grab the index data
(2) for each item in the index data
(2a) grab files stat() SHA-1 hash
(2b) if doesnt match current working dir stored hash of file, flag as changed not staged
(2c) if does match above but doesnt match staged, flag as not staged
(2d) if does match staged but not repository, flag as not comitted
(3) update index file
(4) output local changes not staged
(5) output staged changes not comitted

(1) explicitly give files e.g. one.txt and two/three.txt
(2) for each file, get contents in SHA-1 and use for directory name and filename
(3) get DEFLATED value and use for content
(4) get SHA-1 value for files stat()
(5) Update the index
(5a) If file was not touched, just proxy values
(5b) If file was touched, update staging for the file
(6) Override old index data with new index data

(1) Grab files of files to commit
(2) Build tree for files in staging or comitted, excluded working dir only
(3) Iterate items root “tree” into a flattened array of trees
(3a) If tree, create tree for children
(3b) Then add children to flattened tree
(3c) If not a tree, push with previous tree
(4) Create tree object for root
(5) Create commit object, using parent commit if exists and the tree hash
(6) From commit object get commit hash
(7) Update index file
(7a) If staging hash does not match repository hash then update. An existing file has been updated.
(8) Update HEAD with the latest commit

I have included the helper file but hopefully the names are pretty self-explanatory.

The largest is createTreeObject and createCommitObject. Both of which:

  1. Process given contents into a hash
  2. Compress given contents
  3. Writes compressed contents to the respective directory and file — The first 2 characters of a hash become the directory and the rest the filename.

I wrote a small project to test the version control. 3 files each with a line of text, 2 of which inside a folder.

The above scripts are found inside bin/

A working directory / application is found in src/

  • one.txt
  • two/three.txt
  • two/four.txt

Then I wrote some inegration tests ( test/index.integration.spec.js) to help track what happens to our repository for a given command, the steps (and results) are:

  1. repo:init => created INDEX with current working directory files stat() hash
  2. repo:status => flag 3 new local changes not staged (those above)
  3. repo:add one.txt two/three.txt =>
  • should create blob objects, inside 2 character-long directories, with content compressed
  • should update INDEX, move items to staged

4. repo:status => flag 1 new local changes not staged and 2 changes not comitted

5. Manually update one.txt

6. repo:status => similar to previous except now flags one.txt as locally changed

7. repo:add one.txt => re-add updated file one.txt should update blob object

8. repo:status => re-added file should show with old added file

9. repo:add two/four.txt => add two/four.txt so 2 items in tree object

10. repo:commit => should create tree and commit object and update HEAD and INDEX

What have we missed?

As mentioned there are many additional parts to the real Git version control which we have omitted from our library. Some of those are:

  • Comparing change chunks (diffing)
  • Packfiles
  • Deltas
  • Branches
  • Tags
  • Merging

Thanks so much for reading, I learnt a huge amount about Git from this research and I hope it was useful for you. You can find the repository for all this code here.

Thanks, Craig 😃