Git is a great SCM tool used widely by many popular companies to maintain the code history, and to manage versions of their projects. There are many possible cases where Git can be useful. In this article, I will present you one of such cases, with quite an interesting trick, utilizing the very low-level structures of Git system. I’m sure this will help you learn how Git works under the hood.
Annoying “example files”
One of most common ways to store some example files in our Git project, is to append some suffix extension to the file name. Then we’ll use the file we need to remove the suffix extension. The example file should be added to .gitignore, in order not to add a sensitive-data file to the index. After making a commit of example files, we’ll be ready to share them with anyone.
For instance, when we have a database configuration file in a Rails project (config/database.yml), we will create the example file config/database.yml.example, add the config/database.yml to .gitignore, prepare the example file, and commit it.
Easy, isn’t it?
Not really – there are a few problems with that solution:
It requires to choose the suffix extension not included in .gitignore (obviously).
We need to remember to add the original example file to .gitignore.
There’s a risk of making a commit that removes such examples, if we accidently delete them in the working directory.
The .gitignore file, as well as projects code history, get more and more complex and unclean.
Are there any good practices? Which extension is better?
How to commit such files? Should you add them to the same commit with the functionality they support, or maybe they should exist separately? Which naming convention will be most suitable for such “separate” commit?
Which data should be included in examples?
What if we don’t want to share those files with others?
What if we have a lot of example files, and we want to use them instantly when needed?
What if we want to have more than one version of the example files?
Those are major issues I encountered while using this workflow. I found it messy and annoying, so I started searching for something more elegant.
To avoid issues mentioned above, I started to study the Git documentation, hoping to find a solution. I found out that there is a bunch of so-called “plumbing commands”, that work on a Git filesystem on a very low level. For instance, there is a git hash-object command that returns a hash of any object given as an argument:
$ git ls-files file.txt $ cat file.txt test $ git hash-object file.txt 9daeafb9864cf43055ae93beb0afd6c7d144bfa4
Apparently, we can also generate such hash for any data we want:
$ echo "any data" | git hash-object --stdin f19452ee6b88c31f81dd4f70e83ef9cfedc12478
Of course, there is no such object in our repository:
$ git show f19452ee6b88c31f81dd4f70e83ef9cfedc12478 fatal: bad object f19452ee6b88c31f81dd4f70e83ef9cfedc12478 $ git cat-file -t f19452ee6b88c31f81dd4f70e83ef9cfedc12478 fatal: git cat-file f19452ee6b88c31f81dd4f70e83ef9cfedc12478: bad file
However, we can create such objects using -w option:
$ echo -en "content of the first file" | git hash-object -w --stdin 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 $ echo -en "content of the second file" | git hash-object -w --stdin 7b3facd72dde52a7ea5eab22fee1717cf630742f $ git show 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 content of the first file $ git show 7b3facd72dde52a7ea5eab22fee1717cf630742f content of the second file $ git cat-file -t 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 blob $ git cat-file -s 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 25 $ git cat-file -p 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 content of the first file $ git cat-file -t 7b3facd72dde52a7ea5eab22fee1717cf630742f blob $ git cat-file -s 7b3facd72dde52a7ea5eab22fee1717cf630742f 26 $ git cat-file -p 7b3facd72dde52a7ea5eab22fee1717cf630742f content of the second file
What’s more, newly-created objects don’t appear neither in working directory, nor in the index:
$ git ls-files file.txt $ git status
On branch master - your branch is up-to-date with 'origin/master'. Nothing to commit, working directory clean.
Great! So there is a way to store any data in Git repository, and leave the project clean and untouched.
Newly-created objects are now stored in .git/objects directory:
$ ls .git/objects/38 96fb7b0dd65c5d75bb66abf504b183d7e2c219 $ ls .git/objects/7b 3facd72dde52a7ea5eab22fee1717cf630742f
They contain object type, object size, and object content – everything is compressed with zlib:
$ openssl zlib -d < .git/objects/38/96fb7b0dd65c5d75bb66abf504b183d7e2c219 blob 25content of the first file $ openssl zlib -d < .git/objects/7b/3facd72dde52a7ea5eab22fee1717cf630742f blob 26content of the second file
These objects are unreachable:
$ git fsck --unreachable unreachable blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 unreachable blob 7b3facd72dde52a7ea5eab22fee1717cf630742f
It means that when we will type “git prune”, they will disappear. The same will happen when we run garbage collector (git gc) after some time. To prevent that, we need to make a reference to the object, for example by making a tree. The objects are equivalent to files in Git filesystem, while the trees behave like directories.
We can organize our objects into a tree, specifying the chmod code, object type, object hash and object name:
$ echo -en "100644 blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219\tfirst.txt\n100644 blob 7b3facd72dde52a7ea5eab22fee1717cf630742f\tsecond.txt" | git mktree 0f2a261623552516bcbc39c5f71904b67f7b38ac $ git cat-file -t 0f2a261623552516bcbc39c5f71904b67f7b38ac tree
Note: To add any other trees to the existing tree (to build the hierarchical structure), use chmod 040000.
The tree contains two files now:
$ git ls-tree 0f2a261623552516bcbc39c5f71904b67f7b38ac 100644 blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219 first.txt 100644 blob 7b3facd72dde52a7ea5eab22fee1717cf630742f second.txt
To keep the changes after running the git prune command, you need to create a reference to this tree. For example, you can make a tag:
git tag my_data 0f2a261623552516bcbc39c5f71904b67f7b38ac
Now git fsck --unreachable will not show our objects – they are safe. The result is a directory-like structure containing two files, and we have access to them without touching stash, workspace, index, or any structure of the project.
Working with tagged trees
To apply the tree to the working directory, we need to checkout the tag:
$ git checkout my_data .
We can apply it in any place we like:
$ git checkout my_data ./config/examples
The tag itself has no information about who created it, and there’s no comment about content of the tree. We can change it by making a signed tag:
$ git tag -s -m ”The configuration files bundle for v0.1 of Facebook API" $ git show my_signed_data my_signed_data 0f2a261623552516bcbc39c5f71904b67f7b38ac tag my_signed_data Tagger: Paweł Placzyński <firstname.lastname@example.org> Date: Thu Sep 17 13:42:26 2015 +0200 The configuration files bundle for v0.1 of Facebook API -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJV+qc6AAoJEMX7aiuDlSqt+Q4P/2n2kK+6B5+vom0cYQzo4QT3 …
To share the tree with others, we only need to push tags:
$ git push —tags
The repository the tree will be downloaded while pulling:
$ git pull From github.com:placek/example.git * [new tag] my_data -> my_data
Pros and cons
There are many advantages of the solution above:
- The code history is cleaner.
- The project directory is cleaner.
- Applying the bundle of example files is executed in one command.
- The example files can be managed in well-described bundles.
- There is no limit for the number of files/directories.
- The tree is applied to the directory, even when the blobs and/or trees match .gitignore exceptions.
- There is no need to keep templates.
- There is no need to rename files.
- The configuration bundle can be applied anywhere in the directory.
- Some bundles can be managed locally only.
On the other hand, creating such bundle by hand can be a major pain in the back. I plan to create a tool to help create the index/workspace/stash-independent tree structures automatically – I will present it in my next article.
Which approach is better? In my opinion, it’s up to developers involved in the project.