Apr 04, 2017

Index/workspace/stash-independent data on Git

blogpost author photo
Paweł Placzyński
blogpost cover image
Git is a great SCM tool used widely by many popular companies to maintain the code history, and to manage versions of their projects. There are many possible cases where Git can be useful. In this article, I will present you one of such cases, with quite an interesting trick, utilizing the very low-level structures of Git system. I’m sure this will help you learn how Git works under the hood. 

Annoying “example files”

One of most common ways to store some example files in our Git project, is to append some suffix extension to the file name. Then we’ll use the file we need to remove the suffix extension. The example file should be added to .gitignore, in order not to add a sensitive-data file to the index. After making a commit of example files, we’ll be ready to share them with anyone.

For instance, when we have a database configuration file in a Rails project (config/database.yml), we will create the example file config/database.yml.example, add the config/database.yml to .gitignore, prepare the example file, and commit it.

Easy, isn’t it?

Not really – there are a few problems with that solution:

  • It requires to choose the suffix extension not included in .gitignore (obviously).

  • We need to remember to add the original example file to .gitignore.

  • There’s a risk of making a commit that removes such examples, if we accidently delete them in the working directory.

  • The .gitignore file, as well as projects code history, get more and more complex and unclean.


  • Are there any good practices? Which extension is better?

  • How to commit such files? Should you add them to the same commit with the functionality they support, or maybe they should exist separately? Which naming convention will be most suitable for such “separate” commit?

  • Which data should be included in examples?

  • What if we don’t want to share those files with others?

  • What if we have a lot of example files, and we want to use them instantly when needed?

  • What if we want to have more than one version of the example files?

Those are major issues I encountered while using this workflow. I found it messy and annoying, so I started searching for something more elegant.

Git filesystem

To avoid issues mentioned above, I started to study the Git documentation, hoping to find a solution. I found out that there is a bunch of so-called “plumbing commands”, that work on a Git filesystem on a very low level. For instance, there is a git hash-object command that returns a hash of any object given as an argument:

$ git ls-files


$ cat file.txt


$ git hash-object file.txt


Apparently, we can also generate such hash for any data we want:

$ echo "any data" | git hash-object --stdin


Of course, there is no such object in our repository:

$ git show f19452ee6b88c31f81dd4f70e83ef9cfedc12478

fatal: bad object f19452ee6b88c31f81dd4f70e83ef9cfedc12478

$ git cat-file -t f19452ee6b88c31f81dd4f70e83ef9cfedc12478

fatal: git cat-file f19452ee6b88c31f81dd4f70e83ef9cfedc12478: bad file

However, we can create such objects using -w option:

$ echo -en "content of the first file" | git hash-object -w --stdin


$ echo -en "content of the second file" | git hash-object -w --stdin


$ git show 3896fb7b0dd65c5d75bb66abf504b183d7e2c219

content of the first file

$ git show 7b3facd72dde52a7ea5eab22fee1717cf630742f

content of the second file

$ git cat-file -t 3896fb7b0dd65c5d75bb66abf504b183d7e2c219


$ git cat-file -s 3896fb7b0dd65c5d75bb66abf504b183d7e2c219


$ git cat-file -p 3896fb7b0dd65c5d75bb66abf504b183d7e2c219

content of the first file

$ git cat-file -t 7b3facd72dde52a7ea5eab22fee1717cf630742f


$ git cat-file -s 7b3facd72dde52a7ea5eab22fee1717cf630742f


$ git cat-file -p 7b3facd72dde52a7ea5eab22fee1717cf630742f

content of the second file

What’s more, newly-created objects don’t appear neither in working directory, nor in the index:

$ git ls-files


$ git status

On branch master - your branch is up-to-date with 'origin/master'. Nothing to commit, working directory clean.

Great! So there is a way to store any data in Git repository, and leave the project clean and untouched.

Newly-created objects are now stored in .git/objects directory:

$ ls .git/objects/38


$ ls .git/objects/7b


They contain object type, object size, and object content – everything is compressed with zlib:

$ openssl zlib -d < .git/objects/38/96fb7b0dd65c5d75bb66abf504b183d7e2c219

blob 25content of the first file

$ openssl zlib -d < .git/objects/7b/3facd72dde52a7ea5eab22fee1717cf630742f

blob 26content of the second file

These objects are unreachable:

$ git fsck --unreachable

unreachable blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219

unreachable blob 7b3facd72dde52a7ea5eab22fee1717cf630742f

It means that when we will type “git prune”, they will disappear. The same will happen when we run garbage collector (git gc) after some time. To prevent that, we need to make a reference to the object, for example by making a tree. The objects are equivalent to files in Git filesystem, while the trees behave like directories.

We can organize our objects into a tree, specifying the chmod code, object type, object hash and object name:

$ echo -en "100644 blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219\tfirst.txt\n100644 blob 7b3facd72dde52a7ea5eab22fee1717cf630742f\tsecond.txt" | git mktree


$ git cat-file -t 0f2a261623552516bcbc39c5f71904b67f7b38ac


Note: To add any other trees to the existing tree (to build the hierarchical structure), use chmod 040000.

The tree contains two files now:

$ git ls-tree 0f2a261623552516bcbc39c5f71904b67f7b38ac

100644 blob 3896fb7b0dd65c5d75bb66abf504b183d7e2c219  first.txt

100644 blob 7b3facd72dde52a7ea5eab22fee1717cf630742f  second.txt

To keep the changes after running the git prune command, you need to create a reference to this tree. For example, you can make a tag:

git tag my_data 0f2a261623552516bcbc39c5f71904b67f7b38ac

Now git fsck --unreachable will not show our objects – they are safe. The result is a directory-like structure containing two files, and we have access to them without touching stash, workspace, index, or any structure of the project.

Working with tagged trees

To apply the tree to the working directory, we need to checkout the tag:

$ git checkout my_data .

We can apply it in any place we like:

$ git checkout my_data ./config/examples

The tag itself has no information about who created it, and there’s no comment about content of the tree. We can change it by making a signed tag:

$ git tag -s -m ”The configuration files bundle for v0.1 of Facebook API"

$ git show my_signed_data my_signed_data 0f2a261623552516bcbc39c5f71904b67f7b38ac

tag my_signed_data

Tagger: Paweł Placzyński <[email protected]>

Date:   Thu Sep 17 13:42:26 2015 +0200

The configuration files bundle for v0.1 of Facebook API




To share the tree with others, we only need to push tags:

$ git push —tags

The repository the tree will be downloaded while pulling:

$ git pull

From github.com:placek/example.git

 * [new tag]         my_data   -> my_data

Pros and cons

There are many advantages of the solution above:

  • The code history is cleaner.
  • The project directory is cleaner.
  • Applying the bundle of example files is executed in one command.
  • The example files can be managed in well-described bundles.
  • There is no limit for the number of files/directories.
  • The tree is applied to the directory, even when the blobs and/or trees match .gitignore exceptions.
  • There is no need to keep templates.
  • There is no need to rename files.
  • The configuration bundle can be applied anywhere in the directory.
  • Some bundles can be managed locally only.

On the other hand, creating such bundle by hand can be a major pain in the back. I plan to create a tool to help create the index/workspace/stash-independent tree structures automatically – I will present it in my next article.

Which approach is better? In my opinion, it’s up to developers involved in the project.