TL;DR: I wrote a script for enumerating and downloading source code off sites when they accidentally share their .git folder.
I’ve notice a couple of things of late when looking at security testing software:
- Python now seems to be the scripting language of choice
- It seems to be mostly in not just git, but on github
And the upswing in popularity of git is not just in small projects, but in deploying to web sites too, so much so that it’s now becoming increasingly more likely to find a site that is inadvertently serving the .git folder to the outside world.
With a little work it should be possible to reconstruct a repository remotely (object packs being the only hard part).
Of course, this isn’t a new problem – SVN has the same issue – but the fact that it’s slightly harder to parse that git metadata means it’s a nice opportunity to finally take the plunge and write some Python and learn more about git.
What’s in the .git folder?
- an index file which, like SVN is effectively a database of all files in the project against hashes of those files. Unlike SVN, it’s in memory map format, which is much more fiddly to write code for
- The entire site source code, reference by an SHA1 hash, compressed using zlib deflate
- Logs of git actions such as commit in logs/HEAD
- A small config file which is a good starting point to test if a .git directory is present or whether the site is configured to return
200 OKfor any URL requested as it returns a very predictable format
Analysing the .git folder
As a starting point, to avoid having to parse the
index file myself, I forked gin – a neat little index file parser written in Python. This already produces a readable and JSON encoded version of the file which I can then use to iterate over the files. The script looks at:
- File extensions. Count which file extensions are the most popular. This tells us what our site is written in, if it wasn’t already obvious
- “Interesting” files. Archive format files, backups, SQL, “hidden” files (beginning with “.”) such as .htaccess and .htpassword, files which might have DB configurations in them etc
logs/HEADfile for emails and credentials stored in URLs
This then dumps this information out into a simple flat text
interesting.lst file, a
report.md file, containing the results of the above scan, a copy of
index in its native format, readable text, json, and flat text and copies of
Being a greedy git
At this point, you already have quite a lot of powerful info however if the script has managed the above, it will probably also be able to download the source code for the site. Since we’ve already determined a lot of interesting files in
interesting.lst, we can use that (edit it and add to it) to download all those files to our computer. In git, the compressed source for a file (in “loose” format) is stored in
.git/objects/ and is referenced by the SHA1 hash of itself. We have that hash, so we can try and download files.
Passing the “
-I” command line argument to greedy-git will make it attempt to download everything in
./files/ in the current working directory. If you really want to go overboard, you can pass “
-a“, which will get as much of the site source code that it knows about, and passing “
-g [remote/file/path]” will download just that file, or matching file pattern.
You now have a target site’s juicy source code. This could contain database or other credentials, clues to vulnerabilities or “security by obscurity” style back doors that the developer thought no one would find. All this is now just a few
grep commands away.
Do use responsibly, and let me know if there is a way of guessing the
pack file name – that would be the keys to the kingdom…