Tracking our code commits during 2016

As the Director of Product Operations at Interana, one of my job roles is to evaluate and improve the effectiveness of our development and release process. It occurred to me that I might be able to ingest our GIT history into Interana in order to learn more about how we actually develop and release.


Ingesting a single file of event data in JSON format into Interana is pretty easy, so the main challenge was, how do I extract a JSON event data file from our Git repository? I knew that I could use the "git log" command with a format string to show me a history of commits in JSON format. But as I worked, I discovered that I wanted to (a) clean up the extract to ensure it was valid JSON, and (b) associate each Git commit with a development phase of a particular release. I ended up building a python script which, given a list of branch names that I cared about, could extract all the commits from our Git repo, cleanse the commit names to be valid JSON, and associate each of them with the appropriate release. Here's an example JSON line created by my extract script:


{"tag": "pre_branch", "version": "2.23", "hash": "985cf1c", "ts": "2016-11-14 18:57:22 -0800", "committer": "Jeff", "committer_email": "", "subject": "the-most-amazing-commit-ever"}


(I know that some people reading this are thinking "hey, aren't you going to show us the python code?" If you'd like to see the implementation details, I wrote a How To topic: Ingesting GIT source control history into Interana.)


Once I had a data file that represented the full set of event data I cared about, it was time for the ingest into Interana. I uploaded my data file into our Ingest Wizard, and chose shard keys of "version" and "committer" so I could ask questions about the behavior of both releases and developers over time.


Customize column types in the Import WizardCustomize column types in the Import Wizard

My favorite visualization that I ended up making in the Interana Explorer was a Stacked Area Time view of commits by release. It gave a sense of how many releases we work on in parallel, and the overall lifetime of each release. I ended up doing a few extra annotations (using a photo editing program) and sharing the image more widely.


Commits by releaseCommits by release

One of the really common experiences that I have working with Interana is that once I can get such an amazing visualization of my data, I realize my data isn't clean enough to answer certain questions. For example, when I showed the visualization to one of our team members, she asked "it seems like our trendline for the absolute number of commits is going down over time... does that mean we are being less productive in Engineering?" It turns out that we use Phabricator for code reviews and merges into GIT, and Phabricator has the property that it "squashes" all a developer's local commits together into a single commit that gets pushed to the origin repository. But we have been enforcing the use of Phabricator more strictly over time, so the older data is going to have more cases where one work unit for a developer produced multiple commits in the origin repository, and the newer data is going to skew in favor of one commit per unit of work. The bottom line is that our data isn't yet clean enough to give an accurate answer about how many commits we are producing over time. (And yes, I know that the number of commits over time isn't really a direct measure of any of our core Engineering values or goals. This question is just a good way to illustrate the really common experience of realizing you sometimes need to make deeper process changes in order to collect accurate information about how your business is behaving.)