Save Joern — Open Source at ShiftLeft

by Fabian Yamaguchi on May 8, 2018

TL;DR; We want the technology developed at ShiftLeft to benefit open security projects and the security research community as much as possible. Therefore, we are planning to open-source our semantic code property graph and its query language in the coming months, and integrate the open-source C/C++ code analyzer “Joern” (http://mlsec.org/joern/) with it. Efforts to benefit the community cost time and money, and so we want them to be focused. I am therefore reaching out to the community to initiate a discussion on what would be the most beneficial route to take regarding this endeavour. If this discussion interests you, please write me a short e-mail at [email protected] and I will include you on the [email protected] mailing list.

Joern/Bjoern are dying — let’s save them!

Joern, and its ugly brother Bjoern, are open-source frameworks for the discovery of vulnerabilities in C/C++ and binary code. These projects implement the idea of code mining based on “code property graphs”, an idea first published in a paper in 2014.

Joern and Bjoern are dying. And it’s totally my fault. I developed and maintained these projects until late 2016 and then joined ShiftLeft to develop similar technology behind closed doors. “Boooo”, I hear you shout, and you’re right. So, let’s save the projects, by bringing the open and the closed versions closer together.

Now is the time

Joern and Bjoern, despite their age and lack of maintenance in the last 1.5 years, aren’t totally dead yet. I still receive a bunch of e-mails every week of people playing with them. We’ve recently seen Cisco Talos use Joern in their work (https://twitter.com/daniel_bilar/status/977146516942016513) (you guys are brave!), and we saw a reimplementation of the concept for Android in munmap’s pwn2own work (“Jandroid”). Dinis Cruz, maintainer of OWASP O2Platform recently expressed interest in the project — wondering whether we can swap notes. Finally, projects in academia are happily building on the idea still, even leading to publications at Tier 1 conferences.

However, let’s not give in to illusion: an unmaintained project gradually loses its usefulness to the community, and that’s a pity. At the same time, it hurts to see deficiencies of the code pop up or lead people to reimplement the concept while spending a lot of time on problems already solved. If we want the project to continue to provide value to the community, then it is time to hook it up to some of the advancements we’ve made in the last 1.5 years at ShiftLeft. If it falls to far behind, then it becomes irrelevant.

Open Source and startups

I strongly believe that open publication of code is a requirement for building truly impactful research and technology. Starting my work at ShiftLeft, I was happy to see that this belief is shared by many of my colleagues, both in engineering and leadership.

While a startup needs to be careful not to go overboard with this, in order to build a sustainable business — and to not see entities with lower self respect “smack a UI on top of it and sell it as their own”, we should never stop asking throughout the process which of our developments would benefit the larger community and how we can go about open-sourcing it. We need to ask ourselves this question, because nobody else can do it for us, and we should ask this question to motivate others to do it about their tech in the future.

Luckily, many of the components that turn core concepts into a commercial product, components crucial to operate in a customer environment, seem of very little value to the research community: researchers tend to not care about “seamless integration into a customer’s build environment”, the work on “scaling code analysis in cloud environments”, or support for a specific customer’s exotic framework. They care about solid reference implementations of core concepts and ideas, and this is what I believe we can and should Share and open to the world.

What we would like to open source

Depending on community interest and feedback, we are seeking to open source the following components:

Semantic code property graph. Joern was a proof of concept that showed how vulnerability discovery via graph database mining is possible. To this end, it introduced the concept of a “code property graph” (see this paper). However, little effort was put into defining a specification for this structure, resulting in a vague concept that required reimplementation for every programming language. In the past 1.5 years, we have put considerable amount of effort turning this into a language-independent intermediate code representation for graph mining, one that is flexible enough to support language-specific constructs and represent them in a consistent way. By open-sourcing this specification, it becomes possible to write language modules to implement graph mining for your favorite programming language. This is what we call the semantic code property graph.
Domain specific language for code analysis. Anybody who has ever played with Joern/Bjoern, will remember the horrible trial and error sessions when crafting a new query in a groovy-based query-language that makes you question your life choices. It is elegant, it is extensible, but usable only by the bravest. The corresponding shell has tab-completion, but really just lists all methods available, because the dynamically-typed groovy-based environment really does not know what type you are holding in your hand there. We replaced this with a Scala-based system, statically typed query language, and corresponding shell, and my tears have now long dried.
In-memory graph database. Joern/Bjoern can be used with any graph database backend that supports the Tinkerpop stack. This is great because you can have graphs that are much larger than what fits into your RAM and have your database take care of providing efficient access. However, in most cases, it just ends up loading the entire database into RAM anyway, and if it doesn’t, things become considerably slower. At that point you ask yourself whether you might be able to just reduce the memory footprint significantly and just serve graphs directly from RAM most of the time. To realize this, we have created an in-memory graph database that uses 70% less memory than the reference implementation (Tinkergraph), and allows for strict schema validation. We have already open-sourced this and continue to maintain and publish fresh versions, ready for you to pick up from maven central. More background in our blog article and on github

With these components out there, we can connect Joern to the semantic code property graph, and end up with a fully functional code analysis System — something much better than the original.

Let’s get there together!

If this interests you, please sign up for the mailing list at [email protected] by sending me a short note to [email protected]. I will kick off the discussion in a few weeks and hope to find out how much effort it is worth putting into this. I am interested in hearing what people have tried doing with Joern, but ended up not doing because of issues with the code. I would be interested in hearing from people developing similar open tech to see if we can collaborate. Can we connect this to other open components maybe? In short, whether you want to contribute or just stay up to date with the effort, please sign up.

Save Joern — Open Source at ShiftLeft was originally published in ShiftLeft Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from ShiftLeft Blog - Medium authored by Fabian Yamaguchi. Read the original post at: https://blog.shiftleft.io/save-joern-open-source-at-shiftleft-6961a8cb6b89?source=rss----86a4f941c7da---4