Finding patterns in API data using Word Embedding methods
The devil is in the details. We all know that. Sometimes the smallest things, those buried deep and hidden from sight, can have the biggest impact over time – and all too often, for the worse. Although these may occasionally seem arbitrary, as in ‘For Want of a Nail’, in many cases they simply challenge us to be more meticulous, thoughtful, and careful. But who’s got time for that?
When it comes to API security, how do you know what “the details” are? If your development team is conforming with API best practices, then the OpenAPI (fka Swagger) can be a good starting point. It can be the basis for building dedicated technical protections as a stepping stone for your API security posture.
But even if your API specifications are 100% aligned with what was actually developed and are always kept up to date, you can be vulnerable to other “details” that simply aren’t documented: the business logic. A flawed API business logic can be the source of functional vulnerabilities.
Functional attacks make use of the intrinsic and unique business logic of an organization’s service, as reflected by the API. From a Machine Learning perspective, the business logic – objects, relations and processes – is mirrored in the patterns formed by API transactions. These patterns can subsequently be verified and monitored, alerting in case of an anomalous activity that causes some relation to be violated or broken.
The question is, how can you map the business logic? This is where Representation Learning comes into play, or more specifically: Word Embeddings.
Word Embeddings are Natural Language Processing (NLP) methods for Representation Learning, in which words or phrases are mapped to some array of numbers (a vector). These methods take as input categorical data, and learn a representation for each data value. Here is more on how NLP protects sensitive data.
There are various approaches and algorithms for obtaining word embeddings, where one of the more popular among them is Word2Vec. Word2vec is basically a two-layer neural network, taking data text documents as input and outputting an array or vector for each word in the input documents. More concretely, what the network does is learn a vector of numbers for each word seen during the learning phase, while optimizing its objective function. The objective function, in a nutshell, aims to predict a word based on other nearby words.
One of the interesting traits in the resulting learned embedding for text data this method obtains is that words that have similar meanings are closer to one another in the learned space.
Consider the following example:
In the figure below, you can see Word Embeddings learned on textual data using Word2Vec. Note how words like ‘application’,’interface’ and ’software’ are closer together and actually form a topical group, but are farther away from other words, like ‘documentary’, denoting that they are less associated with them.
Figure 1
This semantic-similarity (or association) pattern between the words is one type of pattern that word2vec captures automatically on textual data.
Can this method also be used to detect patterns in API data? Well, yes.
Applying word embedding methods to API data can automatically uncover hidden patterns that reflect the API business logic on different levels. By defining words in our vocabulary as fields that appear at API endpoints, we can design several learning schemes which capture different patterns in the API data.
Figure 2 shows a visualization of such a learned embedding space on API Telecommunication data in a three-dimensional space:
Figure 2
If we dive deeper into these results, some interesting patterns arise. For example, all the required fields in the requests are grouped tightly together, and are far away from other fields’ clusters, reflecting a distinctive relation between them compared to other fields (which are not necessarily required fields in the API), as shown in Figure 3.
This insight can in turn be used for verifying whether the related transactions include these required fields and alert when a transaction does not include one of these fields as an anomaly.
Figure 3
Another interesting pattern is exhibited in Figure 4, which focuses on Profile-related fields that appear in two different API calls. These profile fields reflect information on a user’s connectivity characteristics in a mobile communication network. As can be seen in the figure, the corresponding fields’ vectors are largely concentrated in two different clusters.
Figure 4
The top cluster contains fields that deal with IP address manipulation of either the endpoint or home gateway. What’s unique about these fields is that their purpose is similar – they all deal with altering the mechanism that governs the user’s IP address, as opposed to other fields that do not concern that mechanism and are outside the cluster.
On the other hand, the bottom cluster contains fields that concern general profile settings: APN, maximal and minimal bandwidth, and other connectivity properties. So while all these fields may appear together in two different API calls, the logic that governs their joint occurrence – the fact that they are concerned with the same business logic entity in the API – is actually captured in their clustering pattern in the embedding space.
This insight can be used for verifying that when a user changes their IP address, the same change occurs in all the related fields in that user’s session. Consequently, when a user’s session does not adhere to this pattern across the related endpoints, we can alert this behavior as anomalous.
Although the first ‘required fields’ example expresses information that can be found in API documentation (such as OpenAPI or Swagger), the second example reflects complex and profound business logic understanding, one that is rarely documented in Swaggers or any other API documentation.
By applying a Word Embedding approach on API data, we can discover internal relations that reflect business-logic patterns seldom described in API documentation. This knowledge subsequently allows for in-depth and fine-tuned anomaly detection abilities on the entire API attack surface.
*** This is a Security Bloggers Network syndicated blog from Imvision Blog authored by Liora Braunstain. Read the original post at: https://blog.imvision.ai/finding-patterns-in-api-data-using-word-embedding-methods