Deployment Isn’t the Final Step – Monitoring Machine Learning Models in Production

Unless you’ve been living in a cave for the last decade, you’ve probably heard of the concept of a machine learning system at least once in your life. Whether it’s auto-translation, auto-completion, face or voice recognition, recommendation systems or autonomous driving, AI-based systems can be found in almost every aspect of our daily lives.

But, although this field looks innovative, it’s actually been around for quite some time, with roots going back to the late 1950s. In recent years the field has been re-recognized, and artificial intelligence is becoming better than human expertise in several fields.

AWS Builder Community Hub

So, what does a typical machine learning project look like?

The machine learning project cycle

Although there are many methodologies out there, it’s possible to say that a typical data science project will hold five main stages – the definition of the problem we wish to solve, data collecting and cleaning, training and testing a model, deploying the product, and monitoring it.

ML Project Cycle

Figure 1: ML Project Cycle

  1. Define the objective
    Like the beginning of any new project, you must define exactly what goals you’re trying to achieve. A well-written objective is crucial as it may affect each step of the project lifecycle. The objective must be specific and measurable, as those aspects will be used throughout the development process and especially when evaluating the model (stage 3) and monitoring the production environment (stage 5) to see if your goals have been achieved. According to this Microsoft blog, we’re to ask ourselves one of five types of questions which defines the project type and the business goals:

    • How much / how many?   >> regression
    • Which category? >> classification
    • Which group? >> clustering
    • Is this weird? >> anomaly detection
    • Which option should be taken? >> recommendation


    And last but not least, you should ask yourself – is machine learning really the right solution for this type of project?

  2. Collect and clean
    Once the objective has been formulated, now’s the time for gathering the relevant data – an action that in most cases is not as trivial as it sounds. Sometimes the data is simply not available, or at least not in the desired amount (a typical neural network model requires thousands or even millions of samples). Moreover, the data isn’t always organized in a form we can work with, so several measures should be taken:

    • Aggregation
    • Missing fields completion
    • Removing garbage
    • Exploratory Data Analysis (EDA)
    • Feature engineering


    This phase will be the most time-consuming in many cases and requires quite a bit of creativity and unconventional thinking.

  3. Modeling
    This is probably the most fun part – where machine learning takes a step forward. Now our data will be divided into training, validation and testing sets which we can use to train a model that will be able to answer the question we’re asking. This stage also requires a lot of domain knowledge and an understanding of the problem and our data – we’re trying to answer some critical questions such as choosing the type of the model and hyperparameters to be used. The method for measuring success also has several approaches depending on the type of problem – should accuracy be the leading measure? Is false positive mistake equivalent to false negative? At the end of this stage, we’ll have a good model in laboratory conditions and we’re ready to move to the deployment stage in the production environment.

  5. Deployment
    After you’ve invested and mastered this model, it’s time to expose it to the world and to some new inputs from customers. It’s also important to note that the production environment will never be like the laboratory environment – this process should therefore be carried out with extreme care.

    System Deployment

    Figure 2: System Deployment

  6. Monitoring and tracking
    So, you developed an awesome predictive model that gave good results, deployed it and received positive feedback from your customers. Unfortunately, most people stop right there and move on to their next challenge. Which brings us to the last (and mostly forgotten) stage – monitoring and tracking your intelligent system.


Why monitoring matters

In recent years, most of the focus and open tools were around the data exploration and developing the model stages (2 & 3), and also the deployment stage (4). Is this neglect justified?


For a standard application, we should keep monitoring for errors, latency, crashes and so on. All of these are valid, of course, in the case of an artificial intelligence system, but they’re not the main issue. The most important thing to look for is that our system keeps doing what it was built for and doing it well.


The difference from a regular system is that the output depends significantly on the input, and its nature may change over time. Just take a look at Microsoft’s racist Twitter chatbot which was shut down in less than a day, after being heavily trolled. Once it was released, users started corrupting the bot by teaching it racist and sexist terminology while trying to alter its output. Just think about the reputational impact this bot had on the company’s name each hour it was running.

Microsoft's Twitter Chabot Tay (Source:

Figure 3: Microsoft’s Twitter Chabot Tay (Source:

The staging environment will never truly reflect the conditions in production. The trained data may suffer from serious bias, resulting in surprising outputs. Amazon’s recruiting tool is a perfect example of this case. Between 2014 and 2017, Amazon’s HR group used an AI-based tool to review resumes and make recommendations during the recruiting stage. It appears this tool was trained on resumes submitted to Amazon over the past decade and therefore preferred male applicants over women as they were the majority in those positions. The machine in this case just did what it saw and was trained for, or in other words – GIGO (Garbage in, Garbage out) – nonsense input will almost always produce nonsense (or biased) output.


In addition, these systems are characterized by non-deterministic output, which increases the natural fear of unexpected errors.

The clustering problem in Attack Analytics

We faced a similar challenge in our Attack Analytics solution when reaching stage five in the machine learning project cycle. Attack Analytics condenses thousands upon thousands of alerts into a handful of relevant, investigate-able incidents, automating what would take a team of security analysts days to do, and cutting the investigation time down to a matter of minutes.


This product is dealing with a clustering problem, which is a process of dividing a dataset into groups such that members of each group are as similar as possible to each other, or in our case – describes the same cyber-attack.

Attack Analytics Alerts to Cyber-Attacks

Figure 4: Attack Analytics Alerts to Cyber-Attacks

The similarity can be in different dimensions on every attack. Sometimes the origin of the attack may be common (IP, country, TOR-based, etc.) and sometimes not (distributed attack). There are cases where the target (URL, files, host) is common and cases where scanners are involved. Other attributes may be the attack tool, type or timing (short/long attack, periodic, etc.).


So we formulated the objectives of the project, collected the data, cleaned and enriched it, and built a stream clustering algorithm that got positive feedback from our customers after launching it in May 2018. But are we done? Besides continuing to release new features, the next step, as we’ve seen, is establishing a mechanism for tracking and monitoring it in production mode.

How to measure success

Each problem category may have its own solutions and measures of success. In our case, we’ll discuss a method for monitoring the clustering solution. This unsupervised task has no samples label, meaning that we don’t have the “real answer” for an accuracy measurement. Our solution’s groups must apply two main conditions:


      • Similarity between each group of alerts
      • Dissimilarity to other groups


There is a perfect term for this – silhouettes. Let’s define the silhouette for each sample in our dataset.


For sample i let’s define a(i) to be the average distance of point i and all other points in the cluster it was assigned to. We can interpret a(i) as to how well the point belongs to its cluster. This is exactly the first condition we defined. We wish this number to be as small as possible for each point.


Similarly, let’s define b(i) as the average distance of point i to other points in its closest neighbor cluster. Of course, we wish this number to be as large as possible for each point. This will be our second condition, to be as far as it can from its closest cluster neighbor (and therefore, from all other groups).


Those attributes were demonstrated in figure 5.

Silhouette Attributes Calculation

Figure 5: Silhouette Attributes Calculation

As we wish to have both small a(i) and large b(i) at the same time, let’s define the Silhouette of each point s(i) as:


We can easily say that s(i) lies in the range of [-1, 1].


For s(i) to be close to 1, a(i) must be small as compared to b(i), which means the point is very close to its cluster and far from other clusters. Hence, s(i) is the optimum result.


Similarly, negative s(i) means the point is closer to its neighbor cluster than to the cluster it was originally assigned. And what does si=0 actually mean? It happens when the point is in the middle of the ‘no man’s-land’ of the two groups.


Now that we have a silhouette value for each point, we can easily calculate the average silhouette value of each cluster and the total silhouette. Figure 6 shows the meaning of the cluster silhouette grade, and the silhouette graph, where each color represents a different cluster and the silhouette value of all its points.

Cluster Silhouette Meaning and Silhouette Graph

Figure 6: Cluster Silhouette Meaning and Silhouette Graph

Silhouette in the web application security domain

When talking about the web application security domain we need to apply the distance function to the security alerts that we showed earlier. This function looks into the different dimensions of the attack – its source, type, target, tool, time, etc. Every cluster defines a group of similar attacks, according to these dimensions, or in other words an attack campaign. Having Attack Analytic clustering results, we can measure the a(i) and b(i) for every alert i, and its silhouette value.


Looking at figure 7, we can see that the blue cluster A and the orange cluster B don’t seem to have good separation. Remember that we wish to have small a(i), which is the average distance of alert i to all other alerts in its cluster, meaning the alerts describe the same attack campaign. As we expected, alert i has a small a(i) value as all the alerts have the same target, tool and type, and also their origin is coming from the same subnet.


The bad separation with other clusters can be seen with small value of b(i) which is the average distance to the nearest cluster (orange cluster B). We can see that alert i has a common IP and tool with alerts coming from the orange cluster. This will be applied with small value of s(i), as alert i belongs to the blue cluster, but also have some common features with the orange cluster.


In contrast, alert j has a silhouette value close to 1, as it is close to other alerts in the same attack (small a(j)) and far from alerts of other attacks (large b(j)).

Clustering of web application attacks & Silhouette calculation

Figure 7: Clustering of web application attacks & Silhouette calculation

What’s next?

Having a grading mechanism for our AI-based system gives us the options for monitoring the clusters of our customers’ attacks. As the clustering algorithm is in streaming mode, just think of a periodic ‘health’ check for each customer and the option to respond when the results begin to start going down before getting the call from your customer. Another opportunity is to use it when adding new features in A/B testing, as you can measure how well you were before and after.


A machine learning system isn’t a regular application, so the responsibility for monitoring it doesn’t go automatically to the DevOps team. It’s important that the validation of a proper operation of these systems will be an integral part of the development cycle and will involve both developers and data scientists.

The post Deployment Isn’t the Final Step – Monitoring Machine Learning Models in Production appeared first on Blog.

*** This is a Security Bloggers Network syndicated blog from Blog authored by Amit Leibovitz. Read the original post at: