SBN

Genetic Algorithms: Using Natural Selection to Block Bot Traffic

In this blog, we will explore the concept of genetic algorithms—an artificial intelligence technique inspired by natural selection—and how they can be leveraged as a rule generation method to effectively block bot traffic.

What are genetic algorithms?

Genetic algorithms are a type of artificial intelligence modeled after the process of natural selection. The main structure of a genetic algorithm is as follows:

  1. Initialization: Create an initial population of potential solutions (individuals).
  2. Evaluation: Evaluate each individual in the population based on a specific criterion.
  3. Selection: Select the best individuals from the population to reproduce.
  4. Reproduction: Combine the genetic material of selected individuals through crossover and create offspring.
  5. Mutation: Introduce random changes (mutations) in the offspring’s genetic material.
  6. Replacement: Replace the least fit individuals in the population with the new offspring.
  7. Termination: Repeat steps 2 to 6 until a satisfactory solution is found or a termination condition is met.

The process of selection, reproduction, and mutation mimics natural selection and genetic variation, allowing the algorithm to explore the solution space and converge towards better solutions over generations.

Leveraging Genetic Algorithms for Bot Protection

The goal of this new rules-based detection method is to use genetic algorithms to discover parts of traffic caused by malicious activities. We are then able to create rules to automatically block those segments.

While DataDome heavily relies on ML techniques to detect malicious bots traffic, our detection engine also offers the possibility to execute millions of rules in a few milliseconds. Thousands of rules—generated both manually and automatically—are already efficiently filtering traffic. Multiplying the methods we use to create rules ensures protection is as complete as possible. The nature of this algorithm allows it to find bot traffic that would not have been detected otherwise, by introducing randomness and making no assumptions about the nature of the requests to target.

Let’s take a look at how we designed this system:

1. Define Our Desired Output

Our goal is for the algorithm to generate rules that specifically target bot traffic that have not already been blocked by our other detection methods. These rules should be defined as a combination of predicates, to target specific request signatures.

To begin, we collect a set of unique key-values from the signatures of the requests stored in recent days. This set contains the predicates we will use to build our potential solutions (i.e. our rules targeting specific parts of traffic).

Two past requests and predicates

If we receive a Post request coming from the country of France with Chrome as a user agent, we add Method=POST, Country=France and UserAgent=Chrome to our set of predicates.

We are then able to create a base set of rules by randomly combining elements from our set of predicates. This base set will serve as the starting point for the algorithm’s evolution.

Two main rules to start with

For example, we create rule A, defined as Country=Spain AND UserAgent: Chrome and rule B as Country=France AND UserAgent: Firefox.

Our hypothesis is that we will be able to make new rules out of these base rules thanks to an evolutionary process. These new rules will use a combination of predicates to effectively match bot traffic.

2. Evaluate Our Potential Solutions

Then, we need to be able to evaluate the fitness of each rule—that is, how good the rule is at targeting bot traffic. This part is challenging; since we are trying to discover bot activity that wasn’t detected by any other method, we have no reliable labels to assess if the requests matched by our rule are bot-made or human-made. The solution we adopted was to look at the time series of the requests that would have been matched by the rule. In particular, we’re combining two metrics to give a fitness score to each rule:

  1. We evaluate how similar the time series is to the time series of known bot patterns (gathered through DataDome’s other detection methods). We want our rules to be as similar as possible to a known bot pattern.
Time series analysis for known bot patterns

The purple time series is very different from this known bot pattern. If it is different from all of the bot patterns, the rule would get a low score.

Time series analysis by known bot patterns

This orange time series is very similar to this known bot pattern. The rule associated with this time series would get a high score.

  1. We evaluate how similar the time series is to the times series of known legitimate human traffic. We want our rules to be as different as possible from all of our known human patterns.
Time series analysis by known human traffic

The purple time series is very similar to this known human pattern. The rule associated with this time series would get a low score.

Time series analysis of known human traffic

The orange time series is very different from this known human pattern. If it is different from all of the human patterns, the rule would get a high score.

If a time series presents the characteristics of a bot traffic time series and is different from all of our known human time series, the associated rule would be deemed good and get assigned a high score. Otherwise, the rule would get assigned a low score.

3. “Evolve” Our Rules to a Satisfactory Solution

Once we have assigned a score to each of our rules, we can retain the ones that perform the best—”survival of the fittest”—and generate new rules by combining the ones we have retained.

The strongest rules survive and are combined

We can combine our rule A and our rule B to create a rule C that could be Country=France AND UserAgent: Chrome.

Top performing rules act as parents

The top-performing rules from the previous generation, along with the newly created rules, make up our next generation.

Next, we introduce random, infrequent mutations in our rules in order to introduce new predicates into the population of potential solutions.

Mutations are introduced to the rules

We can randomly mutate rule C by removing part of the rule and replacing it by another key-value from our set of predicates, such as replacing Country=France with ConnectionType=Cellular.

By introducing mutations, the algorithm is able to explore new areas of the solution space and potentially find better solutions that it would have missed otherwise.

Having created this new generation, we’re able to repeat the process of scoring our potential solutions and combining them into new ones over and over. This process will increase the fitness score of our population with each new generation.

Once we’ve gathered enough satisfactory rules, we’re able to stop the evolution process and leverage the new rules through our detection system.

Results & Conclusion

During this “proof of concept” phase of the project, we successfully demonstrated that the algorithm is capable of finding solutions with an increasing fitness score in each new generation.

By running the algorithm periodically on recent data, we were able to continuously detect and filter bot traffic. The rules generated by the algorithm effectively match bot traffic with a minimal false positive rate, thereby validating the effectiveness of our scoring method. About 160K previously undetected malicious requests were blocked over 24 hours with only 30 rules created on a limited scope.

Testing the new rules on real traffic

A sample of rules blocking traffic in real time. The algorithm generated new solutions at 8:40 a.m., resulting in the introduction of new rules to filter bot traffic that was not previously detected.

The main benefit of this method is that we are not making any (or very few) assumptions about which parts of traffic to target or the structure of the output rules. The algorithm explores the space of potential solutions in order to find rules that are gradually better at blocking bot traffic with each new generation.

For more insights into how DataDome protects businesses from cyberfraud threats, book a demo today.

*** This is a Security Bloggers Network syndicated blog from DataDome authored by Jules Marécaille. Read the original post at: https://datadome.co/threat-research/genetic-algorithms-using-natural-selection-to-block-bot-traffic/