Humans & Machines Uniting to Detect Complex Web Application Attacks

In the last 12 months, the application of Machine Learning to real world problems has quickened due to the arrival of tools and frameworks that allow ordinary, but skilled programmers to apply it to a wide variety of problems. The integration of these toolkits with specialized hardware such as graphics processing units (GPUs) has unleashed tremendous processing power. This integration, combined with a growing community that provides support, guidance, training, and examples, has broadened participation from specialized wizards to mere mortals and migrated results from labs to production.

At Alert Logic, we have successfully integrated machine learning into an operational security environment for thousands of customers. Through that experience, we’ve learned that any successful machine learning (ML) program requires five primary ingredients:

  1. Collection of consistent, high quality and high volume of data
  2. Data scientists to curate and label subsets of data needed for training the ML algorithms
  3. Domain experts to guide the data scientists
  4. A production team to convert the data scientist results into a scalable high-quality production implementation
  5. A feedback loop that evaluates the performance of the algorithms and continually improves the results

All parts are tightly linked. Collecting high-quality, consistent data is critical. If the data is noisy, the system will train incorrectly and produce bad results (garbage in, garbage out). Once the algorithm is trained, the real-world data must be measured and collected using identical sensors to the training data. Otherwise, it’s like training the algorithm using text documents and expecting it to recognize speech. We solved the consistency problem at Alert Logic by deploying our own sensors in the customer environments.

How it Works… do not take Machine Learning lightly

Domain experts work closely with data scientists to create a training set for the machine learning algorithm. In many real-world applications, the creation of the training set is the hardest part of ML. Domain experts can interpret the raw data and point out what features or characteristics are important to the data scientists. The data scientists use this knowledge to select an appropriate ML technique and to transform the raw data into a balanced training set for the algorithm. The training set must be carefully constructed so that the ML trains on the right features and learns from both positive and negative examples.

Once the algorithm is working, moving it from the lab to a production environment is another hurdle. The set of skills needed to build a robust, nonstop, maintainable production environment is different than those needed to apply ML techniques to a problem domain. The production team needs to be hardcore engineers who can understand the data scientist’s algorithms and needs, as well as build an environment that allows their work to be run.

Once the machine learning algorithms are trained, running in production and producing results, the experts need to identify which results are accurate and which are not. This new data is then given to the data scientists, who can either tweak the algorithm and/or add the data to their training set. This feedback loop is critical to developing an effective algorithm. At Alert Logic, we have a 24×7 security operations center that validates the output of our detection algorithms, provides feedback to improve the algorithms and escalates the results for our customers.

Developing a successful machine learning environment requires these ingredients and that investment needs to be taken into consideration from the start.

The benefits of initiating any sort of ML program can be great. Machine Learning allows us to shift from a problem set that can be managed inside a human brain, to much more complex problems such as Human Genome sequencing, facial recognition, cyber security and cancer treatment. It’s important to emphasize that while the AI computing solutions can far outperform humans to address complex problems, we still need the human domain experts to program and steer them.