Building a Multi-cloud Logging Strategy: Issues and Pitfalls

Posted under: Heavy Research

As we begin our series on Multi-cloud logging, we will start with why some of the traditional approaches to logging won’t work. I generally don’t like to start on a negative tone, but we think it is import to point out some of the challenges and pitfalls that commonly beset firms on their first migration to cloud. That, and it helps frame why we are making other recommendations later in this series. Let’s take a look at some of the common issues by category.

Tooling

  • Scale & Performance: Most of the log management and SIEM platforms in existence were designed and first sold before anyone had heard of clouds, Kafka or containers. They were architected for ‘hub-and-spoke’ deployments on flat networks, and ‘Scalability’ meant it could run on a bigger server. This is important as the infrastructure we now monitor is agile, designed to auto-scale up when we need processing power, and scale back down to reduce costs. The ability to scale up, scale-down and scale-out are essential characteristics of cloud, and the commonly a missing ingredient for older logging products which require manual setup, lack full API enablement and auto-scale capability.
  • Data Sources: We mentioned in the introduction that some common network log sources are unavailable. Similarly, as automation and orchestration of cloud resources is conducted through API calls, the API logs become an important source. Data contained in these new log sources may change in format, and the indicators used to group events or users within logs changes. For example, servers in auto-scale groups may share a common IP address. Functions and other ‘serverless’ infrastructure is ephemeral, making it impossible to differentiate one instance from the next. The point here is the tools you rely upon need to ingest new types of logs, ingest them faster, and change the methods of threat detection according to type.
  • Identity: Understanding who did what requires a sense of identity. That identity may be a person, or it may be a service, or it may be a device. Regardless, the need to map identity, and perhaps correlate identity across one or more sources, is more important in hybrid and multi-cloud environments
  • Sheer Volumes of Logs: When SIEM started to make the rounds, there were only so many security tools and they were pumping out only so many logs. Between new niches in security and new compliance regulations, the array of log sources sending unprecedented amounts of logs to collect and analyze only grows every year. Moving from traditional AV logs to EPP, for example, is a huge endpoint log volume increase. Add in EDR logs and you’re really into some serious volumes. On the server side, moving from network and server logs to add application layer and container logs will be a non-trivial increase in log volume. There are only so many tools designed to handle event rates (X billions of events per day) and volumes (Y terabytes per day) without buckling under the load, and more importantly, there are only so many people who know how to deploy and operate them in production. While storage is plentiful and cheap in the cloud, you have to get those logs into storage from the various sources that reside on-prem and in various cloud services, from IaaS/PaaS/SaaS. If you think that’s easy, call your SaaS vendor and ask them how you can export all your logs out of their cloud and into yours (S3/ADLS/GCS/etc). The old saw from Silicon Valley ‘but does it scale’ is funny but also really does apply in some cases.
  • Bandwidth: While we’re on the topic of ridiculous volumes of logs, let’s discuss bandwidth. Network bandwidth and transport layer security between on-prem and cloud and inter-cloud is non-trivial. There are financial costs as well as engineering and operational considerations. If you don’t believe me, ask your AWS or Azure sales person how to move, say, 10 terabytes a day between the two clouds. In some cases, architecture will only allow a certain amount of bandwidth for log movement and transport, so consider this when you’re planning any migrations or add-ons. As a rule of thumb

Structure

  • Multi-account, Multi-cloud Architectures: Cloud security promotes things like micro-segmentation, multi-account strategies, closing down all unnecessary network access and even running different workloads in different cloud environments. This sort of segmentation makes it very difficult for attackers to pivot if they gain a foothold. And it also means that you will need to consider what cloud native logs are available, what you will need to supplement with other tooling, and how you will stitch all of these sources together. Expecting to dump all your events into a syslog-like service and let it percolate back on-premise is ineffective. You’ll need to employ new architectures for log capture, filtering and analysis to compensate. Storage is the easy part.
  • Monitoring ‘Up The Stack’: As the cloud provider manages infrastructure, and possibly the applications as well, the focus of threat detection must shift from networks to applications. This is both because you lack the visibility into network operations, but cloud network deployments are commonly more secure, prompting attackers to shift their focus. Even if you’re used to monitoring the app layer from a security perspective, for example with a big WAF in front of your servers on-prem, do you know if that vendor has a viable cloud offering? If you’re lucky enough to have one that works in both places, and you CAN deploy in cloud as well, answer this (before you initiate the project…): Where will those logs go, and how will you get them there?
  • Storage vs. Ingestion: Data storage in cloud services, especially as it pertains to object storage, is so cheap it is practically free. And long term data archival cloud services offer huge cost advantages over older on-premise solutions. In essence we are encouraged to store more. But while storage costs are cheap, it’s not always cheap to ingest more data into the cloud, since some logging and analytics services charge based upon volume (GB ingested) and event rates (number of events) ingested into the tool/service/platform. Example are Splunk, Azure Eventhubs, AWS Kinesis, and Google Stackdriver. Many log sources for cloud are verbose in both the number of events they log, but the amount of data generated from each event. From this standpoint you will need to architect your solution to be economically efficient, as well as negotiate with your vendors in relation to ingestion of noisy sources such as DNS and Proxy, for example.
    Side note on ‘closed’ logging pipelines: Some vendors want to own your logging pipeline on top of your analytics toolset. This may sound convenient because it ‘just works’ (mostly for their business model) but beware the lock-in of such approaches, both in terms of cost-overruns from lack of ability to perform dedupe/filtering pre-ingestion (the water meter is running on every log ingested), but also from the opportunity cost of lost analytical capabilities that other tools can provide, but not if you can’t get them a copy of your log stream. Just because you can afford to move a bunch of logs from place to place, doesn’t mean it’s easy. Some logging architectures are not conducive to sending logs to more than one place at a time, and one ‘in’ their system, exporting all logs (not just alerts) to another analytical tool is incredibly difficult and resource-intensive, since the logs are already ‘cooked’ into a particular proprietary format that you then have to reverse after export in order to have it make sense for the other analytical tools you want to use.

Process

  • What To Filter And When: Compliance, regulatory and contractual commitments prompt organizations to log everything and store it all forever (ok not literally, but pretty much). And not just in production, but pre-production, development and test systems. Combine that with overly chatty cloud logging systems (what do you plan to do with logs of every single API call into the entire cloud you’re operating?), and you get quickly overloaded with data. This results in both slower processing and higher costs. Dealing with this problem is a combination of deciding what must be kept vs. what can be filtered; what needs to be analyzed vs. what must be captured for prosperity; what is relevant today for security analysis and model building, but irrelevant tomorrow. One of the decision points you’ll want to address earlier on in the logging strategy is what you consider perishable/real-time needs vs. historical/high-latency tolerant needs.
  • Speed: For several years there has been a movement away from batch processing, and moving to real-time analysis (footnote: batch can be very slow (days) or very fast micro-batching within 1 or 2 second windows, so we use batch to mean anything not real-time streaming but more towards daily or below in frequency). Batch mode, as well as normalization and storage, prior to analysis is becoming antiquated. The use of stream processing infrastructure, M-L and the concept of ‘stateless security’ allow – even promote – events to be analyzed as they are received. Changing the process to analyze events is needed in order to react to attackers – who commonly employ fully automated attacks – in real time.
  • Automated Response: In the face of nukeware with rapid lateral movement (see Wannacry/NotPetya): Many on-premises megaglobocorps and government agencies suffered tremendously in 2017 from fast-spreading ‘ransomworms’, so response models are in need of a tune-up in comparison to stealthy, low-and slow IP and PII exfiltration attacks. Once the fast-movers execute, you cannot detect your way out of these grenades in your datacenter. They’re very obvious and very loud. The good news is that the cloud model has some essential characteristics that enable micro-segmentation and automated response. Cloud also doesn’t rely on ancient identity and lateral movement-enabling network protocols that plague even the best on-prem security shops. Let’s not forget that bad practices in the cloud won’t save you from even untargeted shenanigans. Remember the MongoDB massacre of January 2017? https://blog.rapid7.com/2017/01/30/the-ransomware-chronicles-a-devops-survival-guide/ Speed in response to things that look wrong is a major factor in being able to drop the net on the bad guys as they try to pwn you. Knowing exactly what you have, its known-good state, and other factors that cloud enables, are advantages that blue team can leverage when defending in cloud.

Again, the point here is not to bash some of the older products, rather point out that to work in cloud environments, you need to re-think how you use tools today and alter their deployment model to be effective. Most will work with re-engineering some of the deployment. In general we are fans of deploying known technologies when they are appropriate as to help reduce the skills gap most security and IT teams face. But in some cases you will find a new and different ways to supplement existing logging infrastructure, and will likely run multiple analysis capabilities in parallel.

Next up in our series: Multi-cloud logging architectures and design.

-Adrian & Gal

– Adrian Lane
(0) Comments
Subscribe to our daily email digest



*** This is a Security Bloggers Network syndicated blog from Securosis Blog authored by info@securosis.com (Securosis). Read the original post at: http://securosis.com/blog/building-a-multi-cloud-logging-strategy-issues-and-pitfalls