IT organizations striving to ensure compliance with HIPAA, Sarbanes-Oxley, BASEL II and similar regulations generally have a good grasp of the security considerations that apply to key hardware and software systems running in the cloud. They need to manage user authentication and access control, disk encryption, update planning and backup/restore. But there are high availability (HA) and disaster recovery (DR) considerations that may not be as obvious. Not only may there be a need to ensure that key systems are available no less than 99.99% of the time, but there may be a need to ensure that a DR infrastructure is in place to ensure continuity of operations in the event of a regional catastrophe that takes out the primary cloud infrastructure. While there are numerous options for configuring HA and DR solutions in the cloud, not all of them are suitable for solutions designed with regulatory compliance in mind.
Configuring for HA in a Regulated Environment
The first thing every organization in a regulated environment needs to understand is that your IT team is ultimately responsible for the security of data and applications, particularly if you are using an infrastructure-as-a-service (IaaS) offering from a cloud service provider. A provider such as AWS, Azure or Google Cloud Platform (GCP) may be responsible for maintaining the virtual machine (VM) infrastructure you’re using, but you are responsible for patching the operating system and applications, setting the access control lists and maintaining the integrity of the software and solutions running on top of that VM.
Then, depending on the business, industry or regulatory requirements you may need to configure certain critical applications for HA, meaning that they will be available no less than 99.99% of the time. In the cloud, you’ll want to run those critical applications on VMs configured as nodes in a failover cluster, which is a multi-node compute cluster with intelligent software that will immediately move the workloads from one VM to another if the first VM becomes unresponsive. To ensure 99.99% availability, you’ll need to configure the VMs in the failover cluster in at least two distinct availability zones (AZs), which are, practically speaking, separate data centers. That way, if an entire AZ goes offline, it will not take both your primary and secondary VMs offline; the failover cluster can move the critical workloads to a VM in the AZ that remains online.
What’s critical from a security and regulatory perspective, though, is that all the secondary VMs be configured identically to your primary VMs. The ACLs and audit controls you apply to the VMs in one AZ must also be applied to the VMs in the second AZ. You’ll also need to ensure that any updates to the security infrastructure affecting the VMs in one AZ are also applied to the VMs in the other AZ.
Securing Data in the Cloud
Ensuring the security and integrity of your data is always of utmost importance in a regulated environment, but in the cloud, you may need to manage your storage differently than you would in an on-premises configuration. Some cloud service providers offer shared storage options that ostensibly enable you to configure a failover cluster in the same way you might in an on-premises configuration. However, not all shared storage options can be configured to support a failover cluster that spans multiple AZs. Not all of those shared storage options allow the level of data encryption that regulatory compliance (or your board of directors) may require.
Consequently, many organizations build failover clusters in the cloud with storage attached to each VM. This approach provides the greatest combination of security, availability and flexibility to meet both business and regulatory requirements. The question then becomes how best to replicate data—securely and speedily—from primary to secondary storage so that secondary infrastructure steps in immediately if the primary infrastructure goes offline unexpectedly.
Some database and ERP applications offer built-in data replication services, but these are often designed to replicate only native data (e.g., the SQL Server database). These services ignore any other data in storage, and that data may be critical from a business or regulatory standpoint. Some of these tools also provide only data replication support, not the cluster failover management support that is critical to high availability. More complete solutions can be found in application-agnostic SANless clustering products, which provide full failover management and synchronous data replication services. If your data is encrypted, you’ll want a solution that provides block-level replication, as block-level replication tools are not concerned with the nature or source of the data being replicated. They simply replicate blocks of data from one storage system to another. Look for synchronous replication services, too, as that ensures that the data replicated to the secondary VMs is always identical to the data attached to the primary VM. In a failover scenario, the secondary VM can immediately take up the workloads of the primary VM and carry on with no loss or corruption of data.
Disaster Recovery Considerations
Even if your critical applications do not need the 99.99% availability that an HA configuration provides, the regulations governing your industry may require you to secure your regulated data from catastrophic loss. The same approach to SANless clustering in the cloud can be used to configure a disaster recovery (DR) solution. Instead of configuring your VMs in a failover cluster spanning two AZs in the same region, you’d configure your VMs across AZs in two geographically distinct regions.
The distance between the AZs in this configuration will likely require the use of asynchronous data replication services between the primary and the DR storage infrastructures. That can create windows during which the primary and secondary infrastructures are a few seconds out of sync. If the AZs in your region went dark during this period, you would be able to bring your DR infrastructure online quickly and with only a few seconds of data loss, ensuring minimal operational interruption in the face of a disaster.