Introduction
“Infrastructure as Code”, IaC, is a term every system administrator has heard by now. We can think about it as the process of managing and provisioning IT infrastructure through source-code instead of performing tasks manually. As we will explore, this helps DevOps teams efficiently and safely adapt infrastructure to meet the always-changing requirements dictated by the business. This approach helps to manage infrastructure in a way that enables the devops team to better serve the organization.
How can this paradigm help you? It encourages the adoption of software development practices like keeping infrastructure’s definition and configuration scripts in a source control system, automated code testing, and doing peer reviews. This benefits infrastructure management in numerous tried-and-true ways.
If you’re starting your journey into IaC, there are many resources you can reference to familiarize yourself with the concepts and terminology associated with this approach. Kief Morris’ “Infrastructure as Code: Managing Servers in the Cloud” is an essential book on the topic (alternatively, Martin Fowler’s blog gives a great overview).
At Etleap, we embrace IaC to build and improve our service every day. This practice helps us in our ongoing effort to make Etleap the best ETL platform it can be.
“IaC is makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes.”
Let’s take a look at a few examples of how using IaC has helped Etleap build a better product.
Service uptime and disaster recovery
One advantage of IaC is that it makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes. The new infrastructure will be consistent with the previous one, which is to say that its software and configuration are the same (every security patch is applied, OS is configured the same way, allocated resources are identical).
Imagine a scenario where extreme weather or a natural disaster destroys the data centers where Etleap is hosted. For obvious reasons, it’s vital that we have a plan to recover from such an ordeal. Using IaC, we’re able to easily and reliably reproduce the entire infrastructure needed by Etleap and get it running in a new data center in short order. And so, even in this extreme case we’re able to recover from a service disruption incredibly quickly.
“With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted.”
Another common issue is configuration drift, which is a major concern for services that must ensure high availability and disaster recovery strategies. If left unchecked, configuration drift increases the risk of prolonged outages or loss of data. By making sure every change introduced to the infrastructure configuration is done through the definition files or scripts, we can totally eliminate configuration drift. This way, we reduce the risk of having misconfiguration issues when we need to re-provision our infrastructure.
Finally, to keep Etleap up and running at all times, we should be able to add more resources or replace an unhealthy component at any time. Let’s imagine that a server instance stops serving requests because it’s running out of memory. In this case we should be able to provision a new server, with more memory, and redirect the traffic to it. Etleap has dealt with a similar challenge where we encountered memory shortages when running an Amazon Elastic MapReduce cluster. After EMR had become unhealthy, we traced the root cause to memory degradation. But because the EMR cluster provisioning and configuration was scripted, it was straightforward to update the configuration and start a new cluster and point Etleap to it after it launched, with zero downtime for our users.
Improved monitoring, highly secure
With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted. Not only physical hardware, networks, and storage, but also identity access management (IAM), monitoring, alarm systems, and much more.
Going back to our example of a server running out of memory: when things go sideways it’s essential to have a monitoring system that alerts us of these issues to avoid service outages. If we know a certain node is going into a bad state, we can take the needed action to improve its behavior or, in the worst case, replace the node outright. This way, we’re usually able to resolve the issue, before our customers notice any issues or downtime. It also makes a lot of sense having the definition of these alarms tied to the infrastructure they’re monitoring — any time infrastructure changes, its monitoring is updated as well.
IAM is hugely important when it comes to security. Meticulously defining the right access levels and ingress rules to different parts of the infrastructure is crucial for data and system protection. By restricting access to production servers we can prevent unauthorized persons from gaining access to sensitive data. Finally, audits and reviews of the configuration and any changes allow us to maintain the right access at all times.
Etleap productization
At Etleap, IaC practices enable a repeatable deployment process. Each time we provision our infrastructure the result is a known quantity, and that’s something we take advantage of in multiple ways.
Etleap is SaaS, meaning our product runs in the cloud and our users don’t need to install or maintain anything to start using it. However, some of our customers, especially those with strict security requirements, require that Etleap runs in an isolated AWS VPC. Embracing IaC helps us efficiently deploy Etleap to a completely new environment. The installation process is well-defined and tested, and is a daily occurrence for us. This allows us to ensure that Etleap running in one environment will behave identically to another instance running in a different environment, which saves time when identifying issues and reduces the need for customers to contact the support team. Thinking of infrastructure as a product itself gives Etleap a competitive advantage, as it allows us to serve customers with complex security requirements.
“IaC not only helps manage production environments but the entire software development lifecycle.”
Running identical instances of Etleap in multiple environments also simplifies updates. For example, diagnosing and fixing a bug for a user running Etleap in his or her own VPC would be really challenging if each of the environments differed from one another. By ensuring parity between all environments where Etleap is deployed, we eliminate this potential headache.
Streamline development and delivery cycle
IaC not only helps manage production environments but the entire software development lifecycle. During development, we can provision an isolated sandbox environment to safely make changes without the risk of breaking something. We can test new changes against our sandbox environment to more quickly detect if they would negatively affect the production environment when deployed. Having each new feature or bug fix properly tested during development reduces the risk of introducing issues when changes are rolled out. Once thoroughly tested, changes are then automatically deployed in a CI/CD process, any new feature or bug fix is rolled out to our users as soon as they’re merged into the master branch.
For example, some time ago I was tasked with improving our validation process for users wanting to add or edit an S3 data lake or S3 input connection. One of our goals was to give to the user more accurate information about misconfiguration problems with their connections. In both cases, most of these configuration issues were related to incorrect policies being attached to a given IAM user. It would have been quite tedious to add all these cases manually through the AWS console. Instead, we were able to quickly and easily script the policies that matched the cases we wanted to test and roll them out to the sandbox environment.
Another case where we took advantage of our ability to effortlessly provision a sandbox environment during development was when we improved our ZooKeeper cluster. We switched from having a standalone ZooKeeper node to an ensemble of nodes. We scripted the cluster configuration and provisioned it in a sandbox environment. This way, we could test that the cluster was working as expected. We were also able to stress test the cluster out to see how it behaved. There were some questions we wanted to answer before rolling it out, like: how well does the cluster behaves when nodes are disconnected? Are new nodes automatically incorporated into the cluster? Will the master node switch to another node when it becomes unhealthy? We tested each of these scenarios in the safety of our sandbox environment without affecting production. When we finally rolled the new ZooKeeper cluster out, we could rest easy that it would work as expected, as we’d already tested against many of the possible point of failures during development.
Conclusion
By leveraging IaC, Etleap benefits in numerous ways. Hosting the infrastructure design in definition files and scripts ensures a consistent environment, where each node has exactly the desired configuration. This makes it easier and less risky to update many aspects of the infrastructure. Errors can be identified and fixed faster, or in the worst case, infrastructure can be reverted to the last functional configuration. Changes can be made quickly and with little effort, and we can easily scale by increasing the number of nodes or their size.