How Etleap automates its infrastructure process with Terraform & Ansible

Introduction

“Infrastructure as Code”, IaC, is a term every system administrator has heard by now. We can think about it as the process of managing and provisioning IT infrastructure through source-code instead of performing tasks manually. As we will explore, this helps DevOps teams efficiently and safely adapt infrastructure to meet the always-changing requirements dictated by the business. This approach helps to manage infrastructure in a way that enables the devops team to better serve the organization.

How can this paradigm help you? It encourages the adoption of software development practices like keeping infrastructure’s definition and configuration scripts in a source control system, automated code testing, and doing peer reviews. This benefits infrastructure management in numerous tried-and-true ways.

If you’re starting your journey into IaC, there are many resources you can reference to familiarize yourself with the concepts and terminology associated with this approach. Kief Morris’ “Infrastructure as Code: Managing Servers in the Cloud” is an essential book on the topic (alternatively, Martin Fowler’s blog gives a great overview).

At Etleap, we embrace IaC to build and improve our service every day. This practice helps us in our ongoing effort to make Etleap the best ETL platform it can be.

“IaC is makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes.”

Let’s take a look at a few examples of how using IaC has helped Etleap build a better product.

Service uptime and disaster recovery

One advantage of IaC is that it makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes. The new infrastructure will be consistent with the previous one, which is to say that its software and configuration are the same (every security patch is applied, OS is configured the same way, allocated resources are identical).

Imagine a scenario where extreme weather or a natural disaster destroys the data centers where Etleap is hosted. For obvious reasons, it’s vital that we have a plan to recover from such an ordeal. Using IaC, we’re able to easily and reliably reproduce the entire infrastructure needed by Etleap and get it running in a new data center in short order. And so, even in this extreme case we’re able to recover from a service disruption incredibly quickly.

“With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted.”

Another common issue is configuration drift, which is a major concern for services that must ensure high availability and disaster recovery strategies. If left unchecked, configuration drift increases the risk of prolonged outages or loss of data. By making sure every change introduced to the infrastructure configuration is done through the definition files or scripts, we can totally eliminate configuration drift. This way, we reduce the risk of having misconfiguration issues when we need to re-provision our infrastructure.

Finally, to keep Etleap up and running at all times, we should be able to add more resources or replace an unhealthy component at any time. Let’s imagine that a server instance stops serving requests because it’s running out of memory. In this case we should be able to provision a new server, with more memory, and redirect the traffic to it. Etleap has dealt with a similar challenge where we encountered memory shortages when running an Amazon Elastic MapReduce cluster. After EMR had become unhealthy, we traced the root cause to memory degradation. But because the EMR cluster provisioning and configuration was scripted, it was straightforward to update the configuration and start a new cluster and point Etleap to it after it launched, with zero downtime for our users.

Improved monitoring, highly secure

With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted. Not only physical hardware, networks, and storage, but also identity access management (IAM), monitoring, alarm systems, and much more.

Going back to our example of a server running out of memory: when things go sideways it’s essential to have a monitoring system that alerts us of these issues to avoid service outages. If we know a certain node is going into a bad state, we can take the needed action to improve its behavior or, in the worst case, replace the node outright. This way, we’re usually able to resolve the issue, before our customers notice any issues or downtime. It also makes a lot of sense having the definition of these alarms tied to the infrastructure they’re monitoring — any time infrastructure changes, its monitoring is updated as well.

IAM is hugely important when it comes to security. Meticulously defining the right access levels and ingress rules to different parts of the infrastructure is crucial for data and system protection. By restricting access to production servers we can prevent unauthorized persons from gaining access to sensitive data. Finally, audits and reviews of the configuration and any changes allow us to maintain the right access at all times. 

Etleap productization

At Etleap, IaC practices enable a repeatable deployment process. Each time we provision our infrastructure the result is a known quantity, and that’s something we take advantage of in multiple ways.

Etleap is SaaS, meaning our product runs in the cloud and our users don’t need to install or maintain anything to start using it. However, some of our customers, especially those with strict security requirements, require that Etleap runs in an isolated AWS VPC. Embracing IaC helps us efficiently deploy Etleap to a completely new environment. The installation process is well-defined and tested, and is a daily occurrence for us. This allows us to ensure that Etleap running in one environment will behave identically to another instance running in a different environment, which saves time when identifying issues and reduces the need for customers to contact the support team. Thinking of infrastructure as a product itself gives Etleap a competitive advantage, as it allows us to serve customers with complex security requirements.

“IaC not only helps manage production environments but the entire software development lifecycle.”

Running identical instances of Etleap in multiple environments also simplifies updates. For example, diagnosing and fixing a bug for a user running Etleap in his or her own VPC would be really challenging if each of the environments differed from one another. By ensuring parity between all environments where Etleap is deployed, we eliminate this potential headache.

Streamline development and delivery cycle

IaC not only helps manage production environments but the entire software development lifecycle. During development, we can provision an isolated sandbox environment to safely make changes without the risk of breaking something. We can test new changes against our sandbox environment to more quickly detect if they would negatively affect the production environment when deployed. Having each new feature or bug fix properly tested during development reduces the risk of introducing issues when changes are rolled out. Once thoroughly tested, changes are then automatically deployed in a CI/CD process, any new feature or bug fix is rolled out to our users as soon as they’re merged into the master branch.

For example, some time ago I was tasked with improving our validation process for users wanting to add or edit an S3 data lake or S3 input connection. One of our goals was to give to the user more accurate information about misconfiguration problems with their connections. In both cases, most of these configuration issues were related to incorrect policies being attached to a given IAM user. It would have been quite tedious to add all these cases manually through the AWS console. Instead, we were able to quickly and easily script the policies that matched the cases we wanted to test and roll them out to the sandbox environment.

Another case where we took advantage of our ability to effortlessly provision a sandbox environment during development was when we improved our ZooKeeper cluster. We switched from having a standalone ZooKeeper node to an ensemble of nodes. We scripted the cluster configuration and provisioned it in a sandbox environment. This way, we could test that the cluster was working as expected. We were also able to stress test the cluster out to see how it behaved. There were some questions we wanted to answer before rolling it out, like: how well does the cluster behaves when nodes are disconnected? Are new nodes automatically incorporated into the cluster? Will the master node switch to another node when it becomes unhealthy? We tested each of these scenarios in the safety of our sandbox environment without affecting production. When we finally rolled the new ZooKeeper cluster out, we could rest easy that it would work as expected, as we’d already tested against many of the possible point of failures during development.

Conclusion

By leveraging IaC, Etleap benefits in numerous ways. Hosting the infrastructure design in definition files and scripts ensures a consistent environment, where each node has exactly the desired configuration. This makes it easier and less risky to update many aspects of the infrastructure. Errors can be identified and fixed faster, or in the worst case, infrastructure can be reverted to the last functional configuration. Changes can be made quickly and with little effort, and we can easily scale by increasing the number of nodes or their size.

What is the “length” of a string?

Finding the length of a string in JavaScript is simple, you use the .length property and that’s it, right?

Not so fast. The “length” of a string may not be exactly what you expect. It turns out that the string length property is the number of code units in the string, and not the number of characters (or more specifically graphemes) as we might expect. For example; “😃” has a length of 2, and “👱‍♂️” has a length of 5!

Screenshot from Etleap’s data wrangler where the column width depends on the column contents.

In our application we have a data wrangler that lets you view a sample of your data in a tabular format. Since this table supports infinite scrolling, both rows and columns are rendered on demand as you scroll vertically or horizontally. We can’t render all the rows and columns at once since a table could easily include more than a hundred thousand cells, which would bring the browser to its knees.

“The ‘length’ of a string may not be exactly what you expect.”

Imagine if most rows of a column contains a small amount of data, such as a single word, but a single row contains more data, such as a sentence. If this row is outside of the currently viewed area we don’t want the column to expand as you scroll down, and we definitely don’t want to cram the sentence into the same small space that’s required by the word. This means that we need to find the widest cell in the column before rendering all the cells. It’s fast and straightforward to find the length of the content in each cell, however what if the cell contains emojis or other content where we can’t rely on the length property to give us an accurate value?

Code units vs. code points

Let’s do a quick Unicode recap. Each character in Unicode is identified by a unique code point represented by a number between 0 and 10FFFF.  Unfortunately, 10FFFF is a large number and requires 4 bytes to represent. To prevent having to allocate 4 bytes for each character, Unicode also specifies different encoding standards that can be used to interpret it, including UTF-16 which is the internal string encoding used by JavaScript.

UTF-16 is a variable length encoding, which means that it uses either 2 or 4 bytes for each code point depending on what is required. To differentiate, we say that UTF-16 uses one or two code units to represent one Unicode code point. The most used characters all fit into one code unit, however some of the more exotic characters, such as emojis, require two code units.

“It turns out that code points are not the only caveat regarding string lengths in JavaScript.”

This is where a problem arises. Since the .length property returns the number of code units, and not the number of code points, it does not directly map to what you may expect. As an example, the emoji “☺️” has a length of 2, even though it looks like only one character.

How can we work around this? ES2015 introduced ways of splitting a string into its respective code points by providing a string iterator. Both Array.from and the spread operator […string] uses this internally so both can be used to get the length of a string in code points.

Combining Characters

It turns out that code points are not the only caveat regarding string lengths in JavaScript. Another is combining characters. A combining character is a character that doesn’t stand on its own, but rather modifies the other characters around it. This is supported in Unicode, meaning that characters such as “è” is actually made up of two code points, “e” and  “\u0300”. This is widely used to combine emojis to get a new representation, such as “👱‍♂️” which is a combination of ” 👱” and ” ♂” with a zero width joiner (\uDC71) in between.

Working around this is more complicated. Currently there is no built in way of reliably counting graphemes in JavaScript. A current stage 2 proposal suggests adding Intl.Segmenter which will return the number of graphemes in a string, however there’s no guarantee that it will make it into the spec (there’s a polyfill for the proposal if you’re desperate.)

Environment Specific Differences

Did you know there’s a ninja cat emoji? Neither did we, because it’s a Windows-only emoji! It’s represented by a combination of “🐱” and “👤”. This means that Windows users will see this combination as one character, while other users will see it as two characters. Depending on the users choice of fonts, they could even see something completely different. You could try to prevent this issue by choosing a specific font for your web app, however that won’t be sufficient as the browser will still search through other fonts on your system if a character is not available in your chosen font.

“The various environment specific differences means that there’s generally no way of measuring the rendered width of a string mathematically. “

Checkmate?

The various environment specific differences means that there’s generally no way of measuring the rendered width of a string mathematically. Therefore, the only way to determine the pixel length is to render it and measure. For our use case in the wrangler, this is exactly what we wanted to avoid in the first place. However there are some optimizations that we can make. 

Instead of rendering all the strings in each column, we can split the strings into their corresponding graphemes and render them individually. This allows us to cache the pixel length of each grapheme we encounter. Since there are substantially fewer graphemes than unique strings in a table, this results in a significant reduction in total rendering. This way we can easily determine the correct width of a column, all while keeping the scrolling snappy and your browser happy.

Building ETL Infrastructure that Analysts Love

This recorded session is from DataEngConf NYC 17. Slides are available on the event page.

There’s an often-quoted statistic that says that data analysts spend 80% of their time preparing data and only 20% actually analyzing it. There’s a lot that we as data engineers can do to help our analytics teams be more productive and spend less time worrying about data preparation. This session discusses common problems in data warehousing infrastructure from the point of view of analytics teams, and suggests practical solutions.

Watch the session video or read the key takeaways below. Continue reading “Building ETL Infrastructure that Analysts Love”

SVG in React

React.js is a great library for creating user interfaces consisting of components. In the browser React is used to output DOM elements like divs, sections and.. SVG! The DOM supports SVG elements, so there is nothing stopping us from outputting it inline directly with React. This allows for easy creation of SVG components that are updated with props and state just like any other component.

Why SVG?

Even though a lot is possible with plain CSS, creating complex shapes like hearts or elephants is very difficult and requires a lot of code. This is because you are restricted to a limited set of primitive shapes that you have to combine to create more complex ones. SVG on the other hand is an image format and allows you a lot more flexibility in creating custom paths. This makes it much easier to create complex shapes as you are free to create any shape you want. If you need convincing, checkout these slides from Sara Soueidan’s great talk about SVG UI components.

Our use

At Etleap we have used React with SVG output in some of our graphical components. A great example of this is our circular progress bar.

circularprogressbarcomponentCircular progress bar used on our dashboard.

This component uses SVG to display the circular progress bar and works just like any other React component. It accepts a few props, including the percentage value to display, and updates whenever new props are received. The reason we opted for SVG in this case was that creating a circular progress bar in CSS is tricky. Using SVG for this was much more appropriate and was straight forward using React to output the SVG markup directly to the DOM, let’s compare the two approaches.

SVG Progress Bar

The essential SVG markup required to render the progress bar is very simple:


<svg>
<g transform="rotate(-90 100 100)" viewbox="0 0 100 100">
<circle className="ProgressBarCircular-bar-background" r={radius} cx={posX} cy={posY} />
<circle className="ProgressBarCircular-bar" strokeDashArray={strokeDashArray} strokeLinecap="round" r={radius} cx={posX} cy={posY} />
</g>
</svg>

We need two circles, one for the dark background, and one for the lighter progress display. The circles are transparent, and the stroke of the circles show the progress and background. To show the amount of progress we use a dashed outline for the circle. If the space between the first and second dash is at least the length of the circumference of the circle only one dash will be shown and we can manipulate the length of that dash to show the current progress. We use stroke-dasharray to specify the length and distance between each dash and stroke-linecap: round to get rounded ends.

 

CSS Progress Bar

Let’s have a look at how we can create a similar progress bar in CSS:

Since CSS does not support stroke-dasharray, nor stroke-linecap, we are immediately at a disadvantage, therefore lets simplify the problem and start by creating a pie-chart. We create two circles here as well, one for the background and one for the progress bar. To display progress we need to be able to cut away part of the circle, so that we are left with a pie slice. To make this happen we can use the CSS clip property (unfortunately it has been deprecated, and the replacement clip-path has very poor browser support). This enables us to define a rectangle mask for the circle so that we can hide parts of it. The problem is that this only works for a maximum of 50% at a time, so we actually need two of these, one for the right and one for the left… As you can see; this is already getting pretty complicated, and we have not even looked at how to handle the rounded edges. So to prevent doubling the length of this post we’ll stop here. If you are interested in the full solution (without rounded edges) checkout this post by Anders Ingemann.

When to use SVG

SVG should not be a replacement for all graphical user elements, but can be used to more easily achieve tricky UI effects where CSS falls short. The most important difference is that SVG supports custom paths. This means you can create any complex shape you want and easily display it, or use it as a mask. This is especially relevant in scenarios involving charts or line drawings. Other interesting features that CSS is lacking includes drawing text along a path, animating paths, and support for a bunch of filters. That being said, CSS is catching up with SVG and has seen support for several filters, masks, and even custom clip paths. For now though, if your designer has created some truly fancy UI effect that you instinctually disregard as impossible, perhaps it is a good time to look into SVG and make it a reality after all.

Reducing the size of your Webpack bundle

To ensure a great user experience it is important to keep the initial page load as  fast as possible. There are two main ways of doing this; one is to reduce the number of file requests made when the site is loading, and the other is to reduce the size of the files. To automate this it is common to use a tool that combines all your javascript into one minified bundle file.

Continue reading “Reducing the size of your Webpack bundle”

Generating password reset tokens

There are a few requirements for a good password reset token:

  1. user should be able to reset their password with the token they receive from in an email
  2. the token should not be guessable
  3. the token should expire
  4. user should not be able to re-use token

Ideally, the web framework of your choice should already have a built-in way to generate reset tokens. However, we use Play and it does not provide a way to do that, so we have to roll our own.

Continue reading “Generating password reset tokens”

Distributed CSV Parsing

tl;dr: This post is about how to split and process CSV files in pieces! Newline characters within fields makes it tricky, but with the help of a finite-state machine it’s possible to work around that in most real-world cases.

Comma-separated values (CSV) is perhaps the world’s most common data exchange format. It’s human-readable, it’s compact, and it’s supported by pretty much any application that ingests data. At Etleap we frequently encounter really big CSV files that would take a long time to process sequentially. Since we want our clients’ data pipelines to have minimal latency, we split these files into pieces and process them in a distributed fashion.

Continue reading “Distributed CSV Parsing”

Typescript at Etleap

Typescript has been getting significant attention in the past year and with over 2 million downloads per month on npm, there has undoubtedly been an increase in adoption. However, many people are still unsure if Typescript will benefit their project, and there are few resources that show how Typescript can be used in large projects and what the practical benefits are. In this post we aim to highlight how we use Typescript at Etleap so that people can get an impression of why we decided to use it and how we benefit from it.

Continue reading “Typescript at Etleap”