Run Etleap Inside Your VPC

Today is a big day for the whole team here at Etleap! We’re announcing that we’re making Etleap VPC available on AWS Marketplace, which means customers can run Etleap inside their own AWS Virtual Private Cloud (VPC) with just a few clicks. This marks the beginning of an era where we’re not just responsible for the success of customers running pipelines on our hosted infrastructure, but also of those running Etleap pipelines on their own AWS infrastructure. This product launch has been in the making for many months, and I am very proud of what our team has accomplished!

So what is Etleap VPC and why did we build it? First I’ll make the case for the VPC SaaS model being the future of enterprise ETL. Then I’ll highlight some of the reasons why being hosted SaaS first has been tremendously beneficial to Etleap’s development as a company.

How did we get here?

Every organization that has a data warehouse or lake has to deal with ETL – after all, what good is a fancy data repository without up-to-date and clean data in it? Traditionally, ETL has been expensive and time-consuming. Learning complex ETL software, building Kimball modeling processes, and setting up compute clusters dedicated to ETL led to projects measured in months or years with 7-figure budgets. And that’s before the operational costs of handling changes and errors.

Enter the cloud. Data warehouses like Amazon Redshift and Snowflake have taken advantage of cloud computing primitives to deliver vastly superior user experiences, offering scalability and flexibility at a fraction of the cost of traditional data warehouses. Today, cloud-native ETL products tailored to these technologies and to modern data teams’ workflows are starting to eliminate the headaches of traditional ETL projects.

Advantages of Cloud-Native, Managed SaaS ETL

What makes cloud-native ETL different? Some products, like Etleap’s ETL solution, are easy to  learn and let data engineers create ETL pipelines that are fully-managed. This means that customers don’t need a dedicated ops team to operate the hardware that the ETL software runs on, or to manage pipelines and fixing errors. Pipelines can be set up quickly from any data source and transformations to make the data useful can be defined without coding.

This is great news for data teams, because they can avoid hiring engineers dedicated to ETL, and they can get up and running in days or weeks instead of months or years. Teams often see an order-of-magnitude gain in data team productivity as a result. At the same time, it means that a tremendous amount of trust is placed in the product’s data security and operation by the customer.

Towards the Virtual Private Cloud (VPC)

A question we have been asked many times over the years by our customers is whether we can operate inside their own Virtual Private Cloud (VPC). The security benefit to the customer is that data doesn’t flow outside their VPC on the way from their source to the warehouse or lake, and they have more direct control over infrastructure and data access policies. While hosted Etleap uses only S3 buckets owned by the customer for intermediate data storage, data passes through servers managed by Etleap for processing. For the most privacy-sensitive companies, this is a non-starter. In order to adopt the cloud, it is an absolute requirement that their data remains tightly controlled inside their VPC.

While a large segment of the market is happy to adopt, and even prefers, hosted ETL services, it is now clear to us that enterprise companies are often not, and probably never will be. We asked ourselves what this meant for our offering, and the answer we have come up with is that running Etleap inside one’s own VPC should “feel” just like using the hosted product; hiring ETL engineers should not be necessary and pipelines should “just work”. This is of course easier said than done – without our experienced ops team having direct access to operate infrastructure and troubleshoot pipeline issues, how can we offer the same experience?

Through working with early adopters of our VPC offering we have hardened the infrastructure components in order to become virtually management-free, and honed in on the required operational metrics and de-identified logs shipped back to Etleap’s ops team in order to enable our team to offer proactive pipeline support and assist with reactive troubleshooting.

Setting up Etleap is designed to be fast and pain-free, and it was important to us that we offer the same effortless experience for customers that want to run Etleap inside their own VPC. Using AWS CloudFormation the setup of infrastructure components is fully automated, including an automatically scaling AWS EMR cluster that runs extractions and transformations. Customers enter their name and email address, press “play” on the CloudFormation template, and minutes later they can set up Etleap pipelines in their browser.

Benefits of Being SaaS-First

Creating a robust ETL system is all about handling edge cases. It’s relatively straight-forward to build a system to ingest files from an SFTP server or replication logs from a SQL database into a warehouse. The complexity is in how you handle issues like errors in the data, schemas that change in unpredictable ways, or ingesting incrementally from an S3 bucket that contains tens of millions of non-alphabetically ordered files. Our focus is on providing intuitive solutions to all these issues, while at the same time making the ETL software a pleasure to use.

Etleap’s hosted multi-tenant deployment is and will continue to be the biggest Etleap deployment in terms of data scale and number of users. This is a great benefit to customers running Etleap inside their own VPCs because of the continuous improvements we make to the scalability, usability, and flexibility of the solution.

Being SaaS-first has enabled us to work very closely with our customers. From the beginning, we set the high bar that pipelines should “just work”, no matter the source, quantity of data, or transformation complexity. Over the last 7 years, we have been fortunate to experience our customer’s new challenges first-hand every day, and as a result have been able to build a robust system that meets their needs. Today, we are thrilled to provide the same pain-free and effortless experience for customers that want to run Etleap inside their own VPC. 


This blog post was written in partnership with the Amazon Redshift team, and also posted on the AWS Big Data Blog.

The materialized views feature in Amazon Redshift is now generally available and has been benefiting customers and partners in preview since December 2019. One customer, AXS, is a leading ticketing, data, and marketing solutions provider for live entertainment venues in the US, UK, Europe, and Japan. Etleap, an Amazon Redshift partner, is an extract, transform, load, and transform (ETLT) service built for AWS. AXS uses Etleap to ingest data into Amazon Redshift from a variety of sources, including file servers, Amazon S3, relational databases, and applications. These ingestion pipelines parse, structure, and load data into Amazon Redshift tables with appropriate column types and sort and distribution keys.

Improving dashboard performance with Etleap models

To analyze data, AXS typically runs queries against large tables that originate from multiple sources. One of the ways that AXS uses Amazon Redshift is to power interactive dashboards. To achieve fast dashboard load times, AXS pre-computes partial answers to the queries dashboards use. These partial answers are orders of magnitude smaller in terms of the number of rows than the tables on which they are based. Dashboards can load much faster than they would if they were querying the base tables directly by querying Amazon Redshift tables that hold the pre-computed partial answers.

Etleap supports creating and managing such pre-computations through a feature called models. A model consists of a SELECT query and triggers for when it should be updated. An example of a trigger is a change to a base table, that is, a table the SELECT statement uses that defines the model. This way, the model can remain consistent with its base tables.

The following screenshot shows an Etleap model with two base table dependencies.

Etleap represents their models as tables in Amazon Redshift. To create the model table, Etleap wraps the SELECT statement in a CREATE TABLE AS (CTAS) query. When an update is triggered, for example, due to base table inserts, updates, or deletes, Etleap recomputes the model table through the following code:

CREATE TABLE model_temporary AS SELECT …
RENAME TABLE model_temporary TO model;

Analyzing CTAS performance as data grows

AXS manages a large number of Etleap models. For one particular model, the CTAS query takes over 6 minutes, on average. This query performs an aggregation on a join of three different tables, including an event table that is constantly ingesting new data and contains over a billion rows. The following graph shows that the CTAS query time increases as the event table increases in number of rows.

There are two key problems with the query taking longer:

  • There’s a longer delay before the updated model is available to analysts
  • The model update consumes more Amazon Redshift cluster resources

To address this, AXS would have to resort to workarounds that are either inconvenient or costly, such as archiving older data from the event table or expanding the Amazon Redshift cluster to increase available resources.

Comparing CTAS to materialized views

Etleap decided to run an experiment to verify that Amazon Redshift’s materialized views feature is an improvement over the CTAS approach for this AXS model. First, they built the materialized view by wrapping the SELECT statement in a CREATE MATERIALIZED VIEW AS query. For updates, instead of recreating the materialized view every time that data in a base table changes, a REFRESH MATERIALIZED VIEW query is sufficient. The expectation was that using materialized views would be significantly faster than the CTAS-based procedure. The following graph compares query times of CTAS to materialized view refresh.

Running REFRESH MATERIALIZED VIEW was 7.9 times faster than the CTAS approach—it took 49 seconds instead of 371 seconds on average at the current scale. Additionally, the update time was roughly proportional to the number of rows that were added to the base table since the last update, rather than the total size of the base table. In this use case, this number is 3.8 million, which corresponds to the approximate number of events ingested per day.

This is great news. The solution solves the previous problems because the delay the model update caused stays constant as new data comes in, and so do the resources that Amazon Redshift consume (assuming the growth of the base table is constant). In other words, using materialized views eliminates the need for workarounds, such as archiving or cluster expansion, as the dataset grows. It also simplifies the refresh procedure for model updates by reducing the number of SQL statements from three (CREATE, DROP, and RENAME) to one (REFRESH).

Achieving fast refresh performance with materialized views

Amazon Redshift can refresh a materialized view efficiently and incrementally. It keeps track of the last transaction in the base tables up to which the materialized view was previously refreshed. During subsequent refreshes, Amazon Redshift processes only the newly inserted, updated, or deleted tuples in the base tables, referred to as a delta, to bring the materialized view up-to-date with its base tables. In other words, Amazon Redshift can incrementally maintain the materialized view by reading only base table deltas, which leads to faster refresh times.

For AXS, Amazon Redshift analyzed their materialized view definitions, which join multiple tables, filters, and aggregates, to figure out how to incrementally maintain their specific materialized view. Each time AXS refreshes the materialized view, Amazon Redshift quickly determines if a refresh is needed, and if so, incrementally maintains the materialized view. As records are ingested into the base table, the materialized view refresh times shown are much faster and grow very slowly because each refresh reads a delta that is small and roughly the same size as the other deltas. In comparison, the refresh times using CTAS are much slower because each refresh reads all the base tables. Moreover, the refresh times using CTAS grow much faster because the amount of data that each refresh reads grows with the ingest rate.

You are in full control of when to refresh your materialized views. For example, AXS refreshes their materialized views based on triggers defined in Etleap. As a result, transactions that are run on base tables do not incur additional cost to maintain dependent materialized views. Decoupling the base tables’ updates from the materialized view’s refresh gives AXS an easy way to insulate their dashboard users and offers them a well-defined snapshot to query, while ingesting new data into base tables. When AXS vets the next batch of base table data via their ETL pipelines, they can refresh their materialized views to offer the next snapshot of dashboard results.

In addition to efficiently maintaining their materialized views, AXS also benefits from the simplicity of Amazon Redshift storing each materialized view as a plain table. Queries on the materialized view perform with the same world-class speed that Amazon Redshift runs any query. You can organize a materialized view like other tables, which means that you can exploit distribution key and sort columns to further improve query performance. Finally, when you need to process many queries at peak times, Amazon Redshift’s concurrency scaling kicks in automatically to elastically scale query processing capacity.


Now that the materialized views feature is generally available, Etleap gives you the option of using materialized views rather than tables when creating models. You can use models more actively as part of your ETLT strategies, and also choose more frequent update schedules for your models, due to the performance benefits of incremental refreshes

For more information about Amazon Redshift materialized views, see Materialize your Amazon Redshift Views to Speed Up Query Execution and Creating Materialized Views in Amazon Redshift.

by Christian Romming, Prasad Varakur (AWS), and Vuk Ercegovac (AWS)

High Pipeline Latency Incident Post-Mortem

Between 15:30 UTC on 8/27 and 14:00 UTC on 8/29 we experienced periods of higher-than-usual pipeline latencies. Between 04:00 and 10:00 UTC on 8/29 most pipelines were completely stopped. At Etleap we want to be transparent about system issues that affect customers, and this post summarizes the timeline of the incident and our team’s response, and what we are doing to prevent a similar incident from happening again.

Number of users with at least one pipeline with higher-than-normal latency.

What happened and what was the impact?

At around 11:30 UTC on 8/27 our ops team was alerted about spikes in two different metrics: CPU of a Zookeeper node and stop-the-world garbage collection (STW GC) time in a Java process responsible for orchestrating certain ETL activities. The two processes were running in different Docker containers on the same host. From this point onwards we saw intermittent spikes in both metrics and periods of downtime of the orchestration process, until the final fix was put in place at 14:00 UTC on 8/29. Additionally, at 15:30 UTC on 8/27 we received the first alert regarding high pipeline latencies. There were intermittent periods of high latency until 10:00 UTC on 8/29.

Incident Response

When our ops team received the first alert they followed our incident response playbook in order to diagnose the problem. It includes checking on potential causes such as spikes in usage, recently deployed changes, and infrastructure component health. The team determined that the issue had to do with the component that sets up source extraction activities, but found no other correlations. Suspecting an external change related to a pipeline source was leading to the increased garbage collection activity, they went on to attempt to narrow down the problem in terms of dimensions such as source, source type, and customer. Etleap uses a Zookeeper cluster for things like interprocess locking and rate limiting, and the theory was that a misbehaving pipeline source was causing the extraction logic to put a significant amount of additional load on the Zookeeper process, while at the same time causing memory pressure within the process itself. However, after an exhaustive search it was determined that the problem could not be attributed to a single source or customer. Also, memory analysis of the Java process with garbage collection issues showed nothing out of the ordinary.

The Culprit

Next, the team looked at the memory situation for the host itself. While each process was running within its defined memory bounds, we found that in aggregate the processes’ memory usage exceeded the amount of physical memory available on the host. The host was configured with a swap space, and while this is often a good practice, it is not so for Zookeeper: by being forced to swap to disk, Zookeeper’s response times went up, leading to queued requests.

Stats show Zookeeper node in an unhealthy state.

In other words, the fact that we had incrementally crossed an overall physical memory limit on this host caused a dramatic degradation of the performance of Zookeeper, which in turn resulted in garbage collection time in a client process. The immediate solution was to increase the physical memory on this host, which had the effect of bringing Zookeeper stats back to normal levels (along with the CPU and STW GC metrics mentioned before).

Zookeeper back in a healthy state after memory increase.

Next steps

We are taking several steps to prevent a similar issue in the future. First, we are configuring Zookeeper not to use swap space. Second, we’re adding monitoring of the key Zookeeper stats, such as latency and outstanding connections. Third, we are adding monitoring of available host physical memory to make sure we know when pressure is getting high. Any of the three configuration and monitoring improvements in isolation would have led us to find the issue sooner, and all three will help prevent issues like this from happening in the first place.

While it’s impossible to guarantee there will never be high latencies for some pipelines, periods of high latencies across the board are unacceptable. What made this incident particularly egregious was the fact that it went on for over 40 hours, and the whole Etleap team is sorry that this happened. The long resolution time was in large part because we didn’t have the appropriate monitoring to lead us towards the root cause, and we have learned from this and are putting more monitoring of key components in place going forward.

Etleap Launches Snowflake Integration

I am pleased to announce our integration with Snowflake. This is the second data warehouse we support, augmenting our existing Amazon Redshift data warehouse and our S3/Glue data lake offering. 

Etleap lets you integrate all your company’s data into Snowflake, and transform and model it as necessary. The result is clean and well-structured data in Snowflake that is ready for high-performance analytics. Unlike traditional ETL tools, Etleap does not require engineering effort to create, maintain, and scale. Etleap provides sophisticated data error handling and comprehensive monitoring capabilities. Because it is delivered as a service, there is no infrastructure to maintain.


2019.05.07 - Etleap Product Graphic


Like any other pipeline set up in Etleap, pipelines to Snowflake can extract from any of Etleap’s supported sources, including databases, web services, file stores, and event streams. Using Etleap’s interactive data wrangler, users have full control over how data is cleaned, structured, and de-identified before it is loaded into Snowflake. From there, Etleap’s native integration with Snowflake is designed to maximize flexibility for users in specifying attributes such as Snowflake schemas, roles, and cluster keys. Once the data is loaded, Etleap’s SQL-based modeling features can be used to further improve the usability and performance of the data for analytics.

Not only does Etleap’s integration with Snowflake provide a seamless user experience, it is also a natural fit technically. Etleap is built on AWS and stores extracted and transformed data in S3. Since Snowflake stores data in S3, loading data into Snowflake is fast and efficient. Architecturally, part of what differentiates Snowflake is its separate, elastic scaling of compute and storage resources. Etleap is built on the same principle, thus enabling it to overcome traditional bottlenecks in ETL by scaling storage and compute resources for extraction and transformation separately and elastically. By taking advantage of AWS building blocks we are able to provide a powerful yet uncomplicated data analytics stack for our customers. 

Etleap is devoted to helping teams build data warehouses and data lakes on AWS, and we offer both hosted and in-VPC deployment options. Like Snowflake, Etleap takes advantage of AWS services such as S3 and EC2 to provide performance and cost benefits not possible with traditional ETL solutions.

As more and more teams building analytics infrastructure on AWS want to use Snowflake as their data warehouse, offering support for Snowflake was a natural next step for us. 

If you would like to explore building a Snowflake data warehouse with Etleap, you can sign up for a demo here.


New Features: Models and History Tables

I’m excited to tell you about two new features we’re launching today: Models and History Tables.



Etleap has long supported single-source transformations through data wrangling. This is great for cleaning, structuring, and filtering data, and for removing unwanted data, such as PII, before it is loaded to the destination. Today, we’re announcing the general availability of models, which enable transformations expressed as SQL queries. Two primary use cases for models are combining data from different sources to build data views optimized for analytics, and aggregating data to speed up analytics queries.

Etleap models are Redshift tables backed by SQL SELECT queries that you define, running against data that has been loaded to Redshift. Etleap creates tables that are the result of these SELECT queries, and updates these tables incrementally or through full refreshes. Etleap triggers updates based on changes to dependent tables, or on a schedule.




History Tables

For regular pipelines into Redshift, Etleap fetches new and updated records from the source. Following transformation, new rows are appended to the destination table, and updated rows are overwritten. This update strategy is known as type-1 Slowly Changing Dimensions in data warehouse speak.

Sometimes it’s useful to be able to go back in time and query the past state of a record, or to be able to investigate how a record has changed over time. For this, Etleap now provides the ability to retain the history of a record collection. For this, the technique known as type-2 Slowly Changing Dimensions is often used. Here’s how it works in Etleap: An end-date column is added to the table. When a record is initially inserted into the destination table, this column’s value is null. Whenever the record is changed in the source, instead of overwriting the existing record in the destination table, a new row is appended instead with a null end-date value. The existing record’s end-date value is set to the new record’s update timestamp.

Starting today, history tables are available for all pipelines from sources that have a primary key and an update timestamp. To get a history table, check the ‘retain history’ box during single or batch pipeline setup.




Want to see these features in action? Request a demo here!

Scaling Etleap with funding from First Round Capital, SV Angel, and more

Today we’re excited to share that we’ve raised $1.5M from First Round Capital, SV Angel, Liquid2, BoxGroup, and others to continue to scale our enterprise-grade ETL solution for building and managing cloud data warehouses.

ETL has traditionally been associated with expensive projects that take months of custom development by specialized engineers. We started Etleap because we believe in a world where analytics teams manage their own data pipelines, and IT teams aren’t burdened with complex ETL infrastructure and tedious operations.

Etleap runs in the cloud and requires no engineering work to set up, maintain, and scale. It helps companies drastically lower the cost and complexity of their ETL solution and improve the usefulness of their data.

Over the past few years we’ve spent a lot of time with analytics teams in order to understand their challenges and have built features for integration, wrangling, and modeling. It’s a thrill to see data-driven customers, including Airtable, Okta, and AXS, use them. Their analytics teams are pushing the boundaries of what’s possible today, and we’re hard at work building features to help bring their productivity to new levels.



Curious how Etleap can solve your analytics infrastructure challenges? Click here to get a demo of Etleap!

Distributed CSV Parsing

tl;dr: This post is about how to split and process CSV files in pieces! Newline characters within fields makes it tricky, but with the help of a finite-state machine it’s possible to work around that in most real-world cases.

Comma-separated values (CSV) is perhaps the world’s most common data exchange format. It’s human-readable, it’s compact, and it’s supported by pretty much any application that ingests data. At Etleap we frequently encounter really big CSV files that would take a long time to process sequentially. Since we want our clients’ data pipelines to have minimal latency, we split these files into pieces and process them in a distributed fashion.

Continue reading “Distributed CSV Parsing”