Etleap Launches Native Amazon Redshift Console Integration

SAN FRANCISCO, Calif. – December 9, 2020 — Etleap, a leading provider of cloud-native ETL solutions for Amazon Web Services (AWS), today announced availability of its hosted Software-as-a-Service (SaaS) offering through an integration with the Amazon Redshift Console, which is now in Public Preview. This is designed to allow Amazon Redshift customers to accelerate data onboarding by ingesting data from a plethora of data sources through Etleap.

By launching this Amazon Redshift Console integration, Etleap enables data and analytics teams to deploy a cloud native ETL solution for their Amazon Redshift data warehouse with just a few clicks. After selecting Etleap in the Amazon Redshift Console, users are transferred to a setup wizard on Etleap’s site where they can select and configure data sources in a few clicks. Etleap ingests data from the data sources into the Amazon Redshift cluster, and keeps data up-to-date over time. From there, users can access their data in Amazon Redshift, and also use Etleap to model their data for analytics.

“We’re excited to launch our integration with Amazon Redshift Console,” said Christian Romming, CEO of Etleap. “It fits right in with our mission of decreasing ETL setup and maintenance costs, as well as reducing time-to-value for analytics initiatives.”

Etleap is an analyst-friendly, cloud-native ETL-as-a-service solution. It replaces time-consuming ETL setup and maintenance with intuitive software and a managed service that automates data pipelines. Etleap lets data teams deploy ETL pipelines that source data from databases, web applications, file stores, and streams, and loads that data into Amazon Redshift and Snowflake warehouses, Amazon Simple Storage Service (Amazon S3), and AWS Glue data lakes. Users can transform data using Etleap’s Data Wrangler and SQL-based Data Modeler to make the data analytics-ready.

For more information or to try it out for yourself, follow this link to Amazon Redshift Console’s native partner integration page. Alternatively, you can email info@etleap.com; Follow Etleap on Twitter @etleap; or Like Etleap on Facebook @etleap.

How Etleap Integrates with Amazon Redshift Data Sharing to Provide Isolation of ETL and BI Workloads

This article was also posted on AWS’s blog here.

By Christian Romming, Founder and CEO at Etleap
Neeraja Rentachintala, Principal Product Manager with Amazon Redshift
Jobin George, Sr. Partner Solutions Architect at AWS
Caius Brindescu, Member of Technical Staff at Etleap

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing extract, transform, and load (ETL), business intelligence (BI), and reporting tools.

Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics.

In this post, we discuss how Etleap integrates with new data sharing and cross-database querying capabilities in Amazon Redshift to enable workload isolation for diverse analytics use cases in customer environments.

Etleap is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation.

Introduction to Etleap and New Amazon Redshift Capabilities

Etleap is a fully-managed ETL solution built specifically for Amazon Web Services (AWS) that doesn’t require extensive engineering work to set up, maintain, and scale. It reduces ETL setup time from months to days, automates most maintenance work, and provides transparency and full control over data pipelines.

Etleap integrates with any data source and runs either as a hosted solution (SaaS) or inside your virtual private cloud (VPC).

The new cross-database query capability in Amazon Redshift allows customers to query across databases in an Amazon Redshift cluster using a simple three-part notation <database>.<schema>.<table>.

Customers will continue to connect to a specific Amazon Redshift database from their BI and analytics tools, and applications via JDBC/ODBC similar to how they do today. Now, however, they can seamlessly query data from all the databases they have permissions to access on the cluster without having to connect to one database at a time.

The new Amazon Redshift data sharing capability allows customers to securely and easily share data across different Amazon Redshift clusters. Data sharing improves the agility of organizations by giving them instant, granular, and high-performance access to live data across Amazon Redshift clusters without the need to manually copy or move it.

Once data is shared, customers can use cross-database query functionality to query the shared databases from the consuming clusters and join the shared data with local data on the consumer cluster.

Why Etleap Customers Need Amazon Redshift Data Sharing

The workloads that Etleap users run on Amazon Redshift can be broadly divided into three categories:

  • Ingestion – Etleap connects to data sources such as databases, applications, files, web services, and event streams, transforms data according to user-specified rules, and loads data into customer’s Amazon Redshift clusters. The load itself consists of COPY statements, and also typically joins with target tables in order to perform UPDATEs and DELETEs of existing records.
    .
  • Modeling – Users create Amazon Redshift materialized views through Etleap models, which are specified through SELECT statements that join and aggregate data. These materialized views ensure queries that power reports and dashboards are performant. For a deeper analysis of the performance benefits of materialized views, see this post from the AWS Big Data Blog.
    .
  • Analytical Queries – These come in many forms—for example ad-hoc analyses, internal or customer-facing dashboards, and scheduled reports. These workloads are read-heavy, consisting mostly of SELECT queries.

We’ll refer to the first two categories (ingestion and modeling) as ETL workloads, and the third category (analytical queries) as BI workloads.

BI workloads require data to have been ingested and modeled. BI users expect their data to be available and up-to-date, often have a low tolerance for data delays, and also cannot miss performance SLAs for critical reporting needs.

In order for data to be up-to-date, ETL processes that ingest and model data need to consistently run on time. There are two issues related to this that Amazon Redshift data sharing helps address.

Workload Variability

ETL workloads tend to be predictable in that they are executed on a schedule. Also, with users moving from nightly to continuous data ingestion and modeling, ETL workloads need to execute more constantly throughout the day as well.

BI workloads, on the other hand, tend to be more dynamic and variable throughout the day. When cluster capacity is shared between ETL and BI workloads, BI workloads can reduce the resources available to ETL because of this variability, delaying ingestions and modeling. These data delays can worsen the BI user’s experience.

There are also scenarios where BI clusters don’t get the necessary resources due to resource contention from ETL and query performance for BI users is impacted.

Different types of workloads also tend to have different data access patterns (for example, small queries from dashboarding applications vs. aggregations from analysts vs. large data processing for ETL) and different SLA requirements.

There may also be different teams responsible for ETL vs. BI, and each team may want to control the compute resources they want to provision for their workloads. Having separate clusters between ETL and different types of BI workloads is therefore beneficial.

By using Amazon Redshift data sharing to isolate ETL and BI workloads to separate clusters, ETL workloads are not affected by BI workload variability, and BI users have up-to-date data and consistent SLAs for their analyses.

Another benefit of this separation is that ETL and BI cluster capacity can be provisioned according to their workload-specific performance requirements, and can be changed independently as requirements change. Data sharing also enables the ability to offer charge-back capabilities for various workloads and teams.

Table Locking

While Amazon Redshift is designed to enable concurrent reads and writes, Etleap often needs to run DDL operations, such as ALTER TABLE and DROP TABLE, in order to apply schema changes and complete full refreshes of ingestion pipelines.

Such DDL operations acquire an AccessExclusiveLock on the table in question, which means it cannot be accessed concurrently by readers. This can lead to contention between ETL and BI workloads.

Using Amazon Redshift data sharing to isolate BI workloads to a consumer cluster solves this problem. Data sharing provides snapshot isolation so consumers continue to query the data even when DDL is happening in producer.

Note there are no data movement as a result of data sharing—live and transactional data is simply shared in place through Amazon Redshift managed storage. This is illustrated in the following diagram.

Figure 1 – Top: ETL and BI workloads running on the same cluster.
Bottom: ETL and BI workloads running on separate clusters with data sharing enabled.

Configuring Amazon Redshift Data Sharing with SQL Queries

Here’s an example of how to configure data sharing. We have configured a process to ingest click logs from our website stored in an Amazon Simple Storage Service (Amazon S3) bucket into our Amazon Redshift producer cluster.

The data is stored in a table called “clicks” in the schema “website” in the database “etl.” We want to share this table with the consumer cluster and grant read access to BI users, which are members of a group in the consumer cluster called “bi_users.”

First, we’re going to set up the datashare itself. For that we need the namespaces of both the producer and consumer clusters, so we run the following query on both clusters:

SELECT CURRENT_NAMESPACE;

Let’s assume the results are “producer_cluster” and “consumer_cluster,” respectively. On the producer cluster, we create the datashare and grant the consumer cluster access to it:

CREATE DATASHARE “etl_to_bi”;

GRANT USAGE ON DATASHARE “etl_to_bi” TO NAMESPACE ‘consumer_cluster’;

Next, we accept the datashare on the consumer cluster by creating a database from it:

CREATE DATABASE “etl_to_bi” FROM DATASHARE “etl_to_bi” OF NAMESPACE ‘producer_cluster’;

Now that we have the datashare established, let’s add the “website” schema and “clicks” table to it on the producer cluster:

ALTER DATASHARE “etl_to_bi” ADD SCHEMA “website”;

ALTER DATASHARE “etl_to_bi” ADD TABLE “website”.”clicks”;

Finally, on the consumer cluster, we can now add the shared schema to our database and grant permissions to our BI users.

CREATE EXTERNAL SCHEMA “website” FROM REDSHIFT DATABASE ‘etl_to_bi’ SCHEMA “website”;

GRANT USAGE ON SCHEMA “website” TO GROUP bi_users;

The “clicks” table is now available for querying by BI users, and, as the table is updated with new data in the producer cluster, it’s immediately available in the consumer cluster as well.

In some scenarios, we need to replace the table on the producer cluster because we’re re-ingesting the data. In this case, we’ll remove the table from the data share and add it back after we have recreated it.

From a consumer point of view, nothing needs to be done and the latest objects are available in the data share automatically in a transactionally consistent fashion.

ALTER DATASHARE “etl_to_bi” REMOVE TABLE ”website”.”clicks”;

DROP TABLE ”website”.”clicks”;

CREATE TABLE ”website”.”clicks” ...

ALTER DATASHARE “etl_to_bi” ADD TABLE “website”.”clicks”;

We don’t need to configure the access privileges on the consumer side again, as the access permissions are granted at the schema level. Note that for every new table that’s added to the schema on the producer side, we need to add the table to the datashare as described above.

Also, for each new schema we want to share, we need to add it to the datashare and also grant usage privileges to it on the consumer side.

Configuring Amazon Redshift Data Sharing with Etleap

Etleap has introduced support for Amazon Redshift data sharing and manages sharing between ETL and BI Amazon Redshift clusters so users don’t need to run the data sharing SQL statements themselves, simplifying workflows for data availability.

Here’s how to configure the data pipeline described in the example above in Etleap, where data ingested into a producer cluster is shared to a consumer cluster using data sharing.

First, we create the consumer cluster connection and add the user group “bi_users,” as shown in the screenshot below.

Figure 2 – Configuring the “bi users” group with read access to tables in the consumer cluster.

Then, we set up the producer cluster connection and, under “Data Sharing” we pick the consumer cluster from the dropdown.

Figure 3 – Configuring the producer cluster to share data with the consumer cluster.

At this point, Etleap has configured sharing from the producer cluster to the consumer cluster. The final step is to create an ingestion pipeline for the click data from Amazon S3 into the producer cluster, which involves going through Etleap’s pipeline setup wizard and interactive data wrangling.

Figure 4 – Creating an Etleap pipeline to ingest data from S3 to the producer cluster.

Once we click “Start ETL’ing,” Etleap creates the “clicks” table in the producer cluster and continuously ingests new data into it as it becomes available. Etleap also adds the table to the datashare with the consumer cluster and configures read access to the “bi_users” group on the consumer cluster.

At this point, the table is ready to be queried by BI users.

Figure 5 – Querying the “clicks” table as a user in the “bi users” group on the consumer cluster.

As Etleap ingests new data into the “clicks” table, BI users will immediately and automatically see up-to-date data through Amazon Redshift data sharing. Note the data is not ingested into the BI cluster, but simply shared from the Amazon Redshift managed storage layer without any performance impact to the producer cluster.

Etleap maintains the datashare throughout the lifecycle of the ingestion pipeline, including cases where the producer table has to be recreated, so there’s no additional work required to maintain data sharing between the two clusters.

Summary

In this post, you learned about Amazon Redshift’s data sharing feature and how it helps achieve workload isolation and charge-ability. You also learned about the architecture of Amazon Redshift with the data sharing feature enabled, how data sharing can be configured using SQL commands, and how easy it is to leverage the capability in Etleap.

To learn more about Etleap, take a look at their Etleap ETL on AWS Quick Start, or visit their listing on AWS Marketplace.

Etleap Achieves Snowflake Ready Status and Announces Snowflake ELT Features

SAN FRANCISCO, Calif. – November 18, 2020 — Etleap announced today that it has achieved the Snowflake Ready status for Snowflake Data Cloud. This recognizes that Etleap’s Snowflake integration has been successfully validated by a third party, and that Etleap adheres to Snowflake best practices for ETL/ELT tools.

Etleap also announced availability of its modeling layer for Snowflake, which takes advantage of Snowflake’s MPP (massively parallel processing) infrastructure to transform data after load in order to combine data from multiple sources and to enable faster query performance.

“We’re excited to welcome Etleap as a Snowflake Ready Technology partner,” Snowflake Director of Technology Alliances, Tarik Dwiek said. “In addition to making it easy to bring data into Snowflake from a multitude of sources, Etleap’s innovative ELT features for Snowflake are opening up new possibilities for customers to further reduce engineering effort and improve time to value for their data. We’re looking forward to seeing Snowflake customers take advantage of these features.”

Etleap is an analyst-friendly, cloud-native ETL-as-a-service solution. It replaces time-consuming ETL setup and maintenance with intuitive software and a managed service that automates data pipelines.

“Etleap is proud to achieve Snowflake Ready status,” said Christian Romming, Founder and CEO of Etleap. “Our team is dedicated to helping companies with maintenance-free, enterprise-grade ETL and ELT by leveraging the ease-of-use, performance, and pace of innovation that Snowflake provides. Our designation as a Snowflake Ready partner shows our continued commitment to Snowflake.”

To try Etleap out for yourself, sign up for a trial here. Alternatively, you can email info@etleap.com; Follow Etleap on Twitter @etleap; or Like Etleap on Facebook @etleap.

Etleap launches AWS Quick Start

Etleap, a leading provider of cloud-native ETL solutions for Amazon Web Services (AWS), today announced availability of an AWS Quick Start for Etleap VPC to enable customers to create, maintain, and scale data warehouses and lakes on AWS without extensive engineering work. The AWS Quick Start automates the deployment of Etleap VPC using AWS CloudFormation and works both with deployments through AWS Marketplace and with licenses provided by Etleap.

Etleap VPC lets data teams deploy ETL pipelines inside their own Amazon Virtual Private Cloud (Amazon VPC) that source data from databases, web applications, file stores, and streams, and load data into Amazon Redshift warehouses, Amazon Simple Storage Service (Amazon S3), and AWS Glue data lakes. They can transform data using Etleap’s Data Wrangler and SQL-based Data Modeler to make the data analytics-ready.

“Customers are looking for ways to integrate their data sources quickly so they can take full advantage of Amazon Redshift’s speed and scalability,” says Chris Grusz, Director, AWS Marketplace, Amazon Web Services, Inc.. “It’s great to see Etleap making an AWS Quick Start available so teams can start ingesting data to Amazon Redshift with just a few clicks, while at the same time run their ETL pipelines inside their own Amazon VPC.”

Enterprises want to take advantage of the cost savings, flexibility, and scalability of cloud data warehouses and lakes, but often have strict data privacy and security policies they need to adhere to as they are migrating to the cloud. This often requires that they keep their data and infrastructure within their own controlled, single-tenant cloud environment as opposed to using hosted, multi-tenant services. Etleap’s customer eMoney Advisor was in this situation when they wanted to spin up their analytics stack on AWS.

“For eMoney to deploy its analytics stack on AWS it was imperative that all data and infrastructure components run inside of our Amazon VPC,” says Barrie Effron, Vice President of Business Analytics Systems and PMO at eMoney Advisor LLC. “With Etleap we were up and running in production in 6 weeks and we have not had to hire engineers to manage our ETL pipelines.”

By launching this AWS Quick Start, Etleap enables data and analytics teams to deploy a cloud native ETL solution in their own Amazon VPC with just a few clicks.

“As enterprises migrate their data warehouses and lakes to AWS, they’re looking to Etleap to reduce the cost of their ETL and time to value for analytics,” said Christian Romming, CEO of Etleap. “With Etleap VPC, customers get all the benefits of Etleap’s managed ETL solution while keeping their data inside their Amazon VPC. This AWS Quick Start makes the setup of Etleap VPC fast and easy.”

For more information or to try it out for yourself, follow this link to Etleap’s AWS Quick Start. Alternatively, you can email info@etleap.com; Follow Etleap on Twitter @etleap; or Like Etleap on Facebook @etleap.

Run Etleap Inside Your VPC

Today is a big day for the whole team here at Etleap! We’re announcing that we’re making Etleap VPC available on AWS Marketplace, which means customers can run Etleap inside their own AWS Virtual Private Cloud (VPC) with just a few clicks. This marks the beginning of an era where we’re not just responsible for the success of customers running pipelines on our hosted infrastructure, but also of those running Etleap pipelines on their own AWS infrastructure. This product launch has been in the making for many months, and I am very proud of what our team has accomplished!

So what is Etleap VPC and why did we build it? First I’ll make the case for the VPC SaaS model being the future of enterprise ETL. Then I’ll highlight some of the reasons why being hosted SaaS first has been tremendously beneficial to Etleap’s development as a company.

How did we get here?

Every organization that has a data warehouse or lake has to deal with ETL – after all, what good is a fancy data repository without up-to-date and clean data in it? Traditionally, ETL has been expensive and time-consuming. Learning complex ETL software, building Kimball modeling processes, and setting up compute clusters dedicated to ETL led to projects measured in months or years with 7-figure budgets. And that’s before the operational costs of handling changes and errors.

Enter the cloud. Data warehouses like Amazon Redshift and Snowflake have taken advantage of cloud computing primitives to deliver vastly superior user experiences, offering scalability and flexibility at a fraction of the cost of traditional data warehouses. Today, cloud-native ETL products tailored to these technologies and to modern data teams’ workflows are starting to eliminate the headaches of traditional ETL projects.

Advantages of Cloud-Native, Managed SaaS ETL

What makes cloud-native ETL different? Some products, like Etleap’s ETL solution, are easy to  learn and let data engineers create ETL pipelines that are fully-managed. This means that customers don’t need a dedicated ops team to operate the hardware that the ETL software runs on, or to manage pipelines and fixing errors. Pipelines can be set up quickly from any data source and transformations to make the data useful can be defined without coding.

This is great news for data teams, because they can avoid hiring engineers dedicated to ETL, and they can get up and running in days or weeks instead of months or years. Teams often see an order-of-magnitude gain in data team productivity as a result. At the same time, it means that a tremendous amount of trust is placed in the product’s data security and operation by the customer.

Towards the Virtual Private Cloud (VPC)

A question we have been asked many times over the years by our customers is whether we can operate inside their own Virtual Private Cloud (VPC). The security benefit to the customer is that data doesn’t flow outside their VPC on the way from their source to the warehouse or lake, and they have more direct control over infrastructure and data access policies. While hosted Etleap uses only S3 buckets owned by the customer for intermediate data storage, data passes through servers managed by Etleap for processing. For the most privacy-sensitive companies, this is a non-starter. In order to adopt the cloud, it is an absolute requirement that their data remains tightly controlled inside their VPC.

While a large segment of the market is happy to adopt, and even prefers, hosted ETL services, it is now clear to us that enterprise companies are often not, and probably never will be. We asked ourselves what this meant for our offering, and the answer we have come up with is that running Etleap inside one’s own VPC should “feel” just like using the hosted product; hiring ETL engineers should not be necessary and pipelines should “just work”. This is of course easier said than done – without our experienced ops team having direct access to operate infrastructure and troubleshoot pipeline issues, how can we offer the same experience?

Through working with early adopters of our VPC offering we have hardened the infrastructure components in order to become virtually management-free, and honed in on the required operational metrics and de-identified logs shipped back to Etleap’s ops team in order to enable our team to offer proactive pipeline support and assist with reactive troubleshooting.

Setting up Etleap is designed to be fast and pain-free, and it was important to us that we offer the same effortless experience for customers that want to run Etleap inside their own VPC. Using AWS CloudFormation the setup of infrastructure components is fully automated, including an automatically scaling AWS EMR cluster that runs extractions and transformations. Customers enter their name and email address, press “play” on the CloudFormation template, and minutes later they can set up Etleap pipelines in their browser.

Benefits of Being SaaS-First

Creating a robust ETL system is all about handling edge cases. It’s relatively straight-forward to build a system to ingest files from an SFTP server or replication logs from a SQL database into a warehouse. The complexity is in how you handle issues like errors in the data, schemas that change in unpredictable ways, or ingesting incrementally from an S3 bucket that contains tens of millions of non-alphabetically ordered files. Our focus is on providing intuitive solutions to all these issues, while at the same time making the ETL software a pleasure to use.

Etleap’s hosted multi-tenant deployment is and will continue to be the biggest Etleap deployment in terms of data scale and number of users. This is a great benefit to customers running Etleap inside their own VPCs because of the continuous improvements we make to the scalability, usability, and flexibility of the solution.

Being SaaS-first has enabled us to work very closely with our customers. From the beginning, we set the high bar that pipelines should “just work”, no matter the source, quantity of data, or transformation complexity. Over the last 7 years, we have been fortunate to experience our customer’s new challenges first-hand every day, and as a result have been able to build a robust system that meets their needs. Today, we are thrilled to provide the same pain-free and effortless experience for customers that want to run Etleap inside their own VPC. 

SPEEDING UP ETLEAP MODELS AT AXS WITH AMAZON REDSHIFT MATERIALIZED VIEWS

This blog post was written in partnership with the Amazon Redshift team, and also posted on the AWS Big Data Blog.

The materialized views feature in Amazon Redshift is now generally available and has been benefiting customers and partners in preview since December 2019. One customer, AXS, is a leading ticketing, data, and marketing solutions provider for live entertainment venues in the US, UK, Europe, and Japan. Etleap, an Amazon Redshift partner, is an extract, transform, load, and transform (ETLT) service built for AWS. AXS uses Etleap to ingest data into Amazon Redshift from a variety of sources, including file servers, Amazon S3, relational databases, and applications. These ingestion pipelines parse, structure, and load data into Amazon Redshift tables with appropriate column types and sort and distribution keys.

Improving dashboard performance with Etleap models

To analyze data, AXS typically runs queries against large tables that originate from multiple sources. One of the ways that AXS uses Amazon Redshift is to power interactive dashboards. To achieve fast dashboard load times, AXS pre-computes partial answers to the queries dashboards use. These partial answers are orders of magnitude smaller in terms of the number of rows than the tables on which they are based. Dashboards can load much faster than they would if they were querying the base tables directly by querying Amazon Redshift tables that hold the pre-computed partial answers.

Etleap supports creating and managing such pre-computations through a feature called models. A model consists of a SELECT query and triggers for when it should be updated. An example of a trigger is a change to a base table, that is, a table the SELECT statement uses that defines the model. This way, the model can remain consistent with its base tables.

The following screenshot shows an Etleap model with two base table dependencies.

Etleap represents their models as tables in Amazon Redshift. To create the model table, Etleap wraps the SELECT statement in a CREATE TABLE AS (CTAS) query. When an update is triggered, for example, due to base table inserts, updates, or deletes, Etleap recomputes the model table through the following code:

CREATE TABLE model_temporary AS SELECT …
DROP TABLE model;
RENAME TABLE model_temporary TO model;

Analyzing CTAS performance as data grows

AXS manages a large number of Etleap models. For one particular model, the CTAS query takes over 6 minutes, on average. This query performs an aggregation on a join of three different tables, including an event table that is constantly ingesting new data and contains over a billion rows. The following graph shows that the CTAS query time increases as the event table increases in number of rows.

There are two key problems with the query taking longer:

  • There’s a longer delay before the updated model is available to analysts
  • The model update consumes more Amazon Redshift cluster resources

To address this, AXS would have to resort to workarounds that are either inconvenient or costly, such as archiving older data from the event table or expanding the Amazon Redshift cluster to increase available resources.

Comparing CTAS to materialized views

Etleap decided to run an experiment to verify that Amazon Redshift’s materialized views feature is an improvement over the CTAS approach for this AXS model. First, they built the materialized view by wrapping the SELECT statement in a CREATE MATERIALIZED VIEW AS query. For updates, instead of recreating the materialized view every time that data in a base table changes, a REFRESH MATERIALIZED VIEW query is sufficient. The expectation was that using materialized views would be significantly faster than the CTAS-based procedure. The following graph compares query times of CTAS to materialized view refresh.

Running REFRESH MATERIALIZED VIEW was 7.9 times faster than the CTAS approach—it took 49 seconds instead of 371 seconds on average at the current scale. Additionally, the update time was roughly proportional to the number of rows that were added to the base table since the last update, rather than the total size of the base table. In this use case, this number is 3.8 million, which corresponds to the approximate number of events ingested per day.

This is great news. The solution solves the previous problems because the delay the model update caused stays constant as new data comes in, and so do the resources that Amazon Redshift consume (assuming the growth of the base table is constant). In other words, using materialized views eliminates the need for workarounds, such as archiving or cluster expansion, as the dataset grows. It also simplifies the refresh procedure for model updates by reducing the number of SQL statements from three (CREATE, DROP, and RENAME) to one (REFRESH).

Achieving fast refresh performance with materialized views

Amazon Redshift can refresh a materialized view efficiently and incrementally. It keeps track of the last transaction in the base tables up to which the materialized view was previously refreshed. During subsequent refreshes, Amazon Redshift processes only the newly inserted, updated, or deleted tuples in the base tables, referred to as a delta, to bring the materialized view up-to-date with its base tables. In other words, Amazon Redshift can incrementally maintain the materialized view by reading only base table deltas, which leads to faster refresh times.

For AXS, Amazon Redshift analyzed their materialized view definitions, which join multiple tables, filters, and aggregates, to figure out how to incrementally maintain their specific materialized view. Each time AXS refreshes the materialized view, Amazon Redshift quickly determines if a refresh is needed, and if so, incrementally maintains the materialized view. As records are ingested into the base table, the materialized view refresh times shown are much faster and grow very slowly because each refresh reads a delta that is small and roughly the same size as the other deltas. In comparison, the refresh times using CTAS are much slower because each refresh reads all the base tables. Moreover, the refresh times using CTAS grow much faster because the amount of data that each refresh reads grows with the ingest rate.

You are in full control of when to refresh your materialized views. For example, AXS refreshes their materialized views based on triggers defined in Etleap. As a result, transactions that are run on base tables do not incur additional cost to maintain dependent materialized views. Decoupling the base tables’ updates from the materialized view’s refresh gives AXS an easy way to insulate their dashboard users and offers them a well-defined snapshot to query, while ingesting new data into base tables. When AXS vets the next batch of base table data via their ETL pipelines, they can refresh their materialized views to offer the next snapshot of dashboard results.

In addition to efficiently maintaining their materialized views, AXS also benefits from the simplicity of Amazon Redshift storing each materialized view as a plain table. Queries on the materialized view perform with the same world-class speed that Amazon Redshift runs any query. You can organize a materialized view like other tables, which means that you can exploit distribution key and sort columns to further improve query performance. Finally, when you need to process many queries at peak times, Amazon Redshift’s concurrency scaling kicks in automatically to elastically scale query processing capacity.

Conclusion

Now that the materialized views feature is generally available, Etleap gives you the option of using materialized views rather than tables when creating models. You can use models more actively as part of your ETLT strategies, and also choose more frequent update schedules for your models, due to the performance benefits of incremental refreshes

For more information about Amazon Redshift materialized views, see Materialize your Amazon Redshift Views to Speed Up Query Execution and Creating Materialized Views in Amazon Redshift.

by Christian Romming, Prasad Varakur (AWS), and Vuk Ercegovac (AWS)

How Etleap automates its infrastructure process with Terraform & Ansible

Introduction

“Infrastructure as Code”, IaC, is a term every system administrator has heard by now. We can think about it as the process of managing and provisioning IT infrastructure through source-code instead of performing tasks manually. As we will explore, this helps DevOps teams efficiently and safely adapt infrastructure to meet the always-changing requirements dictated by the business. This approach helps to manage infrastructure in a way that enables the devops team to better serve the organization.

How can this paradigm help you? It encourages the adoption of software development practices like keeping infrastructure’s definition and configuration scripts in a source control system, automated code testing, and doing peer reviews. This benefits infrastructure management in numerous tried-and-true ways.

If you’re starting your journey into IaC, there are many resources you can reference to familiarize yourself with the concepts and terminology associated with this approach. Kief Morris’ “Infrastructure as Code: Managing Servers in the Cloud” is an essential book on the topic (alternatively, Martin Fowler’s blog gives a great overview).

At Etleap, we embrace IaC to build and improve our service every day. This practice helps us in our ongoing effort to make Etleap the best ETL platform it can be.

“IaC is makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes.”

Let’s take a look at a few examples of how using IaC has helped Etleap build a better product.

Service uptime and disaster recovery

One advantage of IaC is that it makes it possible to effortlessly and reliably spin up any element of an infrastructure at any time, or even the entire infrastructure, in a matter of minutes. The new infrastructure will be consistent with the previous one, which is to say that its software and configuration are the same (every security patch is applied, OS is configured the same way, allocated resources are identical).

Imagine a scenario where extreme weather or a natural disaster destroys the data centers where Etleap is hosted. For obvious reasons, it’s vital that we have a plan to recover from such an ordeal. Using IaC, we’re able to easily and reliably reproduce the entire infrastructure needed by Etleap and get it running in a new data center in short order. And so, even in this extreme case we’re able to recover from a service disruption incredibly quickly.

“With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted.”

Another common issue is configuration drift, which is a major concern for services that must ensure high availability and disaster recovery strategies. If left unchecked, configuration drift increases the risk of prolonged outages or loss of data. By making sure every change introduced to the infrastructure configuration is done through the definition files or scripts, we can totally eliminate configuration drift. This way, we reduce the risk of having misconfiguration issues when we need to re-provision our infrastructure.

Finally, to keep Etleap up and running at all times, we should be able to add more resources or replace an unhealthy component at any time. Let’s imagine that a server instance stops serving requests because it’s running out of memory. In this case we should be able to provision a new server, with more memory, and redirect the traffic to it. Etleap has dealt with a similar challenge where we encountered memory shortages when running an Amazon Elastic MapReduce cluster. After EMR had become unhealthy, we traced the root cause to memory degradation. But because the EMR cluster provisioning and configuration was scripted, it was straightforward to update the configuration and start a new cluster and point Etleap to it after it launched, with zero downtime for our users.

Improved monitoring, highly secure

With IaC tools available, almost every aspect of an infrastructure’s configuration can be defined in a configuration file or scripted. Not only physical hardware, networks, and storage, but also identity access management (IAM), monitoring, alarm systems, and much more.

Going back to our example of a server running out of memory: when things go sideways it’s essential to have a monitoring system that alerts us of these issues to avoid service outages. If we know a certain node is going into a bad state, we can take the needed action to improve its behavior or, in the worst case, replace the node outright. This way, we’re usually able to resolve the issue, before our customers notice any issues or downtime. It also makes a lot of sense having the definition of these alarms tied to the infrastructure they’re monitoring — any time infrastructure changes, its monitoring is updated as well.

IAM is hugely important when it comes to security. Meticulously defining the right access levels and ingress rules to different parts of the infrastructure is crucial for data and system protection. By restricting access to production servers we can prevent unauthorized persons from gaining access to sensitive data. Finally, audits and reviews of the configuration and any changes allow us to maintain the right access at all times. 

Etleap productization

At Etleap, IaC practices enable a repeatable deployment process. Each time we provision our infrastructure the result is a known quantity, and that’s something we take advantage of in multiple ways.

Etleap is SaaS, meaning our product runs in the cloud and our users don’t need to install or maintain anything to start using it. However, some of our customers, especially those with strict security requirements, require that Etleap runs in an isolated AWS VPC. Embracing IaC helps us efficiently deploy Etleap to a completely new environment. The installation process is well-defined and tested, and is a daily occurrence for us. This allows us to ensure that Etleap running in one environment will behave identically to another instance running in a different environment, which saves time when identifying issues and reduces the need for customers to contact the support team. Thinking of infrastructure as a product itself gives Etleap a competitive advantage, as it allows us to serve customers with complex security requirements.

“IaC not only helps manage production environments but the entire software development lifecycle.”

Running identical instances of Etleap in multiple environments also simplifies updates. For example, diagnosing and fixing a bug for a user running Etleap in his or her own VPC would be really challenging if each of the environments differed from one another. By ensuring parity between all environments where Etleap is deployed, we eliminate this potential headache.

Streamline development and delivery cycle

IaC not only helps manage production environments but the entire software development lifecycle. During development, we can provision an isolated sandbox environment to safely make changes without the risk of breaking something. We can test new changes against our sandbox environment to more quickly detect if they would negatively affect the production environment when deployed. Having each new feature or bug fix properly tested during development reduces the risk of introducing issues when changes are rolled out. Once thoroughly tested, changes are then automatically deployed in a CI/CD process, any new feature or bug fix is rolled out to our users as soon as they’re merged into the master branch.

For example, some time ago I was tasked with improving our validation process for users wanting to add or edit an S3 data lake or S3 input connection. One of our goals was to give to the user more accurate information about misconfiguration problems with their connections. In both cases, most of these configuration issues were related to incorrect policies being attached to a given IAM user. It would have been quite tedious to add all these cases manually through the AWS console. Instead, we were able to quickly and easily script the policies that matched the cases we wanted to test and roll them out to the sandbox environment.

Another case where we took advantage of our ability to effortlessly provision a sandbox environment during development was when we improved our ZooKeeper cluster. We switched from having a standalone ZooKeeper node to an ensemble of nodes. We scripted the cluster configuration and provisioned it in a sandbox environment. This way, we could test that the cluster was working as expected. We were also able to stress test the cluster out to see how it behaved. There were some questions we wanted to answer before rolling it out, like: how well does the cluster behaves when nodes are disconnected? Are new nodes automatically incorporated into the cluster? Will the master node switch to another node when it becomes unhealthy? We tested each of these scenarios in the safety of our sandbox environment without affecting production. When we finally rolled the new ZooKeeper cluster out, we could rest easy that it would work as expected, as we’d already tested against many of the possible point of failures during development.

Conclusion

By leveraging IaC, Etleap benefits in numerous ways. Hosting the infrastructure design in definition files and scripts ensures a consistent environment, where each node has exactly the desired configuration. This makes it easier and less risky to update many aspects of the infrastructure. Errors can be identified and fixed faster, or in the worst case, infrastructure can be reverted to the last functional configuration. Changes can be made quickly and with little effort, and we can easily scale by increasing the number of nodes or their size.

AWS re:Invent 2019 Roundup

Materialized Views, Amazon Redshift Ready, and more!

Last week Etleap put on another exciting show at AWS re:Invent, where we announced some new features and integrations with AWS services, were interviewed by the tech experts over at “theCUBE,” hosted a session all about data lakes, and most importantly, spoke with countless attendees about ETL. Here’s a roundup of all the Etleap action you may have missed at AWS re:Invent 2019.


Etleap’s booth was a veritable oasis of ETL discussion and Etleap product demos
Amazon Redshift Launches materialized views with help from etleap

Among AWS’ numerous announcements at re:Invent this year was the availability of Materialized Views in preview on Amazon Redshift. The Materialized Views feature is designed to help customers achieve up to 100x faster query performance on analytical workloads such as dashboarding queries from Business Intelligence (BI) tools and ELT data processing. Etleap helped launch this feature by integrating it into a beta version of Etleap Models (let us know if you want to be included in the beta!) and showing that it can give an ~8x performance boost. The Redshift team showcased our results in their chalk talk on “Accelerating performance with Materialized Views.”


Yannis (seated, left) and Vuk (standing, right) from the Amazon Redshift team showcase Etleap at their Redshift Materialized Views Chalk Talk

“We are delighted to have Etleap help launch the Materialized Views feature in Amazon Redshift,” said Andi Gutmans, Vice President, Analytics, Amazon Web Services, Inc. “Amazon Redshift Materialized Views allow customers to realize a significant boost in query performance in ETL pipelines and BI dashboards. By integrating Etleap with this new functionality, customers can seamlessly get the benefits of Amazon Redshift Materialized Views without needing to make any application changes.”

You can read the full Etleap press release about Amazon Redshift Materialized Views here.

Etleap Founder makes the case for more analyst-friendly data lakes, alongside Redshift team

Many Etleap customers use our solution to build their S3/Glue data lakes, so data lakes are a topic we’ve learned a thing or two about over the years. For re:Invent this year, we thought we’d share our data lake expertise with the world by hosting a session alongside the Redshift team entitled “Five data lake considerations with Amazon Redshift, Amazon S3 & AWS Glue.”


Etleap founder and CEO, Christian Romming, led the session focused on data lakes

Have an interest in data lakes yourself? You can check out the session here.

Etleap featured on enterprise tech talk show

After our data lakes session, Founder and CEO of Etleap, Christian Romming, sat down with the hosts of “theCUBE,” re:Invent’s resident technologies interview show. Check it out:

Etleap founder sits down with David Vellante and John Walls of theCUBE
Etleap achieves Amazon Redshift Ready Designation

Distinguishing ourselves in the Amazon Redshift partner ecosystem, we announced that Etleap has achieved the designation of “Amazon Redshift Ready,” a recently announced status among partners who have proven integration with Amazon Redshift.

Etleap was featured in the keynote announcement among a select few debuting partners

“Etleap is proud to achieve Amazon Redshift Ready status,” said Christian Romming, Founder and CEO of Etleap. “Our team is dedicated to helping companies achieve maintenance-free, enterprise-grade ETL by leveraging the agility, breadth of services, and pace of innovation that AWS provides. Our status as an Amazon Redshift Ready partner shows our continued commitment to Amazon Redshift and the AWS ecosystem.”

You can read the full Etleap press release covering the Amazon Redshift Ready announcement here.


This concludes our roundup of the biggest Etleap new stories from AWS re:Invent 2019. Stay tuned for more Etleap trade show news, and for all things ETL you’re already in the right place.

On-Demand Webinar: Etleap presents “Customer First Technology”

In this webinar, we explore how and why eMoney puts their customers first by choosing technologies that solve customer challenges and their use case for running Etleap and Looker within a highly secure VPC environment.

Ready to try Etleap for yourself? Click here to get started!

Stay tuned to this blog for more webinars and other Etleap content, and for all things ETL you’re already in the right place.

Etleap Achieves Amazon Redshift Ready designation

Recently-announced designation distinguishes Etleap on the Redshift platform

SAN FRANCISCO, Calif. – December 4, 2019 — Etleap announced today that it has achieved the Amazon Redshift Ready designation. This designation recognizes that Etleap has demonstrated successful integration with Amazon Redshift. 

Achieving the Amazon Redshift Ready designation differentiates Etleap as an AWS Partner Network (APN) member with a product integrating with Amazon Redshift and is generally available and fully supported for AWS customers. AWS Service Ready Partners have demonstrated success building products integrated with AWS services, helping AWS customers evaluate and use their technology productively, at scale and varying levels of complexity. 

“Etleap is proud to achieve Amazon Redshift Ready status,” said Christian Romming, Founder and CEO of Etleap. “Our team is dedicated to helping companies achieve maintenance-free, enterprise-grade ETL by leveraging the agility, breadth of services, and pace of innovation that AWS provides. Our status as an Amazon Redshift Ready partner shows our continued commitment to Amazon Redshift and the AWS ecosystem.”

To support the seamless integration and deployment of these solutions, AWS established the AWS Service Ready Program to help customers identify products integrated with AWS services and spend less time evaluating new tools, and more time scaling their use of products that are integrated with AWS Services.

Etleap is analyst-friendly ETL-as-a-service for Amazon Redshift and Snowflake data warehouses and Amazon S3/AWS Glue data lakes. Etleap replaces time-consuming ETL setup and maintenance with intuitive software and a managed service that automates data pipelines and reduces time to value.

For more information, email info@etleap.com; Follow us on Twitter @etleap; or Like us on Facebook @etleap.


About Etleap: Etleap was founded by Christian Romming in 2013. Before founding Etleap, Romming was the CTO of an ad-tech company, where he recognized the available solutions for building data pipelines required monumental engineering resources to implement, maintain, and scale. Etleap is backed by world-class investment firms First Round Capital, SV Angel, BoxGroup, and Y Combinator. Our mission is to make data analytics teams more productive. Our ETL solution lets analysts build data warehouses without internal IT resources or knowledge of complex scripting languages. This reduces the time of typical ETL projects from weeks to hours, and takes out the pain of maintaining data pipelines over time.