Moderna’s data engineering leader Carlos Peralta recently presented how they have handled the exponential data growth at the biotech firm over the past two years.
“When COVID hit and we realized our vaccine would lead to company hyper-growth, we knew we had to accelerate building our new analytics platform.”
Moderna faced three main challenges in supporting its scientists with the full breadth of its data.
The team has successfully navigated these challenges with the help of a modern data architecture, anchored by an AWS Redshift data warehouse and Etleap for data transformations and pipelines.
Data Sources
Peralta’s team faced nearly 100 internal applications as siloed data sources. Some partners, such as drug manufacturers, provide Moderna their data through file servers and databases. In the past, the team tried one-off tools to ingest a given data source, but found that did not scale. By unifying its pipelines process, the team has become far more efficient and been able to tap into the expertise of the Etleap support team for troubleshooting and strategy.
Moderna has used the easy Etleap user interface to build pipelines from S3 buckets, file stores, and hundreds of SQL and NoSQL databases like SQL Server, Postgres and MongoDB. Most of these were available “out of the box” with Etleap, and Etleap quickly added integration support for Moderna’s non-standard sources.
Security
Data is highly sensitive and heavily regulated in the health and life sciences industry. Moderna’s vendors must meet stringent audit and change management compliance. Moderna also cannot use traditional SaaS workflows where its data would leave its own virtual private cloud (VPC). While Etleap is most commonly deployed as SaaS, it is also one of the few ETL tools available for deployment in a company’s VPC. A key part of this deployment is sending de-identified diagnostic data to enable Etleap to provide the same proactive support it delivers for SaaS customers.
Etleap’s focus on security has given Moderna comfort and flexibility. For instance, it has allowed Moderna to isolate ingest and transformation workloads from other AWS workloads and run separate VPCs for production and pre-production. This has helped with audit-ability and development efficiency.
Data Engineering Team Productivity
The Moderna data engineering team has big expectations from the company, and they need to devote their scarce time to differentiating projects rather than the plumbing of building and maintaining data pipelines. In Etleap, they found a solution that is tightly integrated with AWS and just works.
“We don't have to worry about scalability, because of the way Etleap is architected with auto-scaling EMR clusters. Some of the sources contain many terabytes of data, and the pipelines to bring them into our platform just work.”
Etleap has accelerated pipeline creation and also cut the time needed to maintain pipelines. If there's a connection issue or a schema change, Etleap delivers alerts and proposed corrections.
The small team has been able to manage pipelines into both an Amazon Redshift data warehouse and AWS Glue as its data lake catalog. They specify the schema and table name, and Etleap automatically creates the tables, pulls the data from the sources, applies the transformations, and then finally loads the data.
“Because this is an AWS-native platform we've found that we have to do very little engineering around the platform itself. For example, we used to have to run SSH tunneling and jump-boxes when connecting to external systems, but Etleap has that all built into it. This means less infrastructure to maintain, which again means that we don't have to hire more data engineers for ETL purposes. This lets us keep our ETL team lean, which is an important thing for us.”
Moderna has had a busy few years to say the least. Along with its growing public profile has come a meteoric growth in data and the need to consolidate it for analytics and AI/ML workloads. Thanks to the Etleap architecture and private cloud (VPC) deployment option, Moderna does not worry about scalability and is confident that its data stays secure.
The small data engineering team can now add multiple new data sources every week. This depends on a highly functioning ETL environment that balances simplicity and robustness. Moderna’s data team and even non-technical users have been able to quickly learn to build pipelines with Etleap. Etleap’s support team and its deep ETL experience are also a powerful resource. Etleap support minimizes the pipeline maintenance Moderna needs to worry about and also helps quickly resolve edge cases in pipeline creation and maintenance.
This all enables a highly effective team and analytics environment that lets Moderna devote more resources to its essential work around mRNA science and medicine.