Jumpstarting Analytics Programs with Serverless Architecture

Serverless web app architectures can be a strikingly affordable option and offer infinite scalability when optimally configured. These can also process enormous data tasks—even at terabyte-level volumes—without requiring a substantial OpEx commitment, allowing you to jumpstart your analytics capability without a huge monthly bill.

How? A basic serverless architecture built on 1) infinitely scalable object storage; 2) ETL operations, crawlers, and data catalogs; and 3) an analytics compute engine such as Amazon Athena can work as follows:

Phase 1: Collection. Gather structured, semi-structured, and unstructured data into a main bucket that consolidates data from different sources.
Phase 2: Transformation. Transform consolidated data into a structured form compatible with table creation.
Phase 3: Cataloging. For more efficient and targeted queries later using a compute engine such as Athena, catalog the transformed data into metadata tables.
Phase 4: Querying. Use Athena or a similar analytics compute engine to execute SQL-like queries and discover insights.

A properly-designed serverless analytics solution can process terabytes of data at savings of 40-90% compared to a serverful setup. Let’s get into more detail about how all this works, in those four phases.

Phase 1: Collection

The collection phase gets all of your data into your main consolidated data bucket. You can send data directly, for example in flat files, or use an ETL job to query from various sources and then direct your output.

Phase 2: Transformation

Since your data is from disparate sources, it’s not likely to be structured for analytics purposes. The transformation phase allows us to arrange that data into an analytics-friendly structure.

In a generic serverless analytics architecture, ETL jobs read data, do necessary transformations–for example, structural changes might include dropping fields to remove columns not needed for the analytics model or renaming fields–and then write the transformed data into a processed data bucket, where it becomes structured enough to query in a meaningful way.

Phase 3: Cataloging

A data catalog is simply a collection of metadata tables for stored data. Although data isn’t really in tables at this point, you’ll want to be able to query quickly and efficiently as if it were. A data catalog allows for this.

Crawlers discover new data that accumulates in the storage repository,, and then update the partitions of catalog tables as necessary. Though crawlers sound like the general solution, they may be replaced by other options, such as using the ETL job that handles the raw to processed transformation to update affected catalog tables.

Phase 4: Querying

Compute engines like Athena allow you to deploy Trino, formerly known as Presto, an SQL query engine meant for massive parallel processing of huge amounts of data, and doesn’t require specialized expertise beyond querying tables in any relational database like MySQL, Postgres, or Oracle. The Athena console in the web-based AWS Management Console can handle all ad-hoc queries. Athena can also be used programmatically in applications, through ODBC or JDBC connections, or the AWS SDK.

Transitioning Away from Serverless Analytics

What if you need to pivot to a non-serverless, dedicated cluster for more complex analytics requirements in the future? Will your serverless analytics compromise future agility?

In the hypothetical scenario requiring a future expansion—maybe more complex analytics requirements arise, or stricter performance and predictability requirements, or both—adding another ETL transformation job can adapt the data for loading into a dedicated analytics cluster, from the processed bucket to a target bucket that will hold modeled data for that cluster. From there, it’s a simple load command, and then a clearing out of the modeled bucket.

When your analytics requirements become more sophisticated and you need visualization, you can connect Athena or another interactive query service to data visualization and dashboarding tools like Quicksight and Power BI to provide your users with that capability.

In a future situation that requires complex dashboarding on top of a basic serverless analytics solution, or a dedicated cluster for new analytics requirements, none of these choices negatively affect future options.

Serverless analytics is a great way to start experimenting with analytics for workloads that don’t demand the OpEx of maintaining a dedicated cluster. It’s a cost-effective way to explore or improve your analytics capability.

Phase 1: Collection

Phase 2: Transformation

Phase 3: Cataloging

Phase 4: Querying

Most Popular on GeekWire