Organizations today are creating and storing more data than ever before. Massive data sets allow them to build AI-driven features that customers now expect from modern business experiences: personalized recommendations, helpful chatbots, fraud detection, and more.
But with big data come big challenges.
For one, data isn’t always stored in the same systems, with data located in object storage such as AWS S3 while being queried by entirely different systems such as Dremio or Trino. Being able to operate on a single copy of this data via one system gives you a significant advantage—but this has historically been difficult to achieve.
Organizations normally have two options: writing and managing complicated data pipelines, which are often brittle and difficult to maintain, or making an all-hands-on-deck effort to move data into a single warehousing solution like Snowflake. The first option can cause issues with maintainability and scalability, which makes taking action on your data much more difficult. The second option raises concerns about vendor lock, which can deter data teams who don’t want to be stuck with a solution that offers little to no interoperability.
Snowflake and Iceberg tables
Luckily, Snowflake customers have a reliable solution at their disposal—and it begins with the Snowflake Polaris Catalog.
The Polaris Catalog is an open source catalog for Apache Iceberg. It’s a critical component for organizations with a multi-engine architecture because it allows data teams to manage data pipelines concurrently through multiple systems. This is possible because all read and write operations are routed through a catalog, allowing the same copy of data to be operated on, producing consistent results for all systems while also allowing each system to access the same data.
A look inside the mapping grid of an Iceberg Node in Coalesce and how it’s routing through a catalog.
With the Snowflake Polaris catalog, you can now take advantage of Apache Iceberg table formats that exist in object storage like AWS S3, Google Cloud Storage, or Azure Blob Storage. Your data never has to be moved or copied into Snowflake and can instead be queried directly from the source through Snowflake’s query engine. This resolves the concern around vendor lock-in because organizations can now query any Iceberg format table directly from the source system—while still being able to use Snowflake’s powerful features, regardless of where their data lives.
Creating an external volume and a catalog
With Iceberg table support, Snowflake can query data from a multitude of systems including Amazon Web Services, Google Cloud, Microsoft Azure, Confluent, Dremio, and more. Additionally, Snowflake users can manage Iceberg tables within the Snowflake platform itself, exposing these tables to other systems that support Iceberg tables for added interoperability.
In order to start using Iceberg tables and take advantage of this interoperability, there are two primary requirements within Snowflake. The first is to create an external volume, which Snowflake defines as follows:
“An external volume is a named, account-level Snowflake object that you use to connect Snowflake to your external cloud storage for Iceberg tables. An external volume stores an identity and access management (IAM) entity for your storage location. Snowflake uses the IAM entity to securely connect to your storage for accessing table data, Iceberg metadata, and manifest files that store the table schema, partitions, and other metadata.”
Second, you need to create a catalog. Put simply, a catalog allows a compute engine to manage and load Iceberg tables. A catalog integration is necessary when your Iceberg tables are managed by systems other than Snowflake. When this happens, you will create a catalog that is an account-level object within Snowflake, and which stores information about how your metadata is organized.
Your external volume and catalog configuration will look something like this:
ALTER DATABASE iceberg_tutorial_db SET CATALOG = 'SNOWFLAKE';
ALTER DATABASE iceberg_tutorial_db SET EXTERNAL_VOLUME = 'iceberg_external_volume';
Once your external volume and catalog are configured, you can load and query data from your storage location directly within Snowflake. You now have the ability to query data across multiple systems within your organization without constraint.
Configuration options that Coalesce exposes to manage Iceberg tables.
All this comes with an extra bonus: because the data you are querying is not stored in Snowflake, you are only charged for the compute used to query the data from the Iceberg tables—there is no storage cost associated with querying Iceberg tables in Snowflake.
Coalesce makes it easier
Setting up the ability to create and manage Iceberg tables does require some technical and architectural knowledge. But with Coalesce, customers can easily create and manage Iceberg tables directly in the Coalesce platform. Through Coalesce Marketplace, you can install the Iceberg package and immediately take advantage of all the benefits discussed here.
Coalesce provides support for creating Iceberg tables at any of these three levels:
- Snowflake catalog
- AWS Glue catalog
- Object storage (AWS, Azure, and GCP)
This means that users can take advantage of each of the possible variations of Iceberg table support within Snowflake. As support for additional Iceberg catalogs becomes available, Coalesce will continue to implement that support into our Iceberg package.
With this functionality, Coalesce customers create external volumes directly in Coalesce from a single interface—no need to understand the syntax, nor take multiple steps to complete the configuration. Because Coalesce makes the assumption that a catalog has already been created, you can pass it through as a catalog integration and begin creating and managing Iceberg tables in Coalesce.
Opening the Node selector to add an Iceberg table to a pipeline in Coalesce.
What all of this means is that you can build your data pipelines in Coalesce using all the usual Snowflake functionality such as Dynamic Tables, Cortex Functions, and Tasks. But now, thanks to Iceberg tables, you can also incorporate additional data sources and functionality to gain more data interoperability than has ever been available within the Snowflake ecosystem—and all without needing years of data engineering experience to accomplish it.
Watch the demo video below to learn how to take advantage of all of this functionality:
Try it for yourself
Data teams can try out our Iceberg Package which enables full use of Snowflake-supported Iceberg functionality. Additionally, our Streams and Tasks Package includes an Iceberg Table with Task Node which utilizes Snowflake-managed Iceberg Tables.
Want to learn more about Iceberg tables and Coalesce? Contact us to request a demo.