Four years and a bit—it may not seem like a long time, but in my experience of working with Snowflake, it feels like dog years. Within this relatively short span, I’ve witnessed it all. I was part of the initial group of SEs in APJ and later became the first Field CTO supporting the region. My journey with Snowflake has taken me from solving data warehousing problems to supporting a platform for applications and AI/ML. It has been an exhilarating experience, and I will always be grateful for the opportunities it has brought me.
Snowflake provides all the lego pieces for data teams and customers
In my role as Field CTO, I was responsible for supporting Snowflake’s data engineering and AI/ML workloads in the APJ region. One of the most rewarding aspects of my journey with Snowflake was the launch of Snowpark. This played a vital role in uniting all stakeholders – data engineers, ML engineers, data scientists, analysts, and business users – on a single platform. Collaborating closely with the product and engineering teams, as well as early customers, we successfully launched Snowpark and its capabilities into the platform within a remarkably short timeframe.
Snowpark empowered Snowflake to provide native support for multiple programming languages, including SQL, Python, Java, and Scala. This capability further expanded to encompass specific compute resources, such as Snowpark Optimized Warehouses for memory-intensive workloads, and most recently, Snowpark Containers for supporting GPU-intensive workloads.
Snowflake empowers data engineers to leverage SQL for efficient data transformations and data modeling that cater to business consumption. With Snowpark, data engineers can now utilize Python for more complex data transformations, leveraging familiar dataframe constructs and third-party packages. The advantage of using Python in Snowpark is that data engineers are relieved from the burdens of creating and maintaining runtime environments, managing dependencies, and addressing security and governance concerns associated with utilizing open source packages from external sources.
The Snowpark Python framework on Snowflake not only empowers data scientists and ML engineers to perform data processing and feature engineering activities but also enables them to do so at scale, closer to where the data resides. Furthermore, at the recently concluded Snowflake Summit 2023, Snowflake introduced Snowpark ML—a comprehensive suite of tools, including SDKs and underlying infrastructure, designed for the development and deployment of machine learning models directly on the Snowflake platform.
As a Field CTO, I had the privilege of traveling to various parts of the APJ region and engaging with a wide range of customers and prospects. Undoubtedly, this was the most fulfilling aspect of my job! I had the opportunity to meet customers at different stages of their Snowflake journey, from those in the evaluation phase or early adoption stage to those who were scaling their Snowflake implementation and pushing the boundaries of its capabilities.
It is essential to recognize that Snowflake, being an exceptional data platform, offers its customers a comprehensive set of building blocks that empower them to tailor the platform to their specific use cases. Whether it’s data ingestion at any latency or format, transformations using preferred programming languages and patterns, or supporting diverse modeling approaches to enable analytics, AI, and ML, Snowflake provides the tools and capabilities. However, what Snowflake does not provide are predefined patterns or standards out of the box. When customers sign up for a Snowflake account, they are presented with a blank slate of immense potential—it is up to the customers to unlock the value and realize their specific objectives.
Snowflake greatly relies on its partners to provide valuable support to customers in the areas of data ingestion, transformation, and modeling on the Snowflake platform. Notably, Snowflake partners such as Fivetran play a crucial role in ensuring reliable data ingestion into Snowflake.
One of the key areas where challenges often arise during the implementation of Snowflake is the transformation step and beyond. Customers end up using one of three patterns to transform their data:
- Custom code orchestrated using open source tools – In this pattern, customers typically write their transformation code as stored procedures using SQL, Javascript, or Python, and orchestrate the process using open-source tools such as Airflow or Argo. While this option may work well for engineering-centric organizations, regardless of the skills involved, it often leads to inconsistencies in the codebase and poses significant difficulties in managing and maintaining the code for future changes and updates.
- Legacy ETL tools – Many large customers rely on legacy tools that were initially developed for ETL frameworks in on-premises environments for both source systems and data platforms. While these tools were valuable in the early 2000s, they have struggled to keep up with the migration to the cloud. Recent efforts to adopt cloud-native products have proven to be cumbersome, with many customers experiencing suboptimal code generation and code pushdown experiences, leading to increased costs on their Snowflake platform.
- Open source tools for transformations – More recently, several open-source tools have emerged to facilitate SQL pushdown on Snowflake for transformation tasks. These tools excel at pushing down code and executing transformations closer to the data, which is advantageous. However, they often heavily rely on code-intensive pushdown approaches that can overwhelm Snowflake optimizers, resulting in slower code execution. Additionally, the absence of visual pipeline DAGs makes maintaining and optimizing these tools more challenging. Furthermore, as these tools typically support multiple platforms, they lack optimization specifically tailored for Snowflake, which can lead to increased costs. This issue has been a common complaint from customers who are scaling their Snowflake implementations while utilizing tools in this category.
Lego instruction booklet – Coalesce.io
Through my conversations with customers across the region, I discovered that a significant portion of their Snowflake credits, ranging from 40% to 80%, was being allocated to data transformation, processing, and modeling tasks. Considering the substantial investment involved, I found it astonishing that none of the existing transformation tools were specifically addressing the need to scale pipelines on the Snowflake platform.
I discovered Coalesce towards the end of last year, and I was incredibly impressed with the platform’s agility in solving the exact pain points my customers had been seeking to solve. One example is Coalesce’s collaboration with Snowflake engineering to provide support for Dynamic Tables as soon as it was announced in Private Preview in November 2022, and later in Public Preview in June at Summit 2023. While other tools have yet to add support for this functionality, Coalesce has been at the forefront, actively meeting customer demands. With its exclusive focus on Snowflake, Coalesce enables customers to readily adopt the latest product innovations introduced by Snowflake.
Coalesce is a data transformation platform built exclusively for Snowflake. This alignment is crucial for Snowflake customers who rely on dedicated transformation platforms to efficiently build, manage, maintain, and scale their data engineering workloads, ultimately driving business value. Just like Snowflake, Coalesce is cloud-native and delivered as a service. This means customers can seamlessly connect to their Snowflake account and start building pipelines without the need to worry about infrastructure, software installation, or upgrades. Coalesce scales effortlessly alongside your growing needs.
Coalesce has placed a strong emphasis on supporting customers in standardizing their data engineering and transformation tasks since its inception. The platform offers pre-built “nodes”–User-defined Nodes, or UDNs–that assist customers in implementing modeling architectures like Dimensional modeling and Data Vault modeling. These nodes significantly reduce development time, establish standards and best practices, automatically generate code, and provide visibility into lineage for improved observability and impact analysis.
Coalesce is actively working on launching support for Snowflake Snowpark capabilities on its platform. This expansion opens up the platform to accommodate complex data engineering tasks using programming languages like Python. Snowpark support on Coalesce also enables the implementation of various ML and AI workflows, including Feature Engineering pipelines, Model training pipelines, and Model inference pipelines. This empowers customers with end-to-end MLOps capabilities for managing their ML and AI use cases.
I am thrilled to join Coalesce at this opportune time. The possibilities that arise from Snowpark support are immense, and I am eager to leverage my experience and knowledge to contribute to the launch of Snowpark support on Coalesce.
To learn more about Coalesce and stay in the loop on future updates, get in touch to request a demo or try it out with a free account.