Businesses are now able to produce more valuable data insights than ever thanks to an explosion of tools that support the analytic engineering function. Cloud data warehouses like Snowflake, Amazon Redshift or Google BigQuery have improved their ability to collect, store and run complex analyses in a single location. It can be easy to be caught off guard by runaway cloud service costs, and finance teams often find themselves unequipped to manage these new expenses.
Data teams can help avoid these situations by embracing financial operations as a shared responsibility. Our data team here at Ramp applies the same operational rigor to our own processes that we do to our customers. In this article, we’ll discuss how Ramp’s data team implemented targeted cloning strategies in Snowflake to keep up with the pace of innovation while not breaking the bank.
Ramp leverages an extract, load, transform (ELT) paradigm. At the core of our data operations is Snowflake, which acts both as a refinery for raw data (in conjunction with dbt) and as a warehouse for cleaned data to be consumed by a variety of functions across marketing automation, sales operations, underwriting, and capital markets, to name a few. Snowflake is one of our newest products, having been added to our data toolbelt in July 2022. We chose Snowflake for its ability to scale alongside us, both in terms of handling our growing workloads and providing features for our developers to continue to iterate quickly.
“Trust is built in drops and lost in buckets.”
– Kevin Plank
Since Snowflake powers Ramp’s most important single-source-of-truth reporting outputs, the data team at Ramp invests heavily into test-driven development. In practice, this is a variety of Continuous Integration (CI) checks to ensure each pull request builds correctly and produces values that align with expectations.
When performing CI checks, our goal is to simulate a production environment as closely as possible and observe how changes interact with this environment. To do this, we’ve built a series of processes that kick off when a pull request is opened.
PR_1234
It is a simple process, but one that can lead to redundant consumption of your cloud credits if left unattended.
As the data team at Ramp migrated over to Snowflake, we also welcomed nearly 20 new data team members and grew the number of data models maintained into the thousands. The velocity of pull requests flowing through our systems grew and weaknesses in our CI process began to surface in the form of rising costs.
What’s not measured in this graph is the hidden cost of developer idle time. Our analytics engineers could spend up to 30 minutes waiting for CI to build before they could start debugging failures, quickly adding up in developer costs. Putting a magnifying glass towards this, it was apparent that our simplistic approach was building unnecessary models and keeping our meter running. What we had solved for in our first iteration was creating a full duplicate of our production environment. Each database spawned by a pull request could have been easily swapped in place of our production environment. What we needed was a targeted approach to test only the models affected by a developer’s changes.
Our plan of action was to reduce the number of resources needed in every step of the CI check through a combination of features offered by Snowflake and dbt: dbt artifacts and Snowflake’s zero-copy clones.
Selectors in dbt are powerful configurations that help narrow down the set of models that need to be built. In particular, the --state
selector helps us identify changes against a previous version of our dbt project. However, in order to leverage this selector in CI, we need to store our artifacts for retrieval during the CI check.
Setting this up meant that after each production run, we saved our manifest.json file in an S3 bucket to be distributed as needed. When it came time to run CI, we could now leverage the --state
selector by copying our manifest.json file into our process to check changes against what was live in production, drastically reducing the number of models built.
From a technical standpoint, the commands invoked by CI looked like:
aws s3 cp s3://path/to/file manifest.json
dbt build --select +state:modified -–state .
While the team had eliminated the vast majority of unneeded models, some of our most commonly updated data lineages were problematic due to computationally heavy upstream models (think of event stream data with window functions). How could we avoid building these models and incurring those costs? There were two solutions available: tacking on the defer functionality made available with storing dbt artifacts or leveraging Snowflake’s zero-copy clone functionality.
On our first attempt, we decided to work with existing resources to extend dbt to use the --defer
functionality. The defer flag is a powerful configuration provided by dbt-labs that avoids the need to build upstream models by pointing to an existing model. While this worked as intended, debugging a pull request review was not always a straightforward task. Analytics engineers needed to keep a mental note of when to swap out namespaces to ensure their updates were working as intended.
Knowing that we needed to maintain a single namespace to debug, we tackled this problem by mimicking dbt’s defer functionality through Snowflake’s zero-copy clones. With these clones, developers could find and debug all of their issues in a single database, rather than hopping between two databases.
On the technical side, we tackled this problem with a novel chain of commands and a bespoke dbt macro to help us clone the correct upstream models.
First, we needed to identify which upstream models needed to be cloned. This actually proved to be difficult since, as of the time of this writing, dbt’s selected_resources jinja function was not supported in a dbt run-operation command. To get around this limitation, we listed the models upstream of those modified, and piped the results of the dbt list function into a csv file, which would later be seeded as part of our CI process.
dbt ls --select +state:modified --exclude state:modified --resource-type model --state . >> data/upstream_models.csv
dbt seed
The macro, shown below, has two functionalities.
{% macro clone_upstream_models(destination_db, source_db='ANALYTICS') %}
{% set destination_db = destination_db|upper %}
{% call statement('find_models', fetch_result=True) %}
select
distinct
existing_models.table_schema as "table_schema",
existing_models.table_name as "table_name",
existing_models.table_type as "table_type"
from {{ source_db }}.information_schema.tables as existing_models
inner join {{ destination_db }}.staging.upstream_models as new_models
on existing_models.table_name = upper(new_models.models)
where existing_models.table_schema not in
('INFORMATION_SCHEMA',
'PUBLIC’)
and "table_type" = 'BASE TABLE'
{% endcall %}
{%- if execute -%}
{% set model_table_results_table = load_result('find_models') %}
{%- endif -%}
{%- if model_table_results_table and model_table_results_table['table'] -%}
{%- set tbl_relations = [] -%}
{%- for row in model_table_results_table['table'] -%}
{%- set tbl_relation = api.Relation.create(
database=destination_db,
schema=row.table_schema,
identifier=row.table_name,
type='table'
) -%}
{%- do tbl_relations.append(tbl_relation) -%}
{%- endfor -%}
{%- set relations = tbl_relations -%}
{%- else -%}
{%- set relations = [] -%}
{%- endif -%}
{{ dbt_utils.log_info("Cloning necessary tables...")}}
{%- for relation in relations -%}
{% set clone %}
create or replace transient table {{ relation.database }}.{{ relation.schema }}.{{ relation.identifier }}
clone {{ source_db }}.{{ relation.schema }}.{{ relation.identifier }};
grant ownership on table {{ relation.database }}.{{ relation.schema }}.{{ relation.identifier }} to developer revoke current grants;
{% endset %}
{% do run_query(clone) %}
{{ dbt_utils.log_info("Successfully created " ~ relation.database ~ "." ~ relation.schema ~ "." ~ relation.identifier)}}
{% endfor %}
{{ dbt_utils.log_info("Cloning completed")}}
{% endmacro %}
All together, the whole CI commands invoked looked like:
aws s3 cp s3://path/to/file manifest.json
dbt ls --select +state:modified --exclude state:modified --resource-type model --state . >> data/upstream_models.csv
dbt seed
dbt run-operation clone_upstream_models
dbt build --select state:modified+1 –-state .
The CI process was now only building and testing the modified models and any direct downstream dependencies. Additionally, any changes that did not align with expectations could be traced without keeping a mental note of whether or not to switch databases.
The final result left us in a great state that balances speed and cost without losing the fidelity to debug issues. We saw a dramatic dip in the rolling 7 day average cost of these warehouses that has been mostly sustained even in the face of higher velocity of pull requests.
There’s no shortage of data products being built on top of your cloud data warehouse. It’s more important now than ever for data teams to be aware of the cost structures and efficacy of their tools. Our targeted cloning solution is just one example of many interesting projects being tackled by the data team at Ramp.