What is indexing in a database, and why is it important?

A set of techniques to skip data that the query doesn't need. On cloud data platforms, this means file or partition pruning rather than B-tree lookups. Without effective pruning, queries scan entire tables, which is slow and expensive on consumption-priced platforms.

Does indexing in databases always improve performance?

No. Each clustering or indexing approach carries a maintenance cost, can slow writes, and does nothing for queries that don't filter on the relevant column.

How does indexing in the database impact cloud costs?

Effective pruning reduces bytes scanned per query, which directly reduces compute charges on consumption-priced platforms.

When should you avoid indexing in a database?

Small tables, low-cardinality columns, write-heavy staging tables, and any table where the query workload doesn't filter on the candidate column.

Is indexing in the database useful for large datasets?

Critical. The larger the table, the bigger the win from pruning and the higher the cost of unprunable queries.

Database Indexing: Improve Cost and Performance

Key takeaways

None of the three platforms supports traditional B-tree or hash indexes. They use file-skipping mechanisms instead.
The "index" you choose is really the column the data is physically organized on.
Maintenance overhead from re-clustering and OPTIMIZE jobs shows up directly on the compute bill.
Skip indexing on small tables, low-cardinality columns, and write-heavy staging tables.
Track pruning effectiveness, not just query latency.

Indexing in a database used to mean B-tree indexes on transactional row stores. On modern cloud data platforms, that primitive doesn't exist, and trying to recreate it is one of the more common ways teams waste money. Snowflake, Databricks, and BigQuery each have their own mechanisms for accelerating queries: micro-partitions and clustering keys, Z-ordering and bloom filters, and partitioning and clustering. Understanding what each one actually does, and what it costs to maintain, is the difference between fast queries and an inflated invoice. The patterns below come from real production environments where the wrong choice meant six-figure overruns; the fix wasn't more indexing, it was less.

What is indexing in a cloud data platform?

On a traditional database, an index is a separate data structure (typically a B-tree, a self-balancing tree that lets the engine find any row in roughly the same number of steps) that points to rows in a table. The database reads the index first, then jumps directly to the matching rows. On Snowflake, Databricks, and BigQuery, that data structure doesn't exist. There's no CREATE INDEX in the way a PostgreSQL or SQL Server engineer would recognize it.

Instead, the cloud warehouses use file-skipping. The data is stored in many small files on cloud object storage, and the engine keeps metadata about each file: the minimum and maximum value of every column, null counts, and distinct value approximations. When a query filters on a column, the engine looks at this metadata first and skips entire files whose ranges don't include the filter value.

Think of it like a librarian organizing a library. A traditional index is a separate card catalog: every book has its own card, and each card points to where the book sits. File-skipping is more like organizing books on shelves so that all books about the same topic end up nearby; when someone asks for chemistry books, the librarian knows which shelves to skip without looking up individual cards. "Indexing" on these platforms is really shelf organization.

The key concept

On Snowflake, Databricks, and BigQuery, the question isn't "Should I add an index?" The question is "Is the data physically organized so the engine can skip the parts it doesn't need?" That re-framing makes everything in this guide easier.

Why indexing decisions matter for cost and performance

On a traditional database, a missing index hurts query latency. On a cloud data platform, it hurts the invoice. The connection is direct: queries that can't skip data scan more data, and on consumption-priced platforms, bytes scanned and compute time map directly to dollars.

A query against a 10TB BigQuery table that scans the full table costs around $60 per run on on-demand pricing. The same query, written against a partitioned and clustered version of the table that prunes to 100GB, costs about $0.60. Run that query a few hundred times a day, and the difference is a five-figure annual line item.

On Snowflake, the math runs through warehouse credits rather than bytes scanned, but the principle is the same. A clustered table that prunes 95% of micro-partitions before scanning needs less compute to answer the query, which means a smaller warehouse, faster auto-suspend, or both. Lower compute means fewer credits, which means a smaller invoice.

On Databricks, OPTIMIZE and Z-ordering trade write-time cost (the OPTIMIZE job runs a compute cluster) against read-time savings. The right answer depends on the read-to-write ratio. A read-heavy reporting table benefits from aggressive Z-ordering; a write-heavy staging table doesn't.

What "indexing" actually means on each platform

With the framing in place, here's what each platform actually does.

Snowflake

Data lives in 16MB micro-partitions, with min/max metadata tracked on every column. The optimizer prunes partitions that can't contain matching rows. A clustering key tells Snowflake to physically organize the data so pruning is effective for your most common predicates. Snowflake handles micro-partition selection automatically as data lands; clustering only matters when the natural ingestion order doesn't match your query patterns. For a table loaded chronologically and queried by date, no clustering is needed at all. The pruning works for free.

The non-obvious part is that Snowflake's micro-partition metadata extends beyond min/max to include null counts and distinct value approximations. This is why a query filtering on a column that's mostly null can prune aggressively even without a clustering key. It's also why low-cardinality columns rarely benefit from clustering: the metadata already does the work.

Databricks

Delta tables (Databricks' transactional table format on top of Parquet files) use file-level statistics for pruning. Z-ordering rewrites the underlying Parquet files so that values close on the chosen column end up in the same files. Bloom filter indexes add a probabilistic per-file check (a bloom filter answers "definitely not in this file" or "maybe in this file" without reading the file), which helps on high-cardinality string columns where Z-ordering alone doesn't. Z-ordering is multi-dimensional clustering: you can Z-order on up to three or four columns and get reasonable pruning on each. Beyond four columns, the math falls apart, and pruning effectiveness degrades sharply.

BigQuery

No row-level index exists. The two pruning mechanisms are partitioning (typically on a date or integer column) and clustering (up to four columns, ordered by selectivity). Partition pruning is the cheapest optimization available; if your fact tables aren't partitioned by date, that's the first fix. Clustering on top of partitioning helps further when queries also filter on a non-partition column. BigQuery clusters within partitions, not across them. Clustering on customer_id with date partitioning means each daily partition is sorted by customer_id, but the global ordering is by date. The common Snowflake problems guide covers similar pitfalls on the Snowflake side.

Side-by-side: Indexing mechanisms across the three platforms

Platform	Mechanism	Syntax	Effectiveness check
Snowflake	Micro-partitions + clustering keys	ALTER TABLE t CLUSTER BY (col)	SYSTEM$CLUSTERING_INFORMATION('t')
Databricks	Z-ordering + bloom filters on Delta	OPTIMIZE t ZORDER BY (col)	Files read per query
BigQuery	Partitioning + clustering	PARTITION BY DATE(col) CLUSTER BY (col2)	Bytes scanned per query

Table 1: The three platforms use different mechanisms but solve the same problem: skipping data that the query doesn't need.

When the maintenance cost outweighs the benefit

Re-clustering and OPTIMIZE jobs aren't free. Snowflake's automatic clustering consumes credits when reorganizing partitions. A Databricks OPTIMIZE on a 1TB Delta table runs a job cluster for 20 to 40 minutes. BigQuery clustering rewrites happen on DML (data manipulation language: INSERT, UPDATE, DELETE), so frequent updates mean more rewrite overhead.

Real cost example

Automatic clustering on a 5TB Snowflake table with daily streaming inserts can consume 50 to 100 credits per month, maintaining the cluster order. At three dollars per credit, that's $150 to $300 monthly per table. On a few high-traffic tables, the maintenance cost alone reaches four figures. Whether that's worth it depends entirely on whether the queries against those tables actually benefit from the clustering.

Skip clustering or Z-ordering when:

The table is under roughly 1GB. Pruning has nothing to prune.
Workload queries don't filter on the clustered column. Clustering on created_at doesn't help queries filtering on customer_id.
The table is write-heavy and read-light. Staging tables often get worse, not better, with clustering.
The column has fewer than 10 distinct values. Min/max statistics prune perfectly without clustering.
Natural ingestion order already matches query patterns. A table loaded in date order and queried by date needs nothing extra.

The decision is workload-driven. Track which columns appear in WHERE clauses across your top 50 queries, then cluster on the column that combines high selectivity with high query frequency. A column hit by 80% of queries but with only 100 distinct values is a worse choice than a column hit by 30% of queries with 10 million distinct values.

Implementation specifics

Snowflake

ALTER TABLE my_table CLUSTER BY (customer_id, event_date);

Check effectiveness with SYSTEM$CLUSTERING_INFORMATION('my_table'). The average_overlaps metric below 1.0 is healthy; above 5.0 means re-clustering isn't keeping up. Automatic clustering is on by default and can be suspended if costs spike.

Order matters: clustering on (customer_id, event_date) prunes well for queries filtering on customer_id, with secondary pruning on event_date. The reverse order optimizes for date-first queries. Pick based on query frequency, not what looks tidy.

Databricks

OPTIMIZE my_table ZORDER BY (customer_id);

Run nightly for active tables. For high-cardinality string columns, add a bloom filter:

CREATE BLOOMFILTER INDEX ON TABLE my_table FOR COLUMNS(email);

Bloom filters add storage but eliminate file scans for equality predicates. They don't help range queries; for those, Z-ordering is the right tool.

A practical pattern on Databricks: combine Z-ordering on the primary filter column with bloom filters on secondary high-cardinality columns. A Delta table queried mostly by date but sometimes by user email gets ZORDER BY (date) plus a bloom filter on email. Both queries prune effectively without forcing a multi-column Z-order that helps neither. For more on the operational side, see challenges in cloud data warehouses.

BigQuery

PARTITION BY DATE(created_at)

CLUSTER BY customer_id, product_id

Set partitioning at table creation. Adding it later requires rewriting the table, which on a 10TB fact table means an hours-long job and a temporary doubling of storage cost. List cluster columns from highest to lowest cardinality; BigQuery applies cluster pruning only on the leading columns, so a high-cardinality column at position four gets nearly no pruning benefit.

Watch the BigQuery clustering threshold: tables under 1GB don't benefit from clustering at all because BigQuery's block size is roughly 1GB. The metadata says clustering is enabled, but the engine has nothing to cluster.

Indexing for write-heavy workloads

The advice changes when writes dominate reads. Streaming ingestion at thousands of records per second amplifies any maintenance overhead because each new file triggers metadata updates and potentially clustering re-evaluation.

Minimize cluster column count

Each additional column in a clustering key or Z-order list increases the work the maintenance job has to do. On a write-heavy table, prefer a single high-selectivity cluster column over a multi-column key. The marginal pruning benefit of additional columns rarely justifies the extra rewrite cost on a hot table.

Decouple ingestion from query optimization

The cleanest pattern for very high-throughput environments is two tables: a staging table optimized for writes (no clustering, minimal metadata) and a reporting table optimized for reads, populated from staging on a schedule. This keeps ingestion fast and gives you full control over when expensive maintenance jobs run.

Schedule OPTIMIZE during low-traffic windows

On Databricks specifically, OPTIMIZE on a busy table can contend with concurrent reads. Running it at 3 AM rather than during business hours often reduces total disruption even though the work itself is the same.

A counter-example worth remembering

A Databricks team Z-ordered a high-throughput Delta table on four columns to support every conceivable query pattern. Reads got marginally faster on a few patterns, but writes took 4x longer, and the nightly OPTIMIZE job grew from 30 minutes to over three hours. The fix was reducing to a single Z-order column on the dominant filter and accepting that secondary query patterns would scan more files. Total cost dropped roughly 60% and write latency returned to baseline. More clustering is not strictly better; it's a trade-off you have to measure.

Quick rule

On any table receiving more than 100 writes per second, start with one cluster column. Add more only when measured query patterns prove the benefit outweighs the rewrite cost.

Monitoring what is working

The single most important metric is bytes scanned (BigQuery), partitions scanned (Snowflake), or files read (Databricks) per query. If a query against a 10TB table scans 10TB, none of your optimizations are doing anything. If it scans 100GB, pruning is working.

Track this per query pattern, not in aggregate. A regression usually means one of three things: clustering has drifted on Snowflake, the OPTIMIZE schedule fell behind on Databricks, or a query started filtering on a non-partitioned column on BigQuery. All three are catchable with metadata-only monitoring; no production query replay needed.

A war story worth knowing

A team running daily reports on Snowflake noticed query times had quadrupled over a quarter. The clustering hadn't drifted: SYSTEM$CLUSTERING_INFORMATION still looked fine. What changed was that a new dashboard had started running queries filtering on a non-clustered column, and those scans had become the dominant cost. The fix was adding a second clustered table for the new workload, not re-clustering the existing one. The lesson: clustering health is workload-relative. A table that's perfectly clustered for one query pattern can be useless for another, and the metrics on the table itself won't tell you that. Your AI DBA for every database handles this kind of continuous monitoring across all three platforms.

Database indexing for cost and performance: A guide for Snowflake, Databricks, and BigQuery

Database indexing for cost and performance: A guide for Snowflake, Databricks, and BigQuery

What is indexing in a cloud data platform?

Why indexing decisions matter for cost and performance

What "indexing" actually means on each platform

Snowflake

Databricks

BigQuery

Side-by-side: Indexing mechanisms across the three platforms

When the maintenance cost outweighs the benefit

Implementation specifics

Snowflake

Databricks

BigQuery

Indexing for write-heavy workloads

Minimize cluster column count

Decouple ingestion from query optimization

Schedule OPTIMIZE during low-traffic windows

A counter-example worth remembering

Monitoring what is working

A war story worth knowing