Database Safety Under Continuous Deployment

The database is shared production state, and continuous deployment sends every change to it straight to production.

Related Concepts: Continuous Delivery | Module Decomposition | Modular Monolith

The database holds the one part of a system that survives every deploy. Application code is replaced on each release. The data, and the schema that shapes it, persists. It is also shared: every module, every request, and every background job reads and writes the same tables.

Continuous deployment raises the stakes of that shared state. A merge to the main branch reaches production within minutes, and no person reviews the running result before users hit it (see Continuous Delivery). A schema change, a new query, or a data model decision goes live against real production data and real production load, and the only review it gets happens at the pull request.

This article covers four disciplines for working with the database under those conditions: changing the schema without breaking the running application, writing queries that hold up under load, modeling data so it survives growth, and composing data across module boundaries.

Schema changes

Database migrations run as a one-off task before the new application version starts (see Deploying to AWS ECS). The schema changes first. The new code arrives after.

This ordering creates a window. Once the migration completes, the previous version of the application keeps serving traffic against the already-changed schema, until the new version starts, passes its health checks, and takes over. With blue-green deployment that window lasts the whole rollout: the old tasks run against the new schema the entire time the new tasks are coming up.

Blue-green gives a clean traffic cutover, so two application versions never serve requests at the same time. The old version still runs against the new schema for minutes on every deploy. Every schema change is therefore backward compatible with the code that is already running.

Backward compatibility

A change is backward compatible when the previously deployed code keeps working against the new schema. Additive changes meet this bar. A new nullable column, a new table, or a new index leaves existing queries working.

Changes that remove or rename a structure the old code references break it. A dropped column makes the old version's INSERT and SELECT statements fail. A renamed table makes its queries fail the instant the rename commits. A new NOT NULL column with no default makes the old version's inserts fail, because that code has no value to supply.

Expand and contract

Expand and contract, also called parallel change, makes an incompatible change safe by splitting it across deploys. Instead of changing a structure in place, you stage it:

Expand. Add the new structure alongside the old one, so both exist at once.
Migrate. Move the code to the new structure and backfill existing data. Release this.
Contract. Once nothing reads the old structure, remove it in a later deploy.

A column rename becomes a sequence: add the new column, write to both columns, backfill the old rows, move reads to the new column, then drop the old column in a separate release. Each step is backward compatible on its own, so each one is safe to deploy on its own.

Locks

A migration's risk comes from the lock it takes, how long it holds that lock, and how much traffic the table is under. A statement that finishes in milliseconds is safe even under a strong lock. The same lock held for the length of a full table rewrite stops the application.

PostgreSQL grants different locks for different operations. Most of them conflict only with writes. One lock mode, ACCESS EXCLUSIVE, conflicts with everything, including a plain SELECT. Operations that take it and release it in milliseconds are fine. Operations that take it and hold it through a full table scan or a table rewrite block all access to the table for that whole duration.

A blocked migration is worse than a slow one. When a migration waits for its lock behind a long-running query, every query that arrives after it waits behind the migration. One slow read and one waiting migration can freeze a busy table for every caller. A short lock_timeout on the migration session bounds this: the migration gives up and retries instead of holding the queue.

Safe forms of common changes

Most changes that would take a long lock have a documented form that does not:

Adding a NOT NULL constraint, a check constraint, or a foreign key validates every existing row under a strong lock. Adding the constraint as NOT VALID first and validating it in a separate step moves that scan to a lock that allows reads and writes to continue.
A plain CREATE INDEX blocks writes for the whole build. Building the index concurrently lets writes continue.
A column default that is a constant is a metadata change on current PostgreSQL. A column default that calls a volatile function, such as a random UUID or the current time, rewrites the whole table. Add the column without the default, set the default for new rows, then backfill existing rows in a separate step.

The exact lock each operation takes, and the precise safe rewrite for each one, are stack-specific and belong in implementation guidance. The principle holds across all of them: the dangerous form and the safe form produce the same end state, and the safe form avoids a long hold on a blocking lock.

Backfills

Updating existing rows is a data change, and a large one inside the migration transaction holds the table's lock for the entire update. A single statement that rewrites millions of rows also produces a large transaction and a large write-ahead log, and it blocks the database from reclaiming dead rows while it runs.

Run a backfill in batches, outside the migration that changes the schema. The schema change commits quickly. The data moves in bounded chunks that each commit on their own.

Destructive changes

Blue-green deployment offers fast rollback by shifting traffic back to the old version. That rollback covers code. The schema stays as the migration left it, because the migration already ran and shifting traffic back does not undo it.

A destructive change removes the rollback path. If one deploy drops a column and the new code then fails, shifting traffic back restores code that still expects the column, running against a schema that no longer has it. Both versions are now broken.

Plan destructive changes to roll forward. Stage them through expand and contract so that no single deploy both removes a structure and depends on its removal. When a deploy fails, the fix is the next deploy.

Query design

A query that returns instantly against a developer's small dataset can be slow against production data. The difference is data volume, value distribution, and concurrency, none of which is visible at the pull request.

Limits of static review

Reading a query tells you its shape. It cannot tell you its speed. Whether a query is slow depends on how many rows the table holds, how the values are distributed, and which plan the database chooses. Those facts come from the running database, through EXPLAIN and the statistics it keeps, and they are absent at code review.

Review catches query shapes that correlate with problems. It cannot confirm a real problem on its own. A flagged query is a question for the author, answered with a query plan against production-like data.

Query shapes

Some shapes are visible in the source and reliably worth a second look:

A function or cast wrapped around an indexed column in a WHERE clause. The index on the bare column no longer applies, so WHERE lower(email) = $1 cannot use an index on email.
A LIKE pattern with a leading wildcard, such as '%term'. A standard B-tree index cannot serve it.
SELECT * where the code uses a few columns. It moves more data than the code needs and defeats index-only scans.
A query over a growing table with no LIMIT. The result set grows with the table.
Pagination with a large OFFSET. The database reads and discards every skipped row to reach the page.
A query issued once per item in a loop, the N+1 pattern. N items produce N round trips, and one batched query replaces them.

Index design

An index speeds up reads that match it and slows down every write to the table, because each write maintains the index. Adding an index to a write-heavy table is a real cost, so it is a deliberate decision with a stated reason.

A few index facts are visible at review time. A composite index on (a, b) serves queries that filter on a, and it does nothing for a query that filters only on b. A foreign key column is not indexed automatically, and an unindexed foreign key makes deletes on the parent table scan the child table. An index whose columns are already the leading columns of another index carries cost with no added benefit.

Whether a given index helps a given query, and whether an existing index is used at all, are runtime facts. Review raises the question. The running database answers it.

Data modeling for growth

Some data model decisions are fine the day they ship and become expensive once a table grows. They are worth raising while the design is still soft. Each one is a question about the future, because the cost depends on a scale the system may not reach.

Unbounded tables

A table that only grows needs a plan for old rows. Event logs, audit trails, notifications, and job records accumulate forever by default. The signs are an append-only insert pattern, a monotonic timestamp, and a name like events or audit_log. The question to ask: is there a retention or archival policy, and would time-based partitioning let old data be removed by detaching a partition instead of deleting millions of rows?

Primary keys

A random primary key, such as a version-4 UUID, scatters inserts across the index. Each insert lands in a random position, which splits index pages and fragments the index over time. A time-ordered key, such as a version-7 UUID or a sequential integer, keeps inserts at one end of the index and the index stays dense. The tradeoff: a time-ordered key reveals roughly when a row was created, and it concentrates inserts at one edge of the index, which can become a contention point under heavy concurrent writes.

High-churn tables

A row that is updated frequently produces dead row versions that the database cleans up later. Counters, status flags, and last-seen timestamps churn this way. PostgreSQL can update such a row in place when no indexed column changes and the new version fits on the same page, which avoids most of the cost. The question: are the frequently changed columns left out of indexes, and is there room on the page for in-place updates?

Enums

A database enum type is awkward to change. Adding a value has restrictions, renaming a value breaks running code the same way a column rename does, and removing a value is unsupported without rebuilding the type. A column expected to gain values over time is easier to manage as a lookup table or as text with a check constraint.

Cross-module data access

In a modular monolith, a module owns its tables and no other module reads them directly. Cross-module data goes through the owning module's public API. Module Decomposition covers the reason: shared tables couple modules together, so a schema change for one module can break another.

This rule reaches into query design. Data that would be a single SQL join across two modules' tables cannot be a join, because the join would read another module's tables. The join becomes two steps: fetch from the other module's API, then combine the results in application memory.

Composition patterns

Fetching from one module and combining the results in memory is API composition. It works well for bounded result sets. Its weakness is large joins: combining big datasets in memory is slower than a database join would be, and that cost is the accepted price of the module boundary.

When a module reads another module's data often, a read model is an alternative. The consuming module keeps its own read-only copy of the slice it needs, updated from the owning module's events. The copy is eventually consistent, and it is written only through the owner.

Cross-boundary N+1 queries

Combining data per item turns one logical join into N calls to another module, one call per item. This is the N+1 pattern across a module boundary. The fix is a batch endpoint on the owning module: collect the identifiers for a page of results, make one call that fetches all of them, and combine in memory. Pagination stays on a single resource, and the related data for that page arrives in one batched call.

Postmortems

Database Safety Under Continuous Deployment

Schema changes

Backward compatibility

Expand and contract

Locks

Safe forms of common changes

Backfills

Destructive changes

Query design

Limits of static review

Query shapes

Index design

Data modeling for growth

Unbounded tables

Primary keys

High-churn tables

Enums

Cross-module data access

Composition patterns

Cross-boundary N+1 queries

Further Reading

Database Safety Under Continuous Deployment ​

Schema changes ​

Backward compatibility ​

Expand and contract ​

Locks ​

Safe forms of common changes ​

Backfills ​

Destructive changes ​

Query design ​

Limits of static review ​

Query shapes ​

Index design ​

Data modeling for growth ​

Unbounded tables ​

Primary keys ​

High-churn tables ​

Enums ​

Cross-module data access ​

Composition patterns ​

Cross-boundary N+1 queries ​

Further Reading ​

Database Safety Under Continuous Deployment

Schema changes

Backward compatibility

Expand and contract

Locks

Safe forms of common changes

Backfills

Destructive changes

Query design

Limits of static review

Query shapes

Index design

Data modeling for growth

Unbounded tables

Primary keys

High-churn tables

Enums

Cross-module data access

Composition patterns

Cross-boundary N+1 queries

Further Reading