Consumer Tech

PostgreSQL MVCC: The Default Behaviors Breaking Production

The MVCC Myth: What ‘No Blocking’ Actually Guarantees (And What It Doesn’t) PostgreSQL’s MVCC guarantee is real and narrow. Reads never block writes, and writes never block reads — but only for standard DML operations against heap data. The moment you introduce DDL, that guarantee evaporates entirely. ALTER TABLE, DROP TABLE, TRUNCATE, VACUUM FULL, and ... Read more

PostgreSQL MVCC: The Default Behaviors Breaking Production
Illustration · Newzlet

The MVCC Myth: What ‘No Blocking’ Actually Guarantees (And What It Doesn’t)

PostgreSQL’s MVCC guarantee is real and narrow. Reads never block writes, and writes never block reads — but only for standard DML operations against heap data. The moment you introduce DDL, that guarantee evaporates entirely.

ALTER TABLE, DROP TABLE, TRUNCATE, VACUUM FULL, and CREATE INDEX (without CONCURRENTLY) all acquire ACCESS EXCLUSIVE locks — the most aggressive of PostgreSQL’s 8 table-level lock modes. ACCESS EXCLUSIVE conflicts with every other lock mode, including ACCESS SHARE, which is what a plain SELECT acquires. A routine schema migration doesn’t sidestep MVCC’s concurrency model; it overrides it completely.

The chain-blocking behavior makes this dangerous at scale. Once a transaction queues an ACCESS EXCLUSIVE lock request and waits for existing queries to finish, every subsequent query against that table — even simple reads — queues behind it. A single ALTER TABLE waiting on a long-running analytics query can accumulate dozens of blocked connections within seconds. Connection pools saturate. Applications time out. The table effectively goes offline without any explicit downtime event.

Most engineers absorb “reads never block writes” and conclude PostgreSQL handles concurrency gracefully by default. That mental model applies to a database running only SELECT, INSERT, UPDATE, and DELETE. Real production systems run migrations, reindex tables, and update statistics — operations that land outside MVCC’s protection.

PostgreSQL documents all 8 table-level lock modes and 4 row-level lock modes with full conflict matrices. Knowing that ROW EXCLUSIVE conflicts with SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE is accurate and useless without knowing that UPDATE silently acquires ROW EXCLUSIVE on the table while also acquiring FOR NO KEY UPDATE row-level locks. The documentation tells you what the modes are. It does not prominently warn you that VACUUM FULL takes ACCESS EXCLUSIVE, that adding a column with a non-null default before PostgreSQL 11 rewrote the entire table, or that a queued lock request blocks every operation that follows it in sequence.

The 8-mode taxonomy creates an illusion of precision. Operators assume that because the locking system is well-specified, it is well-behaved. Those are different claims.

The Lock Conflict Table Trap: When Documentation Becomes a False Comfort

PostgreSQL’s official lock conflict table is technically accurate. It tells you that AccessExclusiveLock conflicts with every other lock mode, and that ShareLock conflicts with RowExclusiveLock. Every entry is correct. The table is also nearly useless as an operational guide.

The problem is the gap between lock modes and the SQL statements that trigger them. The documentation lists 8 table-level lock modes and describes their conflict relationships in precise detail. What it does not do is prominently map common DDL statements to the aggressive modes they acquire. ALTER TABLE acquires AccessExclusiveLock. CREATE INDEX without the CONCURRENT option acquires ShareLock. VACUUM FULL acquires AccessExclusiveLock. These mappings exist in the documentation, but they are scattered across reference pages, buried under syntax descriptions, and absent from the conflict table itself.

Engineers read the conflict table, feel informed, and then run a routine ALTER TABLE to add a column or change a default value. That statement immediately acquires AccessExclusiveLock on the target table. Because AccessExclusiveLock conflicts with every other lock mode, it blocks all concurrent reads and writes. Worse, once that lock request is queued and waiting behind an existing long-running transaction, every subsequent query against that table queues behind the AccessExclusiveLock request. A single maintenance command turns into a cascading block that takes down table access entirely.

This is not a bug. PostgreSQL behaves exactly as documented. The failure is that the documentation trains engineers to think in terms of lock modes while production systems operate in terms of SQL statements. No developer mentally translates ALTER TABLE ADD COLUMN into AccessExclusiveLock during a code review. No deployment checklist flags that CREATE INDEX on a 200-million-row table will hold a ShareLock for 45 minutes during a peak traffic window.

Teams that have studied PostgreSQL locking still get caught by this. The documentation creates confidence without creating operational awareness. Knowing that AccessExclusiveLock blocks everything is not the same as knowing which of your weekly migration scripts silently acquires it.

Five Behaviors That Blindside Even Experienced Teams

PostgreSQL’s MVCC architecture earns its reputation: reads don’t block writes, writes don’t block reads. That guarantee holds. What it doesn’t protect you from is the locking behavior of DDL operations, and that gap is where production incidents are born.

Adding a NOT NULL column or building an index feels instant in development because your table has 200 rows and zero concurrent traffic. In production, that same operation requests an ACCESS EXCLUSIVE lock — the heaviest lock mode in PostgreSQL’s eight-level table-lock hierarchy. ACCESS EXCLUSIVE conflicts with every other lock mode, including simple SELECT queries. If a long-running transaction already holds any lock on the table, your migration waits. While it waits, every query that arrives afterward queues behind it. One blocked DDL statement transforms into a full table freeze, and the freeze deepens with every second the original transaction stays open.

This queue compounding is the behavior that consistently catches experienced teams off guard. The migration itself might complete in milliseconds once it acquires the lock. But if a reporting query or a slow API transaction holds the table for 30 seconds before that happens, you’ve handed every subsequent SELECT and INSERT a 30-second wait. Connection pools exhaust. Timeouts cascade. What looked like a routine schema change becomes an incident.

Foreign key additions and column type changes make this worse because developers consistently underestimate the locks they require. ALTER TABLE ... ADD FOREIGN KEY holds a SHARE ROW EXCLUSIVE lock on the referencing table and acquires locks on the referenced table. Changing a column’s type — even between apparently compatible types — typically rewrites the entire table under an ACCESS EXCLUSIVE lock. Neither operation signals its weight through syntax.

Autovacuum adds a fourth axis of risk. The process runs automatically in the background, acquiring locks that interact with application transactions in ways that never appear during development. Under production write volume, autovacuum contention surfaces as intermittent slowdowns with no obvious cause in application logs — the lock is held by an internal maintenance process that most observability tooling doesn’t surface clearly.

Each of these behaviors is documented. None of them are bugs. They become risks because the development environment never generates the conditions that make them dangerous.

What Most Coverage Gets Wrong: The Timing Problem Nobody Talks About

Most PostgreSQL locking articles hand you a conflict matrix and call it a day. Lock mode A conflicts with lock mode B — now you know. That framing misses the actual production failure mode entirely.

The conflict table tells you what will block. It tells you nothing about how long that block lasts. Those are completely different problems, and only one of them kills production systems.

Here’s the mechanic that standard coverage skips: when a DDL statement like ALTER TABLE requests an ACCESS EXCLUSIVE lock, it doesn’t acquire that lock instantly. It waits in a queue behind every transaction currently holding a conflicting lock on that table. In a dev environment running isolated tests, those prior transactions finish in milliseconds. The ALTER TABLE slips through, the benchmark looks clean, and the operation gets classified as low-risk.

Production is not an isolated test. Production has long-running OLAP queries scanning millions of rows. It has application bugs leaving connections idle-in-transaction for minutes at a stretch. When your ALTER TABLE hits that environment, it doesn’t wait milliseconds — it waits for the slowest active transaction touching that table to finish. That could be 30 seconds. It could be 8 minutes. The lock itself is identical to what you tested. The duration is determined entirely by what else was happening at the moment you ran it.

This compounds through PostgreSQL’s queue behavior. Once ACCESS EXCLUSIVE sits in the lock queue, every subsequent query that needs even a basic ACCESS SHARE lock — which includes ordinary SELECT statements — stacks up behind it. The DDL operation becomes a traffic barrier. Connections accumulate. Connection pools exhaust. What started as a schema change that “takes a few milliseconds” becomes a full table outage measured in minutes, caused not by the operation itself but by the workload it landed in the middle of.

No dev benchmark captures that. No staging environment reproduces it unless staging carries the same concurrent query load as production. Lock duration is a property of your entire workload at a specific moment in time — and that’s the variable every standard locking tutorial treats as constant.

Practical Defenses: How to Operate PostgreSQL Without Getting Burned

Set lock_timeout and statement_timeout on every connection, every environment, without exception. A single ALTER TABLE waiting on an ACCESS EXCLUSIVE lock will queue behind long-running transactions and block every subsequent query on that table in a chain — reads included. Without a lock_timeout, that queue grows indefinitely. A value like lock_timeout = '2s' forces the DDL to fail fast rather than silently stall your entire application. statement_timeout acts as the outer boundary, killing any operation that runs longer than your SLA allows. These are not emergency tools; they are baseline configuration.

For index creation, CREATE INDEX CONCURRENTLY is the correct default on any table receiving live traffic. Standard CREATE INDEX takes an ACCESS EXCLUSIVE lock and blocks all reads and writes for its duration. CONCURRENTLY trades a longer build time — it performs multiple passes — for a lock profile that allows normal DML throughout. The tradeoff is almost always worth it in production.

Schema migrations that add constraints deserve the same discipline. The safe sequence is: add the column as nullable with no default, deploy application code that populates it, backfill existing rows in batches, then add the NOT NULL constraint using ALTER TABLE ... SET NOT NULL only after every row is populated. PostgreSQL 18 can validate a NOT NULL constraint without a full table rewrite when rows already comply, but even earlier versions benefit from this phased approach because it keeps lock acquisition windows small and predictable.

Monitoring is not a post-incident activity. pg_stat_activity shows every active session, its current query, and its wait event. pg_locks shows every held and pending lock. Joining these two views gives you a live lock graph — which session holds what, which sessions are waiting, and for how long. Teams running frequent deployments should query this graph continuously, not reactively. A blocked ALTER TABLE that’s been waiting 30 seconds is a recoverable situation. The same operation after five minutes, with 200 queued connections behind it, is an outage. The data to catch it early exists inside PostgreSQL the entire time; the only question is whether anyone is looking.

Why This Matters More Now: The Rise of Continuous Deployment Against Live Databases

Continuous deployment changed the relationship between application code and database schema in a way PostgreSQL’s locking model was never designed to handle. A decade ago, teams ran migrations during Saturday night maintenance windows against tables with zero active connections. Today, those same migrations run automatically on every merged pull request, triggered by GitHub Actions or similar pipelines, against tables processing thousands of transactions per minute.

The collision risk compounds directly with deployment frequency. A team shipping twice a day runs roughly 60 migrations per month against live traffic. Each one is an opportunity for an ACCESS EXCLUSIVE lock — the kind acquired by a standard ALTER TABLE — to queue behind a long-running read and then block every subsequent query on that table until it completes. PostgreSQL’s lock queue is not fair in the way most engineers assume: once that exclusive lock request is waiting, it holds the gate closed for all requests behind it, regardless of their lock mode. A single slow OLAP query can turn a five-second migration into a two-minute outage.

The PostgreSQL locking documentation is accurate and thorough. The conflict tables tell you exactly which modes block which. The problem is that nothing in a standard CI/CD setup prompts an engineer to consult those tables before merging a schema change. Migration frameworks like Flyway and Liquibase execute SQL faithfully — they do not analyze lock implications or warn when a statement will acquire an ACCESS EXCLUSIVE lock on a high-traffic table.

The result is predictable: teams rediscover the same failure modes independently, usually at 2am, usually during a high-traffic period when deployment pipelines are most active. Adding a column with a default value, creating an index without CONCURRENTLY, adding a foreign key without validating separately — each of these has a well-documented, safer alternative. The alternatives exist. The default behavior, however, is the unsafe one, and defaults are what automated pipelines execute.

Treating every schema change as a potentially blocking operation requires a deliberate process change, not just better tooling. Without that shift, continuous deployment against live databases is not a solved problem. It is a scheduled incident.

AI-Assisted Content — This article was produced with AI assistance. Sources are cited below. Factual claims are verified automatically; uncertain claims are flagged for human review. Found an error? Contact us or read our AI Disclosure.

More in Consumer Tech

See all →