
Our team members built and operated five managed PostgreSQL services over the past 15 years. Across all of them, one configuration has remained constant: strict memory overcommit.
In this blog post, we will explain how strict memory overcommit protects your database from catastrophic OOM (out of memory) kills. We will also share how a one character kernel bug forced us to temporarily disable this setting. Finally, we will explain our heuristic for determining the right memory overcommit limit. Hopefully, this will help you find the right setting for your workloads.
Linux allows processes to allocate more virtual memory than what is physically available. When a process allocates memory, for example with malloc(), the kernel reserves virtual address space for it. However, the kernel does not immediately back that space with physical memory. Physical pages are only consumed when the process actually touches the memory.
The kernel relies on the assumption that not all allocated memory will be actively used at the same time. Usually, this assumption holds. When it doesn’t, the kernel invokes the OOM killer to free memory by terminating a process.
For most processes, handling an OOM kill is simple: the process restarts, reconnects, and picks up where it left off. PostgreSQL is different.
PostgreSQL's postmaster (its main supervisor process) forks a backend process for each connection. These backends share memory segments that hold shared buffers, WAL buffers, lock tables, and other shared state. The OOM killer doesn't understand this architecture. It simply picks a process based on an heuristic (usually the process that uses the most memory) and terminates it. If that backend was modifying a shared memory segment, the segment may be left in an inconsistent state. Shared memory has no transactional guarantees at the OS level. A half-written page in shared buffers means silent data corruption.
PostgreSQL's postmaster knows this. When it detects that any of its child processes has been killed, it assumes the worst: shared memory may be corrupted. When shared memory is corrupted, there is a risk of corrupting the stored data as well. To prevent this, the postmaster terminates all remaining backends. Every active connection is dropped. Every in-flight transaction is aborted. On its next start, the database goes through crash recovery.
This is the correct behavior. PostgreSQL is protecting your data. But it means a single OOM kill doesn't just affect one connection. It takes down every connection on the server. On top of that, if the write volume was high, replaying all WAL files for crash recovery can take a long time. This means a single out of memory case can cause long outages.
It is possible to configure how the kernel behaves when processes ask for memory. Linux provides three overcommit policies via vm.overcommit_memory:
Under strict overcommit, the kernel has two knobs to set CommitLimit: overcommit_kbytes and overcommit_ratio. The CommitLimit is calculated as:
CommitLimit = overcommit_kbytes + swapOr, if overcommit_kbytes is not set:
CommitLimit = overcommit_ratio / 100 * available_memory + swapWhen allocation fails with ENOMEM error code. PostgreSQL handles this gracefully. A backend that cannot allocate memory reports an error to the client, cancels the transaction, and continues. The postmaster stays up. Other connections remain unaffected. This is a routine error, not a catastrophe. The trade-off is that strict overcommit converts late, destructive failures into early, graceful ones.
This trade-off works best when the machine is dedicated to PostgreSQL and a small set of known sidecar processes. In that scenario, the committed memory profile is predictable and the limit can be tuned with confidence. On shared machines running diverse workloads, committed memory becomes harder to predict. An unrelated process can use up the commit budget. This can make PostgreSQL get an ENOMEM error, even if the database load is fine.
We always favored strict overcommit for PostgreSQL. We used it in previous managed PostgreSQL services we built and also in Ubicloud PostgreSQL. However, after enabling it this time, we quickly ran into trouble. A few weeks after we turned on strict memory overcommit, we started to get failures on some of the databases. They showed out of memory errors, even though there was plenty of free physical memory on the machines. We disabled strict memory overcommit and started investigating.
The first clue came from a routine check of /proc/meminfo on one of our servers with 8 GB memory:
$> cat /proc/meminfo | grep "Committed_AS"
Committed_AS: 683547672 kB
651 GB of committed memory on an 8 GB machine! For comparison, a healthy server of the same size showed:
$> cat /proc/meminfo | grep "Committed_AS"
Committed_AS: 2703940 kB
The counter was off by orders of magnitude.
We first looked at ps output.
$> ps -C postgres -o pid,vsz,rss,cmd --sort=-vsz
PID VSZ RSS CMD
96622 2242244 95416 postgres: 18/main: postgres postgres...
95721 2241668 94708 postgres: 18/main: postgres postgres...
96414 2241436 94892 postgres: 18/main: postgres postgres...
96619 2241076 93308 postgres: 18/main: postgres postgres...
96417 2240900 94300 postgres: 18/main: postgres postgres...
95728 2240736 93864 postgres: 18/main: postgres postgres...
96620 2240736 92852 postgres: 18/main: postgres postgres...
95727 2240428 93640 postgres: 18/main: postgres postgres...
96623 2239840 93164 postgres: 18/main: postgres postgres...
VSZ is the total virtual address space a process has mapped and RSS is the physical memory it's actually using. In the output above, each backend shows ~2 GB of VSZ covering its entire mapped address space, but a much smaller RSS (~95 MB) reflecting the memory it is actively using. On this 8 GB VM we configure 2 GB of shared_buffers, and if you think ~2 GB VSZ is suspiciously close to the shared_buffers size, you are right. Most of each backend's VSZ is actually the shared memory segment that holds shared_buffers. Every backend maps the same 2 GB region into its own address space, so it shows up in each backend's VSZ. With many backends, the VSZ numbers add up quickly.
That said, none of this should inflate Committed_AS. The shared memory segment appears in every backend's address space but physically exists only once, so it should be counted only once. On top of that, we run PostgreSQL with huge_pages = on, so shared_buffers is allocated from hugetlb. Hugetlb mappings have their own separate reservation accounting and are not supposed to count toward Committed_AS at all. Still, the 2 GB hugetlb region was by far the largest mapping in each backend, and hugetlb accounting is a special case in the kernel. That made it the most natural place to start looking, so our first hypothesis was that the kernel was somehow miscounting these mappings. For example, charging them once per process instead of ignoring them.
To verify, we checked the VMA (Virtual Memory Area) flags on the hugetlb mapping via /proc/<pid>/smaps. Each VMA has a set of flags, and the ac flag (VM_ACCOUNT) indicates that the region counts toward committed memory:
$> sudo cat /proc/321784/smaps | grep -A 25 "hugepage"
7fce75000000-7fcef0c00000 rw-s 00000000 00:10 10723551 /anon_hugepage (deleted)
Size: 2027520 kB
Shared_Hugetlb: 393216 kB
Private_Hugetlb: 0 kB
...
...
VmFlags: rd wr sh mr mw me ms de ht sd
No ac flag. Huge tables were correctly excluded from committed memory accounting. The hypothesis is ruled out.
We then summed accountable memory (VMAs with the ac flag) across all processes on the machine:
$> sudo awk '/^Size/{size=$2} /VmFlags:/ && / ac/{sum+=size} END{printf "%.2f GB\n", sum/1048576}' /proc/[0-9]*/smaps
2.43 GB
2.43 GB accountable vs 651 GB reported; 648 GB of phantom committed memory. The vm_committed_as counter was leaking. We suspected that the memory was being charged on allocation but was never recredited. This made us consider a potential kernel bug in committed memory calculation.
At that time, we had two different kernels being used on our fleet. We checked our entire fleet of PostgreSQL servers and compared the ratio of Committed_AS to MemTotal against kernel version and uptime:
| Metric | Kernel 6.5.0 | Kernel 6.8.0 |
|---|---|---|
Median Ratio | 0.55 | 0.27 |
Mean Ratio | 24.97 | 0.32 |
Max Ratio | 3,405 | 1.86 |
Servers with a ratio > 1.0 | 23% | < 1% |
We also ran a statistical analysis and found that a server running the 6.5 kernel was 52x more likely to have inflated committed memory.
On 6.5 servers, uptime was positively correlated with inflation. The leak grew at roughly 4.7% compound per week, proportional to uptime. On 6.8 servers, no correlation existed.
This analysis significantly strengthened our hypothesis that this was a kernel bug.
To have definitive proof, we tasked an LLM to look into every commit between 6.5.0 and 6.8.0 to find possible bug fixes in committed memory calculations. It quickly found the following.
The bug was introduced in Linux 6.5 by commit 408579c. This commit changed the return convention of do_vmi_align_munmap():
The commit updated callers throughout the mm subsystem. However, in mm/mremap.c, inside move_vma(), the error check was converted incorrectly:
BEFORE (correct): error handler runs on negative return (on error)
if (do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false) < 0) {
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
vm_acct_memory(old_len >> PAGE_SHIFT);
}
AFTER (broken): error handler runs when return is 0 (on success)
if (!do_vmi_munmap(&vmi, mm, old_addr, old_len, uf_unmap, false)) {
/* OOM: unable to split vma, just get accounts right */
if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
vm_acct_memory(old_len >> PAGE_SHIFT);
}
The change from < 0 to ! inverted the condition. To understand why this matters, consider what move_vma() does. It first decrements Committed_AS for the old region as part of the move, then calls do_vmi_munmap() to actually unmap it. If the unmap fails, the kernel needs to increment the counter back to keep accounting correct. After all, unmap has failed and the old region still exists. Its charge must be restored. With the inverted condition, this re-increment runs on every successful mremap instead of only on failure. The counter grew monotonically with every memory remap operation.
The bug was reported here and bisected here. Linus himself analyzed the root cause and fixed it with a one-line change, reverting the condition back to < 0:
As Linus Torvalds wrote in the fix:
This didn't change any actual VM behavior _except_ for memory accounting when 'VM_ACCOUNT' was set on the vma. Which made the wrong return value test fairly subtle, since everything continues to work.
Or rather - it continues to work but the "Committed memory" accounting goes all wonky (Committed_AS value in /proc/meminfo), and depending on settings that then causes problems much much later as the VM relies on bogus statistics for its heuristics.
This is the kind of bug that hides in plain sight. Under heuristic overcommit (the default), Committed_AS is purely informational. The kernel doesn't use it to gate allocations. The bug only causes failures under non-default strict overcommit mode, so it went unnoticed. The failure is also indirect. The accounting drifts silently for weeks before Committed_AS finally crosses CommitLimit and allocations start failing.
With the kernel bug behind us, we can gradually go back to enabling strict memory overcommit. This is a good point to explain our heuristic in deciding the commit limit in case you want to enable it for your workloads as well.
We use the formula:
In plain terms: 80% of total physical memory plus 2 GB.
The 20% holdback covers memory used by kernel data structures not seen in userspace. This includes items like page tables, slab caches, network buffers, and the kernel's own allocations.
It is important to note that 20% is not wasted. The kernel still uses it for page cache (i.e. the kernel uses free physical memory to cache file I/O). This is the biggest consumer and directly benefits PostgreSQL read performance. Page cache doesn't count toward Committed_AS because it's reclaimable. The kernel can evict cached pages anytime a process actually needs the memory.
Every PostgreSQL server in our fleet runs several sidecar processes. Some examples are prometheus, node_exporter, postgres_exporter and wal-g. These are Go programs, and Go's runtime reserves large virtual memory regions upfront via mmap but only faults in pages as needed. Their committed memory contribution is far larger than their actual resident memory.
We surveyed the committed memory of these sidecar processes across our fleet:
| Sidecar Committed Memory | Percentage of Servers |
|---|---|
| 0.0 – 0.5 GB | ~64% |
| 0.5 – 1.0 GB | ~32% |
| 1.0 – 1.5 GB | ~1% |
| 1.5 – 2.0 GB | ~1% |
| 2.0 – 2.5 GB | ~1% |
| 2.5 – 3.0 GB | ~1% |
| 3.0 – 3.5 GB | ~1% |
96% of servers fall under 1 GB. We found a weak positive correlation between vCPU count and sidecar committed memory (r = 0.22). This is likely driven by Go's runtime scaling with available CPUs but it is not strong enough to justify proportional scaling.
The fixed 2 GB covers >99% of servers. It is deliberately generous. If this offset is too small, sidecars can silently consume the remaining commit budget, and PostgreSQL, not the sidecar, is the one that hits ENOMEM.
If you are curious about how we implemented this, it is actually pretty straightforward. You can read the code in our GitHub repo here. I’m also adding the core part of it below for convenience.
def configure_memory_overcommit(strict: false)
if strict
total_mem_kb = File.read("/proc/meminfo").match(/MemTotal:\s+(\d+)/)[1].to_i
overcommit_kbytes = (total_mem_kb * 0.8 + 2 * 1048576).round
safe_write_to_file("/etc/sysctl.d/99-overcommit.conf", "vm.overcommit_memory=2\nvm.overcommit_kbytes=#{overcommit_kbytes}\n")
else
r "sudo rm -f /etc/sysctl.d/99-overcommit.conf"
end
r "sudo sysctl --system"
endNote that we use vm.overcommit_kbytes instead of vm.overcommit_ratio. We need overcommit_kbytes because our formula includes a fixed 2 GB component that can't be expressed as a percentage. On a 4 GB server, the 2 GB buffer is 50% of the physical memory; on a 64 GB server, it's 3%. A single ratio can't capture both.
Strict memory overcommit is a small configuration change that provides a meaningful safety improvement for PostgreSQL. It converts catastrophic OOM kills into graceful allocation failures. This way, each backend can manage the issue without disrupting the whole system. Even though we had to disable it for a while due to a kernel bug, it remains a key configuration for healthy PostgreSQL deployments.
If you run PostgreSQL in production, we recommend enabling vm.overcommit_memory=2. However, it is important to configure this carefully. If CommitLimit is set too low, your application may experience frequent OOM errors. On the other hand, if it is set too high, you will not fully benefit from the protection that strict memory overcommit provides. Our recommendation is to monitor your memory usage over time and enable this setting only after you have a solid understanding of the memory characteristics of your workload.