# Heap Leak Hunting — Methodology

The single idea behind this skill: **a histogram tells you *what* is numerous; it does not
tell you *who retains it* or *why it is never freed*.** Generic containers (`char[]`,
`String`, `HashMap$Node`, `Object[]`) always top the histogram and almost never name the
bug. Root-causing a leak means walking *up the reference graph* to the GC-root holder, then
explaining the *mechanism* that keeps the collection growing.

Every conclusion should be reached by **at least two independent routes** so it is not an
artifact of one tool.

---

## Stage 1 — Histogram: find the abnormal class

Run `scripts/hprof_histogram.py`. Ignore the all-classes board first; read the
**business / third-party board (JDK excluded)**. You are hunting for a *domain* class whose
instance count is wildly larger than the number of real things it should represent.

**Anchor the count against reality.** Pick objects whose count equals "live work":

- `io.netty.channel.socket.nio.NioSocketChannel`, `sun.nio.ch.SocketChannelImpl`,
  `java.io.FileDescriptor` → number of live TCP connections / sockets.
- a domain "online user / active session" object, thread-pool size, etc.

If a session/handler/listener/entry class has, say, 190k instances while live connections
are 40k and active users are 6k, the ~180k surplus is **retained zombies**. That class is
your Stage-2 suspect.

Cross-check internal consistency: leaked parents usually drag fixed-ratio children. If
`SuspectA = N`, look for classes at `≈N`, `≈2N`, `≈k·N` — they confirm a whole subtree
leaks together (e.g. each leaked session keeps 1 handshake + 2 transport states + 1 store).

---

## Stage 2 — Reverse-reference trace: who retains them

Run `scripts/trace_referrers.py <dump> <suspect-class>`. It implements MAT's
"path to GC roots" in pure Python: from every instance of the suspect, find referrers via
**instance fields, array elements, and static fields**, hop by hop, until reaching a
GC root or a single accumulation container.

Read each hop for:

- **The dominant referrer field.** `('java.util.concurrent.ConcurrentHashMap$Node', 'val')`
  with a count ≈ suspect count means "they live as *values* of a ConcurrentHashMap."
  Next hops climb `Node → Node[] table → ConcurrentHashMap → holder.field`.
- **★ static field holder** — a `static` field directly references the set. This is the
  smoking gun for static-registry / static-cache leaks; the holder is a GC root by itself.
- **★ referrer is itself a GC root** — thread (local var / thread object), JNI global,
  sticky class.

**Convergence is the goal.** You are looking for the hop where many objects funnel into
*one* container instance (one big `Node[]` table, one singleton holder). That container's
owning field is the retention point.

**Cycles are expected.** Frameworks wire back-references (child holds a pointer to the
registry that holds the child), so the reverse-BFS frontier *grows* after a few hops. Don't
chase the ballooning frontier — read the *early* hops and the ★ anchors.

---

## Stage 3 — MAT cross-validation (optional)

If Eclipse MAT runs (see `mat-headless-runbook.md`), its **Leak Suspects** report should
independently name the same holder ("Problem Suspect 1: one instance of X occupies N% …
accumulated in one `Node[capacity]`"). Two unrelated methods agreeing turns a strong
inference into a confident finding. MAT also gives **retained size** (true cost of the
subtree), which the histogram's shallow size cannot.

---

## Stage 4 — Precise measurement: turn inference into numbers

Run `scripts/inspect_objects.py` to:

- **Measure the suspect collection's true size** — `--map-fields`. Distinguish the real
  leak map (100k+ entries, huge power-of-two table) from innocent siblings (thousands).
- **Read the runtime config that governs cleanup** — `--fields`. Heartbeat/timeout values,
  feature flags, max sizes. A misconfigured timeout (e.g. 100× the default) often explains
  *why* cleanup never fires.

> **Trust non-empty bucket count + table capacity, not `size`/`baseCount`.**
> `ConcurrentHashMap` spreads its counter across `counterCells` under contention, so
> `baseCount` can read absurdly low (e.g. 552 for a map with ~190k entries). The
> `Node[]` table capacity (a power of two) and the non-empty bucket count are reliable.
> The script flags this automatically (⚠️) when `size` and non-empty buckets differ by >10×.

When comparing sibling collections, remember **they may hold different things**: a registry
keyed by session-id can be huge while a sibling keyed by live channel stays small (it only
holds currently-connected transports, not every session). Don't expect every map's size to
equal the live-connection count — judge each against *what it is supposed to hold*. The leak
is the one whose size has no business being that large.

---

## Stage 5 — Mechanism & report

Locating the holder is half the job; explain **why it grows unbounded**:

- **Static cache / registry never pruned** — entries added on event A, removal depends on
  event B that doesn't always happen.
- **Listener / callback never deregistered** — observer holds the subject (or vice-versa).
- **Connection / session not removed on disconnect** — cleanup depends on a timeout or an
  explicit close callback that some paths skip.
- **Unbounded queue / buffer** — producer outruns consumer.
- **`ThreadLocal` on a pooled thread** — value outlives the request.
- **`ClassLoader` leak** — a long-lived object pins a webapp/plugin classloader.
- **Misconfigured timeout / TTL** — cleanup mechanism exists but is effectively disabled
  (set to 0) or set so large it never fires.

To pin the mechanism, read the owning library's source for the **add/remove lifecycle** of
that exact collection, and quote the method names. State plainly what is *proven from the
heap* vs *inferred*.

---

## Worked example (anonymized, real case)

> This is a real, anonymized case shown to illustrate the flow end-to-end. For a *new* dump,
> derive every number yourself with the scripts — **do not pattern-match your conclusion to
> this example.** A different leak will have a different holder class, field, and mechanism;
> the *method* transfers, the answer does not.

**Symptom.** 2.6 GB dump, ~40M objects. Histogram: a Socket.IO session class `ClientHead`
= 192,639; live `NioSocketChannel` = 41,773; active session-wrapper = 6,530. → ~186k
zombie sessions.

**Reverse trace.** `ClientHead ← ConcurrentHashMap$Node.val (194,844) ← Node[] table ←
ClientsBox (singleton) ← GC root (Netty I/O thread → SocketIOChannelInitializer)`. Sibling
ratios confirmed: `TransportState ≈ 2× ClientHead`, message queues `≈ 2× ClientHead`.

**MAT.** Leak Suspects: "one `ClientsBox` occupies 64.67% … accumulated in
`ConcurrentHashMap$Node[524288]`." Same holder. ✓

**Measure.** `ClientsBox.uuid2clients`: table 524288, non-empty buckets 160,989 (≈190k);
`channel2clients`: 2,027; business static maps: 939 / 14,496. → leak is **only**
`uuid2clients`. Config: `pingTimeout=120000` (2× default), `upgradeTimeout=1,000,000` (100×
default).

**Mechanism.** `AuthorizeHandler.authorize()` does `clientsBox.addClient()` on handshake;
removal depends on ping-timeout or explicit disconnect. Only ~18k timeout tasks exist for
190k sessions → most zombies have no pending cleanup. The blown-up `upgradeTimeout`
amplifies retention of mid-upgrade/aborted polling connections.

**Fix.** Upgrade the library; restore `upgradeTimeout`/`pingTimeout` to sane values; add
ingress rate-limit/auth to stop invalid handshakes; monitor `uuid2clients.size`; mitigate
with heap bump + rolling restart / scheduled stale-session cleanup.
