How a Health Check Took Down My LAN // famstack.dev

This is the post-mortem of an outage on my home server. The first report came from my son, who tried to connect to a game on the desktop and got DNS_PROBE_ERROR in the browser. Then I remembered: my wife had occasionally mentioned that the internet felt unstable. Nothing dramatic enough to act on at the time. In hindsight, those were the small precursors of what eventually broke fully.

The visible symptom on the day of the outage was that every device connected by cable lost name resolution. The actual cause was a single Mac running out of ephemeral TCP ports, triggered by a common but quietly wrong pattern in a Python HTTP client used by one of the services I run.

I want to write it down because the diagnostic path was longer than the fix, and because the underlying lesson has nothing to do with which specific service caused the problem.

#What I saw first

The first signal was AdGuard Home, the DNS server I run for the whole house. Its log was filled with timeouts on every DoH upstream:

[error] dnsproxy: exchange failed
  upstream=https://dns10.quad9.net:443/dns-query
  question=";mask.icloud.com. IN A"
  duration=30.005884152s
  err="net/http: request canceled while waiting for connection
       (Client.Timeout exceeded while awaiting headers)"

Every query to Quad9 and Google DNS was failing with a 30-second connection timeout. The Go HTTP client wording is precise here: the connection was never opened, the request was canceled while waiting on the dial.

Only the wired clients in the house were affected. The wireless side appeared healthy, which initially looked like the kind of asymmetric failure that points at a misconfigured DHCP scope or a separate access point. It wasn’t. Apple devices on the wireless side were silently bypassing AdGuard for most external lookups via iCloud Private Relay, so they kept working. Every non-Apple device, which happened to be the wired ones, lost DNS entirely.

#First theory: bad DoH endpoint

The AdGuard upstreams list had three DoH URLs. One of them was wrong. I had at some point entered https://dns.cloudflare.com/dns-query, which is not a valid Cloudflare endpoint. AdGuard’s own validator confirmed it had never worked.

That explained a fraction of the failures, but the timeouts I was seeing were on Quad9 and Google DNS, not Cloudflare. So a real bug, but not the one bringing the LAN down.

I switched all upstreams to plain DNS over UDP, expecting immediate recovery. The DNS queries still failed. The wrong theory.

#Second theory: outbound HTTPS blocked

I moved down a layer. From inside the AdGuard container, plain DNS lookups against Quad9 still worked. Name resolution was fine. The HTTPS connection to the DoH endpoints was the problem.

I tried opening a TCP connection directly from the host:

$ nc -w 5 -zv 8.8.8.8 443
nc: connectx to 8.8.8.8 port 443 (tcp) failed:
    Can't assign requested address

EADDRNOTAVAIL on a connect call is unusual. It does not mean the destination is unreachable. It means the kernel could not pick a local source address and port for the outgoing socket.

A quick test against the local gateway confirmed how broad the issue was:

$ nc -w 5 -zv 192.168.188.1 80
nc: connectx to 192.168.188.1 port 80 (tcp) failed:
    Can't assign requested address

The Mac could not open a TCP connection to its own router. ICMP was fine; ping 8.8.8.8 returned in 11 milliseconds. Every TCP connect() was being rejected before a single packet hit the wire.

#The diagnosis

macOS allocates outbound TCP source ports from the range 49152 to 65535, sixteen thousand three hundred eighty four slots. Each closed connection sits in TIME_WAIT for a few seconds before its port returns to the pool. If new connections are created faster than the timer drains them, the pool eventually empties, and the kernel has nothing to allocate.

$ netstat -an | grep -c TIME_WAIT
17546

More TIME_WAIT entries than the kernel had source ports to give out. Every new TCP call was failing because there were no ports left. This was the wedge.

Grouping the stuck sockets by destination made the source obvious:

14126  127.0.0.1.8888
 1917  1.1.1.1.53
   56  8.8.4.4.443
   21  9.9.9.10.443
   ...

Eighty percent of the stuck sockets were localhost-to-localhost on port 8888. The remaining slice was AdGuard’s own retries against the DNS upstreams it could no longer reach. That part was the symptom I had been chasing; the actual cause was something inside the host hammering its own loopback interface.

Port 8888 belonged to a self-hosted inference server I run on the same Mac, used for local LLM workloads. Some component of it was opening a new TCP connection to itself on a steady cadence, never reusing them.

#The kernel was stuck

Before identifying the exact code path, I tried to recover without rebooting. I lowered net.inet.tcp.msl to 1, which should reduce TIME_WAIT lifetime to roughly two seconds:

$ sudo sysctl -w net.inet.tcp.msl=1
$ sleep 5
$ netstat -an | grep -c TIME_WAIT
17546

The count did not move across repeated checks. The TIME_WAIT timer was no longer firing on existing entries. The TCP state machine had reached a state from which it was not recovering at runtime. A reboot was the only way out.

After the reboot the host was healthy. New TCP connections worked. AdGuard recovered as soon as it could open sockets again. The LAN came back. The diagnosis was still incomplete.

#Catching the source

Post-reboot, with the system idle, I left a monitor running on the TIME_WAIT count and the source ports of new connections to 127.0.0.1:8888:

21:14:11  TIME_WAIT=11  top: 7 127.0.0.1.8888 | source port 49401
21:14:31  TIME_WAIT=11  top: 7 127.0.0.1.8888 | source port 49408
21:14:47  TIME_WAIT=9   top: 7 127.0.0.1.8888 | source port 49413
21:15:07  TIME_WAIT=9   top: 7 127.0.0.1.8888 | source port 49417

A new source port roughly every five seconds. That is a poll, not connection reuse. In a steady state with one polling caller and a five-second interval, this produces six or seven sockets in TIME_WAIT at any given moment, indefinitely. Harmless at idle. The relationship is linear with poll rate, which means any burst scales the residue accordingly.

lsof confirmed both ends. The desktop component of the inference server was holding the client end of a connection on 127.0.0.1. Its own Python backend was holding the server end on 127.0.0.1:8888. Two processes I owned, running on the same Mac, talking to each other through the loopback interface on a five-second cadence.

#The pattern, not the project

The polling code was the smoking gun:

def check_health(self) -> bool:
    try:
        session = requests.Session()
        session.trust_env = False
        response = session.get(self._get_health_url(), timeout=2)
        return response.status_code == 200
    except requests.RequestException:
        return False

A requests.Session instantiated inside the function, used once, then discarded when the function returns. The Python garbage collector closes the underlying connection. The kernel parks it in TIME_WAIT for the configured lifetime.

This is one of the most common ways to misuse requests.Session. The library’s value lies in connection pooling across calls; pooling implies that the Session has to outlive any individual call. A function-local Session has the import cost of a Session and the behavior of a bare requests.get(). The runtime cost looks identical until something starts calling it on a fast cadence.

The fix is two lines: move the Session into the object’s constructor, reuse it on every call. The connection stays open. No TIME_WAIT accumulation per poll.

def __init__(self, config):
    ...
    self._health_session = requests.Session()
    self._health_session.trust_env = False

def check_health(self) -> bool:
    try:
        response = self._health_session.get(
            self._get_health_url(), timeout=2
        )
        return response.status_code == 200
    except requests.RequestException:
        return False

I wrote a test that asserts the Session is created exactly once across multiple check_health calls, watched it fail on the original code, applied the fix, watched it pass. The change went upstream as a pull request and has since been merged.

Update (June 11): while preparing the PR I found the same function-local-Session pattern at two more call sites in the same codebase. Neither sits on a polling hot path, so they did not contribute to this outage, but it confirms the point of this section: the pattern travels. It rarely appears just once.

#What I have not proven

I want to be careful about the scope of the conclusion. I observed the wedged TCP stack, I observed that the overwhelming majority of stuck sockets were destined for the inference backend, and I observed the function-local Session pattern that scales the socket count linearly with poll rate. I did not capture the storm in the act. I have no log evidence of what drove the polling rate above steady state on the day of the outage.

The honest framing is that the pattern made the failure mode reachable. With a single reused Session, even a hundred-times-faster polling loop would not exhaust the port range, because there would be one connection instead of thousands. Fixing the pattern closes the path, even if I cannot point to the exact trigger that walked the system down it.

#Why this happens at home

The interesting part of this incident is not the bug. The bug is small, fixable in two lines, and the pattern is well documented. The interesting part is the failure mode that becomes possible when everything you depend on runs on a single machine.

In a cloud setup, the DNS server, the inference workload, the reverse proxy, and the family chat bot would each be a separate container on a separate node, with its own kernel, its own port range, and its own blast radius. A buggy HTTP client in one service would degrade that service. It would not silently take down DNS for unrelated devices in another room.

On a self-hosted Mac, they all share one kernel. They share one TCP stack and one ephemeral port range. One misbehaving caller can starve every other process on the host of outbound connectivity, and the effect propagates across services that have no other relationship to each other. The DNS server fails not because of anything it did, but because the host underneath it ran out of resources.

This is the trade-off of running your own stack. The pricing page hides a lot of failure modes. When you put everything on one machine, you also put everything in scope for any one of them.

#What I take from this

The diagnostic time was not wasted, even though the patch is two lines. I learned something concrete about how a TCP source-port pool behaves under sustained pressure on macOS. I learned that the TIME_WAIT cleanup timer is not as resilient as I had assumed. I learned to put a steady-state TIME_WAIT monitor on the host so the next pile-up does not need a family complaint to be detected.

I also walked away with a small contribution to a project I use, which is the part I value most. The bug existed before I noticed it, and would have existed for the next person whose polling rate spiked. It is fixed now. That is the upside of running open-source software on hardware you own: when something breaks, you have the option to read the code and fix it, rather than waiting for someone else to.

The trade-off is real. The investment is also real. The next outage will find a different corner of the system, and I will be the one who has to learn it.

That is the burden of running your own stack at home. It is also why I keep doing it.