Extending nsncd with host lookup support

This article describes how we1 extended nsncd to support NSS host lookups, and provided it for NixOS 22.11 as a drop-in alternative.

What is NSS, how does it work, why should I care?

NSS is what’s being used under the hood of a Linux system to translate users, groups and hosts 2 from names to numbers/IPs (and back).

You probably heard of DNS and /etc/hosts that are used to look up hostnames, but in a modern system, there’s a bunch of other (dynamic) sources to choose from, such as:

  • Local network device discovery (Zeroconf / Avahi)
  • Names of containers running on your machine.
  • *.local hostnames, which come handy for multi-vhost testing.

Similarly, user and group names might be provided by some directory services (LDAP etc.)

All of these lookups are provided by the NSS (Name Service Switch) mechanism, which is part of glibc, a low-level system library used in most binaries.

glibc reads /etc/nsswitch.conf for the list of configured NSS modules, and then queries each of these in the defined order.

Usually, this file is configured by the system administrator according to the desired local configuration.

What’s problematic with it?

All NSS modules are essentially just a libnss_*.so file, that’s dlopen()‘ed from well-known locations into the running process on the first lookup.

  • On regular distros, this mostly works as long as the .so files are mostly compatible, but long-running processes (or just old binaries) dlopen()‘ing new NSS modules can segault the binary. 3

  • Nix-built binaries running on non-NixOS systems can’t find the NSS modules specified in the hosts /etc/nsswitch.conf, because a nix-built glibc only knows how to load the most basic NSS modules (the ones shipped with glibc directly) look in /usr/lib.

nscd

On NixOS (and GUIX ), this is worked this around so far, by making use of nscd.

nscd was meant as a “caching daemon” for NSS requests. In case glibc sees a unix socket at /var/run/nscd/socket, it tries to connect to it, and run queries through it, using a undocumented, but somewhat stable binary protocol.

That daemon can be “steered appropriately” to find the NSS modules specified in /etc/nsswitch.conf 4, or in the case of non-NixOS, use the host-provided NSS modules from /usr/lib or similar.

As only nscd takes care of the dlopen() calls, segfaults and problems with ABI incompatibilities are minimized.

Problems with nscd

However, using nscd exposed some problems:

Caching, even when disabled

We tried hard to disable caching in nscd.

Yet, there were occurences where nscd still seems to cache results.

Occasionally getting stuck

When roaming around in various WiFi networks, especially those with captive portals, I often experienced entirely “stuck” DNS lookups for tens of seconds.

Sometimes a systemctl restart nscd did help, sometimes not.

Search for alternatives

This problem kept popping up over and over again. We tried some of the alternatives, such as versioned import paths , or using some of the alternative nscd implementations, but ultimately none of them supported the feature set we needed to be a nscd replacement.

nsncd

We ultimately decided to extend nsncd , a non-caching nscd alternative, written in Rust, that already supported most of the lookup types, with support for host lookups .

This required understanding a lot of very hard-to-read glibc code, combined with using sockdump to stare at the bits going over the wire, and re-implementing the various lookup methods required for it.

We also added support for sd_notify readyness signalling.

A NixOS test was added to verify matching NSS lookup behaviour for both nscd and nsncd, as well as providing wire format dumps via sockdump.

We also added a NixOS option, services.nscd.enableNsncd, which can be set to true to use nsncd instead of nscd.

We plan to flip the default for the release after NixOS 22.11. Please give this some testing!

Upstreaming

The host lookup patches are still under review by the upstream maintainer, Two Sigma. For now, the nixpkgs version points to a fork maintained in the nix-community project, but obviously, having host lookups “just work” in the “official” nsncd package will make it much easier for users on non-NixOS systems to install it.

We’re working with upstream to hopefully get this merged in some form.

Future work

nsncd: Wire Tests

We want to include more wire format unit tests of various lookup responses into nsncd itself. andi found a bug when looking up IPv6-only hosts that some glibc clients handled ungracefully.

nsncd: Socket activation

While nsncd is very quick to start up, we still would like to see it being socket-activated to prevent failed lookups early during boot, or when switching to a new NixOS configuration and restarting nsncd while doing so.

Initially nsncd had support for being socket-activated, but that got removed due to some deadlocks.

It might have gotten fixed by a recent systemd commit and should probably be re-evaluated.

nsncd: Use client namespace

We also got feedback from some users they disable ns(n)cd because they run some workloads in a separate network namespace, where everything is tunneled via a VPN, and don’t want to leak DNS lookups to the untunneled connection. We should investigate if we can detect the network namespace of the client that’s connecting, and do the lookup in that namespace, rather than in the host namespace.

glibc: NIX_GLIBC_NSS_PATH

There’s a nixpkgs PR adding a patch to glibc, to have it look for NSS modules in another path, which doesn’t affect other module loading path.

It could solve as a workaround for:

  • the “host lookup network namespace leakage problems” described above
  • non-NixOS distributions where ns(n)cd can’t be run at all, but the user is super sure about the NSS modules pointed to being compatible with the run binaries

Even with all these improvements on nsncd, we should probably still include this somehow. Ideally, reach out to glibc upstream, and see if something like this can be added.

glibc: Simplify client code

The current glibc nscd client code is pretty convoluted, and in some cases, asks for a file descriptor pointing to the internal nscd cache structures “to look if the response is there already”, and then some logic to extract it from there client-side. The protocol also has some more commands regarding shutdown and flushing of the cache.

All of these commands are not really desirable in case of a non-caching implementation that simply acts as a dispatcher, so the client code could probably be simplified a lot / rewritten, to stop using the other request types.

This should be in line with Fedora’s choice to remove nscd in Fedora 36 and discussion around a simplification.

We should use nsncd for these usecases, and get the nscd client-code simplified.


  1. This is mostly me and NinjaTrappeur , while helping a NumTide customer, OTTO Motors↩︎

  2. There’s some more “databases” it provides lookups for, check nsswitch.conf for the full list ↩︎

  3. https://github.com/NixOS/nixpkgs/pull/138178#issuecomment-925104467 , https://github.com/erikarvstedt/check-glibc-compatibilities/  ↩︎

  4. This is usually accomplished by setting its LD_LIBRARY_PATH to all the NSS module paths configured in the Host OS. ↩︎