Understanding Nix's String Context

Sunday, August 5, 2018

Note: This post assumes some familiarity with and interest in Nix. In particular, a basic understanding of the distinction and relationship between the Nix language and the Nix store is required.

In the Nix expression language, all strings have some metadata associated with them called "context". At a high level, this context is used to track dependencies in our build scripts to ensure we have all of the packages we need available during our builds. In this post, we'll go into more detail about why this context is necessary, how to interpret it, and the technical details of how this feature interacts with other aspects of the language.

Why string context?

From an overly operational standpoint, we can think of Nix-lang as being a domain-specific language for writing build scripts (usually Bash). As a first attempt at designing such a language, we might focus on a language with good string manipulation facilities, making it easy to combine strings, interpolate values, convert values to strings, etc., since ultimately we'll be expressing these scripts as strings¹ for Bash to run. But Nix is a bit more ambitious than this: We don't just want to define the build actions themselves, we want to capture the dependency relationships between builds as part of their definition.

From the perspective of the Nix store, build scripts (that is, drv files) include specifications of what input sources they depend on and what build outputs of other builds they depend on. The input sources are represented as a store path pointing to a valid path in the Nix store, while the build outputs of other builds are essentially² represented as a store path pointing to another drv file, which represents the build that must complete successfully to create the desired dependency. So if Nix-lang wants to create Nix-store build scripts with the right information, we need some way to keep track of build dependency information in a way that lets us get at the relevant store paths when we're creating the drv file. Since we're building these scripts by concatenating and interpolating strings, Nix-lang tracks this dependency information as metadata associated with strings themselves; this metadata is known as "string context."

The semantics of string context

Speaking conceptually, every string in Nix-lang has associated with it a set of specifications of store paths. There are two³ ways these store paths are specified, mirroring the two kinds of inputs Nix-store drv files can represent: A specification is either a plain store path, representing any valid path in the store, or an output of a derivation, representing the result of running the build specified by a specific drv file (which may not yet have been built). For example, a string representing a script to compile some C program might have /nix/store/0011c1sr9cm6s7ak0mj1wn0574d1dfkb-hello.c as a plain store path in its context, tracking that build's dependency on hello.c as a source input, as well as /nix/store/hzynaqawj3j3zrshhc85bjd6j502zhrk-gcc-7.3.0.drv as a derivation output store path in its context, tracking that build's dependency on having built gcc 7.3.0 first.

How are we to interpret this set of store paths? I think of it this way: If a Nix-lang string S has context set C, then we can only meaningfully use S if all of the plain store paths in C actually exist as valid paths in the Nix store and all of the derivations outputs specified in C have been built, with the qualification that derivation output specifications have a kind of call-by-need/laziness associated with them where some usages of S are valid before the input derivations have been built, though not all. This is all very abstract and intentionally vague⁴ in what "meaningfully use" actually means, though, so let's walk through some examples of different ways strings can be used in Nix-lang:

If we have an attribute set, we can use a string to lookup a particular attribute: builtins.getAttr "a" { a = 1; } evaluates to 1. What would it mean to try to look up a string with context in an attribute set? It would mean that the attribute lookup operation can only yield a meaningful result if the relevant store paths are valid. But attribute lookup is just a pure string equality comparison, and the existence or non-existence of some path on the filesystem can't impact that, so if we claim that the string's meaning depends on some store paths (i.e., the string has a non-empty context) we must be doing something wrong. And indeed, Nix throws an error if you call builtins.getAttr with a string with non-empty context.

From a given Nix expression, you can evaluate an expression defined in some other file by import-ing it. Typically import is called with a path argument, but it can be called with a string argument; that string must be an absolute path to the file defining the expression. What would it mean to try to import a string with context? It would mean that the expression cannot be evaluated if the relevant store paths aren't valid. Unlike in the attribute lookup case, we can come up with cases where this makes sense: suppose package foo in nixpkgs installs some helper Nix expression into its output. We can determine that foo's output will be at, say, /nix/store/6ks8cw9np947wr990a87r12zlm6whkk9-foo, but it's not enough to simply say import "/nix/store/6ks8cw9np947wr990a87r12zlm6whkk9-foo/foo-helper.nix", as that string doesn't contain enough information on its own to determine the contents of the expression we're importing; we need to actually use the string as a pointer into the store, and dereferencing that pointer in turn requires the store path to actually be valid. If we instead write import "${pkgs.foo}/foo-helper.nix", everything works as expected, because "${pkgs.foo}/foo-helper.nix" is a string with context denoting that the derivation defined in pkgs.foo must be built before we can use it, in this case before we can import it.

Finally, we can of course use strings to define the environment and commands we want to run during a build. Nixpkgs's stdenv.mkDerivation, for example, lets you provide a buildCommand attribute as a string that is expected to be a bash shell snippet that performs the core logic of the build. What would it mean for the buildCommand string to have context? It would mean that we can't actually run the snippet unless all of the relevant paths exist and dependent builds have completed successfully. Let's suppose package foo in nixpkgs contains some utility that can generate a useful build artifact. It's not enough to just say buildCommand = "/nix/store/6ks8cw9np947wr990a87r12zlm6whkk9-foo/bin/do-foo --bar=false";, as that string on its own doesn't contain enough information to actually execute the do-foo program; we actually need to use the first part of that line as a pointer into the store, and dereferencing that pointer (calling the program from Bash) requires that the store path is actually valid. If instead we say buildCommand = "${pkgs.foo}/bin/do-foo --bar=false";, everything works as desired, since that that string has context denoting that it requires pkgs.foo to be built to be meaningfully used as a shell snippet. Note that this case is a bit different from the import case, though: If we write import "${pkgs.foo}/foo-helper.nix", pkgs.foo must be built before we can evaluate that expression at all. On the other hand, stdenv.mkDerivation { name = "baz"; buildCommand = "${pkgs.foo}/bin/do-foo --bar=false"; } can be evaluated as a Nix-lang expression without actually building pkgs.foo itself; it's only when we actually run the build defined there pkgs.foo must be built (this is an example of the laziness referred to above).

A digression into coercions

To get a full handle on what strings in a given Nix expression have what context, we need to understand how different values become strings in the first place. Due to the centrality of strings in the language, non-string values are often automatically coerced to strings, but which values and the result of the coercion depends on the specific context which demands it.

There are two axes along which coercions to string can vary. The first, called coerceMore in the C++ codebase, determines whether it's an error if the value being coerced is something like a list or an integer or whether we recurse into lists and convert scalar values. The second, called copyToStore, determines whether path values are simply converted to a string representation of their absolute path or whether the file/directory at the path in question is added to the store first, with the coercion result being the resulting store path. Copying to store connects us back to string context: When a path is copied to the store, the string that gets returned has that store path as its context.

Most builtins and operators which do string coercion, such as string concatenation, don't coerce more but do copy to store. So "${[ "string" ]}-concatenation" will be an error, since we don't recurse into lists, while "${./string}-concatenation" will copy the ./string file into the store and result in something like "/nix/store/wrd12y30yvlwwpilssbkly81964kab6p-string-concatenation".

The toString builtin does coerce more but does not copy to store. So toString [ "string" "list" ] converts to "string list", while toString ./file converts to something like "/home/shlevy/file".

Finally, the derivation builtin coerces each⁵ attribute to a string, coercing more and copying to store. So derivation { foo = [ "string" "list" ]; bar = ./file; } would have the foo environment variable set to "string list", and would copy the ./file file into the store and have the bar variable set to something like /nix/store/wrd12y30yvlwwpilssbkly81964kab6p-file.

Operators and builtins

To round out our understanding of context, we need to understand how each language construct interacts with strings with context.

String literals have no context. When paths are coerced to strings in a copy-to-store context, such as "${./foo}", the resulting string has that path in the store as its context. The outPath attribute of a derivation has that derivation's output in its context, while the drvPath attribute has all of the derivation's outputs in its context. Concatenation, whether via interpolation or the concatenation operator, unions the context of the concatenated strings.

Nix's operators and builtins treat strings with context in four ways, depending on their relevant semantics. Some builtins, such as builtins.getAttr, simply throw an error if passed a string with context. Some, such as string concatenation or the derivation builtin, propagate the context in some way to their results but don't need actually perform any builds at the time of evaluation. Some, such as import, build any needed derivations during evaluation⁶. Finally, some builtins, such as throw, simply ignore their context.

Construct	Context Handling	Notes
scopedImport	Build during eval	import is a special case of scopedImport. import of a drv file directly does not build that drv.
importNative	Build during eval
exec	Build during eval
abort	Ignore
throw	Ignore
getEnv	Error
derivation	Propagate	Context is propagated via the inputs of the store derivation. The returned output and drv paths have context corresponding to those builds.
pathExists	Error
baseNameOf	Propagate	The resulting string shares the context of the input path
dirOf	Propagate	The resulting path shares the context of the input path, if any
readFile	Build during eval
findFile	Error	The second argument is expected to be a string
readDir	Build during eval
fromJSON	Error
toFile	Error/Propagate	The name can't have context, the body's context is added as references of the resulting file, unless they are drv outputs (which is an error)
getAttr / .	Error
hasAttr / ?	Error
removeAttrs	Error
toString	Propagate
substring	Propagate
stringLength	Ignore
hasContext	Propagate	Turns context existence or lack thereof into a boolean
hashString	Propagate
split	Error/Propagate	Regular expression can't have context
concatStringsSep	Propagate
replaceStrings	Propagate
parseDrvName	Error
compareVersions	Error
splitVersion	Error
++	Propagate
${}	Propagate

Language construct interactions with string context, as of Nix 2.1

Future directions

In Nix as it exists today, there is very little we can do to introspect on or manipulate context directly. The hasContext builtin can tell us whether a given string has non-empty context at all, but we can't extract the set of paths. In the future, it may be desirable to be able to attach arbitrary metadata as context and introspect on it later; this could give us functionality like pure derivation poisoning, which would, for example, let us cause certain packages to fail to evaluate (e.g. for licensing issues) when they or their dependents are built without breaking query-style evaluations of the whole package set.

My internal Rúnar Bjarnason is crying at the complete lack of structure here as I write this. Maybe some day someone will come up with a clean constrained language for describing general build processes, but for now this is what we've got.

This is an oversimplification, since Nix builds can have multiple outputs and a given build need not depend on all of them.

In the C++ Nix evaluator and in the WIP Haskell implementation, there are three kinds of specifications. The third, though, is simply shorthand for "all of the outputs of this specific drv file", and can conceptually be replaced by an individual context element for each individual output the drv file specifies.

⁴

The vagueness of parametricity, of course 😉

⁵

Except the args attribute, which is expected to be a list of coercible-to-strings.

⁶

This is how the "import from derivation" feature is implemented.