magical shell utilities: ld

Written: 2020-10-31 to 2021-03-04

A surprisingly huge amount of problems can be reduced to simple graph traversal. Every time you order your compiler to build your spaghetti projects, that mess of function calls has to be tsorted in order for the linker to figure out where your code actually jumps to. Even well-organized projects will have their functions separated into distinct modules. This makes it easier for the programmer to reason about how their code fits together, but creates a difficult challenge for the compiler to figure out.

tsort itself was originally created as a linker utility, according to the GNU team. Unless you’re compiling everything statically, like a true suckless adherent, you’re leaving a lot of work for ld to do. The linker has to do a LOT in order to resolve undefined symbols when duct-taping your object files together into an executable. Having function signatures in header files is what allows ld to have some notion of what goes where.

You don’t have to reimplement printf every time you write a new project, because dynamic linking lets you abstract away from the problem of integrating shared libraries into your program. stdio.h itself doesn’t implement printf either. It’s only 160 lines. This blew my mind when I was first learning C, because I was under the impression that all programs were statically linked. I knew that the #include <file> pragma literally just copy-pasted the entirety of <file> into your source code, but I figured #include <stdio.h> just vomited 2000+ lines for each library function you added into your code.

Why do we need to write header files at all? Isn’t

int launchMissiles(char **targets, uint32_t *launchCodes) {
. . .
}

enough information for the compiler? It shows how much space needs to be allocated for its body, how many bytes needs to be pushed onto the stack for its arguments, and how many meaningful bytes will be moved into RAX when it returns. Why do you have to write that twice? Once in the implementation, and again in a separate header file. You need to #include that header in any other module that you want to launch missiles from.

Imagine your program as a tree, rooted at main.c or whatever module includes main(). Your program’s entry point is primitive, in the sense that main() doesn’t require a header. Your program will always start from main(). That’s your onramp onto the highway, and it’s the point from which every other symbol lookup and function call will stem. All the declarations made in a header file form a collection of edges emanating from that header file to compiled object files. The ultimate goal of a linker is to have every jmp and call instruction point somewhere in a program’s address space. This involves traversing a complex web of object files and figuring out which modules and subroutines depend on what. ld needs to know where launchMissiles() is defined before it can add a location to jump to when the function is invoked. Sound familiar?

The ELF format for executable files consists of a few important parts. The order tends to vary between environments, but the idea remains the same. There’s a header that tells you it’s an ELF file. Next is a program header table with information to the OS on how the program should be run. Then a series of sections where your program’s data lies - space for static variables, string literals, opcodes for your CPU to crunch, etc. These come from your individual object files. The clues on where your functions are actually defined lie in the section header table. There are a few interesting entries in the section header table, like redirection addresses for sections that are too big to fit, the total offsets for each section’s location relative to your program’s address space, various flags, the size of symbols, etc.

Before linking, every object file corresponds to a single source file. If there’s a function call within a source file whose implementation is elsewhere, the compiler basically just adds a little TODO for the linker to figure out. This is assuming that the function has been mentioned somewhere, of course. GCC will throw an error if it’s unknown. This is also where those cryptic implicit declaration of function 'foo' is invalid errors come from - if you try to call foo() before it’s mentioned, the compiler doesn’t know what you’re talking about. GCC doesn’t maintain any context between translating different source files. It sees a file, and does what it does. Try it for yourself.

int main(void) {
  foo(); // won't compile
  return 0;
}

void foo(void) {
  puts("hello world!")
}

foo’s definition is only 4 lines down, but GCC doesn’t know that. You’re giving that black box way too much credit. What about this?

void foo(void);

int main(void) {
  foo(); // "hey, I know what that is!" - `GCC`
  return 0;
}

void foo(void) {
  puts("hello world!")
}

Remember what I said about #include just vomiting text into your file? Your header files contain these function declarations for a reason - to give GCC a clue of which instructions are necessary in order to call these functions. Depending on the compiler and environment, arguments might all be passed on the stack, or in partly in registers. This also means you need to know in advance how many arguments there are, and the size of each argument. In terms of handling, a function with the arguments (uint32_t x, uint64_t y) is not the same as (uint64_t x, uint32_t y). Sure, the two values have the same combined size, and the space allocated is equal in both cases. But if you’re passing the arguments on the stack, the offsets for each value are completely different.

f1 |  f2
========
X     y
X     y
X     y
X     y
y     y
y     y
y     y
y     y
y     X
y     X
y     X
y     X
-- rest of the stack --

The compiler needs to know exactly how much space each variable takes up, because it matters. You might need a popq or a popl in order to get that data off the stack, but the only way to know is to have some idea of what data you’re working with. Imagine if your compiler was prone to making hasty conclusions and guessing about data size. What if it created one version of launchMissiles that took arguments like (char targets, uint8_t launchCodes), and another version with the original arguments? You’d be writing and overwriting data all over the place, and we’d probably be dead.

Back to the linker. Now that we know why header files are required (right?), let’s look at how these function definitions are actually resolved. Every object file ends up with a symbol table (assume 64-bit because we’re not living in the 90’s anymore). The symbol table is an array of Elf64_Sym, each corresponding to one function:

typedef struct {
    Elf64_Word      st_name;
    unsigned char   st_info;
    unsigned char   st_other;
    Elf64_Half      st_shndx;
    Elf64_Addr      st_value;
    Elf64_Xword     st_size;
} Elf64_Sym;

Elf64_Word and Elf64_Xword are just typedef aliases for uint32_t and uint64_t, respectively. Elf64_Addr is, as you guessed it, an unsigned 64-bit pointer (uint64_t). st_name isn’t a string, but an index into a string table within the object file that contains the function’s string representation. st_info is a bitmask that contains metadata about whether the symbol is local or global, and some other bookkeeping stuff. The most crucial bits of info here are st_name, st_info, st_size, and st_value. With this, the linker now has everything it needs to be able to redirect any invocations of the function to its exact location within its resident object file.

When object files are combined together, their data occurs sequentially within the resulting executable’s data. The linker takes the addresses of each function definition and calculates a new offset based on where it spits the function body into the executable. This order doesn’t have to necessarily follow the original order within each object file - functions might be inlined depending on how often they get called, they might be placed closer together or overlaid for cache locality, etc. As ld builds the executable, it’s unclear what the offset of the next function, or the one after that, will be. All it knows is the offset of the program header, data, and everything it’s added in so far. ld provides a lot of fine-grained control to the programmer when specifying how a program should be linked, since there’s so much metadata contained in each object file. The compiler didn’t break when it was translating everything to assembly, the assembler didn’t whine, and so ld gets to do its thing now.

What’s cool is that ld is lazy in binding functions now. Since the compiler was able to find a valid traversal through your program, the only thing left to be done for your program to run is to actually write the addresses into the executable. This doesn’t have to be done until you try to run your program, and it’s how dynamic linking allows you to save literally KILOBYTES of memory by allowing different programs to share the same library functions in memory. This wasn’t the case back in the 1970’s. In the infancy of UNIX, ld had a lot more responsibility. Static analysis wasn’t as feasible as it was today, so tasks like dead-code elimination were up to the linker. In particular, ld used to only do a single pass over a tarball of objects in their predefined order. This is where it was crucial that a traversal-order permutation of symbols was obtained.