Everything you need to build your own nm and otool

Recreate the nm and otool binaries using the C programming language. You will learn about the macho format, which defines what patterns your operating system expect for executables.

Photo by Tianyi Ma / Unsplash

Understand how your computer compiles and executes binaries.

Did you ever asked how your computer decodes binaries ? Lately I wanted to learn this concept, and that’s why I decided to implement the nm and otool commands. In C, with the most basic functions, these two programs made me understand a lot about binaries and Unix. Those interested in the field might learn a few concepts here 🎓.

This article should have all the ressources needed to build your own implementations. I strongly advise you to try doing this project by yourself. You will gain a lot of skills exploring the man and system header files.

This implementation covers Mach-O, the current executable format for MacOS. Feel free to access the complete GitHub project below.

Executables

When the operating system starts a binary, it will expect the file to follow a predefined pattern. Each operating systems has their own conventions. In this article, we will focus on the Mach-O format, the one used by modern MacOS computers. Other conventions exist, for example Linux mainly uses ELF and Windows PE. You can find a complete list here.

The following document gives you a complete reference in case you want to understand this in depth.

1st step: Identify a mach-o file

The first byte of a file usually defines its identity: it’s called the magic number. By comparing it to a list of known magic numbers, we can deduct if the file follows a Mach-O pattern. To get these constants you can include the file<mach-o/loader.h> in your project.

// Defined in <mach-o/loader.h>

#define	MH_MAGIC	0xfeedface
#define MH_CIGAM	NXSwapInt(MH_MAGIC)
#define MH_MAGIC_64	0xfeedfacf
#define MH_CIGAM_64	NXSwapInt(MH_MAGIC_64)

Those 4 magic numbers identify mach-o files. They differ because of their structure size (32–64 bits) and their endianness.

Archives and fat binaries can also contain mach-o data. nm and otool are able to parse them, so we’ll talk briefly about it in the end of the article.

Nm and otool

So why are we implementing nm and otool ?

Those commands are great to learn about mach-o files because they parse, analyse their structure and then display the data.

nm displays a list of symbols of an executable.
otool displays the hexdumped data of a specified segment. We will see what is a segment later.

Parsing the structure

https://github.com/aidansteele/osx-abi-macho-file-format-reference/blob/master/Mach-O_File_Format.pdf fig1–1

Access the file

We will first access and read the file. I use a simple combinaison of open, fstat, mmap and close to get a pointer to the start of the data.

You should check the magic number against the previous predefined mach-o magics.

The MacOS system gives us many header files that define for us the Mach-O structures and constants. We will use them in the following sections. Because nm and otool need to parse the same structures, we can code common functions.

The header

A mach-o file always start with the following header:

It gives you many informations, like a cpu_type ( cpus able to run this executable), the filetype, etc …

Load commands

The load commands divide data of the binary in multiple sections. You can get the complete list of types of load commands in the loader.h header, under LC_XXX names. For this article, you’ll only need the LC_SYMTAB and LC_SEGMENT commands.

Because the load commands are placed after each other, we can iterate thought them using their size.

otool print the content of some sections in the command LC_SEGMENT. For nm, we have to match items in the command LC_SYMTAB to their relatedLC_SEGMENT section.

LC_SEGMENT — The segment command

The segment commands tell where to find a segment in memory, and the number of bytes to allocate for it. It also specifies the number of sections it contains.

At lc->fileoff we find the start of the segment. It also start with a header, followed by the list of nsects sections. A section is characterized by its section name (__text for example) and segment name (__TEXT for example), the address of its related data in memory, the data size, etc.

With these informations, we can iterate through them. With otool, you have to hexdump the data at addr. With nm you must save the segment to match it later with a symbol in the SYMTAB. For that we need a new parameter: the id of the segment, so dont forget to save it. For example, if it’s the first segment in the file, it’s id is 0, etc.

LC_SYMTAB — The symbol table command

A symtab_command header is followed by a list of nlist symbols.

To get the name of a symbol, we need to parse the strtab. The nlist structure also gives us many useful information.

What do we need to build a line for nm ? The first column shows the address, second one gives a letter describing the symbol type: for example T for an exported methods, U are external methods. The complete list is available below.

nm(1): symbols from object files - Linux man page

GNU nm lists the symbols from object files objfile.... If no object files are listed as arguments, nm assumes the file a.out.

Linux man page

Here is how to get the representation for a symbol.

When the N_SECT mask is true with sect->type, we must find the type based on the given segment. Remember you saved the id of our sections ? You can use it here 😉

Go further

Look at you ! You should now be able to build you own nm and otool 😎. But wait, if you’re serious about this project you still need to handle some edge cases. I will briefly talk about 4 of them.

Archives and fat files

fat binary, multi-architecture binary,

Parsing those files is not complicated if you followed the previous steps. The headers are available at <mach-o/fat.h> and <ar.h>. The process is the same.

Support for little/big endian

Variables stored on the headers might differs on how they’re stored. When you read the values of integers, you sometimes need to reverse their bits order.

Support for 64 and 32 bits

Sometimes the header will give you a 64 bits integer, be prepared to handle it.

Secure against corrupt files 🏴‍☠️

This part is a bit more complicated and requires some testing. I always consider that a program should never segfault.

In the case you receive a corrupted binary, the program could try to access a memory location that is not available. For every time you move a pointer based on file values, I suggest you to check if it never goes before the start of file, or after the end of it.

I’m starting a new website called myopen.market. It’s still in a early stage, but if you found this article useful, subscribing to its newletters would be the best way to thanks me ❤️