Reading Code

By Michael Gebis, Thu 15 April 2021, in category Coding

Get good at reading code

If you're a professional software engineer, it pays to think about the process of reading code. You're going to be doing it for the rest of your career, you might as well get good at it.

But working on multimillion line legacy code bases is hard. Code rot, duplicated functionality, general cruft, obsolete packages, dead code, "prototype" code that became production code--the fun just never stops.

I've read a lot of blog entries advising "you should learn to read code", but there's not a lot of advice as to how to read code. Everybody has their own approach, but I wanted to write down some techniques that have worked for me. Remember, the end goal is to get up-to-speed on a new code base quickly. My memory isn't what it once was, so when I start to get lost reading things, I take a lot of notes. Here are some of the ways I do that:

General advice

Document the directory hierarchy

As you start to tackle a large code base, go through important directories, and for each subdirectory, annotate what it's for. You will be completely clueless at first, but do a bit of code research, and just give it your best guess. I tend to use question marks to indicate where I'm unsure, and come back and refine my answers over time.

For example, for the top level linux directories, my notes would look something like this:

    Documentation
      ABI                   the ABI between the Linux kernel and userspace
      PCI                   PCI bus
      RCU                   Read/Copy/Update
      ...
    Licenses
    arch                    processor-specific code
    blk                     block device code?
    certs                   TPM certificates???  I dunno.
    ...

I take these notes just to confirm my own understanding. Pro-tip: some day, you might want to refine this guide and internally publish it for others.

Find and take notes on the key functions/data structures.

I find it helpful to create a one-line summary of key functions and data structures. You're not trying to document everything, and your notes should be terse and to the point. If I'm lucky, I'll be able to pull this from comments. But doing this forces me to read and understand each function enough to have a cogent entry.

This is kind of a bad example because the functions are fairly self-describing, but here's what I'd do, using a random file in the linux source tree:

    btintel_check_btaddr    // check device for corrupt/buggy address
    btintel_enter_mfg       // enter manufacturing mode
    btintel_exit_mfg        // exit manufacturing mode, with/without reset
    btintel_set_bdaddr      // set bluetooth device address
    btintel_load_ddc_config // load intel "Device Data Control?" parameters

This list can help quickly jog your mind about what each function does, especially when function names are cryptic and unfamiliar.

Document the lifespan of an API, top-down then bottom-up

For example: I have worked on many hardware platforms that perform read/write IO calls. And in each case, I've written a document titled "The Life Of A Hardware Read", that starts at the uppermost user API, and recursively describes each step until the read reaches hardware, and then unwinds the stack showing how the data gets back to the user.

This can be a very large task--depending on the system, it could take days to do the topic justice. Usually my first pass is pretty sloppy. But later, I've turned these documents into full presentations to familiarize new hires with how things work.

Create a list of threads and thread responsibilities

Look for the thread_create function, and document each thread that is started, and its responsibilities.

If you're lucky each thread has a clearly defined responsibility. If you're not, well, better to know that you've got a big mess on your hands.

Look at (or create) function call graphs and function pseudocode

For important functions, create a function call graph. This may be using cflow, or doxygen,or callgrind, or your editor's "Show Call Hierarchy" function.

Or it may be just going through and creating pseudocode for the functions. The idea is to pull out the important bits and remove the boilerplate code. When you forget what "redo_foobar_froz()" does, your pseudocode should tell you at a glance.

Don’t be afraid to run/debug/add logging

Just because you're reading code doesn't mean you can't also run the code. So go ahead, run it under the debugger to be able to inspect runtime values and control flow. Add printf statements to elucidate the tricky bits. Delete a function body and see what unit tests fail, and how.

Document state machine with graphviz/dot

If you are trying to read a state machine in code, create a very simple graphviz document. This can be done so quickly that it's almost always worth doing.

Code archeology

If you're looking at a feature, sometimes it's worth looking at the particular commit so that you see all the related code at once. Use your source control to take a look. Of course, if the code was checked in haphazardly, or the function has been through a ton of bug fixes, this may be impossible. But it's worth checking out.

Conclusion

This list is by no means exhaustive, and I'll keep expanding it as I come up with new tricks. In the meantime, connect with me on twitter to let me know what tricks I missed.