Day 1: Identifying the Architecture

Welcome to Day 1 of the Advent of Radare!

Introduction

Today, we’re diving into one of the foundational steps in binary analysis: identifying the specific CPU architecture.

Knowing a binary’s architecture is crucial as it affects everything from disassembly to emulation. Radare2 provides various tools and commands to help us uncover this information, most of the time this information is autodetected or directly exposed by the binary headers, but sometimes it’s not straightforward.

Let’s look at how we can use Radare2 commands like rasm2, i, and asm.cpu settings to investigate architectures. We’ll also explore an advanced script to attempt automatic architecture detection for binaries that lack headers, such as firmware binaries.

Listing Supported Architectures

One of the quickest ways to familiarize yourself with the architectures Radare2 supports is by running rasm2 -L. This command outputs a comprehensive list of architectures, along with the bits, endianness, and available CPU types associated with each.

The output of rasm2 -L will list architectures such as x86, arm, mips, powerpc, and more. Here’s a sample of what it looks like:

$ rasm2 -L
_de 8           6502        Disassembler for the 6502 microprocessor family (NES, c64, ..)
_de 8           6502.cs     Capstone mos65xx 8 bit microprocessors
ade 8 16        8051        8051 microcontroller (also known as MCS-51)
_de 64          alpha       ALPHA architecture disassembler based on GNU binutils
_de 32          amd29k      AMD 29k decoder
a__ 16 32 64    any.as      Use system's gnu/clang 'as' assembler
a__ 8 16 32 64  any.vasm    Use asm.cpu=6502, 6809, c16x, jagrisc, m68k, pdp11, ppc,qnice, tr3200, vidcore, x86, z80
_de 16 32       arc         ARC processor instruction decoder
a__ 16 32 64    arm.nz      Custom thumb, arm32 and arm64 assembler
_de 16 32 64    arm         Capstone ARM analyzer
_de 16 32 64    arm.gnu     ARM code analysis plugin (asm.cpu=wd for winedbg disassembler)
_de 64          arm.v35     Vector35 ARM analyzer
ade 8 16        avr         AVR microcontroller CPU by Atmel
ade 32          bf          brainfuck architecture
ade 32          bpf.mr      BPF the Berkeley Packet Filter bytecode
_de 32 64       bpf         Capstone BPF bytecode
...

Note that for parsing purposes you can always append -j to get the output in JSON rasm2 -jL.

The first column indicates the following information:

a the plugin supports assembling instructions (encode)
d the plugin supports disassembling instructions (decode)
e the plugin supports emulating instructions (emulate)

If the architecture we are looking for is not listed there we may probably want to use r2pm -s to search for 3rd party plugins and install them like this:

$ r2pm -ci hexagon

Binary Headers

Most of the time we will be loading binaries with a structured header that specifies all this information.

The i command outputs basic information about the binary, often including arch, bits, endian, class, and machine. For instance:

$ r2 -qci /path/to/binary
arch     x86
bits     64
endian   little

This metadata helps you confirm if radare2 has correctly detected the binary’s architecture. However, when working with raw binaries (like firmwares or memory dumps) that lack those meta headers, radare2 will default to an incorrect or the host architecture, requiring manual intervention.

But there’s no need to load the entire binary inside radare2 to retrive the architecture information, we can achieve the same output using just rabin2 from the shell like this:

$ rabin2 -I /path/to/binary
arch     x86
bits     64
endian   little
...

This information is also exposed in JSON format by just appending the -j flag:

$ rabin2 -j -I /bin/ls | jq .
{
  "info": {
    "arch": "arm",
    "baddr": 4294967296,
    "binsz": 89088,
    "bintype": "mach0",
    "bits": 64,
    "canary": true,
    "injprot": false,
    "class": "MACH064",
...

Specifying the specific CPU model

To configure the CPU model for the given architecture we must use the asm.cpu variable. This can be essential when dealing with binaries optimized for specific processors, such as ARM Cortex-M or MIPS R3000.

To list valid asm.cpu options for the currently loaded architecture, use:

e asm.cpu=?

You can list the cpus for the arm.gnu plugin with the command from the shell with the following oneliner:

$ r2 -a arm.gnu -b 32 -qc 'e asm.cpu=?' --
v2
v2a
v3M
v4
v5
v5t
v5te
v5j
XScale
ep9312
iWMMXt
iWMMXt2

Changing the asm.cpu will show immediate change after disassembling code, this may help us discover what some invalid instructions are really doing. Note that inside the r2 shell you can also use -e (like the commandline flags of the very same tool):

-e asm.cpu=v5t
pd 10

Setting the asm.cpu appropriately can enhance disassembly accuracy by accounting for architecture-specific opcodes and behaviors, providing a more precise interpretation of the binary’s instructions.

Automating it in a Script

When a binary has no header information, architecture detection becomes a manual process. However, we can leverage Radare2’s flexibility with an r2js script that tries different architecture and bit configurations, analyzes the disassembly, and measures the ratio of valid to invalid instructions. This process can give us a strong indication of the correct architecture by narrowing down configurations that yield the fewest decoding errors.

The script below attempts various arch and bits combinations, performs a short disassembly (pd), and counts invalid instructions. The configuration with the least invalid instructions is likely the correct one.

Auto-Detect Script

This script is written in r2js, Radare2’s JavaScript interface, which allows for dynamic command execution and result parsing.

const architectures = [
    {arch: "arm", bits: [64, 32, 16]},
    {arch: "x86", bits: [64, 32, 16]},
    {arch: "mips", bits: [64, 32, 16]},
    {arch: "ppc", bits: [64, 32]}
];

let bestMatch = {arch: "", bits: 0, invalidCount: Infinity};

for (const config of architectures) {
    for (const bit of config.bits) {
        // Set architecture and bit width
        r2.cmd(`e asm.arch=${config.arch}`);
        r2.cmd(`e asm.bits=${bit}`);

        // Perform a short disassembly
        const disasm = r2.cmdj('pdj 80');
        let invalidCount = 0;

        // Count invalid instructions
        disasm.forEach(instruction => {
            if (instruction.opcode === 'invalid') {
                invalidCount++;
            }
        });

        console.log(`Testing ${config.arch}-${bit}: ${invalidCount} invalid instructions`);

        // Track the best configuration
        if (invalidCount < bestMatch.invalidCount) {
            bestMatch = {arch: config.arch, bits: bit, invalidCount};
        }
    }
}

console.log(`Best match: ${bestMatch.arch}-${bestMatch.bits}`);

// Set the best configuration
r2.cmd(`e asm.arch=${bestMatch.arch}`);
r2.cmd(`e asm.bits=${bestMatch.bits}`);

How It Works:

The script iterates through different configurations, disassembling the first 80 instruction and counts how many of them can’t be decoded and considered invalid.

Best Match Selection: It tracks the configuration with the lowest count of invalid instructions, which is likely to be the correct one.

$ uname -m
arm64
$ r2 -i whatarch.r2.js /bin/ls
Testing arm-64: 0 invalid instructions
Testing arm-32: 12 invalid instructions
Testing arm-16: 3 invalid instructions
Testing x86-64: 3 invalid instructions
Testing x86-32: 1 invalid instructions
Testing x86-16: 0 invalid instructions
Testing mips-64: 18 invalid instructions
Testing mips-32: 2 invalid instructions
Testing mips-16: 18 invalid instructions
Testing ppc-64: 4 invalid instructions
Testing ppc-32: 4 invalid instructions
Best match: arm-64
 -- Now with more better English!
[0x100003a58]>

There are several assumptions this script is doing that can be improved and it’s important to have them into consideration.

We took the instructions starting from the entrypoint
Firmwares use to have one jump at the begining and then data
Sometimes the code is not in the right place
Other metrics like branches to valid destinations will make sense too
16 bit architectures use to have a low ratio of invalid instructions which may lead to false positives
This script is only playing with a hardcoded list of arch/bits, we can use the whole list of archs too
The script doesn’t play with endian or asm.cpu values
Decode every instruction at bit level and analyze the entropy level of each operand

As for today we give you the challenge to write a better version of this script that is able to solve all the problems described in this post and share it! ideally as a pull request into the examples/ directory in GitHub!

Keep learning

If you are curious about radare2, I would recommend you to checkout the following links:

Feel free to contribute and open tickets to improve the documentation from what you learned here!

Wrapping Up

Identifying a binary’s architecture is the foundation of effective reverse engineering, and Radare2 offers robust tools to assist in this process. By using commands like rasm2 -L, i, and asm.cpu, we can investigate architecture and bit options manually. In cases with limited metadata, scripts can automate the detection process, saving valuable time.

Hope you all learned something new and see you tomorrow in the second advent post of radare2!

–pancake