19 - Function Bytes

Welcome to Day 19 of the Advent of Radare!

Today’s focus is on extracting byte sequences from functions, which raises several questions:

Why do we need to do this?
Do all the bits matter?
Are they linearly mapped?
What’s the meaning of life?

This post will try to answer all these questions and provide you with key commands to create YARA rules and zignature files, as well as identify patterns for similar functions. We’ll also address the challenges of different code constructions and discuss how to eliminate parts that can vary between similar patterns.

Linear vs Sparse

Functions can be described as a consecutive list of instructions that have one entrypoint. The rest of rules can vary depending on the way they are implemented.

Can have none, one or many exit points, and those can or can’t return values to the caller
Code can be linearly and consecutive but it’s usually not
Basic blocks can be shared between different functions.
Basic blocks can be splitted by branches
Multiple basic blocks can coeixst in the same address
Multiple representations of the same code can be seen in the same address if jump-in-the-middle hacks are made
Implementation can jump back, so entrypoint don’t need to be the lowest address
Data can be inlined between the code blocks
There can be dead code. When optimized to have aligned branch destinations, when unoptimized because code was not removed
Relocations can patch the code at loading time and change the entire meaning of existence

As you can see, things that may look simple or easy can become a complete nightmatre for the analyst, and even more for writing software that reliabily aims to find out athe real constructions behind the assembly.

So having all these concepts in mind we may want to know what would be the best or easiest way to get a list of all the basic blocks..

Linear Paradise

Imagine a perfect world where all compilers generate a single entry point for every function, without reusing basic blocks, and place the implementation linearly below the entry point.

Imagine these functions have no data mixed with code within their boundaries, instead delegating data to the space between functions or into a separate rodata section within the binary.

In this imaginary world, we wouldn’t have many problems and could simply use commands like:

pD $FS @ $FB

Where: * $FS: The linear size of the function * $FB: The beginning address of the function

Alternatively, you can use p8 to view the byte sequence in hexadecimal pairs:

p8 $FS @ $FB

Note that $FB can also be written as $F

This linear disassembly is usually easily readable with commands like pdf or pD $FS. However, if you’ve used these commands for a while, you’ve probably noticed that sometimes the output is incomplete and contains disappointing “…” ellipses.

Unfortunately, this perfect world doesn’t exist.

Sparse Land

Now let’s get into the harsh real world problems, following the rules we read before we need a way to enumerate the basic blocks of a function: afb.

[0x100003a58]> afb~:0..10
0x100003a58 0x100003aa4 00:0000 76 j 0x100003aa8 f 0x100003aa4
0x100003aa4 0x100003aa8 00:0000 4 j 0x100003aa8
0x100003aa8 0x100003aec 00:0000 68 j 0x100003b1c f 0x100003aec
0x100003aec 0x100003b00 00:0000 20 j 0x100003b88 f 0x100003b00
0x100003b00 0x100003b1c 00:0000 28 j 0x100003b88
0x100003b1c 0x100003b38 00:0000 28 j 0x100003b40 f 0x100003b38
0x100003b38 0x100003b40 00:0000 8 j 0x100003b6c f 0x100003b40
0x100003b40 0x100003b60 00:0000 32 j 0x100003b80 f 0x100003b60
0x100003b60 0x100003b68 00:0000 8 j 0x100003b7c f 0x100003b68
0x100003b68 0x100003b6c 00:0000 4 j 0x100003b80
[0x100003a58]>

Use afbq (the quiet version of afb) to enumerate only the addresses.
The cons filter ~:0..10 is the same as | head -n 10

Now that we have the list of basic block entrypoints of the current function we need to disassemble every basic block:

pdb @@= `afbq`

pdb : disassemble the basic block (same as pD $BS @ $BB)
@@= : foreach operator that takes space separated list of addresses for tempoeral seeking
afbq : backticks replace the output of the command inside the same line

We said we wanted to get the bytes, right? So we may replace the pdb with:

p8 $BS : show N hexpairs, where $N is the numvar that specifies the size of the basic block size.

Additionally we may probably want to use @ $BB to force the temporal seek to start at the begining of the basic block. But as long as we have learned a single address can be owned by multiple basic blocks. and we must assume afbq output is enough for us to determine each basic block address.

For parsing reasons, we can replace the following expression:

@@=`command`

With the non-backtick version: @@c:

This gives us the possibility to create a oneliner that prints a single block

[0x100003a58]> echo `p8b@@c:afbq`|sed -e 's, ,,g'
7f2303d5fc6fbaa9fa6701a9f85f02a9f65703a9f44f04a9fd7b05...

Visualizing the how much sparsed the function code is can be done with the afb= command which shows some nice ascii art about it. (Yes, this ascii art can be much better, and i’m open to suggestions and pull requests!)

[0x1000038fc]> afb=

0*  0x1000038fc ███―――――― 0x100003914
1   0x100003914 ――█―――――― 0x100003918
2   0x100003918 ――██――――― 0x100003928
3   0x100003928 ――――█―――― 0x100003930
4   0x100003930 ――――██――― 0x100003934
5   0x100003934 ―――――█――― 0x10000393c
6   0x10000393c ―――――████ 0x100003960
=>  0x1000038fc ^^^^^^^^^ 0x1000039fc
[0x1000038fc]>

Which command can we use instead of pdf to disassemble a sparse functions?

Correct! It’s pdr, which stands for print-disasm-recursive. That command will probably not show the branch lines in the best possible way, but will cover all the basic blocks, trying to enumerate them by jump and address location order.

The good part of pdr is that it will show all the code of the function
The bad side is that it will skip all the dead code and data inlined, sometimes valuable for the reader.

Give it a try!

Sorting blocks

For non-linear functions, it’s essential to process each basic block individually to retrieve the complete byte sequence.

Functions can be sorted using the afls command, which affects the default listing from afl. However, we can always use afl, to create custom table queries to filter and reorder functions as needed.

[0x100003a58]> afls?
Usage: afls  [afls] # sort function list
| afls   same as aflsa
| aflsa  sort by address (same as afls)
| aflss  sort by size
| aflsn  sort by name
| aflsb  sort by number of basic blocks
[0x100003a58]>

Unlike afls, there’s no sorting command for afb (no afbs). This could be a potential contribution to the project. However, we can use afb, to filter basic block listings according to our requirements.

But what’s the correct order for sorting them? Can we simply use the entrypoint address as a numeric ordinal? Unfortunately, no. The appropriate approach depends on our specific needs.

If we’re reading code: It’s generally fine to follow each branch on every basic block until we’ve covered all basic blocks.

afbq > $bbs
p8 $BS @@c:cat $bbs

Note: dollar-files are memory-based virtual files that can be used with any r2 command as if they were physical files.

For cases where we need a pattern that’s as linear as possible, we might want to sort them and fill gaps with masked bytes.

Instead of using afbq (which sorts by offset), we can use afba which mirrors the implementation of afla. This means taking all basic blocks and traversing them in reverse order, covering all basic blocks and code paths while defining the proper analysis order following the jumps.

[0x100003a58]> afb? | grep order
| afba[!]                                       list basic blocks of current offset in analysis order (EXPERIMENTAL, see afla)

The only issue with this listing is that it’s reversed. We can solve this using tac (the reverse version of cat) to achieve our desired order.

aaaa           # analyze all the things
afba > $bbs    # list basic blocks in reverse analysis dependency order
tac $bbs       # reverse every line of that file, entrypoint should be the first

Zignaturezzz

Creating function signatures is useful for many reasons. While this post delves into the topic, we might be focusing too much on the technical requirements rather than the practical applications.

In r2land we call them “zignaturez”, yes; with ‘z’.

And as expected, radare2 ships its own implementation for signatures. You might wonder: “Why not just use FLIRT or whatever GHIDRA provides?” The main reason is that IDA’s implementation is too simplistic and comes packed in a proprietary file format, while Ghidra’s FIDB is stored as Java serialization data, version 5.

I don’t really understand why such a complex topic was reduced to just “byte pattern + binary mask” and stored in a proprietary, non-standard file format. This explains why r2 has its own implementation, which is more flexible, configurable, powerful, and precise.

And the best part? As usual, hardly anyone knows about it! So you can feel even more exclusive when using these features.

The choice for the z command for this feature was made because s was already taken, and using z became a memorable pronunciation joke. We won’t go into much detail here since there’s an entire chapter in the r2book about it.

Let’s look at how to use it first, then we’ll explore the configuration options and metrics. Here’s a sample session:

aaaa       # analyze all
zg         # generate zignatures for all the functions
z* > z.r2  # dump them into an r2script

If we load a different file or a program with functions from the library we’ve analyzed, we must use the z/ command, which scans and compares every function with the loaded signatures. We can load multiple zignatures at once and manipulate them as needed.

To inspect what’s created in the file, we can simply read this z.r2 or dump it in JSON format.

[0x100003a58]> zj~{}|head -n 50
[
  {
    "name": "sym.imp.__tolower",
    "bytes": "110000b031820091300240f9110a1fd7",
    "mask": "ff000000ffffffffffffffffffffffff",
    "graph": {
      "cc": 1,
      "nbbs": 1,
      "edges": 0,
      "ebbs": 1,
      "bbsum": 16
    },
    "addr": 4294997000,
    "next": "sym.imp.abort",
    "types": "int __tolower (int c)",
    "refs": [

    ],
    "xrefs": [
      "sym.func.100006780"
    ],
    "collisions": [

    ],
    "vars": [

    ],
    "hash": {
      "bbhash": "ceb84efacc7c830486ac15b8fa27a452ecb0f75f3d21eac04bd44bdcfc580a2e"
    }
  },
  {
    "name": "sym.imp.compat_mode",
    "bytes": "110000b031220291300240f9110a1fd7",
    "mask": "ff000000ffffffffffffffffffffffff",
    "graph": {
      "cc": 1,
      "nbbs": 1,
      "edges": 0,
      "ebbs": 1,
      "bbsum": 16
    },

This generates a zignature, a function signature that uses bitmasks to exclude bits prone to variation.

We can filter the patterns with the bytes like this:

z~bytes[1]

To see the bitmask applied by the zignature (showing which bits are ignored), use:

z~mask[1]

This bitmask helps identify which parts of the instruction sequence are static and which are variable, making it easier to create robust YARA rules or other detection mechanisms that remain effective across different build variations.

Metrics and Options

Other tools come with just a big button that makes things happen, but that’s not the r2 way. Here, we prefer to understand how things work and customize them to fit our specific use cases. While we strive to provide good defaults, sometimes they might not work well or haven’t been thoroughly tested for .. reasons.

The metrics used to generate signatures are the following:

name is the function named in a similar way?
bytes linear byte patterns
mask binary mask to be applied on the bytes
graph code complexity, basic block cound, amount of edges, ending basic blocks..
contex which is the next and previous symbol
types which types are used as arguments or variables and its function signature
refs which data is referenced
xrefs who is referencing this function
hash aproximated minhash of the bytes used to compute distance with others

We can see the configuration options to generate and match the metrics with the following command:

[0x100003a58]> e??zign.
       zign.autoload: autoload all zignatures located in dir.zigns
          zign.bytes: use bytes patterns for matching
   zign.diff.bthresh: threshold for diffing zign bytes [0, 1] (see zc?)
   zign.diff.gthresh: threshold for diffing zign graphs [0, 1] (see zc?)
           zign.dups: allow duplicate zignatures
          zign.graph: use graph metrics for matching
           zign.hash: use Hash for matching
        zign.mangled: use the manged name for zignatures (EXPERIMENTAL)
          zign.maxsz: maximum zignature length
          zign.mincc: minimum cyclomatic complexity for matching
          zign.minsz: minimum zignature length for matching
         zign.offset: use original offset for matching
         zign.prefix: default prefix for zignatures matches
           zign.refs: use references for matching
      zign.threshold: minimum similarity required for inclusion in zb output
          zign.types: use types for matching
[0x100003a58]>

Yarayarayara

R2Yara brings the power of YARA pattern matching into radare2, enabling efficient binary analysis and malware detection. This integration allows you to scan binaries for specific patterns using YARA rules directly within your r2 session.

Simply install using r2pm:

$ r2pm -ci r2yara

R2Yara provides two command sets: yara for full commands and yr for shorter alternatives. The key commands are:

Load rules: yr <file>
List rules: yrl
Scan binary: yrs
Clear rules: yr-*

[0x00000000]> yr crypto.yara   # Load crypto detection rules
[0x00000000]> yrs              # Scan current binary

R2Yara will automatically creates flags at matched locations, making it valuable for both automated analysis and manual investigation of suspicious binaries.

Common use cases for r2yara are:

Malware Detection
- Identify known malware patterns
- Match suspicious code structures
Crypto Detection
- Find cryptographic constants
- Identify encryption algorithms
Binary Classification
- Detect compiler patterns
- Match library signatures

I encourage you to watch “Uncovering more crypto secrets”, a presentation by Sylvain and Azox at #r2con2024, to learn more about practical use cases of YARA rules for cryptographic pattern detection.

Challenge

With the scripting knowledge from yesterday’s post, I’m challenging you to create an r2js script (or Python using the r2pipe API) that creates a binary mask pattern for every function, similar to how the zignatures implementation works. Try to improve upon the existing implementation by:

Warning about patterns that might result in false positives
Identifying potential misbehavior
Running the script across all binaries from the testbins repository to perform testing and understand the challenges encountered

$ for a in test/bins/**/* ; do r2 -qi script.r2.js $a ; done

After completing this, share your results by:

Posting the script used for the examples repository
Identifying at least 3 binaries that exhibited unusual behavior patterns
Explaining the specific issues encountered with these binaries

Additional questions:

Which architectures proved most challenging when creating function patterns?
What limitations did you encounter during implementation?
Could some of these missing elements be exposed in the ao output, rather than relying on the zignatures implementation to remove parameterized and immediate values from the disassembly and show the associated binary mask?

Summary

Radare2 provides robust tools for displaying a function’s byte sequence, whether the function is linear or divided into basic blocks.

Commands like p8 and pD make it easy to capture bytes for linear functions, while p8 $BS @@c:afbq captures bytes across multiple basic blocks. For reliable detection across builds, zignatures (zg) generate function signatures with bitmasks to ignore variable bits, helping you create accurate, flexible YARA rules.

Stay tuned for tomorrow’s Radare2 post as we continue exploring advanced analysis and reverse engineering techniques!