/advent

19 - Function Bytes

Welcome to Day 19 of the Advent of Radare!

Today’s focus is on extracting byte sequences from functions, which raises several questions:

This post will try to answer all these questions and provide you with key commands to create YARA rules and zignature files, as well as identify patterns for similar functions. We’ll also address the challenges of different code constructions and discuss how to eliminate parts that can vary between similar patterns.

Linear vs Sparse

Functions can be described as a consecutive list of instructions that have one entrypoint. The rest of rules can vary depending on the way they are implemented.

As you can see, things that may look simple or easy can become a complete nightmatre for the analyst, and even more for writing software that reliabily aims to find out athe real constructions behind the assembly.

So having all these concepts in mind we may want to know what would be the best or easiest way to get a list of all the basic blocks..

Linear Paradise

Imagine a perfect world where all compilers generate a single entry point for every function, without reusing basic blocks, and place the implementation linearly below the entry point.

Imagine these functions have no data mixed with code within their boundaries, instead delegating data to the space between functions or into a separate rodata section within the binary.

In this imaginary world, we wouldn’t have many problems and could simply use commands like:

pD $FS @ $FB

Where: * $FS: The linear size of the function * $FB: The beginning address of the function

Alternatively, you can use p8 to view the byte sequence in hexadecimal pairs:

p8 $FS @ $FB

This linear disassembly is usually easily readable with commands like pdf or pD $FS. However, if you’ve used these commands for a while, you’ve probably noticed that sometimes the output is incomplete and contains disappointing “…” ellipses.

Unfortunately, this perfect world doesn’t exist.

Sparse Land

Now let’s get into the harsh real world problems, following the rules we read before we need a way to enumerate the basic blocks of a function: afb.

[0x100003a58]> afb~:0..10
0x100003a58 0x100003aa4 00:0000 76 j 0x100003aa8 f 0x100003aa4
0x100003aa4 0x100003aa8 00:0000 4 j 0x100003aa8
0x100003aa8 0x100003aec 00:0000 68 j 0x100003b1c f 0x100003aec
0x100003aec 0x100003b00 00:0000 20 j 0x100003b88 f 0x100003b00
0x100003b00 0x100003b1c 00:0000 28 j 0x100003b88
0x100003b1c 0x100003b38 00:0000 28 j 0x100003b40 f 0x100003b38
0x100003b38 0x100003b40 00:0000 8 j 0x100003b6c f 0x100003b40
0x100003b40 0x100003b60 00:0000 32 j 0x100003b80 f 0x100003b60
0x100003b60 0x100003b68 00:0000 8 j 0x100003b7c f 0x100003b68
0x100003b68 0x100003b6c 00:0000 4 j 0x100003b80
[0x100003a58]>

Now that we have the list of basic block entrypoints of the current function we need to disassemble every basic block:

pdb @@= `afbq`

We said we wanted to get the bytes, right? So we may replace the pdb with:

Additionally we may probably want to use @ $BB to force the temporal seek to start at the begining of the basic block. But as long as we have learned a single address can be owned by multiple basic blocks. and we must assume afbq output is enough for us to determine each basic block address.

For parsing reasons, we can replace the following expression:

@@=`command`

With the non-backtick version: @@c:

This gives us the possibility to create a oneliner that prints a single block

[0x100003a58]> echo `p8b@@c:afbq`|sed -e 's, ,,g'
7f2303d5fc6fbaa9fa6701a9f85f02a9f65703a9f44f04a9fd7b05...

Visualizing the how much sparsed the function code is can be done with the afb= command which shows some nice ascii art about it. (Yes, this ascii art can be much better, and i’m open to suggestions and pull requests!)

[0x1000038fc]> afb=

0*  0x1000038fc ███―――――― 0x100003914
1   0x100003914 ――█―――――― 0x100003918
2   0x100003918 ――██――――― 0x100003928
3   0x100003928 ――――█―――― 0x100003930
4   0x100003930 ――――██――― 0x100003934
5   0x100003934 ―――――█――― 0x10000393c
6   0x10000393c ―――――████ 0x100003960
=>  0x1000038fc ^^^^^^^^^ 0x1000039fc
[0x1000038fc]>

Which command can we use instead of pdf to disassemble a sparse functions?

Correct! It’s pdr, which stands for print-disasm-recursive. That command will probably not show the branch lines in the best possible way, but will cover all the basic blocks, trying to enumerate them by jump and address location order.

Give it a try!

Sorting blocks

For non-linear functions, it’s essential to process each basic block individually to retrieve the complete byte sequence.

Functions can be sorted using the afls command, which affects the default listing from afl. However, we can always use afl, to create custom table queries to filter and reorder functions as needed.

[0x100003a58]> afls?
Usage: afls  [afls] # sort function list
| afls   same as aflsa
| aflsa  sort by address (same as afls)
| aflss  sort by size
| aflsn  sort by name
| aflsb  sort by number of basic blocks
[0x100003a58]>

Unlike afls, there’s no sorting command for afb (no afbs). This could be a potential contribution to the project. However, we can use afb, to filter basic block listings according to our requirements.

But what’s the correct order for sorting them? Can we simply use the entrypoint address as a numeric ordinal? Unfortunately, no. The appropriate approach depends on our specific needs.

If we’re reading code: It’s generally fine to follow each branch on every basic block until we’ve covered all basic blocks.

afbq > $bbs
p8 $BS @@c:cat $bbs

For cases where we need a pattern that’s as linear as possible, we might want to sort them and fill gaps with masked bytes.

Instead of using afbq (which sorts by offset), we can use afba which mirrors the implementation of afla. This means taking all basic blocks and traversing them in reverse order, covering all basic blocks and code paths while defining the proper analysis order following the jumps.

[0x100003a58]> afb? | grep order
| afba[!]                                       list basic blocks of current offset in analysis order (EXPERIMENTAL, see afla)

The only issue with this listing is that it’s reversed. We can solve this using tac (the reverse version of cat) to achieve our desired order.

aaaa           # analyze all the things
afba > $bbs    # list basic blocks in reverse analysis dependency order
tac $bbs       # reverse every line of that file, entrypoint should be the first

Zignaturezzz

Creating function signatures is useful for many reasons. While this post delves into the topic, we might be focusing too much on the technical requirements rather than the practical applications.

In r2land we call them “zignaturez”, yes; with ‘z’.

And as expected, radare2 ships its own implementation for signatures. You might wonder: “Why not just use FLIRT or whatever GHIDRA provides?” The main reason is that IDA’s implementation is too simplistic and comes packed in a proprietary file format, while Ghidra’s FIDB is stored as Java serialization data, version 5.

I don’t really understand why such a complex topic was reduced to just “byte pattern + binary mask” and stored in a proprietary, non-standard file format. This explains why r2 has its own implementation, which is more flexible, configurable, powerful, and precise.

And the best part? As usual, hardly anyone knows about it! So you can feel even more exclusive when using these features.

The choice for the z command for this feature was made because s was already taken, and using z became a memorable pronunciation joke. We won’t go into much detail here since there’s an entire chapter in the r2book about it.

Let’s look at how to use it first, then we’ll explore the configuration options and metrics. Here’s a sample session:

aaaa       # analyze all
zg         # generate zignatures for all the functions
z* > z.r2  # dump them into an r2script

If we load a different file or a program with functions from the library we’ve analyzed, we must use the z/ command, which scans and compares every function with the loaded signatures. We can load multiple zignatures at once and manipulate them as needed.

To inspect what’s created in the file, we can simply read this z.r2 or dump it in JSON format.

[0x100003a58]> zj~{}|head -n 50
[
  {
    "name": "sym.imp.__tolower",
    "bytes": "110000b031820091300240f9110a1fd7",
    "mask": "ff000000ffffffffffffffffffffffff",
    "graph": {
      "cc": 1,
      "nbbs": 1,
      "edges": 0,
      "ebbs": 1,
      "bbsum": 16
    },
    "addr": 4294997000,
    "next": "sym.imp.abort",
    "types": "int __tolower (int c)",
    "refs": [

    ],
    "xrefs": [
      "sym.func.100006780"
    ],
    "collisions": [

    ],
    "vars": [

    ],
    "hash": {
      "bbhash": "ceb84efacc7c830486ac15b8fa27a452ecb0f75f3d21eac04bd44bdcfc580a2e"
    }
  },
  {
    "name": "sym.imp.compat_mode",
    "bytes": "110000b031220291300240f9110a1fd7",
    "mask": "ff000000ffffffffffffffffffffffff",
    "graph": {
      "cc": 1,
      "nbbs": 1,
      "edges": 0,
      "ebbs": 1,
      "bbsum": 16
    },

This generates a zignature, a function signature that uses bitmasks to exclude bits prone to variation.

We can filter the patterns with the bytes like this:

z~bytes[1]

To see the bitmask applied by the zignature (showing which bits are ignored), use:

z~mask[1]

This bitmask helps identify which parts of the instruction sequence are static and which are variable, making it easier to create robust YARA rules or other detection mechanisms that remain effective across different build variations.

Metrics and Options

Other tools come with just a big button that makes things happen, but that’s not the r2 way. Here, we prefer to understand how things work and customize them to fit our specific use cases. While we strive to provide good defaults, sometimes they might not work well or haven’t been thoroughly tested for .. reasons.

The metrics used to generate signatures are the following:

We can see the configuration options to generate and match the metrics with the following command:

[0x100003a58]> e??zign.
       zign.autoload: autoload all zignatures located in dir.zigns
          zign.bytes: use bytes patterns for matching
   zign.diff.bthresh: threshold for diffing zign bytes [0, 1] (see zc?)
   zign.diff.gthresh: threshold for diffing zign graphs [0, 1] (see zc?)
           zign.dups: allow duplicate zignatures
          zign.graph: use graph metrics for matching
           zign.hash: use Hash for matching
        zign.mangled: use the manged name for zignatures (EXPERIMENTAL)
          zign.maxsz: maximum zignature length
          zign.mincc: minimum cyclomatic complexity for matching
          zign.minsz: minimum zignature length for matching
         zign.offset: use original offset for matching
         zign.prefix: default prefix for zignatures matches
           zign.refs: use references for matching
      zign.threshold: minimum similarity required for inclusion in zb output
          zign.types: use types for matching
[0x100003a58]>

Yarayarayara

R2Yara brings the power of YARA pattern matching into radare2, enabling efficient binary analysis and malware detection. This integration allows you to scan binaries for specific patterns using YARA rules directly within your r2 session.

Simply install using r2pm:

$ r2pm -ci r2yara

R2Yara provides two command sets: yara for full commands and yr for shorter alternatives. The key commands are:

[0x00000000]> yr crypto.yara   # Load crypto detection rules
[0x00000000]> yrs              # Scan current binary

R2Yara will automatically creates flags at matched locations, making it valuable for both automated analysis and manual investigation of suspicious binaries.

Common use cases for r2yara are:

I encourage you to watch “Uncovering more crypto secrets”, a presentation by Sylvain and Azox at #r2con2024, to learn more about practical use cases of YARA rules for cryptographic pattern detection.

Challenge

With the scripting knowledge from yesterday’s post, I’m challenging you to create an r2js script (or Python using the r2pipe API) that creates a binary mask pattern for every function, similar to how the zignatures implementation works. Try to improve upon the existing implementation by:

  1. Warning about patterns that might result in false positives
  2. Identifying potential misbehavior
  3. Running the script across all binaries from the testbins repository to perform testing and understand the challenges encountered
$ for a in test/bins/**/* ; do r2 -qi script.r2.js $a ; done

After completing this, share your results by:

Additional questions:

Summary

Radare2 provides robust tools for displaying a function’s byte sequence, whether the function is linear or divided into basic blocks.

Commands like p8 and pD make it easy to capture bytes for linear functions, while p8 $BS @@c:afbq captures bytes across multiple basic blocks. For reliable detection across builds, zignatures (zg) generate function signatures with bitmasks to ignore variable bits, helping you create accurate, flexible YARA rules.

Stay tuned for tomorrow’s Radare2 post as we continue exploring advanced analysis and reverse engineering techniques!