COMP4801 · Final Year Project · HKU CS · 2025–26

Disassembling ARM Binaries
with Reinforcement Learning

Budgeted Analysis-Strategy Selection for Binary Indirect Control Flow Resolution

Resolving indirect control flow in stripped ARM binaries is a prerequisite for virtually every downstream program analysis task, yet the tools capable of doing so span a wide spectrum of cost and capability. This project asks: under a fixed analysis budget, can a learned policy decide which tool to invoke at each indirect site better than strong baselines? We formulate the task as a sequential Markov decision process and train a per-binary PPO policy over a three-level tool hierarchy. The answer is yes in some regimes—and explicitly not in others.

0.97
Resolve rate on gcc & dealII (L1-unresolved subset)

43%
Strongest baseline, same formulation

0.31 → 0.85
ssh resolve rate: initial 4-action vs sunk-cost 3-action

3 tiers
Pattern matching, CFG recovery, symbolic execution

Key figures

Key visual results

The following figures highlight the strongest improvements observed in Phase 2 and the cross-phase trend across binaries.

Phase 2: RL vs random baseline

Resolve-rate comparison between the learned policy and the random baseline in Phase 2.

Phase 2 RL versus random baseline resolve rate by binary

Phase 1 → Phase 2 slope (ssh highlighted)

Slope chart showing how RL resolve rates move from Phase 1 to Phase 2, with ssh highlighted.

Slope chart of RL resolve rate from Phase 1 to Phase 2 by binary

The Problem

Indirect control flow makes CFG recovery expensive and uneven

Many analyses begin from a recovered control-flow graph. On stripped, optimised binaries, a large fraction of edges are indirect (targets not visible to a linear sweep): jump tables, vtable dispatch, stored function pointers, and more.

Practical tools sit on a capability–cost ladder: fast pattern matchers miss harder sites; heavier abstract interpretation and symbolic execution help more but burn budget quickly. Under a fixed budget, choosing where to spend expensive analyses is a scheduling problem—usually left to tool defaults or manual tuning, with little per-binary adaptation.

Approach

MDP formulation and per-binary PPO over a tool hierarchy

We model budgeted strategy selection as a sequential MDP: at each unresolved indirect site, the policy selects among three tiers of analysis. Training uses PPO on a corpus of ARM32 ELF binaries; policies are learned per binary to reflect differing structure and difficulty.

A key empirical finding is that reformulating the action space (for example, moving to a sunk-cost three-action formulation on ssh) can matter more than fine-grained algorithm tuning—the decision interface shapes what reinforcement learning can exploit.

Results & Limits

Strong gains in some binaries, clear negative space elsewhere

On gcc and dealII, the learned policy reaches a 97% resolve rate on the L1-unresolved subset versus 43% for the strongest baseline under the same MDP formulation. On ssh, resolve rate improves markedly when the action space is redesigned.

We also identify binaries and budget regimes where learning offers no advantage over simple heuristics. The full report discusses these boundary cases explicitly rather than overstating universal wins.

Scope & Deployment

When this policy is worth using

The strongest gains appear when the L1-unresolved subset is still large and hard-dominated, so L2/L3 allocation remains a real scheduling problem rather than a near-random spending decision. In small residual sets, learned scheduling often matches or trails cheap heuristics.

All resolve-rate numbers on this page are measured on the L1-unresolved subset under the silver-label success model, not on the full indirect-site set or compiler-level ground truth.

Within our tested transfer pairs, policies trained on one binary do not generalise reliably to another, so the current practical default is per-binary training. The next step is online-adaptive scheduling with richer code-level features and cross-site information sharing.

About

Zhang Hengyuan · BEng(CompSc), HKU

Supervisor: Prof. Qian Chenxiong
Second examiner: Prof. Bruno Oliveira

Report & poster

Course site: wp2025.cs.hku.hk/fyp25023

Source code: https://github.com/PeterZh6/fyp25023

Final report

Poster