Writing a 32-bit CPU Emulator and Assembler in C++

There is a moment in every programmer’s life when they stop looking at code as magic spells and start wondering, “How does the metal actually understand this?”

We spend our lives building abstractions. We write classes, functions, and modules to distance ourselves from the raw hardware. But eventually, curiosity strikes. That curiosity led me to build my own custom CPU emulator. I didn’t want to emulate an existing chip like the NES (6502) or the Gameboy (Z80) just yet. I wanted to design my own architecture to truly understand the relationship between memory, registers, and the stack.

Introduction

There’s no shortage of resources explaining how processors work, but nothing teaches you faster than implementing one yourself. It forces you to reason about the absolute fundamentals: instructions, memory maps, stacks, addressing modes, pipelines, timing, and how tiny architectural choices shape the whole system.

My goal wasn’t to recreate a real architecture. I wanted something small, believable, and programmable. This blog outlines the most important concepts that I learned while building the emulator. Anyone who understands these concepts can use them to build their own emulator.

This is not a step-by-step guide for building an emulator! If you want a full step-by-step guide, plenty of excellent resources already exist.

Defining the Architecture

Before writing a single line of code, I needed to decide what this machine looked like. I wanted something simple enough to understand in one sitting, but complex enough to do real work (like recursion and 32-bit math).

The Specs:

Word Size: 32-bit (Data is processed in 32-bit chunks).
Address Space: 16-bit (It can address 65,536 locations, though I mapped 4KB).
Registers: 16 general-purpose registers (R0 through R15).
Stack: A dedicated space in memory for function calls.

I started by defining the CPU state in a header file.

class CPU {
private:
    uint32_t m_pc;          // Program Counter: Where are we?
    uint32_t m_sp;          // Stack Pointer: Where is the stack top?
    uint32_t m_regs[16]{};  // The 16 registers
    uint8_t m_flags;        // Status flags (Zero, Carry, Negative)
    Memory m_memory;        // The system RAM
    bool m_running;         // Is the CPU running?
    
    static constexpr uint8_t FLAG_Z = 1 << 0;
    static constexpr uint8_t FLAG_C = 1 << 1;
    static constexpr uint8_t FLAG_N = 1 << 2;
};

The structure is pretty self-explanatory. The only thing to note is the flag masks. CPUs internally hold a collection of flags that are set based on the results of executed instructions. For example, the zero flag is set to true if the result of an arithmetic operation was 0. Note that I said arithmetic operation, not just subtraction. The zero flag is set even if the result of adding two values overflows back to 0.

The reason I need those bit-shifted masks is that I’m storing all three flags in a single 8-bit integer. This practice is commonly referred to as bit packing. I could have just as easily stored the flags in 3 boolean variables, as the choice is up to the programmer. However, bit packing saves space. Booleans always consume 8 bits even though they technically store a 1-bit value (this arises because memory is byte-addressable, as we will see for ourselves shortly).

The Memory

The CPU class uses a memory class internally for handling read and write operations. The memory class is surprisingly simple.

class Memory {
public:
    Memory();
    [[nodiscard]] uint8_t read8(uint16_t addr) const;
    void write8(uint16_t addr, uint8_t data);
 
    [[nodiscard]] uint16_t read16(uint16_t addr) const;
    void write16(uint16_t addr, uint16_t data);
 
    [[nodiscard]] uint32_t read32(uint16_t addr) const;
    void write32(uint16_t addr, uint32_t data);
 
private:
    uint8_t m_cells[65536]{};
};

It defines an array of 65,536 memory cells, with each one being 8 bits wide. That gives our memory a size of exactly 64KB. While the registers can hold 32-bit values, allowing for 4GB of addressable space, I am artificially limiting the address bus to 16 bits to keep the memory footprint small.

The memory class also defines three variations of read and write functions:

read8/write8
read16/write16
read32/write32

Since the memory is 8 bits wide (byte-addressable), we need the capability to read and write 8-, 16-, and 32-bit values. To operate efficiently, the CPU must be able to combine consecutive bytes from memory to form larger data types. The architectural width determines the register and bus sizes, but not the smallest addressable unit in memory.

Memory Map

With the memory class defined, we need to partition it into regions and assign their specific purposes. Unlike modern systems that distinguish between volatile memory and persistent storage, our CPU operates on a single contiguous block of memory for everything. This includes storing the program code, holding general-purpose data (RAM), and hosting the stack.

We can define these regions arbitrarily to suit our needs. I chose a layout consisting of 4KB of ROM (where the program is initially loaded and executed), 56KB of RAM, and 4KB of stack memory. These are defined simply by their start and end addresses.

namespace MemMap {
    constexpr uint32_t ROM_START   = 0x00000000;
    constexpr uint32_t ROM_END     = 0x00000FFF; // 4KB
    constexpr uint32_t RAM_START   = 0x00001000;
    constexpr uint32_t RAM_END     = 0x0000EFFF; // 56KB
    constexpr uint32_t STACK_START = 0x0000F000;
    constexpr uint32_t STACK_END   = 0x0000FFFF; // 4KB
}

The Instruction Set

With our CPU and memory defined, we need to define an Instruction Set Architecture (ISA). An instruction set is effectively the list of opcodes (commands) and operands (arguments) that the processor supports. It is the only language our CPU actually understands.

CPUs are incapable of understanding modern high-level programming languages directly. These languages must be compiled into machine code - raw binary bytes that the CPU decodes and executes. This binary code is often represented in Assembly, a human-readable format that maps directly to those raw bytes. We will look into the details of building an assembler (a tool to convert text to bytes) in later sections.

Assembly is the final stage of the program before it is rendered unreadable to the average human (although someone experienced enough can still decipher the raw hex bytes).

For the sake of simplicity, let’s start with a bare-bones instruction set that allows us to load a value into a register, perform some math on it, and move it to another register.

MNEMONIC	OPCODE	OPERANDS	Description
LOAD	0x10	R, imm	Load 32-bit immediate value into register `R`.
MOV	0x11	R1, R2	Copy value from `R2` into `R1`.
ADD	0x30	R1, R2	Add `R2` to `R1` and store into `R1`.
SUB	0x31	R1, R2	Subtract `R2` from `R1` and store into `R1`.
HLT	0x01	-	Halt the CPU

The Fetch-Decode-Execute Cycle

Regardless of architecture, all CPUs operate on the Fetch-Decode-Execute cycle. It is exactly what it sounds like. We can implement these steps quite simply in our emulated CPU.

Fetch

This is the easiest step. We simply read the value at the current m_pc (Program Counter) address in memory. This retrieves the opcode, which we pass to the decode step.

Note that we are using 32-bit alignment for everything - opcodes and operands are all stored as individual 32-bit values. While this wastes memory (e.g., a register index takes 4 full bytes), it keeps the decoding logic extremely simple.

uint32_t CPU::fetch() const {
    return m_memory.read32(m_pc);
}

Decode

In the decode step, we need to figure out what opcode we just read and then use that to fetch the necessary operands from the following memory addresses. To do this effectively, we first define a struct to represent a decoded instruction.

struct Instruction {
    uint32_t opcode;
    uint8_t reg1;       // First register operand
    uint8_t reg2;       // Second register operand
    uint16_t address;   // For memory operations or jumps
    uint32_t immediate; // For immediate values
    uint8_t length;     // Total instruction size in bytes
};

Now, we need to populate this struct with the actual data living in memory. We use a switch statement to examine the opcode we retrieved during the fetch stage.

Instruction CPU::decode(uint32_t opcode) {
    Instruction instruction{};
    instruction.opcode = opcode;
 
    // Helper to ensure register indices don't segfault (0-15)
    auto safe_reg = [&](const uint16_t offset) {
        return m_memory.read32(m_pc + offset) & 0x0F;
    };
 
    switch (opcode) {
        case 0x10: { // LOAD R, imm
            instruction.reg1 = safe_reg(4);                    // Offset 4 bytes
            instruction.immediate = m_memory.read32(m_pc + 8); // Offset 8 bytes
            instruction.length = 12;
            break;
        }
        case 0x11: { // MOV R1, R2
            instruction.reg1 = safe_reg(4);
            instruction.reg2 = safe_reg(8);
            instruction.length = 12;
            break;
        }
        case 0x30: // ADD R1, R2
        case 0x31: { // SUB R1, R2
            instruction.reg1 = safe_reg(4);
            instruction.reg2 = safe_reg(8);
            instruction.length = 12;
            break;
        }
        case 0x01: { // HLT
            instruction.length = 4; // Consume the opcode itself
            break;
        }
        default: {
            std::cerr << "\n[ERROR] Unknown opcode: 0x"
                      << std::hex << std::setw(2) << std::setfill('0') << opcode << "\n";
            instruction.opcode = 0x01; // Treat as HLT
            instruction.length = 4;    // Consume the bad word
        }
    }
    return instruction;
}

We start by initializing a new instruction and setting the opcode. For LOAD, the opcode is 0x10 (as defined in our instruction set). Since we know the first operand is a register and the second is an immediate value, we access them by peeking ahead in memory relative to the program counter (m_pc).

Keep in mind that m_pc currently points to the opcode we just fetched. Naturally, the operands follow the opcode in memory, so we access them by adding offsets to the current address.

Address:   [PC]       [PC+4]      [PC+8]
Data:      [OPCODE]   [OPERAND1]  [OPERAND2]
Size:      --4bytes----4bytes------4bytes---

Because our emulated architecture aligns everything to 32 bits, all data in memory is stored in 4-byte chunks. That is why we offset our reads by 4 and 8 bytes respectively.

The safe_reg lambda function ensures we do not try to access a register index that doesn’t exist. It masks the read value to ensure the index stays within the valid range (0-15).

Finally, the instruction.length field stores the total size of the instruction in bytes. When we execute this instruction later, the program counter will be advanced by this amount so it lands correctly on the next instruction’s opcode. Note that if two opcodes share the same operand format (like ADD and SUB), we can use a single switch case to populate the instruction struct for both.

Execute

Since we already have the decoded instruction, this stage is surprisingly simple. We just read the struct and perform the required operation.

svoid CPU::execute(const Instruction& instruction) {
    switch (instruction.opcode) {
        case 0x10: // LOAD
            m_regs[instruction.reg1] = instruction.immediate;
            break;
 
        case 0x11: // MOV
            m_regs[instruction.reg1] = m_regs[instruction.reg2];
            break;
 
        case 0x30: { // ADD
            const uint32_t a = m_regs[instruction.reg1];
            const uint32_t b = m_regs[instruction.reg2];
            const uint32_t result = a + b;
 
            m_regs[instruction.reg1] = result;
            set_flag(Flag::ZERO, result == 0);
            set_flag(Flag::NEGATIVE, result & 0x80000000);
            set_flag(Flag::CARRY, result < a);
            break;
        }
 
        case 0x31: { // SUB
            const uint32_t a = m_regs[instruction.reg1];
            const uint32_t b = m_regs[instruction.reg2];
            const uint32_t result = a - b;
            m_regs[instruction.reg1] = result;
 
            set_flag(Flag::ZERO, result == 0);
            set_flag(Flag::NEGATIVE, result & 0x80000000);
            set_flag(Flag::CARRY, a < b);
            break;
        }
 
        case 0x01: // HLT
            m_running = false;
            break;
 
        default: break;
    }
}

LOAD and MOV are simple enough. We operate on the first register using the immediate value (for LOAD) or the value of the second register (for MOV).

ADD and SUB are also straightforward, except they set flags based on the result. The CARRY flag is set if an addition overflows or if a subtraction underflows (requires a borrow).

The NEGATIVE flag is interesting. We cannot simply perform a result < 0 check because our registers store uint32_t (unsigned integers). The CPU has no inherent concept of negative numbers; it only sees raw bits. Negative numbers are represented using Two’s Complement, where the leftmost bit acts as the sign bit. If the leftmost bit is 1, we interpret the number as negative.

To check this, we perform a bitwise AND operation with a mask that isolates the leftmost bit (0x80000000). If the result is non-zero, that bit is set, and we raise the negative flag.

MSB Flag Check Visualization:Suppose we have result = 0xF2345678 (binary: 1111 0010 0011 0100 0101 0110 0111 1000):
Bit:  31                                     0
       |-------------------------------------|
Value: 1  1  1  1  0  0  1  0  0  0 ... 1  0  0  0Mask 0x80000000:
Bit:  31                                     0
       |-------------------------------------|
Mask: 1  0  0  0  0  0  0  0  0  0 ... 0  0  0  0Bitwise AND (&) result:
Result & Mask: 1  0  0  0  0  0  0  0  0  0 ... 0  0  0  0  → 0x80000000 (non-zero)Because the MSB is 1, the result of the AND is non-zero → negative flag is set.

We do not need to handle the default case here because the decode step handles validation by converting unknown opcodes to HLT. It is effectively impossible for the default statement to be triggered in the execute step. However, we keep the default case to prevent the compiler from complaining about unhandled enum values.

Marching On

With the Fetch-Decode-Execute cycle implemented, we just need a function to orchestrate these steps in order.

void CPU::step() {
    const uint32_t opcode = fetch();
    const Instruction instruction = decode(opcode);
    execute(instruction);
    
    // instruction.length is already in bytes (e.g., 12 bytes)
    m_pc += instruction.length; 
}

Finally, we create a function to drive the CPU, stepping through the program continuously as long as the CPU remains active.

void CPU::run() {
    while (m_running) {
        step();
    }
}

With that, our CPU is functional. We can hardcode a small program into the beginning of memory and watch it run. Obviously, this CPU is far from executing complex software because we still lack control flow (branching and looping). But not for long!

Control Flow

Up until now, our emulator has been just a glorified calculator - one that is also extremely hard to use because instructions need to be defined in raw bytes. To have any hope of running real programs, we need a way to run conditional loops.

This is actually quite easy because we handled half of the challenge when we implemented the flags. We just need a way to move the Program Counter around arbitrarily. This is accomplished via Jump opcodes.

MNEMONIC	OPCODE	OPERANDS	Description
JMP	0x40	addr	Unconditional jump to memory address `addr`.
JZ	0x41	addr	Jump to `addr` if `ZERO` flag is set.
JNZ	0x42	addr	Jump to `addr` if `ZERO` flag is not set.

These opcodes should be enough for the time being to write some simple programs. Obviously, other jump opcodes exist for the other flags, but I haven’t included them in the table since they are trivial to implement, and I do not wish to dump the entire source code into this blog.

The decode function gets the following cases added:

case 0x40: // JMP
case 0x41: // JZ
case 0x42: { // JNZ
    instruction.address = m_memory.read32(m_pc + 4);
    // Opcode (4 bytes) + Address (4 bytes) = 8 bytes total
    instruction.length = 8; 
    break;
}

So far, the step function automatically increments the program counter by the instruction’s length after execution. However, if we execute a jump instruction, our program counter is manually set to a specific address. In that case, we do not want to increment the counter again, as it is already exactly where we want it to be.

This demands a small change to the execute function. It now returns a boolean indicating whether a jump was performed.

bool CPU::execute(const Instruction& instruction) {
    bool jumped = false;
 
    switch(instruction.opcode){
        // ... (other opcodes)
 
        case 0x40: // JMP
            m_pc = instruction.address;
            jumped = true;
            break;
 
        case 0x41: // JZ
            if (get_flag(Flag::ZERO)) {
                m_pc = instruction.address;
                jumped = true;
            }
            break;
 
        case 0x42: // JNZ
            if (!get_flag(Flag::ZERO)) {
                m_pc = instruction.address;
                jumped = true;
            }
            break;
            
        default: break;
    }
    
    return jumped;
}

When executed, the jump instructions set the jumped boolean to true. We use this in the step function to determine whether to skip the automatic PC increment.

The modified step function now looks like this:

void CPU::step() {
    const uint32_t opcode = fetch();
    const Instruction instruction = decode(opcode);
    const bool jumped = execute(instruction);
    
    // Only increment PC if we didn't just jump manually
    if (!jumped) {
        m_pc += instruction.length;
    }
}

And with that, our emulator is ready to run programs with control flow. Of course, more opcodes are necessary for a complete system, but they can be added rapidly using the framework we have built. They simply need to be added as switch cases to the decode and execute functions.

A Simple ROM

With everything set up and ready to go, we can provide the emulator with a program (ROM) to execute. Since we do not yet have an assembler to generate binary files from text, we can temporarily hardcode instructions directly into an array.

Because our architecture is 32-bit aligned, every element in this array represents a 4-byte chunk of memory (Opcode, Operand, or Immediate).

const uint32_t rom[] = {
    // 1. LOAD R0, 3 (Length: 12 bytes / 3 words)
    0x10,           // Opcode (LOAD)
    0x00,           // Operand 1 (Register Index 0)
    0x03,           // Operand 2 (Immediate Value 3)
 
    // 2. LOAD R1, 5 (Length: 12 bytes / 3 words)
    0x10,           // Opcode (LOAD)
    0x01,           // Operand 1 (Register Index 1)
    0x05,           // Operand 2 (Immediate Value 5)
 
    // 3. ADD R0, R1 (Length: 12 bytes / 3 words)
    0x30,           // Opcode (ADD)
    0x00,           // Operand 1 (Register Index 0 / Destination)
    0x01,           // Operand 2 (Register Index 1 / Source)
 
    // 4. HLT (Length: 4 bytes / 1 word)
    0x01            // Opcode (HLT)
};
 
// Load ROM into Memory
uint32_t w_loc = MemMap::ROM_START;
for (const auto word : rom) {
    m_memory.write32(w_loc, word);
    w_loc += 4;
}

This is a simple program that adds two numbers together. We can also use jump commands if we wish, but hand-writing programs as raw hex arrays is unwieldy and prone to error.

We now need an assembler. This tool will allow us to write code in a text editor using readable mnemonics and convert it into the binary format that our emulator can load and execute.

Writing an Assembler

An assembler is, in principle, surprisingly simple. Its primary job is to take an assembly program written in plain text and convert it into a binary byte array (machine code). It also allows us to use high-level features like labels and variables, saving us from manually calculating memory offsets.

We can imagine the architecture of an assembler in three main phases:

Lexing/Tokenization: Breaking the raw text into meaningful chunks (tokens).
Parsing + Symbol Table Construction: Understanding the structure and tracking the addresses of labels and variables.
Code Generation: Outputting the final binary data.

Lexing/Tokenization

The first step of any compiler or assembler is to convert the source code (plain text) into a stream of tokens for further processing. In a way, we already did a primitive version of this in the emulator when we loaded raw bytes and interpreted them as instructions. However, an assembler must handle text, which is much messier.

Let’s look at what a typical assembly program looks like:

; Fibonacci Solver for the nth element
 
.data
n: .word 19          ; Compute the 20th Fibonacci number (since we count to 0)
 
.text
 
LD   R5, n           ; Load value from the address stored in n into R5
LOAD R1, 0           ; R1 = first  number (F(0))
LOAD R2, 1           ; R2 = second number (F(1))
 
loop:
    ADD  R3, R1      ; R3 = R3 + R1  → start sum
    ADD  R3, R2      ; R3 = R1 + R2  → Fibonacci step result
    MOV  R1, R2      ; shift: new first  = old second
    MOV  R2, R3      ; shift: new second = sum we just computed
    LOAD R3, 0       ; reset R3 for next iteration
    DEC  R5          ; decrement counter
    JNZ  loop        ; repeat until counter reaches zero
 
HLT                   ; halt CPU

Assembly programs generally consist of two distinct sections:

.data: Defines values and variables that are loaded into RAM before execution begins.
.text: Contains the actual executable instructions.

The text region supports labels (like loop:). These are incredibly useful because they allow us to use human-readable names for jump locations instead of calculating manual memory offsets. If we add lines of code later, the assembler automatically recalculates the label’s address, saving us from breaking all our jumps.

To process the assembly code, we iterate through it line by line:

Strip Comments & Whitespace: Split the line at spaces to get a list of tokens. Remove anything following a semicolon ;.
Handle Directives: If the first token is .data or .text, we switch our internal state to track which section we are currently parsing.
Parse Data Variables: If we are in the .data section and the first token ends with a colon :, we treat it as a variable declaration (e.g., n:).
Parse Labels: If we are in the .text section and the first token ends with a colon :, we treat it as a jump label.
Parse Instructions: If none of the above match, it is likely an instruction. We take the first token as the opcode and parse the subsequent tokens as operands.

At the end of this phase, we populate vectors containing our structured data:

std::vector<struct::Instruction> text;
std::vector<struct::Data> data;

These rely on the following data structures:

enum TokenType {
    Instruction,
    Register,
    Number,
    Address,
    Label,
    Identifier,
    Colon,
    Comma,
    EOL,
    EOF_
};
 
struct Token {
    TokenType type;
    std::string text;
    int value; // For parsed numbers or immediate values
};
 
struct Instruction {
    int size; // Size in bytes (to calculate addresses)
    std::vector<Token> tokens;
};
 
enum Directive {
    Byte,
    Half,
    Word,
    DWord,
    ASCII,
    ASCIZ,
    SPACE
};
 
struct Data {
    std::string name;
    Directive directive;
    int value;
};

Parsing & Symbol Table Construction

After the tokenization phase, we need to replace the labels in instructions with actual memory addresses. To do this, we need to generate a Symbol Table.

The process is simple. We loop through all tokens to find symbols (either variable names declared in the .data section or labels declared in the .text section).

Labels in the .text section are assigned an address based on where they appear. We achieve this by iterating through the Instruction vector (see the code above) and maintaining a generic current_address counter starting from 0x0. The counter is incremented by the size of each instruction.

For example, a label at the very start of the .text section receives address 0x0. If it appears after a LOAD instruction (which is 12 bytes long), the label receives address 0x0C (12 in hex).

The .data section in assembly is simply a list of values intended to be loaded into RAM. In our memory map, RAM starts at 0x00001000. Therefore, when the assembler encounters a variable, it assigns it an address starting from 0x1000 and increments sequentially based on the data size.

When these variables are referenced in the .text section, the assembler simply swaps the variable name for that calculated memory address. This is similar to C pointers: the variable name is just a label for the memory address where the actual value is stored.

However, handling these variables requires a small change in how we load files, which we will discuss in the Loading the compiled ROM section.

For now, the assembler will append the raw data values to the end of the compiled binary file. The emulator will be responsible for loading the code portion into ROM and the data portion into RAM. But this creates a problem:

How will the emulator know where the program ends and the data variables begin?

Quite simply, we introduce a Header at the top of the output binary. This header will act as a table of contents, storing the file offsets for the program start/end and the data start/end.

Code Generation

Code generation consists of three distinct phases:

Text Section: Converting instruction tokens into binary values.
Data Section: Collecting data variables to append to the end of the binary.
Header: Creating a file header so the emulator knows how to load the file.

The simplest approach is to internally maintain three vectors of 32-bit values (uint32_t). We populate these vectors with our data and then flush them to a binary file.

Text Section

We generate the .text section by iterating through the instructions (which have already had their labels resolved to addresses in the previous phase) and pushing their token values into the text vector.

void Generator::generateText(std::vector<struct::Instruction> &instructions) {
    for (const auto& [size, tokens] : instructions) {
        for (const auto& token : tokens) {
            text.push_back(token.value);
        }
    }
}

Data Section

For the .data section, we iterate through the symbol table. If we encounter a VariableSymbol (from the .data section), we push its value into the data vector.

void Generator::generateData(const std::vector<Symbol> &symbols) {
    for (const auto& symbol: symbols) {
        if (symbol.type == VariableSymbol) {
            data.push_back(symbol.value);
        }
    }
}

File header

Finally, we need a header to tell our emulator how to parse the ROM file. We need to store the sizes and offsets of the different sections. For simplicity, we store these values as 32-bit word counts, not byte offsets.

void Generator::generateHeader() {
    header.push_back(0x00); // entry point
    header.push_back(0x00); // ROM START
    header.push_back(text.size()); // ROM SIZE
    header.push_back(0x00 + text.size()); // DATA START
    header.push_back(data.size()); // DATA SIZE
}

Outputting the binary

With the vectors populated, we can write them into a binary file. Since std::ofstream expects bytes, we use a helper function write_section (not shown here) that casts our uint32_t data into char* for writing.

void Generator::outputFile() const {
    std::ofstream outfile(this->filename, std::ios::binary);
 
    write_section(outfile, header);
    write_section(outfile, text);
    write_section(outfile, data);
 
    std::cout << "Generated ROM saved to " << this->filename << std::endl;
}

Loading the Compiled ROM

Now that we finally have a way to compile programs for the CPU, we need a way to load them. Previously, we used a hardcoded array to test the emulator. We can now replace that with a function designed to load binary files generated by our assembler.

void CPU::loadROM(const std::string& filename) {
    std::ifstream file(filename, std::ios::binary | std::ios::ate);
    if (!file) {
        std::cerr << "Failed to open ROM file: " << filename << std::endl;
        exit(1);
    }
 
    const std::streamsize filesize = file.tellg();
    file.seekg(0, std::ios::beg);
 
    // Read entire ROM into a uint32_t buffer
    // Note: This assumes the binary file has the same Endianness as the host machine
    const size_t wordCount = filesize / sizeof(uint32_t);
    std::vector<uint32_t> rom(wordCount);
 
    if (!file.read(reinterpret_cast<char*>(rom.data()), filesize)) {
        std::cerr << "Failed to read ROM: " << filename << std::endl;
        exit(1);
    }
 
    // --- Parse ROM header (First 5 words) ---
    uint32_t entryPoint = rom[0];
    uint32_t romStart   = rom[1]; // Usually 0
    uint32_t textSize   = rom[2]; // Number of words in text section
    uint32_t dataStart  = rom[3] + 5; // Offset + 5 words to skip the header
    uint32_t dataSize   = rom[4]; // Number of words in data section
 
 
    // --- Load TEXT section into memory ---
    uint32_t memAddr = romStart;
    uint32_t romIndex = 5; // Start after the header
 
    for (uint32_t i = 0; i < textSize; i++) {
        uint32_t word = rom[romIndex++];
        m_memory.write32(memAddr, word);
        memAddr += 4; // Advance 4 bytes in CPU memory
    }
 
    // --- Load DATA section into RAM (0x1000 onwards) ---
    uint32_t dataMemAddr = MemMap::RAM_START;
 
    for (uint32_t i = 0; i < dataSize; i++) {
        uint32_t word = rom[dataStart + i];
        m_memory.write32(dataMemAddr, word);
        dataMemAddr += 4; // Advance 4 bytes in CPU memory
    }
 
    // --- Set program counter to entry point ---
    m_pc = entryPoint;
 
    std::cout << "Loaded ROM into memory. Entry point = 0x"
              << std::hex << entryPoint << std::dec << std::endl;
}

The loading process works by first reading the File Header (the first 5 words / 20 bytes of the file). Based on the header information, we load the subsequent instructions into ROM starting at 0x00000000. Next, we calculate the offset for the .data section and load those values into RAM starting at 0x00001000.

The crucial detail here is the use of memAddr += 4 and dataMemAddr += 4. Because our memory is byte-addressable (each address holds 1 byte) but we are writing 32-bit values (4 bytes), we must increment the memory address by 4 for every write to prevent overwriting the data we just stored.

Bonus: The Stack & Function Calls

We have implemented jumping, which allows us to loop and branch, but we are still missing a crucial component of modern programming: Functions.

When you call a function, the CPU needs to jump to that code, execute it, and then magically know exactly where to return to when the function finishes. Since a function might be called from many different places in a program, we can’t hardcode a return address.

This is where the Stack comes in.

The Stack is a specific region in memory (which we defined earlier as 0xF000 to 0xFFFF) used for short-term storage. It works on a LIFO (Last In, First Out) basis. We use the Stack Pointer register (m_sp) to keep track of the “top” of this pile.

To implement functions, we need two new instructions:

CALL: Pushes the address of the next instruction onto the stack, then jumps to the function.
RET: Pops the address off the stack and jumps back to it.

The Stack Layout

Before we implement the instructions, we need to understand how the stack resides in memory. In our MemMap, we defined the stack region from 0xF000 to 0xFFFF.

Unlike our program code, which starts at 0x0000 and grows upwards (increasing addresses), the stack starts at the very end of memory (0xFFFF) and grows downwards towards 0xF000.

Initialization: We set m_sp to the very end of the memory block (0xFFFF).
Pushing (CALL): We decrement the pointer, then write to that lower address.
Popping (RET): We read from the current address, then increment the pointer back up.

Memory Address      Content
  0xF000            [ Limit of Stack ]
   ...                   ...
  0xFFF4            [ Future Data ]
  0xFFF8            [ Future Data ]
  0xFFFC            [ Return Address ]  <-- New m_sp after CALL
  0xFFFF            [ Start of Stack ]  <-- Initial m_sp

Updating the Decoder

We need to register these opcodes in our decode function.

CALL takes an address operand, so it is 2 words (8 bytes) long.
RET takes no operands, so it is 1 word (4 bytes) long.

case 0x72: { // CALL
    instruction.address = m_memory.read32(m_pc + 4);
    instruction.length = 8; // Opcode + Address
    break;
}
case 0x73: { // RET
    instruction.length = 4; // Opcode only
    break;
}

Executing the Call

This is where the magic happens. By convention, stacks usually “grow downwards” in memory (from high addresses to low addresses).

When we execute CALL, we calculate the return address. Since the current CALL instruction is 8 bytes long, the next instruction is at m_pc + 8. We write this value to the stack and decrement the stack pointer.

When we execute RET, we simply read that value back and force the m_pc to go there.

case 0x72: // CALL
    // 1. Decrement Stack Pointer (Grow downwards)
    m_sp -= 4;
    
    // 2. Push the Return Address to the Stack
    // We want to return to the instruction *after* this CALL, which is 8 bytes ahead.
    m_memory.write32(m_sp, m_pc + 8);
    
    // 3. Jump to the function
    m_pc = instruction.address;
    jumped = true;
    break;
 
case 0x73: // RET
    // 1. Pop the return address from the Stack
    m_pc = m_memory.read32(m_sp);
    
    // 2. Increment Stack Pointer (Shrink upwards)
    m_sp += 4;
    
    jumped = true;
    break;

Note: For this to work, make sure your CPU constructor initializes m_sp to the very end of your mapped stack memory (0xFFFF), so it has space to grow downwards.

With these two instructions, our CPU now supports recursion and reusable subroutines - the backbone of structured programming!

Testing it out

To verify that our stack pointer logic works, we can write a simple program that initializes a value, calls a subroutine to modify it, and returns.

; Function Call Test
; Expected Result: R0 should contain 10 (<block>0xA</block>) at the end
 
.text
    LOAD R0, 5        ; R0 = 5
    CALL add_five     ; Push return addr to stack, Jump to 'add_five'
    HLT               ; [Return Point] We land here after RET
 
add_five:
    LOAD R1, 5        ; R1 = 5
    ADD  R0, R1       ; R0 = R0 + R1 (5 + 5 = 10)
    RET               ; Pop return addr from stack, Jump back to HLT

The Flow of Execution:

Setup: We load 5 into Register 0.
The Call: The CALL instruction pushes the address of the HLT instruction (the point of return) onto the stack at 0xFFFC and sets the Stack Pointer to that location. It then sets the Program Counter to the address of add_five.
The Subroutine: The CPU executes the addition, changing Register 0 to 10.
The Return: The RET instruction reads the address stored at 0xFFFC (which is the address of HLT), loads it into the Program Counter, and increments the Stack Pointer back to its starting state (0xFFFF).
The End: The CPU is now back at the HLT instruction and terminates execution gracefully.

Talking to the World: Memory Mapped I/O

Right now, our CPU is a black box. To see the result of a calculation, we have to pause the emulator and inspect the raw memory arrays. A real computer needs a way to send output to a screen or read input from a keyboard.

We could add specific instructions like PRINT or READ, but that bloats the instruction set. Instead, most architectures use Memory Mapped I/O (MMIO).

The concept is simple: We designate a specific memory address to act as a bridge to the outside world. When the CPU writes to this address, instead of storing the value in RAM, the emulator intercepts it and performs an action (like printing to the console).

For example, let’s map the address 0xFF00 to std::cout.

We simply modify our Memory::write32 function:

void Memory::write32(uint32_t addr, uint32_t data) {
    // Check for MMIO address
    if (addr == 0xFF00) {
        std::cout << static_cast<char>(data);
        return; 
    }
 
    // Normal memory write logic
    if (addr < 65536) {
        // ... write to m_cells ...
    }
}

Now, from the assembly side, “printing” is just a standard write operation. If we want to print the letter ‘H’, we just write the ASCII value 72 to address 0xFF00.

.data
OUT_PORT: .word 0xFF00
 
.text
LOAD R0, 72          ; ASCII for 'H'
LD R1, <block>OUT_PORT</block>    ; Load the address of our output port
STORE R1, R0         ; Write 'H' to <block>0xFF00</block> -> appears on screen!

By mapping different addresses, we can control anything. 0xFF01 could be keyboard input, 0xFF02 could clear the screen, or 0xFF03 could even play a beep sound.

Conclusion

It is easy to take modern computing for granted. We spend our days interacting with sleek graphical interfaces, high-level languages like Python or JavaScript, and operating systems that manage terabytes of data without us lifting a finger. But building this emulator forces us to tear back those layers of abstraction and look the machine in the eye.

By defining our own Instruction Set Architecture, implementing the Fetch-Decode-Execute cycle, and managing the Stack manually, we’ve bridged the gap between abstract code and raw logic. We’ve seen that the “magic” of a function call is just a pointer moving in memory, and that a complex while loop is nothing more than a conditional jump.

While we may not have to write raw assembly in our day job, this knowledge is at least interesting, if not invaluable. Understanding how a CPU thinks changes the way we debug high-level code.

As always you can find the code on my Gitlab.

Emulator (Orca) - https://gitlab.com/sharmarohan/orca
Assembler (Harpoon) - https://gitlab.com/sharmarohan/harpoon

EOF