Instructions unclear

The Answer to the Great Question…Of Life, the Universe and Everything…Is…

Po-Tay-To

Contracts

In labour disputes there is a concept called the white strike or work-to-rule. This is where the workforce chooses to still go to work, still do their jobs, but only act within their minimal contractual obligations. This can take many forms but it’s usually things like: Not working on days off to cover illness, not answering phone calls if it’s not part of the job, stopping for lunch at precisely the designated time, etc. Needless to say this is infuriating for employers because they lose all the employee flexibility and free labour they were taking for granted.

Your processor is on a white strike permanently and you are its employer. If we want our processor to do exactly what we want we’re going to have to write an absolutely impeccable contract.

So we’re going to use mathematics.

Instructions

Ignoring absolutely all detail an x86 processor roughly works as follows. It reads the instruction at the address pointed to by the instruction pointer, interprets it, and then performs the instruction. Once the instruction is complete the processor will do the same thing again.

That’s it as far as we’re concerned. The actual operation of the hardware is the domain of electronics which is a completely different funhouse.

Instructions are a set of bytes that encode information on what operation to perform as well as any extra values required.

Some examples for flavour:

The instruction RETF is a far return operation. It is encoded as the byte 0xCB.

The instruction MOV CX, 0x1 is an operation that moves the immediate value 1 into the register CX. It is encoded as the bytes 0xB9 0x01 0x00.

Normally whatever tools you use to write code will be performing all the encoding needed to get from your favourite programming language down to something the processor can understand. However, because DwarfOS doesn’t have any tools like that yet the only thing that’s going to be doing any encoding is me.

It’s difficult to know how much detail to go into. All the information required for encoding instructions is in the Intel manuals, but it’s not exactly immediately accessible. So I could just say to go and look, but I’d like this journal to at least give some pointers. Plus it might stop someone, perhaps myself, going through all the reading again so I’ll give some kind of overview.

Parts of an instruction:

  • Instruction prefix(es)
  • Opcode
  • ModR/M
  • SIB
  • Displacement
  • Immediate

Instruction prefixes

Instruction prefixes can override particulars of an instruction’s operation, e.g., the segment register used, if an operation is used 4 or 2 byte registers, etc. Most operations won’t require any prefixes but there can be up to 4 applied to any one instruction.

Opcode

An operation code (opcode) is 1 to 3 bytes long. Sometimes the opcode also uses an extra 3 bits that get encoded in the “Reg” part of the ModR/M byte.

Here’s an example instruction from the manual:

OpcodeInstructionOp/EnDescription
88 /rMOV r/m8, r8MRMove r8 to r/m8
89 /rMOV r/m16, r16MRMove r16 to r/m16

Depending on if you’ve done some assembly programming that may or may not make a little sense to you. I’ve missed out a couple of columns for brevity.

The “Instruction” column shows how the instruction would look in Intel syntax assembly language and the “Description” column tells you what it does. The “Instruction” column gives an operator (MOV) followed by the format for any operands (values/arguments for the operation).

Further up in the manual there’s a section detailing what the symbols in the columns mean.

In the “Instruction” and “Description” columns you have these:

  • r8 is a byte sized general purpose register, e.g., AL
  • r16 is a word sized general purpose register, e.g., AX
  • r/m8 is either a byte sized general purpose register or a byte from memory
  • r/m16 is either a word sized general purpose register or a word from memory

So when you see MOV AX, BX in assembly AX is the r/m16 part and BX is the r16 part.

The “Op/En” is a key into a table which tells you about where the operands listed in the “Instruction” column are in the final encoded instruction.

The big one for this section however is the “Opcode” column. Here you get the hex value of the operation code and some symbols on encoding. Here the /r symbol means that the hex value given is followed by a ModR/M byte which encodes the operands part of the instruction.

Here’s a different MOV instruction:

OpcodeInstructionOp/EnDescription
b0 +rb ibMOV r8, imm8OIMove imm8 to r8

This moves an immediate byte value into a byte sized general purpose register, e.g., MOV AL, 0x32.

If we look at the opcode here the ib part should be self-explanatory, it’s a byte following the opcode byte that will contain our immediate value. The +rb however is something we haven’t seen before and means a 3-bit indicator for the register needs to be added to the 0xb0 byte to form the opcode because there is no ModR/M byte. For example, the byte sized general purpose register DH is indicated by 110b so to use this register we add 6 to 0xb0 for the final opcode 0xb6. This is shorthand in the manual, the encoded opcode always points to a specific instruction.

ModR/M

This, at least for me, is the most complicated part of the instruction encoding.

It’s a single byte with the form:

|7 .. Mod .. 6| |5 .. Reg/Opcode .. 3| |2 .. R/M .. 0|

Mod switches between 4 different groups of addressing.

R/M determines which of the 8 addressing modes in the Mod group is being used.

Reg/Opcode determines which register is being used, or it can be filled with 3 bits of opcode.

The order of source, destination is determined by the encoding specified in the “Instruction Operand Encoding” referenced by the “Op/En” column mentioned earlier.

Here’s an example:

Op/EnOperand 1Operand 2
RMModRM:reg(w)ModRM:r/m (r)

Which means the source is encoded in the r/m part and the destination is encoded in the reg part. There is only ever one ModR/M byte, you don’t use one per operand.

As to how you form the ModR/M byte. You could, if you really wanted, memorise things. Like Mod of 11b means general purpose registers and then the 3-bits of r/m indicate the register. But I doubt in a pinch you’re going to remember 00001b is the [BX + DI] memory address if you’re in 16-bit mode.

Happily Intel has provided the alluringly named “16-Bit Addressing Forms with the ModR/M Byte” table. Forming the ModR/M byte basically boils down to cross referencing your source/destination in the table and reading off the hex value. Peruse at your leisure, I’m not reproducing it here. Needless to say this has been printed off and is sat on my desk.

SIB

The scale index base (SIB) byte is used with some more complex forms of addressing in 32-bit mode.

For now we’re operating in 16-bit mode so I’m going to ignore it. I’ll provide an explanation when it’s required.

Displacement

Some forms of addressing require a displacement value. It is not always present, but when it is it takes the form of a 1, 2 or 4 byte value.

Immediate

If the instruction specifies an immediate operand then it appears here. It is not always present, but when it is it takes the form of a 1, 2 or 4 byte value.

Still there?

A boring journal entry for a dry topic so if you made it this far then well done.

That’s probably enough information to start encoding instructions. I gave a bit more time to the Opcode encoding because there were a few things that tripped me up reading the manual. If you want to learn how to do the encoding then volume two of the developer’s manual is where you want to be. All the information is in there.

For now what I’ll do is when I show some snippets of code, and that time is coming, I’ll give details on how the instruction was encoded. That, combined with the lovely ModR/M table should make things clear.

I’m ignoring the existence of 32-bit and 64-bit modes for the time being. Their time will come, but it’s not now.

— Curufir