Sunday, November 29, 2009

Laying out VM vitals

The teeterl's poor performance felt like a cold shower to me. And at the opportune moment I decided to have a closer look at why it is so bad. By then I had a scaffold implementation of a register-based teeterl VM. The new version was already churning out primes numbers in spadefuls without a hitch. But I decided to take two steps back and measure if the decisions I made were of any merit. It happened many of them did more harm than good.

For about a week I was rewriting my shiny new VM into a new application, cleverly named teepad, that was doing nothing but creating random stack frames, touching various memory locations, allocating chunks of memory from the heap, a dispatching random instructions from a dozens of garbage modules. It weighs heavily on a fellow's mind to write a simulation software of the sort. When someone laboriously develop a sizable do-nothing system for a long time, thoughts start to creep in. I am lucky it did not take too long. Now I have speed readings of the garbage mincer and can make a few observations.

Do not reenter VM's main function
As everybody knows the main function of a VM is where the huge switch statement is. I was toying with the idea of reentering the routine on every Erlang function call. Thus register values will be instantaneously back to existence after return. Well, Erlang process should yield from time to time, and I added a separate stack which was filled in when the process was yielding. Boy, it is probably the worst stack management scheme out there. teepad was able to dispatch 2.5 times more commands if such nastiness is avoided.

Keep the number of opcodes moderate
With teepad it was possible to boost the number of different opcodes to a sizable number increasing the number of cases in the master switch statement accordingly. I thought it would be neat to have an opcode for, say, 'move r1, r2' and another opcode for 'move r2, r3'. I was quite wrong again. teepad experiments showed that twice as many opcodes means 15% less throughput, 5 times - shaves 25% from performance.

Packing/unpacking of arguments does not matter
These days of multi-layered caches you may do any number of bitwise operations on a value already retrieved from memory. Take highest 6 bits, mask out 3 lower bits, add 7 to the result, do it on every dispatch and you won't tax your performance noticeably. Bitwise operations help to reduce the memory footprint of a VM and thus stimulate a better cache behavior that more than offset the additional CPU cycles burn of bit fiddling.

More details for someone who care

1. Instruction set generated from the types below:
TypeProb#variants#argsOptions
test0.2101-3[branches]
enter0.0511-1[reduce,jumps]
enterf0.0123-4[reduce,jumps,intermodule]
call0.0211-1[reduce,calls]
callf0.0123-4[reduce,calls,intermodule]
bif0.0182-3[reduce,heap]
move0.3100-2[]
make0.1101-3[heap]
select0.151-3[heap]
compute0.1202-3[]
apply0.0121-2[reduce,jumps]
'receive'0.00111-1[yields,messages]
error0.000111-1[exits]

Implementation of each opcode is randomly generated using Options column, e.g. 'receive' instruction has 1 variant and appears once per each 1000 instructions. Its implementation may include unpacking its arguments, touching the mailbox memory and faking a context switch.

2. Modules are generated from instruction set using probabilities of commands of given type

# of modules is 10.
Module size ranges between 1000 and 100000.
Reductions before yield is, obviously, 1000.
Instruction arguments are packed (4 arguments into 1 word).

3. Generated program is compiled, run for 5sec and the number of instructions dispatched is reported.

The source code of teepad and raw measurement data are gladly provided upon an e-mail inquiry.

No comments: