Friday, December 25, 2009

teeterl fixes Erlang records

I think Erlang records syntax is neat. Expressive, tight and consistent. A bit repetitive sometimes, yes. There is still a big problem with Erlang records. Not with the syntax but with record declarations shared between modules in .hrl files. This is fixed in teeterl.

teeterl now supports (partially) a new breed of records called 'named tuples'. Named tuples and records are very similar - at some point both are mapped to tagged tuples. The true difference is that named tuples are never declared. The inner structure of a named tuple is inferred from its use. No declarations, no .hrl files, no need to recompile.

Named tuple has a separate syntax from records but this is not of essence. Examples:

%% create a tuple named 'car' with fields 'model' and 'mpg'
Car1 = {car#model <- "Lexus", mpg <- 30},

%% create another tuple named 'car'; name is inferred
%% from use in the previous line
Car2 = {model <- "Honda"},

%% get index of the field in the tuple; needed for keyfind()
%% and the likes
Index = car#mpg,

%% access named tuple field; again name is inferred,
%% the alternative is Car1.car#mpg
Mpg = Car1.mpg,

Values:
Car1 = {car,"Lexus",30}
Car2 = {car,"Honda,undefined}
Index = 3
Mpg = 30

The trick is that field offsets of a named tuple are assigned during module loading not during preprocessing or some other compilation step. Thus the layout of a given named tuple may be different in different VMs.

Now if you need another field in a named tuple just assign a value to the field. Send the extended tuple to the old code and you will not break anything. Modules are compiled in complete isolation, no worry about record versions. Plus information about fields is accessible during runtime. And all these goodies come with no performance penalty. Named tuples work at exactly the same speed as conventional records.

EDIT: the teeterl's source code is on GitHub

Sunday, November 29, 2009

New teeterl; 4x faster

Three months ago I decided to give teeterl a once over. The goal was to make it faster, order of magnitude faster. Today, the new teeterl version has become mature enough to run estone_SUITE and produce meaningful performance measurement printout. The result: it is almost 4 times faster than earlier teeterl. Not 10x, but definitely on the right track. BEAM is still a distant performance star boasting more than 2 times more estones.

The new version of teeterl has a goodish bit under the hood. The most noticeable are the new partial, generation-aware garbage collector and the fact that teeterl VM is now register-based.

The new teeterl can not do much more than run a few test suites now. Nevertheless I will release the source of github as soon I figure out how to push it into existing 'teeterl' repo without destroying the previous version.

stdin/stdout in Erlang

Erlang blogs and message boards are littered with messages expressing dissatisfaction with Erlang syntax, the language's handling of strings, verbosity of records etc. Members of the discontented crowd usually stop short when specifics are requested. I will try to make my small rant about Erlang as specific as possible.

My grudge is about stdin/stdout. Why would modern software need these? Existence of stdin/stdout assumes that Erlang is started from a text-based shell. These days it usually not the case. Why io:format() pretty-printing is crafted to output text on a screen of 80 columns? Erlang is all about non-blocking input/output and then there is io:get_line() getting text from nowhere. I am aware of Elang's telecom heritage and RS-232 interface. But this kind of thing definitely slows the adoption of Erlang in today's world of World-Wide Web.

The new version of teeterl does not have io module as it does not have stdin/stdout concept. On the other hand the new teeterl has had an embedded web server with comet functionality from day one. All input/output in the new teeterl is done through a socket and HTTP. Instead of io:format() there is format:to_json() call and all actual formatting including pretty-printing is done on the client side with jQuery/Javascript. Well, a printout still may appear on stderr but only as an epitaph on the tomb of init process.

Laying out VM vitals

The teeterl's poor performance felt like a cold shower to me. And at the opportune moment I decided to have a closer look at why it is so bad. By then I had a scaffold implementation of a register-based teeterl VM. The new version was already churning out primes numbers in spadefuls without a hitch. But I decided to take two steps back and measure if the decisions I made were of any merit. It happened many of them did more harm than good.

For about a week I was rewriting my shiny new VM into a new application, cleverly named teepad, that was doing nothing but creating random stack frames, touching various memory locations, allocating chunks of memory from the heap, a dispatching random instructions from a dozens of garbage modules. It weighs heavily on a fellow's mind to write a simulation software of the sort. When someone laboriously develop a sizable do-nothing system for a long time, thoughts start to creep in. I am lucky it did not take too long. Now I have speed readings of the garbage mincer and can make a few observations.

Do not reenter VM's main function
As everybody knows the main function of a VM is where the huge switch statement is. I was toying with the idea of reentering the routine on every Erlang function call. Thus register values will be instantaneously back to existence after return. Well, Erlang process should yield from time to time, and I added a separate stack which was filled in when the process was yielding. Boy, it is probably the worst stack management scheme out there. teepad was able to dispatch 2.5 times more commands if such nastiness is avoided.

Keep the number of opcodes moderate
With teepad it was possible to boost the number of different opcodes to a sizable number increasing the number of cases in the master switch statement accordingly. I thought it would be neat to have an opcode for, say, 'move r1, r2' and another opcode for 'move r2, r3'. I was quite wrong again. teepad experiments showed that twice as many opcodes means 15% less throughput, 5 times - shaves 25% from performance.

Packing/unpacking of arguments does not matter
These days of multi-layered caches you may do any number of bitwise operations on a value already retrieved from memory. Take highest 6 bits, mask out 3 lower bits, add 7 to the result, do it on every dispatch and you won't tax your performance noticeably. Bitwise operations help to reduce the memory footprint of a VM and thus stimulate a better cache behavior that more than offset the additional CPU cycles burn of bit fiddling.

More details for someone who care

1. Instruction set generated from the types below:
TypeProb#variants#argsOptions
test0.2101-3[branches]
enter0.0511-1[reduce,jumps]
enterf0.0123-4[reduce,jumps,intermodule]
call0.0211-1[reduce,calls]
callf0.0123-4[reduce,calls,intermodule]
bif0.0182-3[reduce,heap]
move0.3100-2[]
make0.1101-3[heap]
select0.151-3[heap]
compute0.1202-3[]
apply0.0121-2[reduce,jumps]
'receive'0.00111-1[yields,messages]
error0.000111-1[exits]

Implementation of each opcode is randomly generated using Options column, e.g. 'receive' instruction has 1 variant and appears once per each 1000 instructions. Its implementation may include unpacking its arguments, touching the mailbox memory and faking a context switch.

2. Modules are generated from instruction set using probabilities of commands of given type

# of modules is 10.
Module size ranges between 1000 and 100000.
Reductions before yield is, obviously, 1000.
Instruction arguments are packed (4 arguments into 1 word).

3. Generated program is compiled, run for 5sec and the number of instructions dispatched is reported.

The source code of teepad and raw measurement data are gladly provided upon an e-mail inquiry.

Monday, July 27, 2009

apr_pool_t with counter

Lack of tracking of the total size of all allocations made from a given apr_pool_t is a big pain for teeterl. I had to introduce yet another layer of indirection by adding xpool_t structure which is no more than a counter and apr_pool_t reference. Now most of the time the size of allocations are added up neatly to check when the threshold is reached and gc should kick in.
Recently I transgressed to the point of putting the size counting inside apr_pool_t thus producing a custom version of APR. Yes, it did make teeterl a bit smaller and cleaner and auxilliary allotments were counted better. On the other hand, the impure act of meddling with APR code made the installation of teeterl much longer and trickier. I still do not have a single person who managed to build and run teeterl on his/her machine and the recent change make the possibility even more remote. I am rolling everything back to xpool_t's.

Thursday, April 23, 2009

teeterl's performance; more than modest

Every compiler or a runtime system should come with an all-encompassing testing suite. Not so for teeterl. There used to be no testing suite at all and the only indication that the system is working as it would be capable of compiling itself and producing a still working copy. Many aspects of the system were this way left outside of spotlight. Such as binary construction and matching as compiler does not do a lot of binary construction and matching when producing code.

Then, teeterl was released to the public and I decided to clean it a bit to expel many a bug still haunting dark corners of the system. OTP test_server and emulator test suites came handy here.

Today teeterl runs many test suites from OTP emulator testing set. Of course, many test suites appeared not relevant for teeterl, such as dynamic driver loading or port commands. These are implemented in teeterl in very unorthodox way. Others, such as binary or list manipulations tests  helped to weed a few overlooked bugs.

One of the test suites was - estone_SUITE. It is a performance test suite and the rest of the post is devoted to its output for teeterl. I would not have published the results if only Richard Feynman's address on cargo cult science convinced me to do otherwise. So, here are all results of estone tests even those that talk not in teeterl's favor at all.


Run #1. teeterl (not optimized meaning -ggdb option to gcc)

**** CPU speed UNKNOWN MHz ****
**** Total time 106.752 seconds ****
**** ESTONES = 9612 ****

    Title                            Millis        Estone

list manipulation                    2522            602
small messages                       38336           80
medium messages                      38806           156
huge messages                        860             576
pattern matching                     1234            628
traverse                             1673            296
Work with large dataset              608             459
Work with large local dataset        441             633
Alloc and dealloc                    184             672
Bif dispatch                         244             3176
Binary handling                      2698            183
Generic server (with timeout)        16953           148
Small Integer arithmetics            750             372
Float arithmetics                    141             219
Function calls                       631             1228
Timers                               672             184

Run #2. teeterl (fully optimized, -O3 -fast options to gcc)

**** CPU speed UNKNOWN MHz ****
**** Total time 82.6701 seconds ****
**** ESTONES = 12654 ****

    Title                            Millis        Estone

list manipulation                    1826            831
small messages                       28284           109
medium messages                      28563           212
huge messages                        623             796
pattern matching                     937             827
traverse                             1061            467
Work with large dataset              607             459
Work with large local dataset        763             365
Alloc and dealloc                    127             972
Bif dispatch                         184             4210
Binary handling                      5901            84
Generic server (with timeout)        12354           203
Small Integer arithmetics            459             607
Float arithmetics                    100             309
Function calls                       398             1946
Timers                               481             257

Run #3. Erlang/OTP, emulator version 5.6.5

**** CPU speed UNKNOWN MHz ****
**** Total time 2.697552 seconds ****
**** ESTONES = 69058 ****

    Title                            Millis        Estone

list manipulation                    292             5204
small messages                       594             5218
medium messages                      1068            5687
huge messages                        315             1573
pattern matching                     104             7456
traverse                             243             2043
Work with large dataset              -490            -569
Work with large local dataset        -634            -439
Alloc and dealloc                    56              2227
Bif dispatch                         36              21695
Binary handling                      228             2171
Generic server (with timeout)        461             5447
Small Integer arithmetics            127             2192
Float arithmetics                    17              1820
Function calls                       118             6572
Timers                               163             761


It looks like teeterl is way slower than out-of-the-box Erlang/OTP even if all optimizations are enabled in C compiler. The results are especially startling for message passing. teeterl takes 50x more time to pass messages. For other tests the figure is close to 2-3x. Something is very very wrong with teeterl's message passing. One possibility is that process yields to scheduler every time it sends a message. This should not be done.

I do not wear hats but if I would wear one I would definitely have taken it off before Erlang/OTP team who produced such a fast system.

Wednesday, March 25, 2009

X Erlang is now teeterl; released under BSD license

I decided to rebrand X Erlang. Now it is called 'teeterl'. Previous name sounded too old-fashioned. The new one emphasizes the toy nature of the system.

And, by the way, everything teeterl is in public access on github (http://github.com/maximk/teeterl).

To gain interactive access to the internals of the running system, you will need another project too. It is called spurl (http://github.com/maximk/spurl).

I am fascinated to learn that someone else was  successful in building and running teeterl (and spurl).