You are on page 1of 26

Specialization in ooc

Amos Wenger June 8, 2012

Contents
Abstract The ooc programming language ooc vs C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ooc vs C#/Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generic classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Covers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Original implementation Generic arguments and return types Conversion . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 3 3 4 5 5 6 6 6 7 8 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Generic pointers

Non-optimatility of the current approach . . . . . . . . . . . . . . . . . . . .

Specialization implementation A-priori perils of specialization in ooc . . . . . . . . . . . . . . . . . . . . . . Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combinatorial explosion . . . . . . . . . . . . . . . . . . . . . . . . . . Type signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AST transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Function-level specialization . . . . . . . . . . . . . . . . . . . . . . . . Class-wide specialization . . . . . . . . . . . . . . . . . . . . . . . . . . Compatibility with legacy code . . . . . . . . . . . . . . . . . . . . . . . . . Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C compiler optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . Source and binary size . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Runtime (gcc Ubuntu/Linaro 4.6.3-1ubuntu5) . . . . . . . . . . . . . . Runtime (clang version 3.0-6ubuntu3) . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 9 10 10 11 11 11 12 14 16 20 21 22 22 23 24 25

Abstract
The purpose of this project is to optimize the performance of generics in the ooc programming language via the implementation of specialization in rock, its main compiler.

The ooc programming language


ooc is a general-purpose programming language I designed in early 2009, in order to be able to write an EPFL assignment in an object-oriented language rather than directly in C.

The rst ooc-to-C compiler implementation was done in Java, and didnt feature any compile-time checks. Many iterations later, the ooc compiler is now self-hosting (written in ooc itself) and has been successfully tested on a wide array of platforms, from Windows, to Linux, to OSX, FreeBSD and OpenBSD. The ooc language is very versatile, and lends itself easily to the same kind of experiments one would use C in: for example, a fork of the ooc SDK was made compilable for the TI-89 calculator. There have also been successful attempts to use ooc on Haiku OS, the modern clone of BeOS.

ooc vs C++
ooc is in some ways comparable to C++. The original meaning of ooc was objectoriented C, which is similar in spirit to C++s C with classes. However, there are a few marked dierences. The rst and foremost dierence is that ooc intends to remain a source-to-source language: even though we might consider alternative backends, such as LLVM, the JVM, etc., in spirit, ooc is a language that remains usable because even though it is not widespread, it produces readable C output that is familiar to a whole generation of programmers. ooc also tries to be leaner than C++: it has fewer features, while still remaining general enough to be relevant for most tasks. Another distinctive dierence is in the implementation of generics versus templates. C++ doesnt have generics, it has templates: instanciation is always done at compiletime and incurs a cost in code size. On the other hand, templates are often used beyond generic collections, as a metaprogramming mechanism more powerful and type-safe than C macros.

ooc vs C#/Java
As far as class-oriented languages go, C# and Java are arguably the two most notable running on VMs. The fact that oocs initial implementation does not run in a virtual machine is a deliberate decision: while a VM allows JIT optimizations, and wider

facilities to debug an application, I saw it as a challenge to work on AOT1 optimizations instead, and make the output portable enough that a whole VM wouldnt be required to run it. Note that even though (obviously) binary executables produced by the C compiler are not portable, the C source is: that is, rock (the ooc compiler) will produce the same C code on Linux, Windows, and OSX: that code will contain the necessary preprocessor directives for it to compile and run on the platforms cited above. Platform-specic code can be written in ooc thanks to version blocks. As far as debugging, proling, and general insight into a program goes, in practice I have found that classical instrumentation tools such as gdb, valgrind, gcov, gprof, etc. all worked very well. Since #line instructions are output, it is even possible to step through ooc code in gdb, for instance. Generics appeared in Java 5 in 2004, to be used mostly for collections. In order to maintain backward compatibility of the JVM bytecode produced, type parameters are erased. As a result, there is only compile-time safety with Java generics, and no introspection of generic type parameters is possible.

Generics
ooc adopts a middle ground between the C++ way and the Java way: generics allow a limited amount of run-time introspection, as type parameters are not erased, and the implementation of specialization in this research work allows generics to act like templates, when manually marked with a new keyword. Generic functions identity is the canonical generic function: it simply returns exactly what has been passed to it.
Just In Time (JIT) optimizations require programs to be ran on a virtual machine that has enough insight to be able to modify the program while it is running in order to make it run faster. Ahead Of Time (AOT) optimizations are made purely at compile time, and makes usage of static analysis and other such techniques in order to predict as accurately as possible the cases for which it is worth optimizing. While criticisms apply to both approaches, Prole Guided Optimization (PGO) seems to combine the best of both worlds: it doesnt incur the classical cost of a VM start-up, while still retaining the relevance of JIT optimizations based on real-world data from actual runs.
1

identity: func <T> (value: T) -> T { value } Generics are reied, so that generic type parameters can be inspected at runtime, like this: info: func <T> (value: T) { "value is a %d of size %d" printfln(T name, T size) if (T inheritsFrom?(Object)) "and value is an object" println() } A limited amount of matching can be done on the type of a generic argument: printType: func <T> (value: T) { match T { case Int => "Its an int!" println() case => "Its something else!" println() } } More advanced matching can also be done on the generic argument itself, alleviating the need for explicit casts. repr: func <T> (value: match value { case c: Char case i: Int case s: String case o: Object case } } T) { => => => => => "Char(%c)" format(c) "Int(%d)" format(i) "String(%s)" format(s) o class name "Unknown"

Generic classes Classes in ooc accept type parameters as well. A simple generic container could be implemented like this: Container: class <T> { value: T init: func (=value) get: func -> T { value } } A prime example of generics usage in the ooc codebase is the structs package, containing various collections2 . Queue: abstract class <T> { push: abstract func (elem: T) pop: abstract func -> T } Classes and functions accept any number of generic parameters: Map: abstract class <K, V> { put: abstract func (key: K, value: V) get: abstract func (key: K) -> V }

Types
Covers The reason the repr function above cannot be simply handled with a virtual method call is that, in ooc, not everything is an object. Types like Int and Octet are covers from C types:
However, in contrast to the Go language, any class accepts generic type parameters, and not exclusively collections.
2

Int: cover from int Octet: cover from char " int size = %d bytes" printfln(Int size) "octet size = %d bytes" printfln(Octet size) Will print: int size = 4 bytes octet size = 1 bytes Covers generate typedefs in the C backend, and are only thin layer over a given C type. They allow to use C libraries with an object-oriented syntax, even though the library might have its own system of virtual function call under the hood.3 Classes As in Java, all objects in ooc are references. ooc has single inheritance, and ultimately every object inherits from the Object class AsciiChar: class { c: Char value: Int } "reference size = %d bytes" printfln(AsciiChar size) " instance size = %d bytes" printfln(AsciiChar instanceSize) Will print: reference size = 4 bytes instance size = 5 bytes . . . if the C compiler does packing. Otherwise, the char will probably get aligned to 4 bytes, and the instance size will be 8.
This is the case, notably, for the GObject library. Foundation to all libraries GTK and Gnome, it features a remarkably complex object system on top of C, implemented using high-level object denition les and an impressiven number of generated C macros. http://developer.gnome.org/gobject/
3

Original implementation
Generic arguments and return types
Since generic arguments can be either basic types or object types, the code generated can handle arguments of any size. C has no explicit support for variable-sized types 4 , the implementation uses pointers to a memory area, and memory copy operations instead of assignment. Heres an example of the C code generated for the above identity function: void identity(Class *T, uint8_t *ret, uint8_t *value) { if(ret) { memcpy(ret, value, T->size); } } And a call to identity, such as the following: a := 42 b := identity(a) Would be translated in C as: int a = 42; int b; identity(Int_class(), &b, &a); Similarly, when declaring variables of a generic type (inside a generic class, for example), they are allocated on the heap. Although the memory is eventually reclaimed by the garbage collector 5 , it incurs some additional processing (housekeeping done by the garbage collector) that would not be necessary if the variable was simply allocated on the stack.
Thats not entirely accurate: C99 supports VLAs (Variable-Length Arrays), which are allocated on the stack (like local variables of basic types in ooc), but their limitations render them worthless in our case: you cant return VLAs, and furthermore, keeping track of stack allocated memory is tricky. 5 rock uses the Boehm garbage collector: http://www.hpl.hp.com/personal/Hans_Boehm/gc/
4

Conversion
When generic variables are used in a match, or explicitly cast to a non-generic type6 , some pointer trickery is required in the generated C code. For example, casting a generic parameter named value of type T to an integer type would look like this: void somefunc(uint8_t value) { int i = *((int*) value); // do something with i }

Generic pointers
Generic pointers are yet another can of worms. They can be indexed like C arrays, getting or setting individual elements. However, since - again - the size of the generic type is not known in advance, code like this: array_set: func <T> (data: T*, index: Int, value: T) -> T { data[index] = value } Requires non-trivial pointer arithmetic in order to maintain the semantics of array manipulation in ooc: void array_set (Class *T, uint8_t *ret, uint8_t *data, int index, uint8_t *value) { memcpy(data + (index * T->size), value, T->size); if (ret) { memcpy(ret, data + (index * T->size), T->size); } } By default, if the last instruction of a non-void ooc function is an expression, it is implicitly returned, which explains the second part of the generated code.
Note that explicit casts from generic types to non-generic types are unsafe and generally regarded as bad practice. Using a match is a much safer way to deal with generic values.
6

Non-optimatility of the current approach


The current implementation of generics suers from a few performance problems. Because of the generality of the machine code that is eventually produced, there is a signicant amount of lost opportunity for optimization. Given that we can infer, at compile time, the type parameters used when instanciating certain generic classes, optimized machine code could be emitted (through a C compiler) for a given subset of these type parameter combinations.

Specialization implementation
The basic idea behind specialization is to turn a subset of generic instances into template instances, statically compiled to type-specic code. For example, the identity function as described above, when used with an integer argument, would compile down to this specialized code: int identity__int(int value) { const Class *T = Int_class(); return value; } Below, we discuss of the major problems linked to the implementation of specialization in a language such as ooc, and the methods used to circumvent said problems. Note that that implementation discussed in this report is available as a branch of the rock project on GitHub7 .

A-priori perils of specialization in ooc


Following are a list of problems that were known ahead of time and made this project challenging. In comparison, the Post-mortem section will detail unforeseen issues that were encountered while implementing specialization.
GitHub is a general-purpose code repository and production suite for software developers, with an emphasis on open-source and collaboration. All of the code for the ooc compiler, associated tools and general library ecosystem around it are hosted there. It even features syntax highlighting for ooc. https://github.com/nddrylliog/rock/tree/specialize
7

10

Introspection ooc features reied generics8 , which means that when a generic type is instanciated, rather than erasing its type parameters, it stores them in the class structure, next to the vtable. As a result, generic classes can be inspected at runtime, along with their type parameters. When specializing classes, although the generic parameters become either partially or totally xed at compile time, we cannot simply erase them from the class structure, because existing code might depend on them (through instanceOf?, or preferably, a match, as demonstrated in the examples above.) As a result, it is necessary in the generated code, to make available generic parameters in the specialized version just as well as in the unspecialized version. In function-level specialization, it can be a simple local declaration, that can be removed by an optimized compiler in case it is not being utilized. int identity__int(int value) { const Class *T = Int_class(); return value; } Combinatorial explosion Another thing to consider is the scope of the specialization: which classes and methods to specialize, and which to leave unspecialized. While heuristics could be developed to nd out the most cost-ecient combinations, for this project we will simply annotate by hand the methods we feel would gain to be specialized. That said, auto-specialization would be an interesting topic for further research.
This design choice departs signicantly from, for example, the Java language and the Scala language. The reason for that is that object semantics are built in the JVM. This decision, criticized by detractors of the JVM, makes it unnecessarily hard to implement alternate OO semantics, perhaps closer to the intent of Alan Kay in Smalltalk, where state is embraced and message-sending is the prime mechanism by which computation is done. However, the state of the art is changing with the JDK 7, thanks to the work of, among others, Charles Nutter, who played a crucial role in the JRuby implementation, and is now championing the development of invokedynamic to better serve dynamic languages on the JVM.
8

11

Type signatures Finally, in the context of C code generation, specialization is tricky because the specialized and the unspecialized versions of a given method can have dierent signatures, implying that they cannot be called the same way (ABI compatibility is not maintained). However, in the event that we need to maintain the same signature, a shim is easy to make, for example: int identity__universal(Class *T, uint8_t *ret, uint8_t *value) { if (T == Int_class()) { if(ret) { *((int*) ret) = identity__int(*((int*) value)); } } else { identity(T, ret, value); } } Note that, ironically, in order to generate a shim like the above, we have to use the same style of checks than in the un-specialized versions (ie. making sure the return pointer is non-null), and we have to cast the pointer type, as shown in the Conversion section.

AST transformations
For the purpose of this project, two types of specializations have been implemented: function-level specialization, and class-wide specialization. Function-level specialization While ooc is an object-oriented language, it allows module-level functions that are not bound to a specic type. Those functions can be generic too, and are potentially subject to specialization as well. In our implementation, we use the pre-existing inline keyword to mark functions that should be specialized. The combinations for which a generic function should be specialized are chosen by callsite. In theory, this might lead to combinatorial explosion (as seen above), but 12

in practice, module-level functions are rare enough in typical ooc code that such an implementation is still relevant. Since the ooc AST is mutable, the rst step to specializing a function is to keep a copy of it before any AST mutation can transform it into a full-blown generic function. In our implementation, we simply added an inline member to the FunctionDecl AST node. The second step is to modify the function call resolution process in order to intercept functions that are marked as specializable. This is done by adding a condition in the resolveCall function of the FunctionCall, that calls the specialize method on the FunctionDecl. In the specialize method, another copy of the original is made, ready to be specialized. Then, we step through each argument of the function and change its generic type to the type inferred from the call. For example, if a function with generic parameter X took an argument of type X, and was called with an argument of type Char, all references to the generic type X would now refer to the concrete type Char. On the side of the function call itself, nothing needs to be changed, except its ref, which is a reference to the function declaration being called. This will ensure that the correct C function is called in the generated code. The specialized version of a function has a name composed of the name of the original function and a unique generated sux in order to make sure that the additional C function generated doesnt clash with any pre-existing code9 . Class-wide specialization To specialize whole classes, we have taken a dierent approach. Instead of marking class declarations and determining combinations from instanciation site, we have introduced a new keyword, #specialize, that accepts a fully-qualied generic class name
That technique, while imperfect, is used in many dierent places in the ooc compiler. Because it was designed from the beginning to generate C code, tradeos were made in order to be facilitate the work of the backend. In retrospect, separating cleanly the C and the ooc backend would have been a much cleaner alternative: this approach is taken in the experimental compiler, oc https: //github.com/nddrylliog/oc, and more recently in the latest rewrite of rock itself. It is hoped that this clean separation will allow alternative backends to be implemented more easily.
9

13

and marks not the class, but the combination of its generic type parameters for specialization. Adding this keyword required modifying the ooc PEG grammar used by rock, as it constitutes an addition to the syntax of the language. Using it carefully in generic code allows one to hint the compiler as to which specializations would be the most benecial for the performance of the program while retaining a relatively small footprint compared to the unspecialized version. The usage of #specialize triggers a copy mechanism similar to the one described previously for module-level functions. The class variation, however, contains a few interesting dierences. The rst one is that in order to resolve generic types to concrete types, a map from type parameter name to concrete type. This map is then used in the resolving process to make sure that, to take the same example, any access to the type X would in fact point to the concrete type Char. This solves the rst part of the problem, which is to actually generate a specialized version of the class. The second part of the problem is to use this specialized version where it can be used. This is solved in an interesting way. new is not a keyword in ooc. In other words, class instanciation is not treated specially, and as such it cannot be easily hooked into in the compiler, except by testing against the name of the method, but here again: there is no guarantee that this method is the actual constructor. In fact, in ooc, new is just a normal static method, which allocates an object, assigns generic type parameters, and then calls the init (non-static) method on the newly created instance, returning it shortly thereafter. Due to the exible nature of the language, an object could very well be created from another static method, such as create, or fromSomethingElse10 . A very elegant solution was found to this problem: it occured to me that in order for
These methods are casually utilized when the default allocation strategy is not deemed t for a particular use case. In some games for example, a memory pool could be used for certain classes of objects subject to rapid creation and destruction in short-term lifes, as it would be less costly than letting the GC handle these memory blocks indiscriminately. On embedded systems, such as the TI-89, the Boehm garbage collector could not run at all: a manual allocation strategy was then required to make ooc code run on that platform. In that case, a simple correction of the core of the ooc object system, itself written in ooc, was sucient.
10

14

the specialized version to be used for compatible instanciations, all we had to do was to call the right version of the static function, be it new or any other. In order to achieve this, a check was added at the end of the function resolving code (resolveCall in the TypeDecl AST node) to check if a static method call was made on a class that had specializations. If it is the case, then we try to match the actual type on which the static function is called to the various specializations manually permitted in the code. The decision whether to use a specialized version of the class or not, and which one to use, is made by comparing the scores of dierent associations of types. This method, which we could call quantitative subtyping, is used in various places of the ooc compiler in order to allow part of the exibility that C is known to have as far as its type system goes. It is especially useful in the context of C covers, where the relations between multiple covers is often fuzzy and does not obey the rules of a classical, strict type system. Quantitative subtyping attributes an anity score between a pair of types: for example, if two types are equivalent, the score will be maximal. If the right-hand-side is a subtype of the left-hand-side, the score will be slightly lower, but still positive. Pairs of types that have absolutely no subtyping relation (either via extends, implements, or cover from) such as ArrayList and HashMap, will yield negative scores. Operator overloading and function overloading use quantitative subtyping extensively to nd the best match or, if there is none, present a helpful message with the closest match. Once the specialization to be used has been determined by quantitative subtyping, the resolution of the ref of the static method call is relayed to the specialized version of the class, which will then be written just as a normal class would.

Compatibility with legacy code


Additions made to the compiler so that it supports specialization do not break any existing code. In fact, it was considered to witness its eects on inception-engine11 , an ooc game engine that is highly dynamic and allows runtime manipulation of all entities in the game world at all times.
The source of this project, although dated, still compiles and runs on the current version of rock, and is available under a BSD-comaptible license on GitHub: https://github.com/nddrylliog/ inception-engine
11

15

Figure 1: inception-engine is a good example of real-world usage for ooc

16

However, the usage of #specialize with data structures such as structs/ArrayList and structs/HashMap causes issues with parts of the code hand-optimized for the initial, na implementation of generics. ve For example, the removeAt method calls directly memmov in order to copy areas of memory eciently (instead of moving one element at a time): removeAt: func (index: SSizeT) -> T { element := data[index] memmove(data + (index * T size), data + ((index + 1) * T size), (_size - index) * T size) _size -= 1 element } This code, instead of using generic facilities in ooc, bypasses them and directly calls C functions for performance. Unfortunately, it does not make sense in the context of a specialization, and prevents the current specialization implementation to be used directly with this SDK. A possible solution to this problem would be to extend the semantics of pointer manipulation in ooc, and allow manipulation on ranges, so that the removeAt method could be re-implemented as follows: removeAt: func (index: SSizeT) -> T { element := data[index] data[index.._size - 1] = data[index + 1.._size] element } This code, resorting to higher-level generic primitives, would then allow the generation of both the unspecialized version (using memmov for fast, anysized copying of array elements) and the specialized version.

Benchmarking
The benchmark that we are going to use is a simple list sorting algorithm. We will compare the respective performance of the specialized and the unspecialized version. 17

include stdint import math/Random, os/Time List: class <X> { data: X* size: SizeT init: func (=size) { data = gc_malloc(X size * size) } get: func (index: Int) -> X { data[index] } set: func (index: Int, element: X) { data[index] = element } swap: func (i, j: Int) { tmp := get(i) (data[i], data[j]) = (get(j), tmp) bbtrap() } bbtrap: func {} print: func (f: Func (X) -> String) { "(" print() for (i in 0..size) { x := get(i) f(x) print() if (i < size - 1) ", " print() } ")" println() } 18

bubbleSort!: func (compare: Func (X, X) -> Int) { sorted := false while (!sorted) { sorted = true for (i in 0..size - 1) { if (compare(get(i), get(i + 1)) > 0) { sorted = false swap(i, i + 1) } } } } } Box1: class { value: Int init: func (=value) } Box2: class { value: Int init: func (=value) } # specialize List<Box2> /** sort an unspecialized list of ints with size elements and return the number of milliseconds it took */ unspecialized: func (size: Int) -> UInt { l := List<Box1> new(size) for (i in 0..l size) { l set(i, Box1 new(Random random())) } Time measure(|| l bubbleSort!(|a, b| a value <=> b value) 19

) } /** sort a specialized list of ints with size elements and return the number of milliseconds it took */ specialized: func (size: Int) -> UInt { l := List<Box2> new(size) for (i in 0..l size) { l set(i, Box2 new(Random random())) } Time measure(|| l bubbleSort!(|a, b| a value <=> b value) ) } benchmark: func { numRuns := 10 "# list_size\ttime_unspecialized\ttime_specialized" println() for (i in 10..15) { size := 1 << i meanUnspe := 0 meanSpe := 0 for (i in 0..3) { meanUnspe += unspecialized(size) meanSpe += specialized(size) } timeUnspe := meanUnspe * (1.0 / numRuns) timeSpe := meanSpe * (1.0 / numRuns) "%d\t%u\t%d" printfln(size, timeUnspe, timeSpe) } } main: func { benchmark() }

20

C compiler optimizations In the following discussion, we often talk about the optimizations C compilers do. While we wont go into the details of every optimization, here is a simple example of ooc-generated code compiled with clang, based on the identity function above. In our test code, the test1 function calls the unspecialized code, the test2 function calls the manually specialized code, and the test3 function calls the automatically specialized code. test1s body disassembled gives the following assembly code: 0000000000402620 <id__test1>: 402620: 55 402621: 48 89 e5 402624: 48 83 ec 10 402628: 48 c7 45 f8 00 00 00 40262f: 00 402630: 30 c0 402632: e8 a9 4b 00 00 402637: 48 8b 50 10 40263b: 48 8d 7d f8 40263f: be 18 b4 63 00 402644: e8 a7 fa ff ff 402649: 48 8b 7d f8 40264d: e8 4e 3d 00 00 402652: 48 83 c4 10 402656: 5d 402657: c3 402658: 0f 1f 84 00 00 00 00 40265f: 00

push mov sub movq xor callq mov lea mov callq mov callq add pop retq nopl

%rbp %rsp,%rbp $0x10,%rsp $0x0,-0x8(%rbp) %al,%al 4071e0 <String_class> 0x10(%rax),%rdx -0x8(%rbp),%rdi $0x63b418,%esi 4020f0 <memcpy@plt> -0x8(%rbp),%rdi 4063a0 <String_println> $0x10,%rsp %rbp 0x0(%rax,%rax,1)

Here, we can see that the call is inlined, ie. the body of the identity function is copied in whole, but the C compiler fails to see the opportunity for optimization, and uses memcpy even though there would be a much faster alternative. On the other hand, test3, that calls the specialized version, compiles to: 0000000000402670 <id__test3>: 402670: 48 8b 3d 91 8d 23 00 21

mov

0x238d91(%rip),%rdi

# 63b408 <__strLit5> 402677: e9 24 3d 00 00 40267c: 0f 1f 40 00

jmpq nopl

4063a0 <String_println> 0x0(%rax)

Here, the specialized identity function is also inlined, and the compiler goes even further: it completely disregards the call and the copy, and directly calls println on the original data: this is the furthest we can go with this optimization, and with the hints provided by the ooc compiler (ie. specializing with the call-site types), the clang/LLVM is able to take full advantage of compile-time information. Source and binary size One downside of specialization is that it produces larger source les and thus, larger executables. This next graph shows, in bytes, the dierence in size for three les: sorting.c, the C le with the actual implementation of the sorting algorithm, sorting.h, a header le containing, among other, the structure denitions for the vtable, and sorting-fwd.h, which contains public function prototypes for an ooc module12 .
The ooc compiler always generate two header les for any module. This allows arbitrary circular dependencies within ooc modules, and proper forward declaration in the generated C code. Those circumvention mechanisms would not be necessary in a target language with a saner modularity paradigm, which is, unfortunately, not the case with C. In practice, modules which only instanciate classes or call functions from the imported module, will only include the forward header (module-fwd.h), and modules which contain subtypes of the imported module will import the full header (module.h) in order to easily generate the hierarchical vtable structure that characterizes ooc classes.
12

22

25000

20000 source size (bytes)

15000

10000

5000

0 sorting.c, sorting.h, sorting-fwd.h As for the size of the executable, when compiled with clang, it went from 728K to 734K with -O0, and from 672K to 674K when with -Os. The impact here is minimal, as most of the space in the executable is occupied by the ooc SDK itself, which is not aected by our usage of #specialize. Memory usage Runtime (gcc Ubuntu/Linaro 4.6.3-1ubuntu5) The GNU Compiler Collection is the de-facto standard for open-source compilers. It has been used for decades to build sizable collections of C and C++ programs such as the Debian project, which contains over 29000 packages. The following graph shows execution times in function of the size of an array sorted using bubble sort (cf. the algorithm shown above).

23

120000 100000 runtime (ms) 80000 60000 40000 20000 0 0

unspecialized (-O0) unspecialized (-Os) specialized (-O0) specialized (-Os)

2000 4000 6000 8000 1000012000140001600018000 array size

It displays a few unintuitive results: for the non-specialized code, the unoptimized version is faster than the version optimized for size. That means that gcc is ready to do compromises on the performance of the program in order to reduce the size of the executable from 710KB to 665KB. Interestingly, on the specialized version of the program, GCC performs as expected, ie. the -Os version is faster by about 60%. Runtime (clang version 3.0-6ubuntu3) Clang is the gcc-compatible front-end for LLVM, an upcoming challenger to the GNU Compiler Collection, which boasts a cleaner codebase, generally faster compile times, and more in-depth optimizations. It now regularly beats GCC on benchmarks, and several Linux distributions are looking forward to switch their toolchain to it entirely13 . Weve run the same tests on clang that we did on GCC, in order to compare the behavior of dierent C compilers and interpret them in relation to the produced C code.
13

Among them is the Gentoo project: http://en.gentoo-wiki.com/wiki/Llvm

24

90000 80000 70000 runtime (ms) 60000 50000 40000 30000 20000 10000 0 0

unspecialized (-O0) unspecialized (-Os) specialized (-O0) specialized (-Os)

2000 4000 6000 8000 1000012000140001600018000 array size

The results are more in line with what we can expect of an optimizing compiler. LLVM seems to strike a better compromise between code size and performance, as the compared runtimes of the -Os and -O1 show, both in the unspecialized and specialized version.

Conclusion
In conclusion, the implementation of specialization in the ooc language proved to be relatively painless due to the design of generics from the beginning. Specialization allowed a performance gain of up to 78% in a generics-heavy sorting benchmark. The remnants of legacy, backend-specic code, prevent this implementation from being useful in an even larger context, but this work constitutes a solid basis for a new class of high-performance ooc applications, where previously, generics types would have been discarded because of their high cost. The result of this research will be merged in the main trunk of the rock ooc compiler, along with the implementation of the new generic pointer primitives discussed in the Class-wide specialization section, in time for the 1.0 release of rock.

25

Acknowledgements
Id like to extend a sincere thank you to the Lab for Automated Reasoning and Analysis at EPFL for allowing me to research the performance and optimizations of the ooc programming language. In particular, this work was made possible thanks to the continued guidance and support of Philippe Suter, along with Etienne Kneuss and Viktor Kuncak. Thanks to open-source and through GitHub, a countless number of contributors got a chance to participate in the specication and implementation of the ooc language, including Friedrich Weber, Yannic Ahrens, Nicholas Markwell, Alexandros Naskos, Joshua Rsslein, Michael Tremel, Peter Lichard, Scott Olson, Noel Cower, Curtis o McEnroe, Anthony Roja Buck, Daniel Danopia, Keita Haga, Mark Fayngersh, Michael Kedzierski, Patrice Ferlet, and Tim Howard.

26

You might also like