You are on page 1of 36

Bits of Unicode

Data structures for a


large character set
Mark Davis
IBM Emerging Technologies
☢ Caution ☢
• “Characters” ambiguous, sometimes:
– Graphemes: “x ̣” (also “ch”,…)
– Code points: 0078 0323
– Code units: 0078 0323 (or UTF-8: 78 CC A3)
• For programmers
– Unicode associates codepoints (or sequences of
codepoints) with properties
– See UTR#17
The Problem
• Programs often have to do <key,value>
lookups
– Look up properties by codepoint
– Map codepoints to values
– Test codepoints for inclusion in set
• e.g. value == true/false
• Easy with 256 codepoints: just use array
Size Matters

• Not so easy with Unicode!


• Unicode 3.0
– subset (except PUA)
– up to FFFF16 = 65,53510
• Unicode 3.1
– full range
– up to 10FFFF16 = 1,114,11110
Array Lookup
• With ASCII • With Unicode
• Simple • Simple
• Fast • Fast
• Compact • Huge (esp. v3.1)
– codepoint ➠ bit: – codepoint ➠ bit:
32 bytes 136 K
– codepoint ➠ short: – codepoint ➠ short:
½K 2.2 M
Further complications

• Mappings, tests, properties often must be


for sequences of codepoints.
– Human languages don’t just use single
codepoints.
– “ch” in Spanish, Slovak; etc.
First step: Avoidance
• Properties from libraries often suffice
– Test for (Character.getType(c) == Nd)
instead of long list of codepoints
• Easier
• Automatically updated with new versions
• Data structures from libraries often suffice
– Java Hashtable
– ICU (Java or C++) CompactArray
– JavaScript properties
• Consult http://www.unicode.org
Data structures: criteria
• Speed
– Read (static)
– Write (dynamic)
– Startup
• Memory footprint
– Ram
– Disk
• Multi-threading
Hashtables
• Advantages
– Easy to use out-of-the-box
– Reasonably fast
– General
• Disadvantages
– High overhead
– Discrete (no range lookup)
– Much slower than array lookup
Overhead: char1 ➠ char2
overhead

overhead
next
hash

key
value

overhead overhead
char1 char2


Trie

• Advantages
– Nearly as fast as array lookup
– Much smaller than arrays or Hashtables
– Take advantage of repetition
• Disadvantages
– Not suited for rapidly changing data
– Best for static, preformed data
Trie structure

Index
Data …

M1 M2
Codepoint
Trie code

• 5 Operations
– Shift, Lookup, Mask, Add, Lookup

v = data[index[c>>S1]+(c&M2)]]

S1
M1 M2
Codepoint
Trie: double indexed

• Double, for more compaction:


– Slightly slower than single index
– Smaller chunks of data, so more compaction
Trie: double indexed
Index1 …
Index2 …
Data …

M1 M2 M3
Codepoint
Trie code: double indexed
b1 = index1[ c >> S1 ]
b2 = index2[ b1 + ((c >> S2) & M2)]
v = data[ b2 + (c & M3) ]

S1
S2
M1 M2 M3
Codepoint
Inversion List
• Compaction of set of codepoints
• Advantages
– Simple
– Very compact
– Faster write than trie
– Very fast boolean operations
• Disadvantages
– Slower read than trie or hashtable
Inversion List Structure
• Structure
– Index (optional)
– List of codepoints in Index
ascending order 0: 0020 in
• Example Set 1: 0062 out
[ 0020-0061, 0135, 2: 0135 in
19A3-201B ] 3: 0136 out
4: 19A3 in
5: 201C out
Inversion List Example
• Find smallest i such that
c < data[i]
– If no i, i = length Index
• Then 0: 0020 in
c ∈ List ↔ odd(i) 1: 0062 out
• Examples: 2: 0135 in
– In: 0023, 0135 3: 0136 out
– Out: 001A, 0136, A357 4: 19A3 in
5: 201C out
Inversion List Operations
• Fast Boolean Operations
• Example: Negation
Index
Index
0: 0000
0: 0020
1: 0062
1: 0020
2: 0135 ➠ 3: 0062
3: 0136
➠ 2: 0135
4: 19A3 4: 0136
5: 201C 5: 19A3
6: 201C
Inversion List: Binary Search
• from Programming Pearls
• Completely unrolled, precalculated parameters
int index = startIndex;
if (x >= data[auxStart]) {
index += auxStart;
}
switch (power) {
case 21: if (x < data[t = index-0x10000])
index = t;
case 20: if (x < data[t = index-0x8000])
index = t;

Index
0: 0020 Inversion Map
1: 0062
2: 0135
3: 0136
4: 19A3 • Inversion List
5: 201C plus
0: 0 • Associated Values
1: 5 – Lookup index just as in
2: 3 Inversion List
3: 9 – Take corresponding
4: 8 value
5: 3
6: 0
Key ➠ String Value
• Problem
– Often almost all values are 1 codepoint
– But, must map to strings in a few cases
– Don’t want overhead for strings always
• Solution
– Exception values indicate extra processing
– Can use same solution for UTF-16 code units
Example

• Get a character ch
• Find its value v
• If v is in [D800..E000], may be string
– check v2 = valueException[v - D800]
– if v2 not null, process it, continue
• Process v
String Key ➠ Value
• Problem
– Often almost all keys are 1 codepoint
– Must have string keys in a few cases
– Don’t want overhead for strings always
• Solution
– Exception values indicate possible follow-on
codepoints
– Can use same solution for UTF-16 code units
– Use key closure!
Closure

• If (X + Y) is a key, then X is a key

Before After
s➠x s➠x
sh ➠ y
shch ➠ z
➠ sh ➠ y
shch ➠ z
c➠w c➠w
shc ➠ yw
Why Closure?

s h c h a …

x
y
yw
z

not found,
use last
Bitpacking

• Squeeze information into value


• Example: Character Properties
– category: 5 bits
– bidi: 4 bits (+ exceptions)
– canonical category: 6 bits + expansion

• compressCanon = [bits >> SHIFT] & MASK;


• canon = expansionArray[compressCanon];
Statetables

• Classic:
– entry = stateTable[ state, ch ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
Statetables

• Unicode:
– type = trie[ch];
– entry = stateTable[ state, type ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
• Also, String Key ➠ Value
Sample Data Structures: ICU
• Trie: CompactArray
– Customized for each datatype
– Automatic expansion
– Compact after setting
• Character Properties
– use CompactArray, Bitpacking
• Inversion List: UnicodeSet
– Boolean Operations
Sample Usage #1: ICU
• Collation
– Trie lookup
– Expanding character: String Key ➠ Value
– Contracting character: Key ➠ String Value
• Break Iterators
– For grapheme, word, line, sentence break
– Statetable
Sample Usage #2: ICU
• Transliteration
– Requires
• Mapping codepoints in context to others
• Rearranging codepoints
• Controlling the choice of mapping
– Character Properties
– Inversion List
– Exception values
Sample Usage #3: ICU
• Character Conversion
– From Unicode to bytes
• Trie
– From bytes to Unicode
• Arrays for simple maps
• Statetables for complex maps
– recognizes valid / invalid mappings
– provides compaction

• Complications
– Invalid vs. Valid mapped vs. Valid unmapped
– Fallbacks
References
• Unicode Open Source — ICU
– http://oss.software.ibm.com/icu
– ICU4j: Java API
– ICU4c: C and C++ APIs
• Other references — see Mark’s website:
– http://www.macchiato.com
Q&A

You might also like