Professional Documents
Culture Documents
overhead
next
hash
…
key
value
overhead overhead
char1 char2
…
Trie
• Advantages
– Nearly as fast as array lookup
– Much smaller than arrays or Hashtables
– Take advantage of repetition
• Disadvantages
– Not suited for rapidly changing data
– Best for static, preformed data
Trie structure
Index
Data …
M1 M2
Codepoint
Trie code
• 5 Operations
– Shift, Lookup, Mask, Add, Lookup
v = data[index[c>>S1]+(c&M2)]]
S1
M1 M2
Codepoint
Trie: double indexed
M1 M2 M3
Codepoint
Trie code: double indexed
b1 = index1[ c >> S1 ]
b2 = index2[ b1 + ((c >> S2) & M2)]
v = data[ b2 + (c & M3) ]
S1
S2
M1 M2 M3
Codepoint
Inversion List
• Compaction of set of codepoints
• Advantages
– Simple
– Very compact
– Faster write than trie
– Very fast boolean operations
• Disadvantages
– Slower read than trie or hashtable
Inversion List Structure
• Structure
– Index (optional)
– List of codepoints in Index
ascending order 0: 0020 in
• Example Set 1: 0062 out
[ 0020-0061, 0135, 2: 0135 in
19A3-201B ] 3: 0136 out
4: 19A3 in
5: 201C out
Inversion List Example
• Find smallest i such that
c < data[i]
– If no i, i = length Index
• Then 0: 0020 in
c ∈ List ↔ odd(i) 1: 0062 out
• Examples: 2: 0135 in
– In: 0023, 0135 3: 0136 out
– Out: 001A, 0136, A357 4: 19A3 in
5: 201C out
Inversion List Operations
• Fast Boolean Operations
• Example: Negation
Index
Index
0: 0000
0: 0020
1: 0062
1: 0020
2: 0135 ➠ 3: 0062
3: 0136
➠ 2: 0135
4: 19A3 4: 0136
5: 201C 5: 19A3
6: 201C
Inversion List: Binary Search
• from Programming Pearls
• Completely unrolled, precalculated parameters
int index = startIndex;
if (x >= data[auxStart]) {
index += auxStart;
}
switch (power) {
case 21: if (x < data[t = index-0x10000])
index = t;
case 20: if (x < data[t = index-0x8000])
index = t;
…
Index
0: 0020 Inversion Map
1: 0062
2: 0135
3: 0136
4: 19A3 • Inversion List
5: 201C plus
0: 0 • Associated Values
1: 5 – Lookup index just as in
2: 3 Inversion List
3: 9 – Take corresponding
4: 8 value
5: 3
6: 0
Key ➠ String Value
• Problem
– Often almost all values are 1 codepoint
– But, must map to strings in a few cases
– Don’t want overhead for strings always
• Solution
– Exception values indicate extra processing
– Can use same solution for UTF-16 code units
Example
• Get a character ch
• Find its value v
• If v is in [D800..E000], may be string
– check v2 = valueException[v - D800]
– if v2 not null, process it, continue
• Process v
String Key ➠ Value
• Problem
– Often almost all keys are 1 codepoint
– Must have string keys in a few cases
– Don’t want overhead for strings always
• Solution
– Exception values indicate possible follow-on
codepoints
– Can use same solution for UTF-16 code units
– Use key closure!
Closure
Before After
s➠x s➠x
sh ➠ y
shch ➠ z
➠ sh ➠ y
shch ➠ z
c➠w c➠w
shc ➠ yw
Why Closure?
s h c h a …
x
y
yw
z
not found,
use last
Bitpacking
• Classic:
– entry = stateTable[ state, ch ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
Statetables
• Unicode:
– type = trie[ch];
– entry = stateTable[ state, type ];
– state = entry.state;
– doSomethingWith( entry.action );
– until (state < 0);
• Also, String Key ➠ Value
Sample Data Structures: ICU
• Trie: CompactArray
– Customized for each datatype
– Automatic expansion
– Compact after setting
• Character Properties
– use CompactArray, Bitpacking
• Inversion List: UnicodeSet
– Boolean Operations
Sample Usage #1: ICU
• Collation
– Trie lookup
– Expanding character: String Key ➠ Value
– Contracting character: Key ➠ String Value
• Break Iterators
– For grapheme, word, line, sentence break
– Statetable
Sample Usage #2: ICU
• Transliteration
– Requires
• Mapping codepoints in context to others
• Rearranging codepoints
• Controlling the choice of mapping
– Character Properties
– Inversion List
– Exception values
Sample Usage #3: ICU
• Character Conversion
– From Unicode to bytes
• Trie
– From bytes to Unicode
• Arrays for simple maps
• Statetables for complex maps
– recognizes valid / invalid mappings
– provides compaction
• Complications
– Invalid vs. Valid mapped vs. Valid unmapped
– Fallbacks
References
• Unicode Open Source — ICU
– http://oss.software.ibm.com/icu
– ICU4j: Java API
– ICU4c: C and C++ APIs
• Other references — see Mark’s website:
– http://www.macchiato.com
Q&A