OLD | NEW |
(Empty) | |
| 1 #!/usr/bin/python |
| 2 |
| 3 ''' |
| 4 A Deterministic acyclic finite state automaton (DAFSA) is a compact |
| 5 representation of an unordered word list (dictionary). |
| 6 |
| 7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton |
| 8 |
| 9 This python program converts a list of strings to a byte array in C++. |
| 10 This python program fetches strings and return values from a gperf file |
| 11 and generates a C++ file with a byte array representing graph that can be |
| 12 used as a memory efficient replacement for the perfect hash table. |
| 13 |
| 14 The input strings are assumed to consist of printable 7-bit ASCII characters |
| 15 and the return values are assumed to be one digit integers. |
| 16 |
| 17 In this program a DAFSA is a diamond shaped graph starting at a common |
| 18 source node and ending at a common sink node. All internal nodes contain |
| 19 a label and each word is represented by the labels in one path from |
| 20 the source node to the sink node. |
| 21 |
| 22 The following python represention is used for nodes: |
| 23 |
| 24 Source node: [ children ] |
| 25 Internal node: (label, [ children ]) |
| 26 Sink node: None |
| 27 |
| 28 The graph is first compressed by prefixes like a trie. In the next step |
| 29 suffixes are compressed so that the graph gets diamond shaped. Finally |
| 30 one to one linked nodes are replaced by nodes with the labels joined. |
| 31 |
| 32 The order of the operations is crucial since lookups will be performed |
| 33 starting from the source with no backtracking. Thus a node must have at |
| 34 most one child with a label starting by the same character. The output |
| 35 is also arranged so that all jumps are to increasing addresses, thus forward |
| 36 in memory. |
| 37 |
| 38 The generated output has suffix free decoding so that the sign of leading |
| 39 bits in a link (a reference to a child node) indicate if it has a size of one, |
| 40 two or three bytes and if it is the last outgoing link from the actual node. |
| 41 A node label is terminated by a byte with the leading bit set. |
| 42 |
| 43 The generated byte array can described by the following BNF: |
| 44 |
| 45 <byte> ::= < 8-bit value in range [0x00-0xFF] > |
| 46 |
| 47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] > |
| 48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] > |
| 49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] > |
| 50 |
| 51 <offset1> ::= < byte in range [0x00-0x3F] > |
| 52 <offset2> ::= < byte in range [0x40-0x5F] > |
| 53 <offset3> ::= < byte in range [0x60-0x7F] > |
| 54 |
| 55 <end_offset1> ::= < byte in range [0x80-0xBF] > |
| 56 <end_offset2> ::= < byte in range [0xC0-0xDF] > |
| 57 <end_offset3> ::= < byte in range [0xE0-0xFF] > |
| 58 |
| 59 <label> ::= <end_char> |
| 60 | <char> <label> |
| 61 |
| 62 <end_label> ::= <return_value> |
| 63 | <char> <end_label> |
| 64 |
| 65 <offset> ::= <offset1> |
| 66 | <offset2> <byte> |
| 67 | <offset3> <byte> <byte> |
| 68 |
| 69 <end_offset> ::= <end_offset1> |
| 70 | <end_offset2> <byte> |
| 71 | <end_offset3> <byte> <byte> |
| 72 |
| 73 <offsets> ::= <end_offset> |
| 74 | <offset> <offsets> |
| 75 |
| 76 <source> ::= <offsets> |
| 77 |
| 78 <node> ::= <label> <offsets> |
| 79 | <end_label> |
| 80 |
| 81 <dafsa> ::= <source> |
| 82 | <dafsa> <node> |
| 83 |
| 84 Decoding: |
| 85 |
| 86 <char> -> printable 7-bit ASCII character |
| 87 <end_char> & 0x7F -> printable 7-bit ASCII character |
| 88 <return value> & 0x0F -> integer |
| 89 <offset1 & 0x3F> -> integer |
| 90 ((<offset2> & 0x1F>) << 8) + <byte> -> integer |
| 91 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer |
| 92 |
| 93 end_offset1, end_offset2 and and_offset3 are decoded same as offset1, |
| 94 offset2 and offset3 respectively. |
| 95 |
| 96 The first offset in a list of offsets is the distance in bytes between the |
| 97 offset itself and the first child node. Subsequent offsets are the distance |
| 98 between previous child node and next child node. Thus each offset links a node |
| 99 to a child node. The distance is always counted between start addresses, i.e. |
| 100 first byte in decoded offset or first byte in child node. |
| 101 |
| 102 Example 1: |
| 103 |
| 104 %% |
| 105 aa, 1 |
| 106 a, 2 |
| 107 %% |
| 108 |
| 109 The input is first parsed to a list of words: |
| 110 ["aa1", "a2"] |
| 111 |
| 112 A fully expanded graph is created from the words: |
| 113 source = [node1, node4] |
| 114 node1 = ("a", [node2]) |
| 115 node2 = ("a", [node3]) |
| 116 node3 = ("\x01", [sink]) |
| 117 node4 = ("a", [node5]) |
| 118 node5 = ("\x02", [sink]) |
| 119 sink = None |
| 120 |
| 121 Compression results in the following graph: |
| 122 source = [node1] |
| 123 node1 = ("a", [node2, node3]) |
| 124 node2 = ("\x02", [sink]) |
| 125 node3 = ("a\x01", [sink]) |
| 126 sink = None |
| 127 |
| 128 A C++ representation of the compressed graph is generated: |
| 129 |
| 130 const unsigned char dafsa[7] = { |
| 131 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81, |
| 132 }; |
| 133 |
| 134 The bytes in the generated array has the following meaning: |
| 135 |
| 136 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1 |
| 137 |
| 138 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a" |
| 139 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4 |
| 140 |
| 141 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5 |
| 142 4: 0x82 <return_value> 0x82 & 0x0F -> return 2 |
| 143 |
| 144 5: 0x61 <char> label character 0x61 -> match "a" |
| 145 6: 0x81 <return_value> 0x81 & 0x0F -> return 1 |
| 146 |
| 147 Example 2: |
| 148 |
| 149 %% |
| 150 aa, 1 |
| 151 bbb, 2 |
| 152 baa, 1 |
| 153 %% |
| 154 |
| 155 The input is first parsed to a list of words: |
| 156 ["aa1", "bbb2", "baa1"] |
| 157 |
| 158 Compression results in the following graph: |
| 159 source = [node1, node2] |
| 160 node1 = ("b", [node2, node3]) |
| 161 node2 = ("aa\x01", [sink]) |
| 162 node3 = ("bb\x02", [sink]) |
| 163 sink = None |
| 164 |
| 165 A C++ representation of the compressed graph is generated: |
| 166 |
| 167 const unsigned char dafsa[11] = { |
| 168 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82, |
| 169 }; |
| 170 |
| 171 The bytes in the generated array has the following meaning: |
| 172 |
| 173 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2 |
| 174 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5 |
| 175 |
| 176 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b" |
| 177 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5 |
| 178 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8 |
| 179 |
| 180 5: 0x61 <char> label character 0x61 -> match "a" |
| 181 6: 0x61 <char> label character 0x61 -> match "a" |
| 182 7: 0x81 <return_value> 0x81 & 0x0F -> return 1 |
| 183 |
| 184 8: 0x62 <char> label character 0x62 -> match "b" |
| 185 9: 0x62 <char> label character 0x62 -> match "b" |
| 186 10: 0x82 <return_value> 0x82 & 0x0F -> return 2 |
| 187 ''' |
| 188 |
| 189 import sys |
| 190 |
| 191 def ToDafsa(words): |
| 192 ''' |
| 193 Generate a DAFSA from a word list and return the source node |
| 194 |
| 195 Each word is split into characters so that each character is represented by |
| 196 a unique node. It is assumed the word list is not empty. |
| 197 ''' |
| 198 assert(words) |
| 199 def ToNodes(word): |
| 200 '''Split words into characters''' |
| 201 # Must be printable 7-bit ASCII. |
| 202 assert(0x1F < ord(word[0]) < 0x80) |
| 203 if len(word) == 1: |
| 204 return chr(ord(word[0]) & 0x0F), [None] |
| 205 return word[0], [ToNodes(word[1:])] |
| 206 return [ToNodes(word) for word in words] |
| 207 |
| 208 def ToWords(node): |
| 209 ''' |
| 210 Generate a word list from all paths starting from an internal node |
| 211 ''' |
| 212 if node is None: |
| 213 return [''] |
| 214 return [(node[0] + word) for child in node[1] for word in ToWords(child)] |
| 215 |
| 216 def Reverse(dafsa): |
| 217 ''' |
| 218 Generate a new DAFSA that is reversed, so that the old sink node becomes the |
| 219 new source node. |
| 220 ''' |
| 221 def Dfs(node, parent): |
| 222 '''Create new nodes''' |
| 223 if node is None: |
| 224 sink.append(parent) |
| 225 elif id(node) not in nodemap: |
| 226 nodemap[id(node)] = (node[0][::-1], [parent]) |
| 227 for child in node[1]: |
| 228 Dfs(child, nodemap[id(node)]) |
| 229 else: |
| 230 nodemap[id(node)][1].append(parent) |
| 231 |
| 232 sink = [] |
| 233 nodemap = {} |
| 234 for node in dafsa: |
| 235 Dfs(node, None) |
| 236 return sink |
| 237 |
| 238 def JoinLabels(dafsa): |
| 239 ''' |
| 240 Generate a new DAFSA where internal nodes are merged if there is a one to |
| 241 one connection. |
| 242 ''' |
| 243 def CountParents(node): |
| 244 '''Count incoming references''' |
| 245 if id(node) in parentcount: |
| 246 parentcount[id(node)] += 1 |
| 247 else: |
| 248 parentcount[id(node)] = 1 |
| 249 for child in node[1]: |
| 250 CountParents(child) |
| 251 |
| 252 def Join(node): |
| 253 '''Create new nodes''' |
| 254 if id(node) not in nodemap: |
| 255 children = [Join(child) for child in node[1]] |
| 256 if len(children) == 1 and parentcount[id(node[1][0])] == 1: |
| 257 child = children[0] |
| 258 nodemap[id(node)] = (node[0] + child[0], child[1]) |
| 259 else: |
| 260 nodemap[id(node)] = (node[0], children) |
| 261 return nodemap[id(node)] |
| 262 |
| 263 parentcount = { id(None): 2 } |
| 264 for node in dafsa: |
| 265 CountParents(node) |
| 266 nodemap = { id(None): None } |
| 267 return [Join(node) for node in dafsa] |
| 268 |
| 269 def JoinSuffixes(dafsa): |
| 270 ''' |
| 271 Generate a new DAFSA where nodes that represent the same word lists towards |
| 272 the sink are merged. |
| 273 ''' |
| 274 def Join(node): |
| 275 '''Return a macthing node. A new node is created if not matching node |
| 276 exists. The graph is accessed in dfs order. |
| 277 ''' |
| 278 suffixes = frozenset(ToWords(node)) |
| 279 if suffixes not in nodemap: |
| 280 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]]) |
| 281 return nodemap[suffixes] |
| 282 |
| 283 nodemap = { frozenset(('',)): None } |
| 284 return [Join(node) for node in dafsa] |
| 285 |
| 286 def TopSort(dafsa): |
| 287 '''Generate list of nodes in topological sort order''' |
| 288 def CountIncoming(node): |
| 289 '''Count incoming references''' |
| 290 if node is not None: |
| 291 if id(node) not in incoming: |
| 292 incoming[id(node)] = 1 |
| 293 for child in node[1]: |
| 294 CountIncoming(child) |
| 295 else: |
| 296 incoming[id(node)] += 1 |
| 297 |
| 298 incoming = {} |
| 299 for node in dafsa: |
| 300 CountIncoming(node) |
| 301 |
| 302 for node in dafsa: |
| 303 incoming[id(node)] -= 1 |
| 304 |
| 305 waiting = [node for node in dafsa if incoming[id(node)] == 0] |
| 306 nodes = [] |
| 307 |
| 308 while waiting: |
| 309 node = waiting.pop() |
| 310 assert(incoming[id(node)] == 0) |
| 311 nodes.append(node) |
| 312 for child in node[1]: |
| 313 if child is not None: |
| 314 incoming[id(child)] -= 1 |
| 315 if incoming[id(child)] == 0: |
| 316 waiting.append(child) |
| 317 return nodes |
| 318 |
| 319 def EncodeLinks(children, offsets, current): |
| 320 '''Encode a list of children as one, two or three byte offsets''' |
| 321 if children[0] is None: |
| 322 # This is an <end_label> node and no links are follow in such nodes |
| 323 assert(len(children) == 1) |
| 324 return [] |
| 325 guess = 3 * len(children) |
| 326 assert(children) |
| 327 while True: |
| 328 offset = current + guess |
| 329 buf = [] |
| 330 for child in sorted(children, key = lambda x: -offsets[id(x)]): |
| 331 last = len(buf) |
| 332 distance = offset - offsets[id(child)] |
| 333 assert(distance > 0 and distance < (1 << 21)) |
| 334 |
| 335 if distance < (1 << 6): |
| 336 # A 6-bit offset: "s0xxxxxx" |
| 337 buf.append(distance) |
| 338 elif distance < (1 << 13): |
| 339 # A 13-bit offset: "s10xxxxxxxxxxxxx" |
| 340 buf.append(0x40 | (distance >> 8)) |
| 341 buf.append(distance & 0xFF) |
| 342 else: |
| 343 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx" |
| 344 buf.append(0x60 | (distance >> 16)) |
| 345 buf.append((distance >> 8) & 0xFF) |
| 346 buf.append(distance & 0xFF) |
| 347 # Distance in first link is relative to following record. |
| 348 # Distance in other links are relative to previous link. |
| 349 offset -= distance |
| 350 if len(buf) == guess: |
| 351 break |
| 352 guess = len(buf) |
| 353 # Set most significant bit to mark end of links in this node. |
| 354 buf[last] |= (1 << 7) |
| 355 buf.reverse() |
| 356 return buf |
| 357 |
| 358 def EncodeLabel(label): |
| 359 ''' |
| 360 Encode a node label as a list of bytes with a trailing high byte >0x80. |
| 361 ''' |
| 362 assert(label) |
| 363 buf = [ord(c) for c in label] |
| 364 buf.reverse() |
| 365 # Set most significant bit to mark end of label in this node. |
| 366 buf[0] |= (1 << 7) |
| 367 return buf |
| 368 |
| 369 def Encode(dafsa): |
| 370 '''Encode a DAFSA to a list of bytes''' |
| 371 output = [] |
| 372 offsets = {} |
| 373 |
| 374 for node in reversed(TopSort(dafsa)): |
| 375 output += EncodeLinks(node[1], offsets, len(output)) |
| 376 output += EncodeLabel(node[0]) |
| 377 offsets[id(node)] = len(output) |
| 378 |
| 379 output += EncodeLinks(dafsa, offsets, len(output)) |
| 380 output.reverse() |
| 381 return output |
| 382 |
| 383 def ToCxx(data): |
| 384 '''Generate C++ code from a list of encoded bytes''' |
| 385 text = '/* This file is generated. Don\'t edit!\n\n' |
| 386 text += ' The following command was used to generate the file:\n\n' |
| 387 text += ' \"' + ' '.join(sys.argv) + '\"\n\n' |
| 388 text += '*/\n\n' |
| 389 text += 'const unsigned char kDafsa[%s] = {\n' % len(data) |
| 390 for i in range(0, len(data), 12): |
| 391 text += ' ' |
| 392 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]]) |
| 393 text += ',\n' |
| 394 text += '};\n' |
| 395 return text |
| 396 |
| 397 def WordsToCxx(words): |
| 398 '''Generate C++ code from a word list''' |
| 399 dafsa = ToDafsa(words) |
| 400 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels): |
| 401 dafsa = fun(dafsa) |
| 402 byte_values = Encode(dafsa) |
| 403 return ToCxx(byte_values) |
| 404 |
| 405 def ParseGperf(infile): |
| 406 '''Parse gperf file and extract strings and return code''' |
| 407 lines = [line.strip() for line in infile] |
| 408 # Extract strings after the first '%%' and before the second '%%'. |
| 409 begin = lines.index('%%') + 1 |
| 410 end = lines.index('%%', begin) |
| 411 lines = lines[begin:end] |
| 412 for line in lines: |
| 413 assert(line[-3:-1] == ', ') |
| 414 assert(line[-1] in '01234') |
| 415 return [line[:-3] + line[-1] for line in lines] |
| 416 |
| 417 if __name__ == '__main__': |
| 418 if len(sys.argv) != 3: |
| 419 print('usage: %s infile outfile' % sys.argv[0]) |
| 420 sys.exit(-1) |
| 421 INFILE = open(sys.argv[1], 'r') |
| 422 OUTFILE = open(sys.argv[2], 'w') |
| 423 OUTFILE.write(WordsToCxx(ParseGperf(INFILE))) |
OLD | NEW |