Chromium Code Reviews| OLD | NEW |
|---|---|
| (Empty) | |
| 1 #!/usr/bin/python | |
|
M-A Ruel
2014/04/23 12:26:18
Please take a look at other scripts around of get
Pam (message me for reviews)
2014/04/23 13:13:56
Python coding style is described at http://www.chr
| |
| 2 | |
| 3 ''' | |
| 4 A Deterministic acyclic finite state automaton (DAFSA) is a compact | |
| 5 representation of an unordered word list (dictionary). | |
| 6 | |
| 7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton | |
| 8 | |
| 9 This python program converts a list of strings to a byte array in C++. | |
| 10 This python program fetches strings and return values from a gperf file | |
| 11 and generates a C++ file with a byte array representing graph that can be | |
| 12 used as a memory efficient replacement for the perfect hash table. | |
| 13 | |
| 14 The input strings are assumed to consist of printable 7-bit ASCII characters | |
|
M-A Ruel
2014/04/24 12:45:19
Why not just enforce it and print an error message
Olle Liljenzin
2014/04/29 12:41:52
It is enforced, lines 205 and 431-432. But it has
| |
| 15 and the return values are assumed to be one digit integers. | |
| 16 | |
| 17 In this program a DAFSA is a diamond shaped graph starting at a common | |
| 18 source node and ending at a common sink node. All internal nodes contain | |
| 19 a label and each word is represented by the labels in one path from | |
| 20 the source node to the sink node. | |
| 21 | |
| 22 The following python represention is used for nodes: | |
| 23 | |
| 24 Source node: [ children ] | |
| 25 Internal node: (label, [ children ]) | |
| 26 Sink node: None | |
| 27 | |
| 28 The graph is first compressed by prefixes like a trie. In the next step | |
| 29 suffixes are compressed so that the graph gets diamond shaped. Finally | |
| 30 one to one linked nodes are replaced by nodes with the labels joined. | |
| 31 | |
| 32 The order of the operations is crucial since lookups will be performed | |
| 33 starting from the source with no backtracking. Thus a node must have at | |
| 34 most one child with a label starting by the same character. The output | |
| 35 is also arranged so that all jumps are to increasing addresses, thus forward | |
| 36 in memory. | |
| 37 | |
| 38 The generated output has suffix free decoding so that the sign of leading | |
| 39 bits in a link (a reference to a child node) indicate if it has a size of one, | |
| 40 two or three bytes and if it is the last outgoing link from the actual node. | |
| 41 A node label is terminated by a byte with the leading bit set. | |
| 42 | |
| 43 The generated byte array can described by the following BNF: | |
| 44 | |
| 45 <byte> ::= < 8-bit value in range [0x00-0xFF] > | |
| 46 | |
| 47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] > | |
| 48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] > | |
| 49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] > | |
| 50 | |
| 51 <offset1> ::= < byte in range [0x00-0x3F] > | |
| 52 <offset2> ::= < byte in range [0x40-0x5F] > | |
| 53 <offset3> ::= < byte in range [0x60-0x7F] > | |
| 54 | |
| 55 <end_offset1> ::= < byte in range [0x80-0xBF] > | |
| 56 <end_offset2> ::= < byte in range [0xC0-0xDF] > | |
| 57 <end_offset3> ::= < byte in range [0xE0-0xFF] > | |
| 58 | |
| 59 <prefix> ::= <char> | |
| 60 | |
| 61 <label> ::= <end_char> | |
| 62 | <char> <label> | |
| 63 | |
| 64 <end_label> ::= <return_value> | |
| 65 | <char> <end_label> | |
| 66 | |
| 67 <offset> ::= <offset1> | |
| 68 | <offset2> <byte> | |
| 69 | <offset3> <byte> <byte> | |
| 70 | |
| 71 <end_offset> ::= <end_offset1> | |
| 72 | <end_offset2> <byte> | |
| 73 | <end_offset3> <byte> <byte> | |
| 74 | |
| 75 <offsets> ::= <end_offset> | |
| 76 | <offset> <offsets> | |
| 77 | |
| 78 <source> ::= <offsets> | |
| 79 | |
| 80 <node> ::= <label> <offsets> | |
| 81 | <prefix> <node> | |
| 82 | <end_label> | |
| 83 | |
| 84 <dafsa> ::= <source> | |
| 85 | <dafsa> <node> | |
| 86 | |
| 87 Decoding: | |
| 88 | |
| 89 <char> -> printable 7-bit ASCII character | |
| 90 <end_char> & 0x7F -> printable 7-bit ASCII character | |
| 91 <return value> & 0x0F -> integer | |
| 92 <offset1 & 0x3F> -> integer | |
| 93 ((<offset2> & 0x1F>) << 8) + <byte> -> integer | |
| 94 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer | |
| 95 | |
| 96 end_offset1, end_offset2 and and_offset3 are decoded same as offset1, | |
| 97 offset2 and offset3 respectively. | |
| 98 | |
| 99 The first offset in a list of offsets is the distance in bytes between the | |
| 100 offset itself and the first child node. Subsequent offsets are the distance | |
| 101 between previous child node and next child node. Thus each offset links a node | |
| 102 to a child node. The distance is always counted between start addresses, i.e. | |
| 103 first byte in decoded offset or first byte in child node. | |
| 104 | |
| 105 Example 1: | |
| 106 | |
| 107 %% | |
| 108 aa, 1 | |
| 109 a, 2 | |
| 110 %% | |
| 111 | |
| 112 The input is first parsed to a list of words: | |
| 113 ["aa1", "a2"] | |
| 114 | |
| 115 A fully expanded graph is created from the words: | |
| 116 source = [node1, node4] | |
| 117 node1 = ("a", [node2]) | |
| 118 node2 = ("a", [node3]) | |
| 119 node3 = ("\x01", [sink]) | |
| 120 node4 = ("a", [node5]) | |
| 121 node5 = ("\x02", [sink]) | |
| 122 sink = None | |
| 123 | |
| 124 Compression results in the following graph: | |
| 125 source = [node1] | |
| 126 node1 = ("a", [node2, node3]) | |
| 127 node2 = ("\x02", [sink]) | |
| 128 node3 = ("a\x01", [sink]) | |
| 129 sink = None | |
| 130 | |
| 131 A C++ representation of the compressed graph is generated: | |
| 132 | |
| 133 const unsigned char dafsa[7] = { | |
| 134 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81, | |
| 135 }; | |
| 136 | |
| 137 The bytes in the generated array has the following meaning: | |
| 138 | |
| 139 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1 | |
| 140 | |
| 141 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a" | |
| 142 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4 | |
| 143 | |
| 144 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5 | |
| 145 4: 0x82 <return_value> 0x82 & 0x0F -> return 2 | |
| 146 | |
| 147 5: 0x61 <char> label character 0x61 -> match "a" | |
| 148 6: 0x81 <return_value> 0x81 & 0x0F -> return 1 | |
| 149 | |
| 150 Example 2: | |
| 151 | |
| 152 %% | |
| 153 aa, 1 | |
| 154 bbb, 2 | |
| 155 baa, 1 | |
| 156 %% | |
| 157 | |
| 158 The input is first parsed to a list of words: | |
| 159 ["aa1", "bbb2", "baa1"] | |
| 160 | |
| 161 Compression results in the following graph: | |
| 162 source = [node1, node2] | |
| 163 node1 = ("b", [node2, node3]) | |
| 164 node2 = ("aa\x01", [sink]) | |
| 165 node3 = ("bb\x02", [sink]) | |
| 166 sink = None | |
| 167 | |
| 168 A C++ representation of the compressed graph is generated: | |
| 169 | |
| 170 const unsigned char dafsa[11] = { | |
| 171 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82, | |
| 172 }; | |
| 173 | |
| 174 The bytes in the generated array has the following meaning: | |
| 175 | |
| 176 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2 | |
| 177 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5 | |
| 178 | |
| 179 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b" | |
| 180 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5 | |
| 181 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8 | |
| 182 | |
| 183 5: 0x61 <char> label character 0x61 -> match "a" | |
| 184 6: 0x61 <char> label character 0x61 -> match "a" | |
| 185 7: 0x81 <return_value> 0x81 & 0x0F -> return 1 | |
| 186 | |
| 187 8: 0x62 <char> label character 0x62 -> match "b" | |
| 188 9: 0x62 <char> label character 0x62 -> match "b" | |
| 189 10: 0x82 <return_value> 0x82 & 0x0F -> return 2 | |
| 190 ''' | |
| 191 | |
| 192 import sys | |
| 193 | |
| 194 def ToDafsa(words): | |
|
M-A Ruel
2014/04/23 12:26:18
Correct me if I'm wrong, but nowhere you describe
Olle Liljenzin
2014/04/24 09:30:00
It returns a source node described on line 24.
Cl
M-A Ruel
2014/04/24 12:45:19
What's the relative perf impact of using collectio
| |
| 195 ''' | |
| 196 Generate a DAFSA from a word list and return the source node | |
| 197 | |
| 198 Each word is split into characters so that each character is represented by | |
| 199 a unique node. It is assumed the word list is not empty. | |
| 200 ''' | |
| 201 assert(words) | |
|
M-A Ruel
2014/04/23 12:26:18
So the script will throw on an empty file?
Olle Liljenzin
2014/04/24 09:30:00
The output format doesn't support the empty graph.
| |
| 202 def ToNodes(word): | |
| 203 '''Split words into characters''' | |
| 204 # Must be printable 7-bit ASCII. | |
| 205 assert(0x1F < ord(word[0]) < 0x80) | |
| 206 if len(word) == 1: | |
| 207 return chr(ord(word[0]) & 0x0F), [None] | |
| 208 return word[0], [ToNodes(word[1:])] | |
| 209 return [ToNodes(word) for word in words] | |
| 210 | |
| 211 def ToWords(node): | |
| 212 ''' | |
| 213 Generate a word list from all paths starting from an internal node | |
| 214 ''' | |
| 215 if node is None: | |
|
M-A Ruel
2014/04/23 12:26:18
if not node:
(everywhere)
| |
| 216 return [''] | |
| 217 return [(node[0] + word) for child in node[1] for word in ToWords(child)] | |
| 218 | |
| 219 def Reverse(dafsa): | |
| 220 ''' | |
| 221 Generate a new DAFSA that is reversed, so that the old sink node becomes the | |
| 222 new source node. | |
| 223 ''' | |
| 224 def Dfs(node, parent): | |
| 225 '''Create new nodes''' | |
|
Pam (message me for reviews)
2014/04/23 13:13:56
This docstring could be more descriptive.
| |
| 226 if node is None: | |
| 227 sink.append(parent) | |
| 228 elif id(node) not in nodemap: | |
| 229 nodemap[id(node)] = (node[0][::-1], [parent]) | |
| 230 for child in node[1]: | |
| 231 Dfs(child, nodemap[id(node)]) | |
| 232 else: | |
| 233 nodemap[id(node)][1].append(parent) | |
| 234 | |
| 235 sink = [] | |
|
M-A Ruel
2014/04/23 12:26:18
Move closure variables before the function using t
| |
| 236 nodemap = {} | |
| 237 for node in dafsa: | |
| 238 Dfs(node, None) | |
| 239 return sink | |
| 240 | |
| 241 def JoinLabels(dafsa): | |
| 242 ''' | |
| 243 Generate a new DAFSA where internal nodes are merged if there is a one to | |
| 244 one connection. | |
| 245 ''' | |
| 246 def CountParents(node): | |
| 247 '''Count incoming references''' | |
| 248 if id(node) in parentcount: | |
| 249 parentcount[id(node)] += 1 | |
| 250 else: | |
| 251 parentcount[id(node)] = 1 | |
| 252 for child in node[1]: | |
| 253 CountParents(child) | |
| 254 | |
| 255 def Join(node): | |
| 256 '''Create new nodes''' | |
| 257 if id(node) not in nodemap: | |
| 258 children = [Join(child) for child in node[1]] | |
| 259 if len(children) == 1 and parentcount[id(node[1][0])] == 1: | |
| 260 child = children[0] | |
| 261 nodemap[id(node)] = (node[0] + child[0], child[1]) | |
| 262 else: | |
| 263 nodemap[id(node)] = (node[0], children) | |
| 264 return nodemap[id(node)] | |
| 265 | |
| 266 parentcount = { id(None): 2 } | |
| 267 for node in dafsa: | |
| 268 CountParents(node) | |
| 269 nodemap = { id(None): None } | |
| 270 return [Join(node) for node in dafsa] | |
| 271 | |
| 272 def JoinSuffixes(dafsa): | |
| 273 ''' | |
| 274 Generate a new DAFSA where nodes that represent the same word lists towards | |
| 275 the sink are merged. | |
| 276 ''' | |
| 277 def Join(node): | |
| 278 '''Return a macthing node. A new node is created if not matching node | |
|
Pam (message me for reviews)
2014/04/23 13:13:56
typos "matching"; "if no matching node"
| |
| 279 exists. The graph is accessed in dfs order. | |
| 280 ''' | |
| 281 suffixes = frozenset(ToWords(node)) | |
| 282 if suffixes not in nodemap: | |
| 283 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]]) | |
| 284 return nodemap[suffixes] | |
| 285 | |
| 286 nodemap = { frozenset(('',)): None } | |
| 287 return [Join(node) for node in dafsa] | |
| 288 | |
| 289 def TopSort(dafsa): | |
| 290 '''Generate list of nodes in topological sort order''' | |
| 291 def CountIncoming(node): | |
| 292 '''Count incoming references''' | |
| 293 if node is not None: | |
| 294 if id(node) not in incoming: | |
| 295 incoming[id(node)] = 1 | |
| 296 for child in node[1]: | |
| 297 CountIncoming(child) | |
| 298 else: | |
| 299 incoming[id(node)] += 1 | |
| 300 | |
| 301 incoming = {} | |
| 302 for node in dafsa: | |
| 303 CountIncoming(node) | |
| 304 | |
| 305 for node in dafsa: | |
| 306 incoming[id(node)] -= 1 | |
| 307 | |
| 308 waiting = [node for node in dafsa if incoming[id(node)] == 0] | |
| 309 nodes = [] | |
| 310 | |
| 311 while waiting: | |
| 312 node = waiting.pop() | |
| 313 assert(incoming[id(node)] == 0) | |
| 314 nodes.append(node) | |
| 315 for child in node[1]: | |
| 316 if child is not None: | |
| 317 incoming[id(child)] -= 1 | |
| 318 if incoming[id(child)] == 0: | |
| 319 waiting.append(child) | |
| 320 return nodes | |
| 321 | |
| 322 def EncodeLinks(children, offsets, current): | |
| 323 '''Encode a list of children as one, two or three byte offsets''' | |
| 324 if children[0] is None: | |
| 325 # This is an <end_label> node and no links are follow in such nodes | |
|
Pam (message me for reviews)
2014/04/23 13:13:56
language nit: "no links follow such nodes"
| |
| 326 assert(len(children) == 1) | |
| 327 return [] | |
| 328 guess = 3 * len(children) | |
| 329 assert(children) | |
| 330 while True: | |
| 331 offset = current + guess | |
| 332 buf = [] | |
| 333 for child in sorted(children, key = lambda x: -offsets[id(x)]): | |
| 334 last = len(buf) | |
| 335 distance = offset - offsets[id(child)] | |
| 336 assert(distance > 0 and distance < (1 << 21)) | |
| 337 | |
| 338 if distance < (1 << 6): | |
| 339 # A 6-bit offset: "s0xxxxxx" | |
| 340 buf.append(distance) | |
| 341 elif distance < (1 << 13): | |
| 342 # A 13-bit offset: "s10xxxxxxxxxxxxx" | |
| 343 buf.append(0x40 | (distance >> 8)) | |
| 344 buf.append(distance & 0xFF) | |
| 345 else: | |
| 346 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx" | |
| 347 buf.append(0x60 | (distance >> 16)) | |
| 348 buf.append((distance >> 8) & 0xFF) | |
| 349 buf.append(distance & 0xFF) | |
| 350 # Distance in first link is relative to following record. | |
| 351 # Distance in other links are relative to previous link. | |
| 352 offset -= distance | |
| 353 if len(buf) == guess: | |
| 354 break | |
| 355 guess = len(buf) | |
| 356 # Set most significant bit to mark end of links in this node. | |
| 357 buf[last] |= (1 << 7) | |
| 358 buf.reverse() | |
| 359 return buf | |
| 360 | |
| 361 def EncodeLabel(label): | |
| 362 ''' | |
| 363 Encode a node label as a list of bytes with a trailing high byte >0x80. | |
| 364 ''' | |
| 365 assert(label) | |
| 366 buf = [ord(c) for c in label] | |
| 367 buf.reverse() | |
| 368 # Set most significant bit to mark end of label in this node. | |
| 369 buf[0] |= (1 << 7) | |
| 370 return buf | |
| 371 | |
| 372 def EncodePrefix(label): | |
| 373 ''' | |
| 374 Encode a node label as a list of bytes without a trailing high byte. | |
| 375 | |
| 376 This method encodes a node if there is exactly one child and the | |
| 377 child follows immidiately after so that no jump is needed. This label | |
| 378 will then be a prefix to the label in the child node. | |
| 379 ''' | |
| 380 assert(label) | |
| 381 return [ord(c) for c in reversed(label)] | |
| 382 | |
| 383 def Encode(dafsa): | |
| 384 '''Encode a DAFSA to a list of bytes''' | |
| 385 output = [] | |
| 386 offsets = {} | |
| 387 | |
| 388 for node in reversed(TopSort(dafsa)): | |
| 389 if len(node[1]) == 1 and node[1][0] is not None and \ | |
| 390 (offsets[id(node[1][0])] == len(output)): | |
| 391 output += EncodePrefix(node[0]) | |
|
M-A Ruel
2014/04/23 12:26:18
optional nit: I generally prefer .append()
Olle Liljenzin
2014/04/24 09:30:00
List concat and list append are different operatio
M-A Ruel
2014/04/24 12:45:19
Eh, I also prefer .extend(). Using += is really co
| |
| 392 else: | |
| 393 output += EncodeLinks(node[1], offsets, len(output)) | |
| 394 output += EncodeLabel(node[0]) | |
| 395 offsets[id(node)] = len(output) | |
| 396 | |
| 397 output += EncodeLinks(dafsa, offsets, len(output)) | |
| 398 output.reverse() | |
| 399 return output | |
| 400 | |
| 401 def ToCxx(data): | |
| 402 '''Generate C++ code from a list of encoded bytes''' | |
| 403 text = '/* This file is generated. Don\'t edit!\n\n' | |
|
M-A Ruel
2014/04/23 12:26:18
I'd put DO NOT EDIT in all caps. Yelling is good s
| |
| 404 text += ' The following command was used to generate the file:\n\n' | |
| 405 text += ' \"' + ' '.join(sys.argv) + '\"\n\n' | |
|
M-A Ruel
2014/04/23 12:26:18
Not necessary IMHO.
| |
| 406 text += '*/\n\n' | |
| 407 text += 'const unsigned char kDafsa[%s] = {\n' % len(data) | |
| 408 for i in range(0, len(data), 12): | |
| 409 text += ' ' | |
| 410 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]]) | |
|
M-A Ruel
2014/04/23 12:26:18
[] are not needed
| |
| 411 text += ',\n' | |
| 412 text += '};\n' | |
| 413 return text | |
| 414 | |
| 415 def WordsToCxx(words): | |
| 416 '''Generate C++ code from a word list''' | |
| 417 dafsa = ToDafsa(words) | |
| 418 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels): | |
| 419 dafsa = fun(dafsa) | |
| 420 byte_values = Encode(dafsa) | |
| 421 return ToCxx(byte_values) | |
|
M-A Ruel
2014/04/23 12:26:18
return ToCxx(Encode(dafsa))
| |
| 422 | |
| 423 def ParseGperf(infile): | |
| 424 '''Parse gperf file and extract strings and return code''' | |
| 425 lines = [line.strip() for line in infile] | |
| 426 # Extract strings after the first '%%' and before the second '%%'. | |
| 427 begin = lines.index('%%') + 1 | |
| 428 end = lines.index('%%', begin) | |
| 429 lines = lines[begin:end] | |
| 430 for line in lines: | |
| 431 assert(line[-3:-1] == ', ') | |
| 432 assert(line[-1] in '01234') | |
| 433 return [line[:-3] + line[-1] for line in lines] | |
|
M-A Ruel
2014/04/23 12:26:18
Could you add in a comment how the lines should lo
Pam (message me for reviews)
2014/04/23 13:13:56
Also some unit tests for this script. There are va
Olle Liljenzin
2014/04/24 09:30:00
There are two examples, see lines 107 and 152.
M-A Ruel
2014/04/24 12:45:19
Would making it a generator be faster? E.g. replac
Olle Liljenzin
2014/04/29 12:41:52
You are probably right, but I don't understand exa
| |
| 434 | |
| 435 if __name__ == '__main__': | |
| 436 if len(sys.argv) != 3: | |
| 437 print('usage: %s infile outfile' % sys.argv[0]) | |
| 438 sys.exit(-1) | |
| 439 INFILE = open(sys.argv[1], 'r') | |
| 440 OUTFILE = open(sys.argv[2], 'w') | |
| 441 OUTFILE.write(WordsToCxx(ParseGperf(INFILE))) | |
| OLD | NEW |