OLD | NEW |
---|---|
(Empty) | |
1 #!/usr/bin/python | |
M-A Ruel
2014/04/23 12:26:18
Please take a look at other scripts around of get
Pam (message me for reviews)
2014/04/23 13:13:56
Python coding style is described at http://www.chr
| |
2 | |
3 ''' | |
4 A Deterministic acyclic finite state automaton (DAFSA) is a compact | |
5 representation of an unordered word list (dictionary). | |
6 | |
7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton | |
8 | |
9 This python program converts a list of strings to a byte array in C++. | |
10 This python program fetches strings and return values from a gperf file | |
11 and generates a C++ file with a byte array representing graph that can be | |
12 used as a memory efficient replacement for the perfect hash table. | |
13 | |
14 The input strings are assumed to consist of printable 7-bit ASCII characters | |
M-A Ruel
2014/04/24 12:45:19
Why not just enforce it and print an error message
Olle Liljenzin
2014/04/29 12:41:52
It is enforced, lines 205 and 431-432. But it has
| |
15 and the return values are assumed to be one digit integers. | |
16 | |
17 In this program a DAFSA is a diamond shaped graph starting at a common | |
18 source node and ending at a common sink node. All internal nodes contain | |
19 a label and each word is represented by the labels in one path from | |
20 the source node to the sink node. | |
21 | |
22 The following python represention is used for nodes: | |
23 | |
24 Source node: [ children ] | |
25 Internal node: (label, [ children ]) | |
26 Sink node: None | |
27 | |
28 The graph is first compressed by prefixes like a trie. In the next step | |
29 suffixes are compressed so that the graph gets diamond shaped. Finally | |
30 one to one linked nodes are replaced by nodes with the labels joined. | |
31 | |
32 The order of the operations is crucial since lookups will be performed | |
33 starting from the source with no backtracking. Thus a node must have at | |
34 most one child with a label starting by the same character. The output | |
35 is also arranged so that all jumps are to increasing addresses, thus forward | |
36 in memory. | |
37 | |
38 The generated output has suffix free decoding so that the sign of leading | |
39 bits in a link (a reference to a child node) indicate if it has a size of one, | |
40 two or three bytes and if it is the last outgoing link from the actual node. | |
41 A node label is terminated by a byte with the leading bit set. | |
42 | |
43 The generated byte array can described by the following BNF: | |
44 | |
45 <byte> ::= < 8-bit value in range [0x00-0xFF] > | |
46 | |
47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] > | |
48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] > | |
49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] > | |
50 | |
51 <offset1> ::= < byte in range [0x00-0x3F] > | |
52 <offset2> ::= < byte in range [0x40-0x5F] > | |
53 <offset3> ::= < byte in range [0x60-0x7F] > | |
54 | |
55 <end_offset1> ::= < byte in range [0x80-0xBF] > | |
56 <end_offset2> ::= < byte in range [0xC0-0xDF] > | |
57 <end_offset3> ::= < byte in range [0xE0-0xFF] > | |
58 | |
59 <prefix> ::= <char> | |
60 | |
61 <label> ::= <end_char> | |
62 | <char> <label> | |
63 | |
64 <end_label> ::= <return_value> | |
65 | <char> <end_label> | |
66 | |
67 <offset> ::= <offset1> | |
68 | <offset2> <byte> | |
69 | <offset3> <byte> <byte> | |
70 | |
71 <end_offset> ::= <end_offset1> | |
72 | <end_offset2> <byte> | |
73 | <end_offset3> <byte> <byte> | |
74 | |
75 <offsets> ::= <end_offset> | |
76 | <offset> <offsets> | |
77 | |
78 <source> ::= <offsets> | |
79 | |
80 <node> ::= <label> <offsets> | |
81 | <prefix> <node> | |
82 | <end_label> | |
83 | |
84 <dafsa> ::= <source> | |
85 | <dafsa> <node> | |
86 | |
87 Decoding: | |
88 | |
89 <char> -> printable 7-bit ASCII character | |
90 <end_char> & 0x7F -> printable 7-bit ASCII character | |
91 <return value> & 0x0F -> integer | |
92 <offset1 & 0x3F> -> integer | |
93 ((<offset2> & 0x1F>) << 8) + <byte> -> integer | |
94 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer | |
95 | |
96 end_offset1, end_offset2 and and_offset3 are decoded same as offset1, | |
97 offset2 and offset3 respectively. | |
98 | |
99 The first offset in a list of offsets is the distance in bytes between the | |
100 offset itself and the first child node. Subsequent offsets are the distance | |
101 between previous child node and next child node. Thus each offset links a node | |
102 to a child node. The distance is always counted between start addresses, i.e. | |
103 first byte in decoded offset or first byte in child node. | |
104 | |
105 Example 1: | |
106 | |
107 %% | |
108 aa, 1 | |
109 a, 2 | |
110 %% | |
111 | |
112 The input is first parsed to a list of words: | |
113 ["aa1", "a2"] | |
114 | |
115 A fully expanded graph is created from the words: | |
116 source = [node1, node4] | |
117 node1 = ("a", [node2]) | |
118 node2 = ("a", [node3]) | |
119 node3 = ("\x01", [sink]) | |
120 node4 = ("a", [node5]) | |
121 node5 = ("\x02", [sink]) | |
122 sink = None | |
123 | |
124 Compression results in the following graph: | |
125 source = [node1] | |
126 node1 = ("a", [node2, node3]) | |
127 node2 = ("\x02", [sink]) | |
128 node3 = ("a\x01", [sink]) | |
129 sink = None | |
130 | |
131 A C++ representation of the compressed graph is generated: | |
132 | |
133 const unsigned char dafsa[7] = { | |
134 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81, | |
135 }; | |
136 | |
137 The bytes in the generated array has the following meaning: | |
138 | |
139 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1 | |
140 | |
141 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a" | |
142 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4 | |
143 | |
144 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5 | |
145 4: 0x82 <return_value> 0x82 & 0x0F -> return 2 | |
146 | |
147 5: 0x61 <char> label character 0x61 -> match "a" | |
148 6: 0x81 <return_value> 0x81 & 0x0F -> return 1 | |
149 | |
150 Example 2: | |
151 | |
152 %% | |
153 aa, 1 | |
154 bbb, 2 | |
155 baa, 1 | |
156 %% | |
157 | |
158 The input is first parsed to a list of words: | |
159 ["aa1", "bbb2", "baa1"] | |
160 | |
161 Compression results in the following graph: | |
162 source = [node1, node2] | |
163 node1 = ("b", [node2, node3]) | |
164 node2 = ("aa\x01", [sink]) | |
165 node3 = ("bb\x02", [sink]) | |
166 sink = None | |
167 | |
168 A C++ representation of the compressed graph is generated: | |
169 | |
170 const unsigned char dafsa[11] = { | |
171 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82, | |
172 }; | |
173 | |
174 The bytes in the generated array has the following meaning: | |
175 | |
176 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2 | |
177 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5 | |
178 | |
179 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b" | |
180 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5 | |
181 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8 | |
182 | |
183 5: 0x61 <char> label character 0x61 -> match "a" | |
184 6: 0x61 <char> label character 0x61 -> match "a" | |
185 7: 0x81 <return_value> 0x81 & 0x0F -> return 1 | |
186 | |
187 8: 0x62 <char> label character 0x62 -> match "b" | |
188 9: 0x62 <char> label character 0x62 -> match "b" | |
189 10: 0x82 <return_value> 0x82 & 0x0F -> return 2 | |
190 ''' | |
191 | |
192 import sys | |
193 | |
194 def ToDafsa(words): | |
M-A Ruel
2014/04/23 12:26:18
Correct me if I'm wrong, but nowhere you describe
Olle Liljenzin
2014/04/24 09:30:00
It returns a source node described on line 24.
Cl
M-A Ruel
2014/04/24 12:45:19
What's the relative perf impact of using collectio
| |
195 ''' | |
196 Generate a DAFSA from a word list and return the source node | |
197 | |
198 Each word is split into characters so that each character is represented by | |
199 a unique node. It is assumed the word list is not empty. | |
200 ''' | |
201 assert(words) | |
M-A Ruel
2014/04/23 12:26:18
So the script will throw on an empty file?
Olle Liljenzin
2014/04/24 09:30:00
The output format doesn't support the empty graph.
| |
202 def ToNodes(word): | |
203 '''Split words into characters''' | |
204 # Must be printable 7-bit ASCII. | |
205 assert(0x1F < ord(word[0]) < 0x80) | |
206 if len(word) == 1: | |
207 return chr(ord(word[0]) & 0x0F), [None] | |
208 return word[0], [ToNodes(word[1:])] | |
209 return [ToNodes(word) for word in words] | |
210 | |
211 def ToWords(node): | |
212 ''' | |
213 Generate a word list from all paths starting from an internal node | |
214 ''' | |
215 if node is None: | |
M-A Ruel
2014/04/23 12:26:18
if not node:
(everywhere)
| |
216 return [''] | |
217 return [(node[0] + word) for child in node[1] for word in ToWords(child)] | |
218 | |
219 def Reverse(dafsa): | |
220 ''' | |
221 Generate a new DAFSA that is reversed, so that the old sink node becomes the | |
222 new source node. | |
223 ''' | |
224 def Dfs(node, parent): | |
225 '''Create new nodes''' | |
Pam (message me for reviews)
2014/04/23 13:13:56
This docstring could be more descriptive.
| |
226 if node is None: | |
227 sink.append(parent) | |
228 elif id(node) not in nodemap: | |
229 nodemap[id(node)] = (node[0][::-1], [parent]) | |
230 for child in node[1]: | |
231 Dfs(child, nodemap[id(node)]) | |
232 else: | |
233 nodemap[id(node)][1].append(parent) | |
234 | |
235 sink = [] | |
M-A Ruel
2014/04/23 12:26:18
Move closure variables before the function using t
| |
236 nodemap = {} | |
237 for node in dafsa: | |
238 Dfs(node, None) | |
239 return sink | |
240 | |
241 def JoinLabels(dafsa): | |
242 ''' | |
243 Generate a new DAFSA where internal nodes are merged if there is a one to | |
244 one connection. | |
245 ''' | |
246 def CountParents(node): | |
247 '''Count incoming references''' | |
248 if id(node) in parentcount: | |
249 parentcount[id(node)] += 1 | |
250 else: | |
251 parentcount[id(node)] = 1 | |
252 for child in node[1]: | |
253 CountParents(child) | |
254 | |
255 def Join(node): | |
256 '''Create new nodes''' | |
257 if id(node) not in nodemap: | |
258 children = [Join(child) for child in node[1]] | |
259 if len(children) == 1 and parentcount[id(node[1][0])] == 1: | |
260 child = children[0] | |
261 nodemap[id(node)] = (node[0] + child[0], child[1]) | |
262 else: | |
263 nodemap[id(node)] = (node[0], children) | |
264 return nodemap[id(node)] | |
265 | |
266 parentcount = { id(None): 2 } | |
267 for node in dafsa: | |
268 CountParents(node) | |
269 nodemap = { id(None): None } | |
270 return [Join(node) for node in dafsa] | |
271 | |
272 def JoinSuffixes(dafsa): | |
273 ''' | |
274 Generate a new DAFSA where nodes that represent the same word lists towards | |
275 the sink are merged. | |
276 ''' | |
277 def Join(node): | |
278 '''Return a macthing node. A new node is created if not matching node | |
Pam (message me for reviews)
2014/04/23 13:13:56
typos "matching"; "if no matching node"
| |
279 exists. The graph is accessed in dfs order. | |
280 ''' | |
281 suffixes = frozenset(ToWords(node)) | |
282 if suffixes not in nodemap: | |
283 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]]) | |
284 return nodemap[suffixes] | |
285 | |
286 nodemap = { frozenset(('',)): None } | |
287 return [Join(node) for node in dafsa] | |
288 | |
289 def TopSort(dafsa): | |
290 '''Generate list of nodes in topological sort order''' | |
291 def CountIncoming(node): | |
292 '''Count incoming references''' | |
293 if node is not None: | |
294 if id(node) not in incoming: | |
295 incoming[id(node)] = 1 | |
296 for child in node[1]: | |
297 CountIncoming(child) | |
298 else: | |
299 incoming[id(node)] += 1 | |
300 | |
301 incoming = {} | |
302 for node in dafsa: | |
303 CountIncoming(node) | |
304 | |
305 for node in dafsa: | |
306 incoming[id(node)] -= 1 | |
307 | |
308 waiting = [node for node in dafsa if incoming[id(node)] == 0] | |
309 nodes = [] | |
310 | |
311 while waiting: | |
312 node = waiting.pop() | |
313 assert(incoming[id(node)] == 0) | |
314 nodes.append(node) | |
315 for child in node[1]: | |
316 if child is not None: | |
317 incoming[id(child)] -= 1 | |
318 if incoming[id(child)] == 0: | |
319 waiting.append(child) | |
320 return nodes | |
321 | |
322 def EncodeLinks(children, offsets, current): | |
323 '''Encode a list of children as one, two or three byte offsets''' | |
324 if children[0] is None: | |
325 # This is an <end_label> node and no links are follow in such nodes | |
Pam (message me for reviews)
2014/04/23 13:13:56
language nit: "no links follow such nodes"
| |
326 assert(len(children) == 1) | |
327 return [] | |
328 guess = 3 * len(children) | |
329 assert(children) | |
330 while True: | |
331 offset = current + guess | |
332 buf = [] | |
333 for child in sorted(children, key = lambda x: -offsets[id(x)]): | |
334 last = len(buf) | |
335 distance = offset - offsets[id(child)] | |
336 assert(distance > 0 and distance < (1 << 21)) | |
337 | |
338 if distance < (1 << 6): | |
339 # A 6-bit offset: "s0xxxxxx" | |
340 buf.append(distance) | |
341 elif distance < (1 << 13): | |
342 # A 13-bit offset: "s10xxxxxxxxxxxxx" | |
343 buf.append(0x40 | (distance >> 8)) | |
344 buf.append(distance & 0xFF) | |
345 else: | |
346 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx" | |
347 buf.append(0x60 | (distance >> 16)) | |
348 buf.append((distance >> 8) & 0xFF) | |
349 buf.append(distance & 0xFF) | |
350 # Distance in first link is relative to following record. | |
351 # Distance in other links are relative to previous link. | |
352 offset -= distance | |
353 if len(buf) == guess: | |
354 break | |
355 guess = len(buf) | |
356 # Set most significant bit to mark end of links in this node. | |
357 buf[last] |= (1 << 7) | |
358 buf.reverse() | |
359 return buf | |
360 | |
361 def EncodeLabel(label): | |
362 ''' | |
363 Encode a node label as a list of bytes with a trailing high byte >0x80. | |
364 ''' | |
365 assert(label) | |
366 buf = [ord(c) for c in label] | |
367 buf.reverse() | |
368 # Set most significant bit to mark end of label in this node. | |
369 buf[0] |= (1 << 7) | |
370 return buf | |
371 | |
372 def EncodePrefix(label): | |
373 ''' | |
374 Encode a node label as a list of bytes without a trailing high byte. | |
375 | |
376 This method encodes a node if there is exactly one child and the | |
377 child follows immidiately after so that no jump is needed. This label | |
378 will then be a prefix to the label in the child node. | |
379 ''' | |
380 assert(label) | |
381 return [ord(c) for c in reversed(label)] | |
382 | |
383 def Encode(dafsa): | |
384 '''Encode a DAFSA to a list of bytes''' | |
385 output = [] | |
386 offsets = {} | |
387 | |
388 for node in reversed(TopSort(dafsa)): | |
389 if len(node[1]) == 1 and node[1][0] is not None and \ | |
390 (offsets[id(node[1][0])] == len(output)): | |
391 output += EncodePrefix(node[0]) | |
M-A Ruel
2014/04/23 12:26:18
optional nit: I generally prefer .append()
Olle Liljenzin
2014/04/24 09:30:00
List concat and list append are different operatio
M-A Ruel
2014/04/24 12:45:19
Eh, I also prefer .extend(). Using += is really co
| |
392 else: | |
393 output += EncodeLinks(node[1], offsets, len(output)) | |
394 output += EncodeLabel(node[0]) | |
395 offsets[id(node)] = len(output) | |
396 | |
397 output += EncodeLinks(dafsa, offsets, len(output)) | |
398 output.reverse() | |
399 return output | |
400 | |
401 def ToCxx(data): | |
402 '''Generate C++ code from a list of encoded bytes''' | |
403 text = '/* This file is generated. Don\'t edit!\n\n' | |
M-A Ruel
2014/04/23 12:26:18
I'd put DO NOT EDIT in all caps. Yelling is good s
| |
404 text += ' The following command was used to generate the file:\n\n' | |
405 text += ' \"' + ' '.join(sys.argv) + '\"\n\n' | |
M-A Ruel
2014/04/23 12:26:18
Not necessary IMHO.
| |
406 text += '*/\n\n' | |
407 text += 'const unsigned char kDafsa[%s] = {\n' % len(data) | |
408 for i in range(0, len(data), 12): | |
409 text += ' ' | |
410 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]]) | |
M-A Ruel
2014/04/23 12:26:18
[] are not needed
| |
411 text += ',\n' | |
412 text += '};\n' | |
413 return text | |
414 | |
415 def WordsToCxx(words): | |
416 '''Generate C++ code from a word list''' | |
417 dafsa = ToDafsa(words) | |
418 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels): | |
419 dafsa = fun(dafsa) | |
420 byte_values = Encode(dafsa) | |
421 return ToCxx(byte_values) | |
M-A Ruel
2014/04/23 12:26:18
return ToCxx(Encode(dafsa))
| |
422 | |
423 def ParseGperf(infile): | |
424 '''Parse gperf file and extract strings and return code''' | |
425 lines = [line.strip() for line in infile] | |
426 # Extract strings after the first '%%' and before the second '%%'. | |
427 begin = lines.index('%%') + 1 | |
428 end = lines.index('%%', begin) | |
429 lines = lines[begin:end] | |
430 for line in lines: | |
431 assert(line[-3:-1] == ', ') | |
432 assert(line[-1] in '01234') | |
433 return [line[:-3] + line[-1] for line in lines] | |
M-A Ruel
2014/04/23 12:26:18
Could you add in a comment how the lines should lo
Pam (message me for reviews)
2014/04/23 13:13:56
Also some unit tests for this script. There are va
Olle Liljenzin
2014/04/24 09:30:00
There are two examples, see lines 107 and 152.
M-A Ruel
2014/04/24 12:45:19
Would making it a generator be faster? E.g. replac
Olle Liljenzin
2014/04/29 12:41:52
You are probably right, but I don't understand exa
| |
434 | |
435 if __name__ == '__main__': | |
436 if len(sys.argv) != 3: | |
437 print('usage: %s infile outfile' % sys.argv[0]) | |
438 sys.exit(-1) | |
439 INFILE = open(sys.argv[1], 'r') | |
440 OUTFILE = open(sys.argv[2], 'w') | |
441 OUTFILE.write(WordsToCxx(ParseGperf(INFILE))) | |
OLD | NEW |