Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(16)

Side by Side Diff: net/tools/tld_cleanup/make_dafsa.py

Issue 197183002: Reduce footprint of registry controlled domain table (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master
Patch Set: Added bounds checks Created 6 years, 8 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 #!/usr/bin/python
M-A Ruel 2014/04/23 12:26:18 Please take a look at other scripts around of get
Pam (message me for reviews) 2014/04/23 13:13:56 Python coding style is described at http://www.chr
2
3 '''
4 A Deterministic acyclic finite state automaton (DAFSA) is a compact
5 representation of an unordered word list (dictionary).
6
7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
8
9 This python program converts a list of strings to a byte array in C++.
10 This python program fetches strings and return values from a gperf file
11 and generates a C++ file with a byte array representing graph that can be
12 used as a memory efficient replacement for the perfect hash table.
13
14 The input strings are assumed to consist of printable 7-bit ASCII characters
M-A Ruel 2014/04/24 12:45:19 Why not just enforce it and print an error message
Olle Liljenzin 2014/04/29 12:41:52 It is enforced, lines 205 and 431-432. But it has
15 and the return values are assumed to be one digit integers.
16
17 In this program a DAFSA is a diamond shaped graph starting at a common
18 source node and ending at a common sink node. All internal nodes contain
19 a label and each word is represented by the labels in one path from
20 the source node to the sink node.
21
22 The following python represention is used for nodes:
23
24 Source node: [ children ]
25 Internal node: (label, [ children ])
26 Sink node: None
27
28 The graph is first compressed by prefixes like a trie. In the next step
29 suffixes are compressed so that the graph gets diamond shaped. Finally
30 one to one linked nodes are replaced by nodes with the labels joined.
31
32 The order of the operations is crucial since lookups will be performed
33 starting from the source with no backtracking. Thus a node must have at
34 most one child with a label starting by the same character. The output
35 is also arranged so that all jumps are to increasing addresses, thus forward
36 in memory.
37
38 The generated output has suffix free decoding so that the sign of leading
39 bits in a link (a reference to a child node) indicate if it has a size of one,
40 two or three bytes and if it is the last outgoing link from the actual node.
41 A node label is terminated by a byte with the leading bit set.
42
43 The generated byte array can described by the following BNF:
44
45 <byte> ::= < 8-bit value in range [0x00-0xFF] >
46
47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >
48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >
49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
50
51 <offset1> ::= < byte in range [0x00-0x3F] >
52 <offset2> ::= < byte in range [0x40-0x5F] >
53 <offset3> ::= < byte in range [0x60-0x7F] >
54
55 <end_offset1> ::= < byte in range [0x80-0xBF] >
56 <end_offset2> ::= < byte in range [0xC0-0xDF] >
57 <end_offset3> ::= < byte in range [0xE0-0xFF] >
58
59 <prefix> ::= <char>
60
61 <label> ::= <end_char>
62 | <char> <label>
63
64 <end_label> ::= <return_value>
65 | <char> <end_label>
66
67 <offset> ::= <offset1>
68 | <offset2> <byte>
69 | <offset3> <byte> <byte>
70
71 <end_offset> ::= <end_offset1>
72 | <end_offset2> <byte>
73 | <end_offset3> <byte> <byte>
74
75 <offsets> ::= <end_offset>
76 | <offset> <offsets>
77
78 <source> ::= <offsets>
79
80 <node> ::= <label> <offsets>
81 | <prefix> <node>
82 | <end_label>
83
84 <dafsa> ::= <source>
85 | <dafsa> <node>
86
87 Decoding:
88
89 <char> -> printable 7-bit ASCII character
90 <end_char> & 0x7F -> printable 7-bit ASCII character
91 <return value> & 0x0F -> integer
92 <offset1 & 0x3F> -> integer
93 ((<offset2> & 0x1F>) << 8) + <byte> -> integer
94 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
95
96 end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
97 offset2 and offset3 respectively.
98
99 The first offset in a list of offsets is the distance in bytes between the
100 offset itself and the first child node. Subsequent offsets are the distance
101 between previous child node and next child node. Thus each offset links a node
102 to a child node. The distance is always counted between start addresses, i.e.
103 first byte in decoded offset or first byte in child node.
104
105 Example 1:
106
107 %%
108 aa, 1
109 a, 2
110 %%
111
112 The input is first parsed to a list of words:
113 ["aa1", "a2"]
114
115 A fully expanded graph is created from the words:
116 source = [node1, node4]
117 node1 = ("a", [node2])
118 node2 = ("a", [node3])
119 node3 = ("\x01", [sink])
120 node4 = ("a", [node5])
121 node5 = ("\x02", [sink])
122 sink = None
123
124 Compression results in the following graph:
125 source = [node1]
126 node1 = ("a", [node2, node3])
127 node2 = ("\x02", [sink])
128 node3 = ("a\x01", [sink])
129 sink = None
130
131 A C++ representation of the compressed graph is generated:
132
133 const unsigned char dafsa[7] = {
134 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
135 };
136
137 The bytes in the generated array has the following meaning:
138
139 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1
140
141 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a"
142 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4
143
144 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5
145 4: 0x82 <return_value> 0x82 & 0x0F -> return 2
146
147 5: 0x61 <char> label character 0x61 -> match "a"
148 6: 0x81 <return_value> 0x81 & 0x0F -> return 1
149
150 Example 2:
151
152 %%
153 aa, 1
154 bbb, 2
155 baa, 1
156 %%
157
158 The input is first parsed to a list of words:
159 ["aa1", "bbb2", "baa1"]
160
161 Compression results in the following graph:
162 source = [node1, node2]
163 node1 = ("b", [node2, node3])
164 node2 = ("aa\x01", [sink])
165 node3 = ("bb\x02", [sink])
166 sink = None
167
168 A C++ representation of the compressed graph is generated:
169
170 const unsigned char dafsa[11] = {
171 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
172 };
173
174 The bytes in the generated array has the following meaning:
175
176 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2
177 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5
178
179 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b"
180 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5
181 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8
182
183 5: 0x61 <char> label character 0x61 -> match "a"
184 6: 0x61 <char> label character 0x61 -> match "a"
185 7: 0x81 <return_value> 0x81 & 0x0F -> return 1
186
187 8: 0x62 <char> label character 0x62 -> match "b"
188 9: 0x62 <char> label character 0x62 -> match "b"
189 10: 0x82 <return_value> 0x82 & 0x0F -> return 2
190 '''
191
192 import sys
193
194 def ToDafsa(words):
M-A Ruel 2014/04/23 12:26:18 Correct me if I'm wrong, but nowhere you describe
Olle Liljenzin 2014/04/24 09:30:00 It returns a source node described on line 24. Cl
M-A Ruel 2014/04/24 12:45:19 What's the relative perf impact of using collectio
195 '''
196 Generate a DAFSA from a word list and return the source node
197
198 Each word is split into characters so that each character is represented by
199 a unique node. It is assumed the word list is not empty.
200 '''
201 assert(words)
M-A Ruel 2014/04/23 12:26:18 So the script will throw on an empty file?
Olle Liljenzin 2014/04/24 09:30:00 The output format doesn't support the empty graph.
202 def ToNodes(word):
203 '''Split words into characters'''
204 # Must be printable 7-bit ASCII.
205 assert(0x1F < ord(word[0]) < 0x80)
206 if len(word) == 1:
207 return chr(ord(word[0]) & 0x0F), [None]
208 return word[0], [ToNodes(word[1:])]
209 return [ToNodes(word) for word in words]
210
211 def ToWords(node):
212 '''
213 Generate a word list from all paths starting from an internal node
214 '''
215 if node is None:
M-A Ruel 2014/04/23 12:26:18 if not node: (everywhere)
216 return ['']
217 return [(node[0] + word) for child in node[1] for word in ToWords(child)]
218
219 def Reverse(dafsa):
220 '''
221 Generate a new DAFSA that is reversed, so that the old sink node becomes the
222 new source node.
223 '''
224 def Dfs(node, parent):
225 '''Create new nodes'''
Pam (message me for reviews) 2014/04/23 13:13:56 This docstring could be more descriptive.
226 if node is None:
227 sink.append(parent)
228 elif id(node) not in nodemap:
229 nodemap[id(node)] = (node[0][::-1], [parent])
230 for child in node[1]:
231 Dfs(child, nodemap[id(node)])
232 else:
233 nodemap[id(node)][1].append(parent)
234
235 sink = []
M-A Ruel 2014/04/23 12:26:18 Move closure variables before the function using t
236 nodemap = {}
237 for node in dafsa:
238 Dfs(node, None)
239 return sink
240
241 def JoinLabels(dafsa):
242 '''
243 Generate a new DAFSA where internal nodes are merged if there is a one to
244 one connection.
245 '''
246 def CountParents(node):
247 '''Count incoming references'''
248 if id(node) in parentcount:
249 parentcount[id(node)] += 1
250 else:
251 parentcount[id(node)] = 1
252 for child in node[1]:
253 CountParents(child)
254
255 def Join(node):
256 '''Create new nodes'''
257 if id(node) not in nodemap:
258 children = [Join(child) for child in node[1]]
259 if len(children) == 1 and parentcount[id(node[1][0])] == 1:
260 child = children[0]
261 nodemap[id(node)] = (node[0] + child[0], child[1])
262 else:
263 nodemap[id(node)] = (node[0], children)
264 return nodemap[id(node)]
265
266 parentcount = { id(None): 2 }
267 for node in dafsa:
268 CountParents(node)
269 nodemap = { id(None): None }
270 return [Join(node) for node in dafsa]
271
272 def JoinSuffixes(dafsa):
273 '''
274 Generate a new DAFSA where nodes that represent the same word lists towards
275 the sink are merged.
276 '''
277 def Join(node):
278 '''Return a macthing node. A new node is created if not matching node
Pam (message me for reviews) 2014/04/23 13:13:56 typos "matching"; "if no matching node"
279 exists. The graph is accessed in dfs order.
280 '''
281 suffixes = frozenset(ToWords(node))
282 if suffixes not in nodemap:
283 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]])
284 return nodemap[suffixes]
285
286 nodemap = { frozenset(('',)): None }
287 return [Join(node) for node in dafsa]
288
289 def TopSort(dafsa):
290 '''Generate list of nodes in topological sort order'''
291 def CountIncoming(node):
292 '''Count incoming references'''
293 if node is not None:
294 if id(node) not in incoming:
295 incoming[id(node)] = 1
296 for child in node[1]:
297 CountIncoming(child)
298 else:
299 incoming[id(node)] += 1
300
301 incoming = {}
302 for node in dafsa:
303 CountIncoming(node)
304
305 for node in dafsa:
306 incoming[id(node)] -= 1
307
308 waiting = [node for node in dafsa if incoming[id(node)] == 0]
309 nodes = []
310
311 while waiting:
312 node = waiting.pop()
313 assert(incoming[id(node)] == 0)
314 nodes.append(node)
315 for child in node[1]:
316 if child is not None:
317 incoming[id(child)] -= 1
318 if incoming[id(child)] == 0:
319 waiting.append(child)
320 return nodes
321
322 def EncodeLinks(children, offsets, current):
323 '''Encode a list of children as one, two or three byte offsets'''
324 if children[0] is None:
325 # This is an <end_label> node and no links are follow in such nodes
Pam (message me for reviews) 2014/04/23 13:13:56 language nit: "no links follow such nodes"
326 assert(len(children) == 1)
327 return []
328 guess = 3 * len(children)
329 assert(children)
330 while True:
331 offset = current + guess
332 buf = []
333 for child in sorted(children, key = lambda x: -offsets[id(x)]):
334 last = len(buf)
335 distance = offset - offsets[id(child)]
336 assert(distance > 0 and distance < (1 << 21))
337
338 if distance < (1 << 6):
339 # A 6-bit offset: "s0xxxxxx"
340 buf.append(distance)
341 elif distance < (1 << 13):
342 # A 13-bit offset: "s10xxxxxxxxxxxxx"
343 buf.append(0x40 | (distance >> 8))
344 buf.append(distance & 0xFF)
345 else:
346 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
347 buf.append(0x60 | (distance >> 16))
348 buf.append((distance >> 8) & 0xFF)
349 buf.append(distance & 0xFF)
350 # Distance in first link is relative to following record.
351 # Distance in other links are relative to previous link.
352 offset -= distance
353 if len(buf) == guess:
354 break
355 guess = len(buf)
356 # Set most significant bit to mark end of links in this node.
357 buf[last] |= (1 << 7)
358 buf.reverse()
359 return buf
360
361 def EncodeLabel(label):
362 '''
363 Encode a node label as a list of bytes with a trailing high byte >0x80.
364 '''
365 assert(label)
366 buf = [ord(c) for c in label]
367 buf.reverse()
368 # Set most significant bit to mark end of label in this node.
369 buf[0] |= (1 << 7)
370 return buf
371
372 def EncodePrefix(label):
373 '''
374 Encode a node label as a list of bytes without a trailing high byte.
375
376 This method encodes a node if there is exactly one child and the
377 child follows immidiately after so that no jump is needed. This label
378 will then be a prefix to the label in the child node.
379 '''
380 assert(label)
381 return [ord(c) for c in reversed(label)]
382
383 def Encode(dafsa):
384 '''Encode a DAFSA to a list of bytes'''
385 output = []
386 offsets = {}
387
388 for node in reversed(TopSort(dafsa)):
389 if len(node[1]) == 1 and node[1][0] is not None and \
390 (offsets[id(node[1][0])] == len(output)):
391 output += EncodePrefix(node[0])
M-A Ruel 2014/04/23 12:26:18 optional nit: I generally prefer .append()
Olle Liljenzin 2014/04/24 09:30:00 List concat and list append are different operatio
M-A Ruel 2014/04/24 12:45:19 Eh, I also prefer .extend(). Using += is really co
392 else:
393 output += EncodeLinks(node[1], offsets, len(output))
394 output += EncodeLabel(node[0])
395 offsets[id(node)] = len(output)
396
397 output += EncodeLinks(dafsa, offsets, len(output))
398 output.reverse()
399 return output
400
401 def ToCxx(data):
402 '''Generate C++ code from a list of encoded bytes'''
403 text = '/* This file is generated. Don\'t edit!\n\n'
M-A Ruel 2014/04/23 12:26:18 I'd put DO NOT EDIT in all caps. Yelling is good s
404 text += ' The following command was used to generate the file:\n\n'
405 text += ' \"' + ' '.join(sys.argv) + '\"\n\n'
M-A Ruel 2014/04/23 12:26:18 Not necessary IMHO.
406 text += '*/\n\n'
407 text += 'const unsigned char kDafsa[%s] = {\n' % len(data)
408 for i in range(0, len(data), 12):
409 text += ' '
410 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]])
M-A Ruel 2014/04/23 12:26:18 [] are not needed
411 text += ',\n'
412 text += '};\n'
413 return text
414
415 def WordsToCxx(words):
416 '''Generate C++ code from a word list'''
417 dafsa = ToDafsa(words)
418 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels):
419 dafsa = fun(dafsa)
420 byte_values = Encode(dafsa)
421 return ToCxx(byte_values)
M-A Ruel 2014/04/23 12:26:18 return ToCxx(Encode(dafsa))
422
423 def ParseGperf(infile):
424 '''Parse gperf file and extract strings and return code'''
425 lines = [line.strip() for line in infile]
426 # Extract strings after the first '%%' and before the second '%%'.
427 begin = lines.index('%%') + 1
428 end = lines.index('%%', begin)
429 lines = lines[begin:end]
430 for line in lines:
431 assert(line[-3:-1] == ', ')
432 assert(line[-1] in '01234')
433 return [line[:-3] + line[-1] for line in lines]
M-A Ruel 2014/04/23 12:26:18 Could you add in a comment how the lines should lo
Pam (message me for reviews) 2014/04/23 13:13:56 Also some unit tests for this script. There are va
Olle Liljenzin 2014/04/24 09:30:00 There are two examples, see lines 107 and 152.
M-A Ruel 2014/04/24 12:45:19 Would making it a generator be faster? E.g. replac
Olle Liljenzin 2014/04/29 12:41:52 You are probably right, but I don't understand exa
434
435 if __name__ == '__main__':
436 if len(sys.argv) != 3:
437 print('usage: %s infile outfile' % sys.argv[0])
438 sys.exit(-1)
439 INFILE = open(sys.argv[1], 'r')
440 OUTFILE = open(sys.argv[2], 'w')
441 OUTFILE.write(WordsToCxx(ParseGperf(INFILE)))
OLDNEW
« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698