Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(145)

Side by Side Diff: net/tools/tld_cleanup/make_dafsa.py

Issue 197183002: Reduce footprint of registry controlled domain table (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master
Patch Set: Created 6 years, 9 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 #!/usr/bin/python
2
3 '''
4 A Deterministic acyclic finite state automaton (DAFSA) is a compact
5 representation of an unordered word list (dictionary).
6
7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
8
9 This python program converts a list of strings to a byte array in C++.
10 This python program fetches strings and return values from a gperf file
11 and generates a C++ file with a byte array representing graph that can be
12 used as a memory efficient replacement for the perfect hash table.
13
14 The input strings are assumed to consist of printable 7-bit ASCII characters
15 and the return values are assumed to be one digit integers.
16
17 In this program a DAFSA is a diamond shaped graph starting at a common
18 source node and ending at a common sink node. All internal nodes contain
19 a label and each word is represented by the labels in one path from
20 the source node to the sink node.
21
22 The following python represention is used for nodes:
23
24 Source node: [ children ]
25 Internal node: (label, [ children ])
26 Sink node: None
27
28 The graph is first compressed by prefixes like a trie. In the next step
29 suffixes are compressed so that the graph gets diamond shaped. Finally
30 one to one linked nodes are replaced by nodes with the labels joined.
31
32 The order of the operations is crucial since lookups will be performed
33 starting from the source with no backtracking. Thus a node must have at
34 most one child with a label starting by the same character. The output
35 is also arranged so that all jumps are to increasing addresses, thus forward
36 in memory.
37
38 The generated output has suffix free decoding so that the sign of leading
39 bits in a link (a reference to a child node) indicate if it has a size of one,
40 two or three bytes and if it is the last outgoing link from the actual node.
41 A node label is terminated by a byte with the leading bit set.
42
43 The generated byte array can described by the following BNF:
44
45 <byte> ::= < 8-bit value in range [0x00-0xFF] >
46
47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >
48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >
49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] >
50
51 <offset1> ::= < byte in range [0x00-0x3F] >
52 <offset2> ::= < byte in range [0x40-0x5F] >
53 <offset3> ::= < byte in range [0x60-0x7F] >
54
55 <end_offset1> ::= < byte in range [0x80-0xBF] >
56 <end_offset2> ::= < byte in range [0xC0-0xDF] >
57 <end_offset3> ::= < byte in range [0xE0-0xFF] >
58
59 <label> ::= <end_char>
60 | <char> <label>
61
62 <end_label> ::= <return_value>
63 | <char> <end_label>
64
65 <offset> ::= <offset1>
66 | <offset2> <byte>
67 | <offset3> <byte> <byte>
68
69 <end_offset> ::= <end_offset1>
70 | <end_offset2> <byte>
71 | <end_offset3> <byte> <byte>
72
73 <offsets> ::= <end_offset>
74 | <offset> <offsets>
75
76 <source> ::= <offsets>
77
78 <node> ::= <label> <offsets>
79 | <end_label>
80
81 <dafsa> ::= <source>
82 | <dafsa> <node>
83
84 Decoding:
85
86 <char> -> printable 7-bit ASCII character
87 <end_char> & 0x7F -> printable 7-bit ASCII character
88 <return value> & 0x0F -> integer
89 <offset1 & 0x3F> -> integer
90 ((<offset2> & 0x1F>) << 8) + <byte> -> integer
91 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer
92
93 end_offset1, end_offset2 and and_offset3 are decoded same as offset1,
94 offset2 and offset3 respectively.
95
96 The first offset in a list of offsets is the distance in bytes between the
97 offset itself and the first child node. Subsequent offsets are the distance
98 between previous child node and next child node. Thus each offset links a node
99 to a child node. The distance is always counted between start addresses, i.e.
100 first byte in decoded offset or first byte in child node.
101
102 Example 1:
103
104 %%
105 aa, 1
106 a, 2
107 %%
108
109 The input is first parsed to a list of words:
110 ["aa1", "a2"]
111
112 A fully expanded graph is created from the words:
113 source = [node1, node4]
114 node1 = ("a", [node2])
115 node2 = ("a", [node3])
116 node3 = ("\x01", [sink])
117 node4 = ("a", [node5])
118 node5 = ("\x02", [sink])
119 sink = None
120
121 Compression results in the following graph:
122 source = [node1]
123 node1 = ("a", [node2, node3])
124 node2 = ("\x02", [sink])
125 node3 = ("a\x01", [sink])
126 sink = None
127
128 A C++ representation of the compressed graph is generated:
129
130 const unsigned char dafsa[7] = {
131 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,
132 };
133
134 The bytes in the generated array has the following meaning:
135
136 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1
137
138 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a"
139 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4
140
141 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5
142 4: 0x82 <return_value> 0x82 & 0x0F -> return 2
143
144 5: 0x61 <char> label character 0x61 -> match "a"
145 6: 0x81 <return_value> 0x81 & 0x0F -> return 1
146
147 Example 2:
148
149 %%
150 aa, 1
151 bbb, 2
152 baa, 1
153 %%
154
155 The input is first parsed to a list of words:
156 ["aa1", "bbb2", "baa1"]
157
158 Compression results in the following graph:
159 source = [node1, node2]
160 node1 = ("b", [node2, node3])
161 node2 = ("aa\x01", [sink])
162 node3 = ("bb\x02", [sink])
163 sink = None
164
165 A C++ representation of the compressed graph is generated:
166
167 const unsigned char dafsa[11] = {
168 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,
169 };
170
171 The bytes in the generated array has the following meaning:
172
173 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2
174 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5
175
176 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b"
177 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5
178 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8
179
180 5: 0x61 <char> label character 0x61 -> match "a"
181 6: 0x61 <char> label character 0x61 -> match "a"
182 7: 0x81 <return_value> 0x81 & 0x0F -> return 1
183
184 8: 0x62 <char> label character 0x62 -> match "b"
185 9: 0x62 <char> label character 0x62 -> match "b"
186 10: 0x82 <return_value> 0x82 & 0x0F -> return 2
187 '''
188
189 import sys
190
191 def ToDafsa(words):
192 '''
193 Generate a DAFSA from a word list and return the source node
194
195 Each word is split into characters so that each character is represented by
196 a unique node. It is assumed the word list is not empty.
197 '''
198 assert(words)
199 def ToNodes(word):
200 '''Split words into characters'''
201 # Must be printable 7-bit ASCII.
202 assert(0x1F < ord(word[0]) < 0x80)
203 if len(word) == 1:
204 return chr(ord(word[0]) & 0x0F), [None]
205 return word[0], [ToNodes(word[1:])]
206 return [ToNodes(word) for word in words]
207
208 def ToWords(node):
209 '''
210 Generate a word list from all paths starting from an internal node
211 '''
212 if node is None:
213 return ['']
214 return [(node[0] + word) for child in node[1] for word in ToWords(child)]
215
216 def Reverse(dafsa):
217 '''
218 Generate a new DAFSA that is reversed, so that the old sink node becomes the
219 new source node.
220 '''
221 def Dfs(node, parent):
222 '''Create new nodes'''
223 if node is None:
224 sink.append(parent)
225 elif id(node) not in nodemap:
226 nodemap[id(node)] = (node[0][::-1], [parent])
227 for child in node[1]:
228 Dfs(child, nodemap[id(node)])
229 else:
230 nodemap[id(node)][1].append(parent)
231
232 sink = []
233 nodemap = {}
234 for node in dafsa:
235 Dfs(node, None)
236 return sink
237
238 def JoinLabels(dafsa):
239 '''
240 Generate a new DAFSA where internal nodes are merged if there is a one to
241 one connection.
242 '''
243 def CountParents(node):
244 '''Count incoming references'''
245 if id(node) in parentcount:
246 parentcount[id(node)] += 1
247 else:
248 parentcount[id(node)] = 1
249 for child in node[1]:
250 CountParents(child)
251
252 def Join(node):
253 '''Create new nodes'''
254 if id(node) not in nodemap:
255 children = [Join(child) for child in node[1]]
256 if len(children) == 1 and parentcount[id(node[1][0])] == 1:
257 child = children[0]
258 nodemap[id(node)] = (node[0] + child[0], child[1])
259 else:
260 nodemap[id(node)] = (node[0], children)
261 return nodemap[id(node)]
262
263 parentcount = { id(None): 2 }
264 for node in dafsa:
265 CountParents(node)
266 nodemap = { id(None): None }
267 return [Join(node) for node in dafsa]
268
269 def JoinSuffixes(dafsa):
270 '''
271 Generate a new DAFSA where nodes that represent the same word lists towards
272 the sink are merged.
273 '''
274 def Join(node):
275 '''Return a macthing node. A new node is created if not matching node
276 exists. The graph is accessed in dfs order.
277 '''
278 suffixes = frozenset(ToWords(node))
279 if suffixes not in nodemap:
280 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]])
281 return nodemap[suffixes]
282
283 nodemap = { frozenset(('',)): None }
284 return [Join(node) for node in dafsa]
285
286 def TopSort(dafsa):
287 '''Generate list of nodes in topological sort order'''
288 def CountIncoming(node):
289 '''Count incoming references'''
290 if node is not None:
291 if id(node) not in incoming:
292 incoming[id(node)] = 1
293 for child in node[1]:
294 CountIncoming(child)
295 else:
296 incoming[id(node)] += 1
297
298 incoming = {}
299 for node in dafsa:
300 CountIncoming(node)
301
302 for node in dafsa:
303 incoming[id(node)] -= 1
304
305 waiting = [node for node in dafsa if incoming[id(node)] == 0]
306 nodes = []
307
308 while waiting:
309 node = waiting.pop()
310 assert(incoming[id(node)] == 0)
311 nodes.append(node)
312 for child in node[1]:
313 if child is not None:
314 incoming[id(child)] -= 1
315 if incoming[id(child)] == 0:
316 waiting.append(child)
317 return nodes
318
319 def EncodeLinks(children, offsets, current):
320 '''Encode a list of children as one, two or three byte offsets'''
321 if children[0] is None:
322 # This is an <end_label> node and no links are follow in such nodes
323 assert(len(children) == 1)
324 return []
325 guess = 3 * len(children)
326 assert(children)
327 while True:
328 offset = current + guess
329 buf = []
330 for child in sorted(children, key = lambda x: -offsets[id(x)]):
331 last = len(buf)
332 distance = offset - offsets[id(child)]
333 assert(distance > 0 and distance < (1 << 21))
334
335 if distance < (1 << 6):
336 # A 6-bit offset: "s0xxxxxx"
337 buf.append(distance)
338 elif distance < (1 << 13):
339 # A 13-bit offset: "s10xxxxxxxxxxxxx"
340 buf.append(0x40 | (distance >> 8))
341 buf.append(distance & 0xFF)
342 else:
343 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"
344 buf.append(0x60 | (distance >> 16))
345 buf.append((distance >> 8) & 0xFF)
346 buf.append(distance & 0xFF)
347 # Distance in first link is relative to following record.
348 # Distance in other links are relative to previous link.
349 offset -= distance
350 if len(buf) == guess:
351 break
352 guess = len(buf)
353 # Set most significant bit to mark end of links in this node.
354 buf[last] |= (1 << 7)
355 buf.reverse()
356 return buf
357
358 def EncodeLabel(label):
359 '''
360 Encode a node label as a list of bytes with a trailing high byte >0x80.
361 '''
362 assert(label)
363 buf = [ord(c) for c in label]
364 buf.reverse()
365 # Set most significant bit to mark end of label in this node.
366 buf[0] |= (1 << 7)
367 return buf
368
369 def Encode(dafsa):
370 '''Encode a DAFSA to a list of bytes'''
371 output = []
372 offsets = {}
373
374 for node in reversed(TopSort(dafsa)):
375 output += EncodeLinks(node[1], offsets, len(output))
376 output += EncodeLabel(node[0])
377 offsets[id(node)] = len(output)
378
379 output += EncodeLinks(dafsa, offsets, len(output))
380 output.reverse()
381 return output
382
383 def ToCxx(data):
384 '''Generate C++ code from a list of encoded bytes'''
385 text = '/* This file is generated. Don\'t edit!\n\n'
386 text += ' The following command was used to generate the file:\n\n'
387 text += ' \"' + ' '.join(sys.argv) + '\"\n\n'
388 text += '*/\n\n'
389 text += 'const unsigned char kDafsa[%s] = {\n' % len(data)
390 for i in range(0, len(data), 12):
391 text += ' '
392 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]])
393 text += ',\n'
394 text += '};\n'
395 return text
396
397 def WordsToCxx(words):
398 '''Generate C++ code from a word list'''
399 dafsa = ToDafsa(words)
400 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels):
401 dafsa = fun(dafsa)
402 byte_values = Encode(dafsa)
403 return ToCxx(byte_values)
404
405 def ParseGperf(infile):
406 '''Parse gperf file and extract strings and return code'''
407 lines = [line.strip() for line in infile]
408 # Extract strings after the first '%%' and before the second '%%'.
409 begin = lines.index('%%') + 1
410 end = lines.index('%%', begin)
411 lines = lines[begin:end]
412 for line in lines:
413 assert(line[-3:-1] == ', ')
414 assert(line[-1] in '01234')
415 return [line[:-3] + line[-1] for line in lines]
416
417 if __name__ == '__main__':
418 if len(sys.argv) != 3:
419 print('usage: %s infile outfile' % sys.argv[0])
420 sys.exit(-1)
421 INFILE = open(sys.argv[1], 'r')
422 OUTFILE = open(sys.argv[2], 'w')
423 OUTFILE.write(WordsToCxx(ParseGperf(INFILE)))
OLDNEW
« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698