net/tools/tld_cleanup/make_dafsa.py - Issue 197183002: Reduce footprint of registry controlled domain table

Side by Side Diff: net/tools/tld_cleanup/make_dafsa.py

Issue 197183002: Reduce footprint of registry controlled domain table (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Added bounds checks Created 6 years, 8 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 #!/usr/bin/python
	M-A Ruel 2014/04/23 12:26:18 Please take a look at other scripts around of get Please take a look at other scripts around of get a sense of the general coding style. - Take the 4 first lines header. - """ for docstrings. - Put the main code in main(), e.g.: def main(): ... return 0 if __name__ == '__main__': sys.exit(main()) - 2 empty lines between file level symbols - Imperative verbs to start the function docstrings. - Use with statements, e.g. with open(foo, 'rb') as f: content = f.read() - While historical code tends to use CamelCase functions, we're slowly switching over to more PEP8 compliant names. - Use () instead of \. Pam (message me for reviews) 2014/04/23 13:13:56 Python coding style is described at http://www.chr Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > Please take a look at other scripts around of get a sense of the general coding > style. > - Take the 4 first lines header. > - """ for docstrings. > - Put the main code in main(), e.g.: > def main(): > ... > return 0 > > > if __name__ == '__main__': > sys.exit(main()) > > - 2 empty lines between file level symbols > - Imperative verbs to start the function docstrings. > - Use with statements, e.g. > with open(foo, 'rb') as f: > content = f.read() > - While historical code tends to use CamelCase functions, we're slowly switching > over to more PEP8 compliant names. > - Use () instead of \. Python coding style is described at http://www.chromium.org/developers/coding-style There's also a copy of pylint in the depot_tools to get basic checks.
	2

	3 '''

	4 A Deterministic acyclic finite state automaton (DAFSA) is a compact

	5 representation of an unordered word list (dictionary).

	6

	7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

	8

	9 This python program converts a list of strings to a byte array in C++.

	10 This python program fetches strings and return values from a gperf file

	11 and generates a C++ file with a byte array representing graph that can be

	12 used as a memory efficient replacement for the perfect hash table.

	13

	14 The input strings are assumed to consist of printable 7-bit ASCII characters
	M-A Ruel 2014/04/24 12:45:19 Why not just enforce it and print an error message Why not just enforce it and print an error message when this is not the case, then you don't have to document it. Olle Liljenzin 2014/04/29 12:41:52 It is enforced, lines 205 and 431-432. But it has Show quoted text On 2014/04/24 12:45:19, M-A Ruel wrote: > Why not just enforce it and print an error message when this is not the case, > then you don't have to document it. It is enforced, lines 205 and 431-432. But it has to be documented since the compression is based on this kind of restrictions.
	15 and the return values are assumed to be one digit integers.

	16

	17 In this program a DAFSA is a diamond shaped graph starting at a common

	18 source node and ending at a common sink node. All internal nodes contain

	19 a label and each word is represented by the labels in one path from

	20 the source node to the sink node.

	21

	22 The following python represention is used for nodes:

	23

	24 Source node: [ children ]

	25 Internal node: (label, [ children ])

	26 Sink node: None

	27

	28 The graph is first compressed by prefixes like a trie. In the next step

	29 suffixes are compressed so that the graph gets diamond shaped. Finally

	30 one to one linked nodes are replaced by nodes with the labels joined.

	31

	32 The order of the operations is crucial since lookups will be performed

	33 starting from the source with no backtracking. Thus a node must have at

	34 most one child with a label starting by the same character. The output

	35 is also arranged so that all jumps are to increasing addresses, thus forward

	36 in memory.

	37

	38 The generated output has suffix free decoding so that the sign of leading

	39 bits in a link (a reference to a child node) indicate if it has a size of one,

	40 two or three bytes and if it is the last outgoing link from the actual node.

	41 A node label is terminated by a byte with the leading bit set.

	42

	43 The generated byte array can described by the following BNF:

	44

	45 <byte> ::= < 8-bit value in range [0x00-0xFF] >

	46

	47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >

	48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >

	49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] >

	50

	51 <offset1> ::= < byte in range [0x00-0x3F] >

	52 <offset2> ::= < byte in range [0x40-0x5F] >

	53 <offset3> ::= < byte in range [0x60-0x7F] >

	54

	55 <end_offset1> ::= < byte in range [0x80-0xBF] >

	56 <end_offset2> ::= < byte in range [0xC0-0xDF] >

	57 <end_offset3> ::= < byte in range [0xE0-0xFF] >

	58

	59 <prefix> ::= <char>

	60

	61 <label> ::= <end_char>

	62 \| <char> <label>

	63

	64 <end_label> ::= <return_value>

	65 \| <char> <end_label>

	66

	67 <offset> ::= <offset1>

	68 \| <offset2> <byte>

	69 \| <offset3> <byte> <byte>

	70

	71 <end_offset> ::= <end_offset1>

	72 \| <end_offset2> <byte>

	73 \| <end_offset3> <byte> <byte>

	74

	75 <offsets> ::= <end_offset>

	76 \| <offset> <offsets>

	77

	78 <source> ::= <offsets>

	79

	80 <node> ::= <label> <offsets>

	81 \| <prefix> <node>

	82 \| <end_label>

	83

	84 <dafsa> ::= <source>

	85 \| <dafsa> <node>

	86

	87 Decoding:

	88

	89 <char> -> printable 7-bit ASCII character

	90 <end_char> & 0x7F -> printable 7-bit ASCII character

	91 <return value> & 0x0F -> integer

	92 <offset1 & 0x3F> -> integer

	93 ((<offset2> & 0x1F>) << 8) + <byte> -> integer

	94 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer

	95

	96 end_offset1, end_offset2 and and_offset3 are decoded same as offset1,

	97 offset2 and offset3 respectively.

	98

	99 The first offset in a list of offsets is the distance in bytes between the

	100 offset itself and the first child node. Subsequent offsets are the distance

	101 between previous child node and next child node. Thus each offset links a node

	102 to a child node. The distance is always counted between start addresses, i.e.

	103 first byte in decoded offset or first byte in child node.

	104

	105 Example 1:

	106

	107 %%

	108 aa, 1

	109 a, 2

	110 %%

	111

	112 The input is first parsed to a list of words:

	113 ["aa1", "a2"]

	114

	115 A fully expanded graph is created from the words:

	116 source = [node1, node4]

	117 node1 = ("a", [node2])

	118 node2 = ("a", [node3])

	119 node3 = ("\x01", [sink])

	120 node4 = ("a", [node5])

	121 node5 = ("\x02", [sink])

	122 sink = None

	123

	124 Compression results in the following graph:

	125 source = [node1]

	126 node1 = ("a", [node2, node3])

	127 node2 = ("\x02", [sink])

	128 node3 = ("a\x01", [sink])

	129 sink = None

	130

	131 A C++ representation of the compressed graph is generated:

	132

	133 const unsigned char dafsa[7] = {

	134 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,

	135 };

	136

	137 The bytes in the generated array has the following meaning:

	138

	139 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1

	140

	141 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a"

	142 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4

	143

	144 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5

	145 4: 0x82 <return_value> 0x82 & 0x0F -> return 2

	146

	147 5: 0x61 <char> label character 0x61 -> match "a"

	148 6: 0x81 <return_value> 0x81 & 0x0F -> return 1

	149

	150 Example 2:

	151

	152 %%

	153 aa, 1

	154 bbb, 2

	155 baa, 1

	156 %%

	157

	158 The input is first parsed to a list of words:

	159 ["aa1", "bbb2", "baa1"]

	160

	161 Compression results in the following graph:

	162 source = [node1, node2]

	163 node1 = ("b", [node2, node3])

	164 node2 = ("aa\x01", [sink])

	165 node3 = ("bb\x02", [sink])

	166 sink = None

	167

	168 A C++ representation of the compressed graph is generated:

	169

	170 const unsigned char dafsa[11] = {

	171 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,

	172 };

	173

	174 The bytes in the generated array has the following meaning:

	175

	176 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2

	177 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5

	178

	179 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b"

	180 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5

	181 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8

	182

	183 5: 0x61 <char> label character 0x61 -> match "a"

	184 6: 0x61 <char> label character 0x61 -> match "a"

	185 7: 0x81 <return_value> 0x81 & 0x0F -> return 1

	186

	187 8: 0x62 <char> label character 0x62 -> match "b"

	188 9: 0x62 <char> label character 0x62 -> match "b"

	189 10: 0x82 <return_value> 0x82 & 0x0F -> return 2

	190 '''

	191

	192 import sys

	193

	194 def ToDafsa(words):
	M-A Ruel 2014/04/23 12:26:18 Correct me if I'm wrong, but nowhere you describe Correct me if I'm wrong, but nowhere you describe what data types this function returns exactly. I understand it is a list of tuples, with single item list recursively, but that's not clear. Olle Liljenzin 2014/04/24 09:30:00 It returns a source node described on line 24. Cl Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > Correct me if I'm wrong, but nowhere you describe what data types this function > returns exactly. I understand it is a list of tuples, with single item list > recursively, but that's not clear. It returns a source node described on line 24. Clearly it would have been more readable with a node class and named attributes instead of anonymous tuples, lists and dictionaries. But it would also make the script more than ten times slower. Running make_dafsa.py to generate effective_tld_names-inc.cc takes about a second and it replaces gperf that took above 4 minutes to run on the same machine. So now we can generate the files on the fly instead of running gperf manually as before, which is much better in my opinion. M-A Ruel 2014/04/24 12:45:19 What's the relative perf impact of using collectio Show quoted text On 2014/04/24 09:30:00, Olle Liljenzin wrote: > On 2014/04/23 12:26:18, M-A Ruel wrote: > > Correct me if I'm wrong, but nowhere you describe what data types this > function > > returns exactly. I understand it is a list of tuples, with single item list > > recursively, but that's not clear. > > It returns a source node described on line 24. > > Clearly it would have been more readable with a node class and named attributes > instead of anonymous tuples, lists and dictionaries. But it would also make the > script more than ten times slower. What's the relative perf impact of using collections.namedtuple?
	195 '''

	196 Generate a DAFSA from a word list and return the source node

	197

	198 Each word is split into characters so that each character is represented by

	199 a unique node. It is assumed the word list is not empty.

	200 '''

	201 assert(words)
	M-A Ruel 2014/04/23 12:26:18 So the script will throw on an empty file? So the script will throw on an empty file? Olle Liljenzin 2014/04/24 09:30:00 The output format doesn't support the empty graph. Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > So the script will throw on an empty file? The output format doesn't support the empty graph. It starts with an offset list which must contain at least one offset. I will add a comment to clearify the assert.
	202 def ToNodes(word):

	203 '''Split words into characters'''

	204 # Must be printable 7-bit ASCII.

	205 assert(0x1F < ord(word[0]) < 0x80)

	206 if len(word) == 1:

	207 return chr(ord(word[0]) & 0x0F), [None]

	208 return word[0], [ToNodes(word[1:])]

	209 return [ToNodes(word) for word in words]

	210

	211 def ToWords(node):

	212 '''

	213 Generate a word list from all paths starting from an internal node

	214 '''

	215 if node is None:
	M-A Ruel 2014/04/23 12:26:18 if not node: (everywhere) if not node: (everywhere)
	216 return ['']

	217 return [(node[0] + word) for child in node[1] for word in ToWords(child)]

	218

	219 def Reverse(dafsa):

	220 '''

	221 Generate a new DAFSA that is reversed, so that the old sink node becomes the

	222 new source node.

	223 '''

	224 def Dfs(node, parent):

	225 '''Create new nodes'''
	Pam (message me for reviews) 2014/04/23 13:13:56 This docstring could be more descriptive. This docstring could be more descriptive.
	226 if node is None:

	227 sink.append(parent)

	228 elif id(node) not in nodemap:

	229 nodemap[id(node)] = (node[0][::-1], [parent])

	230 for child in node[1]:

	231 Dfs(child, nodemap[id(node)])

	232 else:

	233 nodemap[id(node)][1].append(parent)

	234

	235 sink = []
	M-A Ruel 2014/04/23 12:26:18 Move closure variables before the function using t Move closure variables before the function using them for sanity. Do this everywhere. In general, I'm really not a fan of closures at all but the ones in this file are small enough so I do not mind too much.
	236 nodemap = {}

	237 for node in dafsa:

	238 Dfs(node, None)

	239 return sink

	240

	241 def JoinLabels(dafsa):

	242 '''

	243 Generate a new DAFSA where internal nodes are merged if there is a one to

	244 one connection.

	245 '''

	246 def CountParents(node):

	247 '''Count incoming references'''

	248 if id(node) in parentcount:

	249 parentcount[id(node)] += 1

	250 else:

	251 parentcount[id(node)] = 1

	252 for child in node[1]:

	253 CountParents(child)

	254

	255 def Join(node):

	256 '''Create new nodes'''

	257 if id(node) not in nodemap:

	258 children = [Join(child) for child in node[1]]

	259 if len(children) == 1 and parentcount[id(node[1][0])] == 1:

	260 child = children[0]

	261 nodemap[id(node)] = (node[0] + child[0], child[1])

	262 else:

	263 nodemap[id(node)] = (node[0], children)

	264 return nodemap[id(node)]

	265

	266 parentcount = { id(None): 2 }

	267 for node in dafsa:

	268 CountParents(node)

	269 nodemap = { id(None): None }

	270 return [Join(node) for node in dafsa]

	271

	272 def JoinSuffixes(dafsa):

	273 '''

	274 Generate a new DAFSA where nodes that represent the same word lists towards

	275 the sink are merged.

	276 '''

	277 def Join(node):

	278 '''Return a macthing node. A new node is created if not matching node
	Pam (message me for reviews) 2014/04/23 13:13:56 typos "matching"; "if no matching node" typos "matching"; "if no matching node"
	279 exists. The graph is accessed in dfs order.

	280 '''

	281 suffixes = frozenset(ToWords(node))

	282 if suffixes not in nodemap:

	283 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]])

	284 return nodemap[suffixes]

	285

	286 nodemap = { frozenset(('',)): None }

	287 return [Join(node) for node in dafsa]

	288

	289 def TopSort(dafsa):

	290 '''Generate list of nodes in topological sort order'''

	291 def CountIncoming(node):

	292 '''Count incoming references'''

	293 if node is not None:

	294 if id(node) not in incoming:

	295 incoming[id(node)] = 1

	296 for child in node[1]:

	297 CountIncoming(child)

	298 else:

	299 incoming[id(node)] += 1

	300

	301 incoming = {}

	302 for node in dafsa:

	303 CountIncoming(node)

	304

	305 for node in dafsa:

	306 incoming[id(node)] -= 1

	307

	308 waiting = [node for node in dafsa if incoming[id(node)] == 0]

	309 nodes = []

	310

	311 while waiting:

	312 node = waiting.pop()

	313 assert(incoming[id(node)] == 0)

	314 nodes.append(node)

	315 for child in node[1]:

	316 if child is not None:

	317 incoming[id(child)] -= 1

	318 if incoming[id(child)] == 0:

	319 waiting.append(child)

	320 return nodes

	321

	322 def EncodeLinks(children, offsets, current):

	323 '''Encode a list of children as one, two or three byte offsets'''

	324 if children[0] is None:

	325 # This is an <end_label> node and no links are follow in such nodes
	Pam (message me for reviews) 2014/04/23 13:13:56 language nit: "no links follow such nodes" language nit: "no links follow such nodes"
	326 assert(len(children) == 1)

	327 return []

	328 guess = 3 * len(children)

	329 assert(children)

	330 while True:

	331 offset = current + guess

	332 buf = []

	333 for child in sorted(children, key = lambda x: -offsets[id(x)]):

	334 last = len(buf)

	335 distance = offset - offsets[id(child)]

	336 assert(distance > 0 and distance < (1 << 21))

	337

	338 if distance < (1 << 6):

	339 # A 6-bit offset: "s0xxxxxx"

	340 buf.append(distance)

	341 elif distance < (1 << 13):

	342 # A 13-bit offset: "s10xxxxxxxxxxxxx"

	343 buf.append(0x40 \| (distance >> 8))

	344 buf.append(distance & 0xFF)

	345 else:

	346 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"

	347 buf.append(0x60 \| (distance >> 16))

	348 buf.append((distance >> 8) & 0xFF)

	349 buf.append(distance & 0xFF)

	350 # Distance in first link is relative to following record.

	351 # Distance in other links are relative to previous link.

	352 offset -= distance

	353 if len(buf) == guess:

	354 break

	355 guess = len(buf)

	356 # Set most significant bit to mark end of links in this node.

	357 buf[last] \|= (1 << 7)

	358 buf.reverse()

	359 return buf

	360

	361 def EncodeLabel(label):

	362 '''

	363 Encode a node label as a list of bytes with a trailing high byte >0x80.

	364 '''

	365 assert(label)

	366 buf = [ord(c) for c in label]

	367 buf.reverse()

	368 # Set most significant bit to mark end of label in this node.

	369 buf[0] \|= (1 << 7)

	370 return buf

	371

	372 def EncodePrefix(label):

	373 '''

	374 Encode a node label as a list of bytes without a trailing high byte.

	375

	376 This method encodes a node if there is exactly one child and the

	377 child follows immidiately after so that no jump is needed. This label

	378 will then be a prefix to the label in the child node.

	379 '''

	380 assert(label)

	381 return [ord(c) for c in reversed(label)]

	382

	383 def Encode(dafsa):

	384 '''Encode a DAFSA to a list of bytes'''

	385 output = []

	386 offsets = {}

	387

	388 for node in reversed(TopSort(dafsa)):

	389 if len(node[1]) == 1 and node[1][0] is not None and \

	390 (offsets[id(node[1][0])] == len(output)):

	391 output += EncodePrefix(node[0])
	M-A Ruel 2014/04/23 12:26:18 optional nit: I generally prefer .append() optional nit: I generally prefer .append() Olle Liljenzin 2014/04/24 09:30:00 List concat and list append are different operatio Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > optional nit: I generally prefer .append() List concat and list append are different operations. M-A Ruel 2014/04/24 12:45:19 Eh, I also prefer .extend(). Using += is really co Show quoted text On 2014/04/24 09:30:00, Olle Liljenzin wrote: > On 2014/04/23 12:26:18, M-A Ruel wrote: > > optional nit: I generally prefer .append() > > List concat and list append are different operations. Eh, I also prefer .extend(). Using += is really confusing IMHO.
	392 else:

	393 output += EncodeLinks(node[1], offsets, len(output))

	394 output += EncodeLabel(node[0])

	395 offsets[id(node)] = len(output)

	396

	397 output += EncodeLinks(dafsa, offsets, len(output))

	398 output.reverse()

	399 return output

	400

	401 def ToCxx(data):

	402 '''Generate C++ code from a list of encoded bytes'''

	403 text = '/* This file is generated. Don\'t edit!\n\n'
	M-A Ruel 2014/04/23 12:26:18 I'd put DO NOT EDIT in all caps. Yelling is good s I'd put DO NOT EDIT in all caps. Yelling is good sometimes. It'd be good to explain what it is in the generated header. Someone reading the generated file will have a hard time figuring out what it is.
	404 text += ' The following command was used to generate the file:\n\n'

	405 text += ' \"' + ' '.join(sys.argv) + '\"\n\n'
	M-A Ruel 2014/04/23 12:26:18 Not necessary IMHO. Not necessary IMHO.
	406 text += '*/\n\n'

	407 text += 'const unsigned char kDafsa[%s] = {\n' % len(data)

	408 for i in range(0, len(data), 12):

	409 text += ' '

	410 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]])
	M-A Ruel 2014/04/23 12:26:18 [] are not needed [] are not needed
	411 text += ',\n'

	412 text += '};\n'

	413 return text

	414

	415 def WordsToCxx(words):

	416 '''Generate C++ code from a word list'''

	417 dafsa = ToDafsa(words)

	418 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels):

	419 dafsa = fun(dafsa)

	420 byte_values = Encode(dafsa)

	421 return ToCxx(byte_values)
	M-A Ruel 2014/04/23 12:26:18 return ToCxx(Encode(dafsa)) return ToCxx(Encode(dafsa))
	422

	423 def ParseGperf(infile):

	424 '''Parse gperf file and extract strings and return code'''

	425 lines = [line.strip() for line in infile]

	426 # Extract strings after the first '%%' and before the second '%%'.

	427 begin = lines.index('%%') + 1

	428 end = lines.index('%%', begin)

	429 lines = lines[begin:end]

	430 for line in lines:

	431 assert(line[-3:-1] == ', ')

	432 assert(line[-1] in '01234')

	433 return [line[:-3] + line[-1] for line in lines]
	M-A Ruel 2014/04/23 12:26:18 Could you add in a comment how the lines should lo Could you add in a comment how the lines should look like. I have no idea what this should produce from what input. Pam (message me for reviews) 2014/04/23 13:13:56 Also some unit tests for this script. There are va Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > Could you add in a comment how the lines should look like. I have no idea what > this should produce from what input. Also some unit tests for this script. There are various frameworks in use, but at least a set of appropriately chosen inputs and expected outputs would help. Olle Liljenzin 2014/04/24 09:30:00 There are two examples, see lines 107 and 152. Show quoted text On 2014/04/23 12:26:18, M-A Ruel wrote: > Could you add in a comment how the lines should look like. I have no idea what > this should produce from what input. There are two examples, see lines 107 and 152. M-A Ruel 2014/04/24 12:45:19 Would making it a generator be faster? E.g. replac Would making it a generator be faster? E.g. replace line 433 with yield line[:-3] + line[-1] Olle Liljenzin 2014/04/29 12:41:52 You are probably right, but I don't understand exa Show quoted text On 2014/04/24 12:45:19, M-A Ruel wrote: > Would making it a generator be faster? E.g. replace line 433 with > yield line[:-3] + line[-1] You are probably right, but I don't understand exactly how you mean.
	434

	435 if __name__ == '__main__':

	436 if len(sys.argv) != 3:

	437 print('usage: %s infile outfile' % sys.argv[0])

	438 sys.exit(-1)

	439 INFILE = open(sys.argv[1], 'r')

	440 OUTFILE = open(sys.argv[2], 'w')

	441 OUTFILE.write(WordsToCxx(ParseGperf(INFILE)))

OLD	NEW

« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »