net/tools/tld_cleanup/make_dafsa.py - Issue 197183002: Reduce footprint of registry controlled domain table

Side by Side Diff: net/tools/tld_cleanup/make_dafsa.py

Issue 197183002: Reduce footprint of registry controlled domain table (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Created 6 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 #!/usr/bin/python

	2

	3 '''

	4 A Deterministic acyclic finite state automaton (DAFSA) is a compact

	5 representation of an unordered word list (dictionary).

	6

	7 http://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton

	8

	9 This python program converts a list of strings to a byte array in C++.

	10 This python program fetches strings and return values from a gperf file

	11 and generates a C++ file with a byte array representing graph that can be

	12 used as a memory efficient replacement for the perfect hash table.

	13

	14 The input strings are assumed to consist of printable 7-bit ASCII characters

	15 and the return values are assumed to be one digit integers.

	16

	17 In this program a DAFSA is a diamond shaped graph starting at a common

	18 source node and ending at a common sink node. All internal nodes contain

	19 a label and each word is represented by the labels in one path from

	20 the source node to the sink node.

	21

	22 The following python represention is used for nodes:

	23

	24 Source node: [ children ]

	25 Internal node: (label, [ children ])

	26 Sink node: None

	27

	28 The graph is first compressed by prefixes like a trie. In the next step

	29 suffixes are compressed so that the graph gets diamond shaped. Finally

	30 one to one linked nodes are replaced by nodes with the labels joined.

	31

	32 The order of the operations is crucial since lookups will be performed

	33 starting from the source with no backtracking. Thus a node must have at

	34 most one child with a label starting by the same character. The output

	35 is also arranged so that all jumps are to increasing addresses, thus forward

	36 in memory.

	37

	38 The generated output has suffix free decoding so that the sign of leading

	39 bits in a link (a reference to a child node) indicate if it has a size of one,

	40 two or three bytes and if it is the last outgoing link from the actual node.

	41 A node label is terminated by a byte with the leading bit set.

	42

	43 The generated byte array can described by the following BNF:

	44

	45 <byte> ::= < 8-bit value in range [0x00-0xFF] >

	46

	47 <char> ::= < printable 7-bit ASCII character, byte in range [0x20-0x7F] >

	48 <end_char> ::= < char + 0x80, byte in range [0xA0-0xFF] >

	49 <return value> ::= < value + 0x80, byte in range [0x80-0x8F] >

	50

	51 <offset1> ::= < byte in range [0x00-0x3F] >

	52 <offset2> ::= < byte in range [0x40-0x5F] >

	53 <offset3> ::= < byte in range [0x60-0x7F] >

	54

	55 <end_offset1> ::= < byte in range [0x80-0xBF] >

	56 <end_offset2> ::= < byte in range [0xC0-0xDF] >

	57 <end_offset3> ::= < byte in range [0xE0-0xFF] >

	58

	59 <label> ::= <end_char>

	60 \| <char> <label>

	61

	62 <end_label> ::= <return_value>

	63 \| <char> <end_label>

	64

	65 <offset> ::= <offset1>

	66 \| <offset2> <byte>

	67 \| <offset3> <byte> <byte>

	68

	69 <end_offset> ::= <end_offset1>

	70 \| <end_offset2> <byte>

	71 \| <end_offset3> <byte> <byte>

	72

	73 <offsets> ::= <end_offset>

	74 \| <offset> <offsets>

	75

	76 <source> ::= <offsets>

	77

	78 <node> ::= <label> <offsets>

	79 \| <end_label>

	80

	81 <dafsa> ::= <source>

	82 \| <dafsa> <node>

	83

	84 Decoding:

	85

	86 <char> -> printable 7-bit ASCII character

	87 <end_char> & 0x7F -> printable 7-bit ASCII character

	88 <return value> & 0x0F -> integer

	89 <offset1 & 0x3F> -> integer

	90 ((<offset2> & 0x1F>) << 8) + <byte> -> integer

	91 ((<offset3> & 0x1F>) << 16) + (<byte> << 8) + <byte> -> integer

	92

	93 end_offset1, end_offset2 and and_offset3 are decoded same as offset1,

	94 offset2 and offset3 respectively.

	95

	96 The first offset in a list of offsets is the distance in bytes between the

	97 offset itself and the first child node. Subsequent offsets are the distance

	98 between previous child node and next child node. Thus each offset links a node

	99 to a child node. The distance is always counted between start addresses, i.e.

	100 first byte in decoded offset or first byte in child node.

	101

	102 Example 1:

	103

	104 %%

	105 aa, 1

	106 a, 2

	107 %%

	108

	109 The input is first parsed to a list of words:

	110 ["aa1", "a2"]

	111

	112 A fully expanded graph is created from the words:

	113 source = [node1, node4]

	114 node1 = ("a", [node2])

	115 node2 = ("a", [node3])

	116 node3 = ("\x01", [sink])

	117 node4 = ("a", [node5])

	118 node5 = ("\x02", [sink])

	119 sink = None

	120

	121 Compression results in the following graph:

	122 source = [node1]

	123 node1 = ("a", [node2, node3])

	124 node2 = ("\x02", [sink])

	125 node3 = ("a\x01", [sink])

	126 sink = None

	127

	128 A C++ representation of the compressed graph is generated:

	129

	130 const unsigned char dafsa[7] = {

	131 0x81, 0xE1, 0x02, 0x81, 0x82, 0x61, 0x81,

	132 };

	133

	134 The bytes in the generated array has the following meaning:

	135

	136 0: 0x81 <end_offset1> child at position 0 + (0x81 & 0x3F) -> jump to 1

	137

	138 1: 0xE1 <end_char> label character (0xE1 & 0x7F) -> match "a"

	139 2: 0x02 <offset1> child at position 2 + (0x02 & 0x3F) -> jump to 4

	140

	141 3: 0x81 <end_offset1> child at position 4 + (0x81 & 0x3F) -> jump to 5

	142 4: 0x82 <return_value> 0x82 & 0x0F -> return 2

	143

	144 5: 0x61 <char> label character 0x61 -> match "a"

	145 6: 0x81 <return_value> 0x81 & 0x0F -> return 1

	146

	147 Example 2:

	148

	149 %%

	150 aa, 1

	151 bbb, 2

	152 baa, 1

	153 %%

	154

	155 The input is first parsed to a list of words:

	156 ["aa1", "bbb2", "baa1"]

	157

	158 Compression results in the following graph:

	159 source = [node1, node2]

	160 node1 = ("b", [node2, node3])

	161 node2 = ("aa\x01", [sink])

	162 node3 = ("bb\x02", [sink])

	163 sink = None

	164

	165 A C++ representation of the compressed graph is generated:

	166

	167 const unsigned char dafsa[11] = {

	168 0x02, 0x83, 0xE2, 0x02, 0x83, 0x61, 0x61, 0x81, 0x62, 0x62, 0x82,

	169 };

	170

	171 The bytes in the generated array has the following meaning:

	172

	173 0: 0x02 <offset1> child at position 0 + (0x02 & 0x3F) -> jump to 2

	174 1: 0x83 <end_offset1> child at position 2 + (0x83 & 0x3F) -> jump to 5

	175

	176 2: 0xE2 <end_char> label character (0xE2 & 0x7F) -> match "b"

	177 3: 0x02 <offset1> child at position 3 + (0x02 & 0x3F) -> jump to 5

	178 4: 0x83 <end_offset1> child at position 5 + (0x83 & 0x3F) -> jump to 8

	179

	180 5: 0x61 <char> label character 0x61 -> match "a"

	181 6: 0x61 <char> label character 0x61 -> match "a"

	182 7: 0x81 <return_value> 0x81 & 0x0F -> return 1

	183

	184 8: 0x62 <char> label character 0x62 -> match "b"

	185 9: 0x62 <char> label character 0x62 -> match "b"

	186 10: 0x82 <return_value> 0x82 & 0x0F -> return 2

	187 '''

	188

	189 import sys

	190

	191 def ToDafsa(words):

	192 '''

	193 Generate a DAFSA from a word list and return the source node

	194

	195 Each word is split into characters so that each character is represented by

	196 a unique node. It is assumed the word list is not empty.

	197 '''

	198 assert(words)

	199 def ToNodes(word):

	200 '''Split words into characters'''

	201 # Must be printable 7-bit ASCII.

	202 assert(0x1F < ord(word[0]) < 0x80)

	203 if len(word) == 1:

	204 return chr(ord(word[0]) & 0x0F), [None]

	205 return word[0], [ToNodes(word[1:])]

	206 return [ToNodes(word) for word in words]

	207

	208 def ToWords(node):

	209 '''

	210 Generate a word list from all paths starting from an internal node

	211 '''

	212 if node is None:

	213 return ['']

	214 return [(node[0] + word) for child in node[1] for word in ToWords(child)]

	215

	216 def Reverse(dafsa):

	217 '''

	218 Generate a new DAFSA that is reversed, so that the old sink node becomes the

	219 new source node.

	220 '''

	221 def Dfs(node, parent):

	222 '''Create new nodes'''

	223 if node is None:

	224 sink.append(parent)

	225 elif id(node) not in nodemap:

	226 nodemap[id(node)] = (node[0][::-1], [parent])

	227 for child in node[1]:

	228 Dfs(child, nodemap[id(node)])

	229 else:

	230 nodemap[id(node)][1].append(parent)

	231

	232 sink = []

	233 nodemap = {}

	234 for node in dafsa:

	235 Dfs(node, None)

	236 return sink

	237

	238 def JoinLabels(dafsa):

	239 '''

	240 Generate a new DAFSA where internal nodes are merged if there is a one to

	241 one connection.

	242 '''

	243 def CountParents(node):

	244 '''Count incoming references'''

	245 if id(node) in parentcount:

	246 parentcount[id(node)] += 1

	247 else:

	248 parentcount[id(node)] = 1

	249 for child in node[1]:

	250 CountParents(child)

	251

	252 def Join(node):

	253 '''Create new nodes'''

	254 if id(node) not in nodemap:

	255 children = [Join(child) for child in node[1]]

	256 if len(children) == 1 and parentcount[id(node[1][0])] == 1:

	257 child = children[0]

	258 nodemap[id(node)] = (node[0] + child[0], child[1])

	259 else:

	260 nodemap[id(node)] = (node[0], children)

	261 return nodemap[id(node)]

	262

	263 parentcount = { id(None): 2 }

	264 for node in dafsa:

	265 CountParents(node)

	266 nodemap = { id(None): None }

	267 return [Join(node) for node in dafsa]

	268

	269 def JoinSuffixes(dafsa):

	270 '''

	271 Generate a new DAFSA where nodes that represent the same word lists towards

	272 the sink are merged.

	273 '''

	274 def Join(node):

	275 '''Return a macthing node. A new node is created if not matching node

	276 exists. The graph is accessed in dfs order.

	277 '''

	278 suffixes = frozenset(ToWords(node))

	279 if suffixes not in nodemap:

	280 nodemap[suffixes] = (node[0], [Join(child) for child in node[1]])

	281 return nodemap[suffixes]

	282

	283 nodemap = { frozenset(('',)): None }

	284 return [Join(node) for node in dafsa]

	285

	286 def TopSort(dafsa):

	287 '''Generate list of nodes in topological sort order'''

	288 def CountIncoming(node):

	289 '''Count incoming references'''

	290 if node is not None:

	291 if id(node) not in incoming:

	292 incoming[id(node)] = 1

	293 for child in node[1]:

	294 CountIncoming(child)

	295 else:

	296 incoming[id(node)] += 1

	297

	298 incoming = {}

	299 for node in dafsa:

	300 CountIncoming(node)

	301

	302 for node in dafsa:

	303 incoming[id(node)] -= 1

	304

	305 waiting = [node for node in dafsa if incoming[id(node)] == 0]

	306 nodes = []

	307

	308 while waiting:

	309 node = waiting.pop()

	310 assert(incoming[id(node)] == 0)

	311 nodes.append(node)

	312 for child in node[1]:

	313 if child is not None:

	314 incoming[id(child)] -= 1

	315 if incoming[id(child)] == 0:

	316 waiting.append(child)

	317 return nodes

	318

	319 def EncodeLinks(children, offsets, current):

	320 '''Encode a list of children as one, two or three byte offsets'''

	321 if children[0] is None:

	322 # This is an <end_label> node and no links are follow in such nodes

	323 assert(len(children) == 1)

	324 return []

	325 guess = 3 * len(children)

	326 assert(children)

	327 while True:

	328 offset = current + guess

	329 buf = []

	330 for child in sorted(children, key = lambda x: -offsets[id(x)]):

	331 last = len(buf)

	332 distance = offset - offsets[id(child)]

	333 assert(distance > 0 and distance < (1 << 21))

	334

	335 if distance < (1 << 6):

	336 # A 6-bit offset: "s0xxxxxx"

	337 buf.append(distance)

	338 elif distance < (1 << 13):

	339 # A 13-bit offset: "s10xxxxxxxxxxxxx"

	340 buf.append(0x40 \| (distance >> 8))

	341 buf.append(distance & 0xFF)

	342 else:

	343 # A 21-bit offset: "s11xxxxxxxxxxxxxxxxxxxxx"

	344 buf.append(0x60 \| (distance >> 16))

	345 buf.append((distance >> 8) & 0xFF)

	346 buf.append(distance & 0xFF)

	347 # Distance in first link is relative to following record.

	348 # Distance in other links are relative to previous link.

	349 offset -= distance

	350 if len(buf) == guess:

	351 break

	352 guess = len(buf)

	353 # Set most significant bit to mark end of links in this node.

	354 buf[last] \|= (1 << 7)

	355 buf.reverse()

	356 return buf

	357

	358 def EncodeLabel(label):

	359 '''

	360 Encode a node label as a list of bytes with a trailing high byte >0x80.

	361 '''

	362 assert(label)

	363 buf = [ord(c) for c in label]

	364 buf.reverse()

	365 # Set most significant bit to mark end of label in this node.

	366 buf[0] \|= (1 << 7)

	367 return buf

	368

	369 def Encode(dafsa):

	370 '''Encode a DAFSA to a list of bytes'''

	371 output = []

	372 offsets = {}

	373

	374 for node in reversed(TopSort(dafsa)):

	375 output += EncodeLinks(node[1], offsets, len(output))

	376 output += EncodeLabel(node[0])

	377 offsets[id(node)] = len(output)

	378

	379 output += EncodeLinks(dafsa, offsets, len(output))

	380 output.reverse()

	381 return output

	382

	383 def ToCxx(data):

	384 '''Generate C++ code from a list of encoded bytes'''

	385 text = '/* This file is generated. Don\'t edit!\n\n'

	386 text += ' The following command was used to generate the file:\n\n'

	387 text += ' \"' + ' '.join(sys.argv) + '\"\n\n'

	388 text += '*/\n\n'

	389 text += 'const unsigned char kDafsa[%s] = {\n' % len(data)

	390 for i in range(0, len(data), 12):

	391 text += ' '

	392 text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]])

	393 text += ',\n'

	394 text += '};\n'

	395 return text

	396

	397 def WordsToCxx(words):

	398 '''Generate C++ code from a word list'''

	399 dafsa = ToDafsa(words)

	400 for fun in (Reverse, JoinSuffixes, Reverse, JoinSuffixes, JoinLabels):

	401 dafsa = fun(dafsa)

	402 byte_values = Encode(dafsa)

	403 return ToCxx(byte_values)

	404

	405 def ParseGperf(infile):

	406 '''Parse gperf file and extract strings and return code'''

	407 lines = [line.strip() for line in infile]

	408 # Extract strings after the first '%%' and before the second '%%'.

	409 begin = lines.index('%%') + 1

	410 end = lines.index('%%', begin)

	411 lines = lines[begin:end]

	412 for line in lines:

	413 assert(line[-3:-1] == ', ')

	414 assert(line[-1] in '01234')

	415 return [line[:-3] + line[-1] for line in lines]

	416

	417 if __name__ == '__main__':

	418 if len(sys.argv) != 3:

	419 print('usage: %s infile outfile' % sys.argv[0])

	420 sys.exit(-1)

	421 INFILE = open(sys.argv[1], 'r')

	422 OUTFILE = open(sys.argv[2], 'w')

	423 OUTFILE.write(WordsToCxx(ParseGperf(INFILE)))

OLD	NEW

« net/net.gyp ('K') | « net/tools/tld_cleanup/README ('k') | no next file » | no next file with comments »