net/base/registry_controlled_domains/registry_controlled_domain.cc - Issue 197183002: Reduce footprint of registry controlled domain table

Side by Side Diff: net/base/registry_controlled_domains/registry_controlled_domain.cc

Issue 197183002: Reduce footprint of registry controlled domain table (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Created 6 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

« no previous file with comments | « net/base/registry_controlled_domains/registry_controlled_domain.h ('k') | net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc » ('j') | net/net.gyp » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
1 // Copyright (c) 2012 The Chromium Authors. All rights reserved.	1 // Copyright (c) 2012 The Chromium Authors. All rights reserved.

2 // Use of this source code is governed by a BSD-style license that can be	2 // Use of this source code is governed by a BSD-style license that can be

3 // found in the LICENSE file.	3 // found in the LICENSE file.

4	4

5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene,	5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene,

6 // later modified by others), but almost entirely rewritten for Chrome.	6 // later modified by others), but almost entirely rewritten for Chrome.

7 // (netwerk/dns/src/nsEffectiveTLDService.cpp)	7 // (netwerk/dns/src/nsEffectiveTLDService.cpp)

8 /* *** BEGIN LICENSE BLOCK ***	8 /* *** BEGIN LICENSE BLOCK ***

9 * Version: MPL 1.1/GPL 2.0/LGPL 2.1	9 * Version: MPL 1.1/GPL 2.0/LGPL 2.1

10 *	10 *

(...skipping 35 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
46 #include "net/base/registry_controlled_domains/registry_controlled_domain.h"	46 #include "net/base/registry_controlled_domains/registry_controlled_domain.h"

47	47

48 #include "base/logging.h"	48 #include "base/logging.h"

49 #include "base/strings/string_util.h"	49 #include "base/strings/string_util.h"

50 #include "base/strings/utf_string_conversions.h"	50 #include "base/strings/utf_string_conversions.h"

51 #include "net/base/net_module.h"	51 #include "net/base/net_module.h"

52 #include "net/base/net_util.h"	52 #include "net/base/net_util.h"

53 #include "url/gurl.h"	53 #include "url/gurl.h"

54 #include "url/url_parse.h"	54 #include "url/url_parse.h"

55	55

56 #include "effective_tld_names.cc"

57

58 namespace net {	56 namespace net {

59 namespace registry_controlled_domains {	57 namespace registry_controlled_domains {

60	58

61 namespace {	59 namespace {

	60 #include "effective_tld_names-inc.cc"

	61

	62 // See make_dafsa.py for documentation of the generated dafsa byte array.

	63

	64 const unsigned char* graph = kDafsa;

	65

	66 int LookupString(const unsigned char* pos, const char* key, int length) {
	Ryan Sleevi 2014/03/19 03:08:32 1) Needs documentation 2) int for byte length = be 1) Needs documentation 2) int for byte length = better to use size_t here 3) Would it make more sense to use a StringPeice, which will get optimized out.
	67 const char* end = key + length;

	68 while (true) {

	69 // Read links.

	70 const unsigned char* child = pos;
	Ryan Sleevi 2014/03/18 01:23:58 Generally speaking, this level of manual string ma Generally speaking, this level of manual string manipulation scares the crap out of me. I just mention it because it's going to take me a lot more time to reason about the security on, even if it's (probably) correct. That said, the use of magic values here (0x60, 0x40, 0x80, 0x0F) would all be better if given symbolic names, and documented heavily within this code const char kTerminalNodeMask = 0x80; const char kOffsetLengthMask = 0x60; const char kOffsetMask = 0x1F; etc
	71 while (true) {

	72 bool is_last = (*pos & 0x80) != 0;
	Ryan Sleevi 2014/03/19 03:08:32 Normally within net/ we encapsulate a lot of the s Normally within net/ we encapsulate a lot of the string parsing within iterators (see header parsing, header value parsing, etc) The optimizer should optimize it out all the same, but it makes it easier to read. For example, you could create an iterator to read links, where the returned type can provide disambiguation about end of key, child, etc. Examples: see https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_util... https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_util... https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_requ... Olle Liljenzin 2014/03/19 14:19:42 Iterators may increase readability by adding a fam Iterators may increase readability by adding a familiar interface on the implementation. But I can't see how it would fit here. E.g. reading a link has side effects (both pos and child pointers are incremented), and hiding that in an overloaded operator or GetNext() method would not make the code easier to read. The current code does not contain repeated patterns and adding abstraction layers will then just blow up code size and split the implementation on different locations, making it harder to follow what really happens in case of debugging. The code is also extremely performance sensitive and compilers are far from perfect. In case the compiler (on some platform) fails to reduce some redundant instructions we will have to read the assembler code to find out where it failed. The current code is just about 50-60 statements (not counting white space and comments), and I would prefer to not make it much larger.
	73 switch (*pos & 0x60) {

	74 case 0x60: // Read three byte offset

	75 child += ((pos[0] & 0x1F) << 16) \| (pos[1] << 8) \| pos[2];

	76 pos += 3;

	77 break;

	78

	79 case 0x40: // Read two byte offset

	80 child += ((pos[0] & 0x1F) << 8) \| pos[1];

	81 pos += 2;

	82 break;

	83

	84 default: // Read one byte offset

	85 child += pos[0] & 0x3F;

	86 pos += 1;

	87 }

	88 if (key == end) {

	89 // End of key reached. A matching child node must be labeled by a

	90 // single byte in range 0x80-0x9F encoding the return value.
	Ryan Sleevi 2014/03/29 02:14:29 As a justification for why we need to split this c As a justification for why we need to split this code into something more readable - it took me way too long to figure out whether the comment was correct and the code was wrong, or whether the code was correct and the comment wrong. In this case, it's the latter. It's a single byte in the range 0x80 - 0x8F, as reflected in the comments for make_dafsa.py
	91 if (!(*child & 0x60)) {

	92 // A return value must always be last in a label. If not the byte

	93 // array is corrupt.

	94 DCHECK(*child & 0x80);

	95

	96 // Extract return value.

	97 return *child & 0x0F;

	98 }

	99 // The key matches and is exhausted, but child has more characters.

	100 if (is_last) {

	101 return -1;
	Ryan Sleevi 2014/03/19 03:08:32 Not a fan of magic values. If you took base::Strin Not a fan of magic values. If you took base::StringPiece(), returning a size_t (or npos) would be far clearer. [Edit: After having written that, I realize it's instead some magic flags value; definitely need to document this stuff, along with providing a meaningful constant for this negative return value]
	102 }

	103 // Try next child.

	104 } else {

	105 // If child node has a single char label.

	106 if (*child & 0x80) {

	107 // If key matches char in child node label.

	108 if ((child & 0x7F) == key) {

	109 // Consume matching label. Step down in child node and read links.

	110 ++key;

	111 ++child;

	112 pos = child;

	113 } else {

	114 // Key doesn't match label in this child node.

	115 if (is_last) {

	116 return -1;

	117 }

	118 // Try next child.

	119 }

	120 } else {

	121 // Child node label has multi character label.

	122 if (child == key) {

	123 // Found a matching link. Step down in child node.

	124 ++key;

	125 pos = child + 1;

	126 break;

	127 } else {

	128 // Key doesn't match label in this child node.

	129 if (is_last) {

	130 return -1;

	131 }

	132 // Try next child.

	133 }

	134 }

	135 }

	136 }

	137 // Compare key with node label. First character is already consumed.

	138 while (true) {

	139 if (key == end) {

	140 // End of key reached.

	141 if (!(*pos & 0x60)) {

	142 // Extract return value.

	143 return *pos & 0x0F;

	144 }

	145 // Node label contains more characters that must match.

	146 return -1;

	147 }

	148 if (*pos & 0x80) {

	149 // Last character in node label.

	150 if (key & 0x80 \|\| key < 0x20 \|\| (key \| 0x80) != pos) {

	151 // Not printable 7-bit ASCII in key or key didn't match.

	152 return -1;

	153 } else {

	154 ++key;

	155 ++pos;

	156 break;

	157 // Read links to child nodes.

	158 }

	159 } else {

	160 if (key++ != pos++) {

	161 // Key doesn't match node label.

	162 return -1;

	163 }

	164 // Key matches so far and there are more characters to check in this

	165 // node label.

	166 }

	167 }

	168 }

	169 }

62	170

63 const int kExceptionRule = 1;	171 const int kExceptionRule = 1;

64 const int kWildcardRule = 2;	172 const int kWildcardRule = 2;

65 const int kPrivateRule = 4;	173 const int kPrivateRule = 4;

66	174

67 const FindDomainPtr kDefaultFindDomainFunction = Perfect_Hash::FindDomain;

68

69 // 'stringpool' is defined as a macro by the gperf-generated

70 // "effective_tld_names.cc". Provide a real constant value for it instead.

71 const char* const kDefaultStringPool = stringpool;

72 #undef stringpool

73

74 FindDomainPtr g_find_domain_function = kDefaultFindDomainFunction;

75 const char* g_stringpool = kDefaultStringPool;

76

77 size_t GetRegistryLengthImpl(	175 size_t GetRegistryLengthImpl(

78 const std::string& host,	176 const std::string& host,

79 UnknownRegistryFilter unknown_filter,	177 UnknownRegistryFilter unknown_filter,

80 PrivateRegistryFilter private_filter) {	178 PrivateRegistryFilter private_filter) {

81 DCHECK(!host.empty());	179 DCHECK(!host.empty());

82	180

83 // Skip leading dots.	181 // Skip leading dots.

84 const size_t host_check_begin = host.find_first_not_of('.');	182 const size_t host_check_begin = host.find_first_not_of('.');

85 if (host_check_begin == std::string::npos)	183 if (host_check_begin == std::string::npos)

86 return 0; // Host is only dots.	184 return 0; // Host is only dots.

(...skipping 11 matching lines...) Expand all Loading...
98	196

99 // Walk up the domain tree, most specific to least specific,	197 // Walk up the domain tree, most specific to least specific,

100 // looking for matches at each level.	198 // looking for matches at each level.

101 size_t prev_start = std::string::npos;	199 size_t prev_start = std::string::npos;

102 size_t curr_start = host_check_begin;	200 size_t curr_start = host_check_begin;

103 size_t next_dot = host.find('.', curr_start);	201 size_t next_dot = host.find('.', curr_start);

104 if (next_dot >= host_check_len) // Catches std::string::npos as well.	202 if (next_dot >= host_check_len) // Catches std::string::npos as well.

105 return 0; // This can't have a registry + domain.	203 return 0; // This can't have a registry + domain.

106 while (1) {	204 while (1) {

107 const char* domain_str = host.data() + curr_start;	205 const char* domain_str = host.data() + curr_start;

108 int domain_length = host_check_len - curr_start;	206 int domain_length = host_check_len - curr_start;
	Ryan Sleevi 2014/03/19 03:08:32 this should have been a size_t, IINM. this should have been a size_t, IINM.
109 const DomainRule* rule = g_find_domain_function(domain_str, domain_length);	207 int type = LookupString(graph, domain_str, domain_length);

	208 bool do_check =

	209 type != -1 && (!(type & kPrivateRule) \|\|

	210 private_filter == INCLUDE_PRIVATE_REGISTRIES);

110	211

111 // We need to compare the string after finding a match because the	212 // If the apparent match is a private registry and we're not including

112 // no-collisions of perfect hashing only refers to items in the set. Since	213 // those, it can't be an actual match.

113 // we're searching for arbitrary domains, there could be collisions.	214 if (do_check) {

114 // Furthermore, if the apparent match is a private registry and we're not	215 // Exception rules override wildcard rules when the domain is an exact

115 // including those, it can't be an actual match.	216 // match, but wildcards take precedence when there's a subdomain.

116 if (rule) {	217 if (type & kWildcardRule && (prev_start != std::string::npos)) {

117 bool do_check = !(rule->type & kPrivateRule) \|\|	218 // If prev_start == host_check_begin, then the host is the registry

118 private_filter == INCLUDE_PRIVATE_REGISTRIES;	219 // itself, so return 0.

119 if (do_check && base::strncasecmp(domain_str,	220 return (prev_start == host_check_begin) ? 0

120 g_stringpool + rule->name_offset,	221 : (host.length() - prev_start);

121 domain_length) == 0) {	222 }

122 // Exception rules override wildcard rules when the domain is an exact	223

123 // match, but wildcards take precedence when there's a subdomain.	224 if (type & kExceptionRule) {

124 if (rule->type & kWildcardRule && (prev_start != std::string::npos)) {	225 if (next_dot == std::string::npos) {

125 // If prev_start == host_check_begin, then the host is the registry	226 // If we get here, we had an exception rule with no dots (e.g.

126 // itself, so return 0.	227 // "!foo"). This would only be valid if we had a corresponding

127 return (prev_start == host_check_begin) ?	228 // wildcard rule, which would have to be "*". But we explicitly

128 0 : (host.length() - prev_start);	229 // disallow that case, so this kind of rule is invalid.

	230 NOTREACHED() << "Invalid exception rule";

	231 return 0;

129 }	232 }

	233 return host.length() - next_dot - 1;

	234 }

130	235

131 if (rule->type & kExceptionRule) {	236 // If curr_start == host_check_begin, then the host is the registry

132 if (next_dot == std::string::npos) {	237 // itself, so return 0.

133 // If we get here, we had an exception rule with no dots (e.g.	238 return (curr_start == host_check_begin) ? 0

134 // "!foo"). This would only be valid if we had a corresponding	239 : (host.length() - curr_start);

135 // wildcard rule, which would have to be "*". But we explicitly

136 // disallow that case, so this kind of rule is invalid.

137 NOTREACHED() << "Invalid exception rule";

138 return 0;

139 }

140 return host.length() - next_dot - 1;

141 }

142

143 // If curr_start == host_check_begin, then the host is the registry

144 // itself, so return 0.

145 return (curr_start == host_check_begin) ?

146 0 : (host.length() - curr_start);

147 }

148 }	240 }

149	241

150 if (next_dot >= host_check_len) // Catches std::string::npos as well.	242 if (next_dot >= host_check_len) // Catches std::string::npos as well.

151 break;	243 break;

152	244

153 prev_start = curr_start;	245 prev_start = curr_start;

154 curr_start = next_dot + 1;	246 curr_start = next_dot + 1;

155 next_dot = host.find('.', curr_start);	247 next_dot = host.find('.', curr_start);

156 }	248 }

157	249

(...skipping 99 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
257 PrivateRegistryFilter private_filter) {	349 PrivateRegistryFilter private_filter) {

258 url_canon::CanonHostInfo host_info;	350 url_canon::CanonHostInfo host_info;

259 const std::string canon_host(CanonicalizeHost(host, &host_info));	351 const std::string canon_host(CanonicalizeHost(host, &host_info));

260 if (canon_host.empty())	352 if (canon_host.empty())

261 return std::string::npos;	353 return std::string::npos;

262 if (host_info.IsIPAddress())	354 if (host_info.IsIPAddress())

263 return 0;	355 return 0;

264 return GetRegistryLengthImpl(canon_host, unknown_filter, private_filter);	356 return GetRegistryLengthImpl(canon_host, unknown_filter, private_filter);

265 }	357 }

266	358

267 void SetFindDomainFunctionAndStringPoolForTesting(FindDomainPtr function,	359 void SetFindDomainGraph(const unsigned char* domains) {

268 const char* stringpool) {	360 graph = domains ? domains : kDafsa;

269 g_find_domain_function = function ? function : kDefaultFindDomainFunction;

270 g_stringpool = stringpool ? stringpool : kDefaultStringPool;

271 }	361 }

272	362

273 } // namespace registry_controlled_domains	363 } // namespace registry_controlled_domains

274 } // namespace net	364 } // namespace net

OLD	NEW