OLD | NEW |
| (Empty) |
1 // Copyright (c) 2011 The Chromium Authors. All rights reserved. | |
2 // Use of this source code is governed by a BSD-style license that can be | |
3 // found in the LICENSE file. | |
4 | |
5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene, | |
6 // later modified by others), but almost entirely rewritten for Chrome. | |
7 // (netwerk/dns/src/nsEffectiveTLDService.h) | |
8 /* ***** BEGIN LICENSE BLOCK ***** | |
9 * Version: MPL 1.1/GPL 2.0/LGPL 2.1 | |
10 * | |
11 * The contents of this file are subject to the Mozilla Public License Version | |
12 * 1.1 (the "License"); you may not use this file except in compliance with | |
13 * the License. You may obtain a copy of the License at | |
14 * http://www.mozilla.org/MPL/ | |
15 * | |
16 * Software distributed under the License is distributed on an "AS IS" basis, | |
17 * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License | |
18 * for the specific language governing rights and limitations under the | |
19 * License. | |
20 * | |
21 * The Original Code is Mozilla TLD Service | |
22 * | |
23 * The Initial Developer of the Original Code is | |
24 * Google Inc. | |
25 * Portions created by the Initial Developer are Copyright (C) 2006 | |
26 * the Initial Developer. All Rights Reserved. | |
27 * | |
28 * Contributor(s): | |
29 * Pamela Greene <pamg.bugs@gmail.com> (original author) | |
30 * | |
31 * Alternatively, the contents of this file may be used under the terms of | |
32 * either the GNU General Public License Version 2 or later (the "GPL"), or | |
33 * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), | |
34 * in which case the provisions of the GPL or the LGPL are applicable instead | |
35 * of those above. If you wish to allow use of your version of this file only | |
36 * under the terms of either the GPL or the LGPL, and not to allow others to | |
37 * use your version of this file under the terms of the MPL, indicate your | |
38 * decision by deleting the provisions above and replace them with the notice | |
39 * and other provisions required by the GPL or the LGPL. If you do not delete | |
40 * the provisions above, a recipient may use your version of this file under | |
41 * the terms of any one of the MPL, the GPL or the LGPL. | |
42 * | |
43 * ***** END LICENSE BLOCK ***** */ | |
44 | |
45 /* | |
46 (Documentation based on the Mozilla documentation currently at | |
47 http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same | |
48 author.) | |
49 | |
50 The RegistryControlledDomainService examines the hostname of a GURL passed to | |
51 it and determines the longest portion that is controlled by a registrar. | |
52 Although technically the top-level domain (TLD) for a hostname is the last | |
53 dot-portion of the name (such as .com or .org), many domains (such as co.uk) | |
54 function as though they were TLDs, allocating any number of more specific, | |
55 essentially unrelated names beneath them. For example, .uk is a TLD, but | |
56 nobody is allowed to register a domain directly under .uk; the "effective" | |
57 TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in | |
58 *.co.uk to set a cookie for the entire co.uk domain, so it's important to be | |
59 able to identify which higher-level domains function as effective TLDs and | |
60 which can be registered. | |
61 | |
62 The service obtains its information about effective TLDs from a text resource | |
63 that must be in the following format: | |
64 | |
65 * It should use plain ASCII. | |
66 * It should contain one domain rule per line, terminated with \n, with nothing | |
67 else on the line. (The last rule in the file may omit the ending \n.) | |
68 * Rules should have been normalized using the same canonicalization that GURL | |
69 applies. For ASCII, that means they're not case-sensitive, among other | |
70 things; other normalizations are applied for other characters. | |
71 * Each rule should list the entire TLD-like domain name, with any subdomain | |
72 portions separated by dots (.) as usual. | |
73 * Rules should neither begin nor end with a dot. | |
74 * If a hostname matches more than one rule, the most specific rule (that is, | |
75 the one with more dot-levels) will be used. | |
76 * Other than in the case of wildcards (see below), rules do not implicitly | |
77 include their subcomponents. For example, "bar.baz.uk" does not imply | |
78 "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" | |
79 will match, but "baz.uk" and "qux.baz.uk" won't. | |
80 * The wildcard character '*' will match any valid sequence of characters. | |
81 * Wildcards may only appear as the entire most specific level of a rule. That | |
82 is, a wildcard must come at the beginning of a line and must be followed by | |
83 a dot. (You may not use a wildcard as the entire rule.) | |
84 * A wildcard rule implies a rule for the entire non-wildcard portion. For | |
85 example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule | |
86 "bar"). This is typically important in the case of exceptions (see below). | |
87 * The exception character '!' before a rule marks an exception to a wildcard | |
88 rule. If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then | |
89 "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp" | |
90 has an effective TLD of "tokyo.jp" (the exception prevents the wildcard | |
91 match, and we thus fall through to matching on the implied "tokyo.jp" rule | |
92 from the wildcard). | |
93 * If you use an exception rule without a corresponding wildcard rule, the | |
94 behavior is undefined. | |
95 | |
96 Firefox has a very similar service, and it's their data file we use to | |
97 construct our resource. However, the data expected by this implementation | |
98 differs from the Mozilla file in several important ways: | |
99 (1) We require that all single-level TLDs (com, edu, etc.) be explicitly | |
100 listed. As of this writing, Mozilla's file includes the single-level | |
101 TLDs too, but that might change. | |
102 (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded | |
103 items must already have been normalized. | |
104 (3) We do not allow comments, rule notes, blank lines, or line endings other | |
105 than LF. | |
106 Rules are also expected to be syntactically valid. | |
107 | |
108 The utility application tld_cleanup.exe converts a Mozilla-style file into a | |
109 Chrome one, making sure that single-level TLDs are explicitly listed, using | |
110 GURL to normalize rules, and validating the rules. | |
111 */ | |
112 | |
113 #ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_ | |
114 #define NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_ | |
115 | |
116 #include <string> | |
117 | |
118 #include "base/basictypes.h" | |
119 #include "net/base/net_export.h" | |
120 | |
121 class GURL; | |
122 | |
123 struct DomainRule; | |
124 | |
125 namespace net { | |
126 | |
127 class NET_EXPORT RegistryControlledDomainService { | |
128 public: | |
129 // Returns the registered, organization-identifying host and all its registry | |
130 // information, but no subdomains, from the given GURL. Returns an empty | |
131 // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple | |
132 // trailing dots, is an IP address, has only one subcomponent (i.e. no dots | |
133 // other than leading/trailing ones), or is itself a recognized registry | |
134 // identifier. If no matching rule is found in the effective-TLD data (or in | |
135 // the default data, if the resource failed to load), the last subcomponent of | |
136 // the host is assumed to be the registry. | |
137 // | |
138 // Examples: | |
139 // http://www.google.com/file.html -> "google.com" (com) | |
140 // http://..google.com/file.html -> "google.com" (com) | |
141 // http://google.com./file.html -> "google.com." (com) | |
142 // http://a.b.co.uk/file.html -> "b.co.uk" (co.uk) | |
143 // file:///C:/bar.html -> "" (no host) | |
144 // http://foo.com../file.html -> "" (multiple trailing dots) | |
145 // http://192.168.0.1/file.html -> "" (IP address) | |
146 // http://bar/file.html -> "" (no subcomponents) | |
147 // http://co.uk/file.html -> "" (host is a registry) | |
148 // http://foo.bar/file.html -> "foo.bar" (no rule; assume bar) | |
149 static std::string GetDomainAndRegistry(const GURL& gurl); | |
150 | |
151 // Like the GURL version, but takes a host (which is canonicalized internally) | |
152 // instead of a full GURL. | |
153 static std::string GetDomainAndRegistry(const std::string& host); | |
154 | |
155 // This convenience function returns true if the two GURLs both have hosts | |
156 // and one of the following is true: | |
157 // * They each have a known domain and registry, and it is the same for both | |
158 // URLs. Note that this means the trailing dot, if any, must match too. | |
159 // * They don't have known domains/registries, but the hosts are identical. | |
160 // Effectively, callers can use this function to check whether the input URLs | |
161 // represent hosts "on the same site". | |
162 static bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2); | |
163 | |
164 // Finds the length in bytes of the registrar portion of the host in the | |
165 // given GURL. Returns std::string::npos if the GURL is invalid or has no | |
166 // host (e.g. a file: URL). Returns 0 if the GURL has multiple trailing dots, | |
167 // is an IP address, has no subcomponents, or is itself a recognized registry | |
168 // identifier. If no matching rule is found in the effective-TLD data (or in | |
169 // the default data, if the resource failed to load), returns 0 if | |
170 // |allow_unknown_registries| is false, or the length of the last subcomponent | |
171 // if |allow_unknown_registries| is true. | |
172 // | |
173 // Examples: | |
174 // http://www.google.com/file.html -> 3 (com) | |
175 // http://..google.com/file.html -> 3 (com) | |
176 // http://google.com./file.html -> 4 (com) | |
177 // http://a.b.co.uk/file.html -> 5 (co.uk) | |
178 // file:///C:/bar.html -> std::string::npos (no host) | |
179 // http://foo.com../file.html -> 0 (multiple trailing | |
180 // dots) | |
181 // http://192.168.0.1/file.html -> 0 (IP address) | |
182 // http://bar/file.html -> 0 (no subcomponents) | |
183 // http://co.uk/file.html -> 0 (host is a registry) | |
184 // http://foo.bar/file.html -> 0 or 3, depending (no rule; assume | |
185 // bar) | |
186 static size_t GetRegistryLength(const GURL& gurl, | |
187 bool allow_unknown_registries); | |
188 | |
189 // Like the GURL version, but takes a host (which is canonicalized internally) | |
190 // instead of a full GURL. | |
191 static size_t GetRegistryLength(const std::string& host, | |
192 bool allow_unknown_registries); | |
193 | |
194 private: | |
195 friend class RegistryControlledDomainTest; | |
196 | |
197 // Internal workings of the static public methods. See above. | |
198 static std::string GetDomainAndRegistryImpl(const std::string& host); | |
199 static size_t GetRegistryLengthImpl(const std::string& host, | |
200 bool allow_unknown_registries); | |
201 | |
202 typedef const struct DomainRule* (*FindDomainPtr)(const char *, unsigned int); | |
203 | |
204 // Used for unit tests, so that a different perfect hash map from the full | |
205 // list is used. Set to NULL to use the Default function. | |
206 static void UseFindDomainFunction(FindDomainPtr function); | |
207 | |
208 // Function that returns a DomainRule given a domain. | |
209 static FindDomainPtr find_domain_function_; | |
210 | |
211 | |
212 DISALLOW_IMPLICIT_CONSTRUCTORS(RegistryControlledDomainService); | |
213 }; | |
214 | |
215 } // namespace net | |
216 | |
217 #endif // NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H_ | |
OLD | NEW |