OLD | NEW |
| (Empty) |
1 | |
2 This directory contains source code for the SQLite "ICU" extension, an | |
3 integration of the "International Components for Unicode" library with | |
4 SQLite. Documentation follows. | |
5 | |
6 1. Features | |
7 | |
8 1.1 SQL Scalars upper() and lower() | |
9 1.2 Unicode Aware LIKE Operator | |
10 1.3 ICU Collation Sequences | |
11 1.4 SQL REGEXP Operator | |
12 | |
13 2. Compilation and Usage | |
14 | |
15 3. Bugs, Problems and Security Issues | |
16 | |
17 3.1 The "case_sensitive_like" Pragma | |
18 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | |
19 3.3 Collation Sequence Security Issue | |
20 | |
21 | |
22 1. FEATURES | |
23 | |
24 1.1 SQL Scalars upper() and lower() | |
25 | |
26 SQLite's built-in implementations of these two functions only | |
27 provide case mapping for the 26 letters used in the English | |
28 language. The ICU based functions provided by this extension | |
29 provide case mapping, where defined, for the full range of | |
30 unicode characters. | |
31 | |
32 ICU provides two types of case mapping, "general" case mapping and | |
33 "language specific". Refer to ICU documentation for the differences | |
34 between the two. Specifically: | |
35 | |
36 http://www.icu-project.org/userguide/caseMappings.html | |
37 http://www.icu-project.org/userguide/posix.html#case_mappings | |
38 | |
39 To utilise "general" case mapping, the upper() or lower() scalar | |
40 functions are invoked with one argument: | |
41 | |
42 upper('ABC') -> 'abc' | |
43 lower('abc') -> 'ABC' | |
44 | |
45 To access ICU "language specific" case mapping, upper() or lower() | |
46 should be invoked with two arguments. The second argument is the name | |
47 of the locale to use. Passing an empty string ("") or SQL NULL value | |
48 as the second argument is the same as invoking the 1 argument version | |
49 of upper() or lower(): | |
50 | |
51 lower('I', 'en_us') -> 'i' | |
52 lower('I', 'tr_tr') -> 'ı' (small dotless i) | |
53 | |
54 1.2 Unicode Aware LIKE Operator | |
55 | |
56 Similarly to the upper() and lower() functions, the built-in SQLite LIKE | |
57 operator understands case equivalence for the 26 letters of the English | |
58 language alphabet. The implementation of LIKE included in this | |
59 extension uses the ICU function u_foldCase() to provide case | |
60 independent comparisons for the full range of unicode characters. | |
61 | |
62 The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the | |
63 dotless 'I' character used in the Turkish language is considered | |
64 to be in the same equivalence class as the dotted 'I' character | |
65 used by many languages (including English). | |
66 | |
67 1.3 ICU Collation Sequences | |
68 | |
69 A special SQL scalar function, icu_load_collation() is provided that | |
70 may be used to register ICU collation sequences with SQLite. It | |
71 is always called with exactly two arguments, the ICU locale | |
72 identifying the collation sequence to ICU, and the name of the | |
73 SQLite collation sequence to create. For example, to create an | |
74 SQLite collation sequence named "turkish" using Turkish language | |
75 sorting rules, the SQL statement: | |
76 | |
77 SELECT icu_load_collation('tr_TR', 'turkish'); | |
78 | |
79 Or, for Australian English: | |
80 | |
81 SELECT icu_load_collation('en_AU', 'australian'); | |
82 | |
83 The identifiers "turkish" and "australian" may then be used | |
84 as collation sequence identifiers in SQL statements: | |
85 | |
86 CREATE TABLE aust_turkish_penpals( | |
87 australian_penpal_name TEXT COLLATE australian, | |
88 turkish_penpal_name TEXT COLLATE turkish | |
89 ); | |
90 | |
91 1.4 SQL REGEXP Operator | |
92 | |
93 This extension provides an implementation of the SQL binary | |
94 comparision operator "REGEXP", based on the regular expression functions | |
95 provided by the ICU library. The syntax of the operator is as described | |
96 in SQLite documentation: | |
97 | |
98 <string> REGEXP <re-pattern> | |
99 | |
100 This extension uses the ICU defaults for regular expression matching | |
101 behavior. Specifically, this means that: | |
102 | |
103 * Matching is case-sensitive, | |
104 * Regular expression comments are not allowed within patterns, and | |
105 * The '^' and '$' characters match the beginning and end of the | |
106 <string> argument, not the beginning and end of lines within | |
107 the <string> argument. | |
108 | |
109 Even more specifically, the value passed to the "flags" parameter | |
110 of ICU C function uregex_open() is 0. | |
111 | |
112 | |
113 2 COMPILATION AND USAGE | |
114 | |
115 The easiest way to compile and use the ICU extension is to build | |
116 and use it as a dynamically loadable SQLite extension. To do this | |
117 using gcc on *nix: | |
118 | |
119 gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so | |
120 | |
121 You may need to add "-I" flags so that gcc can find sqlite3ext.h | |
122 and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be | |
123 loaded into sqlite in the same way as any other dynamically loadable | |
124 extension. | |
125 | |
126 | |
127 3 BUGS, PROBLEMS AND SECURITY ISSUES | |
128 | |
129 3.1 The "case_sensitive_like" Pragma | |
130 | |
131 This extension does not work well with the "case_sensitive_like" | |
132 pragma. If this pragma is used before the ICU extension is loaded, | |
133 then the pragma has no effect. If the pragma is used after the ICU | |
134 extension is loaded, then SQLite ignores the ICU implementation and | |
135 always uses the built-in LIKE operator. | |
136 | |
137 The ICU extension LIKE operator is always case insensitive. | |
138 | |
139 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | |
140 | |
141 Passing very long patterns to the built-in SQLite LIKE operator can | |
142 cause excessive CPU usage. To curb this problem, SQLite defines the | |
143 SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a | |
144 pattern in bytes (irrespective of encoding). The default value is | |
145 defined in internal header file "limits.h". | |
146 | |
147 The ICU extension LIKE implementation suffers from the same | |
148 problem and uses the same solution. However, since the ICU extension | |
149 code does not include the SQLite file "limits.h", modifying | |
150 the default value therein does not affect the ICU extension. | |
151 The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by | |
152 the ICU extension LIKE operator is 50000, defined in source | |
153 file "icu.c". | |
154 | |
155 3.3 Collation Sequence Security Issue | |
156 | |
157 Internally, SQLite assumes that indices stored in database files | |
158 are sorted according to the collation sequence indicated by the | |
159 SQL schema. Changing the definition of a collation sequence after | |
160 an index has been built is therefore equivalent to database | |
161 corruption. The SQLite library is not very well tested under | |
162 these conditions, and may contain potential buffer overruns | |
163 or other programming errors that could be exploited by a malicious | |
164 programmer. | |
165 | |
166 If the ICU extension is used in an environment where potentially | |
167 malicious users may execute arbitrary SQL (i.e. gears), they | |
168 should be prevented from invoking the icu_load_collation() function, | |
169 possibly using the authorisation callback. | |
OLD | NEW |