| OLD | NEW |
| (Empty) |
| 1 | |
| 2 This directory contains source code for the SQLite "ICU" extension, an | |
| 3 integration of the "International Components for Unicode" library with | |
| 4 SQLite. Documentation follows. | |
| 5 | |
| 6 1. Features | |
| 7 | |
| 8 1.1 SQL Scalars upper() and lower() | |
| 9 1.2 Unicode Aware LIKE Operator | |
| 10 1.3 ICU Collation Sequences | |
| 11 1.4 SQL REGEXP Operator | |
| 12 | |
| 13 2. Compilation and Usage | |
| 14 | |
| 15 3. Bugs, Problems and Security Issues | |
| 16 | |
| 17 3.1 The "case_sensitive_like" Pragma | |
| 18 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | |
| 19 3.3 Collation Sequence Security Issue | |
| 20 | |
| 21 | |
| 22 1. FEATURES | |
| 23 | |
| 24 1.1 SQL Scalars upper() and lower() | |
| 25 | |
| 26 SQLite's built-in implementations of these two functions only | |
| 27 provide case mapping for the 26 letters used in the English | |
| 28 language. The ICU based functions provided by this extension | |
| 29 provide case mapping, where defined, for the full range of | |
| 30 unicode characters. | |
| 31 | |
| 32 ICU provides two types of case mapping, "general" case mapping and | |
| 33 "language specific". Refer to ICU documentation for the differences | |
| 34 between the two. Specifically: | |
| 35 | |
| 36 http://www.icu-project.org/userguide/caseMappings.html | |
| 37 http://www.icu-project.org/userguide/posix.html#case_mappings | |
| 38 | |
| 39 To utilise "general" case mapping, the upper() or lower() scalar | |
| 40 functions are invoked with one argument: | |
| 41 | |
| 42 upper('ABC') -> 'abc' | |
| 43 lower('abc') -> 'ABC' | |
| 44 | |
| 45 To access ICU "language specific" case mapping, upper() or lower() | |
| 46 should be invoked with two arguments. The second argument is the name | |
| 47 of the locale to use. Passing an empty string ("") or SQL NULL value | |
| 48 as the second argument is the same as invoking the 1 argument version | |
| 49 of upper() or lower(): | |
| 50 | |
| 51 lower('I', 'en_us') -> 'i' | |
| 52 lower('I', 'tr_tr') -> 'ı' (small dotless i) | |
| 53 | |
| 54 1.2 Unicode Aware LIKE Operator | |
| 55 | |
| 56 Similarly to the upper() and lower() functions, the built-in SQLite LIKE | |
| 57 operator understands case equivalence for the 26 letters of the English | |
| 58 language alphabet. The implementation of LIKE included in this | |
| 59 extension uses the ICU function u_foldCase() to provide case | |
| 60 independent comparisons for the full range of unicode characters. | |
| 61 | |
| 62 The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the | |
| 63 dotless 'I' character used in the Turkish language is considered | |
| 64 to be in the same equivalence class as the dotted 'I' character | |
| 65 used by many languages (including English). | |
| 66 | |
| 67 1.3 ICU Collation Sequences | |
| 68 | |
| 69 A special SQL scalar function, icu_load_collation() is provided that | |
| 70 may be used to register ICU collation sequences with SQLite. It | |
| 71 is always called with exactly two arguments, the ICU locale | |
| 72 identifying the collation sequence to ICU, and the name of the | |
| 73 SQLite collation sequence to create. For example, to create an | |
| 74 SQLite collation sequence named "turkish" using Turkish language | |
| 75 sorting rules, the SQL statement: | |
| 76 | |
| 77 SELECT icu_load_collation('tr_TR', 'turkish'); | |
| 78 | |
| 79 Or, for Australian English: | |
| 80 | |
| 81 SELECT icu_load_collation('en_AU', 'australian'); | |
| 82 | |
| 83 The identifiers "turkish" and "australian" may then be used | |
| 84 as collation sequence identifiers in SQL statements: | |
| 85 | |
| 86 CREATE TABLE aust_turkish_penpals( | |
| 87 australian_penpal_name TEXT COLLATE australian, | |
| 88 turkish_penpal_name TEXT COLLATE turkish | |
| 89 ); | |
| 90 | |
| 91 1.4 SQL REGEXP Operator | |
| 92 | |
| 93 This extension provides an implementation of the SQL binary | |
| 94 comparision operator "REGEXP", based on the regular expression functions | |
| 95 provided by the ICU library. The syntax of the operator is as described | |
| 96 in SQLite documentation: | |
| 97 | |
| 98 <string> REGEXP <re-pattern> | |
| 99 | |
| 100 This extension uses the ICU defaults for regular expression matching | |
| 101 behaviour. Specifically, this means that: | |
| 102 | |
| 103 * Matching is case-sensitive, | |
| 104 * Regular expression comments are not allowed within patterns, and | |
| 105 * The '^' and '$' characters match the beginning and end of the | |
| 106 <string> argument, not the beginning and end of lines within | |
| 107 the <string> argument. | |
| 108 | |
| 109 Even more specifically, the value passed to the "flags" parameter | |
| 110 of ICU C function uregex_open() is 0. | |
| 111 | |
| 112 | |
| 113 2 COMPILATION AND USAGE | |
| 114 | |
| 115 The easiest way to compile and use the ICU extension is to build | |
| 116 and use it as a dynamically loadable SQLite extension. To do this | |
| 117 using gcc on *nix: | |
| 118 | |
| 119 gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so | |
| 120 | |
| 121 You may need to add "-I" flags so that gcc can find sqlite3ext.h | |
| 122 and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be | |
| 123 loaded into sqlite in the same way as any other dynamically loadable | |
| 124 extension. | |
| 125 | |
| 126 | |
| 127 3 BUGS, PROBLEMS AND SECURITY ISSUES | |
| 128 | |
| 129 3.1 The "case_sensitive_like" Pragma | |
| 130 | |
| 131 This extension does not work well with the "case_sensitive_like" | |
| 132 pragma. If this pragma is used before the ICU extension is loaded, | |
| 133 then the pragma has no effect. If the pragma is used after the ICU | |
| 134 extension is loaded, then SQLite ignores the ICU implementation and | |
| 135 always uses the built-in LIKE operator. | |
| 136 | |
| 137 The ICU extension LIKE operator is always case insensitive. | |
| 138 | |
| 139 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | |
| 140 | |
| 141 Passing very long patterns to the built-in SQLite LIKE operator can | |
| 142 cause a stack overflow. To curb this problem, SQLite defines the | |
| 143 SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a | |
| 144 pattern in bytes (irrespective of encoding). The default value is | |
| 145 defined in internal header file "limits.h". | |
| 146 | |
| 147 The ICU extension LIKE implementation suffers from the same | |
| 148 problem and uses the same solution. However, since the ICU extension | |
| 149 code does not include the SQLite file "limits.h", modifying | |
| 150 the default value therein does not affect the ICU extension. | |
| 151 The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by | |
| 152 the ICU extension LIKE operator is 50000, defined in source | |
| 153 file "icu.c". | |
| 154 | |
| 155 3.3 Collation Sequence Security Issue | |
| 156 | |
| 157 Internally, SQLite assumes that indices stored in database files | |
| 158 are sorted according to the collation sequence indicated by the | |
| 159 SQL schema. Changing the definition of a collation sequence after | |
| 160 an index has been built is therefore equivalent to database | |
| 161 corruption. The SQLite library is not very well tested under | |
| 162 these conditions, and may contain potential buffer overruns | |
| 163 or other programming errors that could be exploited by a malicious | |
| 164 programmer. | |
| 165 | |
| 166 If the ICU extension is used in an environment where potentially | |
| 167 malicious users may execute arbitrary SQL (i.e. gears), they | |
| 168 should be prevented from invoking the icu_load_collation() function, | |
| 169 possibly using the authorisation callback. | |
| 170 | |
| OLD | NEW |