OLD | NEW |
(Empty) | |
| 1 |
| 2 This directory contains source code for the SQLite "ICU" extension, an |
| 3 integration of the "International Components for Unicode" library with |
| 4 SQLite. Documentation follows. |
| 5 |
| 6 1. Features |
| 7 |
| 8 1.1 SQL Scalars upper() and lower() |
| 9 1.2 Unicode Aware LIKE Operator |
| 10 1.3 ICU Collation Sequences |
| 11 1.4 SQL REGEXP Operator |
| 12 |
| 13 2. Compilation and Usage |
| 14 |
| 15 3. Bugs, Problems and Security Issues |
| 16 |
| 17 3.1 The "case_sensitive_like" Pragma |
| 18 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro |
| 19 3.3 Collation Sequence Security Issue |
| 20 |
| 21 |
| 22 1. FEATURES |
| 23 |
| 24 1.1 SQL Scalars upper() and lower() |
| 25 |
| 26 SQLite's built-in implementations of these two functions only |
| 27 provide case mapping for the 26 letters used in the English |
| 28 language. The ICU based functions provided by this extension |
| 29 provide case mapping, where defined, for the full range of |
| 30 unicode characters. |
| 31 |
| 32 ICU provides two types of case mapping, "general" case mapping and |
| 33 "language specific". Refer to ICU documentation for the differences |
| 34 between the two. Specifically: |
| 35 |
| 36 http://www.icu-project.org/userguide/caseMappings.html |
| 37 http://www.icu-project.org/userguide/posix.html#case_mappings |
| 38 |
| 39 To utilise "general" case mapping, the upper() or lower() scalar |
| 40 functions are invoked with one argument: |
| 41 |
| 42 upper('ABC') -> 'abc' |
| 43 lower('abc') -> 'ABC' |
| 44 |
| 45 To access ICU "language specific" case mapping, upper() or lower() |
| 46 should be invoked with two arguments. The second argument is the name |
| 47 of the locale to use. Passing an empty string ("") or SQL NULL value |
| 48 as the second argument is the same as invoking the 1 argument version |
| 49 of upper() or lower(): |
| 50 |
| 51 lower('I', 'en_us') -> 'i' |
| 52 lower('I', 'tr_tr') -> 'ı' (small dotless i) |
| 53 |
| 54 1.2 Unicode Aware LIKE Operator |
| 55 |
| 56 Similarly to the upper() and lower() functions, the built-in SQLite LIKE |
| 57 operator understands case equivalence for the 26 letters of the English |
| 58 language alphabet. The implementation of LIKE included in this |
| 59 extension uses the ICU function u_foldCase() to provide case |
| 60 independent comparisons for the full range of unicode characters. |
| 61 |
| 62 The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the |
| 63 dotless 'I' character used in the Turkish language is considered |
| 64 to be in the same equivalence class as the dotted 'I' character |
| 65 used by many languages (including English). |
| 66 |
| 67 1.3 ICU Collation Sequences |
| 68 |
| 69 A special SQL scalar function, icu_load_collation() is provided that |
| 70 may be used to register ICU collation sequences with SQLite. It |
| 71 is always called with exactly two arguments, the ICU locale |
| 72 identifying the collation sequence to ICU, and the name of the |
| 73 SQLite collation sequence to create. For example, to create an |
| 74 SQLite collation sequence named "turkish" using Turkish language |
| 75 sorting rules, the SQL statement: |
| 76 |
| 77 SELECT icu_load_collation('tr_TR', 'turkish'); |
| 78 |
| 79 Or, for Australian English: |
| 80 |
| 81 SELECT icu_load_collation('en_AU', 'australian'); |
| 82 |
| 83 The identifiers "turkish" and "australian" may then be used |
| 84 as collation sequence identifiers in SQL statements: |
| 85 |
| 86 CREATE TABLE aust_turkish_penpals( |
| 87 australian_penpal_name TEXT COLLATE australian, |
| 88 turkish_penpal_name TEXT COLLATE turkish |
| 89 ); |
| 90 |
| 91 1.4 SQL REGEXP Operator |
| 92 |
| 93 This extension provides an implementation of the SQL binary |
| 94 comparision operator "REGEXP", based on the regular expression functions |
| 95 provided by the ICU library. The syntax of the operator is as described |
| 96 in SQLite documentation: |
| 97 |
| 98 <string> REGEXP <re-pattern> |
| 99 |
| 100 This extension uses the ICU defaults for regular expression matching |
| 101 behaviour. Specifically, this means that: |
| 102 |
| 103 * Matching is case-sensitive, |
| 104 * Regular expression comments are not allowed within patterns, and |
| 105 * The '^' and '$' characters match the beginning and end of the |
| 106 <string> argument, not the beginning and end of lines within |
| 107 the <string> argument. |
| 108 |
| 109 Even more specifically, the value passed to the "flags" parameter |
| 110 of ICU C function uregex_open() is 0. |
| 111 |
| 112 |
| 113 2 COMPILATION AND USAGE |
| 114 |
| 115 The easiest way to compile and use the ICU extension is to build |
| 116 and use it as a dynamically loadable SQLite extension. To do this |
| 117 using gcc on *nix: |
| 118 |
| 119 gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so |
| 120 |
| 121 You may need to add "-I" flags so that gcc can find sqlite3ext.h |
| 122 and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be |
| 123 loaded into sqlite in the same way as any other dynamically loadable |
| 124 extension. |
| 125 |
| 126 |
| 127 3 BUGS, PROBLEMS AND SECURITY ISSUES |
| 128 |
| 129 3.1 The "case_sensitive_like" Pragma |
| 130 |
| 131 This extension does not work well with the "case_sensitive_like" |
| 132 pragma. If this pragma is used before the ICU extension is loaded, |
| 133 then the pragma has no effect. If the pragma is used after the ICU |
| 134 extension is loaded, then SQLite ignores the ICU implementation and |
| 135 always uses the built-in LIKE operator. |
| 136 |
| 137 The ICU extension LIKE operator is always case insensitive. |
| 138 |
| 139 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro |
| 140 |
| 141 Passing very long patterns to the built-in SQLite LIKE operator can |
| 142 cause excessive CPU usage. To curb this problem, SQLite defines the |
| 143 SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a |
| 144 pattern in bytes (irrespective of encoding). The default value is |
| 145 defined in internal header file "limits.h". |
| 146 |
| 147 The ICU extension LIKE implementation suffers from the same |
| 148 problem and uses the same solution. However, since the ICU extension |
| 149 code does not include the SQLite file "limits.h", modifying |
| 150 the default value therein does not affect the ICU extension. |
| 151 The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by |
| 152 the ICU extension LIKE operator is 50000, defined in source |
| 153 file "icu.c". |
| 154 |
| 155 3.3 Collation Sequence Security Issue |
| 156 |
| 157 Internally, SQLite assumes that indices stored in database files |
| 158 are sorted according to the collation sequence indicated by the |
| 159 SQL schema. Changing the definition of a collation sequence after |
| 160 an index has been built is therefore equivalent to database |
| 161 corruption. The SQLite library is not very well tested under |
| 162 these conditions, and may contain potential buffer overruns |
| 163 or other programming errors that could be exploited by a malicious |
| 164 programmer. |
| 165 |
| 166 If the ICU extension is used in an environment where potentially |
| 167 malicious users may execute arbitrary SQL (i.e. gears), they |
| 168 should be prevented from invoking the icu_load_collation() function, |
| 169 possibly using the authorisation callback. |
OLD | NEW |