OLD | NEW |
(Empty) | |
| 1 |
| 2 1. FTS2 Tokenizers |
| 3 |
| 4 When creating a new full-text table, FTS2 allows the user to select |
| 5 the text tokenizer implementation to be used when indexing text |
| 6 by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE |
| 7 statement: |
| 8 |
| 9 CREATE VIRTUAL TABLE <table-name> USING fts2( |
| 10 <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]] |
| 11 ); |
| 12 |
| 13 The built-in tokenizers (valid values to pass as <tokenizer name>) are |
| 14 "simple" and "porter". |
| 15 |
| 16 <tokenizer-args> should consist of zero or more white-space separated |
| 17 arguments to pass to the selected tokenizer implementation. The |
| 18 interpretation of the arguments, if any, depends on the individual |
| 19 tokenizer. |
| 20 |
| 21 2. Custom Tokenizers |
| 22 |
| 23 FTS2 allows users to provide custom tokenizer implementations. The |
| 24 interface used to create a new tokenizer is defined and described in |
| 25 the fts2_tokenizer.h source file. |
| 26 |
| 27 Registering a new FTS2 tokenizer is similar to registering a new |
| 28 virtual table module with SQLite. The user passes a pointer to a |
| 29 structure containing pointers to various callback functions that |
| 30 make up the implementation of the new tokenizer type. For tokenizers, |
| 31 the structure (defined in fts2_tokenizer.h) is called |
| 32 "sqlite3_tokenizer_module". |
| 33 |
| 34 FTS2 does not expose a C-function that users call to register new |
| 35 tokenizer types with a database handle. Instead, the pointer must |
| 36 be encoded as an SQL blob value and passed to FTS2 through the SQL |
| 37 engine by evaluating a special scalar function, "fts2_tokenizer()". |
| 38 The fts2_tokenizer() function may be called with one or two arguments, |
| 39 as follows: |
| 40 |
| 41 SELECT fts2_tokenizer(<tokenizer-name>); |
| 42 SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); |
| 43 |
| 44 Where <tokenizer-name> is a string identifying the tokenizer and |
| 45 <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module |
| 46 structure encoded as an SQL blob. If the second argument is present, |
| 47 it is registered as tokenizer <tokenizer-name> and a copy of it |
| 48 returned. If only one argument is passed, a pointer to the tokenizer |
| 49 implementation currently registered as <tokenizer-name> is returned, |
| 50 encoded as a blob. Or, if no such tokenizer exists, an SQL exception |
| 51 (error) is raised. |
| 52 |
| 53 SECURITY: If the fts2 extension is used in an environment where potentially |
| 54 malicious users may execute arbitrary SQL (i.e. gears), they should be |
| 55 prevented from invoking the fts2_tokenizer() function, possibly using the |
| 56 authorisation callback. |
| 57 |
| 58 See "Sample code" below for an example of calling the fts2_tokenizer() |
| 59 function from C code. |
| 60 |
| 61 3. ICU Library Tokenizers |
| 62 |
| 63 If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor |
| 64 symbol defined, then there exists a built-in tokenizer named "icu" |
| 65 implemented using the ICU library. The first argument passed to the |
| 66 xCreate() method (see fts2_tokenizer.h) of this tokenizer may be |
| 67 an ICU locale identifier. For example "tr_TR" for Turkish as used |
| 68 in Turkey, or "en_AU" for English as used in Australia. For example: |
| 69 |
| 70 "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)" |
| 71 |
| 72 The ICU tokenizer implementation is very simple. It splits the input |
| 73 text according to the ICU rules for finding word boundaries and discards |
| 74 any tokens that consist entirely of white-space. This may be suitable |
| 75 for some applications in some locales, but not all. If more complex |
| 76 processing is required, for example to implement stemming or |
| 77 discard punctuation, this can be done by creating a tokenizer |
| 78 implementation that uses the ICU tokenizer as part of its implementation. |
| 79 |
| 80 When using the ICU tokenizer this way, it is safe to overwrite the |
| 81 contents of the strings returned by the xNext() method (see |
| 82 fts2_tokenizer.h). |
| 83 |
| 84 4. Sample code. |
| 85 |
| 86 The following two code samples illustrate the way C code should invoke |
| 87 the fts2_tokenizer() scalar function: |
| 88 |
| 89 int registerTokenizer( |
| 90 sqlite3 *db, |
| 91 char *zName, |
| 92 const sqlite3_tokenizer_module *p |
| 93 ){ |
| 94 int rc; |
| 95 sqlite3_stmt *pStmt; |
| 96 const char zSql[] = "SELECT fts2_tokenizer(?, ?)"; |
| 97 |
| 98 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); |
| 99 if( rc!=SQLITE_OK ){ |
| 100 return rc; |
| 101 } |
| 102 |
| 103 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); |
| 104 sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); |
| 105 sqlite3_step(pStmt); |
| 106 |
| 107 return sqlite3_finalize(pStmt); |
| 108 } |
| 109 |
| 110 int queryTokenizer( |
| 111 sqlite3 *db, |
| 112 char *zName, |
| 113 const sqlite3_tokenizer_module **pp |
| 114 ){ |
| 115 int rc; |
| 116 sqlite3_stmt *pStmt; |
| 117 const char zSql[] = "SELECT fts2_tokenizer(?)"; |
| 118 |
| 119 *pp = 0; |
| 120 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); |
| 121 if( rc!=SQLITE_OK ){ |
| 122 return rc; |
| 123 } |
| 124 |
| 125 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); |
| 126 if( SQLITE_ROW==sqlite3_step(pStmt) ){ |
| 127 if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ |
| 128 memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); |
| 129 } |
| 130 } |
| 131 |
| 132 return sqlite3_finalize(pStmt); |
| 133 } |
OLD | NEW |