| OLD | NEW |
| (Empty) |
| 1 | |
| 2 1. FTS3 Tokenizers | |
| 3 | |
| 4 When creating a new full-text table, FTS3 allows the user to select | |
| 5 the text tokenizer implementation to be used when indexing text | |
| 6 by specifying a "tokenize" clause as part of the CREATE VIRTUAL TABLE | |
| 7 statement: | |
| 8 | |
| 9 CREATE VIRTUAL TABLE <table-name> USING fts3( | |
| 10 <columns ...> [, tokenize <tokenizer-name> [<tokenizer-args>]] | |
| 11 ); | |
| 12 | |
| 13 The built-in tokenizers (valid values to pass as <tokenizer name>) are | |
| 14 "simple" and "porter". | |
| 15 | |
| 16 <tokenizer-args> should consist of zero or more white-space separated | |
| 17 arguments to pass to the selected tokenizer implementation. The | |
| 18 interpretation of the arguments, if any, depends on the individual | |
| 19 tokenizer. | |
| 20 | |
| 21 2. Custom Tokenizers | |
| 22 | |
| 23 FTS3 allows users to provide custom tokenizer implementations. The | |
| 24 interface used to create a new tokenizer is defined and described in | |
| 25 the fts3_tokenizer.h source file. | |
| 26 | |
| 27 Registering a new FTS3 tokenizer is similar to registering a new | |
| 28 virtual table module with SQLite. The user passes a pointer to a | |
| 29 structure containing pointers to various callback functions that | |
| 30 make up the implementation of the new tokenizer type. For tokenizers, | |
| 31 the structure (defined in fts3_tokenizer.h) is called | |
| 32 "sqlite3_tokenizer_module". | |
| 33 | |
| 34 FTS3 does not expose a C-function that users call to register new | |
| 35 tokenizer types with a database handle. Instead, the pointer must | |
| 36 be encoded as an SQL blob value and passed to FTS3 through the SQL | |
| 37 engine by evaluating a special scalar function, "fts3_tokenizer()". | |
| 38 The fts3_tokenizer() function may be called with one or two arguments, | |
| 39 as follows: | |
| 40 | |
| 41 SELECT fts3_tokenizer(<tokenizer-name>); | |
| 42 SELECT fts3_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); | |
| 43 | |
| 44 Where <tokenizer-name> is a string identifying the tokenizer and | |
| 45 <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module | |
| 46 structure encoded as an SQL blob. If the second argument is present, | |
| 47 it is registered as tokenizer <tokenizer-name> and a copy of it | |
| 48 returned. If only one argument is passed, a pointer to the tokenizer | |
| 49 implementation currently registered as <tokenizer-name> is returned, | |
| 50 encoded as a blob. Or, if no such tokenizer exists, an SQL exception | |
| 51 (error) is raised. | |
| 52 | |
| 53 SECURITY: If the fts3 extension is used in an environment where potentially | |
| 54 malicious users may execute arbitrary SQL (i.e. gears), they should be | |
| 55 prevented from invoking the fts3_tokenizer() function, possibly using the | |
| 56 authorisation callback. | |
| 57 | |
| 58 See "Sample code" below for an example of calling the fts3_tokenizer() | |
| 59 function from C code. | |
| 60 | |
| 61 3. ICU Library Tokenizers | |
| 62 | |
| 63 If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor | |
| 64 symbol defined, then there exists a built-in tokenizer named "icu" | |
| 65 implemented using the ICU library. The first argument passed to the | |
| 66 xCreate() method (see fts3_tokenizer.h) of this tokenizer may be | |
| 67 an ICU locale identifier. For example "tr_TR" for Turkish as used | |
| 68 in Turkey, or "en_AU" for English as used in Australia. For example: | |
| 69 | |
| 70 "CREATE VIRTUAL TABLE thai_text USING fts3(text, tokenizer icu th_TH)" | |
| 71 | |
| 72 The ICU tokenizer implementation is very simple. It splits the input | |
| 73 text according to the ICU rules for finding word boundaries and discards | |
| 74 any tokens that consist entirely of white-space. This may be suitable | |
| 75 for some applications in some locales, but not all. If more complex | |
| 76 processing is required, for example to implement stemming or | |
| 77 discard punctuation, this can be done by creating a tokenizer | |
| 78 implementation that uses the ICU tokenizer as part of its implementation. | |
| 79 | |
| 80 When using the ICU tokenizer this way, it is safe to overwrite the | |
| 81 contents of the strings returned by the xNext() method (see | |
| 82 fts3_tokenizer.h). | |
| 83 | |
| 84 4. Sample code. | |
| 85 | |
| 86 The following two code samples illustrate the way C code should invoke | |
| 87 the fts3_tokenizer() scalar function: | |
| 88 | |
| 89 int registerTokenizer( | |
| 90 sqlite3 *db, | |
| 91 char *zName, | |
| 92 const sqlite3_tokenizer_module *p | |
| 93 ){ | |
| 94 int rc; | |
| 95 sqlite3_stmt *pStmt; | |
| 96 const char zSql[] = "SELECT fts3_tokenizer(?, ?)"; | |
| 97 | |
| 98 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); | |
| 99 if( rc!=SQLITE_OK ){ | |
| 100 return rc; | |
| 101 } | |
| 102 | |
| 103 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); | |
| 104 sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); | |
| 105 sqlite3_step(pStmt); | |
| 106 | |
| 107 return sqlite3_finalize(pStmt); | |
| 108 } | |
| 109 | |
| 110 int queryTokenizer( | |
| 111 sqlite3 *db, | |
| 112 char *zName, | |
| 113 const sqlite3_tokenizer_module **pp | |
| 114 ){ | |
| 115 int rc; | |
| 116 sqlite3_stmt *pStmt; | |
| 117 const char zSql[] = "SELECT fts3_tokenizer(?)"; | |
| 118 | |
| 119 *pp = 0; | |
| 120 rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); | |
| 121 if( rc!=SQLITE_OK ){ | |
| 122 return rc; | |
| 123 } | |
| 124 | |
| 125 sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); | |
| 126 if( SQLITE_ROW==sqlite3_step(pStmt) ){ | |
| 127 if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ | |
| 128 memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); | |
| 129 } | |
| 130 } | |
| 131 | |
| 132 return sqlite3_finalize(pStmt); | |
| 133 } | |
| OLD | NEW |