OLD | NEW |
1 In-memory appengine wrapper | 1 Datastore implementation internals |
2 =========================== | 2 ================================== |
3 | 3 |
4 | 4 This document contains internal implementation details for this in-memory |
5 Notes on the internal encodings | 5 version of datastore. It's mostly helpful to understand what's going on in this |
6 ------------------------------- | 6 implementation, but it also can reveal some insight into how the real appengine |
7 | 7 datastore works (though note that the specific encodings are different). |
8 All datatypes inside of the index Collections of the gkvlite Store are stored | 8 |
9 in a manner which allows them to be compared entirely via bytes.Compare. All | 9 Additionally note that this implementation cheats by moving some of the Key |
10 types are prefixed by a sortable type byte which encodes the sort-order of types | 10 bytes into the table (collection) names (like the namespace, property name for |
11 according to the appengine SDK. Additionally, types have the following data | 11 the builtin indexes, etc.). The real implementation contains these bytes in the |
12 encoding: | 12 table row keys, I think. |
13 * ints | 13 |
14 * stored with the `funnybase` varint encoding | 14 |
15 * floats | 15 Internal datastore key/value collection schema |
16 * http://stereopsis.com/radix.html | 16 ---------------------------------------------- |
17 * toBytes: | 17 |
18 ``` | 18 The datastore implementation here uses several different tables ('collections') |
19 b := math.Float64bits(f) | 19 to manage state for the data. The schema for these tables is enumerated below |
20 return b ^ (-(b >> 63) | 0x8000000000000000) | 20 to make the code a bit easier to reason about. |
21 ``` | 21 |
22 * fromBytes: | 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized |
23 ``` | 23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the |
24 return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000) | 24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings |
25 ``` | 25 are important to understanding why the schemas below sort correctly when |
26 * string, []byte, BlobKey, ByteString | 26 compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that |
27 * funnybase byte count | 27 you're familiar with those encodings, but will point out where we diverge from |
28 * raw bytes | 28 the stock encodings. |
29 * \*Key, GeoPoint | 29 |
30 * composite of above types | 30 All encoded Property values used in gkvlite Keys (i.e. index rows) are |
31 * time.Time | 31 serialized using the settings `serialize.WithoutContext`, and |
32 * composite of above types, stored with microsecond accuracy. | 32 `datastore.ShouldIndex`. |
33 * rounding to microseconds is a limitation of the real appengine. | 33 |
34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)` | 34 ### Primary table |
35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)` | 35 |
36 * nil, true, false | 36 The primary table maps datastore keys to entities. |
37 * value is encoded directly in the type byte | 37 - Name: `"ents:" + namespace` |
38 | 38 - Key: serialized datastore.Property containing the entity's datastore.Key |
39 | 39 - Value: serialized datastore.PropertyMap |
40 Gkvlite Collection schema | 40 |
41 ------------------------- | 41 This table also encodes values for the following special keys: |
42 | 42 - Every entity root (e.g. a Key with nil Parent()) with key K has: |
43 In order to provide efficient result deduplication, the value of an index row | 43 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}` |
44 which indexes 1 or more properties is a concatenation of the previous values | 44 A child entity with the kind `__entity_group__` and an id of `1`. The value |
45 which would show up in the same index. For example, if you have the property | 45 has a single property `__version__`, which contains the version number of |
46 list for the key K: | 46 this entity group. This is used to detect transaction conflicts. |
47 | 47 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}` |
48 bob: 1 | 48 A child entity with the kind `__entity_group__` and an id of `1`. The value |
49 bob: 4 | 49 has a single property `__version__`, which contains the last automatically |
50 bob: 7 | 50 allocated entity ID for entities within this entity group. |
51 cat: "hello" | 51 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the |
52 cat: "world" | 52 same `__version__` property, and indicates the last automatically allocated |
53 | 53 entity ID for root entities. |
54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the | 54 |
55 rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates | 55 ### Compound Index table |
56 concatenation, each value has an implied type byte. `...` indicates that other | 56 |
57 rows may be present): | 57 The next table keeps track of all the user-added 'compound' index descriptions |
58 | 58 (not the content for the indexes). There is a row in this table for each |
59 ... | 59 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`. |
60 1|"world"|K = nil|nil | 60 - Name: `"idx"` |
61 ... | 61 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice |
62 1|"hello"|K = nil|"world" | 62 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`). |
63 ... | 63 - Value: empty |
64 4|"world"|K = 1|nil | 64 |
65 ... | 65 The Key format here requires some special attention. Say you started with |
66 4|"hello"|K = 1|"world" | 66 a compound IndexDefinition of: |
67 ... | 67 IndexDefinition{ |
68 7|"world"|K = 4|nil | 68 Kind: "Foo", |
69 ... | 69 Ancestor: true, |
70 7|"hello"|K = 4|"world" | 70 SortBy: []IndexColumn{ |
71 ... | 71 {Property: "Something", Direction: DESCENDING}, |
72 | 72 {Property: "Else", Direction: ASCENDING}, |
73 This allows us to, start scanning at any point and be able to determine if we've | 73 {Property: "Cool", Direction: ASCENDING}, |
74 returned a given key already (without storing all of the keys in memory | 74 } |
75 for the duration of the Query run). We can do this because we can see if the | 75 } |
76 value of an index row falls within the original query filter parameters. If it | 76 |
77 does, then we must already have returned they Key, and can safely skip the index | 77 After prepping it for the table, it would be equivalent to: |
78 row. AFAIK, real-datastore provides deduplication by keeping all the returned | 78 IndexDefinition{ |
79 keys in memory as it runs the query, and doing a set-check. | 79 Kind: "Foo", |
80 | 80 Ancestor: true, |
81 The end-result is semantically equivalent, with the exception that Query Cursors | 81 SortBy: []IndexColumn{ |
82 on the real datastore will potentially return the same Key in the first Cursor | 82 {Property: "__key__", Direction: ASCENDING}, |
83 use as well as on the 2nd (or Nth) cursor use, where this method will not. | 83 {Property: "Cool", Direction: ASCENDING}, |
84 | 84 {Property: "Else", Direction: ASCENDING}, |
85 collections | 85 {Property: "Something", Direction: DESCENDING}, |
86 ents:ns -> key -> value | 86 } |
87 (rootkind, rootid, __entity_group__,1) -> {_
_version__: int} | 87 } |
88 (rootkind, rootid, __entity_group_ids__,1) -
> {__version__: int} | 88 |
89 (__entity_group_ids__,1) -> {__version__: in
t} | 89 The reason for doing this will be covered in the `Query Planning` section, but |
90 // TODO(iannucci): Journal every entity write in a log with a globally | 90 it boils down to allowing the query planner to use this encoded table to |
91 // increasing version number (aka "timestamp"). | 91 intelligently scan for potentially useful compound indexes. |
92 // | 92 |
93 // TODO(iannucci): Use the value in idx collection to indicate the last | 93 ### Index Tables |
94 // global log version reflected in this index. Then index updates can happ
en | 94 |
95 // in parallel, in a truly eventually-consistent fashion (and completely | 95 Every index (both builtin and compound) has one index table per namespace, which |
96 // avoid holding the DB writelock while calculating index entries). | 96 contains as rows every entry in the index, one per row. |
97 // Unfortunately, copying new records (and removing old ones) into the DB | 97 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()` |
98 // would still require holding global writelock. | 98 - Key: concatenated datastore.Property values, one per SortBy column in the |
99 // | 99 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is |
100 // TODO(iannucci): how do we garbage-collect the journal? | 100 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF). |
101 // | 101 - Value: empty |
102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with | 102 |
103 // another one, transactionally? Not sure if this is possible to do easily
. | 103 If the IndexDefinition has `Ancestor: true`, then the very first column of the |
104 // If we had this, then we could do all the index writes for a given index | 104 Key contains the partial Key for the entity. So if an entity has the datastore |
105 // on the side, and then do a quick swap into place with a writelock. As | 105 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`, |
106 // long as every index only has a single goroutine writing it, then this | 106 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column |
107 // would enable maximum concurrency, since all indexes could update in | 107 of indexes that it matches. |
108 // parallel and only synchronize for the duration of a single pointer swap
. | 108 |
109 idx -> kind|A?|[-?prop]* = nil | 109 #### Builtin (automatic) indexes |
110 idx:ns:kind -> key = nil | 110 |
111 idx:ns:kind|prop -> propval|key = [prev val] | 111 The following indexes are automatically created for some entity with a key |
112 idx:ns:kind|-prop -> -propval|key = [next val] | 112 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo": |
113 idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[pre
v/next val] | 113 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ |
114 idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/
next val] | 114 {Property: "__key__", Direction: ASCENDING}, |
| 115 }} |
| 116 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ |
| 117 {Property: "Foo", Direction: ASCENDING}, |
| 118 {Property: "__key__", Direction: ASCENDING}, |
| 119 }} |
| 120 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ |
| 121 {Property: "Foo", Direction: DESCENDING}, |
| 122 {Property: "__key__", Direction: ASCENDING}, |
| 123 }} |
| 124 |
| 125 Index updates |
| 126 ------------- |
| 127 |
| 128 (Note that this is a LARGE departure from how the production appengine datastore |
| 129 does this. This model only works because the implementation is not distributed, |
| 130 and not journaled. The real datastore does index updates in parallel and is |
| 131 generally pretty fancy compared to this). |
| 132 |
| 133 Index updates are pretty straightforward. On a mutation to the primary entity |
| 134 table, we take the old entity value (remember that entity values are |
| 135 PropertyMaps), the new property value, create index entries for both, merge |
| 136 them, and apply the deltas to the affected index tables (i.e. entries that |
| 137 exist in the old entities, but not the new ones, are deleted. Entries that exist |
| 138 in the new entities, but not the old ones, are added). |
| 139 |
| 140 Index generation works (given an slice of indexes []Idxs) by: |
| 141 * serializing all ShouldIndex Properties in the PropertyMap to get a |
| 142 `map[name][]serializedProperty`. |
| 143 * for each index idx |
| 144 * if idx's columns contain properties that are not in the map, skip idx |
| 145 * make a `[][]serializedProperty`, where each serializedProperty slice |
| 146 corresponds with the IndexColumn of idx.SortBy |
| 147 * duplicated values for multi-valued properties are skipped. |
| 148 * generate a `[]byte` row which is the contatenation of one value from each |
| 149 `[]serializedProperty`, permuting through all combinations. If the SortBy |
| 150 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty |
| 151 value!. |
| 152 * add that generated []byte row to the index's corresponding table. |
| 153 |
| 154 Note that we choose to serialize all permutations of the saved entity. This is |
| 155 so that we can use repeated-column indexes to fill queries which use a subset of |
| 156 the columns. E.g. if we have the index `duck,duck,duck,goose`, we can |
| 157 theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by |
| 158 pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index |
| 159 selection at the expense of larger indexes. However, it means that if you have |
| 160 the entity: |
| 161 duck = 1, 2, 3, 4 |
| 162 goose = "færøske" |
| 163 |
| 164 It generates the following index entries: |
| 165 duck=1,duck=1,duck=1,goose="færøske" |
| 166 duck=1,duck=1,duck=2,goose="færøske" |
| 167 duck=1,duck=1,duck=3,goose="færøske" |
| 168 duck=1,duck=1,duck=4,goose="færøske" |
| 169 duck=1,duck=2,duck=1,goose="færøske" |
| 170 duck=1,duck=2,duck=2,goose="færøske" |
| 171 duck=1,duck=2,duck=3,goose="færøske" |
| 172 duck=1,duck=2,duck=4,goose="færøske" |
| 173 duck=1,duck=3,duck=1,goose="færøske" |
| 174 ... a lot ... |
| 175 duck=4,duck=4,duck=4,goose="færøske" |
| 176 |
| 177 This is a very large number of index rows (i.e. an 'exploding index')! |
| 178 |
| 179 An alternate design would be to only generate unique permutations of elements |
| 180 where the index has repeated columns of a single property. This only makes sense |
| 181 because it's illegal to have an equality and an inequality on the same property, |
| 182 under the current constraints of appengine (though not completely ridiculous in |
| 183 general, if inequality constraints meant the same thing as equality constraints. |
| 184 However it would lead to a multi-dimensional query, which can be quite slow and |
| 185 is very difficult to scale without application knowledge). If we do this, it |
| 186 also means that we need to SORT the equality filter values when generating the |
| 187 prefix (so that the least-valued equality constraint is first). If we did this, |
| 188 then the generated index rows for the above entity would be: |
| 189 duck=1,duck=2,duck=3,goose="færøske" |
| 190 duck=1,duck=2,duck=4,goose="færøske" |
| 191 duck=1,duck=3,duck=4,goose="færøske" |
| 192 duck=2,duck=3,duck=4,goose="færøske" |
| 193 |
| 194 Which be a LOT more compact. It may be worth implementing this restriction |
| 195 later, simply for the memory savings when indexing multi-valued properties. |
| 196 |
| 197 If this technique is used, there's also room to unambiguously index entities |
| 198 with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see |
| 199 a row in the index like: |
| 200 duck=1,duck=1,duck=2,goose="færøske" |
| 201 |
| 202 Which would allow you to query for "an entity which has duck values equal to 1, |
| 203 1 and 2". Currently such a query is not possible to execute (it would be |
| 204 equivalent to "an entity which has duck values equal to 1 and 2"). |
| 205 |
| 206 Query planning |
| 207 -------------- |
| 208 |
| 209 Now that we have all of our data tabulated, let's plan some queries. The |
| 210 high-level algorithm works like this: |
| 211 * Generate a suffix format from the user's query which looks like: |
| 212 * orders (including the inequality as the first order, if any) |
| 213 * projected fields which aren't explicitly referenced in the orders (we |
| 214 assume ASCENDING order for them), in the order that they were projected. |
| 215 * `__key__` (implied ascending, unless the query's last sort order is for |
| 216 `__key__`, in which case it's whatever order the user specified) |
| 217 * Reverse the order of this suffix format, and serialize it into an |
| 218 IndexDefinition, along with the query's Kind and Ancestor values. This |
| 219 does what PrepForIdxTable did when we added the Index in the first place. |
| 220 * Use this serialized reversed index to find compound indexes which might |
| 221 match by looking up rows in the "idx" table which begin with this serialized |
| 222 reversed index. |
| 223 * Generate every builtin index for the inequality + equality filter |
| 224 properties, and see if they match too. |
| 225 |
| 226 An index is a potential match if its suffix *exactly* matches the suffix format, |
| 227 and it contains *only* sort orders which appear in the query (e.g. the index |
| 228 contains a column which doesn't appear as an equality or inequlity filter). |
| 229 |
| 230 The index search continues until: |
| 231 * We find at least one matching index; AND |
| 232 * The combination of all matching indexes accounts for every equality filter |
| 233 at least once. |
| 234 |
| 235 If we fail to find sufficient indexes to fulfill the query, we generate an index |
| 236 description that *could* be sufficient by concatenating all missing equality |
| 237 filters, in ascending order, followed by concatenating the suffix format that we |
| 238 generated for this query. We then suggest this new index to the user for them to |
| 239 add by returing an error containing the generated IndexDefinition. Note that |
| 240 this index is not REQUIRED for the user to add; they could choose to add bits |
| 241 and pieces of it, extend existing indexes in order to cover the missing columns, |
| 242 invert the direction of some of the equality columns, etc. |
| 243 |
| 244 Recall that equality filters are expressed as |
| 245 `map[propName][]serializedProperty`. We'll refer to this mapping as the |
| 246 'constraint' mapping below. |
| 247 |
| 248 To actually come up with the final index selection, we sort all the matching |
| 249 indexes from greatest number of columns to least. We add the 0th index (the one |
| 250 with the greatest number of columns) unconditionally. We then keep adding indexe
s |
| 251 which contain one or more of the remaining constraints, until we have no |
| 252 more constraints to satisfy. |
| 253 |
| 254 Adding an index entails determining which columns in that index correspond to |
| 255 equality columns, and which ones correspond to inequality/order/projection |
| 256 columns. Recall that the inequality/order/projection columns are all the same |
| 257 for all of the potential indices (i.e. they all share the same *suffix format*). |
| 258 We can use this to just iterate over the index's SortBy columns which we'll use |
| 259 for equality filters. For each equality column, we remove a corresponding value |
| 260 from the constraints map. In the event that we _run out_ of constraints for a |
| 261 given column, we simply _pick an arbitrary value_ from the original equality |
| 262 filter mapping and use that. This is valid to do because, after all, they're |
| 263 equality filters. |
| 264 |
| 265 Note that for compound indexes, the ancestor key counts as an equality filter, |
| 266 and if the compound index has `Ancestor: true`, then we implicitly put the |
| 267 ancestor as if it were the first SortBy column. For satisfying Ancestor queries |
| 268 with built-in indexes, see the next section. |
| 269 |
| 270 Once we've got our list of constraints for this index, we concatenate them all |
| 271 together to get the *prefix* for this index. When iterating over this index, we |
| 272 only ever want to see index rows whose prefix exactly matches this. Unlike the |
| 273 suffix formt, the prefix is per-index (remember that ALL indexes in the |
| 274 query must have the same suffix format). |
| 275 |
| 276 Finally, we set the 'start' and 'end' values for all chosen indexes to either |
| 277 the Start and End cursors, or the Greater-Than and Less-Than values for the |
| 278 inequality. The Cursors contain values for every suffix column, and the |
| 279 inequality only contains a value for the first suffix column. If both cursors |
| 280 and an inequality are specified, we take the smaller set of both (the |
| 281 combination which will return the fewest rows). |
| 282 |
| 283 That's about it for index selection! See Query Execution for how we actually use |
| 284 the selected indexes to run a query. |
| 285 |
| 286 ### Ancestor queries and Built-in indexes |
| 287 |
| 288 You may have noticed that the built-in indexes can be used for Ancestor queries |
| 289 with equality filters, but they don't start with the magic Ancestor column! |
| 290 |
| 291 There's a trick that you can do if the suffix format for the query is just |
| 292 `__key__` though (e.g. the query only contains equality filters, and/or an |
| 293 inequality filter on `__key__`). You can serialize the datastore key that you're |
| 294 planning to use for the Ancestor query, then chop off the termintating null byte |
| 295 from the encoding, and then use this as additional prefix bytes for this index. |
| 296 So if the builtin for the "Val" property has the column format of: |
| 297 |
| 298 {Property: "Val"}, {Property: "__key__"} |
| 299 |
| 300 And your query holds Val as an equality filter, you can serialize the |
| 301 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had |
| 302 an index row: |
| 303 |
| 304 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP |
| 305 |
| 306 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for |
| 307 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as: |
| 308 |
| 309 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 |
| 310 |
| 311 Omitting the STOP which normally terminates the Key encoding. Using this prefix |
| 312 will only return index rows which are `/Kind,1` or its children. |
| 313 |
| 314 "That's cool! Why not use this trick for compound indexes?", I hear you ask :) |
| 315 Remember that this trick only works if the prefix before the `__key__` is |
| 316 *entirely* composed of equality filters. Also recall that if you ONLY use |
| 317 equality filters and Ancestor (and possibly an inequality on `__key__`), then |
| 318 you can always satisfy the query from the built-in indexes! While you |
| 319 technically could do it with a compound index, there's not really a point to |
| 320 doing so. To remain faithful to the production datastore implementation, we |
| 321 don't implement this trick for anything other than the built-in indexes. |
| 322 |
| 323 ### Cursor format |
| 324 |
| 325 Cursors work by containing values for each of the columns in the suffix, in the |
| 326 order and Direction specified by the suffix. In fact, cursors are just encoded |
| 327 versions of the []IndexColumn used for the 'suffix format', followed by the |
| 328 raw bytes of the suffix for that particular row (incremented by 1 bit). |
| 329 |
| 330 This means that technically you can port cursors between any queries which share |
| 331 precisely the same suffix format, regardless of other query options, even if the |
| 332 index planner ends up choosing different indexes to use from the first query to |
| 333 the second. No state is maintained in the service implementation for cursors. |
| 334 |
| 335 I suspect that this is a more liberal version of cursors than how the production |
| 336 appengine implements them, but I haven't verified one way or the other. |
| 337 |
| 338 Query execution |
| 339 --------------- |
| 340 |
| 341 Last but not least, we need to actually execute the query. After figuring out |
| 342 which indexes to use with what prefixes and start/end values, we essentially |
| 343 have a list of index subsets, all sorted the same way. To pull the values out, |
| 344 we start by iterating the first index in the list, grabbing its suffix value, |
| 345 and trying to iterate from that suffix in the second, third, fourth, etc index. |
| 346 |
| 347 If any index iterates past that suffix, we start back at the 0th index with that |
| 348 suffix, and continue to try to find a matching row. Doing this will end up |
| 349 skipping large portions of all of the indexes in the list. This is the algorithm |
| 350 known as "zigzag merge join", and you can find talks on it from some of the |
| 351 appengine folks. It has very good algorithmic running time and tends to scale |
| 352 with the number of full matches, rather than the size of all of the indexes |
| 353 involved. |
| 354 |
| 355 A hit occurs when all of the iterators have precisely the same suffix. This hit |
| 356 suffix is then decoded using the suffix format information. The very last column |
| 357 of the suffix will always be the datastore key. The suffix is then used to call |
| 358 back to the user, according to the query type: |
| 359 * keys-only queries just directly return the Key |
| 360 * projection queries return the projected fields from the decoded suffix. |
| 361 Remember how we added all the projections after the orders? This is why. The |
| 362 projected values are pulled directly from the index, instead of going to the |
| 363 main entity table. |
| 364 * normal queries pull the decoded Key from the "ents" table, and return that |
| 365 entity to the user. |
OLD | NEW |