Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(427)

Side by Side Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator
Patch Set: stringSet everywhere Created 5 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 In-memory appengine wrapper 1 Datastore implementation internals
2 =========================== 2 ==================================
3 3
4 4 This document contains internal implementation details for this in-memory
5 Notes on the internal encodings 5 version of datastore. It's mostly helpful to understand what's going on in this
6 ------------------------------- 6 implementation, but it also can reveal some insight into how the real appengine
7 7 datastore works (though note that the specific encodings are different).
8 All datatypes inside of the index Collections of the gkvlite Store are stored 8
9 in a manner which allows them to be compared entirely via bytes.Compare. All 9 Additionally note that this implementation cheats by moving some of the Key
10 types are prefixed by a sortable type byte which encodes the sort-order of types 10 bytes into the table (collection) names (like the namespace, property name for
11 according to the appengine SDK. Additionally, types have the following data 11 the builtin indexes, etc.). The real implementation contains these bytes in the
12 encoding: 12 table row keys, I think.
13 * ints 13
14 * stored with the `funnybase` varint encoding 14
15 * floats 15 Internal datastore key/value collection schema
16 * http://stereopsis.com/radix.html 16 ----------------------------------------------
17 * toBytes: 17
18 ``` 18 The datastore implementation here uses several different tables ('collections')
19 b := math.Float64bits(f) 19 to manage state for the data. The schema for these tables is enumerated below
20 return b ^ (-(b >> 63) | 0x8000000000000000) 20 to make the code a bit easier to reason about.
21 ``` 21
22 * fromBytes: 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
23 ``` 23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the
24 return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000) 24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings
25 ``` 25 are important to understanding why the schemas below sort correctly when
26 * string, []byte, BlobKey, ByteString 26 compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that
27 * funnybase byte count 27 you're familiar with those encodings, but will point out where we diverge from
28 * raw bytes 28 the stock encodings.
29 * \*Key, GeoPoint 29
30 * composite of above types 30 All encoded Property values used in gkvlite Keys (i.e. index rows) are
31 * time.Time 31 serialized using the settings `serialize.WithoutContext`, and
32 * composite of above types, stored with microsecond accuracy. 32 `datastore.ShouldIndex`.
33 * rounding to microseconds is a limitation of the real appengine. 33
34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)` 34 ### Primary table
35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)` 35
36 * nil, true, false 36 The primary table maps datastore keys to entities.
37 * value is encoded directly in the type byte 37 - Name: `"ents:" + namespace`
38 38 - Key: serialized datastore.Property containing the entity's datastore.Key
39 39 - Value: serialized datastore.PropertyMap
40 Gkvlite Collection schema 40
41 ------------------------- 41 This table also encodes values for the following special keys:
42 42 - Every entity root (e.g. a Key with nil Parent()) with key K has:
43 In order to provide efficient result deduplication, the value of an index row 43 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
44 which indexes 1 or more properties is a concatenation of the previous values 44 A child entity with the kind `__entity_group__` and an id of `1`. The value
45 which would show up in the same index. For example, if you have the property 45 has a single property `__version__`, which contains the version number of
46 list for the key K: 46 this entity group. This is used to detect transaction conflicts.
47 47 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
48 bob: 1 48 A child entity with the kind `__entity_group__` and an id of `1`. The value
49 bob: 4 49 has a single property `__version__`, which contains the last automatically
50 bob: 7 50 allocated entity ID for entities within this entity group.
51 cat: "hello" 51 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the
52 cat: "world" 52 same `__version__` property, and indicates the last automatically allocated
53 53 entity ID for root entities.
54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the 54
55 rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates 55 ### Compound Index table
56 concatenation, each value has an implied type byte. `...` indicates that other 56
57 rows may be present): 57 The next table keeps track of all the user-added 'compound' index descriptions
58 58 (not the content for the indexes). There is a row in this table for each
59 ... 59 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
60 1|"world"|K = nil|nil 60 - Name: `"idx"`
61 ... 61 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
62 1|"hello"|K = nil|"world" 62 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
63 ... 63 - Value: empty
64 4|"world"|K = 1|nil 64
65 ... 65 The Key format here requires some special attention. Say you started with
66 4|"hello"|K = 1|"world" 66 a compound IndexDefinition of:
67 ... 67 IndexDefinition{
68 7|"world"|K = 4|nil 68 Kind: "Foo",
69 ... 69 Ancestor: true,
70 7|"hello"|K = 4|"world" 70 SortBy: []IndexColumn{
71 ... 71 {Property: "Something", Direction: DESCENDING},
72 72 {Property: "Else", Direction: ASCENDING},
73 This allows us to, start scanning at any point and be able to determine if we've 73 {Property: "Cool", Direction: ASCENDING},
74 returned a given key already (without storing all of the keys in memory 74 }
75 for the duration of the Query run). We can do this because we can see if the 75 }
76 value of an index row falls within the original query filter parameters. If it 76
77 does, then we must already have returned they Key, and can safely skip the index 77 After prepping it for the table, it would be equivalent to:
78 row. AFAIK, real-datastore provides deduplication by keeping all the returned 78 IndexDefinition{
79 keys in memory as it runs the query, and doing a set-check. 79 Kind: "Foo",
80 80 Ancestor: true,
81 The end-result is semantically equivalent, with the exception that Query Cursors 81 SortBy: []IndexColumn{
82 on the real datastore will potentially return the same Key in the first Cursor 82 {Property: "__key__", Direction: ASCENDING},
83 use as well as on the 2nd (or Nth) cursor use, where this method will not. 83 {Property: "Cool", Direction: ASCENDING},
84 84 {Property: "Else", Direction: ASCENDING},
85 collections 85 {Property: "Something", Direction: DESCENDING},
86 ents:ns -> key -> value 86 }
87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int} 87 }
88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int} 88
89 (__entity_group_ids__,1) -> {__version__: in t} 89 The reason for doing this will be covered in the `Query Planning` section, but
90 // TODO(iannucci): Journal every entity write in a log with a globally 90 it boils down to allowing the query planner to use this encoded table to
91 // increasing version number (aka "timestamp"). 91 intelligently scan for potentially useful compound indexes.
92 // 92
93 // TODO(iannucci): Use the value in idx collection to indicate the last 93 ### Index Tables
94 // global log version reflected in this index. Then index updates can happ en 94
95 // in parallel, in a truly eventually-consistent fashion (and completely 95 Every index (both builtin and compound) has one index table per namespace, which
96 // avoid holding the DB writelock while calculating index entries). 96 contains as rows every entry in the index, one per row.
97 // Unfortunately, copying new records (and removing old ones) into the DB 97 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
98 // would still require holding global writelock. 98 - Key: concatenated datastore.Property values, one per SortBy column in the
99 // 99 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
100 // TODO(iannucci): how do we garbage-collect the journal? 100 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
101 // 101 - Value: empty
102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with 102
103 // another one, transactionally? Not sure if this is possible to do easily . 103 If the IndexDefinition has `Ancestor: true`, then the very first column of the
104 // If we had this, then we could do all the index writes for a given index 104 Key contains the partial Key for the entity. So if an entity has the datastore
105 // on the side, and then do a quick swap into place with a writelock. As 105 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
106 // long as every index only has a single goroutine writing it, then this 106 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
107 // would enable maximum concurrency, since all indexes could update in 107 of indexes that it matches.
108 // parallel and only synchronize for the duration of a single pointer swap . 108
109 idx -> kind|A?|[-?prop]* = nil 109 #### Builtin (automatic) indexes
110 idx:ns:kind -> key = nil 110
111 idx:ns:kind|prop -> propval|key = [prev val] 111 The following indexes are automatically created for some entity with a key
112 idx:ns:kind|-prop -> -propval|key = [next val] 112 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
113 idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[pre v/next val] 113 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
114 idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/ next val] 114 {Property: "__key__", Direction: ASCENDING},
115 }}
116 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
117 {Property: "Foo", Direction: ASCENDING},
118 {Property: "__key__", Direction: ASCENDING},
119 }}
120 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
121 {Property: "Foo", Direction: DESCENDING},
122 {Property: "__key__", Direction: ASCENDING},
123 }}
124
125 Index updates
126 -------------
127
128 (Note that this is a LARGE departure from how the production appengine datastore
129 does this. This model only works because the implementation is not distributed,
130 and not journaled. The real datastore does index updates in parallel and is
131 generally pretty fancy compared to this).
132
133 Index updates are pretty straightforward. On a mutation to the primary entity
134 table, we take the old entity value (remember that entity values are
135 PropertyMaps), the new property value, create index entries for both, merge
136 them, and apply the deltas to the affected index tables (i.e. entries that
137 exist in the old entities, but not the new ones, are deleted. Entries that exist
138 in the new entities, but not the old ones, are added).
139
140 Index generation works (given an slice of indexes []Idxs) by:
141 * serializing all ShouldIndex Properties in the PropertyMap to get a
142 `map[name][]serializedProperty`.
143 * for each index idx
144 * if idx's columns contain properties that are not in the map, skip idx
145 * make a `[][]serializedProperty`, where each serializedProperty slice
146 corresponds with the IndexColumn of idx.SortBy
147 * duplicated values for multi-valued properties are skipped.
148 * generate a `[]byte` row which is the contatenation of one value from each
149 `[]serializedProperty`, permuting through all combinations. If the SortBy
150 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
151 value!.
152 * add that generated []byte row to the index's corresponding table.
153
154 Note that we choose to serialize all permutations of the saved entity. This is
155 so that we can use repeated-column indexes to fill queries which use a subset of
156 the columns. E.g. if we have the index `duck,duck,duck,goose`, we can
157 theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by
158 pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index
159 selection at the expense of larger indexes. However, it means that if you have
160 the entity:
161 duck = 1, 2, 3, 4
162 goose = "færøske"
163
164 It generates the following index entries:
165 duck=1,duck=1,duck=1,goose="færøske"
166 duck=1,duck=1,duck=2,goose="færøske"
167 duck=1,duck=1,duck=3,goose="færøske"
168 duck=1,duck=1,duck=4,goose="færøske"
169 duck=1,duck=2,duck=1,goose="færøske"
170 duck=1,duck=2,duck=2,goose="færøske"
171 duck=1,duck=2,duck=3,goose="færøske"
172 duck=1,duck=2,duck=4,goose="færøske"
173 duck=1,duck=3,duck=1,goose="færøske"
174 ... a lot ...
175 duck=4,duck=4,duck=4,goose="færøske"
176
177 This is a very large number of index rows (i.e. an 'exploding index')!
178
179 An alternate design would be to only generate unique permutations of elements
180 where the index has repeated columns of a single property. This only makes sense
181 because it's illegal to have an equality and an inequality on the same property,
182 under the current constraints of appengine (though not completely ridiculous in
183 general, if inequality constraints meant the same thing as equality constraints.
184 However it would lead to a multi-dimensional query, which can be quite slow and
185 is very difficult to scale without application knowledge). If we do this, it
186 also means that we need to SORT the equality filter values when generating the
187 prefix (so that the least-valued equality constraint is first). If we did this,
188 then the generated index rows for the above entity would be:
189 duck=1,duck=2,duck=3,goose="færøske"
190 duck=1,duck=2,duck=4,goose="færøske"
191 duck=1,duck=3,duck=4,goose="færøske"
192 duck=2,duck=3,duck=4,goose="færøske"
193
194 Which be a LOT more compact. It may be worth implementing this restriction
195 later, simply for the memory savings when indexing multi-valued properties.
196
197 If this technique is used, there's also room to unambiguously index entities
198 with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see
199 a row in the index like:
200 duck=1,duck=1,duck=2,goose="færøske"
201
202 Which would allow you to query for "an entity which has duck values equal to 1,
203 1 and 2". Currently such a query is not possible to execute (it would be
204 equivalent to "an entity which has duck values equal to 1 and 2").
205
206 Query planning
207 --------------
208
209 Now that we have all of our data tabulated, let's plan some queries. The
210 high-level algorithm works like this:
211 * Generate a suffix format from the user's query which looks like:
212 * orders (including the inequality as the first order, if any)
213 * projected fields which aren't explicitly referenced in the orders (we
214 assume ASCENDING order for them), in the order that they were projected.
215 * `__key__` (implied ascending, unless the query's last sort order is for
216 `__key__`, in which case it's whatever order the user specified)
217 * Reverse the order of this suffix format, and serialize it into an
218 IndexDefinition, along with the query's Kind and Ancestor values. This
219 does what PrepForIdxTable did when we added the Index in the first place.
220 * Use this serialized reversed index to find compound indexes which might
221 match by looking up rows in the "idx" table which begin with this serialized
222 reversed index.
223 * Generate every builtin index for the inequality + equality filter
224 properties, and see if they match too.
225
226 An index is a potential match if its suffix *exactly* matches the suffix format,
227 and it contains *only* sort orders which appear in the query (e.g. the index
228 contains a column which doesn't appear as an equality or inequlity filter).
229
230 The index search continues until:
231 * We find at least one matching index; AND
232 * The combination of all matching indexes accounts for every equality filter
233 at least once.
234
235 If we fail to find sufficient indexes to fulfill the query, we generate an index
236 description that *could* be sufficient by concatenating all missing equality
237 filters, in ascending order, followed by concatenating the suffix format that we
238 generated for this query. We then suggest this new index to the user for them to
239 add by returing an error containing the generated IndexDefinition. Note that
240 this index is not REQUIRED for the user to add; they could choose to add bits
241 and pieces of it, extend existing indexes in order to cover the missing columns,
242 invert the direction of some of the equality columns, etc.
243
244 Recall that equality filters are expressed as
245 `map[propName][]serializedProperty`. We'll refer to this mapping as the
246 'constraint' mapping below.
247
248 To actually come up with the final index selection, we sort all the matching
249 indexes from greatest number of columns to least. We add the 0th index (the one
250 with the greatest number of columns) unconditionally. We then keep adding indexe s
251 which contain one or more of the remaining constraints, until we have no
252 more constraints to satisfy.
253
254 Adding an index entails determining which columns in that index correspond to
255 equality columns, and which ones correspond to inequality/order/projection
256 columns. Recall that the inequality/order/projection columns are all the same
257 for all of the potential indices (i.e. they all share the same *suffix format*).
258 We can use this to just iterate over the index's SortBy columns which we'll use
259 for equality filters. For each equality column, we remove a corresponding value
260 from the constraints map. In the event that we _run out_ of constraints for a
261 given column, we simply _pick an arbitrary value_ from the original equality
262 filter mapping and use that. This is valid to do because, after all, they're
263 equality filters.
264
265 Note that for compound indexes, the ancestor key counts as an equality filter,
266 and if the compound index has `Ancestor: true`, then we implicitly put the
267 ancestor as if it were the first SortBy column. For satisfying Ancestor queries
268 with built-in indexes, see the next section.
269
270 Once we've got our list of constraints for this index, we concatenate them all
271 together to get the *prefix* for this index. When iterating over this index, we
272 only ever want to see index rows whose prefix exactly matches this. Unlike the
273 suffix formt, the prefix is per-index (remember that ALL indexes in the
274 query must have the same suffix format).
275
276 Finally, we set the 'start' and 'end' values for all chosen indexes to either
277 the Start and End cursors, or the Greater-Than and Less-Than values for the
278 inequality. The Cursors contain values for every suffix column, and the
279 inequality only contains a value for the first suffix column. If both cursors
280 and an inequality are specified, we take the smaller set of both (the
281 combination which will return the fewest rows).
282
283 That's about it for index selection! See Query Execution for how we actually use
284 the selected indexes to run a query.
285
286 ### Ancestor queries and Built-in indexes
287
288 You may have noticed that the built-in indexes can be used for Ancestor queries
289 with equality filters, but they don't start with the magic Ancestor column!
290
291 There's a trick that you can do if the suffix format for the query is just
292 `__key__` though (e.g. the query only contains equality filters, and/or an
293 inequality filter on `__key__`). You can serialize the datastore key that you're
294 planning to use for the Ancestor query, then chop off the termintating null byte
295 from the encoding, and then use this as additional prefix bytes for this index.
296 So if the builtin for the "Val" property has the column format of:
297
298 {Property: "Val"}, {Property: "__key__"}
299
300 And your query holds Val as an equality filter, you can serialize the
301 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
302 an index row:
303
304 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
305
306 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
307 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
308
309 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
310
311 Omitting the STOP which normally terminates the Key encoding. Using this prefix
312 will only return index rows which are `/Kind,1` or its children.
313
314 "That's cool! Why not use this trick for compound indexes?", I hear you ask :)
315 Remember that this trick only works if the prefix before the `__key__` is
316 *entirely* composed of equality filters. Also recall that if you ONLY use
317 equality filters and Ancestor (and possibly an inequality on `__key__`), then
318 you can always satisfy the query from the built-in indexes! While you
319 technically could do it with a compound index, there's not really a point to
320 doing so. To remain faithful to the production datastore implementation, we
321 don't implement this trick for anything other than the built-in indexes.
322
323 ### Cursor format
324
325 Cursors work by containing values for each of the columns in the suffix, in the
326 order and Direction specified by the suffix. In fact, cursors are just encoded
327 versions of the []IndexColumn used for the 'suffix format', followed by the
328 raw bytes of the suffix for that particular row (incremented by 1 bit).
329
330 This means that technically you can port cursors between any queries which share
331 precisely the same suffix format, regardless of other query options, even if the
332 index planner ends up choosing different indexes to use from the first query to
333 the second. No state is maintained in the service implementation for cursors.
334
335 I suspect that this is a more liberal version of cursors than how the production
336 appengine implements them, but I haven't verified one way or the other.
337
338 Query execution
339 ---------------
340
341 Last but not least, we need to actually execute the query. After figuring out
342 which indexes to use with what prefixes and start/end values, we essentially
343 have a list of index subsets, all sorted the same way. To pull the values out,
344 we start by iterating the first index in the list, grabbing its suffix value,
345 and trying to iterate from that suffix in the second, third, fourth, etc index.
346
347 If any index iterates past that suffix, we start back at the 0th index with that
348 suffix, and continue to try to find a matching row. Doing this will end up
349 skipping large portions of all of the indexes in the list. This is the algorithm
350 known as "zigzag merge join", and you can find talks on it from some of the
351 appengine folks. It has very good algorithmic running time and tends to scale
352 with the number of full matches, rather than the size of all of the indexes
353 involved.
354
355 A hit occurs when all of the iterators have precisely the same suffix. This hit
356 suffix is then decoded using the suffix format information. The very last column
357 of the suffix will always be the datastore key. The suffix is then used to call
358 back to the user, according to the query type:
359 * keys-only queries just directly return the Key
360 * projection queries return the projected fields from the decoded suffix.
361 Remember how we added all the projections after the orders? This is why. The
362 projected values are pulled directly from the index, instead of going to the
363 main entity table.
364 * normal queries pull the decoded Key from the "ents" table, and return that
365 entity to the user.
OLDNEW
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698