Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(12)

Side by Side Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator
Patch Set: minor fixes Created 5 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | impl/memory/binary_tools.go » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 In-memory appengine wrapper 1 Datastore implementation internals
2 =========================== 2 ==================================
3 3
4 4 This document contains internal implementation details for this in-memory
5 Notes on the internal encodings 5 version of datastore. It's mostly helpful to understand what's going on in this
6 ------------------------------- 6 implementation, but it also can reveal some insight into how the real appengine
7 7 datastore works (though note that the specific encodings are different).
8 All datatypes inside of the index Collections of the gkvlite Store are stored 8
9 in a manner which allows them to be compared entirely via bytes.Compare. All 9 Additionally note that this implementation cheats by moving some of the Key
10 types are prefixed by a sortable type byte which encodes the sort-order of types 10 bytes into the table (collection) names (like the namespace, property name for
11 according to the appengine SDK. Additionally, types have the following data 11 the builtin indexes, etc.). The real implementation contains these bytes in the
12 encoding: 12 table row keys, I think.
13 * ints 13
14 * stored with the `funnybase` varint encoding 14
15 * floats 15 Internal datastore key/value collection schema
16 * http://stereopsis.com/radix.html 16 ----------------------------------------------
17 * toBytes: 17
18 ``` 18 The datastore implementation here uses several different tables ('collections')
19 b := math.Float64bits(f) 19 to manage state for the data. The schema for these tables is enumerated below
20 return b ^ (-(b >> 63) | 0x8000000000000000) 20 to make the code a bit easier to reason about.
21 ``` 21
22 * fromBytes: 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
23 ``` 23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the
24 return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000) 24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings
25 ``` 25 are important to understanding why the schemas below sort correctly when
26 * string, []byte, BlobKey, ByteString 26 compared only using bytes.Compare (aka `memcmp`). This doc will assume that
dnj 2015/08/28 16:37:54 "Compare (aka `memcmp`)"?
iannucci 2015/08/28 19:48:54 ? it's `bytes.Compare`
27 * funnybase byte count 27 you're familiar with those encodings, but will point out where we diverge from
28 * raw bytes 28 the stock encodings.
29 * \*Key, GeoPoint 29
30 * composite of above types 30 All encoded Property values used in Keys are serialized `WithoutContext`, and
dnj 2015/08/28 16:37:54 Wording.
iannucci 2015/08/28 19:48:54 Done I think?
31 * time.Time 31 `ShouldIndex`.
32 * composite of above types, stored with microsecond accuracy. 32
33 * rounding to microseconds is a limitation of the real appengine. 33 ### Primary table
34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)` 34
35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)` 35 The primary table maps datastore keys to entities.
36 * nil, true, false 36 - Name: `"ents:" + namespace`
37 * value is encoded directly in the type byte 37 - Key: serialized datastore.Property containing the entity's datastore.Key
38 38 - Value: serialized datastore.PropertyMap
39 39
40 Gkvlite Collection schema 40 This table also encodes values for the following special keys:
41 ------------------------- 41 - Every entity root (e.g. a Key with nil Parent()) with key K has:
42 42 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
43 In order to provide efficient result deduplication, the value of an index row 43 A child entity with the kind `__entity_group__` and an id of `1`. The value
44 which indexes 1 or more properties is a concatenation of the previous values 44 has a single property `__version__`, which contains the version number of
45 which would show up in the same index. For example, if you have the property 45 this entity group. This is used to detect transaction conflicts.
46 list for the key K: 46 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
47 47 A child entity with the kind `__entity_group__` and an id of `1`. The value
48 bob: 1 48 has a single property `__version__`, which contains the last automatically
49 bob: 4 49 allocated entity ID for entities within this entity group.
50 bob: 7 50 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the
51 cat: "hello" 51 same `__version__` property, and indicates the last automatically allocated
52 cat: "world" 52 entity ID for root entities.
53 53
54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the 54 ### Compound Index table
55 rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates 55
56 concatenation, each value has an implied type byte. `...` indicates that other 56 The next table keeps track of all the user-added 'compound' index descriptions
57 rows may be present): 57 (not the content for the indexes). There is a row in this table for each
58 58 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
59 ... 59 - Name: `"idx"`
60 1|"world"|K = nil|nil 60 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
61 ... 61 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
62 1|"hello"|K = nil|"world" 62 - Value: empty
63 ... 63
64 4|"world"|K = 1|nil 64 The Key format here requires some special attention. Say you started with
65 ... 65 a compound IndexDefinition of:
66 4|"hello"|K = 1|"world" 66 IndexDefinition{
67 ... 67 Kind: "Foo",
68 7|"world"|K = 4|nil 68 Ancestor: true,
69 ... 69 SortBy: []IndexColumn{
70 7|"hello"|K = 4|"world" 70 {Property: "Something", Direction: DESCENDING},
71 ... 71 {Property: "Else", Direction: ASCENDING},
72 72 {Property: "Cool", Direction: ASCENDING},
73 This allows us to, start scanning at any point and be able to determine if we've 73 }
74 returned a given key already (without storing all of the keys in memory 74 }
75 for the duration of the Query run). We can do this because we can see if the 75
76 value of an index row falls within the original query filter parameters. If it 76 After prepping it for the table, it would be equivalent to:
77 does, then we must already have returned they Key, and can safely skip the index 77 IndexDefinition{
78 row. AFAIK, real-datastore provides deduplication by keeping all the returned 78 Kind: "Foo",
79 keys in memory as it runs the query, and doing a set-check. 79 Ancestor: true,
80 80 SortBy: []IndexColumn{
81 The end-result is semantically equivalent, with the exception that Query Cursors 81 {Property: "__key__", Direction: ASCENDING},
82 on the real datastore will potentially return the same Key in the first Cursor 82 {Property: "Cool", Direction: ASCENDING},
83 use as well as on the 2nd (or Nth) cursor use, where this method will not. 83 {Property: "Else", Direction: ASCENDING},
84 84 {Property: "Something", Direction: DESCENDING},
85 collections 85 }
86 ents:ns -> key -> value 86 }
87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int} 87
88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int} 88 The reason for doing this will be covered in the `Query Planning` section, but
89 (__entity_group_ids__,1) -> {__version__: in t} 89 it boils down to allowing the query planner to use this encoded table to
90 // TODO(iannucci): Journal every entity write in a log with a globally 90 intelligently scan for potentially useful compound indexes.
91 // increasing version number (aka "timestamp"). 91
92 // 92 ### Index Tables
dnj (Google) 2015/08/28 17:54:21 Comment about repeated tags and padding. e.g., "t
iannucci 2015/08/28 19:48:54 Done. I think we should actually implement that te
93 // TODO(iannucci): Use the value in idx collection to indicate the last 93
94 // global log version reflected in this index. Then index updates can happ en 94 Every index (both builtin and compound) has one index table per namespace, which
95 // in parallel, in a truly eventually-consistent fashion (and completely 95 contains as rows every entry in the index, one per row.
96 // avoid holding the DB writelock while calculating index entries). 96 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
97 // Unfortunately, copying new records (and removing old ones) into the DB 97 - Key: concatenated datastore.Property values, one per SortBy column in the
98 // would still require holding global writelock. 98 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
99 // 99 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
100 // TODO(iannucci): how do we garbage-collect the journal? 100 - Value: empty
101 // 101
102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with 102 If the IndexDefinition has `Ancestor: true`, then the very first column of the
103 // another one, transactionally? Not sure if this is possible to do easily . 103 Key contains the partial Key for the entity. So if an entity has the datastore
104 // If we had this, then we could do all the index writes for a given index 104 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
105 // on the side, and then do a quick swap into place with a writelock. As 105 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
dnj 2015/08/28 16:37:54 Explain why? Also the key notation, I'm assuming
iannucci 2015/08/28 19:48:54 "/Some,1" is a key with no parent. It's the root o
106 // long as every index only has a single goroutine writing it, then this 106 of indexes that it matches.
107 // would enable maximum concurrency, since all indexes could update in 107
108 // parallel and only synchronize for the duration of a single pointer swap . 108 #### Builtin (automatic) indexes
109 idx -> kind|A?|[-?prop]* = nil 109
110 idx:ns:kind -> key = nil 110 The following indexes are automatically created for some entity with a key
111 idx:ns:kind|prop -> propval|key = [prev val] 111 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
112 idx:ns:kind|-prop -> -propval|key = [next val] 112 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
113 idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[pre v/next val] 113 {Property: "__key__", Direction: ASCENDING},
114 idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/ next val] 114 }}
115 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
116 {Property: "Foo", Direction: ASCENDING},
117 {Property: "__key__", Direction: ASCENDING},
118 }}
119 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
120 {Property: "Foo", Direction: DESCENDING},
121 {Property: "__key__", Direction: ASCENDING},
122 }}
123
124 Index updates
125 -------------
126
127 (Note that this is a LARGE departure from how the production appengine datastore
128 does this. This model only works because the implementation is not distributed,
129 and not journaled. The real datastore does index updates in parallel and is
130 generally pretty fancy compared to this).
131
132 Index updates are pretty straightforward. On a multation to the primary entity
dnj 2015/08/28 16:37:54 mutation**
iannucci 2015/08/28 19:48:54 done
133 table, we take the old entity value (remember that entity values are
134 PropertyMaps), the new property value, create index entries for both, merge
135 them, and apply the deltas to the affected index tables (i.e. entries that
136 exist in the old, but not new entities are deleted, entries that exist in the
dnj 2015/08/28 16:37:54 Kind of awkward wording. Maybe: "...but not new ar
iannucci 2015/08/28 19:48:54 Better?
137 new but not old entities are deleted).
138
139 Index generation works (given an slice of indexes []Idxs) by:
140 * serializing all ShouldIndex Properties in the PropertyMap to get a
141 `map[name][]serializedProperty`.
142 * for each index idx
143 * if idx's columns contain properties that are not in the map, skip idx
144 * make a `[][]serializedProperty`, where each serializedProperty slice
145 corresponds with the IndexColumn of idx.SortBy
146 * generate a `[]byte` row which is the contatenation of one value from each
147 `[]serializedProperty`, permuting through all combinations. If the SortBy
148 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
149 value!.
150 * add that generated []byte row to the index's corresponding table.
151
152 Query planning
153 --------------
154
155 Now that we have all of our data tabulated, let's plan some queries. The
156 high-level algorithm works like this:
157 * Generate a suffix format from the user's query which looks like:
158 * orders (including the inequality as the first order, if any)
159 * projected fields which aren't explicitly referenced in the orders (we
160 assume ASCENDING order for them), in the order that they were projected.
161 * `__key__` (implied ascending, unless the query's last sort order is for
162 `__key__`, in which case it's whatever order the user specified)
163 * Reverse the order of this suffix format, and serialize it into an
164 IndexDefinition, along with the query's Kind and Ancestor values. This
165 does what PrepForIdxTable did when we added the Index in the first place.
166 * Use this serialized reversed index to find compound indexes which might
167 match by looking up rows in the "idx" table which begin with this serialized
168 reversed index.
169 * Generate every builtin index for the inequality + equality filter
170 properties, and see if they match too.
171
172 An index is a potential match if its suffix *exactly* matches the suffix format,
173 and it contains *only* sort orders which appear in the query (e.g. the index
174 contains a column which doesn't appear as an equality or inequlity filter).
175
176 The index search continues until:
177 * We find at least one matching index; AND
178 * The combination of all matching indexes accounts for every equality filter
179 at least once.
180
181 If we fail to find sufficient indexes to fulfil the query, we generate the
dnj 2015/08/28 16:37:54 fulfill
iannucci 2015/08/28 19:48:54 done
182 'missing' index by concatenating all missing equality filters (in ascending
dnj 2015/08/28 16:37:54 Not clear, maybe "a suggested index description wh
iannucci 2015/08/28 19:48:54 How about this?
183 order, but it doesn't matter if the user wants to make them decending), followed
184 by concatenating the suffix format that we generated for this query.
185
186 Recall that equality filters are expressed as
187 `map[propName][]serializedProperty`. We'll refer to this mapping as the
188 'constraint' mapping below.
189
190 To actually come up with the final index selection, we sort all the matching
191 indexes from greatest number of columns to least. We add the 0th index (the one
192 with the greatest number of columns) unconditionally. We then keep adding indexe s
193 which contain one or more of the remaining constraints, until we have no
194 more constraints to satisfy.
195
196 Adding an index entails determining which columns in that index correspond to
197 equality columns, and which ones correspond to inequality/order/projection
198 columns. Recall that the inequality/order/projection columns are all the same
199 for all of the potential indices (i.e. they all share the same *suffix format*).
200 We can use this to just iterate over the index's SortBy columns which we'll use
201 for equality filters. For each equality column, we remove a corresponding value
202 from the constraints map. In the event that we _run out_ of constraints for a
203 given column, we simply _pick an arbitrary value_ from the original equality
204 filter mapping and use that. This is valid to do because, after all, they're
205 equality filters.
206
207 Note that for compound indexes, the ancestor key counts as an equality filter,
208 and if the compound index has `Ancestor: true`, then we implicitly put the
209 ancestor as if it were the first SortBy column. For satisfying Ancestor queries
210 with built-in indexes, see the next section.
211
212 Once we've got our list of constraints for this index, we concatenate them all
213 together to get the *prefix* for this index. When iterating over this index, we
214 only ever want to see index rows whose prefix exactly matches this. Unlike the
215 suffix formt, the prefix is per-index (remember that ALL indexes in the
216 query must have the same suffix format).
217
218 Finally, we set the 'start' and 'end' values for all chosen indexes to either
219 the Start and End cursors, or the Greater-Than and Less-Than values for the
220 inequality. The Cursors contain values for every suffix column, and the
221 inequality only contains a value for the first suffix column. If both cursors
222 and an inequality are specified, we take the smaller set of both (the
223 combination which will return the fewest rows).
224
225 That's about it for index selection! See Query Execution for how we actually use
226 the selected indexes to run a query.
227
228 ### Ancestor queries and Built-in indexes
229
230 You may have noticed that the built-in indexes can be used for Ancestor queries
231 with equality filters, but they don't start with the magic Ancestor column!
232
233 There's a trick that you can do if the suffix format for the query is just
234 `__key__` though (e.g. the query only contains equality filters, and/or an
235 inequality filter on `__key__`). You can serialize the datastore key that you're
236 planning to use for the Ancestor query, then chop off the termintating null byte
237 from the encoding, and then use this as additional prefix bytes for this index.
238 So if the builtin for the "Val" property has the column format of:
239
240 {Property: "Val"}, {Property: "__key__"}
241
242 And your query holds Val as an equality filter, you can serialize the
243 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
244 an index row:
245
246 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
247
248 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
249 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
250
251 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
252
253 Omitting the STOP which normally terminates the Key encoding. Using this prefix
254 will only return index rows which are `/Kind,1` or its children.
255
256 "That's cool! Why not use this trick for compound indexes?", I hear you ask :)
257 Remember that this trick only works if the prefix before the `__key__` is
258 *entirely* composed of equality filters. Also recall that if you ONLY use
259 equality filters and Ancestor (and possibly an inequality on `__key__`), then
260 you can always satisfy the query from the built-in indexes! While you
261 technically could do it with a compound index, there's not really a point to
262 doing so. To remain faithful to the production datastore implementation, we
263 don't implement this trick for anything other than the built-in indicies.
264
265 ### Cursor format
266
267 Cursors work by containing values for each of the columns in the suffix, in the
268 order and Direction specified by the suffix. In fact, cursors are just encoded
269 versions of the []IndexColumn used for the 'suffix format', followed by the
270 raw bytes of the suffix for that particular row (incremented by 1 bit).
271
272 This means that technically you can port cursors between any queries which share
273 precisely the same suffix format, regardless of other query options, even if the
274 index planner ends up choosing different indexes to use from the first query to
275 the second. No state is maintained in the service implementation for cursors.
276
277 I suspect that this is a more liberal version of cursors than how the production
278 appengine implements them, but I haven't verified one way or the other.
279
280 Query execution
281 ---------------
282
283 Last but not least, we need to actually execute the query. After figuring out
284 which indexes to use with what prefixes and start/end values, we essentially
285 have a list of index subsets, all sorted the same way. To pull the values out,
286 we start by iterating the first index in the list, grabbing its suffix value,
287 and trying to iterate from that suffix in the second, third, fourth, etc index.
288
289 If any index iterates past that suffix, we start back at the 0th index with that
290 suffix, and continue to try to find a matching row. Doing this will end up
291 skipping large portions of all of the indexes in the list. This is the algorithm
292 known as "zigzag merge join", and you can find talks on it from some of the
293 appengine folks. It has very good algorithmic running time and tends to scale
294 with the number of full matches, rather than the size of all of the indexes
295 involved.
296
297 A hit occurs when all of the iterators have precisely the same suffix. This hit
298 suffix is then decoded using the suffix format information. The very last column
299 of the suffix will always be the datastore key. The suffix is then used to call
300 back to the user, according to the query type:
301 * keys-only queries just directly return the Key
302 * projection queries return the projected fields from the decoded suffix.
303 Remember how we added all the projections after the orders? This is why. The
304 projected values are pulled directly from the index, instead of going to the
305 main entity table.
306 * normal queries pull the decoded Key from the "ents" table, and return that
307 entity to the user.
OLDNEW
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | impl/memory/binary_tools.go » ('J')

Powered by Google App Engine
This is Rietveld 408576698