OLD | NEW |
---|---|
1 In-memory appengine wrapper | 1 Datastore implementation internals |
2 =========================== | 2 ================================== |
3 | 3 |
4 | 4 This document contains internal implementation details for this in-memory |
5 Notes on the internal encodings | 5 version of datastore. It's mostly helpful to understand what's going on in this |
6 ------------------------------- | 6 implementation, but it also can reveal some insight into how the real appengine |
7 | 7 datastore works (though note that the specific encodings are different). |
8 All datatypes inside of the index Collections of the gkvlite Store are stored | 8 |
9 in a manner which allows them to be compared entirely via bytes.Compare. All | 9 Additionally note that this implementation cheats by moving some of the Key |
10 types are prefixed by a sortable type byte which encodes the sort-order of types | 10 bytes into the table (collection) names (like the namespace, property name for |
11 according to the appengine SDK. Additionally, types have the following data | 11 the builtin indexes, etc.). The real implementation contains these bytes in the |
12 encoding: | 12 table row keys, I think. |
13 * ints | 13 |
14 * stored with the `funnybase` varint encoding | 14 |
15 * floats | 15 Internal datastore key/value collection schema |
16 * http://stereopsis.com/radix.html | 16 ---------------------------------------------- |
17 * toBytes: | 17 |
18 ``` | 18 The datastore implementation here uses several different tables ('collections') |
19 b := math.Float64bits(f) | 19 to manage state for the data. The schema for these tables is enumerated below |
20 return b ^ (-(b >> 63) | 0x8000000000000000) | 20 to make the code a bit easier to reason about. |
21 ``` | 21 |
22 * fromBytes: | 22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized |
23 ``` | 23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the |
24 return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000) | 24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings |
25 ``` | 25 are important to understanding why the schemas below sort correctly when |
26 * string, []byte, BlobKey, ByteString | 26 compared only using bytes.Compare (aka `memcmp`). This doc will assume that |
dnj
2015/08/28 16:37:54
"Compare (aka `memcmp`)"?
iannucci
2015/08/28 19:48:54
? it's `bytes.Compare`
| |
27 * funnybase byte count | 27 you're familiar with those encodings, but will point out where we diverge from |
28 * raw bytes | 28 the stock encodings. |
29 * \*Key, GeoPoint | 29 |
30 * composite of above types | 30 All encoded Property values used in Keys are serialized `WithoutContext`, and |
dnj
2015/08/28 16:37:54
Wording.
iannucci
2015/08/28 19:48:54
Done I think?
| |
31 * time.Time | 31 `ShouldIndex`. |
32 * composite of above types, stored with microsecond accuracy. | 32 |
33 * rounding to microseconds is a limitation of the real appengine. | 33 ### Primary table |
34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)` | 34 |
35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)` | 35 The primary table maps datastore keys to entities. |
36 * nil, true, false | 36 - Name: `"ents:" + namespace` |
37 * value is encoded directly in the type byte | 37 - Key: serialized datastore.Property containing the entity's datastore.Key |
38 | 38 - Value: serialized datastore.PropertyMap |
39 | 39 |
40 Gkvlite Collection schema | 40 This table also encodes values for the following special keys: |
41 ------------------------- | 41 - Every entity root (e.g. a Key with nil Parent()) with key K has: |
42 | 42 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}` |
43 In order to provide efficient result deduplication, the value of an index row | 43 A child entity with the kind `__entity_group__` and an id of `1`. The value |
44 which indexes 1 or more properties is a concatenation of the previous values | 44 has a single property `__version__`, which contains the version number of |
45 which would show up in the same index. For example, if you have the property | 45 this entity group. This is used to detect transaction conflicts. |
46 list for the key K: | 46 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}` |
47 | 47 A child entity with the kind `__entity_group__` and an id of `1`. The value |
48 bob: 1 | 48 has a single property `__version__`, which contains the last automatically |
49 bob: 4 | 49 allocated entity ID for entities within this entity group. |
50 bob: 7 | 50 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the |
51 cat: "hello" | 51 same `__version__` property, and indicates the last automatically allocated |
52 cat: "world" | 52 entity ID for root entities. |
53 | 53 |
54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the | 54 ### Compound Index table |
55 rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates | 55 |
56 concatenation, each value has an implied type byte. `...` indicates that other | 56 The next table keeps track of all the user-added 'compound' index descriptions |
57 rows may be present): | 57 (not the content for the indexes). There is a row in this table for each |
58 | 58 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`. |
59 ... | 59 - Name: `"idx"` |
60 1|"world"|K = nil|nil | 60 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice |
61 ... | 61 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`). |
62 1|"hello"|K = nil|"world" | 62 - Value: empty |
63 ... | 63 |
64 4|"world"|K = 1|nil | 64 The Key format here requires some special attention. Say you started with |
65 ... | 65 a compound IndexDefinition of: |
66 4|"hello"|K = 1|"world" | 66 IndexDefinition{ |
67 ... | 67 Kind: "Foo", |
68 7|"world"|K = 4|nil | 68 Ancestor: true, |
69 ... | 69 SortBy: []IndexColumn{ |
70 7|"hello"|K = 4|"world" | 70 {Property: "Something", Direction: DESCENDING}, |
71 ... | 71 {Property: "Else", Direction: ASCENDING}, |
72 | 72 {Property: "Cool", Direction: ASCENDING}, |
73 This allows us to, start scanning at any point and be able to determine if we've | 73 } |
74 returned a given key already (without storing all of the keys in memory | 74 } |
75 for the duration of the Query run). We can do this because we can see if the | 75 |
76 value of an index row falls within the original query filter parameters. If it | 76 After prepping it for the table, it would be equivalent to: |
77 does, then we must already have returned they Key, and can safely skip the index | 77 IndexDefinition{ |
78 row. AFAIK, real-datastore provides deduplication by keeping all the returned | 78 Kind: "Foo", |
79 keys in memory as it runs the query, and doing a set-check. | 79 Ancestor: true, |
80 | 80 SortBy: []IndexColumn{ |
81 The end-result is semantically equivalent, with the exception that Query Cursors | 81 {Property: "__key__", Direction: ASCENDING}, |
82 on the real datastore will potentially return the same Key in the first Cursor | 82 {Property: "Cool", Direction: ASCENDING}, |
83 use as well as on the 2nd (or Nth) cursor use, where this method will not. | 83 {Property: "Else", Direction: ASCENDING}, |
84 | 84 {Property: "Something", Direction: DESCENDING}, |
85 collections | 85 } |
86 ents:ns -> key -> value | 86 } |
87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int} | 87 |
88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int} | 88 The reason for doing this will be covered in the `Query Planning` section, but |
89 (__entity_group_ids__,1) -> {__version__: in t} | 89 it boils down to allowing the query planner to use this encoded table to |
90 // TODO(iannucci): Journal every entity write in a log with a globally | 90 intelligently scan for potentially useful compound indexes. |
91 // increasing version number (aka "timestamp"). | 91 |
92 // | 92 ### Index Tables |
dnj (Google)
2015/08/28 17:54:21
Comment about repeated tags and padding.
e.g., "t
iannucci
2015/08/28 19:48:54
Done. I think we should actually implement that te
| |
93 // TODO(iannucci): Use the value in idx collection to indicate the last | 93 |
94 // global log version reflected in this index. Then index updates can happ en | 94 Every index (both builtin and compound) has one index table per namespace, which |
95 // in parallel, in a truly eventually-consistent fashion (and completely | 95 contains as rows every entry in the index, one per row. |
96 // avoid holding the DB writelock while calculating index entries). | 96 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()` |
97 // Unfortunately, copying new records (and removing old ones) into the DB | 97 - Key: concatenated datastore.Property values, one per SortBy column in the |
98 // would still require holding global writelock. | 98 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is |
99 // | 99 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF). |
100 // TODO(iannucci): how do we garbage-collect the journal? | 100 - Value: empty |
101 // | 101 |
102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with | 102 If the IndexDefinition has `Ancestor: true`, then the very first column of the |
103 // another one, transactionally? Not sure if this is possible to do easily . | 103 Key contains the partial Key for the entity. So if an entity has the datastore |
104 // If we had this, then we could do all the index writes for a given index | 104 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`, |
105 // on the side, and then do a quick swap into place with a writelock. As | 105 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column |
dnj
2015/08/28 16:37:54
Explain why?
Also the key notation, I'm assuming
iannucci
2015/08/28 19:48:54
"/Some,1" is a key with no parent. It's the root o
| |
106 // long as every index only has a single goroutine writing it, then this | 106 of indexes that it matches. |
107 // would enable maximum concurrency, since all indexes could update in | 107 |
108 // parallel and only synchronize for the duration of a single pointer swap . | 108 #### Builtin (automatic) indexes |
109 idx -> kind|A?|[-?prop]* = nil | 109 |
110 idx:ns:kind -> key = nil | 110 The following indexes are automatically created for some entity with a key |
111 idx:ns:kind|prop -> propval|key = [prev val] | 111 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo": |
112 idx:ns:kind|-prop -> -propval|key = [next val] | 112 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ |
113 idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[pre v/next val] | 113 {Property: "__key__", Direction: ASCENDING}, |
114 idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/ next val] | 114 }} |
115 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ | |
116 {Property: "Foo", Direction: ASCENDING}, | |
117 {Property: "__key__", Direction: ASCENDING}, | |
118 }} | |
119 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{ | |
120 {Property: "Foo", Direction: DESCENDING}, | |
121 {Property: "__key__", Direction: ASCENDING}, | |
122 }} | |
123 | |
124 Index updates | |
125 ------------- | |
126 | |
127 (Note that this is a LARGE departure from how the production appengine datastore | |
128 does this. This model only works because the implementation is not distributed, | |
129 and not journaled. The real datastore does index updates in parallel and is | |
130 generally pretty fancy compared to this). | |
131 | |
132 Index updates are pretty straightforward. On a multation to the primary entity | |
dnj
2015/08/28 16:37:54
mutation**
iannucci
2015/08/28 19:48:54
done
| |
133 table, we take the old entity value (remember that entity values are | |
134 PropertyMaps), the new property value, create index entries for both, merge | |
135 them, and apply the deltas to the affected index tables (i.e. entries that | |
136 exist in the old, but not new entities are deleted, entries that exist in the | |
dnj
2015/08/28 16:37:54
Kind of awkward wording. Maybe: "...but not new ar
iannucci
2015/08/28 19:48:54
Better?
| |
137 new but not old entities are deleted). | |
138 | |
139 Index generation works (given an slice of indexes []Idxs) by: | |
140 * serializing all ShouldIndex Properties in the PropertyMap to get a | |
141 `map[name][]serializedProperty`. | |
142 * for each index idx | |
143 * if idx's columns contain properties that are not in the map, skip idx | |
144 * make a `[][]serializedProperty`, where each serializedProperty slice | |
145 corresponds with the IndexColumn of idx.SortBy | |
146 * generate a `[]byte` row which is the contatenation of one value from each | |
147 `[]serializedProperty`, permuting through all combinations. If the SortBy | |
148 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty | |
149 value!. | |
150 * add that generated []byte row to the index's corresponding table. | |
151 | |
152 Query planning | |
153 -------------- | |
154 | |
155 Now that we have all of our data tabulated, let's plan some queries. The | |
156 high-level algorithm works like this: | |
157 * Generate a suffix format from the user's query which looks like: | |
158 * orders (including the inequality as the first order, if any) | |
159 * projected fields which aren't explicitly referenced in the orders (we | |
160 assume ASCENDING order for them), in the order that they were projected. | |
161 * `__key__` (implied ascending, unless the query's last sort order is for | |
162 `__key__`, in which case it's whatever order the user specified) | |
163 * Reverse the order of this suffix format, and serialize it into an | |
164 IndexDefinition, along with the query's Kind and Ancestor values. This | |
165 does what PrepForIdxTable did when we added the Index in the first place. | |
166 * Use this serialized reversed index to find compound indexes which might | |
167 match by looking up rows in the "idx" table which begin with this serialized | |
168 reversed index. | |
169 * Generate every builtin index for the inequality + equality filter | |
170 properties, and see if they match too. | |
171 | |
172 An index is a potential match if its suffix *exactly* matches the suffix format, | |
173 and it contains *only* sort orders which appear in the query (e.g. the index | |
174 contains a column which doesn't appear as an equality or inequlity filter). | |
175 | |
176 The index search continues until: | |
177 * We find at least one matching index; AND | |
178 * The combination of all matching indexes accounts for every equality filter | |
179 at least once. | |
180 | |
181 If we fail to find sufficient indexes to fulfil the query, we generate the | |
dnj
2015/08/28 16:37:54
fulfill
iannucci
2015/08/28 19:48:54
done
| |
182 'missing' index by concatenating all missing equality filters (in ascending | |
dnj
2015/08/28 16:37:54
Not clear, maybe "a suggested index description wh
iannucci
2015/08/28 19:48:54
How about this?
| |
183 order, but it doesn't matter if the user wants to make them decending), followed | |
184 by concatenating the suffix format that we generated for this query. | |
185 | |
186 Recall that equality filters are expressed as | |
187 `map[propName][]serializedProperty`. We'll refer to this mapping as the | |
188 'constraint' mapping below. | |
189 | |
190 To actually come up with the final index selection, we sort all the matching | |
191 indexes from greatest number of columns to least. We add the 0th index (the one | |
192 with the greatest number of columns) unconditionally. We then keep adding indexe s | |
193 which contain one or more of the remaining constraints, until we have no | |
194 more constraints to satisfy. | |
195 | |
196 Adding an index entails determining which columns in that index correspond to | |
197 equality columns, and which ones correspond to inequality/order/projection | |
198 columns. Recall that the inequality/order/projection columns are all the same | |
199 for all of the potential indices (i.e. they all share the same *suffix format*). | |
200 We can use this to just iterate over the index's SortBy columns which we'll use | |
201 for equality filters. For each equality column, we remove a corresponding value | |
202 from the constraints map. In the event that we _run out_ of constraints for a | |
203 given column, we simply _pick an arbitrary value_ from the original equality | |
204 filter mapping and use that. This is valid to do because, after all, they're | |
205 equality filters. | |
206 | |
207 Note that for compound indexes, the ancestor key counts as an equality filter, | |
208 and if the compound index has `Ancestor: true`, then we implicitly put the | |
209 ancestor as if it were the first SortBy column. For satisfying Ancestor queries | |
210 with built-in indexes, see the next section. | |
211 | |
212 Once we've got our list of constraints for this index, we concatenate them all | |
213 together to get the *prefix* for this index. When iterating over this index, we | |
214 only ever want to see index rows whose prefix exactly matches this. Unlike the | |
215 suffix formt, the prefix is per-index (remember that ALL indexes in the | |
216 query must have the same suffix format). | |
217 | |
218 Finally, we set the 'start' and 'end' values for all chosen indexes to either | |
219 the Start and End cursors, or the Greater-Than and Less-Than values for the | |
220 inequality. The Cursors contain values for every suffix column, and the | |
221 inequality only contains a value for the first suffix column. If both cursors | |
222 and an inequality are specified, we take the smaller set of both (the | |
223 combination which will return the fewest rows). | |
224 | |
225 That's about it for index selection! See Query Execution for how we actually use | |
226 the selected indexes to run a query. | |
227 | |
228 ### Ancestor queries and Built-in indexes | |
229 | |
230 You may have noticed that the built-in indexes can be used for Ancestor queries | |
231 with equality filters, but they don't start with the magic Ancestor column! | |
232 | |
233 There's a trick that you can do if the suffix format for the query is just | |
234 `__key__` though (e.g. the query only contains equality filters, and/or an | |
235 inequality filter on `__key__`). You can serialize the datastore key that you're | |
236 planning to use for the Ancestor query, then chop off the termintating null byte | |
237 from the encoding, and then use this as additional prefix bytes for this index. | |
238 So if the builtin for the "Val" property has the column format of: | |
239 | |
240 {Property: "Val"}, {Property: "__key__"} | |
241 | |
242 And your query holds Val as an equality filter, you can serialize the | |
243 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had | |
244 an index row: | |
245 | |
246 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP | |
247 | |
248 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for | |
249 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as: | |
250 | |
251 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 | |
252 | |
253 Omitting the STOP which normally terminates the Key encoding. Using this prefix | |
254 will only return index rows which are `/Kind,1` or its children. | |
255 | |
256 "That's cool! Why not use this trick for compound indexes?", I hear you ask :) | |
257 Remember that this trick only works if the prefix before the `__key__` is | |
258 *entirely* composed of equality filters. Also recall that if you ONLY use | |
259 equality filters and Ancestor (and possibly an inequality on `__key__`), then | |
260 you can always satisfy the query from the built-in indexes! While you | |
261 technically could do it with a compound index, there's not really a point to | |
262 doing so. To remain faithful to the production datastore implementation, we | |
263 don't implement this trick for anything other than the built-in indicies. | |
264 | |
265 ### Cursor format | |
266 | |
267 Cursors work by containing values for each of the columns in the suffix, in the | |
268 order and Direction specified by the suffix. In fact, cursors are just encoded | |
269 versions of the []IndexColumn used for the 'suffix format', followed by the | |
270 raw bytes of the suffix for that particular row (incremented by 1 bit). | |
271 | |
272 This means that technically you can port cursors between any queries which share | |
273 precisely the same suffix format, regardless of other query options, even if the | |
274 index planner ends up choosing different indexes to use from the first query to | |
275 the second. No state is maintained in the service implementation for cursors. | |
276 | |
277 I suspect that this is a more liberal version of cursors than how the production | |
278 appengine implements them, but I haven't verified one way or the other. | |
279 | |
280 Query execution | |
281 --------------- | |
282 | |
283 Last but not least, we need to actually execute the query. After figuring out | |
284 which indexes to use with what prefixes and start/end values, we essentially | |
285 have a list of index subsets, all sorted the same way. To pull the values out, | |
286 we start by iterating the first index in the list, grabbing its suffix value, | |
287 and trying to iterate from that suffix in the second, third, fourth, etc index. | |
288 | |
289 If any index iterates past that suffix, we start back at the 0th index with that | |
290 suffix, and continue to try to find a matching row. Doing this will end up | |
291 skipping large portions of all of the indexes in the list. This is the algorithm | |
292 known as "zigzag merge join", and you can find talks on it from some of the | |
293 appengine folks. It has very good algorithmic running time and tends to scale | |
294 with the number of full matches, rather than the size of all of the indexes | |
295 involved. | |
296 | |
297 A hit occurs when all of the iterators have precisely the same suffix. This hit | |
298 suffix is then decoded using the suffix format information. The very last column | |
299 of the suffix will always be the datastore key. The suffix is then used to call | |
300 back to the user, according to the query type: | |
301 * keys-only queries just directly return the Key | |
302 * projection queries return the projected fields from the decoded suffix. | |
303 Remember how we added all the projections after the orders? This is why. The | |
304 projected values are pulled directly from the index, instead of going to the | |
305 main entity table. | |
306 * normal queries pull the decoded Key from the "ents" table, and return that | |
307 entity to the user. | |
OLD | NEW |