impl/memory/README.md - Issue 1302813003: impl/memory: Implement Queries

Side by Side Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator

Patch Set: stringSet everywhere Created 5 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 In-memory appengine wrapper	1 Datastore implementation internals

2 ===========================	2 ==================================

3	3

4	4 This document contains internal implementation details for this in-memory

5 Notes on the internal encodings	5 version of datastore. It's mostly helpful to understand what's going on in this

6 -------------------------------	6 implementation, but it also can reveal some insight into how the real appengine

7	7 datastore works (though note that the specific encodings are different).

8 All datatypes inside of the index Collections of the gkvlite Store are stored	8

9 in a manner which allows them to be compared entirely via bytes.Compare. All	9 Additionally note that this implementation cheats by moving some of the Key

10 types are prefixed by a sortable type byte which encodes the sort-order of types	10 bytes into the table (collection) names (like the namespace, property name for

11 according to the appengine SDK. Additionally, types have the following data	11 the builtin indexes, etc.). The real implementation contains these bytes in the

12 encoding:	12 table row keys, I think.

13 * ints	13

14 * stored with the `funnybase` varint encoding	14

15 * floats	15 Internal datastore key/value collection schema

16 * http://stereopsis.com/radix.html	16 ----------------------------------------------

17 * toBytes:	17

18 ```	18 The datastore implementation here uses several different tables ('collections')

19 b := math.Float64bits(f)	19 to manage state for the data. The schema for these tables is enumerated below

20 return b ^ (-(b >> 63) \| 0x8000000000000000)	20 to make the code a bit easier to reason about.

21 ```	21

22 * fromBytes:	22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized

23 ```	23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the

24 return math.Float64frombits(b ^ ((b >> 63) - 1) \| 0x8000000000000000)	24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings

25 ```	25 are important to understanding why the schemas below sort correctly when

26 * string, []byte, BlobKey, ByteString	26 compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that

27 * funnybase byte count	27 you're familiar with those encodings, but will point out where we diverge from

28 * raw bytes	28 the stock encodings.

29 * \*Key, GeoPoint	29

30 * composite of above types	30 All encoded Property values used in gkvlite Keys (i.e. index rows) are

31 * time.Time	31 serialized using the settings `serialize.WithoutContext`, and

32 * composite of above types, stored with microsecond accuracy.	32 `datastore.ShouldIndex`.

33 * rounding to microseconds is a limitation of the real appengine.	33

34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)`	34 ### Primary table

35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)`	35

36 * nil, true, false	36 The primary table maps datastore keys to entities.

37 * value is encoded directly in the type byte	37 - Name: `"ents:" + namespace`

38	38 - Key: serialized datastore.Property containing the entity's datastore.Key

39	39 - Value: serialized datastore.PropertyMap

40 Gkvlite Collection schema	40

41 -------------------------	41 This table also encodes values for the following special keys:

42	42 - Every entity root (e.g. a Key with nil Parent()) with key K has:

43 In order to provide efficient result deduplication, the value of an index row	43 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`

44 which indexes 1 or more properties is a concatenation of the previous values	44 A child entity with the kind `__entity_group__` and an id of `1`. The value

45 which would show up in the same index. For example, if you have the property	45 has a single property `__version__`, which contains the version number of

46 list for the key K:	46 this entity group. This is used to detect transaction conflicts.

47	47 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`

48 bob: 1	48 A child entity with the kind `__entity_group__` and an id of `1`. The value

49 bob: 4	49 has a single property `__version__`, which contains the last automatically

50 bob: 7	50 allocated entity ID for entities within this entity group.

51 cat: "hello"	51 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the

52 cat: "world"	52 same `__version__` property, and indicates the last automatically allocated

53	53 entity ID for root entities.

54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the	54

55 rows in the index `idx:ns:kind\|R\|bob\|-cat` (\| in the row indicates	55 ### Compound Index table

56 concatenation, each value has an implied type byte. `...` indicates that other	56

57 rows may be present):	57 The next table keeps track of all the user-added 'compound' index descriptions

58	58 (not the content for the indexes). There is a row in this table for each

59 ...	59 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.

60 1\|"world"\|K = nil\|nil	60 - Name: `"idx"`

61 ...	61 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice

62 1\|"hello"\|K = nil\|"world"	62 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).

63 ...	63 - Value: empty

64 4\|"world"\|K = 1\|nil	64

65 ...	65 The Key format here requires some special attention. Say you started with

66 4\|"hello"\|K = 1\|"world"	66 a compound IndexDefinition of:

67 ...	67 IndexDefinition{

68 7\|"world"\|K = 4\|nil	68 Kind: "Foo",

69 ...	69 Ancestor: true,

70 7\|"hello"\|K = 4\|"world"	70 SortBy: []IndexColumn{

71 ...	71 {Property: "Something", Direction: DESCENDING},

72	72 {Property: "Else", Direction: ASCENDING},

73 This allows us to, start scanning at any point and be able to determine if we've	73 {Property: "Cool", Direction: ASCENDING},

74 returned a given key already (without storing all of the keys in memory	74 }

75 for the duration of the Query run). We can do this because we can see if the	75 }

76 value of an index row falls within the original query filter parameters. If it	76

77 does, then we must already have returned they Key, and can safely skip the index	77 After prepping it for the table, it would be equivalent to:

78 row. AFAIK, real-datastore provides deduplication by keeping all the returned	78 IndexDefinition{

79 keys in memory as it runs the query, and doing a set-check.	79 Kind: "Foo",

80	80 Ancestor: true,

81 The end-result is semantically equivalent, with the exception that Query Cursors	81 SortBy: []IndexColumn{

82 on the real datastore will potentially return the same Key in the first Cursor	82 {Property: "__key__", Direction: ASCENDING},

83 use as well as on the 2nd (or Nth) cursor use, where this method will not.	83 {Property: "Cool", Direction: ASCENDING},

84	84 {Property: "Else", Direction: ASCENDING},

85 collections	85 {Property: "Something", Direction: DESCENDING},

86 ents:ns -> key -> value	86 }

87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int}	87 }

88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int}	88

89 (__entity_group_ids__,1) -> {__version__: in t}	89 The reason for doing this will be covered in the `Query Planning` section, but

90 // TODO(iannucci): Journal every entity write in a log with a globally	90 it boils down to allowing the query planner to use this encoded table to

91 // increasing version number (aka "timestamp").	91 intelligently scan for potentially useful compound indexes.

92 //	92

93 // TODO(iannucci): Use the value in idx collection to indicate the last	93 ### Index Tables

94 // global log version reflected in this index. Then index updates can happ en	94

95 // in parallel, in a truly eventually-consistent fashion (and completely	95 Every index (both builtin and compound) has one index table per namespace, which

96 // avoid holding the DB writelock while calculating index entries).	96 contains as rows every entry in the index, one per row.

97 // Unfortunately, copying new records (and removing old ones) into the DB	97 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`

98 // would still require holding global writelock.	98 - Key: concatenated datastore.Property values, one per SortBy column in the

99 //	99 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is

100 // TODO(iannucci): how do we garbage-collect the journal?	100 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).

101 //	101 - Value: empty

102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with	102

103 // another one, transactionally? Not sure if this is possible to do easily .	103 If the IndexDefinition has `Ancestor: true`, then the very first column of the

104 // If we had this, then we could do all the index writes for a given index	104 Key contains the partial Key for the entity. So if an entity has the datastore

105 // on the side, and then do a quick swap into place with a writelock. As	105 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,

106 // long as every index only has a single goroutine writing it, then this	106 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column

107 // would enable maximum concurrency, since all indexes could update in	107 of indexes that it matches.

108 // parallel and only synchronize for the duration of a single pointer swap .	108

109 idx -> kind\|A?\|[-?prop]* = nil	109 #### Builtin (automatic) indexes

110 idx:ns:kind -> key = nil	110

111 idx:ns:kind\|prop -> propval\|key = [prev val]	111 The following indexes are automatically created for some entity with a key

112 idx:ns:kind\|-prop -> -propval\|key = [next val]	112 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":

113 idx:ns:kind\|A\|?prop\|?prop -> A\|propval\|propval\|key = [prev/next val]\|[pre v/next val]	113 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

114 idx:ns:kind\|?prop\|?prop -> propval\|propval\|key = [prev/next val]\|[prev/ next val]	114 {Property: "__key__", Direction: ASCENDING},

	115 }}

	116 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

	117 {Property: "Foo", Direction: ASCENDING},

	118 {Property: "__key__", Direction: ASCENDING},

	119 }}

	120 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

	121 {Property: "Foo", Direction: DESCENDING},

	122 {Property: "__key__", Direction: ASCENDING},

	123 }}

	124

	125 Index updates

	126 -------------

	127

	128 (Note that this is a LARGE departure from how the production appengine datastore

	129 does this. This model only works because the implementation is not distributed,

	130 and not journaled. The real datastore does index updates in parallel and is

	131 generally pretty fancy compared to this).

	132

	133 Index updates are pretty straightforward. On a mutation to the primary entity

	134 table, we take the old entity value (remember that entity values are

	135 PropertyMaps), the new property value, create index entries for both, merge

	136 them, and apply the deltas to the affected index tables (i.e. entries that

	137 exist in the old entities, but not the new ones, are deleted. Entries that exist

	138 in the new entities, but not the old ones, are added).

	139

	140 Index generation works (given an slice of indexes []Idxs) by:

	141 * serializing all ShouldIndex Properties in the PropertyMap to get a

	142 `map[name][]serializedProperty`.

	143 * for each index idx

	144 * if idx's columns contain properties that are not in the map, skip idx

	145 * make a `[][]serializedProperty`, where each serializedProperty slice

	146 corresponds with the IndexColumn of idx.SortBy

	147 * duplicated values for multi-valued properties are skipped.

	148 * generate a `[]byte` row which is the contatenation of one value from each

	149 `[]serializedProperty`, permuting through all combinations. If the SortBy

	150 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty

	151 value!.

	152 * add that generated []byte row to the index's corresponding table.

	153

	154 Note that we choose to serialize all permutations of the saved entity. This is

	155 so that we can use repeated-column indexes to fill queries which use a subset of

	156 the columns. E.g. if we have the index `duck,duck,duck,goose`, we can

	157 theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by

	158 pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index

	159 selection at the expense of larger indexes. However, it means that if you have

	160 the entity:

	161 duck = 1, 2, 3, 4

	162 goose = "færøske"

	163

	164 It generates the following index entries:

	165 duck=1,duck=1,duck=1,goose="færøske"

	166 duck=1,duck=1,duck=2,goose="færøske"

	167 duck=1,duck=1,duck=3,goose="færøske"

	168 duck=1,duck=1,duck=4,goose="færøske"

	169 duck=1,duck=2,duck=1,goose="færøske"

	170 duck=1,duck=2,duck=2,goose="færøske"

	171 duck=1,duck=2,duck=3,goose="færøske"

	172 duck=1,duck=2,duck=4,goose="færøske"

	173 duck=1,duck=3,duck=1,goose="færøske"

	174 ... a lot ...

	175 duck=4,duck=4,duck=4,goose="færøske"

	176

	177 This is a very large number of index rows (i.e. an 'exploding index')!

	178

	179 An alternate design would be to only generate unique permutations of elements

	180 where the index has repeated columns of a single property. This only makes sense

	181 because it's illegal to have an equality and an inequality on the same property,

	182 under the current constraints of appengine (though not completely ridiculous in

	183 general, if inequality constraints meant the same thing as equality constraints.

	184 However it would lead to a multi-dimensional query, which can be quite slow and

	185 is very difficult to scale without application knowledge). If we do this, it

	186 also means that we need to SORT the equality filter values when generating the

	187 prefix (so that the least-valued equality constraint is first). If we did this,

	188 then the generated index rows for the above entity would be:

	189 duck=1,duck=2,duck=3,goose="færøske"

	190 duck=1,duck=2,duck=4,goose="færøske"

	191 duck=1,duck=3,duck=4,goose="færøske"

	192 duck=2,duck=3,duck=4,goose="færøske"

	193

	194 Which be a LOT more compact. It may be worth implementing this restriction

	195 later, simply for the memory savings when indexing multi-valued properties.

	196

	197 If this technique is used, there's also room to unambiguously index entities

	198 with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see

	199 a row in the index like:

	200 duck=1,duck=1,duck=2,goose="færøske"

	201

	202 Which would allow you to query for "an entity which has duck values equal to 1,

	203 1 and 2". Currently such a query is not possible to execute (it would be

	204 equivalent to "an entity which has duck values equal to 1 and 2").

	205

	206 Query planning

	207 --------------

	208

	209 Now that we have all of our data tabulated, let's plan some queries. The

	210 high-level algorithm works like this:

	211 * Generate a suffix format from the user's query which looks like:

	212 * orders (including the inequality as the first order, if any)

	213 * projected fields which aren't explicitly referenced in the orders (we

	214 assume ASCENDING order for them), in the order that they were projected.

	215 * `__key__` (implied ascending, unless the query's last sort order is for

	216 `__key__`, in which case it's whatever order the user specified)

	217 * Reverse the order of this suffix format, and serialize it into an

	218 IndexDefinition, along with the query's Kind and Ancestor values. This

	219 does what PrepForIdxTable did when we added the Index in the first place.

	220 * Use this serialized reversed index to find compound indexes which might

	221 match by looking up rows in the "idx" table which begin with this serialized

	222 reversed index.

	223 * Generate every builtin index for the inequality + equality filter

	224 properties, and see if they match too.

	225

	226 An index is a potential match if its suffix exactly matches the suffix format,

	227 and it contains only sort orders which appear in the query (e.g. the index

	228 contains a column which doesn't appear as an equality or inequlity filter).

	229

	230 The index search continues until:

	231 * We find at least one matching index; AND

	232 * The combination of all matching indexes accounts for every equality filter

	233 at least once.

	234

	235 If we fail to find sufficient indexes to fulfill the query, we generate an index

	236 description that could be sufficient by concatenating all missing equality

	237 filters, in ascending order, followed by concatenating the suffix format that we

	238 generated for this query. We then suggest this new index to the user for them to

	239 add by returing an error containing the generated IndexDefinition. Note that

	240 this index is not REQUIRED for the user to add; they could choose to add bits

	241 and pieces of it, extend existing indexes in order to cover the missing columns,

	242 invert the direction of some of the equality columns, etc.

	243

	244 Recall that equality filters are expressed as

	245 `map[propName][]serializedProperty`. We'll refer to this mapping as the

	246 'constraint' mapping below.

	247

	248 To actually come up with the final index selection, we sort all the matching

	249 indexes from greatest number of columns to least. We add the 0th index (the one

	250 with the greatest number of columns) unconditionally. We then keep adding indexe s

	251 which contain one or more of the remaining constraints, until we have no

	252 more constraints to satisfy.

	253

	254 Adding an index entails determining which columns in that index correspond to

	255 equality columns, and which ones correspond to inequality/order/projection

	256 columns. Recall that the inequality/order/projection columns are all the same

	257 for all of the potential indices (i.e. they all share the same suffix format).

	258 We can use this to just iterate over the index's SortBy columns which we'll use

	259 for equality filters. For each equality column, we remove a corresponding value

	260 from the constraints map. In the event that we _run out_ of constraints for a

	261 given column, we simply _pick an arbitrary value_ from the original equality

	262 filter mapping and use that. This is valid to do because, after all, they're

	263 equality filters.

	264

	265 Note that for compound indexes, the ancestor key counts as an equality filter,

	266 and if the compound index has `Ancestor: true`, then we implicitly put the

	267 ancestor as if it were the first SortBy column. For satisfying Ancestor queries

	268 with built-in indexes, see the next section.

	269

	270 Once we've got our list of constraints for this index, we concatenate them all

	271 together to get the prefix for this index. When iterating over this index, we

	272 only ever want to see index rows whose prefix exactly matches this. Unlike the

	273 suffix formt, the prefix is per-index (remember that ALL indexes in the

	274 query must have the same suffix format).

	275

	276 Finally, we set the 'start' and 'end' values for all chosen indexes to either

	277 the Start and End cursors, or the Greater-Than and Less-Than values for the

	278 inequality. The Cursors contain values for every suffix column, and the

	279 inequality only contains a value for the first suffix column. If both cursors

	280 and an inequality are specified, we take the smaller set of both (the

	281 combination which will return the fewest rows).

	282

	283 That's about it for index selection! See Query Execution for how we actually use

	284 the selected indexes to run a query.

	285

	286 ### Ancestor queries and Built-in indexes

	287

	288 You may have noticed that the built-in indexes can be used for Ancestor queries

	289 with equality filters, but they don't start with the magic Ancestor column!

	290

	291 There's a trick that you can do if the suffix format for the query is just

	292 `__key__` though (e.g. the query only contains equality filters, and/or an

	293 inequality filter on `__key__`). You can serialize the datastore key that you're

	294 planning to use for the Ancestor query, then chop off the termintating null byte

	295 from the encoding, and then use this as additional prefix bytes for this index.

	296 So if the builtin for the "Val" property has the column format of:

	297

	298 {Property: "Val"}, {Property: "__key__"}

	299

	300 And your query holds Val as an equality filter, you can serialize the

	301 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had

	302 an index row:

	303

	304 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP

	305

	306 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for

	307 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:

	308

	309 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1

	310

	311 Omitting the STOP which normally terminates the Key encoding. Using this prefix

	312 will only return index rows which are `/Kind,1` or its children.

	313

	314 "That's cool! Why not use this trick for compound indexes?", I hear you ask :)

	315 Remember that this trick only works if the prefix before the `__key__` is

	316 entirely composed of equality filters. Also recall that if you ONLY use

	317 equality filters and Ancestor (and possibly an inequality on `__key__`), then

	318 you can always satisfy the query from the built-in indexes! While you

	319 technically could do it with a compound index, there's not really a point to

	320 doing so. To remain faithful to the production datastore implementation, we

	321 don't implement this trick for anything other than the built-in indexes.

	322

	323 ### Cursor format

	324

	325 Cursors work by containing values for each of the columns in the suffix, in the

	326 order and Direction specified by the suffix. In fact, cursors are just encoded

	327 versions of the []IndexColumn used for the 'suffix format', followed by the

	328 raw bytes of the suffix for that particular row (incremented by 1 bit).

	329

	330 This means that technically you can port cursors between any queries which share

	331 precisely the same suffix format, regardless of other query options, even if the

	332 index planner ends up choosing different indexes to use from the first query to

	333 the second. No state is maintained in the service implementation for cursors.

	334

	335 I suspect that this is a more liberal version of cursors than how the production

	336 appengine implements them, but I haven't verified one way or the other.

	337

	338 Query execution

	339 ---------------

	340

	341 Last but not least, we need to actually execute the query. After figuring out

	342 which indexes to use with what prefixes and start/end values, we essentially

	343 have a list of index subsets, all sorted the same way. To pull the values out,

	344 we start by iterating the first index in the list, grabbing its suffix value,

	345 and trying to iterate from that suffix in the second, third, fourth, etc index.

	346

	347 If any index iterates past that suffix, we start back at the 0th index with that

	348 suffix, and continue to try to find a matching row. Doing this will end up

	349 skipping large portions of all of the indexes in the list. This is the algorithm

	350 known as "zigzag merge join", and you can find talks on it from some of the

	351 appengine folks. It has very good algorithmic running time and tends to scale

	352 with the number of full matches, rather than the size of all of the indexes

	353 involved.

	354

	355 A hit occurs when all of the iterators have precisely the same suffix. This hit

	356 suffix is then decoded using the suffix format information. The very last column

	357 of the suffix will always be the datastore key. The suffix is then used to call

	358 back to the user, according to the query type:

	359 * keys-only queries just directly return the Key

	360 * projection queries return the projected fields from the decoded suffix.

	361 Remember how we added all the projections after the orders? This is why. The

	362 projected values are pulled directly from the index, instead of going to the

	363 main entity table.

	364 * normal queries pull the decoded Key from the "ents" table, and return that

	365 entity to the user.

OLD	NEW

« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | no next file with comments »