impl/memory/README.md - Issue 1302813003: impl/memory: Implement Queries

Side by Side Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator

Patch Set: minor fixes Created 5 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 In-memory appengine wrapper	1 Datastore implementation internals

2 ===========================	2 ==================================

3	3

4	4 This document contains internal implementation details for this in-memory

5 Notes on the internal encodings	5 version of datastore. It's mostly helpful to understand what's going on in this

6 -------------------------------	6 implementation, but it also can reveal some insight into how the real appengine

7	7 datastore works (though note that the specific encodings are different).

8 All datatypes inside of the index Collections of the gkvlite Store are stored	8

9 in a manner which allows them to be compared entirely via bytes.Compare. All	9 Additionally note that this implementation cheats by moving some of the Key

10 types are prefixed by a sortable type byte which encodes the sort-order of types	10 bytes into the table (collection) names (like the namespace, property name for

11 according to the appengine SDK. Additionally, types have the following data	11 the builtin indexes, etc.). The real implementation contains these bytes in the

12 encoding:	12 table row keys, I think.

13 * ints	13

14 * stored with the `funnybase` varint encoding	14

15 * floats	15 Internal datastore key/value collection schema

16 * http://stereopsis.com/radix.html	16 ----------------------------------------------

17 * toBytes:	17

18 ```	18 The datastore implementation here uses several different tables ('collections')

19 b := math.Float64bits(f)	19 to manage state for the data. The schema for these tables is enumerated below

20 return b ^ (-(b >> 63) \| 0x8000000000000000)	20 to make the code a bit easier to reason about.

21 ```	21

22 * fromBytes:	22 All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized

23 ```	23 using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the

24 return math.Float64frombits(b ^ ((b >> 63) - 1) \| 0x8000000000000000)	24 primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings

25 ```	25 are important to understanding why the schemas below sort correctly when

26 * string, []byte, BlobKey, ByteString	26 compared only using bytes.Compare (aka `memcmp`). This doc will assume that
	dnj 2015/08/28 16:37:54 "Compare (aka `memcmp`)"? "Compare (aka `memcmp`)"? iannucci 2015/08/28 19:48:54 ? it's `bytes.Compare` Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > "Compare (aka `memcmp`)"? ? it's `bytes.Compare`
27 * funnybase byte count	27 you're familiar with those encodings, but will point out where we diverge from

28 * raw bytes	28 the stock encodings.

29 * \*Key, GeoPoint	29

30 * composite of above types	30 All encoded Property values used in Keys are serialized `WithoutContext`, and
	dnj 2015/08/28 16:37:54 Wording. Wording. iannucci 2015/08/28 19:48:54 Done I think? Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > Wording. Done I think?
31 * time.Time	31 `ShouldIndex`.

32 * composite of above types, stored with microsecond accuracy.	32

33 * rounding to microseconds is a limitation of the real appengine.	33 ### Primary table

34 * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)`	34

35 * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)`	35 The primary table maps datastore keys to entities.

36 * nil, true, false	36 - Name: `"ents:" + namespace`

37 * value is encoded directly in the type byte	37 - Key: serialized datastore.Property containing the entity's datastore.Key

38	38 - Value: serialized datastore.PropertyMap

39	39

40 Gkvlite Collection schema	40 This table also encodes values for the following special keys:

41 -------------------------	41 - Every entity root (e.g. a Key with nil Parent()) with key K has:

42	42 - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`

43 In order to provide efficient result deduplication, the value of an index row	43 A child entity with the kind `__entity_group__` and an id of `1`. The value

44 which indexes 1 or more properties is a concatenation of the previous values	44 has a single property `__version__`, which contains the version number of

45 which would show up in the same index. For example, if you have the property	45 this entity group. This is used to detect transaction conflicts.

46 list for the key K:	46 - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`

47	47 A child entity with the kind `__entity_group__` and an id of `1`. The value

48 bob: 1	48 has a single property `__version__`, which contains the last automatically

49 bob: 4	49 allocated entity ID for entities within this entity group.

50 bob: 7	50 - A root entity with the key `Key("__entity_group_ids__",1)` which contains the

51 cat: "hello"	51 same `__version__` property, and indicates the last automatically allocated

52 cat: "world"	52 entity ID for root entities.

53	53

54 And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the	54 ### Compound Index table

55 rows in the index `idx:ns:kind\|R\|bob\|-cat` (\| in the row indicates	55

56 concatenation, each value has an implied type byte. `...` indicates that other	56 The next table keeps track of all the user-added 'compound' index descriptions

57 rows may be present):	57 (not the content for the indexes). There is a row in this table for each

58	58 compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.

59 ...	59 - Name: `"idx"`

60 1\|"world"\|K = nil\|nil	60 - Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice

61 ...	61 in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).

62 1\|"hello"\|K = nil\|"world"	62 - Value: empty

63 ...	63

64 4\|"world"\|K = 1\|nil	64 The Key format here requires some special attention. Say you started with

65 ...	65 a compound IndexDefinition of:

66 4\|"hello"\|K = 1\|"world"	66 IndexDefinition{

67 ...	67 Kind: "Foo",

68 7\|"world"\|K = 4\|nil	68 Ancestor: true,

69 ...	69 SortBy: []IndexColumn{

70 7\|"hello"\|K = 4\|"world"	70 {Property: "Something", Direction: DESCENDING},

71 ...	71 {Property: "Else", Direction: ASCENDING},

72	72 {Property: "Cool", Direction: ASCENDING},

73 This allows us to, start scanning at any point and be able to determine if we've	73 }

74 returned a given key already (without storing all of the keys in memory	74 }

75 for the duration of the Query run). We can do this because we can see if the	75

76 value of an index row falls within the original query filter parameters. If it	76 After prepping it for the table, it would be equivalent to:

77 does, then we must already have returned they Key, and can safely skip the index	77 IndexDefinition{

78 row. AFAIK, real-datastore provides deduplication by keeping all the returned	78 Kind: "Foo",

79 keys in memory as it runs the query, and doing a set-check.	79 Ancestor: true,

80	80 SortBy: []IndexColumn{

81 The end-result is semantically equivalent, with the exception that Query Cursors	81 {Property: "__key__", Direction: ASCENDING},

82 on the real datastore will potentially return the same Key in the first Cursor	82 {Property: "Cool", Direction: ASCENDING},

83 use as well as on the 2nd (or Nth) cursor use, where this method will not.	83 {Property: "Else", Direction: ASCENDING},

84	84 {Property: "Something", Direction: DESCENDING},

85 collections	85 }

86 ents:ns -> key -> value	86 }

87 (rootkind, rootid, __entity_group__,1) -> {_ _version__: int}	87

88 (rootkind, rootid, __entity_group_ids__,1) - > {__version__: int}	88 The reason for doing this will be covered in the `Query Planning` section, but

89 (__entity_group_ids__,1) -> {__version__: in t}	89 it boils down to allowing the query planner to use this encoded table to

90 // TODO(iannucci): Journal every entity write in a log with a globally	90 intelligently scan for potentially useful compound indexes.

91 // increasing version number (aka "timestamp").	91

92 //	92 ### Index Tables
	dnj (Google) 2015/08/28 17:54:21 Comment about repeated tags and padding. e.g., "t Comment about repeated tags and padding. e.g., "tag tag tag bar". iannucci 2015/08/28 19:48:54 Done. I think we should actually implement that te Show quoted text On 2015/08/28 at 17:54:21, dnj (Google) wrote: > Comment about repeated tags and padding. > > e.g., "tag tag tag bar". Done. I think we should actually implement that technique in a later CL. The memory savings could be pretty dramatic.
93 // TODO(iannucci): Use the value in idx collection to indicate the last	93

94 // global log version reflected in this index. Then index updates can happ en	94 Every index (both builtin and compound) has one index table per namespace, which

95 // in parallel, in a truly eventually-consistent fashion (and completely	95 contains as rows every entry in the index, one per row.

96 // avoid holding the DB writelock while calculating index entries).	96 - Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`

97 // Unfortunately, copying new records (and removing old ones) into the DB	97 - Key: concatenated datastore.Property values, one per SortBy column in the

98 // would still require holding global writelock.	98 IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is

99 //	99 DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).

100 // TODO(iannucci): how do we garbage-collect the journal?	100 - Value: empty

101 //	101

102 // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with	102 If the IndexDefinition has `Ancestor: true`, then the very first column of the

103 // another one, transactionally? Not sure if this is possible to do easily .	103 Key contains the partial Key for the entity. So if an entity has the datastore

104 // If we had this, then we could do all the index writes for a given index	104 key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,

105 // on the side, and then do a quick swap into place with a writelock. As	105 `/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
	dnj 2015/08/28 16:37:54 Explain why? Also the key notation, I'm assuming Explain why? Also the key notation, I'm assuming that "Some,1" is the entity partial key component? Can you add a small paragraph explaining? iannucci 2015/08/28 19:48:54 "/Some,1" is a key with no parent. It's the root o Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > Explain why? > > Also the key notation, I'm assuming that "Some,1" is the entity partial key component? Can you add a small paragraph explaining? "/Some,1" is a key with no parent. It's the root of the full key. It's the same format as the Key.String() method :)
106 // long as every index only has a single goroutine writing it, then this	106 of indexes that it matches.

107 // would enable maximum concurrency, since all indexes could update in	107

108 // parallel and only synchronize for the duration of a single pointer swap .	108 #### Builtin (automatic) indexes

109 idx -> kind\|A?\|[-?prop]* = nil	109

110 idx:ns:kind -> key = nil	110 The following indexes are automatically created for some entity with a key

111 idx:ns:kind\|prop -> propval\|key = [prev val]	111 `/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":

112 idx:ns:kind\|-prop -> -propval\|key = [next val]	112 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

113 idx:ns:kind\|A\|?prop\|?prop -> A\|propval\|propval\|key = [prev/next val]\|[pre v/next val]	113 {Property: "__key__", Direction: ASCENDING},

114 idx:ns:kind\|?prop\|?prop -> propval\|propval\|key = [prev/next val]\|[prev/ next val]	114 }}

	115 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

	116 {Property: "Foo", Direction: ASCENDING},

	117 {Property: "__key__", Direction: ASCENDING},

	118 }}

	119 IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

	120 {Property: "Foo", Direction: DESCENDING},

	121 {Property: "__key__", Direction: ASCENDING},

	122 }}

	123

	124 Index updates

	125 -------------

	126

	127 (Note that this is a LARGE departure from how the production appengine datastore

	128 does this. This model only works because the implementation is not distributed,

	129 and not journaled. The real datastore does index updates in parallel and is

	130 generally pretty fancy compared to this).

	131

	132 Index updates are pretty straightforward. On a multation to the primary entity
	dnj 2015/08/28 16:37:54 mutation mutation iannucci 2015/08/28 19:48:54 done Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > mutation** done
	133 table, we take the old entity value (remember that entity values are

	134 PropertyMaps), the new property value, create index entries for both, merge

	135 them, and apply the deltas to the affected index tables (i.e. entries that

	136 exist in the old, but not new entities are deleted, entries that exist in the
	dnj 2015/08/28 16:37:54 Kind of awkward wording. Maybe: "...but not new ar Kind of awkward wording. Maybe: "...but not new are deleted" iannucci 2015/08/28 19:48:54 Better? Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > Kind of awkward wording. Maybe: "...but not new are deleted" Better?
	137 new but not old entities are deleted).

	138

	139 Index generation works (given an slice of indexes []Idxs) by:

	140 * serializing all ShouldIndex Properties in the PropertyMap to get a

	141 `map[name][]serializedProperty`.

	142 * for each index idx

	143 * if idx's columns contain properties that are not in the map, skip idx

	144 * make a `[][]serializedProperty`, where each serializedProperty slice

	145 corresponds with the IndexColumn of idx.SortBy

	146 * generate a `[]byte` row which is the contatenation of one value from each

	147 `[]serializedProperty`, permuting through all combinations. If the SortBy

	148 column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty

	149 value!.

	150 * add that generated []byte row to the index's corresponding table.

	151

	152 Query planning

	153 --------------

	154

	155 Now that we have all of our data tabulated, let's plan some queries. The

	156 high-level algorithm works like this:

	157 * Generate a suffix format from the user's query which looks like:

	158 * orders (including the inequality as the first order, if any)

	159 * projected fields which aren't explicitly referenced in the orders (we

	160 assume ASCENDING order for them), in the order that they were projected.

	161 * `__key__` (implied ascending, unless the query's last sort order is for

	162 `__key__`, in which case it's whatever order the user specified)

	163 * Reverse the order of this suffix format, and serialize it into an

	164 IndexDefinition, along with the query's Kind and Ancestor values. This

	165 does what PrepForIdxTable did when we added the Index in the first place.

	166 * Use this serialized reversed index to find compound indexes which might

	167 match by looking up rows in the "idx" table which begin with this serialized

	168 reversed index.

	169 * Generate every builtin index for the inequality + equality filter

	170 properties, and see if they match too.

	171

	172 An index is a potential match if its suffix exactly matches the suffix format,

	173 and it contains only sort orders which appear in the query (e.g. the index

	174 contains a column which doesn't appear as an equality or inequlity filter).

	175

	176 The index search continues until:

	177 * We find at least one matching index; AND

	178 * The combination of all matching indexes accounts for every equality filter

	179 at least once.

	180

	181 If we fail to find sufficient indexes to fulfil the query, we generate the
	dnj 2015/08/28 16:37:54 fulfill fulfill iannucci 2015/08/28 19:48:54 done Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > fulfill done
	182 'missing' index by concatenating all missing equality filters (in ascending
	dnj 2015/08/28 16:37:54 Not clear, maybe "a suggested index description wh Not clear, maybe "a suggested index description which would be sufficient for presentation to the user". iannucci 2015/08/28 19:48:54 How about this? Show quoted text On 2015/08/28 at 16:37:54, dnj wrote: > Not clear, maybe "a suggested index description which would be sufficient for presentation to the user". How about this?
	183 order, but it doesn't matter if the user wants to make them decending), followed

	184 by concatenating the suffix format that we generated for this query.

	185

	186 Recall that equality filters are expressed as

	187 `map[propName][]serializedProperty`. We'll refer to this mapping as the

	188 'constraint' mapping below.

	189

	190 To actually come up with the final index selection, we sort all the matching

	191 indexes from greatest number of columns to least. We add the 0th index (the one

	192 with the greatest number of columns) unconditionally. We then keep adding indexe s

	193 which contain one or more of the remaining constraints, until we have no

	194 more constraints to satisfy.

	195

	196 Adding an index entails determining which columns in that index correspond to

	197 equality columns, and which ones correspond to inequality/order/projection

	198 columns. Recall that the inequality/order/projection columns are all the same

	199 for all of the potential indices (i.e. they all share the same suffix format).

	200 We can use this to just iterate over the index's SortBy columns which we'll use

	201 for equality filters. For each equality column, we remove a corresponding value

	202 from the constraints map. In the event that we _run out_ of constraints for a

	203 given column, we simply _pick an arbitrary value_ from the original equality

	204 filter mapping and use that. This is valid to do because, after all, they're

	205 equality filters.

	206

	207 Note that for compound indexes, the ancestor key counts as an equality filter,

	208 and if the compound index has `Ancestor: true`, then we implicitly put the

	209 ancestor as if it were the first SortBy column. For satisfying Ancestor queries

	210 with built-in indexes, see the next section.

	211

	212 Once we've got our list of constraints for this index, we concatenate them all

	213 together to get the prefix for this index. When iterating over this index, we

	214 only ever want to see index rows whose prefix exactly matches this. Unlike the

	215 suffix formt, the prefix is per-index (remember that ALL indexes in the

	216 query must have the same suffix format).

	217

	218 Finally, we set the 'start' and 'end' values for all chosen indexes to either

	219 the Start and End cursors, or the Greater-Than and Less-Than values for the

	220 inequality. The Cursors contain values for every suffix column, and the

	221 inequality only contains a value for the first suffix column. If both cursors

	222 and an inequality are specified, we take the smaller set of both (the

	223 combination which will return the fewest rows).

	224

	225 That's about it for index selection! See Query Execution for how we actually use

	226 the selected indexes to run a query.

	227

	228 ### Ancestor queries and Built-in indexes

	229

	230 You may have noticed that the built-in indexes can be used for Ancestor queries

	231 with equality filters, but they don't start with the magic Ancestor column!

	232

	233 There's a trick that you can do if the suffix format for the query is just

	234 `__key__` though (e.g. the query only contains equality filters, and/or an

	235 inequality filter on `__key__`). You can serialize the datastore key that you're

	236 planning to use for the Ancestor query, then chop off the termintating null byte

	237 from the encoding, and then use this as additional prefix bytes for this index.

	238 So if the builtin for the "Val" property has the column format of:

	239

	240 {Property: "Val"}, {Property: "__key__"}

	241

	242 And your query holds Val as an equality filter, you can serialize the

	243 ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had

	244 an index row:

	245

	246 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP

	247

	248 (where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for

	249 the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:

	250

	251 PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1

	252

	253 Omitting the STOP which normally terminates the Key encoding. Using this prefix

	254 will only return index rows which are `/Kind,1` or its children.

	255

	256 "That's cool! Why not use this trick for compound indexes?", I hear you ask :)

	257 Remember that this trick only works if the prefix before the `__key__` is

	258 entirely composed of equality filters. Also recall that if you ONLY use

	259 equality filters and Ancestor (and possibly an inequality on `__key__`), then

	260 you can always satisfy the query from the built-in indexes! While you

	261 technically could do it with a compound index, there's not really a point to

	262 doing so. To remain faithful to the production datastore implementation, we

	263 don't implement this trick for anything other than the built-in indicies.

	264

	265 ### Cursor format

	266

	267 Cursors work by containing values for each of the columns in the suffix, in the

	268 order and Direction specified by the suffix. In fact, cursors are just encoded

	269 versions of the []IndexColumn used for the 'suffix format', followed by the

	270 raw bytes of the suffix for that particular row (incremented by 1 bit).

	271

	272 This means that technically you can port cursors between any queries which share

	273 precisely the same suffix format, regardless of other query options, even if the

	274 index planner ends up choosing different indexes to use from the first query to

	275 the second. No state is maintained in the service implementation for cursors.

	276

	277 I suspect that this is a more liberal version of cursors than how the production

	278 appengine implements them, but I haven't verified one way or the other.

	279

	280 Query execution

	281 ---------------

	282

	283 Last but not least, we need to actually execute the query. After figuring out

	284 which indexes to use with what prefixes and start/end values, we essentially

	285 have a list of index subsets, all sorted the same way. To pull the values out,

	286 we start by iterating the first index in the list, grabbing its suffix value,

	287 and trying to iterate from that suffix in the second, third, fourth, etc index.

	288

	289 If any index iterates past that suffix, we start back at the 0th index with that

	290 suffix, and continue to try to find a matching row. Doing this will end up

	291 skipping large portions of all of the indexes in the list. This is the algorithm

	292 known as "zigzag merge join", and you can find talks on it from some of the

	293 appengine folks. It has very good algorithmic running time and tends to scale

	294 with the number of full matches, rather than the size of all of the indexes

	295 involved.

	296

	297 A hit occurs when all of the iterators have precisely the same suffix. This hit

	298 suffix is then decoded using the suffix format information. The very last column

	299 of the suffix will always be the datastore key. The suffix is then used to call

	300 back to the user, according to the query type:

	301 * keys-only queries just directly return the Key

	302 * projection queries return the projected fields from the decoded suffix.

	303 Remember how we added all the projections after the orders? This is why. The

	304 projected values are pulled directly from the index, instead of going to the

	305 main entity table.

	306 * normal queries pull the decoded Key from the "ents" table, and return that

	307 entity to the user.

OLD	NEW

« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | impl/memory/binary_tools.go » ('J')