impl/memory/README.md - Issue 1302813003: impl/memory: Implement Queries

Unified Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator

Patch Set: stringSet everywhere Created 5 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: impl/memory/README.md

diff --git a/impl/memory/README.md b/impl/memory/README.md

index 719edc947d7145b5ee346f68f1fd95de566bc9f3..cf8510e867a90097a8562a9abb488d363b6eac1c 100644

--- a/impl/memory/README.md

+++ b/impl/memory/README.md

@@ -1,114 +1,365 @@

-In-memory appengine wrapper

-===========================

-Notes on the internal encodings

--------------------------------

-All datatypes inside of the index Collections of the gkvlite Store are stored

-in a manner which allows them to be compared entirely via bytes.Compare. All

-types are prefixed by a sortable type byte which encodes the sort-order of types

-according to the appengine SDK. Additionally, types have the following data

-encoding:

- * ints

- * stored with the `funnybase` varint encoding

- * floats

- * http://stereopsis.com/radix.html

- * toBytes:

- ```

- b := math.Float64bits(f)

- return b ^ (-(b >> 63) | 0x8000000000000000)

- ```

- * fromBytes:

- ```

- return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000)

- ```

- * string, []byte, BlobKey, ByteString

- * funnybase byte count

- * raw bytes

- * \*Key, GeoPoint

- * composite of above types

- * time.Time

- * composite of above types, stored with microsecond accuracy.

- * rounding to microseconds is a limitation of the real appengine.

- * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)`

- * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)`

- * nil, true, false

- * value is encoded directly in the type byte

-Gkvlite Collection schema

--------------------------

-In order to provide efficient result deduplication, the value of an index row

-which indexes 1 or more properties is a concatenation of the previous values

-which would show up in the same index. For example, if you have the property

-list for the key K:

- bob: 1

- bob: 4

- bob: 7

- cat: "hello"

- cat: "world"

-And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the

-rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates

-concatenation, each value has an implied type byte. `...` indicates that other

-rows may be present):

- ...

- 1|"world"|K = nil|nil

- ...

- 1|"hello"|K = nil|"world"

- ...

- 4|"world"|K = 1|nil

- ...

- 4|"hello"|K = 1|"world"

- ...

- 7|"world"|K = 4|nil

- ...

- 7|"hello"|K = 4|"world"

- ...

-This allows us to, start scanning at any point and be able to determine if we've

-returned a given key already (without storing all of the keys in memory

-for the duration of the Query run). We can do this because we can see if the

-value of an index row falls within the original query filter parameters. If it

-does, then we must already have returned they Key, and can safely skip the index

-row. AFAIK, real-datastore provides deduplication by keeping all the returned

-keys in memory as it runs the query, and doing a set-check.

-The end-result is semantically equivalent, with the exception that Query Cursors

-on the real datastore will potentially return the same Key in the first Cursor

-use as well as on the 2nd (or Nth) cursor use, where this method will not.

- collections

- ents:ns -> key -> value

- (rootkind, rootid, __entity_group__,1) -> {__version__: int}

- (rootkind, rootid, __entity_group_ids__,1) -> {__version__: int}

- (__entity_group_ids__,1) -> {__version__: int}

- // TODO(iannucci): Journal every entity write in a log with a globally

- // increasing version number (aka "timestamp").

- //

- // TODO(iannucci): Use the value in idx collection to indicate the last

- // global log version reflected in this index. Then index updates can happen

- // in parallel, in a truly eventually-consistent fashion (and completely

- // avoid holding the DB writelock while calculating index entries).

- // Unfortunately, copying new records (and removing old ones) into the DB

- // would still require holding global writelock.

- //

- // TODO(iannucci): how do we garbage-collect the journal?

- //

- // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with

- // another one, transactionally? Not sure if this is possible to do easily.

- // If we had this, then we could do all the index writes for a given index

- // on the side, and then do a quick swap into place with a writelock. As

- // long as every index only has a single goroutine writing it, then this

- // would enable maximum concurrency, since all indexes could update in

- // parallel and only synchronize for the duration of a single pointer swap.

- idx -> kind|A?|[-?prop]* = nil

- idx:ns:kind -> key = nil

- idx:ns:kind|prop -> propval|key = [prev val]

- idx:ns:kind|-prop -> -propval|key = [next val]

+Datastore implementation internals

+==================================

+This document contains internal implementation details for this in-memory

+version of datastore. It's mostly helpful to understand what's going on in this

+implementation, but it also can reveal some insight into how the real appengine

+datastore works (though note that the specific encodings are different).

+Additionally note that this implementation cheats by moving some of the Key

+bytes into the table (collection) names (like the namespace, property name for

+the builtin indexes, etc.). The real implementation contains these bytes in the

+table row keys, I think.

+Internal datastore key/value collection schema

+----------------------------------------------

+The datastore implementation here uses several different tables ('collections')

+to manage state for the data. The schema for these tables is enumerated below

+to make the code a bit easier to reason about.

+All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized

+using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the

+primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings

+are important to understanding why the schemas below sort correctly when

+compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that

+you're familiar with those encodings, but will point out where we diverge from

+the stock encodings.

+All encoded Property values used in gkvlite Keys (i.e. index rows) are

+serialized using the settings `serialize.WithoutContext`, and

+`datastore.ShouldIndex`.

+### Primary table

+The primary table maps datastore keys to entities.

+- Name: `"ents:" + namespace`

+- Key: serialized datastore.Property containing the entity's datastore.Key

+- Value: serialized datastore.PropertyMap

+This table also encodes values for the following special keys:

+- Every entity root (e.g. a Key with nil Parent()) with key K has:

+ - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`

+ A child entity with the kind `__entity_group__` and an id of `1`. The value

+ has a single property `__version__`, which contains the version number of

+ this entity group. This is used to detect transaction conflicts.

+ - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`

+ A child entity with the kind `__entity_group__` and an id of `1`. The value

+ has a single property `__version__`, which contains the last automatically

+ allocated entity ID for entities within this entity group.

+- A root entity with the key `Key("__entity_group_ids__",1)` which contains the

+ same `__version__` property, and indicates the last automatically allocated

+ entity ID for root entities.

+### Compound Index table

+The next table keeps track of all the user-added 'compound' index descriptions

+(not the content for the indexes). There is a row in this table for each

+compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.

+- Name: `"idx"`

+- Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice

+ in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).

+- Value: empty

+The Key format here requires some special attention. Say you started with

+a compound IndexDefinition of:

+ IndexDefinition{

+ Kind: "Foo",

+ Ancestor: true,

+ SortBy: []IndexColumn{

+ {Property: "Something", Direction: DESCENDING},

+ {Property: "Else", Direction: ASCENDING},

+ {Property: "Cool", Direction: ASCENDING},

+ }

+After prepping it for the table, it would be equivalent to:

+ IndexDefinition{

+ Kind: "Foo",

+ Ancestor: true,

+ SortBy: []IndexColumn{

+ {Property: "__key__", Direction: ASCENDING},

+ {Property: "Cool", Direction: ASCENDING},

+ {Property: "Else", Direction: ASCENDING},

+ {Property: "Something", Direction: DESCENDING},

+ }

+The reason for doing this will be covered in the `Query Planning` section, but

+it boils down to allowing the query planner to use this encoded table to

+intelligently scan for potentially useful compound indexes.

+### Index Tables

+Every index (both builtin and compound) has one index table per namespace, which

+contains as rows every entry in the index, one per row.

+- Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`

+- Key: concatenated datastore.Property values, one per SortBy column in the

+ IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is

+ DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).

+- Value: empty

+If the IndexDefinition has `Ancestor: true`, then the very first column of the

+Key contains the partial Key for the entity. So if an entity has the datastore

+key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,

+`/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column

+of indexes that it matches.

+#### Builtin (automatic) indexes

+The following indexes are automatically created for some entity with a key

+`/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":

+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

+ {Property: "__key__", Direction: ASCENDING},

+ }}

+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

+ {Property: "Foo", Direction: ASCENDING},

+ {Property: "__key__", Direction: ASCENDING},

+ }}

+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{

+ {Property: "Foo", Direction: DESCENDING},

+ {Property: "__key__", Direction: ASCENDING},

+ }}

+Index updates

+-------------

+(Note that this is a LARGE departure from how the production appengine datastore

+does this. This model only works because the implementation is not distributed,

+and not journaled. The real datastore does index updates in parallel and is

+generally pretty fancy compared to this).

+Index updates are pretty straightforward. On a mutation to the primary entity

+table, we take the old entity value (remember that entity values are

+PropertyMaps), the new property value, create index entries for both, merge

+them, and apply the deltas to the affected index tables (i.e. entries that

+exist in the old entities, but not the new ones, are deleted. Entries that exist

+in the new entities, but not the old ones, are added).

+Index generation works (given an slice of indexes []Idxs) by:

+* serializing all ShouldIndex Properties in the PropertyMap to get a

+ `map[name][]serializedProperty`.

+* for each index idx

+ * if idx's columns contain properties that are not in the map, skip idx

+ * make a `[][]serializedProperty`, where each serializedProperty slice

+ corresponds with the IndexColumn of idx.SortBy

+ * duplicated values for multi-valued properties are skipped.

+ * generate a `[]byte` row which is the contatenation of one value from each

+ `[]serializedProperty`, permuting through all combinations. If the SortBy

+ column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty

+ value!.

+ * add that generated []byte row to the index's corresponding table.

+Note that we choose to serialize all permutations of the saved entity. This is

+so that we can use repeated-column indexes to fill queries which use a subset of

+the columns. E.g. if we have the index `duck,duck,duck,goose`, we can

+theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by

+pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index

+selection at the expense of larger indexes. However, it means that if you have

+the entity:

+ duck = 1, 2, 3, 4

+ goose = "færøske"

+It generates the following index entries:

+ duck=1,duck=1,duck=1,goose="færøske"

+ duck=1,duck=1,duck=2,goose="færøske"

+ duck=1,duck=1,duck=3,goose="færøske"

+ duck=1,duck=1,duck=4,goose="færøske"

+ duck=1,duck=2,duck=1,goose="færøske"

+ duck=1,duck=2,duck=2,goose="færøske"

+ duck=1,duck=2,duck=3,goose="færøske"

+ duck=1,duck=2,duck=4,goose="færøske"

+ duck=1,duck=3,duck=1,goose="færøske"

+ ... a lot ...

+ duck=4,duck=4,duck=4,goose="færøske"

+This is a very large number of index rows (i.e. an 'exploding index')!

+An alternate design would be to only generate unique permutations of elements

+where the index has repeated columns of a single property. This only makes sense

+because it's illegal to have an equality and an inequality on the same property,

+under the current constraints of appengine (though not completely ridiculous in

+general, if inequality constraints meant the same thing as equality constraints.

+However it would lead to a multi-dimensional query, which can be quite slow and

+is very difficult to scale without application knowledge). If we do this, it

+also means that we need to SORT the equality filter values when generating the

+prefix (so that the least-valued equality constraint is first). If we did this,

+then the generated index rows for the above entity would be:

+ duck=1,duck=2,duck=3,goose="færøske"

+ duck=1,duck=2,duck=4,goose="færøske"

+ duck=1,duck=3,duck=4,goose="færøske"

+ duck=2,duck=3,duck=4,goose="færøske"

+Which be a LOT more compact. It may be worth implementing this restriction

+later, simply for the memory savings when indexing multi-valued properties.

+If this technique is used, there's also room to unambiguously index entities

+with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see

+a row in the index like:

+ duck=1,duck=1,duck=2,goose="færøske"

+Which would allow you to query for "an entity which has duck values equal to 1,

+1 and 2". Currently such a query is not possible to execute (it would be

+equivalent to "an entity which has duck values equal to 1 and 2").

+Query planning

+--------------

+Now that we have all of our data tabulated, let's plan some queries. The

+high-level algorithm works like this:

+ * Generate a suffix format from the user's query which looks like:

+ * orders (including the inequality as the first order, if any)

+ * projected fields which aren't explicitly referenced in the orders (we

+ assume ASCENDING order for them), in the order that they were projected.

+ * `__key__` (implied ascending, unless the query's last sort order is for

+ `__key__`, in which case it's whatever order the user specified)

+ * Reverse the order of this suffix format, and serialize it into an

+ IndexDefinition, along with the query's Kind and Ancestor values. This

+ does what PrepForIdxTable did when we added the Index in the first place.

+ * Use this serialized reversed index to find compound indexes which might

+ match by looking up rows in the "idx" table which begin with this serialized

+ reversed index.

+ * Generate every builtin index for the inequality + equality filter

+ properties, and see if they match too.

+An index is a potential match if its suffix *exactly* matches the suffix format,

+and it contains *only* sort orders which appear in the query (e.g. the index

+contains a column which doesn't appear as an equality or inequlity filter).

+The index search continues until:

+ * We find at least one matching index; AND

+ * The combination of all matching indexes accounts for every equality filter

+ at least once.

+If we fail to find sufficient indexes to fulfill the query, we generate an index

+description that *could* be sufficient by concatenating all missing equality

+filters, in ascending order, followed by concatenating the suffix format that we

+generated for this query. We then suggest this new index to the user for them to

+add by returing an error containing the generated IndexDefinition. Note that

+this index is not REQUIRED for the user to add; they could choose to add bits

+and pieces of it, extend existing indexes in order to cover the missing columns,

+invert the direction of some of the equality columns, etc.

+Recall that equality filters are expressed as

+`map[propName][]serializedProperty`. We'll refer to this mapping as the

+'constraint' mapping below.

+To actually come up with the final index selection, we sort all the matching

+indexes from greatest number of columns to least. We add the 0th index (the one

+with the greatest number of columns) unconditionally. We then keep adding indexes

+which contain one or more of the remaining constraints, until we have no

+more constraints to satisfy.

+Adding an index entails determining which columns in that index correspond to

+equality columns, and which ones correspond to inequality/order/projection

+columns. Recall that the inequality/order/projection columns are all the same

+for all of the potential indices (i.e. they all share the same *suffix format*).

+We can use this to just iterate over the index's SortBy columns which we'll use

+for equality filters. For each equality column, we remove a corresponding value

+from the constraints map. In the event that we _run out_ of constraints for a

+given column, we simply _pick an arbitrary value_ from the original equality

+filter mapping and use that. This is valid to do because, after all, they're

+equality filters.

+Note that for compound indexes, the ancestor key counts as an equality filter,

+and if the compound index has `Ancestor: true`, then we implicitly put the

+ancestor as if it were the first SortBy column. For satisfying Ancestor queries

+with built-in indexes, see the next section.

+Once we've got our list of constraints for this index, we concatenate them all

+together to get the *prefix* for this index. When iterating over this index, we

+only ever want to see index rows whose prefix exactly matches this. Unlike the

+suffix formt, the prefix is per-index (remember that ALL indexes in the

+query must have the same suffix format).

+Finally, we set the 'start' and 'end' values for all chosen indexes to either

+the Start and End cursors, or the Greater-Than and Less-Than values for the

+inequality. The Cursors contain values for every suffix column, and the

+inequality only contains a value for the first suffix column. If both cursors

+and an inequality are specified, we take the smaller set of both (the

+combination which will return the fewest rows).

+That's about it for index selection! See Query Execution for how we actually use

+the selected indexes to run a query.

+### Ancestor queries and Built-in indexes

+You may have noticed that the built-in indexes can be used for Ancestor queries

+with equality filters, but they don't start with the magic Ancestor column!

+There's a trick that you can do if the suffix format for the query is just

+`__key__` though (e.g. the query only contains equality filters, and/or an

+inequality filter on `__key__`). You can serialize the datastore key that you're

+planning to use for the Ancestor query, then chop off the termintating null byte

+from the encoding, and then use this as additional prefix bytes for this index.

+So if the builtin for the "Val" property has the column format of:

+ {Property: "Val"}, {Property: "__key__"}

+And your query holds Val as an equality filter, you can serialize the

+ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had

+an index row:

+ PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP

+(where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for

+the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:

+ PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1

+Omitting the STOP which normally terminates the Key encoding. Using this prefix

+will only return index rows which are `/Kind,1` or its children.

+"That's cool! Why not use this trick for compound indexes?", I hear you ask :)

+Remember that this trick only works if the prefix before the `__key__` is

+*entirely* composed of equality filters. Also recall that if you ONLY use

+equality filters and Ancestor (and possibly an inequality on `__key__`), then

+you can always satisfy the query from the built-in indexes! While you

+technically could do it with a compound index, there's not really a point to

+doing so. To remain faithful to the production datastore implementation, we

+don't implement this trick for anything other than the built-in indexes.

+### Cursor format

+Cursors work by containing values for each of the columns in the suffix, in the

+order and Direction specified by the suffix. In fact, cursors are just encoded

+versions of the []IndexColumn used for the 'suffix format', followed by the

+raw bytes of the suffix for that particular row (incremented by 1 bit).

+This means that technically you can port cursors between any queries which share

+precisely the same suffix format, regardless of other query options, even if the

+index planner ends up choosing different indexes to use from the first query to

+the second. No state is maintained in the service implementation for cursors.

+I suspect that this is a more liberal version of cursors than how the production

+appengine implements them, but I haven't verified one way or the other.

+Query execution

+---------------

+Last but not least, we need to actually execute the query. After figuring out

+which indexes to use with what prefixes and start/end values, we essentially

+have a list of index subsets, all sorted the same way. To pull the values out,

+we start by iterating the first index in the list, grabbing its suffix value,

+and trying to iterate from that suffix in the second, third, fourth, etc index.

+If any index iterates past that suffix, we start back at the 0th index with that

+suffix, and continue to try to find a matching row. Doing this will end up

+skipping large portions of all of the indexes in the list. This is the algorithm

+known as "zigzag merge join", and you can find talks on it from some of the

+appengine folks. It has very good algorithmic running time and tends to scale

+with the number of full matches, rather than the size of all of the indexes

+involved.

+A hit occurs when all of the iterators have precisely the same suffix. This hit

+suffix is then decoded using the suffix format information. The very last column

+of the suffix will always be the datastore key. The suffix is then used to call

+back to the user, according to the query type:

+* keys-only queries just directly return the Key

+* projection queries return the projected fields from the decoded suffix.

+ Remember how we added all the projections after the orders? This is why. The

+ projected values are pulled directly from the index, instead of going to the

+ main entity table.

+* normal queries pull the decoded Key from the "ents" table, and return that

+ entity to the user.

« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | no next file with comments »