| Index: impl/memory/README.md
|
| diff --git a/impl/memory/README.md b/impl/memory/README.md
|
| index 719edc947d7145b5ee346f68f1fd95de566bc9f3..cf8510e867a90097a8562a9abb488d363b6eac1c 100644
|
| --- a/impl/memory/README.md
|
| +++ b/impl/memory/README.md
|
| @@ -1,114 +1,365 @@
|
| -In-memory appengine wrapper
|
| -===========================
|
| -
|
| -
|
| -Notes on the internal encodings
|
| --------------------------------
|
| -
|
| -All datatypes inside of the index Collections of the gkvlite Store are stored
|
| -in a manner which allows them to be compared entirely via bytes.Compare. All
|
| -types are prefixed by a sortable type byte which encodes the sort-order of types
|
| -according to the appengine SDK. Additionally, types have the following data
|
| -encoding:
|
| - * ints
|
| - * stored with the `funnybase` varint encoding
|
| - * floats
|
| - * http://stereopsis.com/radix.html
|
| - * toBytes:
|
| - ```
|
| - b := math.Float64bits(f)
|
| - return b ^ (-(b >> 63) | 0x8000000000000000)
|
| - ```
|
| - * fromBytes:
|
| - ```
|
| - return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000)
|
| - ```
|
| - * string, []byte, BlobKey, ByteString
|
| - * funnybase byte count
|
| - * raw bytes
|
| - * \*Key, GeoPoint
|
| - * composite of above types
|
| - * time.Time
|
| - * composite of above types, stored with microsecond accuracy.
|
| - * rounding to microseconds is a limitation of the real appengine.
|
| - * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)`
|
| - * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)`
|
| - * nil, true, false
|
| - * value is encoded directly in the type byte
|
| -
|
| -
|
| -Gkvlite Collection schema
|
| --------------------------
|
| -
|
| -In order to provide efficient result deduplication, the value of an index row
|
| -which indexes 1 or more properties is a concatenation of the previous values
|
| -which would show up in the same index. For example, if you have the property
|
| -list for the key K:
|
| -
|
| - bob: 1
|
| - bob: 4
|
| - bob: 7
|
| - cat: "hello"
|
| - cat: "world"
|
| -
|
| -And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the
|
| -rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates
|
| -concatenation, each value has an implied type byte. `...` indicates that other
|
| -rows may be present):
|
| -
|
| - ...
|
| - 1|"world"|K = nil|nil
|
| - ...
|
| - 1|"hello"|K = nil|"world"
|
| - ...
|
| - 4|"world"|K = 1|nil
|
| - ...
|
| - 4|"hello"|K = 1|"world"
|
| - ...
|
| - 7|"world"|K = 4|nil
|
| - ...
|
| - 7|"hello"|K = 4|"world"
|
| - ...
|
| -
|
| -This allows us to, start scanning at any point and be able to determine if we've
|
| -returned a given key already (without storing all of the keys in memory
|
| -for the duration of the Query run). We can do this because we can see if the
|
| -value of an index row falls within the original query filter parameters. If it
|
| -does, then we must already have returned they Key, and can safely skip the index
|
| -row. AFAIK, real-datastore provides deduplication by keeping all the returned
|
| -keys in memory as it runs the query, and doing a set-check.
|
| -
|
| -The end-result is semantically equivalent, with the exception that Query Cursors
|
| -on the real datastore will potentially return the same Key in the first Cursor
|
| -use as well as on the 2nd (or Nth) cursor use, where this method will not.
|
| -
|
| - collections
|
| - ents:ns -> key -> value
|
| - (rootkind, rootid, __entity_group__,1) -> {__version__: int}
|
| - (rootkind, rootid, __entity_group_ids__,1) -> {__version__: int}
|
| - (__entity_group_ids__,1) -> {__version__: int}
|
| - // TODO(iannucci): Journal every entity write in a log with a globally
|
| - // increasing version number (aka "timestamp").
|
| - //
|
| - // TODO(iannucci): Use the value in idx collection to indicate the last
|
| - // global log version reflected in this index. Then index updates can happen
|
| - // in parallel, in a truly eventually-consistent fashion (and completely
|
| - // avoid holding the DB writelock while calculating index entries).
|
| - // Unfortunately, copying new records (and removing old ones) into the DB
|
| - // would still require holding global writelock.
|
| - //
|
| - // TODO(iannucci): how do we garbage-collect the journal?
|
| - //
|
| - // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with
|
| - // another one, transactionally? Not sure if this is possible to do easily.
|
| - // If we had this, then we could do all the index writes for a given index
|
| - // on the side, and then do a quick swap into place with a writelock. As
|
| - // long as every index only has a single goroutine writing it, then this
|
| - // would enable maximum concurrency, since all indexes could update in
|
| - // parallel and only synchronize for the duration of a single pointer swap.
|
| - idx -> kind|A?|[-?prop]* = nil
|
| - idx:ns:kind -> key = nil
|
| - idx:ns:kind|prop -> propval|key = [prev val]
|
| - idx:ns:kind|-prop -> -propval|key = [next val]
|
| - idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[prev/next val]
|
| - idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/next val]
|
| +Datastore implementation internals
|
| +==================================
|
| +
|
| +This document contains internal implementation details for this in-memory
|
| +version of datastore. It's mostly helpful to understand what's going on in this
|
| +implementation, but it also can reveal some insight into how the real appengine
|
| +datastore works (though note that the specific encodings are different).
|
| +
|
| +Additionally note that this implementation cheats by moving some of the Key
|
| +bytes into the table (collection) names (like the namespace, property name for
|
| +the builtin indexes, etc.). The real implementation contains these bytes in the
|
| +table row keys, I think.
|
| +
|
| +
|
| +Internal datastore key/value collection schema
|
| +----------------------------------------------
|
| +
|
| +The datastore implementation here uses several different tables ('collections')
|
| +to manage state for the data. The schema for these tables is enumerated below
|
| +to make the code a bit easier to reason about.
|
| +
|
| +All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
|
| +using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the
|
| +primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings
|
| +are important to understanding why the schemas below sort correctly when
|
| +compared only using `bytes.Compare` (aka `memcmp`). This doc will assume that
|
| +you're familiar with those encodings, but will point out where we diverge from
|
| +the stock encodings.
|
| +
|
| +All encoded Property values used in gkvlite Keys (i.e. index rows) are
|
| +serialized using the settings `serialize.WithoutContext`, and
|
| +`datastore.ShouldIndex`.
|
| +
|
| +### Primary table
|
| +
|
| +The primary table maps datastore keys to entities.
|
| +- Name: `"ents:" + namespace`
|
| +- Key: serialized datastore.Property containing the entity's datastore.Key
|
| +- Value: serialized datastore.PropertyMap
|
| +
|
| +This table also encodes values for the following special keys:
|
| +- Every entity root (e.g. a Key with nil Parent()) with key K has:
|
| + - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
|
| + A child entity with the kind `__entity_group__` and an id of `1`. The value
|
| + has a single property `__version__`, which contains the version number of
|
| + this entity group. This is used to detect transaction conflicts.
|
| + - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
|
| + A child entity with the kind `__entity_group__` and an id of `1`. The value
|
| + has a single property `__version__`, which contains the last automatically
|
| + allocated entity ID for entities within this entity group.
|
| +- A root entity with the key `Key("__entity_group_ids__",1)` which contains the
|
| + same `__version__` property, and indicates the last automatically allocated
|
| + entity ID for root entities.
|
| +
|
| +### Compound Index table
|
| +
|
| +The next table keeps track of all the user-added 'compound' index descriptions
|
| +(not the content for the indexes). There is a row in this table for each
|
| +compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
|
| +- Name: `"idx"`
|
| +- Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
|
| + in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
|
| +- Value: empty
|
| +
|
| +The Key format here requires some special attention. Say you started with
|
| +a compound IndexDefinition of:
|
| + IndexDefinition{
|
| + Kind: "Foo",
|
| + Ancestor: true,
|
| + SortBy: []IndexColumn{
|
| + {Property: "Something", Direction: DESCENDING},
|
| + {Property: "Else", Direction: ASCENDING},
|
| + {Property: "Cool", Direction: ASCENDING},
|
| + }
|
| + }
|
| +
|
| +After prepping it for the table, it would be equivalent to:
|
| + IndexDefinition{
|
| + Kind: "Foo",
|
| + Ancestor: true,
|
| + SortBy: []IndexColumn{
|
| + {Property: "__key__", Direction: ASCENDING},
|
| + {Property: "Cool", Direction: ASCENDING},
|
| + {Property: "Else", Direction: ASCENDING},
|
| + {Property: "Something", Direction: DESCENDING},
|
| + }
|
| + }
|
| +
|
| +The reason for doing this will be covered in the `Query Planning` section, but
|
| +it boils down to allowing the query planner to use this encoded table to
|
| +intelligently scan for potentially useful compound indexes.
|
| +
|
| +### Index Tables
|
| +
|
| +Every index (both builtin and compound) has one index table per namespace, which
|
| +contains as rows every entry in the index, one per row.
|
| +- Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
|
| +- Key: concatenated datastore.Property values, one per SortBy column in the
|
| + IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
|
| + DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
|
| +- Value: empty
|
| +
|
| +If the IndexDefinition has `Ancestor: true`, then the very first column of the
|
| +Key contains the partial Key for the entity. So if an entity has the datastore
|
| +key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
|
| +`/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
|
| +of indexes that it matches.
|
| +
|
| +#### Builtin (automatic) indexes
|
| +
|
| +The following indexes are automatically created for some entity with a key
|
| +`/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
|
| + IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
|
| + {Property: "__key__", Direction: ASCENDING},
|
| + }}
|
| + IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
|
| + {Property: "Foo", Direction: ASCENDING},
|
| + {Property: "__key__", Direction: ASCENDING},
|
| + }}
|
| + IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
|
| + {Property: "Foo", Direction: DESCENDING},
|
| + {Property: "__key__", Direction: ASCENDING},
|
| + }}
|
| +
|
| +Index updates
|
| +-------------
|
| +
|
| +(Note that this is a LARGE departure from how the production appengine datastore
|
| +does this. This model only works because the implementation is not distributed,
|
| +and not journaled. The real datastore does index updates in parallel and is
|
| +generally pretty fancy compared to this).
|
| +
|
| +Index updates are pretty straightforward. On a mutation to the primary entity
|
| +table, we take the old entity value (remember that entity values are
|
| +PropertyMaps), the new property value, create index entries for both, merge
|
| +them, and apply the deltas to the affected index tables (i.e. entries that
|
| +exist in the old entities, but not the new ones, are deleted. Entries that exist
|
| +in the new entities, but not the old ones, are added).
|
| +
|
| +Index generation works (given an slice of indexes []Idxs) by:
|
| +* serializing all ShouldIndex Properties in the PropertyMap to get a
|
| + `map[name][]serializedProperty`.
|
| +* for each index idx
|
| + * if idx's columns contain properties that are not in the map, skip idx
|
| + * make a `[][]serializedProperty`, where each serializedProperty slice
|
| + corresponds with the IndexColumn of idx.SortBy
|
| + * duplicated values for multi-valued properties are skipped.
|
| + * generate a `[]byte` row which is the contatenation of one value from each
|
| + `[]serializedProperty`, permuting through all combinations. If the SortBy
|
| + column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
|
| + value!.
|
| + * add that generated []byte row to the index's corresponding table.
|
| +
|
| +Note that we choose to serialize all permutations of the saved entity. This is
|
| +so that we can use repeated-column indexes to fill queries which use a subset of
|
| +the columns. E.g. if we have the index `duck,duck,duck,goose`, we can
|
| +theoretically use it to fill a query for `duck=1,duck=2,goose>"canadian"`, by
|
| +pasting 1 or 2 as the value for the 3rd `duck` column. This simplifies index
|
| +selection at the expense of larger indexes. However, it means that if you have
|
| +the entity:
|
| + duck = 1, 2, 3, 4
|
| + goose = "færøske"
|
| +
|
| +It generates the following index entries:
|
| + duck=1,duck=1,duck=1,goose="færøske"
|
| + duck=1,duck=1,duck=2,goose="færøske"
|
| + duck=1,duck=1,duck=3,goose="færøske"
|
| + duck=1,duck=1,duck=4,goose="færøske"
|
| + duck=1,duck=2,duck=1,goose="færøske"
|
| + duck=1,duck=2,duck=2,goose="færøske"
|
| + duck=1,duck=2,duck=3,goose="færøske"
|
| + duck=1,duck=2,duck=4,goose="færøske"
|
| + duck=1,duck=3,duck=1,goose="færøske"
|
| + ... a lot ...
|
| + duck=4,duck=4,duck=4,goose="færøske"
|
| +
|
| +This is a very large number of index rows (i.e. an 'exploding index')!
|
| +
|
| +An alternate design would be to only generate unique permutations of elements
|
| +where the index has repeated columns of a single property. This only makes sense
|
| +because it's illegal to have an equality and an inequality on the same property,
|
| +under the current constraints of appengine (though not completely ridiculous in
|
| +general, if inequality constraints meant the same thing as equality constraints.
|
| +However it would lead to a multi-dimensional query, which can be quite slow and
|
| +is very difficult to scale without application knowledge). If we do this, it
|
| +also means that we need to SORT the equality filter values when generating the
|
| +prefix (so that the least-valued equality constraint is first). If we did this,
|
| +then the generated index rows for the above entity would be:
|
| + duck=1,duck=2,duck=3,goose="færøske"
|
| + duck=1,duck=2,duck=4,goose="færøske"
|
| + duck=1,duck=3,duck=4,goose="færøske"
|
| + duck=2,duck=3,duck=4,goose="færøske"
|
| +
|
| +Which be a LOT more compact. It may be worth implementing this restriction
|
| +later, simply for the memory savings when indexing multi-valued properties.
|
| +
|
| +If this technique is used, there's also room to unambiguously index entities
|
| +with repeated equivalent values. E.g. if duck=1,1,2,3,4 , then you could see
|
| +a row in the index like:
|
| + duck=1,duck=1,duck=2,goose="færøske"
|
| +
|
| +Which would allow you to query for "an entity which has duck values equal to 1,
|
| +1 and 2". Currently such a query is not possible to execute (it would be
|
| +equivalent to "an entity which has duck values equal to 1 and 2").
|
| +
|
| +Query planning
|
| +--------------
|
| +
|
| +Now that we have all of our data tabulated, let's plan some queries. The
|
| +high-level algorithm works like this:
|
| + * Generate a suffix format from the user's query which looks like:
|
| + * orders (including the inequality as the first order, if any)
|
| + * projected fields which aren't explicitly referenced in the orders (we
|
| + assume ASCENDING order for them), in the order that they were projected.
|
| + * `__key__` (implied ascending, unless the query's last sort order is for
|
| + `__key__`, in which case it's whatever order the user specified)
|
| + * Reverse the order of this suffix format, and serialize it into an
|
| + IndexDefinition, along with the query's Kind and Ancestor values. This
|
| + does what PrepForIdxTable did when we added the Index in the first place.
|
| + * Use this serialized reversed index to find compound indexes which might
|
| + match by looking up rows in the "idx" table which begin with this serialized
|
| + reversed index.
|
| + * Generate every builtin index for the inequality + equality filter
|
| + properties, and see if they match too.
|
| +
|
| +An index is a potential match if its suffix *exactly* matches the suffix format,
|
| +and it contains *only* sort orders which appear in the query (e.g. the index
|
| +contains a column which doesn't appear as an equality or inequlity filter).
|
| +
|
| +The index search continues until:
|
| + * We find at least one matching index; AND
|
| + * The combination of all matching indexes accounts for every equality filter
|
| + at least once.
|
| +
|
| +If we fail to find sufficient indexes to fulfill the query, we generate an index
|
| +description that *could* be sufficient by concatenating all missing equality
|
| +filters, in ascending order, followed by concatenating the suffix format that we
|
| +generated for this query. We then suggest this new index to the user for them to
|
| +add by returing an error containing the generated IndexDefinition. Note that
|
| +this index is not REQUIRED for the user to add; they could choose to add bits
|
| +and pieces of it, extend existing indexes in order to cover the missing columns,
|
| +invert the direction of some of the equality columns, etc.
|
| +
|
| +Recall that equality filters are expressed as
|
| +`map[propName][]serializedProperty`. We'll refer to this mapping as the
|
| +'constraint' mapping below.
|
| +
|
| +To actually come up with the final index selection, we sort all the matching
|
| +indexes from greatest number of columns to least. We add the 0th index (the one
|
| +with the greatest number of columns) unconditionally. We then keep adding indexes
|
| +which contain one or more of the remaining constraints, until we have no
|
| +more constraints to satisfy.
|
| +
|
| +Adding an index entails determining which columns in that index correspond to
|
| +equality columns, and which ones correspond to inequality/order/projection
|
| +columns. Recall that the inequality/order/projection columns are all the same
|
| +for all of the potential indices (i.e. they all share the same *suffix format*).
|
| +We can use this to just iterate over the index's SortBy columns which we'll use
|
| +for equality filters. For each equality column, we remove a corresponding value
|
| +from the constraints map. In the event that we _run out_ of constraints for a
|
| +given column, we simply _pick an arbitrary value_ from the original equality
|
| +filter mapping and use that. This is valid to do because, after all, they're
|
| +equality filters.
|
| +
|
| +Note that for compound indexes, the ancestor key counts as an equality filter,
|
| +and if the compound index has `Ancestor: true`, then we implicitly put the
|
| +ancestor as if it were the first SortBy column. For satisfying Ancestor queries
|
| +with built-in indexes, see the next section.
|
| +
|
| +Once we've got our list of constraints for this index, we concatenate them all
|
| +together to get the *prefix* for this index. When iterating over this index, we
|
| +only ever want to see index rows whose prefix exactly matches this. Unlike the
|
| +suffix formt, the prefix is per-index (remember that ALL indexes in the
|
| +query must have the same suffix format).
|
| +
|
| +Finally, we set the 'start' and 'end' values for all chosen indexes to either
|
| +the Start and End cursors, or the Greater-Than and Less-Than values for the
|
| +inequality. The Cursors contain values for every suffix column, and the
|
| +inequality only contains a value for the first suffix column. If both cursors
|
| +and an inequality are specified, we take the smaller set of both (the
|
| +combination which will return the fewest rows).
|
| +
|
| +That's about it for index selection! See Query Execution for how we actually use
|
| +the selected indexes to run a query.
|
| +
|
| +### Ancestor queries and Built-in indexes
|
| +
|
| +You may have noticed that the built-in indexes can be used for Ancestor queries
|
| +with equality filters, but they don't start with the magic Ancestor column!
|
| +
|
| +There's a trick that you can do if the suffix format for the query is just
|
| +`__key__` though (e.g. the query only contains equality filters, and/or an
|
| +inequality filter on `__key__`). You can serialize the datastore key that you're
|
| +planning to use for the Ancestor query, then chop off the termintating null byte
|
| +from the encoding, and then use this as additional prefix bytes for this index.
|
| +So if the builtin for the "Val" property has the column format of:
|
| +
|
| + {Property: "Val"}, {Property: "__key__"}
|
| +
|
| +And your query holds Val as an equality filter, you can serialize the
|
| +ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
|
| +an index row:
|
| +
|
| + PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
|
| +
|
| +(where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
|
| +the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
|
| +
|
| + PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
|
| +
|
| +Omitting the STOP which normally terminates the Key encoding. Using this prefix
|
| +will only return index rows which are `/Kind,1` or its children.
|
| +
|
| +"That's cool! Why not use this trick for compound indexes?", I hear you ask :)
|
| +Remember that this trick only works if the prefix before the `__key__` is
|
| +*entirely* composed of equality filters. Also recall that if you ONLY use
|
| +equality filters and Ancestor (and possibly an inequality on `__key__`), then
|
| +you can always satisfy the query from the built-in indexes! While you
|
| +technically could do it with a compound index, there's not really a point to
|
| +doing so. To remain faithful to the production datastore implementation, we
|
| +don't implement this trick for anything other than the built-in indexes.
|
| +
|
| +### Cursor format
|
| +
|
| +Cursors work by containing values for each of the columns in the suffix, in the
|
| +order and Direction specified by the suffix. In fact, cursors are just encoded
|
| +versions of the []IndexColumn used for the 'suffix format', followed by the
|
| +raw bytes of the suffix for that particular row (incremented by 1 bit).
|
| +
|
| +This means that technically you can port cursors between any queries which share
|
| +precisely the same suffix format, regardless of other query options, even if the
|
| +index planner ends up choosing different indexes to use from the first query to
|
| +the second. No state is maintained in the service implementation for cursors.
|
| +
|
| +I suspect that this is a more liberal version of cursors than how the production
|
| +appengine implements them, but I haven't verified one way or the other.
|
| +
|
| +Query execution
|
| +---------------
|
| +
|
| +Last but not least, we need to actually execute the query. After figuring out
|
| +which indexes to use with what prefixes and start/end values, we essentially
|
| +have a list of index subsets, all sorted the same way. To pull the values out,
|
| +we start by iterating the first index in the list, grabbing its suffix value,
|
| +and trying to iterate from that suffix in the second, third, fourth, etc index.
|
| +
|
| +If any index iterates past that suffix, we start back at the 0th index with that
|
| +suffix, and continue to try to find a matching row. Doing this will end up
|
| +skipping large portions of all of the indexes in the list. This is the algorithm
|
| +known as "zigzag merge join", and you can find talks on it from some of the
|
| +appengine folks. It has very good algorithmic running time and tends to scale
|
| +with the number of full matches, rather than the size of all of the indexes
|
| +involved.
|
| +
|
| +A hit occurs when all of the iterators have precisely the same suffix. This hit
|
| +suffix is then decoded using the suffix format information. The very last column
|
| +of the suffix will always be the datastore key. The suffix is then used to call
|
| +back to the user, according to the query type:
|
| +* keys-only queries just directly return the Key
|
| +* projection queries return the projected fields from the decoded suffix.
|
| + Remember how we added all the projections after the orders? This is why. The
|
| + projected values are pulled directly from the index, instead of going to the
|
| + main entity table.
|
| +* normal queries pull the decoded Key from the "ents" table, and return that
|
| + entity to the user.
|
|
|