Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(5)

Unified Diff: impl/memory/README.md

Issue 1302813003: impl/memory: Implement Queries (Closed) Base URL: https://github.com/luci/gae.git@add_multi_iterator
Patch Set: minor fixes Created 5 years, 4 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | impl/memory/binary_tools.go » ('J')
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: impl/memory/README.md
diff --git a/impl/memory/README.md b/impl/memory/README.md
index 719edc947d7145b5ee346f68f1fd95de566bc9f3..0ecc9e214a79c02614e42f2f0d660dd3695c4747 100644
--- a/impl/memory/README.md
+++ b/impl/memory/README.md
@@ -1,114 +1,307 @@
-In-memory appengine wrapper
-===========================
-
-
-Notes on the internal encodings
--------------------------------
-
-All datatypes inside of the index Collections of the gkvlite Store are stored
-in a manner which allows them to be compared entirely via bytes.Compare. All
-types are prefixed by a sortable type byte which encodes the sort-order of types
-according to the appengine SDK. Additionally, types have the following data
-encoding:
- * ints
- * stored with the `funnybase` varint encoding
- * floats
- * http://stereopsis.com/radix.html
- * toBytes:
- ```
- b := math.Float64bits(f)
- return b ^ (-(b >> 63) | 0x8000000000000000)
- ```
- * fromBytes:
- ```
- return math.Float64frombits(b ^ ((b >> 63) - 1) | 0x8000000000000000)
- ```
- * string, []byte, BlobKey, ByteString
- * funnybase byte count
- * raw bytes
- * \*Key, GeoPoint
- * composite of above types
- * time.Time
- * composite of above types, stored with microsecond accuracy.
- * rounding to microseconds is a limitation of the real appengine.
- * toMicro: `return t.Unix()*1e6 + int64(t.Nanosecond()/1e3)`
- * fromMicro: `return time.Unix(t/1e6, (t%1e6)*1e3)`
- * nil, true, false
- * value is encoded directly in the type byte
-
-
-Gkvlite Collection schema
--------------------------
-
-In order to provide efficient result deduplication, the value of an index row
-which indexes 1 or more properties is a concatenation of the previous values
-which would show up in the same index. For example, if you have the property
-list for the key K:
-
- bob: 1
- bob: 4
- bob: 7
- cat: "hello"
- cat: "world"
-
-And the regular (non-ancestor) composite index was {bob, -cat}, you'd have the
-rows in the index `idx:ns:kind|R|bob|-cat` (| in the row indicates
-concatenation, each value has an implied type byte. `...` indicates that other
-rows may be present):
-
- ...
- 1|"world"|K = nil|nil
- ...
- 1|"hello"|K = nil|"world"
- ...
- 4|"world"|K = 1|nil
- ...
- 4|"hello"|K = 1|"world"
- ...
- 7|"world"|K = 4|nil
- ...
- 7|"hello"|K = 4|"world"
- ...
-
-This allows us to, start scanning at any point and be able to determine if we've
-returned a given key already (without storing all of the keys in memory
-for the duration of the Query run). We can do this because we can see if the
-value of an index row falls within the original query filter parameters. If it
-does, then we must already have returned they Key, and can safely skip the index
-row. AFAIK, real-datastore provides deduplication by keeping all the returned
-keys in memory as it runs the query, and doing a set-check.
-
-The end-result is semantically equivalent, with the exception that Query Cursors
-on the real datastore will potentially return the same Key in the first Cursor
-use as well as on the 2nd (or Nth) cursor use, where this method will not.
-
- collections
- ents:ns -> key -> value
- (rootkind, rootid, __entity_group__,1) -> {__version__: int}
- (rootkind, rootid, __entity_group_ids__,1) -> {__version__: int}
- (__entity_group_ids__,1) -> {__version__: int}
- // TODO(iannucci): Journal every entity write in a log with a globally
- // increasing version number (aka "timestamp").
- //
- // TODO(iannucci): Use the value in idx collection to indicate the last
- // global log version reflected in this index. Then index updates can happen
- // in parallel, in a truly eventually-consistent fashion (and completely
- // avoid holding the DB writelock while calculating index entries).
- // Unfortunately, copying new records (and removing old ones) into the DB
- // would still require holding global writelock.
- //
- // TODO(iannucci): how do we garbage-collect the journal?
- //
- // TODO(iannucci): add the ability in gkvlite to 'swap' a collection with
- // another one, transactionally? Not sure if this is possible to do easily.
- // If we had this, then we could do all the index writes for a given index
- // on the side, and then do a quick swap into place with a writelock. As
- // long as every index only has a single goroutine writing it, then this
- // would enable maximum concurrency, since all indexes could update in
- // parallel and only synchronize for the duration of a single pointer swap.
- idx -> kind|A?|[-?prop]* = nil
- idx:ns:kind -> key = nil
- idx:ns:kind|prop -> propval|key = [prev val]
- idx:ns:kind|-prop -> -propval|key = [next val]
- idx:ns:kind|A|?prop|?prop -> A|propval|propval|key = [prev/next val]|[prev/next val]
- idx:ns:kind|?prop|?prop -> propval|propval|key = [prev/next val]|[prev/next val]
+Datastore implementation internals
+==================================
+
+This document contains internal implementation details for this in-memory
+version of datastore. It's mostly helpful to understand what's going on in this
+implementation, but it also can reveal some insight into how the real appengine
+datastore works (though note that the specific encodings are different).
+
+Additionally note that this implementation cheats by moving some of the Key
+bytes into the table (collection) names (like the namespace, property name for
+the builtin indexes, etc.). The real implementation contains these bytes in the
+table row keys, I think.
+
+
+Internal datastore key/value collection schema
+----------------------------------------------
+
+The datastore implementation here uses several different tables ('collections')
+to manage state for the data. The schema for these tables is enumerated below
+to make the code a bit easier to reason about.
+
+All datastore user objects (Keys, Properties, PropertyMaps, etc.) are serialized
+using `github.com/luci/gae/service/datastore/serialize`, which in turn uses the
+primitives available in `github.com/luci/luci-go/common/cmpbin`. The encodings
+are important to understanding why the schemas below sort correctly when
+compared only using bytes.Compare (aka `memcmp`). This doc will assume that
dnj 2015/08/28 16:37:54 "Compare (aka `memcmp`)"?
iannucci 2015/08/28 19:48:54 ? it's `bytes.Compare`
+you're familiar with those encodings, but will point out where we diverge from
+the stock encodings.
+
+All encoded Property values used in Keys are serialized `WithoutContext`, and
dnj 2015/08/28 16:37:54 Wording.
iannucci 2015/08/28 19:48:54 Done I think?
+`ShouldIndex`.
+
+### Primary table
+
+The primary table maps datastore keys to entities.
+- Name: `"ents:" + namespace`
+- Key: serialized datastore.Property containing the entity's datastore.Key
+- Value: serialized datastore.PropertyMap
+
+This table also encodes values for the following special keys:
+- Every entity root (e.g. a Key with nil Parent()) with key K has:
+ - `Key("__entity_group__", 1, K)` -> `{"__version__": PTInt}`
+ A child entity with the kind `__entity_group__` and an id of `1`. The value
+ has a single property `__version__`, which contains the version number of
+ this entity group. This is used to detect transaction conflicts.
+ - `Key("__entity_group_ids__", 1, K)` -> `{"__version__": PTInt}`
+ A child entity with the kind `__entity_group__` and an id of `1`. The value
+ has a single property `__version__`, which contains the last automatically
+ allocated entity ID for entities within this entity group.
+- A root entity with the key `Key("__entity_group_ids__",1)` which contains the
+ same `__version__` property, and indicates the last automatically allocated
+ entity ID for root entities.
+
+### Compound Index table
+
+The next table keeps track of all the user-added 'compound' index descriptions
+(not the content for the indexes). There is a row in this table for each
+compound index that the user adds by calling `ds.Raw().Testable().AddIndexes`.
+- Name: `"idx"`
+- Key: normalized, serialized `datastore.IndexDefinition` with the SortBy slice
+ in reverse order (i.e. `datastore.IndexDefinition.PrepForIdxTable()`).
+- Value: empty
+
+The Key format here requires some special attention. Say you started with
+a compound IndexDefinition of:
+ IndexDefinition{
+ Kind: "Foo",
+ Ancestor: true,
+ SortBy: []IndexColumn{
+ {Property: "Something", Direction: DESCENDING},
+ {Property: "Else", Direction: ASCENDING},
+ {Property: "Cool", Direction: ASCENDING},
+ }
+ }
+
+After prepping it for the table, it would be equivalent to:
+ IndexDefinition{
+ Kind: "Foo",
+ Ancestor: true,
+ SortBy: []IndexColumn{
+ {Property: "__key__", Direction: ASCENDING},
+ {Property: "Cool", Direction: ASCENDING},
+ {Property: "Else", Direction: ASCENDING},
+ {Property: "Something", Direction: DESCENDING},
+ }
+ }
+
+The reason for doing this will be covered in the `Query Planning` section, but
+it boils down to allowing the query planner to use this encoded table to
+intelligently scan for potentially useful compound indexes.
+
+### Index Tables
dnj (Google) 2015/08/28 17:54:21 Comment about repeated tags and padding. e.g., "t
iannucci 2015/08/28 19:48:54 Done. I think we should actually implement that te
+
+Every index (both builtin and compound) has one index table per namespace, which
+contains as rows every entry in the index, one per row.
+- Name: `"idx:" + namespace + IndexDefinition.PrepForIdxTable()`
+- Key: concatenated datastore.Property values, one per SortBy column in the
+ IndexDefinition (the non-PrepForIdxTable version). If the SortBy column is
+ DESCENDING, the serialized Property is inverted (e.g. XOR 0xFF).
+- Value: empty
+
+If the IndexDefinition has `Ancestor: true`, then the very first column of the
+Key contains the partial Key for the entity. So if an entity has the datastore
+key `/Some,1/Thing,2/Else,3`, it would have the values `/Some,1`,
+`/Some,1/Thing,2`, and `/Some,1/Thing,2/Else,3` as value in the ancestor column
dnj 2015/08/28 16:37:54 Explain why? Also the key notation, I'm assuming
iannucci 2015/08/28 19:48:54 "/Some,1" is a key with no parent. It's the root o
+of indexes that it matches.
+
+#### Builtin (automatic) indexes
+
+The following indexes are automatically created for some entity with a key
+`/Kind,*`, for every property (with `ShouldIndex` values) named "Foo":
+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
+ {Property: "__key__", Direction: ASCENDING},
+ }}
+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
+ {Property: "Foo", Direction: ASCENDING},
+ {Property: "__key__", Direction: ASCENDING},
+ }}
+ IndexDefinition{ Kind: "Kind", Ancestor: false, SortBy: []IndexColumn{
+ {Property: "Foo", Direction: DESCENDING},
+ {Property: "__key__", Direction: ASCENDING},
+ }}
+
+Index updates
+-------------
+
+(Note that this is a LARGE departure from how the production appengine datastore
+does this. This model only works because the implementation is not distributed,
+and not journaled. The real datastore does index updates in parallel and is
+generally pretty fancy compared to this).
+
+Index updates are pretty straightforward. On a multation to the primary entity
dnj 2015/08/28 16:37:54 mutation**
iannucci 2015/08/28 19:48:54 done
+table, we take the old entity value (remember that entity values are
+PropertyMaps), the new property value, create index entries for both, merge
+them, and apply the deltas to the affected index tables (i.e. entries that
+exist in the old, but not new entities are deleted, entries that exist in the
dnj 2015/08/28 16:37:54 Kind of awkward wording. Maybe: "...but not new ar
iannucci 2015/08/28 19:48:54 Better?
+new but not old entities are deleted).
+
+Index generation works (given an slice of indexes []Idxs) by:
+* serializing all ShouldIndex Properties in the PropertyMap to get a
+ `map[name][]serializedProperty`.
+* for each index idx
+ * if idx's columns contain properties that are not in the map, skip idx
+ * make a `[][]serializedProperty`, where each serializedProperty slice
+ corresponds with the IndexColumn of idx.SortBy
+ * generate a `[]byte` row which is the contatenation of one value from each
+ `[]serializedProperty`, permuting through all combinations. If the SortBy
+ column is DESCENDING, make sure to invert (XOR 0xFF) the serializedProperty
+ value!.
+ * add that generated []byte row to the index's corresponding table.
+
+Query planning
+--------------
+
+Now that we have all of our data tabulated, let's plan some queries. The
+high-level algorithm works like this:
+ * Generate a suffix format from the user's query which looks like:
+ * orders (including the inequality as the first order, if any)
+ * projected fields which aren't explicitly referenced in the orders (we
+ assume ASCENDING order for them), in the order that they were projected.
+ * `__key__` (implied ascending, unless the query's last sort order is for
+ `__key__`, in which case it's whatever order the user specified)
+ * Reverse the order of this suffix format, and serialize it into an
+ IndexDefinition, along with the query's Kind and Ancestor values. This
+ does what PrepForIdxTable did when we added the Index in the first place.
+ * Use this serialized reversed index to find compound indexes which might
+ match by looking up rows in the "idx" table which begin with this serialized
+ reversed index.
+ * Generate every builtin index for the inequality + equality filter
+ properties, and see if they match too.
+
+An index is a potential match if its suffix *exactly* matches the suffix format,
+and it contains *only* sort orders which appear in the query (e.g. the index
+contains a column which doesn't appear as an equality or inequlity filter).
+
+The index search continues until:
+ * We find at least one matching index; AND
+ * The combination of all matching indexes accounts for every equality filter
+ at least once.
+
+If we fail to find sufficient indexes to fulfil the query, we generate the
dnj 2015/08/28 16:37:54 fulfill
iannucci 2015/08/28 19:48:54 done
+'missing' index by concatenating all missing equality filters (in ascending
dnj 2015/08/28 16:37:54 Not clear, maybe "a suggested index description wh
iannucci 2015/08/28 19:48:54 How about this?
+order, but it doesn't matter if the user wants to make them decending), followed
+by concatenating the suffix format that we generated for this query.
+
+Recall that equality filters are expressed as
+`map[propName][]serializedProperty`. We'll refer to this mapping as the
+'constraint' mapping below.
+
+To actually come up with the final index selection, we sort all the matching
+indexes from greatest number of columns to least. We add the 0th index (the one
+with the greatest number of columns) unconditionally. We then keep adding indexes
+which contain one or more of the remaining constraints, until we have no
+more constraints to satisfy.
+
+Adding an index entails determining which columns in that index correspond to
+equality columns, and which ones correspond to inequality/order/projection
+columns. Recall that the inequality/order/projection columns are all the same
+for all of the potential indices (i.e. they all share the same *suffix format*).
+We can use this to just iterate over the index's SortBy columns which we'll use
+for equality filters. For each equality column, we remove a corresponding value
+from the constraints map. In the event that we _run out_ of constraints for a
+given column, we simply _pick an arbitrary value_ from the original equality
+filter mapping and use that. This is valid to do because, after all, they're
+equality filters.
+
+Note that for compound indexes, the ancestor key counts as an equality filter,
+and if the compound index has `Ancestor: true`, then we implicitly put the
+ancestor as if it were the first SortBy column. For satisfying Ancestor queries
+with built-in indexes, see the next section.
+
+Once we've got our list of constraints for this index, we concatenate them all
+together to get the *prefix* for this index. When iterating over this index, we
+only ever want to see index rows whose prefix exactly matches this. Unlike the
+suffix formt, the prefix is per-index (remember that ALL indexes in the
+query must have the same suffix format).
+
+Finally, we set the 'start' and 'end' values for all chosen indexes to either
+the Start and End cursors, or the Greater-Than and Less-Than values for the
+inequality. The Cursors contain values for every suffix column, and the
+inequality only contains a value for the first suffix column. If both cursors
+and an inequality are specified, we take the smaller set of both (the
+combination which will return the fewest rows).
+
+That's about it for index selection! See Query Execution for how we actually use
+the selected indexes to run a query.
+
+### Ancestor queries and Built-in indexes
+
+You may have noticed that the built-in indexes can be used for Ancestor queries
+with equality filters, but they don't start with the magic Ancestor column!
+
+There's a trick that you can do if the suffix format for the query is just
+`__key__` though (e.g. the query only contains equality filters, and/or an
+inequality filter on `__key__`). You can serialize the datastore key that you're
+planning to use for the Ancestor query, then chop off the termintating null byte
+from the encoding, and then use this as additional prefix bytes for this index.
+So if the builtin for the "Val" property has the column format of:
+
+ {Property: "Val"}, {Property: "__key__"}
+
+And your query holds Val as an equality filter, you can serialize the
+ancestor key (say `/Kind,1`), and add those bytes to the prefix. So if you had
+an index row:
+
+ PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1 ++ CONTINUE ++ "Child" ++ 2 ++ STOP
+
+(where CONTINUE is the byte 0x01, and STOP is 0x00), you can form a prefix for
+the query `Query("Kind").Ancestor(Key(Kind, 1)).Filter("Val =", 100)` as:
+
+ PTInt ++ 100 ++ PTKey ++ "Kind" ++ 1
+
+Omitting the STOP which normally terminates the Key encoding. Using this prefix
+will only return index rows which are `/Kind,1` or its children.
+
+"That's cool! Why not use this trick for compound indexes?", I hear you ask :)
+Remember that this trick only works if the prefix before the `__key__` is
+*entirely* composed of equality filters. Also recall that if you ONLY use
+equality filters and Ancestor (and possibly an inequality on `__key__`), then
+you can always satisfy the query from the built-in indexes! While you
+technically could do it with a compound index, there's not really a point to
+doing so. To remain faithful to the production datastore implementation, we
+don't implement this trick for anything other than the built-in indicies.
+
+### Cursor format
+
+Cursors work by containing values for each of the columns in the suffix, in the
+order and Direction specified by the suffix. In fact, cursors are just encoded
+versions of the []IndexColumn used for the 'suffix format', followed by the
+raw bytes of the suffix for that particular row (incremented by 1 bit).
+
+This means that technically you can port cursors between any queries which share
+precisely the same suffix format, regardless of other query options, even if the
+index planner ends up choosing different indexes to use from the first query to
+the second. No state is maintained in the service implementation for cursors.
+
+I suspect that this is a more liberal version of cursors than how the production
+appengine implements them, but I haven't verified one way or the other.
+
+Query execution
+---------------
+
+Last but not least, we need to actually execute the query. After figuring out
+which indexes to use with what prefixes and start/end values, we essentially
+have a list of index subsets, all sorted the same way. To pull the values out,
+we start by iterating the first index in the list, grabbing its suffix value,
+and trying to iterate from that suffix in the second, third, fourth, etc index.
+
+If any index iterates past that suffix, we start back at the 0th index with that
+suffix, and continue to try to find a matching row. Doing this will end up
+skipping large portions of all of the indexes in the list. This is the algorithm
+known as "zigzag merge join", and you can find talks on it from some of the
+appengine folks. It has very good algorithmic running time and tends to scale
+with the number of full matches, rather than the size of all of the indexes
+involved.
+
+A hit occurs when all of the iterators have precisely the same suffix. This hit
+suffix is then decoded using the suffix format information. The very last column
+of the suffix will always be the datastore key. The suffix is then used to call
+back to the user, according to the query type:
+* keys-only queries just directly return the Key
+* projection queries return the projected fields from the decoded suffix.
+ Remember how we added all the projections after the orders? This is why. The
+ projected values are pulled directly from the index, instead of going to the
+ main entity table.
+* normal queries pull the decoded Key from the "ents" table, and return that
+ entity to the user.
« no previous file with comments | « no previous file | impl/memory/binary_tools.go » ('j') | impl/memory/binary_tools.go » ('J')

Powered by Google App Engine
This is Rietveld 408576698