third_party/sqlite/sqlite-src-3080704/src/vdbesort.c - Issue 883353008: [sql] Import reference version of SQLite 3.8.7.4.

Side by Side Diff: third_party/sqlite/sqlite-src-3080704/src/vdbesort.c

Issue 883353008: [sql] Import reference version of SQLite 3.8.7.4. (Closed) Base URL: http://chromium.googlesource.com/chromium/src.git@master

Patch Set: Hold back encoding change which is messing up patch. Created 5 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 /*

	2 ** 2011-07-09

	3 **

	4 ** The author disclaims copyright to this source code. In place of

	5 ** a legal notice, here is a blessing:

	6 **

	7 ** May you do good and not evil.

	8 ** May you find forgiveness for yourself and forgive others.

	9 ** May you share freely, never taking more than you give.

	10 **

	11 *************************************************************************

	12 ** This file contains code for the VdbeSorter object, used in concert with

	13 ** a VdbeCursor to sort large numbers of keys for CREATE INDEX statements

	14 ** or by SELECT statements with ORDER BY clauses that cannot be satisfied

	15 ** using indexes and without LIMIT clauses.

	16 **

	17 ** The VdbeSorter object implements a multi-threaded external merge sort

	18 ** algorithm that is efficient even if the number of elements being sorted

	19 ** exceeds the available memory.

	20 **

	21 ** Here is the (internal, non-API) interface between this module and the

	22 ** rest of the SQLite system:

	23 **

	24 ** sqlite3VdbeSorterInit() Create a new VdbeSorter object.

	25 **

	26 ** sqlite3VdbeSorterWrite() Add a single new row to the VdbeSorter

	27 ** object. The row is a binary blob in the

	28 ** OP_MakeRecord format that contains both

	29 ** the ORDER BY key columns and result columns

	30 ** in the case of a SELECT w/ ORDER BY, or

	31 ** the complete record for an index entry

	32 ** in the case of a CREATE INDEX.

	33 **

	34 ** sqlite3VdbeSorterRewind() Sort all content previously added.

	35 ** Position the read cursor on the

	36 ** first sorted element.

	37 **

	38 ** sqlite3VdbeSorterNext() Advance the read cursor to the next sorted

	39 ** element.

	40 **

	41 ** sqlite3VdbeSorterRowkey() Return the complete binary blob for the

	42 ** row currently under the read cursor.

	43 **

	44 ** sqlite3VdbeSorterCompare() Compare the binary blob for the row

	45 ** currently under the read cursor against

	46 ** another binary blob X and report if

	47 ** X is strictly less than the read cursor.

	48 ** Used to enforce uniqueness in a

	49 ** CREATE UNIQUE INDEX statement.

	50 **

	51 ** sqlite3VdbeSorterClose() Close the VdbeSorter object and reclaim

	52 ** all resources.

	53 **

	54 ** sqlite3VdbeSorterReset() Refurbish the VdbeSorter for reuse. This

	55 ** is like Close() followed by Init() only

	56 ** much faster.

	57 **

	58 ** The interfaces above must be called in a particular order. Write() can

	59 ** only occur in between Init()/Reset() and Rewind(). Next(), Rowkey(), and

	60 ** Compare() can only occur in between Rewind() and Close()/Reset(). i.e.

	61 **

	62 ** Init()

	63 ** for each record: Write()

	64 ** Rewind()

	65 ** Rowkey()/Compare()

	66 ** Next()

	67 ** Close()

	68 **

	69 ** Algorithm:

	70 **

	71 ** Records passed to the sorter via calls to Write() are initially held

	72 ** unsorted in main memory. Assuming the amount of memory used never exceeds

	73 ** a threshold, when Rewind() is called the set of records is sorted using

	74 ** an in-memory merge sort. In this case, no temporary files are required

	75 ** and subsequent calls to Rowkey(), Next() and Compare() read records

	76 ** directly from main memory.

	77 **

	78 ** If the amount of space used to store records in main memory exceeds the

	79 ** threshold, then the set of records currently in memory are sorted and

	80 ** written to a temporary file in "Packed Memory Array" (PMA) format.

	81 ** A PMA created at this point is known as a "level-0 PMA". Higher levels

	82 ** of PMAs may be created by merging existing PMAs together - for example

	83 ** merging two or more level-0 PMAs together creates a level-1 PMA.

	84 **

	85 ** The threshold for the amount of main memory to use before flushing

	86 ** records to a PMA is roughly the same as the limit configured for the

	87 ** page-cache of the main database. Specifically, the threshold is set to

	88 ** the value returned by "PRAGMA main.page_size" multipled by

	89 ** that returned by "PRAGMA main.cache_size", in bytes.

	90 **

	91 ** If the sorter is running in single-threaded mode, then all PMAs generated

	92 ** are appended to a single temporary file. Or, if the sorter is running in

	93 ** multi-threaded mode then up to (N+1) temporary files may be opened, where

	94 ** N is the configured number of worker threads. In this case, instead of

	95 ** sorting the records and writing the PMA to a temporary file itself, the

	96 ** calling thread usually launches a worker thread to do so. Except, if

	97 ** there are already N worker threads running, the main thread does the work

	98 ** itself.

	99 **

	100 ** The sorter is running in multi-threaded mode if (a) the library was built

	101 ** with pre-processor symbol SQLITE_MAX_WORKER_THREADS set to a value greater

	102 ** than zero, and (b) worker threads have been enabled at runtime by calling

	103 ** sqlite3_config(SQLITE_CONFIG_WORKER_THREADS, ...).

	104 **

	105 ** When Rewind() is called, any data remaining in memory is flushed to a

	106 ** final PMA. So at this point the data is stored in some number of sorted

	107 ** PMAs within temporary files on disk.

	108 **

	109 ** If there are fewer than SORTER_MAX_MERGE_COUNT PMAs in total and the

	110 ** sorter is running in single-threaded mode, then these PMAs are merged

	111 ** incrementally as keys are retreived from the sorter by the VDBE. The

	112 ** MergeEngine object, described in further detail below, performs this

	113 ** merge.

	114 **

	115 ** Or, if running in multi-threaded mode, then a background thread is

	116 ** launched to merge the existing PMAs. Once the background thread has

	117 ** merged T bytes of data into a single sorted PMA, the main thread

	118 ** begins reading keys from that PMA while the background thread proceeds

	119 ** with merging the next T bytes of data. And so on.

	120 **

	121 ** Parameter T is set to half the value of the memory threshold used

	122 ** by Write() above to determine when to create a new PMA.

	123 **

	124 ** If there are more than SORTER_MAX_MERGE_COUNT PMAs in total when

	125 ** Rewind() is called, then a hierarchy of incremental-merges is used.

	126 ** First, T bytes of data from the first SORTER_MAX_MERGE_COUNT PMAs on

	127 ** disk are merged together. Then T bytes of data from the second set, and

	128 ** so on, such that no operation ever merges more than SORTER_MAX_MERGE_COUNT

	129 ** PMAs at a time. This done is to improve locality.

	130 **

	131 ** If running in multi-threaded mode and there are more than

	132 ** SORTER_MAX_MERGE_COUNT PMAs on disk when Rewind() is called, then more

	133 ** than one background thread may be created. Specifically, there may be

	134 ** one background thread for each temporary file on disk, and one background

	135 ** thread to merge the output of each of the others to a single PMA for

	136 ** the main thread to read from.

	137 */

	138 #include "sqliteInt.h"

	139 #include "vdbeInt.h"

	140

	141 /*

	142 ** If SQLITE_DEBUG_SORTER_THREADS is defined, this module outputs various

	143 ** messages to stderr that may be helpful in understanding the performance

	144 ** characteristics of the sorter in multi-threaded mode.

	145 */

	146 #if 0

	147 # define SQLITE_DEBUG_SORTER_THREADS 1

	148 #endif

	149

	150 /*

	151 ** Private objects used by the sorter

	152 */

	153 typedef struct MergeEngine MergeEngine; /* Merge PMAs together */

	154 typedef struct PmaReader PmaReader; /* Incrementally read one PMA */

	155 typedef struct PmaWriter PmaWriter; /* Incrementally write one PMA */

	156 typedef struct SorterRecord SorterRecord; /* A record being sorted */

	157 typedef struct SortSubtask SortSubtask; /* A sub-task in the sort process */

	158 typedef struct SorterFile SorterFile; /* Temporary file object wrapper */

	159 typedef struct SorterList SorterList; /* In-memory list of records */

	160 typedef struct IncrMerger IncrMerger; /* Read & merge multiple PMAs */

	161

	162 /*

	163 ** A container for a temp file handle and the current amount of data

	164 ** stored in the file.

	165 */

	166 struct SorterFile {

	167 sqlite3_file pFd; / File handle */

	168 i64 iEof; /* Bytes of data stored in pFd */

	169 };

	170

	171 /*

	172 ** An in-memory list of objects to be sorted.

	173 **

	174 ** If aMemory==0 then each object is allocated separately and the objects

	175 ** are connected using SorterRecord.u.pNext. If aMemory!=0 then all objects

	176 ** are stored in the aMemory[] bulk memory, one right after the other, and

	177 ** are connected using SorterRecord.u.iNext.

	178 */

	179 struct SorterList {

	180 SorterRecord pList; / Linked list of records */

	181 u8 aMemory; / If non-NULL, bulk memory to hold pList */

	182 int szPMA; /* Size of pList as PMA in bytes */

	183 };

	184

	185 /*

	186 ** The MergeEngine object is used to combine two or more smaller PMAs into

	187 ** one big PMA using a merge operation. Separate PMAs all need to be

	188 ** combined into one big PMA in order to be able to step through the sorted

	189 ** records in order.

	190 **

	191 ** The aReadr[] array contains a PmaReader object for each of the PMAs being

	192 ** merged. An aReadr[] object either points to a valid key or else is at EOF.

	193 ** ("EOF" means "End Of File". When aReadr[] is at EOF there is no more data.)

	194 ** For the purposes of the paragraphs below, we assume that the array is

	195 ** actually N elements in size, where N is the smallest power of 2 greater

	196 ** to or equal to the number of PMAs being merged. The extra aReadr[] elements

	197 ** are treated as if they are empty (always at EOF).

	198 **

	199 ** The aTree[] array is also N elements in size. The value of N is stored in

	200 ** the MergeEngine.nTree variable.

	201 **

	202 ** The final (N/2) elements of aTree[] contain the results of comparing

	203 ** pairs of PMA keys together. Element i contains the result of

	204 ** comparing aReadr[2i-N] and aReadr[2i-N+1]. Whichever key is smaller, the

	205 ** aTree element is set to the index of it.

	206 **

	207 ** For the purposes of this comparison, EOF is considered greater than any

	208 ** other key value. If the keys are equal (only possible with two EOF

	209 ** values), it doesn't matter which index is stored.

	210 **

	211 ** The (N/4) elements of aTree[] that precede the final (N/2) described

	212 ** above contains the index of the smallest of each block of 4 PmaReaders

	213 ** And so on. So that aTree[1] contains the index of the PmaReader that

	214 ** currently points to the smallest key value. aTree[0] is unused.

	215 **

	216 ** Example:

	217 **

	218 ** aReadr[0] -> Banana

	219 ** aReadr[1] -> Feijoa

	220 ** aReadr[2] -> Elderberry

	221 ** aReadr[3] -> Currant

	222 ** aReadr[4] -> Grapefruit

	223 ** aReadr[5] -> Apple

	224 ** aReadr[6] -> Durian

	225 ** aReadr[7] -> EOF

	226 **

	227 ** aTree[] = { X, 5 0, 5 0, 3, 5, 6 }

	228 **

	229 ** The current element is "Apple" (the value of the key indicated by

	230 ** PmaReader 5). When the Next() operation is invoked, PmaReader 5 will

	231 ** be advanced to the next key in its segment. Say the next key is

	232 ** "Eggplant":

	233 **

	234 ** aReadr[5] -> Eggplant

	235 **

	236 ** The contents of aTree[] are updated first by comparing the new PmaReader

	237 ** 5 key to the current key of PmaReader 4 (still "Grapefruit"). The PmaReader

	238 ** 5 value is still smaller, so aTree[6] is set to 5. And so on up the tree.

	239 ** The value of PmaReader 6 - "Durian" - is now smaller than that of PmaReader

	240 ** 5, so aTree[3] is set to 6. Key 0 is smaller than key 6 (Banana<Durian),

	241 ** so the value written into element 1 of the array is 0. As follows:

	242 **

	243 ** aTree[] = { X, 0 0, 6 0, 3, 5, 6 }

	244 **

	245 ** In other words, each time we advance to the next sorter element, log2(N)

	246 ** key comparison operations are required, where N is the number of segments

	247 ** being merged (rounded up to the next power of 2).

	248 */

	249 struct MergeEngine {

	250 int nTree; /* Used size of aTree/aReadr (power of 2) */

	251 SortSubtask pTask; / Used by this thread only */

	252 int aTree; / Current state of incremental merge */

	253 PmaReader aReadr; / Array of PmaReaders to merge data from */

	254 };

	255

	256 /*

	257 ** This object represents a single thread of control in a sort operation.

	258 ** Exactly VdbeSorter.nTask instances of this object are allocated

	259 ** as part of each VdbeSorter object. Instances are never allocated any

	260 ** other way. VdbeSorter.nTask is set to the number of worker threads allowed

	261 ** (see SQLITE_CONFIG_WORKER_THREADS) plus one (the main thread). Thus for

	262 ** single-threaded operation, there is exactly one instance of this object

	263 ** and for multi-threaded operation there are two or more instances.

	264 **

	265 ** Essentially, this structure contains all those fields of the VdbeSorter

	266 ** structure for which each thread requires a separate instance. For example,

	267 ** each thread requries its own UnpackedRecord object to unpack records in

	268 ** as part of comparison operations.

	269 **

	270 ** Before a background thread is launched, variable bDone is set to 0. Then,

	271 ** right before it exits, the thread itself sets bDone to 1. This is used for

	272 ** two purposes:

	273 **

	274 ** 1. When flushing the contents of memory to a level-0 PMA on disk, to

	275 ** attempt to select a SortSubtask for which there is not already an

	276 ** active background thread (since doing so causes the main thread

	277 ** to block until it finishes).

	278 **

	279 ** 2. If SQLITE_DEBUG_SORTER_THREADS is defined, to determine if a call

	280 ** to sqlite3ThreadJoin() is likely to block. Cases that are likely to

	281 ** block provoke debugging output.

	282 **

	283 ** In both cases, the effects of the main thread seeing (bDone==0) even

	284 ** after the thread has finished are not dire. So we don't worry about

	285 ** memory barriers and such here.

	286 */

	287 struct SortSubtask {

	288 SQLiteThread pThread; / Background thread, if any */

	289 int bDone; /* Set if thread is finished but not joined */

	290 VdbeSorter pSorter; / Sorter that owns this sub-task */

	291 UnpackedRecord pUnpacked; / Space to unpack a record */

	292 SorterList list; /* List for thread to write to a PMA */

	293 int nPMA; /* Number of PMAs currently in file */

	294 SorterFile file; /* Temp file for level-0 PMAs */

	295 SorterFile file2; /* Space for other PMAs */

	296 };

	297

	298 /*

	299 ** Main sorter structure. A single instance of this is allocated for each

	300 ** sorter cursor created by the VDBE.

	301 **

	302 ** mxKeysize:

	303 ** As records are added to the sorter by calls to sqlite3VdbeSorterWrite(),

	304 ** this variable is updated so as to be set to the size on disk of the

	305 ** largest record in the sorter.

	306 */

	307 struct VdbeSorter {

	308 int mnPmaSize; /* Minimum PMA size, in bytes */

	309 int mxPmaSize; /* Maximum PMA size, in bytes. 0==no limit */

	310 int mxKeysize; /* Largest serialized key seen so far */

	311 int pgsz; /* Main database page size */

	312 PmaReader pReader; / Readr data from here after Rewind() */

	313 MergeEngine pMerger; / Or here, if bUseThreads==0 */

	314 sqlite3 db; / Database connection */

	315 KeyInfo pKeyInfo; / How to compare records */

	316 UnpackedRecord pUnpacked; / Used by VdbeSorterCompare() */

	317 SorterList list; /* List of in-memory records */

	318 int iMemory; /* Offset of free space in list.aMemory */

	319 int nMemory; /* Size of list.aMemory allocation in bytes */

	320 u8 bUsePMA; /* True if one or more PMAs created */

	321 u8 bUseThreads; /* True to use background threads */

	322 u8 iPrev; /* Previous thread used to flush PMA */

	323 u8 nTask; /* Size of aTask[] array */

	324 SortSubtask aTask[1]; /* One or more subtasks */

	325 };

	326

	327 /*

	328 ** An instance of the following object is used to read records out of a

	329 ** PMA, in sorted order. The next key to be read is cached in nKey/aKey.

	330 ** aKey might point into aMap or into aBuffer. If neither of those locations

	331 ** contain a contiguous representation of the key, then aAlloc is allocated

	332 ** and the key is copied into aAlloc and aKey is made to poitn to aAlloc.

	333 **

	334 ** pFd==0 at EOF.

	335 */

	336 struct PmaReader {

	337 i64 iReadOff; /* Current read offset */

	338 i64 iEof; /* 1 byte past EOF for this PmaReader */

	339 int nAlloc; /* Bytes of space at aAlloc */

	340 int nKey; /* Number of bytes in key */

	341 sqlite3_file pFd; / File handle we are reading from */

	342 u8 aAlloc; / Space for aKey if aBuffer and pMap wont work */

	343 u8 aKey; / Pointer to current key */

	344 u8 aBuffer; / Current read buffer */

	345 int nBuffer; /* Size of read buffer in bytes */

	346 u8 aMap; / Pointer to mapping of entire file */

	347 IncrMerger pIncr; / Incremental merger */

	348 };

	349

	350 /*

	351 ** Normally, a PmaReader object iterates through an existing PMA stored

	352 ** within a temp file. However, if the PmaReader.pIncr variable points to

	353 ** an object of the following type, it may be used to iterate/merge through

	354 ** multiple PMAs simultaneously.

	355 **

	356 ** There are two types of IncrMerger object - single (bUseThread==0) and

	357 ** multi-threaded (bUseThread==1).

	358 **

	359 ** A multi-threaded IncrMerger object uses two temporary files - aFile[0]

	360 ** and aFile[1]. Neither file is allowed to grow to more than mxSz bytes in

	361 ** size. When the IncrMerger is initialized, it reads enough data from

	362 ** pMerger to populate aFile[0]. It then sets variables within the

	363 ** corresponding PmaReader object to read from that file and kicks off

	364 ** a background thread to populate aFile[1] with the next mxSz bytes of

	365 ** sorted record data from pMerger.

	366 **

	367 ** When the PmaReader reaches the end of aFile[0], it blocks until the

	368 ** background thread has finished populating aFile[1]. It then exchanges

	369 ** the contents of the aFile[0] and aFile[1] variables within this structure,

	370 ** sets the PmaReader fields to read from the new aFile[0] and kicks off

	371 ** another background thread to populate the new aFile[1]. And so on, until

	372 ** the contents of pMerger are exhausted.

	373 **

	374 ** A single-threaded IncrMerger does not open any temporary files of its

	375 ** own. Instead, it has exclusive access to mxSz bytes of space beginning

	376 ** at offset iStartOff of file pTask->file2. And instead of using a

	377 ** background thread to prepare data for the PmaReader, with a single

	378 ** threaded IncrMerger the allocate part of pTask->file2 is "refilled" with

	379 ** keys from pMerger by the calling thread whenever the PmaReader runs out

	380 ** of data.

	381 */

	382 struct IncrMerger {

	383 SortSubtask pTask; / Task that owns this merger */

	384 MergeEngine pMerger; / Merge engine thread reads data from */

	385 i64 iStartOff; /* Offset to start writing file at */

	386 int mxSz; /* Maximum bytes of data to store */

	387 int bEof; /* Set to true when merge is finished */

	388 int bUseThread; /* True to use a bg thread for this object */

	389 SorterFile aFile[2]; /* aFile[0] for reading, [1] for writing */

	390 };

	391

	392 /*

	393 ** An instance of this object is used for writing a PMA.

	394 **

	395 ** The PMA is written one record at a time. Each record is of an arbitrary

	396 ** size. But I/O is more efficient if it occurs in page-sized blocks where

	397 ** each block is aligned on a page boundary. This object caches writes to

	398 ** the PMA so that aligned, page-size blocks are written.

	399 */

	400 struct PmaWriter {

	401 int eFWErr; /* Non-zero if in an error state */

	402 u8 aBuffer; / Pointer to write buffer */

	403 int nBuffer; /* Size of write buffer in bytes */

	404 int iBufStart; /* First byte of buffer to write */

	405 int iBufEnd; /* Last byte of buffer to write */

	406 i64 iWriteOff; /* Offset of start of buffer in file */

	407 sqlite3_file pFd; / File handle to write to */

	408 };

	409

	410 /*

	411 ** This object is the header on a single record while that record is being

	412 ** held in memory and prior to being written out as part of a PMA.

	413 **

	414 ** How the linked list is connected depends on how memory is being managed

	415 ** by this module. If using a separate allocation for each in-memory record

	416 ** (VdbeSorter.list.aMemory==0), then the list is always connected using the

	417 ** SorterRecord.u.pNext pointers.

	418 **

	419 ** Or, if using the single large allocation method (VdbeSorter.list.aMemory!=0),

	420 ** then while records are being accumulated the list is linked using the

	421 ** SorterRecord.u.iNext offset. This is because the aMemory[] array may

	422 ** be sqlite3Realloc()ed while records are being accumulated. Once the VM

	423 ** has finished passing records to the sorter, or when the in-memory buffer

	424 ** is full, the list is sorted. As part of the sorting process, it is

	425 ** converted to use the SorterRecord.u.pNext pointers. See function

	426 ** vdbeSorterSort() for details.

	427 */

	428 struct SorterRecord {

	429 int nVal; /* Size of the record in bytes */

	430 union {

	431 SorterRecord pNext; / Pointer to next record in list */

	432 int iNext; /* Offset within aMemory of next record */

	433 } u;

	434 /* The data for the record immediately follows this header */

	435 };

	436

	437 /* Return a pointer to the buffer containing the record data for SorterRecord

	438 ** object p. Should be used as if:

	439 **

	440 ** void SRVAL(SorterRecord p) { return (void*)&p[1]; }

	441 */

	442 #define SRVAL(p) ((void)((SorterRecord)(p) + 1))

	443

	444 /* The minimum PMA size is set to this value multiplied by the database

	445 ** page size in bytes. */

	446 #define SORTER_MIN_WORKING 10

	447

	448 /* Maximum number of PMAs that a single MergeEngine can merge */

	449 #define SORTER_MAX_MERGE_COUNT 16

	450

	451 static int vdbeIncrSwap(IncrMerger*);

	452 static void vdbeIncrFree(IncrMerger *);

	453

	454 /*

	455 ** Free all memory belonging to the PmaReader object passed as the

	456 ** argument. All structure fields are set to zero before returning.

	457 */

	458 static void vdbePmaReaderClear(PmaReader *pReadr){

	459 sqlite3_free(pReadr->aAlloc);

	460 sqlite3_free(pReadr->aBuffer);

	461 if( pReadr->aMap ) sqlite3OsUnfetch(pReadr->pFd, 0, pReadr->aMap);

	462 vdbeIncrFree(pReadr->pIncr);

	463 memset(pReadr, 0, sizeof(PmaReader));

	464 }

	465

	466 /*

	467 ** Read the next nByte bytes of data from the PMA p.

	468 ** If successful, set *ppOut to point to a buffer containing the data

	469 ** and return SQLITE_OK. Otherwise, if an error occurs, return an SQLite

	470 ** error code.

	471 **

	472 ** The buffer returned in *ppOut is only valid until the

	473 ** next call to this function.

	474 */

	475 static int vdbePmaReadBlob(

	476 PmaReader p, / PmaReader from which to take the blob */

	477 int nByte, /* Bytes of data to read */

	478 u8 *ppOut / OUT: Pointer to buffer containing data */

	479 ){

	480 int iBuf; /* Offset within buffer to read from */

	481 int nAvail; /* Bytes of data available in buffer */

	482

	483 if( p->aMap ){

	484 *ppOut = &p->aMap[p->iReadOff];

	485 p->iReadOff += nByte;

	486 return SQLITE_OK;

	487 }

	488

	489 assert( p->aBuffer );

	490

	491 /* If there is no more data to be read from the buffer, read the next

	492 ** p->nBuffer bytes of data from the file into it. Or, if there are less

	493 ** than p->nBuffer bytes remaining in the PMA, read all remaining data. */

	494 iBuf = p->iReadOff % p->nBuffer;

	495 if( iBuf==0 ){

	496 int nRead; /* Bytes to read from disk */

	497 int rc; /* sqlite3OsRead() return code */

	498

	499 /* Determine how many bytes of data to read. */

	500 if( (p->iEof - p->iReadOff) > (i64)p->nBuffer ){

	501 nRead = p->nBuffer;

	502 }else{

	503 nRead = (int)(p->iEof - p->iReadOff);

	504 }

	505 assert( nRead>0 );

	506

	507 /* Readr data from the file. Return early if an error occurs. */

	508 rc = sqlite3OsRead(p->pFd, p->aBuffer, nRead, p->iReadOff);

	509 assert( rc!=SQLITE_IOERR_SHORT_READ );

	510 if( rc!=SQLITE_OK ) return rc;

	511 }

	512 nAvail = p->nBuffer - iBuf;

	513

	514 if( nByte<=nAvail ){

	515 /* The requested data is available in the in-memory buffer. In this

	516 ** case there is no need to make a copy of the data, just return a

	517 ** pointer into the buffer to the caller. */

	518 *ppOut = &p->aBuffer[iBuf];

	519 p->iReadOff += nByte;

	520 }else{

	521 /* The requested data is not all available in the in-memory buffer.

	522 ** In this case, allocate space at p->aAlloc[] to copy the requested

	523 ** range into. Then return a copy of pointer p->aAlloc to the caller. */

	524 int nRem; /* Bytes remaining to copy */

	525

	526 /* Extend the p->aAlloc[] allocation if required. */

	527 if( p->nAlloc<nByte ){

	528 u8 *aNew;

	529 int nNew = MAX(128, p->nAlloc*2);

	530 while( nByte>nNew ) nNew = nNew*2;

	531 aNew = sqlite3Realloc(p->aAlloc, nNew);

	532 if( !aNew ) return SQLITE_NOMEM;

	533 p->nAlloc = nNew;

	534 p->aAlloc = aNew;

	535 }

	536

	537 /* Copy as much data as is available in the buffer into the start of

	538 ** p->aAlloc[]. */

	539 memcpy(p->aAlloc, &p->aBuffer[iBuf], nAvail);

	540 p->iReadOff += nAvail;

	541 nRem = nByte - nAvail;

	542

	543 /* The following loop copies up to p->nBuffer bytes per iteration into

	544 ** the p->aAlloc[] buffer. */

	545 while( nRem>0 ){

	546 int rc; /* vdbePmaReadBlob() return code */

	547 int nCopy; /* Number of bytes to copy */

	548 u8 aNext; / Pointer to buffer to copy data from */

	549

	550 nCopy = nRem;

	551 if( nRem>p->nBuffer ) nCopy = p->nBuffer;

	552 rc = vdbePmaReadBlob(p, nCopy, &aNext);

	553 if( rc!=SQLITE_OK ) return rc;

	554 assert( aNext!=p->aAlloc );

	555 memcpy(&p->aAlloc[nByte - nRem], aNext, nCopy);

	556 nRem -= nCopy;

	557 }

	558

	559 *ppOut = p->aAlloc;

	560 }

	561

	562 return SQLITE_OK;

	563 }

	564

	565 /*

	566 ** Read a varint from the stream of data accessed by p. Set *pnOut to

	567 ** the value read.

	568 */

	569 static int vdbePmaReadVarint(PmaReader p, u64 pnOut){

	570 int iBuf;

	571

	572 if( p->aMap ){

	573 p->iReadOff += sqlite3GetVarint(&p->aMap[p->iReadOff], pnOut);

	574 }else{

	575 iBuf = p->iReadOff % p->nBuffer;

	576 if( iBuf && (p->nBuffer-iBuf)>=9 ){

	577 p->iReadOff += sqlite3GetVarint(&p->aBuffer[iBuf], pnOut);

	578 }else{

	579 u8 aVarint[16], *a;

	580 int i = 0, rc;

	581 do{

	582 rc = vdbePmaReadBlob(p, 1, &a);

	583 if( rc ) return rc;

	584 aVarint[(i++)&0xf] = a[0];

	585 }while( (a[0]&0x80)!=0 );

	586 sqlite3GetVarint(aVarint, pnOut);

	587 }

	588 }

	589

	590 return SQLITE_OK;

	591 }

	592

	593 /*

	594 ** Attempt to memory map file pFile. If successful, set *pp to point to the

	595 ** new mapping and return SQLITE_OK. If the mapping is not attempted

	596 ** (because the file is too large or the VFS layer is configured not to use

	597 ** mmap), return SQLITE_OK and set *pp to NULL.

	598 **

	599 ** Or, if an error occurs, return an SQLite error code. The final value of

	600 ** *pp is undefined in this case.

	601 */

	602 static int vdbeSorterMapFile(SortSubtask pTask, SorterFile pFile, u8 **pp){

	603 int rc = SQLITE_OK;

	604 if( pFile->iEof<=(i64)(pTask->pSorter->db->nMaxSorterMmap) ){

	605 sqlite3_file *pFd = pFile->pFd;

	606 if( pFd->pMethods->iVersion>=3 ){

	607 rc = sqlite3OsFetch(pFd, 0, (int)pFile->iEof, (void**)pp);

	608 testcase( rc!=SQLITE_OK );

	609 }

	610 }

	611 return rc;

	612 }

	613

	614 /*

	615 ** Attach PmaReader pReadr to file pFile (if it is not already attached to

	616 ** that file) and seek it to offset iOff within the file. Return SQLITE_OK

	617 ** if successful, or an SQLite error code if an error occurs.

	618 */

	619 static int vdbePmaReaderSeek(

	620 SortSubtask pTask, / Task context */

	621 PmaReader pReadr, / Reader whose cursor is to be moved */

	622 SorterFile pFile, / Sorter file to read from */

	623 i64 iOff /* Offset in pFile */

	624 ){

	625 int rc = SQLITE_OK;

	626

	627 assert( pReadr->pIncr==0 \|\| pReadr->pIncr->bEof==0 );

	628

	629 if( sqlite3FaultSim(201) ) return SQLITE_IOERR_READ;

	630 if( pReadr->aMap ){

	631 sqlite3OsUnfetch(pReadr->pFd, 0, pReadr->aMap);

	632 pReadr->aMap = 0;

	633 }

	634 pReadr->iReadOff = iOff;

	635 pReadr->iEof = pFile->iEof;

	636 pReadr->pFd = pFile->pFd;

	637

	638 rc = vdbeSorterMapFile(pTask, pFile, &pReadr->aMap);

	639 if( rc==SQLITE_OK && pReadr->aMap==0 ){

	640 int pgsz = pTask->pSorter->pgsz;

	641 int iBuf = pReadr->iReadOff % pgsz;

	642 if( pReadr->aBuffer==0 ){

	643 pReadr->aBuffer = (u8*)sqlite3Malloc(pgsz);

	644 if( pReadr->aBuffer==0 ) rc = SQLITE_NOMEM;

	645 pReadr->nBuffer = pgsz;

	646 }

	647 if( rc==SQLITE_OK && iBuf ){

	648 int nRead = pgsz - iBuf;

	649 if( (pReadr->iReadOff + nRead) > pReadr->iEof ){

	650 nRead = (int)(pReadr->iEof - pReadr->iReadOff);

	651 }

	652 rc = sqlite3OsRead(

	653 pReadr->pFd, &pReadr->aBuffer[iBuf], nRead, pReadr->iReadOff

	654 );

	655 testcase( rc!=SQLITE_OK );

	656 }

	657 }

	658

	659 return rc;

	660 }

	661

	662 /*

	663 ** Advance PmaReader pReadr to the next key in its PMA. Return SQLITE_OK if

	664 ** no error occurs, or an SQLite error code if one does.

	665 */

	666 static int vdbePmaReaderNext(PmaReader *pReadr){

	667 int rc = SQLITE_OK; /* Return Code */

	668 u64 nRec = 0; /* Size of record in bytes */

	669

	670

	671 if( pReadr->iReadOff>=pReadr->iEof ){

	672 IncrMerger *pIncr = pReadr->pIncr;

	673 int bEof = 1;

	674 if( pIncr ){

	675 rc = vdbeIncrSwap(pIncr);

	676 if( rc==SQLITE_OK && pIncr->bEof==0 ){

	677 rc = vdbePmaReaderSeek(

	678 pIncr->pTask, pReadr, &pIncr->aFile[0], pIncr->iStartOff

	679 );

	680 bEof = 0;

	681 }

	682 }

	683

	684 if( bEof ){

	685 /* This is an EOF condition */

	686 vdbePmaReaderClear(pReadr);

	687 testcase( rc!=SQLITE_OK );

	688 return rc;

	689 }

	690 }

	691

	692 if( rc==SQLITE_OK ){

	693 rc = vdbePmaReadVarint(pReadr, &nRec);

	694 }

	695 if( rc==SQLITE_OK ){

	696 pReadr->nKey = (int)nRec;

	697 rc = vdbePmaReadBlob(pReadr, (int)nRec, &pReadr->aKey);

	698 testcase( rc!=SQLITE_OK );

	699 }

	700

	701 return rc;

	702 }

	703

	704 /*

	705 ** Initialize PmaReader pReadr to scan through the PMA stored in file pFile

	706 ** starting at offset iStart and ending at offset iEof-1. This function

	707 ** leaves the PmaReader pointing to the first key in the PMA (or EOF if the

	708 ** PMA is empty).

	709 **

	710 ** If the pnByte parameter is NULL, then it is assumed that the file

	711 ** contains a single PMA, and that that PMA omits the initial length varint.

	712 */

	713 static int vdbePmaReaderInit(

	714 SortSubtask pTask, / Task context */

	715 SorterFile pFile, / Sorter file to read from */

	716 i64 iStart, /* Start offset in pFile */

	717 PmaReader pReadr, / PmaReader to populate */

	718 i64 pnByte / IN/OUT: Increment this value by PMA size */

	719 ){

	720 int rc;

	721

	722 assert( pFile->iEof>iStart );

	723 assert( pReadr->aAlloc==0 && pReadr->nAlloc==0 );

	724 assert( pReadr->aBuffer==0 );

	725 assert( pReadr->aMap==0 );

	726

	727 rc = vdbePmaReaderSeek(pTask, pReadr, pFile, iStart);

	728 if( rc==SQLITE_OK ){

	729 u64 nByte; /* Size of PMA in bytes */

	730 rc = vdbePmaReadVarint(pReadr, &nByte);

	731 pReadr->iEof = pReadr->iReadOff + nByte;

	732 *pnByte += nByte;

	733 }

	734

	735 if( rc==SQLITE_OK ){

	736 rc = vdbePmaReaderNext(pReadr);

	737 }

	738 return rc;

	739 }

	740

	741

	742 /*

	743 ** Compare key1 (buffer pKey1, size nKey1 bytes) with key2 (buffer pKey2,

	744 ** size nKey2 bytes). Use (pTask->pKeyInfo) for the collation sequences

	745 ** used by the comparison. Return the result of the comparison.

	746 **

	747 ** Before returning, object (pTask->pUnpacked) is populated with the

	748 ** unpacked version of key2. Or, if pKey2 is passed a NULL pointer, then it

	749 ** is assumed that the (pTask->pUnpacked) structure already contains the

	750 ** unpacked key to use as key2.

	751 **

	752 ** If an OOM error is encountered, (pTask->pUnpacked->error_rc) is set

	753 ** to SQLITE_NOMEM.

	754 */

	755 static int vdbeSorterCompare(

	756 SortSubtask pTask, / Subtask context (for pKeyInfo) */

	757 const void pKey1, int nKey1, / Left side of comparison */

	758 const void pKey2, int nKey2 / Right side of comparison */

	759 ){

	760 UnpackedRecord *r2 = pTask->pUnpacked;

	761 if( pKey2 ){

	762 sqlite3VdbeRecordUnpack(pTask->pSorter->pKeyInfo, nKey2, pKey2, r2);

	763 }

	764 return sqlite3VdbeRecordCompare(nKey1, pKey1, r2);

	765 }

	766

	767 /*

	768 ** Initialize the temporary index cursor just opened as a sorter cursor.

	769 **

	770 ** Usually, the sorter module uses the value of (pCsr->pKeyInfo->nField)

	771 ** to determine the number of fields that should be compared from the

	772 ** records being sorted. However, if the value passed as argument nField

	773 ** is non-zero and the sorter is able to guarantee a stable sort, nField

	774 ** is used instead. This is used when sorting records for a CREATE INDEX

	775 ** statement. In this case, keys are always delivered to the sorter in

	776 ** order of the primary key, which happens to be make up the final part

	777 ** of the records being sorted. So if the sort is stable, there is never

	778 ** any reason to compare PK fields and they can be ignored for a small

	779 ** performance boost.

	780 **

	781 ** The sorter can guarantee a stable sort when running in single-threaded

	782 ** mode, but not in multi-threaded mode.

	783 **

	784 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	785 */

	786 int sqlite3VdbeSorterInit(

	787 sqlite3 db, / Database connection (for malloc()) */

	788 int nField, /* Number of key fields in each record */

	789 VdbeCursor pCsr / Cursor that holds the new sorter */

	790 ){

	791 int pgsz; /* Page size of main database */

	792 int i; /* Used to iterate through aTask[] */

	793 int mxCache; /* Cache size */

	794 VdbeSorter pSorter; / The new sorter */

	795 KeyInfo pKeyInfo; / Copy of pCsr->pKeyInfo with db==0 */

	796 int szKeyInfo; /* Size of pCsr->pKeyInfo in bytes */

	797 int sz; /* Size of pSorter in bytes */

	798 int rc = SQLITE_OK;

	799 #if SQLITE_MAX_WORKER_THREADS==0

	800 # define nWorker 0

	801 #else

	802 int nWorker;

	803 #endif

	804

	805 /* Initialize the upper limit on the number of worker threads */

	806 #if SQLITE_MAX_WORKER_THREADS>0

	807 if( sqlite3TempInMemory(db) \|\| sqlite3GlobalConfig.bCoreMutex==0 ){

	808 nWorker = 0;

	809 }else{

	810 nWorker = db->aLimit[SQLITE_LIMIT_WORKER_THREADS];

	811 }

	812 #endif

	813

	814 /* Do not allow the total number of threads (main thread + all workers)

	815 ** to exceed the maximum merge count */

	816 #if SQLITE_MAX_WORKER_THREADS>=SORTER_MAX_MERGE_COUNT

	817 if( nWorker>=SORTER_MAX_MERGE_COUNT ){

	818 nWorker = SORTER_MAX_MERGE_COUNT-1;

	819 }

	820 #endif

	821

	822 assert( pCsr->pKeyInfo && pCsr->pBt==0 );

	823 szKeyInfo = sizeof(KeyInfo) + (pCsr->pKeyInfo->nField-1)sizeof(CollSeq);

	824 sz = sizeof(VdbeSorter) + nWorker * sizeof(SortSubtask);

	825

	826 pSorter = (VdbeSorter*)sqlite3DbMallocZero(db, sz + szKeyInfo);

	827 pCsr->pSorter = pSorter;

	828 if( pSorter==0 ){

	829 rc = SQLITE_NOMEM;

	830 }else{

	831 pSorter->pKeyInfo = pKeyInfo = (KeyInfo)((u8)pSorter + sz);

	832 memcpy(pKeyInfo, pCsr->pKeyInfo, szKeyInfo);

	833 pKeyInfo->db = 0;

	834 if( nField && nWorker==0 ) pKeyInfo->nField = nField;

	835 pSorter->pgsz = pgsz = sqlite3BtreeGetPageSize(db->aDb[0].pBt);

	836 pSorter->nTask = nWorker + 1;

	837 pSorter->bUseThreads = (pSorter->nTask>1);

	838 pSorter->db = db;

	839 for(i=0; i<pSorter->nTask; i++){

	840 SortSubtask *pTask = &pSorter->aTask[i];

	841 pTask->pSorter = pSorter;

	842 }

	843

	844 if( !sqlite3TempInMemory(db) ){

	845 pSorter->mnPmaSize = SORTER_MIN_WORKING * pgsz;

	846 mxCache = db->aDb[0].pSchema->cache_size;

	847 if( mxCache<SORTER_MIN_WORKING ) mxCache = SORTER_MIN_WORKING;

	848 pSorter->mxPmaSize = mxCache * pgsz;

	849

	850 /* If the application has not configure scratch memory using

	851 ** SQLITE_CONFIG_SCRATCH then we assume it is OK to do large memory

	852 ** allocations. If scratch memory has been configured, then assume

	853 ** large memory allocations should be avoided to prevent heap

	854 ** fragmentation.

	855 */

	856 if( sqlite3GlobalConfig.pScratch==0 ){

	857 assert( pSorter->iMemory==0 );

	858 pSorter->nMemory = pgsz;

	859 pSorter->list.aMemory = (u8*)sqlite3Malloc(pgsz);

	860 if( !pSorter->list.aMemory ) rc = SQLITE_NOMEM;

	861 }

	862 }

	863 }

	864

	865 return rc;

	866 }

	867 #undef nWorker /* Defined at the top of this function */

	868

	869 /*

	870 ** Free the list of sorted records starting at pRecord.

	871 */

	872 static void vdbeSorterRecordFree(sqlite3 db, SorterRecord pRecord){

	873 SorterRecord *p;

	874 SorterRecord *pNext;

	875 for(p=pRecord; p; p=pNext){

	876 pNext = p->u.pNext;

	877 sqlite3DbFree(db, p);

	878 }

	879 }

	880

	881 /*

	882 ** Free all resources owned by the object indicated by argument pTask. All

	883 ** fields of *pTask are zeroed before returning.

	884 */

	885 static void vdbeSortSubtaskCleanup(sqlite3 db, SortSubtask pTask){

	886 sqlite3DbFree(db, pTask->pUnpacked);

	887 pTask->pUnpacked = 0;

	888 #if SQLITE_MAX_WORKER_THREADS>0

	889 /* pTask->list.aMemory can only be non-zero if it was handed memory

	890 ** from the main thread. That only occurs SQLITE_MAX_WORKER_THREADS>0 */

	891 if( pTask->list.aMemory ){

	892 sqlite3_free(pTask->list.aMemory);

	893 pTask->list.aMemory = 0;

	894 }else

	895 #endif

	896 {

	897 assert( pTask->list.aMemory==0 );

	898 vdbeSorterRecordFree(0, pTask->list.pList);

	899 }

	900 pTask->list.pList = 0;

	901 if( pTask->file.pFd ){

	902 sqlite3OsCloseFree(pTask->file.pFd);

	903 pTask->file.pFd = 0;

	904 pTask->file.iEof = 0;

	905 }

	906 if( pTask->file2.pFd ){

	907 sqlite3OsCloseFree(pTask->file2.pFd);

	908 pTask->file2.pFd = 0;

	909 pTask->file2.iEof = 0;

	910 }

	911 }

	912

	913 #ifdef SQLITE_DEBUG_SORTER_THREADS

	914 static void vdbeSorterWorkDebug(SortSubtask pTask, const char zEvent){

	915 i64 t;

	916 int iTask = (pTask - pTask->pSorter->aTask);

	917 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	918 fprintf(stderr, "%lld:%d %s\n", t, iTask, zEvent);

	919 }

	920 static void vdbeSorterRewindDebug(const char *zEvent){

	921 i64 t;

	922 sqlite3OsCurrentTimeInt64(sqlite3_vfs_find(0), &t);

	923 fprintf(stderr, "%lld:X %s\n", t, zEvent);

	924 }

	925 static void vdbeSorterPopulateDebug(

	926 SortSubtask *pTask,

	927 const char *zEvent

	928 ){

	929 i64 t;

	930 int iTask = (pTask - pTask->pSorter->aTask);

	931 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	932 fprintf(stderr, "%lld:bg%d %s\n", t, iTask, zEvent);

	933 }

	934 static void vdbeSorterBlockDebug(

	935 SortSubtask *pTask,

	936 int bBlocked,

	937 const char *zEvent

	938 ){

	939 if( bBlocked ){

	940 i64 t;

	941 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	942 fprintf(stderr, "%lld:main %s\n", t, zEvent);

	943 }

	944 }

	945 #else

	946 # define vdbeSorterWorkDebug(x,y)

	947 # define vdbeSorterRewindDebug(y)

	948 # define vdbeSorterPopulateDebug(x,y)

	949 # define vdbeSorterBlockDebug(x,y,z)

	950 #endif

	951

	952 #if SQLITE_MAX_WORKER_THREADS>0

	953 /*

	954 ** Join thread pTask->thread.

	955 */

	956 static int vdbeSorterJoinThread(SortSubtask *pTask){

	957 int rc = SQLITE_OK;

	958 if( pTask->pThread ){

	959 #ifdef SQLITE_DEBUG_SORTER_THREADS

	960 int bDone = pTask->bDone;

	961 #endif

	962 void *pRet = SQLITE_INT_TO_PTR(SQLITE_ERROR);

	963 vdbeSorterBlockDebug(pTask, !bDone, "enter");

	964 (void)sqlite3ThreadJoin(pTask->pThread, &pRet);

	965 vdbeSorterBlockDebug(pTask, !bDone, "exit");

	966 rc = SQLITE_PTR_TO_INT(pRet);

	967 assert( pTask->bDone==1 );

	968 pTask->bDone = 0;

	969 pTask->pThread = 0;

	970 }

	971 return rc;

	972 }

	973

	974 /*

	975 ** Launch a background thread to run xTask(pIn).

	976 */

	977 static int vdbeSorterCreateThread(

	978 SortSubtask pTask, / Thread will use this task object */

	979 void (xTask)(void), / Routine to run in a separate thread */

	980 void pIn / Argument passed into xTask() */

	981 ){

	982 assert( pTask->pThread==0 && pTask->bDone==0 );

	983 return sqlite3ThreadCreate(&pTask->pThread, xTask, pIn);

	984 }

	985

	986 /*

	987 ** Join all outstanding threads launched by SorterWrite() to create

	988 ** level-0 PMAs.

	989 */

	990 static int vdbeSorterJoinAll(VdbeSorter *pSorter, int rcin){

	991 int rc = rcin;

	992 int i;

	993

	994 /* This function is always called by the main user thread.

	995 **

	996 ** If this function is being called after SorterRewind() has been called,

	997 ** it is possible that thread pSorter->aTask[pSorter->nTask-1].pThread

	998 ** is currently attempt to join one of the other threads. To avoid a race

	999 ** condition where this thread also attempts to join the same object, join

	1000 ** thread pSorter->aTask[pSorter->nTask-1].pThread first. */

	1001 for(i=pSorter->nTask-1; i>=0; i--){

	1002 SortSubtask *pTask = &pSorter->aTask[i];

	1003 int rc2 = vdbeSorterJoinThread(pTask);

	1004 if( rc==SQLITE_OK ) rc = rc2;

	1005 }

	1006 return rc;

	1007 }

	1008 #else

	1009 # define vdbeSorterJoinAll(x,rcin) (rcin)

	1010 # define vdbeSorterJoinThread(pTask) SQLITE_OK

	1011 #endif

	1012

	1013 /*

	1014 ** Allocate a new MergeEngine object capable of handling up to

	1015 ** nReader PmaReader inputs.

	1016 **

	1017 ** nReader is automatically rounded up to the next power of two.

	1018 ** nReader may not exceed SORTER_MAX_MERGE_COUNT even after rounding up.

	1019 */

	1020 static MergeEngine *vdbeMergeEngineNew(int nReader){

	1021 int N = 2; /* Smallest power of two >= nReader */

	1022 int nByte; /* Total bytes of space to allocate */

	1023 MergeEngine pNew; / Pointer to allocated object to return */

	1024

	1025 assert( nReader<=SORTER_MAX_MERGE_COUNT );

	1026

	1027 while( N<nReader ) N += N;

	1028 nByte = sizeof(MergeEngine) + N * (sizeof(int) + sizeof(PmaReader));

	1029

	1030 pNew = sqlite3FaultSim(100) ? 0 : (MergeEngine*)sqlite3MallocZero(nByte);

	1031 if( pNew ){

	1032 pNew->nTree = N;

	1033 pNew->pTask = 0;

	1034 pNew->aReadr = (PmaReader*)&pNew[1];

	1035 pNew->aTree = (int*)&pNew->aReadr[N];

	1036 }

	1037 return pNew;

	1038 }

	1039

	1040 /*

	1041 ** Free the MergeEngine object passed as the only argument.

	1042 */

	1043 static void vdbeMergeEngineFree(MergeEngine *pMerger){

	1044 int i;

	1045 if( pMerger ){

	1046 for(i=0; i<pMerger->nTree; i++){

	1047 vdbePmaReaderClear(&pMerger->aReadr[i]);

	1048 }

	1049 }

	1050 sqlite3_free(pMerger);

	1051 }

	1052

	1053 /*

	1054 ** Free all resources associated with the IncrMerger object indicated by

	1055 ** the first argument.

	1056 */

	1057 static void vdbeIncrFree(IncrMerger *pIncr){

	1058 if( pIncr ){

	1059 #if SQLITE_MAX_WORKER_THREADS>0

	1060 if( pIncr->bUseThread ){

	1061 vdbeSorterJoinThread(pIncr->pTask);

	1062 if( pIncr->aFile[0].pFd ) sqlite3OsCloseFree(pIncr->aFile[0].pFd);

	1063 if( pIncr->aFile[1].pFd ) sqlite3OsCloseFree(pIncr->aFile[1].pFd);

	1064 }

	1065 #endif

	1066 vdbeMergeEngineFree(pIncr->pMerger);

	1067 sqlite3_free(pIncr);

	1068 }

	1069 }

	1070

	1071 /*

	1072 ** Reset a sorting cursor back to its original empty state.

	1073 */

	1074 void sqlite3VdbeSorterReset(sqlite3 db, VdbeSorter pSorter){

	1075 int i;

	1076 (void)vdbeSorterJoinAll(pSorter, SQLITE_OK);

	1077 assert( pSorter->bUseThreads \|\| pSorter->pReader==0 );

	1078 #if SQLITE_MAX_WORKER_THREADS>0

	1079 if( pSorter->pReader ){

	1080 vdbePmaReaderClear(pSorter->pReader);

	1081 sqlite3DbFree(db, pSorter->pReader);

	1082 pSorter->pReader = 0;

	1083 }

	1084 #endif

	1085 vdbeMergeEngineFree(pSorter->pMerger);

	1086 pSorter->pMerger = 0;

	1087 for(i=0; i<pSorter->nTask; i++){

	1088 SortSubtask *pTask = &pSorter->aTask[i];

	1089 vdbeSortSubtaskCleanup(db, pTask);

	1090 }

	1091 if( pSorter->list.aMemory==0 ){

	1092 vdbeSorterRecordFree(0, pSorter->list.pList);

	1093 }

	1094 pSorter->list.pList = 0;

	1095 pSorter->list.szPMA = 0;

	1096 pSorter->bUsePMA = 0;

	1097 pSorter->iMemory = 0;

	1098 pSorter->mxKeysize = 0;

	1099 sqlite3DbFree(db, pSorter->pUnpacked);

	1100 pSorter->pUnpacked = 0;

	1101 }

	1102

	1103 /*

	1104 ** Free any cursor components allocated by sqlite3VdbeSorterXXX routines.

	1105 */

	1106 void sqlite3VdbeSorterClose(sqlite3 db, VdbeCursor pCsr){

	1107 VdbeSorter *pSorter = pCsr->pSorter;

	1108 if( pSorter ){

	1109 sqlite3VdbeSorterReset(db, pSorter);

	1110 sqlite3_free(pSorter->list.aMemory);

	1111 sqlite3DbFree(db, pSorter);

	1112 pCsr->pSorter = 0;

	1113 }

	1114 }

	1115

	1116 #if SQLITE_MAX_MMAP_SIZE>0

	1117 /*

	1118 ** The first argument is a file-handle open on a temporary file. The file

	1119 ** is guaranteed to be nByte bytes or smaller in size. This function

	1120 ** attempts to extend the file to nByte bytes in size and to ensure that

	1121 ** the VFS has memory mapped it.

	1122 **

	1123 ** Whether or not the file does end up memory mapped of course depends on

	1124 ** the specific VFS implementation.

	1125 */

	1126 static void vdbeSorterExtendFile(sqlite3 db, sqlite3_file pFd, i64 nByte){

	1127 if( nByte<=(i64)(db->nMaxSorterMmap) && pFd->pMethods->iVersion>=3 ){

	1128 int rc = sqlite3OsTruncate(pFd, nByte);

	1129 if( rc==SQLITE_OK ){

	1130 void *p = 0;

	1131 sqlite3OsFetch(pFd, 0, (int)nByte, &p);

	1132 sqlite3OsUnfetch(pFd, 0, p);

	1133 }

	1134 }

	1135 }

	1136 #else

	1137 # define vdbeSorterExtendFile(x,y,z)

	1138 #endif

	1139

	1140 /*

	1141 ** Allocate space for a file-handle and open a temporary file. If successful,

	1142 ** set *ppFd to point to the malloc'd file-handle and return SQLITE_OK.

	1143 ** Otherwise, set *ppFd to 0 and return an SQLite error code.

	1144 */

	1145 static int vdbeSorterOpenTempFile(

	1146 sqlite3 db, / Database handle doing sort */

	1147 i64 nExtend, /* Attempt to extend file to this size */

	1148 sqlite3_file **ppFd

	1149 ){

	1150 int rc;

	1151 rc = sqlite3OsOpenMalloc(db->pVfs, 0, ppFd,

	1152 SQLITE_OPEN_TEMP_JOURNAL \|

	1153 SQLITE_OPEN_READWRITE \| SQLITE_OPEN_CREATE \|

	1154 SQLITE_OPEN_EXCLUSIVE \| SQLITE_OPEN_DELETEONCLOSE, &rc

	1155 );

	1156 if( rc==SQLITE_OK ){

	1157 i64 max = SQLITE_MAX_MMAP_SIZE;

	1158 sqlite3OsFileControlHint(ppFd, SQLITE_FCNTL_MMAP_SIZE, (void)&max);

	1159 if( nExtend>0 ){

	1160 vdbeSorterExtendFile(db, *ppFd, nExtend);

	1161 }

	1162 }

	1163 return rc;

	1164 }

	1165

	1166 /*

	1167 ** If it has not already been allocated, allocate the UnpackedRecord

	1168 ** structure at pTask->pUnpacked. Return SQLITE_OK if successful (or

	1169 ** if no allocation was required), or SQLITE_NOMEM otherwise.

	1170 */

	1171 static int vdbeSortAllocUnpacked(SortSubtask *pTask){

	1172 if( pTask->pUnpacked==0 ){

	1173 char *pFree;

	1174 pTask->pUnpacked = sqlite3VdbeAllocUnpackedRecord(

	1175 pTask->pSorter->pKeyInfo, 0, 0, &pFree

	1176 );

	1177 assert( pTask->pUnpacked==(UnpackedRecord*)pFree );

	1178 if( pFree==0 ) return SQLITE_NOMEM;

	1179 pTask->pUnpacked->nField = pTask->pSorter->pKeyInfo->nField;

	1180 pTask->pUnpacked->errCode = 0;

	1181 }

	1182 return SQLITE_OK;

	1183 }

	1184

	1185

	1186 /*

	1187 ** Merge the two sorted lists p1 and p2 into a single list.

	1188 ** Set *ppOut to the head of the new list.

	1189 */

	1190 static void vdbeSorterMerge(

	1191 SortSubtask pTask, / Calling thread context */

	1192 SorterRecord p1, / First list to merge */

	1193 SorterRecord p2, / Second list to merge */

	1194 SorterRecord *ppOut / OUT: Head of merged list */

	1195 ){

	1196 SorterRecord *pFinal = 0;

	1197 SorterRecord **pp = &pFinal;

	1198 void *pVal2 = p2 ? SRVAL(p2) : 0;

	1199

	1200 while( p1 && p2 ){

	1201 int res;

	1202 res = vdbeSorterCompare(pTask, SRVAL(p1), p1->nVal, pVal2, p2->nVal);

	1203 if( res<=0 ){

	1204 *pp = p1;

	1205 pp = &p1->u.pNext;

	1206 p1 = p1->u.pNext;

	1207 pVal2 = 0;

	1208 }else{

	1209 *pp = p2;

	1210 pp = &p2->u.pNext;

	1211 p2 = p2->u.pNext;

	1212 if( p2==0 ) break;

	1213 pVal2 = SRVAL(p2);

	1214 }

	1215 }

	1216 *pp = p1 ? p1 : p2;

	1217 *ppOut = pFinal;

	1218 }

	1219

	1220 /*

	1221 ** Sort the linked list of records headed at pTask->pList. Return

	1222 ** SQLITE_OK if successful, or an SQLite error code (i.e. SQLITE_NOMEM) if

	1223 ** an error occurs.

	1224 */

	1225 static int vdbeSorterSort(SortSubtask pTask, SorterList pList){

	1226 int i;

	1227 SorterRecord **aSlot;

	1228 SorterRecord *p;

	1229 int rc;

	1230

	1231 rc = vdbeSortAllocUnpacked(pTask);

	1232 if( rc!=SQLITE_OK ) return rc;

	1233

	1234 aSlot = (SorterRecord *)sqlite3MallocZero(64 sizeof(SorterRecord *));

	1235 if( !aSlot ){

	1236 return SQLITE_NOMEM;

	1237 }

	1238

	1239 p = pList->pList;

	1240 while( p ){

	1241 SorterRecord *pNext;

	1242 if( pList->aMemory ){

	1243 if( (u8*)p==pList->aMemory ){

	1244 pNext = 0;

	1245 }else{

	1246 assert( p->u.iNext<sqlite3MallocSize(pList->aMemory) );

	1247 pNext = (SorterRecord*)&pList->aMemory[p->u.iNext];

	1248 }

	1249 }else{

	1250 pNext = p->u.pNext;

	1251 }

	1252

	1253 p->u.pNext = 0;

	1254 for(i=0; aSlot[i]; i++){

	1255 vdbeSorterMerge(pTask, p, aSlot[i], &p);

	1256 aSlot[i] = 0;

	1257 }

	1258 aSlot[i] = p;

	1259 p = pNext;

	1260 }

	1261

	1262 p = 0;

	1263 for(i=0; i<64; i++){

	1264 vdbeSorterMerge(pTask, p, aSlot[i], &p);

	1265 }

	1266 pList->pList = p;

	1267

	1268 sqlite3_free(aSlot);

	1269 assert( pTask->pUnpacked->errCode==SQLITE_OK

	1270 \|\| pTask->pUnpacked->errCode==SQLITE_NOMEM

	1271 );

	1272 return pTask->pUnpacked->errCode;

	1273 }

	1274

	1275 /*

	1276 ** Initialize a PMA-writer object.

	1277 */

	1278 static void vdbePmaWriterInit(

	1279 sqlite3_file pFd, / File handle to write to */

	1280 PmaWriter p, / Object to populate */

	1281 int nBuf, /* Buffer size */

	1282 i64 iStart /* Offset of pFd to begin writing at */

	1283 ){

	1284 memset(p, 0, sizeof(PmaWriter));

	1285 p->aBuffer = (u8*)sqlite3Malloc(nBuf);

	1286 if( !p->aBuffer ){

	1287 p->eFWErr = SQLITE_NOMEM;

	1288 }else{

	1289 p->iBufEnd = p->iBufStart = (iStart % nBuf);

	1290 p->iWriteOff = iStart - p->iBufStart;

	1291 p->nBuffer = nBuf;

	1292 p->pFd = pFd;

	1293 }

	1294 }

	1295

	1296 /*

	1297 ** Write nData bytes of data to the PMA. Return SQLITE_OK

	1298 ** if successful, or an SQLite error code if an error occurs.

	1299 */

	1300 static void vdbePmaWriteBlob(PmaWriter p, u8 pData, int nData){

	1301 int nRem = nData;

	1302 while( nRem>0 && p->eFWErr==0 ){

	1303 int nCopy = nRem;

	1304 if( nCopy>(p->nBuffer - p->iBufEnd) ){

	1305 nCopy = p->nBuffer - p->iBufEnd;

	1306 }

	1307

	1308 memcpy(&p->aBuffer[p->iBufEnd], &pData[nData-nRem], nCopy);

	1309 p->iBufEnd += nCopy;

	1310 if( p->iBufEnd==p->nBuffer ){

	1311 p->eFWErr = sqlite3OsWrite(p->pFd,

	1312 &p->aBuffer[p->iBufStart], p->iBufEnd - p->iBufStart,

	1313 p->iWriteOff + p->iBufStart

	1314 );

	1315 p->iBufStart = p->iBufEnd = 0;

	1316 p->iWriteOff += p->nBuffer;

	1317 }

	1318 assert( p->iBufEnd<p->nBuffer );

	1319

	1320 nRem -= nCopy;

	1321 }

	1322 }

	1323

	1324 /*

	1325 ** Flush any buffered data to disk and clean up the PMA-writer object.

	1326 ** The results of using the PMA-writer after this call are undefined.

	1327 ** Return SQLITE_OK if flushing the buffered data succeeds or is not

	1328 ** required. Otherwise, return an SQLite error code.

	1329 **

	1330 ** Before returning, set *piEof to the offset immediately following the

	1331 ** last byte written to the file.

	1332 */

	1333 static int vdbePmaWriterFinish(PmaWriter p, i64 piEof){

	1334 int rc;

	1335 if( p->eFWErr==0 && ALWAYS(p->aBuffer) && p->iBufEnd>p->iBufStart ){

	1336 p->eFWErr = sqlite3OsWrite(p->pFd,

	1337 &p->aBuffer[p->iBufStart], p->iBufEnd - p->iBufStart,

	1338 p->iWriteOff + p->iBufStart

	1339 );

	1340 }

	1341 *piEof = (p->iWriteOff + p->iBufEnd);

	1342 sqlite3_free(p->aBuffer);

	1343 rc = p->eFWErr;

	1344 memset(p, 0, sizeof(PmaWriter));

	1345 return rc;

	1346 }

	1347

	1348 /*

	1349 ** Write value iVal encoded as a varint to the PMA. Return

	1350 ** SQLITE_OK if successful, or an SQLite error code if an error occurs.

	1351 */

	1352 static void vdbePmaWriteVarint(PmaWriter *p, u64 iVal){

	1353 int nByte;

	1354 u8 aByte[10];

	1355 nByte = sqlite3PutVarint(aByte, iVal);

	1356 vdbePmaWriteBlob(p, aByte, nByte);

	1357 }

	1358

	1359 /*

	1360 ** Write the current contents of in-memory linked-list pList to a level-0

	1361 ** PMA in the temp file belonging to sub-task pTask. Return SQLITE_OK if

	1362 ** successful, or an SQLite error code otherwise.

	1363 **

	1364 ** The format of a PMA is:

	1365 **

	1366 ** * A varint. This varint contains the total number of bytes of content

	1367 ** in the PMA (not including the varint itself).

	1368 **

	1369 ** * One or more records packed end-to-end in order of ascending keys.

	1370 ** Each record consists of a varint followed by a blob of data (the

	1371 ** key). The varint is the number of bytes in the blob of data.

	1372 */

	1373 static int vdbeSorterListToPMA(SortSubtask pTask, SorterList pList){

	1374 sqlite3 *db = pTask->pSorter->db;

	1375 int rc = SQLITE_OK; /* Return code */

	1376 PmaWriter writer; /* Object used to write to the file */

	1377

	1378 #ifdef SQLITE_DEBUG

	1379 /* Set iSz to the expected size of file pTask->file after writing the PMA.

	1380 ** This is used by an assert() statement at the end of this function. */

	1381 i64 iSz = pList->szPMA + sqlite3VarintLen(pList->szPMA) + pTask->file.iEof;

	1382 #endif

	1383

	1384 vdbeSorterWorkDebug(pTask, "enter");

	1385 memset(&writer, 0, sizeof(PmaWriter));

	1386 assert( pList->szPMA>0 );

	1387

	1388 /* If the first temporary PMA file has not been opened, open it now. */

	1389 if( pTask->file.pFd==0 ){

	1390 rc = vdbeSorterOpenTempFile(db, 0, &pTask->file.pFd);

	1391 assert( rc!=SQLITE_OK \|\| pTask->file.pFd );

	1392 assert( pTask->file.iEof==0 );

	1393 assert( pTask->nPMA==0 );

	1394 }

	1395

	1396 /* Try to get the file to memory map */

	1397 if( rc==SQLITE_OK ){

	1398 vdbeSorterExtendFile(db, pTask->file.pFd, pTask->file.iEof+pList->szPMA+9);

	1399 }

	1400

	1401 /* Sort the list */

	1402 if( rc==SQLITE_OK ){

	1403 rc = vdbeSorterSort(pTask, pList);

	1404 }

	1405

	1406 if( rc==SQLITE_OK ){

	1407 SorterRecord *p;

	1408 SorterRecord *pNext = 0;

	1409

	1410 vdbePmaWriterInit(pTask->file.pFd, &writer, pTask->pSorter->pgsz,

	1411 pTask->file.iEof);

	1412 pTask->nPMA++;

	1413 vdbePmaWriteVarint(&writer, pList->szPMA);

	1414 for(p=pList->pList; p; p=pNext){

	1415 pNext = p->u.pNext;

	1416 vdbePmaWriteVarint(&writer, p->nVal);

	1417 vdbePmaWriteBlob(&writer, SRVAL(p), p->nVal);

	1418 if( pList->aMemory==0 ) sqlite3_free(p);

	1419 }

	1420 pList->pList = p;

	1421 rc = vdbePmaWriterFinish(&writer, &pTask->file.iEof);

	1422 }

	1423

	1424 vdbeSorterWorkDebug(pTask, "exit");

	1425 assert( rc!=SQLITE_OK \|\| pList->pList==0 );

	1426 assert( rc!=SQLITE_OK \|\| pTask->file.iEof==iSz );

	1427 return rc;

	1428 }

	1429

	1430 /*

	1431 ** Advance the MergeEngine to its next entry.

	1432 ** Set *pbEof to true there is no next entry because

	1433 ** the MergeEngine has reached the end of all its inputs.

	1434 **

	1435 ** Return SQLITE_OK if successful or an error code if an error occurs.

	1436 */

	1437 static int vdbeMergeEngineStep(

	1438 MergeEngine pMerger, / The merge engine to advance to the next row */

	1439 int pbEof / Set TRUE at EOF. Set false for more content */

	1440 ){

	1441 int rc;

	1442 int iPrev = pMerger->aTree[1];/* Index of PmaReader to advance */

	1443 SortSubtask *pTask = pMerger->pTask;

	1444

	1445 /* Advance the current PmaReader */

	1446 rc = vdbePmaReaderNext(&pMerger->aReadr[iPrev]);

	1447

	1448 /* Update contents of aTree[] */

	1449 if( rc==SQLITE_OK ){

	1450 int i; /* Index of aTree[] to recalculate */

	1451 PmaReader pReadr1; / First PmaReader to compare */

	1452 PmaReader pReadr2; / Second PmaReader to compare */

	1453 u8 pKey2; / To pReadr2->aKey, or 0 if record cached */

	1454

	1455 /* Find the first two PmaReaders to compare. The one that was just

	1456 ** advanced (iPrev) and the one next to it in the array. */

	1457 pReadr1 = &pMerger->aReadr[(iPrev & 0xFFFE)];

	1458 pReadr2 = &pMerger->aReadr[(iPrev \| 0x0001)];

	1459 pKey2 = pReadr2->aKey;

	1460

	1461 for(i=(pMerger->nTree+iPrev)/2; i>0; i=i/2){

	1462 /* Compare pReadr1 and pReadr2. Store the result in variable iRes. */

	1463 int iRes;

	1464 if( pReadr1->pFd==0 ){

	1465 iRes = +1;

	1466 }else if( pReadr2->pFd==0 ){

	1467 iRes = -1;

	1468 }else{

	1469 iRes = vdbeSorterCompare(pTask,

	1470 pReadr1->aKey, pReadr1->nKey, pKey2, pReadr2->nKey

	1471 );

	1472 }

	1473

	1474 /* If pReadr1 contained the smaller value, set aTree[i] to its index.

	1475 ** Then set pReadr2 to the next PmaReader to compare to pReadr1. In this

	1476 ** case there is no cache of pReadr2 in pTask->pUnpacked, so set

	1477 ** pKey2 to point to the record belonging to pReadr2.

	1478 **

	1479 ** Alternatively, if pReadr2 contains the smaller of the two values,

	1480 ** set aTree[i] to its index and update pReadr1. If vdbeSorterCompare()

	1481 ** was actually called above, then pTask->pUnpacked now contains

	1482 ** a value equivalent to pReadr2. So set pKey2 to NULL to prevent

	1483 ** vdbeSorterCompare() from decoding pReadr2 again.

	1484 **

	1485 ** If the two values were equal, then the value from the oldest

	1486 ** PMA should be considered smaller. The VdbeSorter.aReadr[] array

	1487 ** is sorted from oldest to newest, so pReadr1 contains older values

	1488 ** than pReadr2 iff (pReadr1<pReadr2). */

	1489 if( iRes<0 \|\| (iRes==0 && pReadr1<pReadr2) ){

	1490 pMerger->aTree[i] = (int)(pReadr1 - pMerger->aReadr);

	1491 pReadr2 = &pMerger->aReadr[ pMerger->aTree[i ^ 0x0001] ];

	1492 pKey2 = pReadr2->aKey;

	1493 }else{

	1494 if( pReadr1->pFd ) pKey2 = 0;

	1495 pMerger->aTree[i] = (int)(pReadr2 - pMerger->aReadr);

	1496 pReadr1 = &pMerger->aReadr[ pMerger->aTree[i ^ 0x0001] ];

	1497 }

	1498 }

	1499 *pbEof = (pMerger->aReadr[pMerger->aTree[1]].pFd==0);

	1500 }

	1501

	1502 return (rc==SQLITE_OK ? pTask->pUnpacked->errCode : rc);

	1503 }

	1504

	1505 #if SQLITE_MAX_WORKER_THREADS>0

	1506 /*

	1507 ** The main routine for background threads that write level-0 PMAs.

	1508 */

	1509 static void vdbeSorterFlushThread(void pCtx){

	1510 SortSubtask pTask = (SortSubtask)pCtx;

	1511 int rc; /* Return code */

	1512 assert( pTask->bDone==0 );

	1513 rc = vdbeSorterListToPMA(pTask, &pTask->list);

	1514 pTask->bDone = 1;

	1515 return SQLITE_INT_TO_PTR(rc);

	1516 }

	1517 #endif /* SQLITE_MAX_WORKER_THREADS>0 */

	1518

	1519 /*

	1520 ** Flush the current contents of VdbeSorter.list to a new PMA, possibly

	1521 ** using a background thread.

	1522 */

	1523 static int vdbeSorterFlushPMA(VdbeSorter *pSorter){

	1524 #if SQLITE_MAX_WORKER_THREADS==0

	1525 pSorter->bUsePMA = 1;

	1526 return vdbeSorterListToPMA(&pSorter->aTask[0], &pSorter->list);

	1527 #else

	1528 int rc = SQLITE_OK;

	1529 int i;

	1530 SortSubtask pTask = 0; / Thread context used to create new PMA */

	1531 int nWorker = (pSorter->nTask-1);

	1532

	1533 /* Set the flag to indicate that at least one PMA has been written.

	1534 ** Or will be, anyhow. */

	1535 pSorter->bUsePMA = 1;

	1536

	1537 /* Select a sub-task to sort and flush the current list of in-memory

	1538 ** records to disk. If the sorter is running in multi-threaded mode,

	1539 ** round-robin between the first (pSorter->nTask-1) tasks. Except, if

	1540 ** the background thread from a sub-tasks previous turn is still running,

	1541 ** skip it. If the first (pSorter->nTask-1) sub-tasks are all still busy,

	1542 ** fall back to using the final sub-task. The first (pSorter->nTask-1)

	1543 ** sub-tasks are prefered as they use background threads - the final

	1544 ** sub-task uses the main thread. */

	1545 for(i=0; i<nWorker; i++){

	1546 int iTest = (pSorter->iPrev + i + 1) % nWorker;

	1547 pTask = &pSorter->aTask[iTest];

	1548 if( pTask->bDone ){

	1549 rc = vdbeSorterJoinThread(pTask);

	1550 }

	1551 if( rc!=SQLITE_OK \|\| pTask->pThread==0 ) break;

	1552 }

	1553

	1554 if( rc==SQLITE_OK ){

	1555 if( i==nWorker ){

	1556 /* Use the foreground thread for this operation */

	1557 rc = vdbeSorterListToPMA(&pSorter->aTask[nWorker], &pSorter->list);

	1558 }else{

	1559 /* Launch a background thread for this operation */

	1560 u8 *aMem = pTask->list.aMemory;

	1561 void pCtx = (void)pTask;

	1562

	1563 assert( pTask->pThread==0 && pTask->bDone==0 );

	1564 assert( pTask->list.pList==0 );

	1565 assert( pTask->list.aMemory==0 \|\| pSorter->list.aMemory!=0 );

	1566

	1567 pSorter->iPrev = (u8)(pTask - pSorter->aTask);

	1568 pTask->list = pSorter->list;

	1569 pSorter->list.pList = 0;

	1570 pSorter->list.szPMA = 0;

	1571 if( aMem ){

	1572 pSorter->list.aMemory = aMem;

	1573 pSorter->nMemory = sqlite3MallocSize(aMem);

	1574 }else if( pSorter->list.aMemory ){

	1575 pSorter->list.aMemory = sqlite3Malloc(pSorter->nMemory);

	1576 if( !pSorter->list.aMemory ) return SQLITE_NOMEM;

	1577 }

	1578

	1579 rc = vdbeSorterCreateThread(pTask, vdbeSorterFlushThread, pCtx);

	1580 }

	1581 }

	1582

	1583 return rc;

	1584 #endif /* SQLITE_MAX_WORKER_THREADS!=0 */

	1585 }

	1586

	1587 /*

	1588 ** Add a record to the sorter.

	1589 */

	1590 int sqlite3VdbeSorterWrite(

	1591 const VdbeCursor pCsr, / Sorter cursor */

	1592 Mem pVal / Memory cell containing record */

	1593 ){

	1594 VdbeSorter *pSorter = pCsr->pSorter;

	1595 int rc = SQLITE_OK; /* Return Code */

	1596 SorterRecord pNew; / New list element */

	1597

	1598 int bFlush; /* True to flush contents of memory to PMA */

	1599 int nReq; /* Bytes of memory required */

	1600 int nPMA; /* Bytes of PMA space required */

	1601

	1602 assert( pSorter );

	1603

	1604 /* Figure out whether or not the current contents of memory should be

	1605 ** flushed to a PMA before continuing. If so, do so.

	1606 **

	1607 ** If using the single large allocation mode (pSorter->aMemory!=0), then

	1608 ** flush the contents of memory to a new PMA if (a) at least one value is

	1609 ** already in memory and (b) the new value will not fit in memory.

	1610 **

	1611 ** Or, if using separate allocations for each record, flush the contents

	1612 ** of memory to a PMA if either of the following are true:

	1613 **

	1614 ** * The total memory allocated for the in-memory list is greater

	1615 ** than (page-size * cache-size), or

	1616 **

	1617 ** * The total memory allocated for the in-memory list is greater

	1618 ** than (page-size * 10) and sqlite3HeapNearlyFull() returns true.

	1619 */

	1620 nReq = pVal->n + sizeof(SorterRecord);

	1621 nPMA = pVal->n + sqlite3VarintLen(pVal->n);

	1622 if( pSorter->mxPmaSize ){

	1623 if( pSorter->list.aMemory ){

	1624 bFlush = pSorter->iMemory && (pSorter->iMemory+nReq) > pSorter->mxPmaSize;

	1625 }else{

	1626 bFlush = (

	1627 (pSorter->list.szPMA > pSorter->mxPmaSize)

	1628 \|\| (pSorter->list.szPMA > pSorter->mnPmaSize && sqlite3HeapNearlyFull())

	1629 );

	1630 }

	1631 if( bFlush ){

	1632 rc = vdbeSorterFlushPMA(pSorter);

	1633 pSorter->list.szPMA = 0;

	1634 pSorter->iMemory = 0;

	1635 assert( rc!=SQLITE_OK \|\| pSorter->list.pList==0 );

	1636 }

	1637 }

	1638

	1639 pSorter->list.szPMA += nPMA;

	1640 if( nPMA>pSorter->mxKeysize ){

	1641 pSorter->mxKeysize = nPMA;

	1642 }

	1643

	1644 if( pSorter->list.aMemory ){

	1645 int nMin = pSorter->iMemory + nReq;

	1646

	1647 if( nMin>pSorter->nMemory ){

	1648 u8 *aNew;

	1649 int nNew = pSorter->nMemory * 2;

	1650 while( nNew < nMin ) nNew = nNew*2;

	1651 if( nNew > pSorter->mxPmaSize ) nNew = pSorter->mxPmaSize;

	1652 if( nNew < nMin ) nNew = nMin;

	1653

	1654 aNew = sqlite3Realloc(pSorter->list.aMemory, nNew);

	1655 if( !aNew ) return SQLITE_NOMEM;

	1656 pSorter->list.pList = (SorterRecord*)(

	1657 aNew + ((u8*)pSorter->list.pList - pSorter->list.aMemory)

	1658 );

	1659 pSorter->list.aMemory = aNew;

	1660 pSorter->nMemory = nNew;

	1661 }

	1662

	1663 pNew = (SorterRecord*)&pSorter->list.aMemory[pSorter->iMemory];

	1664 pSorter->iMemory += ROUND8(nReq);

	1665 pNew->u.iNext = (int)((u8*)(pSorter->list.pList) - pSorter->list.aMemory);

	1666 }else{

	1667 pNew = (SorterRecord *)sqlite3Malloc(nReq);

	1668 if( pNew==0 ){

	1669 return SQLITE_NOMEM;

	1670 }

	1671 pNew->u.pNext = pSorter->list.pList;

	1672 }

	1673

	1674 memcpy(SRVAL(pNew), pVal->z, pVal->n);

	1675 pNew->nVal = pVal->n;

	1676 pSorter->list.pList = pNew;

	1677

	1678 return rc;

	1679 }

	1680

	1681 /*

	1682 ** Read keys from pIncr->pMerger and populate pIncr->aFile[1]. The format

	1683 ** of the data stored in aFile[1] is the same as that used by regular PMAs,

	1684 ** except that the number-of-bytes varint is omitted from the start.

	1685 */

	1686 static int vdbeIncrPopulate(IncrMerger *pIncr){

	1687 int rc = SQLITE_OK;

	1688 int rc2;

	1689 i64 iStart = pIncr->iStartOff;

	1690 SorterFile *pOut = &pIncr->aFile[1];

	1691 SortSubtask *pTask = pIncr->pTask;

	1692 MergeEngine *pMerger = pIncr->pMerger;

	1693 PmaWriter writer;

	1694 assert( pIncr->bEof==0 );

	1695

	1696 vdbeSorterPopulateDebug(pTask, "enter");

	1697

	1698 vdbePmaWriterInit(pOut->pFd, &writer, pTask->pSorter->pgsz, iStart);

	1699 while( rc==SQLITE_OK ){

	1700 int dummy;

	1701 PmaReader *pReader = &pMerger->aReadr[ pMerger->aTree[1] ];

	1702 int nKey = pReader->nKey;

	1703 i64 iEof = writer.iWriteOff + writer.iBufEnd;

	1704

	1705 /* Check if the output file is full or if the input has been exhausted.

	1706 ** In either case exit the loop. */

	1707 if( pReader->pFd==0 ) break;

	1708 if( (iEof + nKey + sqlite3VarintLen(nKey))>(iStart + pIncr->mxSz) ) break;

	1709

	1710 /* Write the next key to the output. */

	1711 vdbePmaWriteVarint(&writer, nKey);

	1712 vdbePmaWriteBlob(&writer, pReader->aKey, nKey);

	1713 assert( pIncr->pMerger->pTask==pTask );

	1714 rc = vdbeMergeEngineStep(pIncr->pMerger, &dummy);

	1715 }

	1716

	1717 rc2 = vdbePmaWriterFinish(&writer, &pOut->iEof);

	1718 if( rc==SQLITE_OK ) rc = rc2;

	1719 vdbeSorterPopulateDebug(pTask, "exit");

	1720 return rc;

	1721 }

	1722

	1723 #if SQLITE_MAX_WORKER_THREADS>0

	1724 /*

	1725 ** The main routine for background threads that populate aFile[1] of

	1726 ** multi-threaded IncrMerger objects.

	1727 */

	1728 static void vdbeIncrPopulateThread(void pCtx){

	1729 IncrMerger pIncr = (IncrMerger)pCtx;

	1730 void *pRet = SQLITE_INT_TO_PTR( vdbeIncrPopulate(pIncr) );

	1731 pIncr->pTask->bDone = 1;

	1732 return pRet;

	1733 }

	1734

	1735 /*

	1736 ** Launch a background thread to populate aFile[1] of pIncr.

	1737 */

	1738 static int vdbeIncrBgPopulate(IncrMerger *pIncr){

	1739 void p = (void)pIncr;

	1740 assert( pIncr->bUseThread );

	1741 return vdbeSorterCreateThread(pIncr->pTask, vdbeIncrPopulateThread, p);

	1742 }

	1743 #endif

	1744

	1745 /*

	1746 ** This function is called when the PmaReader corresponding to pIncr has

	1747 ** finished reading the contents of aFile[0]. Its purpose is to "refill"

	1748 ** aFile[0] such that the PmaReader should start rereading it from the

	1749 ** beginning.

	1750 **

	1751 ** For single-threaded objects, this is accomplished by literally reading

	1752 ** keys from pIncr->pMerger and repopulating aFile[0].

	1753 **

	1754 ** For multi-threaded objects, all that is required is to wait until the

	1755 ** background thread is finished (if it is not already) and then swap

	1756 ** aFile[0] and aFile[1] in place. If the contents of pMerger have not

	1757 ** been exhausted, this function also launches a new background thread

	1758 ** to populate the new aFile[1].

	1759 **

	1760 ** SQLITE_OK is returned on success, or an SQLite error code otherwise.

	1761 */

	1762 static int vdbeIncrSwap(IncrMerger *pIncr){

	1763 int rc = SQLITE_OK;

	1764

	1765 #if SQLITE_MAX_WORKER_THREADS>0

	1766 if( pIncr->bUseThread ){

	1767 rc = vdbeSorterJoinThread(pIncr->pTask);

	1768

	1769 if( rc==SQLITE_OK ){

	1770 SorterFile f0 = pIncr->aFile[0];

	1771 pIncr->aFile[0] = pIncr->aFile[1];

	1772 pIncr->aFile[1] = f0;

	1773 }

	1774

	1775 if( rc==SQLITE_OK ){

	1776 if( pIncr->aFile[0].iEof==pIncr->iStartOff ){

	1777 pIncr->bEof = 1;

	1778 }else{

	1779 rc = vdbeIncrBgPopulate(pIncr);

	1780 }

	1781 }

	1782 }else

	1783 #endif

	1784 {

	1785 rc = vdbeIncrPopulate(pIncr);

	1786 pIncr->aFile[0] = pIncr->aFile[1];

	1787 if( pIncr->aFile[0].iEof==pIncr->iStartOff ){

	1788 pIncr->bEof = 1;

	1789 }

	1790 }

	1791

	1792 return rc;

	1793 }

	1794

	1795 /*

	1796 ** Allocate and return a new IncrMerger object to read data from pMerger.

	1797 **

	1798 ** If an OOM condition is encountered, return NULL. In this case free the

	1799 ** pMerger argument before returning.

	1800 */

	1801 static int vdbeIncrMergerNew(

	1802 SortSubtask pTask, / The thread that will be using the new IncrMerger */

	1803 MergeEngine pMerger, / The MergeEngine that the IncrMerger will control */

	1804 IncrMerger *ppOut / Write the new IncrMerger here */

	1805 ){

	1806 int rc = SQLITE_OK;

	1807 IncrMerger pIncr = ppOut = (IncrMerger*)

	1808 (sqlite3FaultSim(100) ? 0 : sqlite3MallocZero(sizeof(*pIncr)));

	1809 if( pIncr ){

	1810 pIncr->pMerger = pMerger;

	1811 pIncr->pTask = pTask;

	1812 pIncr->mxSz = MAX(pTask->pSorter->mxKeysize+9,pTask->pSorter->mxPmaSize/2);

	1813 pTask->file2.iEof += pIncr->mxSz;

	1814 }else{

	1815 vdbeMergeEngineFree(pMerger);

	1816 rc = SQLITE_NOMEM;

	1817 }

	1818 return rc;

	1819 }

	1820

	1821 #if SQLITE_MAX_WORKER_THREADS>0

	1822 /*

	1823 ** Set the "use-threads" flag on object pIncr.

	1824 */

	1825 static void vdbeIncrMergerSetThreads(IncrMerger *pIncr){

	1826 pIncr->bUseThread = 1;

	1827 pIncr->pTask->file2.iEof -= pIncr->mxSz;

	1828 }

	1829 #endif /* SQLITE_MAX_WORKER_THREADS>0 */

	1830

	1831

	1832

	1833 /*

	1834 ** Recompute pMerger->aTree[iOut] by comparing the next keys on the

	1835 ** two PmaReaders that feed that entry. Neither of the PmaReaders

	1836 ** are advanced. This routine merely does the comparison.

	1837 */

	1838 static void vdbeMergeEngineCompare(

	1839 MergeEngine pMerger, / Merge engine containing PmaReaders to compare */

	1840 int iOut /* Store the result in pMerger->aTree[iOut] */

	1841 ){

	1842 int i1;

	1843 int i2;

	1844 int iRes;

	1845 PmaReader *p1;

	1846 PmaReader *p2;

	1847

	1848 assert( iOut<pMerger->nTree && iOut>0 );

	1849

	1850 if( iOut>=(pMerger->nTree/2) ){

	1851 i1 = (iOut - pMerger->nTree/2) * 2;

	1852 i2 = i1 + 1;

	1853 }else{

	1854 i1 = pMerger->aTree[iOut*2];

	1855 i2 = pMerger->aTree[iOut*2+1];

	1856 }

	1857

	1858 p1 = &pMerger->aReadr[i1];

	1859 p2 = &pMerger->aReadr[i2];

	1860

	1861 if( p1->pFd==0 ){

	1862 iRes = i2;

	1863 }else if( p2->pFd==0 ){

	1864 iRes = i1;

	1865 }else{

	1866 int res;

	1867 assert( pMerger->pTask->pUnpacked!=0 ); /* from vdbeSortSubtaskMain() */

	1868 res = vdbeSorterCompare(

	1869 pMerger->pTask, p1->aKey, p1->nKey, p2->aKey, p2->nKey

	1870 );

	1871 if( res<=0 ){

	1872 iRes = i1;

	1873 }else{

	1874 iRes = i2;

	1875 }

	1876 }

	1877

	1878 pMerger->aTree[iOut] = iRes;

	1879 }

	1880

	1881 /*

	1882 ** Allowed values for the eMode parameter to vdbeMergeEngineInit()

	1883 ** and vdbePmaReaderIncrMergeInit().

	1884 **

	1885 ** Only INCRINIT_NORMAL is valid in single-threaded builds (when

	1886 ** SQLITE_MAX_WORKER_THREADS==0). The other values are only used

	1887 ** when there exists one or more separate worker threads.

	1888 */

	1889 #define INCRINIT_NORMAL 0

	1890 #define INCRINIT_TASK 1

	1891 #define INCRINIT_ROOT 2

	1892

	1893 /* Forward reference.

	1894 ** The vdbeIncrMergeInit() and vdbePmaReaderIncrMergeInit() routines call each

	1895 ** other (when building a merge tree).

	1896 */

	1897 static int vdbePmaReaderIncrMergeInit(PmaReader *pReadr, int eMode);

	1898

	1899 /*

	1900 ** Initialize the MergeEngine object passed as the second argument. Once this

	1901 ** function returns, the first key of merged data may be read from the

	1902 ** MergeEngine object in the usual fashion.

	1903 **

	1904 ** If argument eMode is INCRINIT_ROOT, then it is assumed that any IncrMerge

	1905 ** objects attached to the PmaReader objects that the merger reads from have

	1906 ** already been populated, but that they have not yet populated aFile[0] and

	1907 ** set the PmaReader objects up to read from it. In this case all that is

	1908 ** required is to call vdbePmaReaderNext() on each PmaReader to point it at

	1909 ** its first key.

	1910 **

	1911 ** Otherwise, if eMode is any value other than INCRINIT_ROOT, then use

	1912 ** vdbePmaReaderIncrMergeInit() to initialize each PmaReader that feeds data

	1913 ** to pMerger.

	1914 **

	1915 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	1916 */

	1917 static int vdbeMergeEngineInit(

	1918 SortSubtask pTask, / Thread that will run pMerger */

	1919 MergeEngine pMerger, / MergeEngine to initialize */

	1920 int eMode /* One of the INCRINIT_XXX constants */

	1921 ){

	1922 int rc = SQLITE_OK; /* Return code */

	1923 int i; /* For looping over PmaReader objects */

	1924 int nTree = pMerger->nTree;

	1925

	1926 /* eMode is always INCRINIT_NORMAL in single-threaded mode */

	1927 assert( SQLITE_MAX_WORKER_THREADS>0 \|\| eMode==INCRINIT_NORMAL );

	1928

	1929 /* Verify that the MergeEngine is assigned to a single thread */

	1930 assert( pMerger->pTask==0 );

	1931 pMerger->pTask = pTask;

	1932

	1933 for(i=0; i<nTree; i++){

	1934 if( SQLITE_MAX_WORKER_THREADS>0 && eMode==INCRINIT_ROOT ){

	1935 /* PmaReaders should be normally initialized in order, as if they are

	1936 ** reading from the same temp file this makes for more linear file IO.

	1937 ** However, in the INCRINIT_ROOT case, if PmaReader aReadr[nTask-1] is

	1938 ** in use it will block the vdbePmaReaderNext() call while it uses

	1939 ** the main thread to fill its buffer. So calling PmaReaderNext()

	1940 ** on this PmaReader before any of the multi-threaded PmaReaders takes

	1941 ** better advantage of multi-processor hardware. */

	1942 rc = vdbePmaReaderNext(&pMerger->aReadr[nTree-i-1]);

	1943 }else{

	1944 rc = vdbePmaReaderIncrMergeInit(&pMerger->aReadr[i], INCRINIT_NORMAL);

	1945 }

	1946 if( rc!=SQLITE_OK ) return rc;

	1947 }

	1948

	1949 for(i=pMerger->nTree-1; i>0; i--){

	1950 vdbeMergeEngineCompare(pMerger, i);

	1951 }

	1952 return pTask->pUnpacked->errCode;

	1953 }

	1954

	1955 /*

	1956 ** Initialize the IncrMerge field of a PmaReader.

	1957 **

	1958 ** If the PmaReader passed as the first argument is not an incremental-reader

	1959 ** (if pReadr->pIncr==0), then this function is a no-op. Otherwise, it serves

	1960 ** to open and/or initialize the temp file related fields of the IncrMerge

	1961 ** object at (pReadr->pIncr).

	1962 **

	1963 ** If argument eMode is set to INCRINIT_NORMAL, then all PmaReaders

	1964 ** in the sub-tree headed by pReadr are also initialized. Data is then loaded

	1965 ** into the buffers belonging to pReadr and it is set to

	1966 ** point to the first key in its range.

	1967 **

	1968 ** If argument eMode is set to INCRINIT_TASK, then pReadr is guaranteed

	1969 ** to be a multi-threaded PmaReader and this function is being called in a

	1970 ** background thread. In this case all PmaReaders in the sub-tree are

	1971 ** initialized as for INCRINIT_NORMAL and the aFile[1] buffer belonging to

	1972 ** pReadr is populated. However, pReadr itself is not set up to point

	1973 ** to its first key. A call to vdbePmaReaderNext() is still required to do

	1974 ** that.

	1975 **

	1976 ** The reason this function does not call vdbePmaReaderNext() immediately

	1977 ** in the INCRINIT_TASK case is that vdbePmaReaderNext() assumes that it has

	1978 ** to block on thread (pTask->thread) before accessing aFile[1]. But, since

	1979 ** this entire function is being run by thread (pTask->thread), that will

	1980 ** lead to the current background thread attempting to join itself.

	1981 **

	1982 ** Finally, if argument eMode is set to INCRINIT_ROOT, it may be assumed

	1983 ** that pReadr->pIncr is a multi-threaded IncrMerge objects, and that all

	1984 ** child-trees have already been initialized using IncrInit(INCRINIT_TASK).

	1985 ** In this case vdbePmaReaderNext() is called on all child PmaReaders and

	1986 ** the current PmaReader set to point to the first key in its range.

	1987 **

	1988 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	1989 */

	1990 static int vdbePmaReaderIncrMergeInit(PmaReader *pReadr, int eMode){

	1991 int rc = SQLITE_OK;

	1992 IncrMerger *pIncr = pReadr->pIncr;

	1993

	1994 /* eMode is always INCRINIT_NORMAL in single-threaded mode */

	1995 assert( SQLITE_MAX_WORKER_THREADS>0 \|\| eMode==INCRINIT_NORMAL );

	1996

	1997 if( pIncr ){

	1998 SortSubtask *pTask = pIncr->pTask;

	1999 sqlite3 *db = pTask->pSorter->db;

	2000

	2001 rc = vdbeMergeEngineInit(pTask, pIncr->pMerger, eMode);

	2002

	2003 /* Set up the required files for pIncr. A multi-theaded IncrMerge object

	2004 ** requires two temp files to itself, whereas a single-threaded object

	2005 ** only requires a region of pTask->file2. */

	2006 if( rc==SQLITE_OK ){

	2007 int mxSz = pIncr->mxSz;

	2008 #if SQLITE_MAX_WORKER_THREADS>0

	2009 if( pIncr->bUseThread ){

	2010 rc = vdbeSorterOpenTempFile(db, mxSz, &pIncr->aFile[0].pFd);

	2011 if( rc==SQLITE_OK ){

	2012 rc = vdbeSorterOpenTempFile(db, mxSz, &pIncr->aFile[1].pFd);

	2013 }

	2014 }else

	2015 #endif

	2016 /if( !pIncr->bUseThread )/{

	2017 if( pTask->file2.pFd==0 ){

	2018 assert( pTask->file2.iEof>0 );

	2019 rc = vdbeSorterOpenTempFile(db, pTask->file2.iEof, &pTask->file2.pFd);

	2020 pTask->file2.iEof = 0;

	2021 }

	2022 if( rc==SQLITE_OK ){

	2023 pIncr->aFile[1].pFd = pTask->file2.pFd;

	2024 pIncr->iStartOff = pTask->file2.iEof;

	2025 pTask->file2.iEof += mxSz;

	2026 }

	2027 }

	2028 }

	2029

	2030 #if SQLITE_MAX_WORKER_THREADS>0

	2031 if( rc==SQLITE_OK && pIncr->bUseThread ){

	2032 /* Use the current thread to populate aFile[1], even though this

	2033 ** PmaReader is multi-threaded. The reason being that this function

	2034 ** is already running in background thread pIncr->pTask->thread. */

	2035 assert( eMode==INCRINIT_ROOT \|\| eMode==INCRINIT_TASK );

	2036 rc = vdbeIncrPopulate(pIncr);

	2037 }

	2038 #endif

	2039

	2040 if( rc==SQLITE_OK

	2041 && (SQLITE_MAX_WORKER_THREADS==0 \|\| eMode!=INCRINIT_TASK)

	2042 ){

	2043 rc = vdbePmaReaderNext(pReadr);

	2044 }

	2045 }

	2046 return rc;

	2047 }

	2048

	2049 #if SQLITE_MAX_WORKER_THREADS>0

	2050 /*

	2051 ** The main routine for vdbePmaReaderIncrMergeInit() operations run in

	2052 ** background threads.

	2053 */

	2054 static void vdbePmaReaderBgInit(void pCtx){

	2055 PmaReader pReader = (PmaReader)pCtx;

	2056 void *pRet = SQLITE_INT_TO_PTR(

	2057 vdbePmaReaderIncrMergeInit(pReader,INCRINIT_TASK)

	2058 );

	2059 pReader->pIncr->pTask->bDone = 1;

	2060 return pRet;

	2061 }

	2062

	2063 /*

	2064 ** Use a background thread to invoke vdbePmaReaderIncrMergeInit(INCRINIT_TASK)

	2065 ** on the PmaReader object passed as the first argument.

	2066 **

	2067 ** This call will initialize the various fields of the pReadr->pIncr

	2068 ** structure and, if it is a multi-threaded IncrMerger, launch a

	2069 ** background thread to populate aFile[1].

	2070 */

	2071 static int vdbePmaReaderBgIncrInit(PmaReader *pReadr){

	2072 void pCtx = (void)pReadr;

	2073 return vdbeSorterCreateThread(pReadr->pIncr->pTask, vdbePmaReaderBgInit, pCtx) ;

	2074 }

	2075 #endif

	2076

	2077 /*

	2078 ** Allocate a new MergeEngine object to merge the contents of nPMA level-0

	2079 ** PMAs from pTask->file. If no error occurs, set *ppOut to point to

	2080 ** the new object and return SQLITE_OK. Or, if an error does occur, set *ppOut

	2081 ** to NULL and return an SQLite error code.

	2082 **

	2083 ** When this function is called, *piOffset is set to the offset of the

	2084 ** first PMA to read from pTask->file. Assuming no error occurs, it is

	2085 ** set to the offset immediately following the last byte of the last

	2086 ** PMA before returning. If an error does occur, then the final value of

	2087 ** *piOffset is undefined.

	2088 */

	2089 static int vdbeMergeEngineLevel0(

	2090 SortSubtask pTask, / Sorter task to read from */

	2091 int nPMA, /* Number of PMAs to read */

	2092 i64 piOffset, / IN/OUT: Readr offset in pTask->file */

	2093 MergeEngine *ppOut / OUT: New merge-engine */

	2094 ){

	2095 MergeEngine pNew; / Merge engine to return */

	2096 i64 iOff = *piOffset;

	2097 int i;

	2098 int rc = SQLITE_OK;

	2099

	2100 *ppOut = pNew = vdbeMergeEngineNew(nPMA);

	2101 if( pNew==0 ) rc = SQLITE_NOMEM;

	2102

	2103 for(i=0; i<nPMA && rc==SQLITE_OK; i++){

	2104 i64 nDummy;

	2105 PmaReader *pReadr = &pNew->aReadr[i];

	2106 rc = vdbePmaReaderInit(pTask, &pTask->file, iOff, pReadr, &nDummy);

	2107 iOff = pReadr->iEof;

	2108 }

	2109

	2110 if( rc!=SQLITE_OK ){

	2111 vdbeMergeEngineFree(pNew);

	2112 *ppOut = 0;

	2113 }

	2114 *piOffset = iOff;

	2115 return rc;

	2116 }

	2117

	2118 /*

	2119 ** Return the depth of a tree comprising nPMA PMAs, assuming a fanout of

	2120 ** SORTER_MAX_MERGE_COUNT. The returned value does not include leaf nodes.

	2121 **

	2122 ** i.e.

	2123 **

	2124 ** nPMA<=16 -> TreeDepth() == 0

	2125 ** nPMA<=256 -> TreeDepth() == 1

	2126 ** nPMA<=65536 -> TreeDepth() == 2

	2127 */

	2128 static int vdbeSorterTreeDepth(int nPMA){

	2129 int nDepth = 0;

	2130 i64 nDiv = SORTER_MAX_MERGE_COUNT;

	2131 while( nDiv < (i64)nPMA ){

	2132 nDiv = nDiv * SORTER_MAX_MERGE_COUNT;

	2133 nDepth++;

	2134 }

	2135 return nDepth;

	2136 }

	2137

	2138 /*

	2139 ** pRoot is the root of an incremental merge-tree with depth nDepth (according

	2140 ** to vdbeSorterTreeDepth()). pLeaf is the iSeq'th leaf to be added to the

	2141 ** tree, counting from zero. This function adds pLeaf to the tree.

	2142 **

	2143 ** If successful, SQLITE_OK is returned. If an error occurs, an SQLite error

	2144 ** code is returned and pLeaf is freed.

	2145 */

	2146 static int vdbeSorterAddToTree(

	2147 SortSubtask pTask, / Task context */

	2148 int nDepth, /* Depth of tree according to TreeDepth() */

	2149 int iSeq, /* Sequence number of leaf within tree */

	2150 MergeEngine pRoot, / Root of tree */

	2151 MergeEngine pLeaf / Leaf to add to tree */

	2152 ){

	2153 int rc = SQLITE_OK;

	2154 int nDiv = 1;

	2155 int i;

	2156 MergeEngine *p = pRoot;

	2157 IncrMerger *pIncr;

	2158

	2159 rc = vdbeIncrMergerNew(pTask, pLeaf, &pIncr);

	2160

	2161 for(i=1; i<nDepth; i++){

	2162 nDiv = nDiv * SORTER_MAX_MERGE_COUNT;

	2163 }

	2164

	2165 for(i=1; i<nDepth && rc==SQLITE_OK; i++){

	2166 int iIter = (iSeq / nDiv) % SORTER_MAX_MERGE_COUNT;

	2167 PmaReader *pReadr = &p->aReadr[iIter];

	2168

	2169 if( pReadr->pIncr==0 ){

	2170 MergeEngine *pNew = vdbeMergeEngineNew(SORTER_MAX_MERGE_COUNT);

	2171 if( pNew==0 ){

	2172 rc = SQLITE_NOMEM;

	2173 }else{

	2174 rc = vdbeIncrMergerNew(pTask, pNew, &pReadr->pIncr);

	2175 }

	2176 }

	2177 if( rc==SQLITE_OK ){

	2178 p = pReadr->pIncr->pMerger;

	2179 nDiv = nDiv / SORTER_MAX_MERGE_COUNT;

	2180 }

	2181 }

	2182

	2183 if( rc==SQLITE_OK ){

	2184 p->aReadr[iSeq % SORTER_MAX_MERGE_COUNT].pIncr = pIncr;

	2185 }else{

	2186 vdbeIncrFree(pIncr);

	2187 }

	2188 return rc;

	2189 }

	2190

	2191 /*

	2192 ** This function is called as part of a SorterRewind() operation on a sorter

	2193 ** that has already written two or more level-0 PMAs to one or more temp

	2194 ** files. It builds a tree of MergeEngine/IncrMerger/PmaReader objects that

	2195 ** can be used to incrementally merge all PMAs on disk.

	2196 **

	2197 ** If successful, SQLITE_OK is returned and *ppOut set to point to the

	2198 ** MergeEngine object at the root of the tree before returning. Or, if an

	2199 ** error occurs, an SQLite error code is returned and the final value

	2200 ** of *ppOut is undefined.

	2201 */

	2202 static int vdbeSorterMergeTreeBuild(

	2203 VdbeSorter pSorter, / The VDBE cursor that implements the sort */

	2204 MergeEngine *ppOut / Write the MergeEngine here */

	2205 ){

	2206 MergeEngine *pMain = 0;

	2207 int rc = SQLITE_OK;

	2208 int iTask;

	2209

	2210 #if SQLITE_MAX_WORKER_THREADS>0

	2211 /* If the sorter uses more than one task, then create the top-level

	2212 ** MergeEngine here. This MergeEngine will read data from exactly

	2213 ** one PmaReader per sub-task. */

	2214 assert( pSorter->bUseThreads \|\| pSorter->nTask==1 );

	2215 if( pSorter->nTask>1 ){

	2216 pMain = vdbeMergeEngineNew(pSorter->nTask);

	2217 if( pMain==0 ) rc = SQLITE_NOMEM;

	2218 }

	2219 #endif

	2220

	2221 for(iTask=0; rc==SQLITE_OK && iTask<pSorter->nTask; iTask++){

	2222 SortSubtask *pTask = &pSorter->aTask[iTask];

	2223 assert( pTask->nPMA>0 \|\| SQLITE_MAX_WORKER_THREADS>0 );

	2224 if( SQLITE_MAX_WORKER_THREADS==0 \|\| pTask->nPMA ){

	2225 MergeEngine pRoot = 0; / Root node of tree for this task */

	2226 int nDepth = vdbeSorterTreeDepth(pTask->nPMA);

	2227 i64 iReadOff = 0;

	2228

	2229 if( pTask->nPMA<=SORTER_MAX_MERGE_COUNT ){

	2230 rc = vdbeMergeEngineLevel0(pTask, pTask->nPMA, &iReadOff, &pRoot);

	2231 }else{

	2232 int i;

	2233 int iSeq = 0;

	2234 pRoot = vdbeMergeEngineNew(SORTER_MAX_MERGE_COUNT);

	2235 if( pRoot==0 ) rc = SQLITE_NOMEM;

	2236 for(i=0; i<pTask->nPMA && rc==SQLITE_OK; i += SORTER_MAX_MERGE_COUNT){

	2237 MergeEngine pMerger = 0; / New level-0 PMA merger */

	2238 int nReader; /* Number of level-0 PMAs to merge */

	2239

	2240 nReader = MIN(pTask->nPMA - i, SORTER_MAX_MERGE_COUNT);

	2241 rc = vdbeMergeEngineLevel0(pTask, nReader, &iReadOff, &pMerger);

	2242 if( rc==SQLITE_OK ){

	2243 rc = vdbeSorterAddToTree(pTask, nDepth, iSeq++, pRoot, pMerger);

	2244 }

	2245 }

	2246 }

	2247

	2248 if( rc==SQLITE_OK ){

	2249 #if SQLITE_MAX_WORKER_THREADS>0

	2250 if( pMain!=0 ){

	2251 rc = vdbeIncrMergerNew(pTask, pRoot, &pMain->aReadr[iTask].pIncr);

	2252 }else

	2253 #endif

	2254 {

	2255 assert( pMain==0 );

	2256 pMain = pRoot;

	2257 }

	2258 }else{

	2259 vdbeMergeEngineFree(pRoot);

	2260 }

	2261 }

	2262 }

	2263

	2264 if( rc!=SQLITE_OK ){

	2265 vdbeMergeEngineFree(pMain);

	2266 pMain = 0;

	2267 }

	2268 *ppOut = pMain;

	2269 return rc;

	2270 }

	2271

	2272 /*

	2273 ** This function is called as part of an sqlite3VdbeSorterRewind() operation

	2274 ** on a sorter that has written two or more PMAs to temporary files. It sets

	2275 ** up either VdbeSorter.pMerger (for single threaded sorters) or pReader

	2276 ** (for multi-threaded sorters) so that it can be used to iterate through

	2277 ** all records stored in the sorter.

	2278 **

	2279 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	2280 */

	2281 static int vdbeSorterSetupMerge(VdbeSorter *pSorter){

	2282 int rc; /* Return code */

	2283 SortSubtask *pTask0 = &pSorter->aTask[0];

	2284 MergeEngine *pMain = 0;

	2285 #if SQLITE_MAX_WORKER_THREADS

	2286 sqlite3 *db = pTask0->pSorter->db;

	2287 #endif

	2288

	2289 rc = vdbeSorterMergeTreeBuild(pSorter, &pMain);

	2290 if( rc==SQLITE_OK ){

	2291 #if SQLITE_MAX_WORKER_THREADS

	2292 assert( pSorter->bUseThreads==0 \|\| pSorter->nTask>1 );

	2293 if( pSorter->bUseThreads ){

	2294 int iTask;

	2295 PmaReader *pReadr = 0;

	2296 SortSubtask *pLast = &pSorter->aTask[pSorter->nTask-1];

	2297 rc = vdbeSortAllocUnpacked(pLast);

	2298 if( rc==SQLITE_OK ){

	2299 pReadr = (PmaReader*)sqlite3DbMallocZero(db, sizeof(PmaReader));

	2300 pSorter->pReader = pReadr;

	2301 if( pReadr==0 ) rc = SQLITE_NOMEM;

	2302 }

	2303 if( rc==SQLITE_OK ){

	2304 rc = vdbeIncrMergerNew(pLast, pMain, &pReadr->pIncr);

	2305 if( rc==SQLITE_OK ){

	2306 vdbeIncrMergerSetThreads(pReadr->pIncr);

	2307 for(iTask=0; iTask<(pSorter->nTask-1); iTask++){

	2308 IncrMerger *pIncr;

	2309 if( (pIncr = pMain->aReadr[iTask].pIncr) ){

	2310 vdbeIncrMergerSetThreads(pIncr);

	2311 assert( pIncr->pTask!=pLast );

	2312 }

	2313 }

	2314 for(iTask=0; rc==SQLITE_OK && iTask<pSorter->nTask; iTask++){

	2315 PmaReader *p = &pMain->aReadr[iTask];

	2316 assert( p->pIncr==0 \|\| p->pIncr->pTask==&pSorter->aTask[iTask] );

	2317 if( p->pIncr ){

	2318 if( iTask==pSorter->nTask-1 ){

	2319 rc = vdbePmaReaderIncrMergeInit(p, INCRINIT_TASK);

	2320 }else{

	2321 rc = vdbePmaReaderBgIncrInit(p);

	2322 }

	2323 }

	2324 }

	2325 }

	2326 pMain = 0;

	2327 }

	2328 if( rc==SQLITE_OK ){

	2329 rc = vdbePmaReaderIncrMergeInit(pReadr, INCRINIT_ROOT);

	2330 }

	2331 }else

	2332 #endif

	2333 {

	2334 rc = vdbeMergeEngineInit(pTask0, pMain, INCRINIT_NORMAL);

	2335 pSorter->pMerger = pMain;

	2336 pMain = 0;

	2337 }

	2338 }

	2339

	2340 if( rc!=SQLITE_OK ){

	2341 vdbeMergeEngineFree(pMain);

	2342 }

	2343 return rc;

	2344 }

	2345

	2346

	2347 /*

	2348 ** Once the sorter has been populated by calls to sqlite3VdbeSorterWrite,

	2349 ** this function is called to prepare for iterating through the records

	2350 ** in sorted order.

	2351 */

	2352 int sqlite3VdbeSorterRewind(const VdbeCursor pCsr, int pbEof){

	2353 VdbeSorter *pSorter = pCsr->pSorter;

	2354 int rc = SQLITE_OK; /* Return code */

	2355

	2356 assert( pSorter );

	2357

	2358 /* If no data has been written to disk, then do not do so now. Instead,

	2359 ** sort the VdbeSorter.pRecord list. The vdbe layer will read data directly

	2360 ** from the in-memory list. */

	2361 if( pSorter->bUsePMA==0 ){

	2362 if( pSorter->list.pList ){

	2363 *pbEof = 0;

	2364 rc = vdbeSorterSort(&pSorter->aTask[0], &pSorter->list);

	2365 }else{

	2366 *pbEof = 1;

	2367 }

	2368 return rc;

	2369 }

	2370

	2371 /* Write the current in-memory list to a PMA. When the VdbeSorterWrite()

	2372 ** function flushes the contents of memory to disk, it immediately always

	2373 ** creates a new list consisting of a single key immediately afterwards.

	2374 ** So the list is never empty at this point. */

	2375 assert( pSorter->list.pList );

	2376 rc = vdbeSorterFlushPMA(pSorter);

	2377

	2378 /* Join all threads */

	2379 rc = vdbeSorterJoinAll(pSorter, rc);

	2380

	2381 vdbeSorterRewindDebug("rewind");

	2382

	2383 /* Assuming no errors have occurred, set up a merger structure to

	2384 ** incrementally read and merge all remaining PMAs. */

	2385 assert( pSorter->pReader==0 );

	2386 if( rc==SQLITE_OK ){

	2387 rc = vdbeSorterSetupMerge(pSorter);

	2388 *pbEof = 0;

	2389 }

	2390

	2391 vdbeSorterRewindDebug("rewinddone");

	2392 return rc;

	2393 }

	2394

	2395 /*

	2396 ** Advance to the next element in the sorter.

	2397 */

	2398 int sqlite3VdbeSorterNext(sqlite3 db, const VdbeCursor pCsr, int *pbEof){

	2399 VdbeSorter *pSorter = pCsr->pSorter;

	2400 int rc; /* Return code */

	2401

	2402 assert( pSorter->bUsePMA \|\| (pSorter->pReader==0 && pSorter->pMerger==0) );

	2403 if( pSorter->bUsePMA ){

	2404 assert( pSorter->pReader==0 \|\| pSorter->pMerger==0 );

	2405 assert( pSorter->bUseThreads==0 \|\| pSorter->pReader );

	2406 assert( pSorter->bUseThreads==1 \|\| pSorter->pMerger );

	2407 #if SQLITE_MAX_WORKER_THREADS>0

	2408 if( pSorter->bUseThreads ){

	2409 rc = vdbePmaReaderNext(pSorter->pReader);

	2410 *pbEof = (pSorter->pReader->pFd==0);

	2411 }else

	2412 #endif

	2413 /if( !pSorter->bUseThreads )/ {

	2414 assert( pSorter->pMerger->pTask==(&pSorter->aTask[0]) );

	2415 rc = vdbeMergeEngineStep(pSorter->pMerger, pbEof);

	2416 }

	2417 }else{

	2418 SorterRecord *pFree = pSorter->list.pList;

	2419 pSorter->list.pList = pFree->u.pNext;

	2420 pFree->u.pNext = 0;

	2421 if( pSorter->list.aMemory==0 ) vdbeSorterRecordFree(db, pFree);

	2422 *pbEof = !pSorter->list.pList;

	2423 rc = SQLITE_OK;

	2424 }

	2425 return rc;

	2426 }

	2427

	2428 /*

	2429 ** Return a pointer to a buffer owned by the sorter that contains the

	2430 ** current key.

	2431 */

	2432 static void *vdbeSorterRowkey(

	2433 const VdbeSorter pSorter, / Sorter object */

	2434 int pnKey / OUT: Size of current key in bytes */

	2435 ){

	2436 void *pKey;

	2437 if( pSorter->bUsePMA ){

	2438 PmaReader *pReader;

	2439 #if SQLITE_MAX_WORKER_THREADS>0

	2440 if( pSorter->bUseThreads ){

	2441 pReader = pSorter->pReader;

	2442 }else

	2443 #endif

	2444 /if( !pSorter->bUseThreads )/{

	2445 pReader = &pSorter->pMerger->aReadr[pSorter->pMerger->aTree[1]];

	2446 }

	2447 *pnKey = pReader->nKey;

	2448 pKey = pReader->aKey;

	2449 }else{

	2450 *pnKey = pSorter->list.pList->nVal;

	2451 pKey = SRVAL(pSorter->list.pList);

	2452 }

	2453 return pKey;

	2454 }

	2455

	2456 /*

	2457 ** Copy the current sorter key into the memory cell pOut.

	2458 */

	2459 int sqlite3VdbeSorterRowkey(const VdbeCursor pCsr, Mem pOut){

	2460 VdbeSorter *pSorter = pCsr->pSorter;

	2461 void pKey; int nKey; / Sorter key to copy into pOut */

	2462

	2463 pKey = vdbeSorterRowkey(pSorter, &nKey);

	2464 if( sqlite3VdbeMemClearAndResize(pOut, nKey) ){

	2465 return SQLITE_NOMEM;

	2466 }

	2467 pOut->n = nKey;

	2468 MemSetTypeFlag(pOut, MEM_Blob);

	2469 memcpy(pOut->z, pKey, nKey);

	2470

	2471 return SQLITE_OK;

	2472 }

	2473

	2474 /*

	2475 ** Compare the key in memory cell pVal with the key that the sorter cursor

	2476 ** passed as the first argument currently points to. For the purposes of

	2477 ** the comparison, ignore the rowid field at the end of each record.

	2478 **

	2479 ** If the sorter cursor key contains any NULL values, consider it to be

	2480 ** less than pVal. Even if pVal also contains NULL values.

	2481 **

	2482 ** If an error occurs, return an SQLite error code (i.e. SQLITE_NOMEM).

	2483 ** Otherwise, set *pRes to a negative, zero or positive value if the

	2484 ** key in pVal is smaller than, equal to or larger than the current sorter

	2485 ** key.

	2486 **

	2487 ** This routine forms the core of the OP_SorterCompare opcode, which in

	2488 ** turn is used to verify uniqueness when constructing a UNIQUE INDEX.

	2489 */

	2490 int sqlite3VdbeSorterCompare(

	2491 const VdbeCursor pCsr, / Sorter cursor */

	2492 Mem pVal, / Value to compare to current sorter key */

	2493 int nKeyCol, /* Compare this many columns */

	2494 int pRes / OUT: Result of comparison */

	2495 ){

	2496 VdbeSorter *pSorter = pCsr->pSorter;

	2497 UnpackedRecord *r2 = pSorter->pUnpacked;

	2498 KeyInfo *pKeyInfo = pCsr->pKeyInfo;

	2499 int i;

	2500 void pKey; int nKey; / Sorter key to compare pVal with */

	2501

	2502 if( r2==0 ){

	2503 char *p;

	2504 r2 = pSorter->pUnpacked = sqlite3VdbeAllocUnpackedRecord(pKeyInfo,0,0,&p);

	2505 assert( pSorter->pUnpacked==(UnpackedRecord*)p );

	2506 if( r2==0 ) return SQLITE_NOMEM;

	2507 r2->nField = nKeyCol;

	2508 }

	2509 assert( r2->nField==nKeyCol );

	2510

	2511 pKey = vdbeSorterRowkey(pSorter, &nKey);

	2512 sqlite3VdbeRecordUnpack(pKeyInfo, nKey, pKey, r2);

	2513 for(i=0; i<nKeyCol; i++){

	2514 if( r2->aMem[i].flags & MEM_Null ){

	2515 *pRes = -1;

	2516 return SQLITE_OK;

	2517 }

	2518 }

	2519

	2520 *pRes = sqlite3VdbeRecordCompare(pVal->n, pVal->z, r2);

	2521 return SQLITE_OK;

	2522 }

OLD	NEW

« no previous file with comments | « third_party/sqlite/sqlite-src-3080704/src/vdbemem.c ('k') | third_party/sqlite/sqlite-src-3080704/src/vdbetrace.c » ('j') | no next file with comments »