third_party/sqlite/sqlite-src-3170000/src/vdbesort.c - Issue 2747283002: [sql] Import reference version of SQLite 3.17..

Side by Side Diff: third_party/sqlite/sqlite-src-3170000/src/vdbesort.c

Issue 2747283002: [sql] Import reference version of SQLite 3.17.. (Closed)

Patch Set: Created 3 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
(Empty)
	1 /*

	2 ** 2011-07-09

	3 **

	4 ** The author disclaims copyright to this source code. In place of

	5 ** a legal notice, here is a blessing:

	6 **

	7 ** May you do good and not evil.

	8 ** May you find forgiveness for yourself and forgive others.

	9 ** May you share freely, never taking more than you give.

	10 **

	11 *************************************************************************

	12 ** This file contains code for the VdbeSorter object, used in concert with

	13 ** a VdbeCursor to sort large numbers of keys for CREATE INDEX statements

	14 ** or by SELECT statements with ORDER BY clauses that cannot be satisfied

	15 ** using indexes and without LIMIT clauses.

	16 **

	17 ** The VdbeSorter object implements a multi-threaded external merge sort

	18 ** algorithm that is efficient even if the number of elements being sorted

	19 ** exceeds the available memory.

	20 **

	21 ** Here is the (internal, non-API) interface between this module and the

	22 ** rest of the SQLite system:

	23 **

	24 ** sqlite3VdbeSorterInit() Create a new VdbeSorter object.

	25 **

	26 ** sqlite3VdbeSorterWrite() Add a single new row to the VdbeSorter

	27 ** object. The row is a binary blob in the

	28 ** OP_MakeRecord format that contains both

	29 ** the ORDER BY key columns and result columns

	30 ** in the case of a SELECT w/ ORDER BY, or

	31 ** the complete record for an index entry

	32 ** in the case of a CREATE INDEX.

	33 **

	34 ** sqlite3VdbeSorterRewind() Sort all content previously added.

	35 ** Position the read cursor on the

	36 ** first sorted element.

	37 **

	38 ** sqlite3VdbeSorterNext() Advance the read cursor to the next sorted

	39 ** element.

	40 **

	41 ** sqlite3VdbeSorterRowkey() Return the complete binary blob for the

	42 ** row currently under the read cursor.

	43 **

	44 ** sqlite3VdbeSorterCompare() Compare the binary blob for the row

	45 ** currently under the read cursor against

	46 ** another binary blob X and report if

	47 ** X is strictly less than the read cursor.

	48 ** Used to enforce uniqueness in a

	49 ** CREATE UNIQUE INDEX statement.

	50 **

	51 ** sqlite3VdbeSorterClose() Close the VdbeSorter object and reclaim

	52 ** all resources.

	53 **

	54 ** sqlite3VdbeSorterReset() Refurbish the VdbeSorter for reuse. This

	55 ** is like Close() followed by Init() only

	56 ** much faster.

	57 **

	58 ** The interfaces above must be called in a particular order. Write() can

	59 ** only occur in between Init()/Reset() and Rewind(). Next(), Rowkey(), and

	60 ** Compare() can only occur in between Rewind() and Close()/Reset(). i.e.

	61 **

	62 ** Init()

	63 ** for each record: Write()

	64 ** Rewind()

	65 ** Rowkey()/Compare()

	66 ** Next()

	67 ** Close()

	68 **

	69 ** Algorithm:

	70 **

	71 ** Records passed to the sorter via calls to Write() are initially held

	72 ** unsorted in main memory. Assuming the amount of memory used never exceeds

	73 ** a threshold, when Rewind() is called the set of records is sorted using

	74 ** an in-memory merge sort. In this case, no temporary files are required

	75 ** and subsequent calls to Rowkey(), Next() and Compare() read records

	76 ** directly from main memory.

	77 **

	78 ** If the amount of space used to store records in main memory exceeds the

	79 ** threshold, then the set of records currently in memory are sorted and

	80 ** written to a temporary file in "Packed Memory Array" (PMA) format.

	81 ** A PMA created at this point is known as a "level-0 PMA". Higher levels

	82 ** of PMAs may be created by merging existing PMAs together - for example

	83 ** merging two or more level-0 PMAs together creates a level-1 PMA.

	84 **

	85 ** The threshold for the amount of main memory to use before flushing

	86 ** records to a PMA is roughly the same as the limit configured for the

	87 ** page-cache of the main database. Specifically, the threshold is set to

	88 ** the value returned by "PRAGMA main.page_size" multipled by

	89 ** that returned by "PRAGMA main.cache_size", in bytes.

	90 **

	91 ** If the sorter is running in single-threaded mode, then all PMAs generated

	92 ** are appended to a single temporary file. Or, if the sorter is running in

	93 ** multi-threaded mode then up to (N+1) temporary files may be opened, where

	94 ** N is the configured number of worker threads. In this case, instead of

	95 ** sorting the records and writing the PMA to a temporary file itself, the

	96 ** calling thread usually launches a worker thread to do so. Except, if

	97 ** there are already N worker threads running, the main thread does the work

	98 ** itself.

	99 **

	100 ** The sorter is running in multi-threaded mode if (a) the library was built

	101 ** with pre-processor symbol SQLITE_MAX_WORKER_THREADS set to a value greater

	102 ** than zero, and (b) worker threads have been enabled at runtime by calling

	103 ** "PRAGMA threads=N" with some value of N greater than 0.

	104 **

	105 ** When Rewind() is called, any data remaining in memory is flushed to a

	106 ** final PMA. So at this point the data is stored in some number of sorted

	107 ** PMAs within temporary files on disk.

	108 **

	109 ** If there are fewer than SORTER_MAX_MERGE_COUNT PMAs in total and the

	110 ** sorter is running in single-threaded mode, then these PMAs are merged

	111 ** incrementally as keys are retreived from the sorter by the VDBE. The

	112 ** MergeEngine object, described in further detail below, performs this

	113 ** merge.

	114 **

	115 ** Or, if running in multi-threaded mode, then a background thread is

	116 ** launched to merge the existing PMAs. Once the background thread has

	117 ** merged T bytes of data into a single sorted PMA, the main thread

	118 ** begins reading keys from that PMA while the background thread proceeds

	119 ** with merging the next T bytes of data. And so on.

	120 **

	121 ** Parameter T is set to half the value of the memory threshold used

	122 ** by Write() above to determine when to create a new PMA.

	123 **

	124 ** If there are more than SORTER_MAX_MERGE_COUNT PMAs in total when

	125 ** Rewind() is called, then a hierarchy of incremental-merges is used.

	126 ** First, T bytes of data from the first SORTER_MAX_MERGE_COUNT PMAs on

	127 ** disk are merged together. Then T bytes of data from the second set, and

	128 ** so on, such that no operation ever merges more than SORTER_MAX_MERGE_COUNT

	129 ** PMAs at a time. This done is to improve locality.

	130 **

	131 ** If running in multi-threaded mode and there are more than

	132 ** SORTER_MAX_MERGE_COUNT PMAs on disk when Rewind() is called, then more

	133 ** than one background thread may be created. Specifically, there may be

	134 ** one background thread for each temporary file on disk, and one background

	135 ** thread to merge the output of each of the others to a single PMA for

	136 ** the main thread to read from.

	137 */

	138 #include "sqliteInt.h"

	139 #include "vdbeInt.h"

	140

	141 /*

	142 ** If SQLITE_DEBUG_SORTER_THREADS is defined, this module outputs various

	143 ** messages to stderr that may be helpful in understanding the performance

	144 ** characteristics of the sorter in multi-threaded mode.

	145 */

	146 #if 0

	147 # define SQLITE_DEBUG_SORTER_THREADS 1

	148 #endif

	149

	150 /*

	151 ** Hard-coded maximum amount of data to accumulate in memory before flushing

	152 ** to a level 0 PMA. The purpose of this limit is to prevent various integer

	153 ** overflows. 512MiB.

	154 */

	155 #define SQLITE_MAX_PMASZ (1<<29)

	156

	157 /*

	158 ** Private objects used by the sorter

	159 */

	160 typedef struct MergeEngine MergeEngine; /* Merge PMAs together */

	161 typedef struct PmaReader PmaReader; /* Incrementally read one PMA */

	162 typedef struct PmaWriter PmaWriter; /* Incrementally write one PMA */

	163 typedef struct SorterRecord SorterRecord; /* A record being sorted */

	164 typedef struct SortSubtask SortSubtask; /* A sub-task in the sort process */

	165 typedef struct SorterFile SorterFile; /* Temporary file object wrapper */

	166 typedef struct SorterList SorterList; /* In-memory list of records */

	167 typedef struct IncrMerger IncrMerger; /* Read & merge multiple PMAs */

	168

	169 /*

	170 ** A container for a temp file handle and the current amount of data

	171 ** stored in the file.

	172 */

	173 struct SorterFile {

	174 sqlite3_file pFd; / File handle */

	175 i64 iEof; /* Bytes of data stored in pFd */

	176 };

	177

	178 /*

	179 ** An in-memory list of objects to be sorted.

	180 **

	181 ** If aMemory==0 then each object is allocated separately and the objects

	182 ** are connected using SorterRecord.u.pNext. If aMemory!=0 then all objects

	183 ** are stored in the aMemory[] bulk memory, one right after the other, and

	184 ** are connected using SorterRecord.u.iNext.

	185 */

	186 struct SorterList {

	187 SorterRecord pList; / Linked list of records */

	188 u8 aMemory; / If non-NULL, bulk memory to hold pList */

	189 int szPMA; /* Size of pList as PMA in bytes */

	190 };

	191

	192 /*

	193 ** The MergeEngine object is used to combine two or more smaller PMAs into

	194 ** one big PMA using a merge operation. Separate PMAs all need to be

	195 ** combined into one big PMA in order to be able to step through the sorted

	196 ** records in order.

	197 **

	198 ** The aReadr[] array contains a PmaReader object for each of the PMAs being

	199 ** merged. An aReadr[] object either points to a valid key or else is at EOF.

	200 ** ("EOF" means "End Of File". When aReadr[] is at EOF there is no more data.)

	201 ** For the purposes of the paragraphs below, we assume that the array is

	202 ** actually N elements in size, where N is the smallest power of 2 greater

	203 ** to or equal to the number of PMAs being merged. The extra aReadr[] elements

	204 ** are treated as if they are empty (always at EOF).

	205 **

	206 ** The aTree[] array is also N elements in size. The value of N is stored in

	207 ** the MergeEngine.nTree variable.

	208 **

	209 ** The final (N/2) elements of aTree[] contain the results of comparing

	210 ** pairs of PMA keys together. Element i contains the result of

	211 ** comparing aReadr[2i-N] and aReadr[2i-N+1]. Whichever key is smaller, the

	212 ** aTree element is set to the index of it.

	213 **

	214 ** For the purposes of this comparison, EOF is considered greater than any

	215 ** other key value. If the keys are equal (only possible with two EOF

	216 ** values), it doesn't matter which index is stored.

	217 **

	218 ** The (N/4) elements of aTree[] that precede the final (N/2) described

	219 ** above contains the index of the smallest of each block of 4 PmaReaders

	220 ** And so on. So that aTree[1] contains the index of the PmaReader that

	221 ** currently points to the smallest key value. aTree[0] is unused.

	222 **

	223 ** Example:

	224 **

	225 ** aReadr[0] -> Banana

	226 ** aReadr[1] -> Feijoa

	227 ** aReadr[2] -> Elderberry

	228 ** aReadr[3] -> Currant

	229 ** aReadr[4] -> Grapefruit

	230 ** aReadr[5] -> Apple

	231 ** aReadr[6] -> Durian

	232 ** aReadr[7] -> EOF

	233 **

	234 ** aTree[] = { X, 5 0, 5 0, 3, 5, 6 }

	235 **

	236 ** The current element is "Apple" (the value of the key indicated by

	237 ** PmaReader 5). When the Next() operation is invoked, PmaReader 5 will

	238 ** be advanced to the next key in its segment. Say the next key is

	239 ** "Eggplant":

	240 **

	241 ** aReadr[5] -> Eggplant

	242 **

	243 ** The contents of aTree[] are updated first by comparing the new PmaReader

	244 ** 5 key to the current key of PmaReader 4 (still "Grapefruit"). The PmaReader

	245 ** 5 value is still smaller, so aTree[6] is set to 5. And so on up the tree.

	246 ** The value of PmaReader 6 - "Durian" - is now smaller than that of PmaReader

	247 ** 5, so aTree[3] is set to 6. Key 0 is smaller than key 6 (Banana<Durian),

	248 ** so the value written into element 1 of the array is 0. As follows:

	249 **

	250 ** aTree[] = { X, 0 0, 6 0, 3, 5, 6 }

	251 **

	252 ** In other words, each time we advance to the next sorter element, log2(N)

	253 ** key comparison operations are required, where N is the number of segments

	254 ** being merged (rounded up to the next power of 2).

	255 */

	256 struct MergeEngine {

	257 int nTree; /* Used size of aTree/aReadr (power of 2) */

	258 SortSubtask pTask; / Used by this thread only */

	259 int aTree; / Current state of incremental merge */

	260 PmaReader aReadr; / Array of PmaReaders to merge data from */

	261 };

	262

	263 /*

	264 ** This object represents a single thread of control in a sort operation.

	265 ** Exactly VdbeSorter.nTask instances of this object are allocated

	266 ** as part of each VdbeSorter object. Instances are never allocated any

	267 ** other way. VdbeSorter.nTask is set to the number of worker threads allowed

	268 ** (see SQLITE_CONFIG_WORKER_THREADS) plus one (the main thread). Thus for

	269 ** single-threaded operation, there is exactly one instance of this object

	270 ** and for multi-threaded operation there are two or more instances.

	271 **

	272 ** Essentially, this structure contains all those fields of the VdbeSorter

	273 ** structure for which each thread requires a separate instance. For example,

	274 ** each thread requries its own UnpackedRecord object to unpack records in

	275 ** as part of comparison operations.

	276 **

	277 ** Before a background thread is launched, variable bDone is set to 0. Then,

	278 ** right before it exits, the thread itself sets bDone to 1. This is used for

	279 ** two purposes:

	280 **

	281 ** 1. When flushing the contents of memory to a level-0 PMA on disk, to

	282 ** attempt to select a SortSubtask for which there is not already an

	283 ** active background thread (since doing so causes the main thread

	284 ** to block until it finishes).

	285 **

	286 ** 2. If SQLITE_DEBUG_SORTER_THREADS is defined, to determine if a call

	287 ** to sqlite3ThreadJoin() is likely to block. Cases that are likely to

	288 ** block provoke debugging output.

	289 **

	290 ** In both cases, the effects of the main thread seeing (bDone==0) even

	291 ** after the thread has finished are not dire. So we don't worry about

	292 ** memory barriers and such here.

	293 */

	294 typedef int (SorterCompare)(SortSubtask,int,const void,int,const void*,int);

	295 struct SortSubtask {

	296 SQLiteThread pThread; / Background thread, if any */

	297 int bDone; /* Set if thread is finished but not joined */

	298 VdbeSorter pSorter; / Sorter that owns this sub-task */

	299 UnpackedRecord pUnpacked; / Space to unpack a record */

	300 SorterList list; /* List for thread to write to a PMA */

	301 int nPMA; /* Number of PMAs currently in file */

	302 SorterCompare xCompare; /* Compare function to use */

	303 SorterFile file; /* Temp file for level-0 PMAs */

	304 SorterFile file2; /* Space for other PMAs */

	305 };

	306

	307

	308 /*

	309 ** Main sorter structure. A single instance of this is allocated for each

	310 ** sorter cursor created by the VDBE.

	311 **

	312 ** mxKeysize:

	313 ** As records are added to the sorter by calls to sqlite3VdbeSorterWrite(),

	314 ** this variable is updated so as to be set to the size on disk of the

	315 ** largest record in the sorter.

	316 */

	317 struct VdbeSorter {

	318 int mnPmaSize; /* Minimum PMA size, in bytes */

	319 int mxPmaSize; /* Maximum PMA size, in bytes. 0==no limit */

	320 int mxKeysize; /* Largest serialized key seen so far */

	321 int pgsz; /* Main database page size */

	322 PmaReader pReader; / Readr data from here after Rewind() */

	323 MergeEngine pMerger; / Or here, if bUseThreads==0 */

	324 sqlite3 db; / Database connection */

	325 KeyInfo pKeyInfo; / How to compare records */

	326 UnpackedRecord pUnpacked; / Used by VdbeSorterCompare() */

	327 SorterList list; /* List of in-memory records */

	328 int iMemory; /* Offset of free space in list.aMemory */

	329 int nMemory; /* Size of list.aMemory allocation in bytes */

	330 u8 bUsePMA; /* True if one or more PMAs created */

	331 u8 bUseThreads; /* True to use background threads */

	332 u8 iPrev; /* Previous thread used to flush PMA */

	333 u8 nTask; /* Size of aTask[] array */

	334 u8 typeMask;

	335 SortSubtask aTask[1]; /* One or more subtasks */

	336 };

	337

	338 #define SORTER_TYPE_INTEGER 0x01

	339 #define SORTER_TYPE_TEXT 0x02

	340

	341 /*

	342 ** An instance of the following object is used to read records out of a

	343 ** PMA, in sorted order. The next key to be read is cached in nKey/aKey.

	344 ** aKey might point into aMap or into aBuffer. If neither of those locations

	345 ** contain a contiguous representation of the key, then aAlloc is allocated

	346 ** and the key is copied into aAlloc and aKey is made to poitn to aAlloc.

	347 **

	348 ** pFd==0 at EOF.

	349 */

	350 struct PmaReader {

	351 i64 iReadOff; /* Current read offset */

	352 i64 iEof; /* 1 byte past EOF for this PmaReader */

	353 int nAlloc; /* Bytes of space at aAlloc */

	354 int nKey; /* Number of bytes in key */

	355 sqlite3_file pFd; / File handle we are reading from */

	356 u8 aAlloc; / Space for aKey if aBuffer and pMap wont work */

	357 u8 aKey; / Pointer to current key */

	358 u8 aBuffer; / Current read buffer */

	359 int nBuffer; /* Size of read buffer in bytes */

	360 u8 aMap; / Pointer to mapping of entire file */

	361 IncrMerger pIncr; / Incremental merger */

	362 };

	363

	364 /*

	365 ** Normally, a PmaReader object iterates through an existing PMA stored

	366 ** within a temp file. However, if the PmaReader.pIncr variable points to

	367 ** an object of the following type, it may be used to iterate/merge through

	368 ** multiple PMAs simultaneously.

	369 **

	370 ** There are two types of IncrMerger object - single (bUseThread==0) and

	371 ** multi-threaded (bUseThread==1).

	372 **

	373 ** A multi-threaded IncrMerger object uses two temporary files - aFile[0]

	374 ** and aFile[1]. Neither file is allowed to grow to more than mxSz bytes in

	375 ** size. When the IncrMerger is initialized, it reads enough data from

	376 ** pMerger to populate aFile[0]. It then sets variables within the

	377 ** corresponding PmaReader object to read from that file and kicks off

	378 ** a background thread to populate aFile[1] with the next mxSz bytes of

	379 ** sorted record data from pMerger.

	380 **

	381 ** When the PmaReader reaches the end of aFile[0], it blocks until the

	382 ** background thread has finished populating aFile[1]. It then exchanges

	383 ** the contents of the aFile[0] and aFile[1] variables within this structure,

	384 ** sets the PmaReader fields to read from the new aFile[0] and kicks off

	385 ** another background thread to populate the new aFile[1]. And so on, until

	386 ** the contents of pMerger are exhausted.

	387 **

	388 ** A single-threaded IncrMerger does not open any temporary files of its

	389 ** own. Instead, it has exclusive access to mxSz bytes of space beginning

	390 ** at offset iStartOff of file pTask->file2. And instead of using a

	391 ** background thread to prepare data for the PmaReader, with a single

	392 ** threaded IncrMerger the allocate part of pTask->file2 is "refilled" with

	393 ** keys from pMerger by the calling thread whenever the PmaReader runs out

	394 ** of data.

	395 */

	396 struct IncrMerger {

	397 SortSubtask pTask; / Task that owns this merger */

	398 MergeEngine pMerger; / Merge engine thread reads data from */

	399 i64 iStartOff; /* Offset to start writing file at */

	400 int mxSz; /* Maximum bytes of data to store */

	401 int bEof; /* Set to true when merge is finished */

	402 int bUseThread; /* True to use a bg thread for this object */

	403 SorterFile aFile[2]; /* aFile[0] for reading, [1] for writing */

	404 };

	405

	406 /*

	407 ** An instance of this object is used for writing a PMA.

	408 **

	409 ** The PMA is written one record at a time. Each record is of an arbitrary

	410 ** size. But I/O is more efficient if it occurs in page-sized blocks where

	411 ** each block is aligned on a page boundary. This object caches writes to

	412 ** the PMA so that aligned, page-size blocks are written.

	413 */

	414 struct PmaWriter {

	415 int eFWErr; /* Non-zero if in an error state */

	416 u8 aBuffer; / Pointer to write buffer */

	417 int nBuffer; /* Size of write buffer in bytes */

	418 int iBufStart; /* First byte of buffer to write */

	419 int iBufEnd; /* Last byte of buffer to write */

	420 i64 iWriteOff; /* Offset of start of buffer in file */

	421 sqlite3_file pFd; / File handle to write to */

	422 };

	423

	424 /*

	425 ** This object is the header on a single record while that record is being

	426 ** held in memory and prior to being written out as part of a PMA.

	427 **

	428 ** How the linked list is connected depends on how memory is being managed

	429 ** by this module. If using a separate allocation for each in-memory record

	430 ** (VdbeSorter.list.aMemory==0), then the list is always connected using the

	431 ** SorterRecord.u.pNext pointers.

	432 **

	433 ** Or, if using the single large allocation method (VdbeSorter.list.aMemory!=0),

	434 ** then while records are being accumulated the list is linked using the

	435 ** SorterRecord.u.iNext offset. This is because the aMemory[] array may

	436 ** be sqlite3Realloc()ed while records are being accumulated. Once the VM

	437 ** has finished passing records to the sorter, or when the in-memory buffer

	438 ** is full, the list is sorted. As part of the sorting process, it is

	439 ** converted to use the SorterRecord.u.pNext pointers. See function

	440 ** vdbeSorterSort() for details.

	441 */

	442 struct SorterRecord {

	443 int nVal; /* Size of the record in bytes */

	444 union {

	445 SorterRecord pNext; / Pointer to next record in list */

	446 int iNext; /* Offset within aMemory of next record */

	447 } u;

	448 /* The data for the record immediately follows this header */

	449 };

	450

	451 /* Return a pointer to the buffer containing the record data for SorterRecord

	452 ** object p. Should be used as if:

	453 **

	454 ** void SRVAL(SorterRecord p) { return (void*)&p[1]; }

	455 */

	456 #define SRVAL(p) ((void)((SorterRecord)(p) + 1))

	457

	458

	459 /* Maximum number of PMAs that a single MergeEngine can merge */

	460 #define SORTER_MAX_MERGE_COUNT 16

	461

	462 static int vdbeIncrSwap(IncrMerger*);

	463 static void vdbeIncrFree(IncrMerger *);

	464

	465 /*

	466 ** Free all memory belonging to the PmaReader object passed as the

	467 ** argument. All structure fields are set to zero before returning.

	468 */

	469 static void vdbePmaReaderClear(PmaReader *pReadr){

	470 sqlite3_free(pReadr->aAlloc);

	471 sqlite3_free(pReadr->aBuffer);

	472 if( pReadr->aMap ) sqlite3OsUnfetch(pReadr->pFd, 0, pReadr->aMap);

	473 vdbeIncrFree(pReadr->pIncr);

	474 memset(pReadr, 0, sizeof(PmaReader));

	475 }

	476

	477 /*

	478 ** Read the next nByte bytes of data from the PMA p.

	479 ** If successful, set *ppOut to point to a buffer containing the data

	480 ** and return SQLITE_OK. Otherwise, if an error occurs, return an SQLite

	481 ** error code.

	482 **

	483 ** The buffer returned in *ppOut is only valid until the

	484 ** next call to this function.

	485 */

	486 static int vdbePmaReadBlob(

	487 PmaReader p, / PmaReader from which to take the blob */

	488 int nByte, /* Bytes of data to read */

	489 u8 *ppOut / OUT: Pointer to buffer containing data */

	490 ){

	491 int iBuf; /* Offset within buffer to read from */

	492 int nAvail; /* Bytes of data available in buffer */

	493

	494 if( p->aMap ){

	495 *ppOut = &p->aMap[p->iReadOff];

	496 p->iReadOff += nByte;

	497 return SQLITE_OK;

	498 }

	499

	500 assert( p->aBuffer );

	501

	502 /* If there is no more data to be read from the buffer, read the next

	503 ** p->nBuffer bytes of data from the file into it. Or, if there are less

	504 ** than p->nBuffer bytes remaining in the PMA, read all remaining data. */

	505 iBuf = p->iReadOff % p->nBuffer;

	506 if( iBuf==0 ){

	507 int nRead; /* Bytes to read from disk */

	508 int rc; /* sqlite3OsRead() return code */

	509

	510 /* Determine how many bytes of data to read. */

	511 if( (p->iEof - p->iReadOff) > (i64)p->nBuffer ){

	512 nRead = p->nBuffer;

	513 }else{

	514 nRead = (int)(p->iEof - p->iReadOff);

	515 }

	516 assert( nRead>0 );

	517

	518 /* Readr data from the file. Return early if an error occurs. */

	519 rc = sqlite3OsRead(p->pFd, p->aBuffer, nRead, p->iReadOff);

	520 assert( rc!=SQLITE_IOERR_SHORT_READ );

	521 if( rc!=SQLITE_OK ) return rc;

	522 }

	523 nAvail = p->nBuffer - iBuf;

	524

	525 if( nByte<=nAvail ){

	526 /* The requested data is available in the in-memory buffer. In this

	527 ** case there is no need to make a copy of the data, just return a

	528 ** pointer into the buffer to the caller. */

	529 *ppOut = &p->aBuffer[iBuf];

	530 p->iReadOff += nByte;

	531 }else{

	532 /* The requested data is not all available in the in-memory buffer.

	533 ** In this case, allocate space at p->aAlloc[] to copy the requested

	534 ** range into. Then return a copy of pointer p->aAlloc to the caller. */

	535 int nRem; /* Bytes remaining to copy */

	536

	537 /* Extend the p->aAlloc[] allocation if required. */

	538 if( p->nAlloc<nByte ){

	539 u8 *aNew;

	540 int nNew = MAX(128, p->nAlloc*2);

	541 while( nByte>nNew ) nNew = nNew*2;

	542 aNew = sqlite3Realloc(p->aAlloc, nNew);

	543 if( !aNew ) return SQLITE_NOMEM_BKPT;

	544 p->nAlloc = nNew;

	545 p->aAlloc = aNew;

	546 }

	547

	548 /* Copy as much data as is available in the buffer into the start of

	549 ** p->aAlloc[]. */

	550 memcpy(p->aAlloc, &p->aBuffer[iBuf], nAvail);

	551 p->iReadOff += nAvail;

	552 nRem = nByte - nAvail;

	553

	554 /* The following loop copies up to p->nBuffer bytes per iteration into

	555 ** the p->aAlloc[] buffer. */

	556 while( nRem>0 ){

	557 int rc; /* vdbePmaReadBlob() return code */

	558 int nCopy; /* Number of bytes to copy */

	559 u8 aNext; / Pointer to buffer to copy data from */

	560

	561 nCopy = nRem;

	562 if( nRem>p->nBuffer ) nCopy = p->nBuffer;

	563 rc = vdbePmaReadBlob(p, nCopy, &aNext);

	564 if( rc!=SQLITE_OK ) return rc;

	565 assert( aNext!=p->aAlloc );

	566 memcpy(&p->aAlloc[nByte - nRem], aNext, nCopy);

	567 nRem -= nCopy;

	568 }

	569

	570 *ppOut = p->aAlloc;

	571 }

	572

	573 return SQLITE_OK;

	574 }

	575

	576 /*

	577 ** Read a varint from the stream of data accessed by p. Set *pnOut to

	578 ** the value read.

	579 */

	580 static int vdbePmaReadVarint(PmaReader p, u64 pnOut){

	581 int iBuf;

	582

	583 if( p->aMap ){

	584 p->iReadOff += sqlite3GetVarint(&p->aMap[p->iReadOff], pnOut);

	585 }else{

	586 iBuf = p->iReadOff % p->nBuffer;

	587 if( iBuf && (p->nBuffer-iBuf)>=9 ){

	588 p->iReadOff += sqlite3GetVarint(&p->aBuffer[iBuf], pnOut);

	589 }else{

	590 u8 aVarint[16], *a;

	591 int i = 0, rc;

	592 do{

	593 rc = vdbePmaReadBlob(p, 1, &a);

	594 if( rc ) return rc;

	595 aVarint[(i++)&0xf] = a[0];

	596 }while( (a[0]&0x80)!=0 );

	597 sqlite3GetVarint(aVarint, pnOut);

	598 }

	599 }

	600

	601 return SQLITE_OK;

	602 }

	603

	604 /*

	605 ** Attempt to memory map file pFile. If successful, set *pp to point to the

	606 ** new mapping and return SQLITE_OK. If the mapping is not attempted

	607 ** (because the file is too large or the VFS layer is configured not to use

	608 ** mmap), return SQLITE_OK and set *pp to NULL.

	609 **

	610 ** Or, if an error occurs, return an SQLite error code. The final value of

	611 ** *pp is undefined in this case.

	612 */

	613 static int vdbeSorterMapFile(SortSubtask pTask, SorterFile pFile, u8 **pp){

	614 int rc = SQLITE_OK;

	615 if( pFile->iEof<=(i64)(pTask->pSorter->db->nMaxSorterMmap) ){

	616 sqlite3_file *pFd = pFile->pFd;

	617 if( pFd->pMethods->iVersion>=3 ){

	618 rc = sqlite3OsFetch(pFd, 0, (int)pFile->iEof, (void**)pp);

	619 testcase( rc!=SQLITE_OK );

	620 }

	621 }

	622 return rc;

	623 }

	624

	625 /*

	626 ** Attach PmaReader pReadr to file pFile (if it is not already attached to

	627 ** that file) and seek it to offset iOff within the file. Return SQLITE_OK

	628 ** if successful, or an SQLite error code if an error occurs.

	629 */

	630 static int vdbePmaReaderSeek(

	631 SortSubtask pTask, / Task context */

	632 PmaReader pReadr, / Reader whose cursor is to be moved */

	633 SorterFile pFile, / Sorter file to read from */

	634 i64 iOff /* Offset in pFile */

	635 ){

	636 int rc = SQLITE_OK;

	637

	638 assert( pReadr->pIncr==0 \|\| pReadr->pIncr->bEof==0 );

	639

	640 if( sqlite3FaultSim(201) ) return SQLITE_IOERR_READ;

	641 if( pReadr->aMap ){

	642 sqlite3OsUnfetch(pReadr->pFd, 0, pReadr->aMap);

	643 pReadr->aMap = 0;

	644 }

	645 pReadr->iReadOff = iOff;

	646 pReadr->iEof = pFile->iEof;

	647 pReadr->pFd = pFile->pFd;

	648

	649 rc = vdbeSorterMapFile(pTask, pFile, &pReadr->aMap);

	650 if( rc==SQLITE_OK && pReadr->aMap==0 ){

	651 int pgsz = pTask->pSorter->pgsz;

	652 int iBuf = pReadr->iReadOff % pgsz;

	653 if( pReadr->aBuffer==0 ){

	654 pReadr->aBuffer = (u8*)sqlite3Malloc(pgsz);

	655 if( pReadr->aBuffer==0 ) rc = SQLITE_NOMEM_BKPT;

	656 pReadr->nBuffer = pgsz;

	657 }

	658 if( rc==SQLITE_OK && iBuf ){

	659 int nRead = pgsz - iBuf;

	660 if( (pReadr->iReadOff + nRead) > pReadr->iEof ){

	661 nRead = (int)(pReadr->iEof - pReadr->iReadOff);

	662 }

	663 rc = sqlite3OsRead(

	664 pReadr->pFd, &pReadr->aBuffer[iBuf], nRead, pReadr->iReadOff

	665 );

	666 testcase( rc!=SQLITE_OK );

	667 }

	668 }

	669

	670 return rc;

	671 }

	672

	673 /*

	674 ** Advance PmaReader pReadr to the next key in its PMA. Return SQLITE_OK if

	675 ** no error occurs, or an SQLite error code if one does.

	676 */

	677 static int vdbePmaReaderNext(PmaReader *pReadr){

	678 int rc = SQLITE_OK; /* Return Code */

	679 u64 nRec = 0; /* Size of record in bytes */

	680

	681

	682 if( pReadr->iReadOff>=pReadr->iEof ){

	683 IncrMerger *pIncr = pReadr->pIncr;

	684 int bEof = 1;

	685 if( pIncr ){

	686 rc = vdbeIncrSwap(pIncr);

	687 if( rc==SQLITE_OK && pIncr->bEof==0 ){

	688 rc = vdbePmaReaderSeek(

	689 pIncr->pTask, pReadr, &pIncr->aFile[0], pIncr->iStartOff

	690 );

	691 bEof = 0;

	692 }

	693 }

	694

	695 if( bEof ){

	696 /* This is an EOF condition */

	697 vdbePmaReaderClear(pReadr);

	698 testcase( rc!=SQLITE_OK );

	699 return rc;

	700 }

	701 }

	702

	703 if( rc==SQLITE_OK ){

	704 rc = vdbePmaReadVarint(pReadr, &nRec);

	705 }

	706 if( rc==SQLITE_OK ){

	707 pReadr->nKey = (int)nRec;

	708 rc = vdbePmaReadBlob(pReadr, (int)nRec, &pReadr->aKey);

	709 testcase( rc!=SQLITE_OK );

	710 }

	711

	712 return rc;

	713 }

	714

	715 /*

	716 ** Initialize PmaReader pReadr to scan through the PMA stored in file pFile

	717 ** starting at offset iStart and ending at offset iEof-1. This function

	718 ** leaves the PmaReader pointing to the first key in the PMA (or EOF if the

	719 ** PMA is empty).

	720 **

	721 ** If the pnByte parameter is NULL, then it is assumed that the file

	722 ** contains a single PMA, and that that PMA omits the initial length varint.

	723 */

	724 static int vdbePmaReaderInit(

	725 SortSubtask pTask, / Task context */

	726 SorterFile pFile, / Sorter file to read from */

	727 i64 iStart, /* Start offset in pFile */

	728 PmaReader pReadr, / PmaReader to populate */

	729 i64 pnByte / IN/OUT: Increment this value by PMA size */

	730 ){

	731 int rc;

	732

	733 assert( pFile->iEof>iStart );

	734 assert( pReadr->aAlloc==0 && pReadr->nAlloc==0 );

	735 assert( pReadr->aBuffer==0 );

	736 assert( pReadr->aMap==0 );

	737

	738 rc = vdbePmaReaderSeek(pTask, pReadr, pFile, iStart);

	739 if( rc==SQLITE_OK ){

	740 u64 nByte = 0; /* Size of PMA in bytes */

	741 rc = vdbePmaReadVarint(pReadr, &nByte);

	742 pReadr->iEof = pReadr->iReadOff + nByte;

	743 *pnByte += nByte;

	744 }

	745

	746 if( rc==SQLITE_OK ){

	747 rc = vdbePmaReaderNext(pReadr);

	748 }

	749 return rc;

	750 }

	751

	752 /*

	753 ** A version of vdbeSorterCompare() that assumes that it has already been

	754 ** determined that the first field of key1 is equal to the first field of

	755 ** key2.

	756 */

	757 static int vdbeSorterCompareTail(

	758 SortSubtask pTask, / Subtask context (for pKeyInfo) */

	759 int pbKey2Cached, / True if pTask->pUnpacked is pKey2 */

	760 const void pKey1, int nKey1, / Left side of comparison */

	761 const void pKey2, int nKey2 / Right side of comparison */

	762 ){

	763 UnpackedRecord *r2 = pTask->pUnpacked;

	764 if( *pbKey2Cached==0 ){

	765 sqlite3VdbeRecordUnpack(pTask->pSorter->pKeyInfo, nKey2, pKey2, r2);

	766 *pbKey2Cached = 1;

	767 }

	768 return sqlite3VdbeRecordCompareWithSkip(nKey1, pKey1, r2, 1);

	769 }

	770

	771 /*

	772 ** Compare key1 (buffer pKey1, size nKey1 bytes) with key2 (buffer pKey2,

	773 ** size nKey2 bytes). Use (pTask->pKeyInfo) for the collation sequences

	774 ** used by the comparison. Return the result of the comparison.

	775 **

	776 ** If IN/OUT parameter *pbKey2Cached is true when this function is called,

	777 ** it is assumed that (pTask->pUnpacked) contains the unpacked version

	778 ** of key2. If it is false, (pTask->pUnpacked) is populated with the unpacked

	779 ** version of key2 and *pbKey2Cached set to true before returning.

	780 **

	781 ** If an OOM error is encountered, (pTask->pUnpacked->error_rc) is set

	782 ** to SQLITE_NOMEM.

	783 */

	784 static int vdbeSorterCompare(

	785 SortSubtask pTask, / Subtask context (for pKeyInfo) */

	786 int pbKey2Cached, / True if pTask->pUnpacked is pKey2 */

	787 const void pKey1, int nKey1, / Left side of comparison */

	788 const void pKey2, int nKey2 / Right side of comparison */

	789 ){

	790 UnpackedRecord *r2 = pTask->pUnpacked;

	791 if( !*pbKey2Cached ){

	792 sqlite3VdbeRecordUnpack(pTask->pSorter->pKeyInfo, nKey2, pKey2, r2);

	793 *pbKey2Cached = 1;

	794 }

	795 return sqlite3VdbeRecordCompare(nKey1, pKey1, r2);

	796 }

	797

	798 /*

	799 ** A specially optimized version of vdbeSorterCompare() that assumes that

	800 ** the first field of each key is a TEXT value and that the collation

	801 ** sequence to compare them with is BINARY.

	802 */

	803 static int vdbeSorterCompareText(

	804 SortSubtask pTask, / Subtask context (for pKeyInfo) */

	805 int pbKey2Cached, / True if pTask->pUnpacked is pKey2 */

	806 const void pKey1, int nKey1, / Left side of comparison */

	807 const void pKey2, int nKey2 / Right side of comparison */

	808 ){

	809 const u8 * const p1 = (const u8 * const)pKey1;

	810 const u8 * const p2 = (const u8 * const)pKey2;

	811 const u8 * const v1 = &p1[ p1[0] ]; /* Pointer to value 1 */

	812 const u8 * const v2 = &p2[ p2[0] ]; /* Pointer to value 2 */

	813

	814 int n1;

	815 int n2;

	816 int res;

	817

	818 getVarint32(&p1[1], n1); n1 = (n1 - 13) / 2;

	819 getVarint32(&p2[1], n2); n2 = (n2 - 13) / 2;

	820 res = memcmp(v1, v2, MIN(n1, n2));

	821 if( res==0 ){

	822 res = n1 - n2;

	823 }

	824

	825 if( res==0 ){

	826 if( pTask->pSorter->pKeyInfo->nField>1 ){

	827 res = vdbeSorterCompareTail(

	828 pTask, pbKey2Cached, pKey1, nKey1, pKey2, nKey2

	829 );

	830 }

	831 }else{

	832 if( pTask->pSorter->pKeyInfo->aSortOrder[0] ){

	833 res = res * -1;

	834 }

	835 }

	836

	837 return res;

	838 }

	839

	840 /*

	841 ** A specially optimized version of vdbeSorterCompare() that assumes that

	842 ** the first field of each key is an INTEGER value.

	843 */

	844 static int vdbeSorterCompareInt(

	845 SortSubtask pTask, / Subtask context (for pKeyInfo) */

	846 int pbKey2Cached, / True if pTask->pUnpacked is pKey2 */

	847 const void pKey1, int nKey1, / Left side of comparison */

	848 const void pKey2, int nKey2 / Right side of comparison */

	849 ){

	850 const u8 * const p1 = (const u8 * const)pKey1;

	851 const u8 * const p2 = (const u8 * const)pKey2;

	852 const int s1 = p1[1]; /* Left hand serial type */

	853 const int s2 = p2[1]; /* Right hand serial type */

	854 const u8 * const v1 = &p1[ p1[0] ]; /* Pointer to value 1 */

	855 const u8 * const v2 = &p2[ p2[0] ]; /* Pointer to value 2 */

	856 int res; /* Return value */

	857

	858 assert( (s1>0 && s1<7) \|\| s1==8 \|\| s1==9 );

	859 assert( (s2>0 && s2<7) \|\| s2==8 \|\| s2==9 );

	860

	861 if( s1>7 && s2>7 ){

	862 res = s1 - s2;

	863 }else{

	864 if( s1==s2 ){

	865 if( (v1 ^ v2) & 0x80 ){

	866 /* The two values have different signs */

	867 res = (*v1 & 0x80) ? -1 : +1;

	868 }else{

	869 /* The two values have the same sign. Compare using memcmp(). */

	870 static const u8 aLen[] = {0, 1, 2, 3, 4, 6, 8 };

	871 int i;

	872 res = 0;

	873 for(i=0; i<aLen[s1]; i++){

	874 if( (res = v1[i] - v2[i]) ) break;

	875 }

	876 }

	877 }else{

	878 if( s2>7 ){

	879 res = +1;

	880 }else if( s1>7 ){

	881 res = -1;

	882 }else{

	883 res = s1 - s2;

	884 }

	885 assert( res!=0 );

	886

	887 if( res>0 ){

	888 if( *v1 & 0x80 ) res = -1;

	889 }else{

	890 if( *v2 & 0x80 ) res = +1;

	891 }

	892 }

	893 }

	894

	895 if( res==0 ){

	896 if( pTask->pSorter->pKeyInfo->nField>1 ){

	897 res = vdbeSorterCompareTail(

	898 pTask, pbKey2Cached, pKey1, nKey1, pKey2, nKey2

	899 );

	900 }

	901 }else if( pTask->pSorter->pKeyInfo->aSortOrder[0] ){

	902 res = res * -1;

	903 }

	904

	905 return res;

	906 }

	907

	908 /*

	909 ** Initialize the temporary index cursor just opened as a sorter cursor.

	910 **

	911 ** Usually, the sorter module uses the value of (pCsr->pKeyInfo->nField)

	912 ** to determine the number of fields that should be compared from the

	913 ** records being sorted. However, if the value passed as argument nField

	914 ** is non-zero and the sorter is able to guarantee a stable sort, nField

	915 ** is used instead. This is used when sorting records for a CREATE INDEX

	916 ** statement. In this case, keys are always delivered to the sorter in

	917 ** order of the primary key, which happens to be make up the final part

	918 ** of the records being sorted. So if the sort is stable, there is never

	919 ** any reason to compare PK fields and they can be ignored for a small

	920 ** performance boost.

	921 **

	922 ** The sorter can guarantee a stable sort when running in single-threaded

	923 ** mode, but not in multi-threaded mode.

	924 **

	925 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	926 */

	927 int sqlite3VdbeSorterInit(

	928 sqlite3 db, / Database connection (for malloc()) */

	929 int nField, /* Number of key fields in each record */

	930 VdbeCursor pCsr / Cursor that holds the new sorter */

	931 ){

	932 int pgsz; /* Page size of main database */

	933 int i; /* Used to iterate through aTask[] */

	934 VdbeSorter pSorter; / The new sorter */

	935 KeyInfo pKeyInfo; / Copy of pCsr->pKeyInfo with db==0 */

	936 int szKeyInfo; /* Size of pCsr->pKeyInfo in bytes */

	937 int sz; /* Size of pSorter in bytes */

	938 int rc = SQLITE_OK;

	939 #if SQLITE_MAX_WORKER_THREADS==0

	940 # define nWorker 0

	941 #else

	942 int nWorker;

	943 #endif

	944

	945 /* Initialize the upper limit on the number of worker threads */

	946 #if SQLITE_MAX_WORKER_THREADS>0

	947 if( sqlite3TempInMemory(db) \|\| sqlite3GlobalConfig.bCoreMutex==0 ){

	948 nWorker = 0;

	949 }else{

	950 nWorker = db->aLimit[SQLITE_LIMIT_WORKER_THREADS];

	951 }

	952 #endif

	953

	954 /* Do not allow the total number of threads (main thread + all workers)

	955 ** to exceed the maximum merge count */

	956 #if SQLITE_MAX_WORKER_THREADS>=SORTER_MAX_MERGE_COUNT

	957 if( nWorker>=SORTER_MAX_MERGE_COUNT ){

	958 nWorker = SORTER_MAX_MERGE_COUNT-1;

	959 }

	960 #endif

	961

	962 assert( pCsr->pKeyInfo && pCsr->pBtx==0 );

	963 assert( pCsr->eCurType==CURTYPE_SORTER );

	964 szKeyInfo = sizeof(KeyInfo) + (pCsr->pKeyInfo->nField-1)sizeof(CollSeq);

	965 sz = sizeof(VdbeSorter) + nWorker * sizeof(SortSubtask);

	966

	967 pSorter = (VdbeSorter*)sqlite3DbMallocZero(db, sz + szKeyInfo);

	968 pCsr->uc.pSorter = pSorter;

	969 if( pSorter==0 ){

	970 rc = SQLITE_NOMEM_BKPT;

	971 }else{

	972 pSorter->pKeyInfo = pKeyInfo = (KeyInfo)((u8)pSorter + sz);

	973 memcpy(pKeyInfo, pCsr->pKeyInfo, szKeyInfo);

	974 pKeyInfo->db = 0;

	975 if( nField && nWorker==0 ){

	976 pKeyInfo->nXField += (pKeyInfo->nField - nField);

	977 pKeyInfo->nField = nField;

	978 }

	979 pSorter->pgsz = pgsz = sqlite3BtreeGetPageSize(db->aDb[0].pBt);

	980 pSorter->nTask = nWorker + 1;

	981 pSorter->iPrev = (u8)(nWorker - 1);

	982 pSorter->bUseThreads = (pSorter->nTask>1);

	983 pSorter->db = db;

	984 for(i=0; i<pSorter->nTask; i++){

	985 SortSubtask *pTask = &pSorter->aTask[i];

	986 pTask->pSorter = pSorter;

	987 }

	988

	989 if( !sqlite3TempInMemory(db) ){

	990 i64 mxCache; /* Cache size in bytes*/

	991 u32 szPma = sqlite3GlobalConfig.szPma;

	992 pSorter->mnPmaSize = szPma * pgsz;

	993

	994 mxCache = db->aDb[0].pSchema->cache_size;

	995 if( mxCache<0 ){

	996 /* A negative cache-size value C indicates that the cache is abs(C)

	997 ** KiB in size. */

	998 mxCache = mxCache * -1024;

	999 }else{

	1000 mxCache = mxCache * pgsz;

	1001 }

	1002 mxCache = MIN(mxCache, SQLITE_MAX_PMASZ);

	1003 pSorter->mxPmaSize = MAX(pSorter->mnPmaSize, (int)mxCache);

	1004

	1005 /* EVIDENCE-OF: R-26747-61719 When the application provides any amount of

	1006 ** scratch memory using SQLITE_CONFIG_SCRATCH, SQLite avoids unnecessary

	1007 ** large heap allocations.

	1008 */

	1009 if( sqlite3GlobalConfig.pScratch==0 ){

	1010 assert( pSorter->iMemory==0 );

	1011 pSorter->nMemory = pgsz;

	1012 pSorter->list.aMemory = (u8*)sqlite3Malloc(pgsz);

	1013 if( !pSorter->list.aMemory ) rc = SQLITE_NOMEM_BKPT;

	1014 }

	1015 }

	1016

	1017 if( (pKeyInfo->nField+pKeyInfo->nXField)<13

	1018 && (pKeyInfo->aColl[0]==0 \|\| pKeyInfo->aColl[0]==db->pDfltColl)

	1019 ){

	1020 pSorter->typeMask = SORTER_TYPE_INTEGER \| SORTER_TYPE_TEXT;

	1021 }

	1022 }

	1023

	1024 return rc;

	1025 }

	1026 #undef nWorker /* Defined at the top of this function */

	1027

	1028 /*

	1029 ** Free the list of sorted records starting at pRecord.

	1030 */

	1031 static void vdbeSorterRecordFree(sqlite3 db, SorterRecord pRecord){

	1032 SorterRecord *p;

	1033 SorterRecord *pNext;

	1034 for(p=pRecord; p; p=pNext){

	1035 pNext = p->u.pNext;

	1036 sqlite3DbFree(db, p);

	1037 }

	1038 }

	1039

	1040 /*

	1041 ** Free all resources owned by the object indicated by argument pTask. All

	1042 ** fields of *pTask are zeroed before returning.

	1043 */

	1044 static void vdbeSortSubtaskCleanup(sqlite3 db, SortSubtask pTask){

	1045 sqlite3DbFree(db, pTask->pUnpacked);

	1046 #if SQLITE_MAX_WORKER_THREADS>0

	1047 /* pTask->list.aMemory can only be non-zero if it was handed memory

	1048 ** from the main thread. That only occurs SQLITE_MAX_WORKER_THREADS>0 */

	1049 if( pTask->list.aMemory ){

	1050 sqlite3_free(pTask->list.aMemory);

	1051 }else

	1052 #endif

	1053 {

	1054 assert( pTask->list.aMemory==0 );

	1055 vdbeSorterRecordFree(0, pTask->list.pList);

	1056 }

	1057 if( pTask->file.pFd ){

	1058 sqlite3OsCloseFree(pTask->file.pFd);

	1059 }

	1060 if( pTask->file2.pFd ){

	1061 sqlite3OsCloseFree(pTask->file2.pFd);

	1062 }

	1063 memset(pTask, 0, sizeof(SortSubtask));

	1064 }

	1065

	1066 #ifdef SQLITE_DEBUG_SORTER_THREADS

	1067 static void vdbeSorterWorkDebug(SortSubtask pTask, const char zEvent){

	1068 i64 t;

	1069 int iTask = (pTask - pTask->pSorter->aTask);

	1070 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	1071 fprintf(stderr, "%lld:%d %s\n", t, iTask, zEvent);

	1072 }

	1073 static void vdbeSorterRewindDebug(const char *zEvent){

	1074 i64 t;

	1075 sqlite3OsCurrentTimeInt64(sqlite3_vfs_find(0), &t);

	1076 fprintf(stderr, "%lld:X %s\n", t, zEvent);

	1077 }

	1078 static void vdbeSorterPopulateDebug(

	1079 SortSubtask *pTask,

	1080 const char *zEvent

	1081 ){

	1082 i64 t;

	1083 int iTask = (pTask - pTask->pSorter->aTask);

	1084 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	1085 fprintf(stderr, "%lld:bg%d %s\n", t, iTask, zEvent);

	1086 }

	1087 static void vdbeSorterBlockDebug(

	1088 SortSubtask *pTask,

	1089 int bBlocked,

	1090 const char *zEvent

	1091 ){

	1092 if( bBlocked ){

	1093 i64 t;

	1094 sqlite3OsCurrentTimeInt64(pTask->pSorter->db->pVfs, &t);

	1095 fprintf(stderr, "%lld:main %s\n", t, zEvent);

	1096 }

	1097 }

	1098 #else

	1099 # define vdbeSorterWorkDebug(x,y)

	1100 # define vdbeSorterRewindDebug(y)

	1101 # define vdbeSorterPopulateDebug(x,y)

	1102 # define vdbeSorterBlockDebug(x,y,z)

	1103 #endif

	1104

	1105 #if SQLITE_MAX_WORKER_THREADS>0

	1106 /*

	1107 ** Join thread pTask->thread.

	1108 */

	1109 static int vdbeSorterJoinThread(SortSubtask *pTask){

	1110 int rc = SQLITE_OK;

	1111 if( pTask->pThread ){

	1112 #ifdef SQLITE_DEBUG_SORTER_THREADS

	1113 int bDone = pTask->bDone;

	1114 #endif

	1115 void *pRet = SQLITE_INT_TO_PTR(SQLITE_ERROR);

	1116 vdbeSorterBlockDebug(pTask, !bDone, "enter");

	1117 (void)sqlite3ThreadJoin(pTask->pThread, &pRet);

	1118 vdbeSorterBlockDebug(pTask, !bDone, "exit");

	1119 rc = SQLITE_PTR_TO_INT(pRet);

	1120 assert( pTask->bDone==1 );

	1121 pTask->bDone = 0;

	1122 pTask->pThread = 0;

	1123 }

	1124 return rc;

	1125 }

	1126

	1127 /*

	1128 ** Launch a background thread to run xTask(pIn).

	1129 */

	1130 static int vdbeSorterCreateThread(

	1131 SortSubtask pTask, / Thread will use this task object */

	1132 void (xTask)(void), / Routine to run in a separate thread */

	1133 void pIn / Argument passed into xTask() */

	1134 ){

	1135 assert( pTask->pThread==0 && pTask->bDone==0 );

	1136 return sqlite3ThreadCreate(&pTask->pThread, xTask, pIn);

	1137 }

	1138

	1139 /*

	1140 ** Join all outstanding threads launched by SorterWrite() to create

	1141 ** level-0 PMAs.

	1142 */

	1143 static int vdbeSorterJoinAll(VdbeSorter *pSorter, int rcin){

	1144 int rc = rcin;

	1145 int i;

	1146

	1147 /* This function is always called by the main user thread.

	1148 **

	1149 ** If this function is being called after SorterRewind() has been called,

	1150 ** it is possible that thread pSorter->aTask[pSorter->nTask-1].pThread

	1151 ** is currently attempt to join one of the other threads. To avoid a race

	1152 ** condition where this thread also attempts to join the same object, join

	1153 ** thread pSorter->aTask[pSorter->nTask-1].pThread first. */

	1154 for(i=pSorter->nTask-1; i>=0; i--){

	1155 SortSubtask *pTask = &pSorter->aTask[i];

	1156 int rc2 = vdbeSorterJoinThread(pTask);

	1157 if( rc==SQLITE_OK ) rc = rc2;

	1158 }

	1159 return rc;

	1160 }

	1161 #else

	1162 # define vdbeSorterJoinAll(x,rcin) (rcin)

	1163 # define vdbeSorterJoinThread(pTask) SQLITE_OK

	1164 #endif

	1165

	1166 /*

	1167 ** Allocate a new MergeEngine object capable of handling up to

	1168 ** nReader PmaReader inputs.

	1169 **

	1170 ** nReader is automatically rounded up to the next power of two.

	1171 ** nReader may not exceed SORTER_MAX_MERGE_COUNT even after rounding up.

	1172 */

	1173 static MergeEngine *vdbeMergeEngineNew(int nReader){

	1174 int N = 2; /* Smallest power of two >= nReader */

	1175 int nByte; /* Total bytes of space to allocate */

	1176 MergeEngine pNew; / Pointer to allocated object to return */

	1177

	1178 assert( nReader<=SORTER_MAX_MERGE_COUNT );

	1179

	1180 while( N<nReader ) N += N;

	1181 nByte = sizeof(MergeEngine) + N * (sizeof(int) + sizeof(PmaReader));

	1182

	1183 pNew = sqlite3FaultSim(100) ? 0 : (MergeEngine*)sqlite3MallocZero(nByte);

	1184 if( pNew ){

	1185 pNew->nTree = N;

	1186 pNew->pTask = 0;

	1187 pNew->aReadr = (PmaReader*)&pNew[1];

	1188 pNew->aTree = (int*)&pNew->aReadr[N];

	1189 }

	1190 return pNew;

	1191 }

	1192

	1193 /*

	1194 ** Free the MergeEngine object passed as the only argument.

	1195 */

	1196 static void vdbeMergeEngineFree(MergeEngine *pMerger){

	1197 int i;

	1198 if( pMerger ){

	1199 for(i=0; i<pMerger->nTree; i++){

	1200 vdbePmaReaderClear(&pMerger->aReadr[i]);

	1201 }

	1202 }

	1203 sqlite3_free(pMerger);

	1204 }

	1205

	1206 /*

	1207 ** Free all resources associated with the IncrMerger object indicated by

	1208 ** the first argument.

	1209 */

	1210 static void vdbeIncrFree(IncrMerger *pIncr){

	1211 if( pIncr ){

	1212 #if SQLITE_MAX_WORKER_THREADS>0

	1213 if( pIncr->bUseThread ){

	1214 vdbeSorterJoinThread(pIncr->pTask);

	1215 if( pIncr->aFile[0].pFd ) sqlite3OsCloseFree(pIncr->aFile[0].pFd);

	1216 if( pIncr->aFile[1].pFd ) sqlite3OsCloseFree(pIncr->aFile[1].pFd);

	1217 }

	1218 #endif

	1219 vdbeMergeEngineFree(pIncr->pMerger);

	1220 sqlite3_free(pIncr);

	1221 }

	1222 }

	1223

	1224 /*

	1225 ** Reset a sorting cursor back to its original empty state.

	1226 */

	1227 void sqlite3VdbeSorterReset(sqlite3 db, VdbeSorter pSorter){

	1228 int i;

	1229 (void)vdbeSorterJoinAll(pSorter, SQLITE_OK);

	1230 assert( pSorter->bUseThreads \|\| pSorter->pReader==0 );

	1231 #if SQLITE_MAX_WORKER_THREADS>0

	1232 if( pSorter->pReader ){

	1233 vdbePmaReaderClear(pSorter->pReader);

	1234 sqlite3DbFree(db, pSorter->pReader);

	1235 pSorter->pReader = 0;

	1236 }

	1237 #endif

	1238 vdbeMergeEngineFree(pSorter->pMerger);

	1239 pSorter->pMerger = 0;

	1240 for(i=0; i<pSorter->nTask; i++){

	1241 SortSubtask *pTask = &pSorter->aTask[i];

	1242 vdbeSortSubtaskCleanup(db, pTask);

	1243 pTask->pSorter = pSorter;

	1244 }

	1245 if( pSorter->list.aMemory==0 ){

	1246 vdbeSorterRecordFree(0, pSorter->list.pList);

	1247 }

	1248 pSorter->list.pList = 0;

	1249 pSorter->list.szPMA = 0;

	1250 pSorter->bUsePMA = 0;

	1251 pSorter->iMemory = 0;

	1252 pSorter->mxKeysize = 0;

	1253 sqlite3DbFree(db, pSorter->pUnpacked);

	1254 pSorter->pUnpacked = 0;

	1255 }

	1256

	1257 /*

	1258 ** Free any cursor components allocated by sqlite3VdbeSorterXXX routines.

	1259 */

	1260 void sqlite3VdbeSorterClose(sqlite3 db, VdbeCursor pCsr){

	1261 VdbeSorter *pSorter;

	1262 assert( pCsr->eCurType==CURTYPE_SORTER );

	1263 pSorter = pCsr->uc.pSorter;

	1264 if( pSorter ){

	1265 sqlite3VdbeSorterReset(db, pSorter);

	1266 sqlite3_free(pSorter->list.aMemory);

	1267 sqlite3DbFree(db, pSorter);

	1268 pCsr->uc.pSorter = 0;

	1269 }

	1270 }

	1271

	1272 #if SQLITE_MAX_MMAP_SIZE>0

	1273 /*

	1274 ** The first argument is a file-handle open on a temporary file. The file

	1275 ** is guaranteed to be nByte bytes or smaller in size. This function

	1276 ** attempts to extend the file to nByte bytes in size and to ensure that

	1277 ** the VFS has memory mapped it.

	1278 **

	1279 ** Whether or not the file does end up memory mapped of course depends on

	1280 ** the specific VFS implementation.

	1281 */

	1282 static void vdbeSorterExtendFile(sqlite3 db, sqlite3_file pFd, i64 nByte){

	1283 if( nByte<=(i64)(db->nMaxSorterMmap) && pFd->pMethods->iVersion>=3 ){

	1284 void *p = 0;

	1285 int chunksize = 4*1024;

	1286 sqlite3OsFileControlHint(pFd, SQLITE_FCNTL_CHUNK_SIZE, &chunksize);

	1287 sqlite3OsFileControlHint(pFd, SQLITE_FCNTL_SIZE_HINT, &nByte);

	1288 sqlite3OsFetch(pFd, 0, (int)nByte, &p);

	1289 sqlite3OsUnfetch(pFd, 0, p);

	1290 }

	1291 }

	1292 #else

	1293 # define vdbeSorterExtendFile(x,y,z)

	1294 #endif

	1295

	1296 /*

	1297 ** Allocate space for a file-handle and open a temporary file. If successful,

	1298 ** set *ppFd to point to the malloc'd file-handle and return SQLITE_OK.

	1299 ** Otherwise, set *ppFd to 0 and return an SQLite error code.

	1300 */

	1301 static int vdbeSorterOpenTempFile(

	1302 sqlite3 db, / Database handle doing sort */

	1303 i64 nExtend, /* Attempt to extend file to this size */

	1304 sqlite3_file **ppFd

	1305 ){

	1306 int rc;

	1307 if( sqlite3FaultSim(202) ) return SQLITE_IOERR_ACCESS;

	1308 rc = sqlite3OsOpenMalloc(db->pVfs, 0, ppFd,

	1309 SQLITE_OPEN_TEMP_JOURNAL \|

	1310 SQLITE_OPEN_READWRITE \| SQLITE_OPEN_CREATE \|

	1311 SQLITE_OPEN_EXCLUSIVE \| SQLITE_OPEN_DELETEONCLOSE, &rc

	1312 );

	1313 if( rc==SQLITE_OK ){

	1314 i64 max = SQLITE_MAX_MMAP_SIZE;

	1315 sqlite3OsFileControlHint(ppFd, SQLITE_FCNTL_MMAP_SIZE, (void)&max);

	1316 if( nExtend>0 ){

	1317 vdbeSorterExtendFile(db, *ppFd, nExtend);

	1318 }

	1319 }

	1320 return rc;

	1321 }

	1322

	1323 /*

	1324 ** If it has not already been allocated, allocate the UnpackedRecord

	1325 ** structure at pTask->pUnpacked. Return SQLITE_OK if successful (or

	1326 ** if no allocation was required), or SQLITE_NOMEM otherwise.

	1327 */

	1328 static int vdbeSortAllocUnpacked(SortSubtask *pTask){

	1329 if( pTask->pUnpacked==0 ){

	1330 pTask->pUnpacked = sqlite3VdbeAllocUnpackedRecord(pTask->pSorter->pKeyInfo);

	1331 if( pTask->pUnpacked==0 ) return SQLITE_NOMEM_BKPT;

	1332 pTask->pUnpacked->nField = pTask->pSorter->pKeyInfo->nField;

	1333 pTask->pUnpacked->errCode = 0;

	1334 }

	1335 return SQLITE_OK;

	1336 }

	1337

	1338

	1339 /*

	1340 ** Merge the two sorted lists p1 and p2 into a single list.

	1341 */

	1342 static SorterRecord *vdbeSorterMerge(

	1343 SortSubtask pTask, / Calling thread context */

	1344 SorterRecord p1, / First list to merge */

	1345 SorterRecord p2 / Second list to merge */

	1346 ){

	1347 SorterRecord *pFinal = 0;

	1348 SorterRecord **pp = &pFinal;

	1349 int bCached = 0;

	1350

	1351 assert( p1!=0 && p2!=0 );

	1352 for(;;){

	1353 int res;

	1354 res = pTask->xCompare(

	1355 pTask, &bCached, SRVAL(p1), p1->nVal, SRVAL(p2), p2->nVal

	1356 );

	1357

	1358 if( res<=0 ){

	1359 *pp = p1;

	1360 pp = &p1->u.pNext;

	1361 p1 = p1->u.pNext;

	1362 if( p1==0 ){

	1363 *pp = p2;

	1364 break;

	1365 }

	1366 }else{

	1367 *pp = p2;

	1368 pp = &p2->u.pNext;

	1369 p2 = p2->u.pNext;

	1370 bCached = 0;

	1371 if( p2==0 ){

	1372 *pp = p1;

	1373 break;

	1374 }

	1375 }

	1376 }

	1377 return pFinal;

	1378 }

	1379

	1380 /*

	1381 ** Return the SorterCompare function to compare values collected by the

	1382 ** sorter object passed as the only argument.

	1383 */

	1384 static SorterCompare vdbeSorterGetCompare(VdbeSorter *p){

	1385 if( p->typeMask==SORTER_TYPE_INTEGER ){

	1386 return vdbeSorterCompareInt;

	1387 }else if( p->typeMask==SORTER_TYPE_TEXT ){

	1388 return vdbeSorterCompareText;

	1389 }

	1390 return vdbeSorterCompare;

	1391 }

	1392

	1393 /*

	1394 ** Sort the linked list of records headed at pTask->pList. Return

	1395 ** SQLITE_OK if successful, or an SQLite error code (i.e. SQLITE_NOMEM) if

	1396 ** an error occurs.

	1397 */

	1398 static int vdbeSorterSort(SortSubtask pTask, SorterList pList){

	1399 int i;

	1400 SorterRecord **aSlot;

	1401 SorterRecord *p;

	1402 int rc;

	1403

	1404 rc = vdbeSortAllocUnpacked(pTask);

	1405 if( rc!=SQLITE_OK ) return rc;

	1406

	1407 p = pList->pList;

	1408 pTask->xCompare = vdbeSorterGetCompare(pTask->pSorter);

	1409

	1410 aSlot = (SorterRecord *)sqlite3MallocZero(64 sizeof(SorterRecord *));

	1411 if( !aSlot ){

	1412 return SQLITE_NOMEM_BKPT;

	1413 }

	1414

	1415 while( p ){

	1416 SorterRecord *pNext;

	1417 if( pList->aMemory ){

	1418 if( (u8*)p==pList->aMemory ){

	1419 pNext = 0;

	1420 }else{

	1421 assert( p->u.iNext<sqlite3MallocSize(pList->aMemory) );

	1422 pNext = (SorterRecord*)&pList->aMemory[p->u.iNext];

	1423 }

	1424 }else{

	1425 pNext = p->u.pNext;

	1426 }

	1427

	1428 p->u.pNext = 0;

	1429 for(i=0; aSlot[i]; i++){

	1430 p = vdbeSorterMerge(pTask, p, aSlot[i]);

	1431 aSlot[i] = 0;

	1432 }

	1433 aSlot[i] = p;

	1434 p = pNext;

	1435 }

	1436

	1437 p = 0;

	1438 for(i=0; i<64; i++){

	1439 if( aSlot[i]==0 ) continue;

	1440 p = p ? vdbeSorterMerge(pTask, p, aSlot[i]) : aSlot[i];

	1441 }

	1442 pList->pList = p;

	1443

	1444 sqlite3_free(aSlot);

	1445 assert( pTask->pUnpacked->errCode==SQLITE_OK

	1446 \|\| pTask->pUnpacked->errCode==SQLITE_NOMEM

	1447 );

	1448 return pTask->pUnpacked->errCode;

	1449 }

	1450

	1451 /*

	1452 ** Initialize a PMA-writer object.

	1453 */

	1454 static void vdbePmaWriterInit(

	1455 sqlite3_file pFd, / File handle to write to */

	1456 PmaWriter p, / Object to populate */

	1457 int nBuf, /* Buffer size */

	1458 i64 iStart /* Offset of pFd to begin writing at */

	1459 ){

	1460 memset(p, 0, sizeof(PmaWriter));

	1461 p->aBuffer = (u8*)sqlite3Malloc(nBuf);

	1462 if( !p->aBuffer ){

	1463 p->eFWErr = SQLITE_NOMEM_BKPT;

	1464 }else{

	1465 p->iBufEnd = p->iBufStart = (iStart % nBuf);

	1466 p->iWriteOff = iStart - p->iBufStart;

	1467 p->nBuffer = nBuf;

	1468 p->pFd = pFd;

	1469 }

	1470 }

	1471

	1472 /*

	1473 ** Write nData bytes of data to the PMA. Return SQLITE_OK

	1474 ** if successful, or an SQLite error code if an error occurs.

	1475 */

	1476 static void vdbePmaWriteBlob(PmaWriter p, u8 pData, int nData){

	1477 int nRem = nData;

	1478 while( nRem>0 && p->eFWErr==0 ){

	1479 int nCopy = nRem;

	1480 if( nCopy>(p->nBuffer - p->iBufEnd) ){

	1481 nCopy = p->nBuffer - p->iBufEnd;

	1482 }

	1483

	1484 memcpy(&p->aBuffer[p->iBufEnd], &pData[nData-nRem], nCopy);

	1485 p->iBufEnd += nCopy;

	1486 if( p->iBufEnd==p->nBuffer ){

	1487 p->eFWErr = sqlite3OsWrite(p->pFd,

	1488 &p->aBuffer[p->iBufStart], p->iBufEnd - p->iBufStart,

	1489 p->iWriteOff + p->iBufStart

	1490 );

	1491 p->iBufStart = p->iBufEnd = 0;

	1492 p->iWriteOff += p->nBuffer;

	1493 }

	1494 assert( p->iBufEnd<p->nBuffer );

	1495

	1496 nRem -= nCopy;

	1497 }

	1498 }

	1499

	1500 /*

	1501 ** Flush any buffered data to disk and clean up the PMA-writer object.

	1502 ** The results of using the PMA-writer after this call are undefined.

	1503 ** Return SQLITE_OK if flushing the buffered data succeeds or is not

	1504 ** required. Otherwise, return an SQLite error code.

	1505 **

	1506 ** Before returning, set *piEof to the offset immediately following the

	1507 ** last byte written to the file.

	1508 */

	1509 static int vdbePmaWriterFinish(PmaWriter p, i64 piEof){

	1510 int rc;

	1511 if( p->eFWErr==0 && ALWAYS(p->aBuffer) && p->iBufEnd>p->iBufStart ){

	1512 p->eFWErr = sqlite3OsWrite(p->pFd,

	1513 &p->aBuffer[p->iBufStart], p->iBufEnd - p->iBufStart,

	1514 p->iWriteOff + p->iBufStart

	1515 );

	1516 }

	1517 *piEof = (p->iWriteOff + p->iBufEnd);

	1518 sqlite3_free(p->aBuffer);

	1519 rc = p->eFWErr;

	1520 memset(p, 0, sizeof(PmaWriter));

	1521 return rc;

	1522 }

	1523

	1524 /*

	1525 ** Write value iVal encoded as a varint to the PMA. Return

	1526 ** SQLITE_OK if successful, or an SQLite error code if an error occurs.

	1527 */

	1528 static void vdbePmaWriteVarint(PmaWriter *p, u64 iVal){

	1529 int nByte;

	1530 u8 aByte[10];

	1531 nByte = sqlite3PutVarint(aByte, iVal);

	1532 vdbePmaWriteBlob(p, aByte, nByte);

	1533 }

	1534

	1535 /*

	1536 ** Write the current contents of in-memory linked-list pList to a level-0

	1537 ** PMA in the temp file belonging to sub-task pTask. Return SQLITE_OK if

	1538 ** successful, or an SQLite error code otherwise.

	1539 **

	1540 ** The format of a PMA is:

	1541 **

	1542 ** * A varint. This varint contains the total number of bytes of content

	1543 ** in the PMA (not including the varint itself).

	1544 **

	1545 ** * One or more records packed end-to-end in order of ascending keys.

	1546 ** Each record consists of a varint followed by a blob of data (the

	1547 ** key). The varint is the number of bytes in the blob of data.

	1548 */

	1549 static int vdbeSorterListToPMA(SortSubtask pTask, SorterList pList){

	1550 sqlite3 *db = pTask->pSorter->db;

	1551 int rc = SQLITE_OK; /* Return code */

	1552 PmaWriter writer; /* Object used to write to the file */

	1553

	1554 #ifdef SQLITE_DEBUG

	1555 /* Set iSz to the expected size of file pTask->file after writing the PMA.

	1556 ** This is used by an assert() statement at the end of this function. */

	1557 i64 iSz = pList->szPMA + sqlite3VarintLen(pList->szPMA) + pTask->file.iEof;

	1558 #endif

	1559

	1560 vdbeSorterWorkDebug(pTask, "enter");

	1561 memset(&writer, 0, sizeof(PmaWriter));

	1562 assert( pList->szPMA>0 );

	1563

	1564 /* If the first temporary PMA file has not been opened, open it now. */

	1565 if( pTask->file.pFd==0 ){

	1566 rc = vdbeSorterOpenTempFile(db, 0, &pTask->file.pFd);

	1567 assert( rc!=SQLITE_OK \|\| pTask->file.pFd );

	1568 assert( pTask->file.iEof==0 );

	1569 assert( pTask->nPMA==0 );

	1570 }

	1571

	1572 /* Try to get the file to memory map */

	1573 if( rc==SQLITE_OK ){

	1574 vdbeSorterExtendFile(db, pTask->file.pFd, pTask->file.iEof+pList->szPMA+9);

	1575 }

	1576

	1577 /* Sort the list */

	1578 if( rc==SQLITE_OK ){

	1579 rc = vdbeSorterSort(pTask, pList);

	1580 }

	1581

	1582 if( rc==SQLITE_OK ){

	1583 SorterRecord *p;

	1584 SorterRecord *pNext = 0;

	1585

	1586 vdbePmaWriterInit(pTask->file.pFd, &writer, pTask->pSorter->pgsz,

	1587 pTask->file.iEof);

	1588 pTask->nPMA++;

	1589 vdbePmaWriteVarint(&writer, pList->szPMA);

	1590 for(p=pList->pList; p; p=pNext){

	1591 pNext = p->u.pNext;

	1592 vdbePmaWriteVarint(&writer, p->nVal);

	1593 vdbePmaWriteBlob(&writer, SRVAL(p), p->nVal);

	1594 if( pList->aMemory==0 ) sqlite3_free(p);

	1595 }

	1596 pList->pList = p;

	1597 rc = vdbePmaWriterFinish(&writer, &pTask->file.iEof);

	1598 }

	1599

	1600 vdbeSorterWorkDebug(pTask, "exit");

	1601 assert( rc!=SQLITE_OK \|\| pList->pList==0 );

	1602 assert( rc!=SQLITE_OK \|\| pTask->file.iEof==iSz );

	1603 return rc;

	1604 }

	1605

	1606 /*

	1607 ** Advance the MergeEngine to its next entry.

	1608 ** Set *pbEof to true there is no next entry because

	1609 ** the MergeEngine has reached the end of all its inputs.

	1610 **

	1611 ** Return SQLITE_OK if successful or an error code if an error occurs.

	1612 */

	1613 static int vdbeMergeEngineStep(

	1614 MergeEngine pMerger, / The merge engine to advance to the next row */

	1615 int pbEof / Set TRUE at EOF. Set false for more content */

	1616 ){

	1617 int rc;

	1618 int iPrev = pMerger->aTree[1];/* Index of PmaReader to advance */

	1619 SortSubtask *pTask = pMerger->pTask;

	1620

	1621 /* Advance the current PmaReader */

	1622 rc = vdbePmaReaderNext(&pMerger->aReadr[iPrev]);

	1623

	1624 /* Update contents of aTree[] */

	1625 if( rc==SQLITE_OK ){

	1626 int i; /* Index of aTree[] to recalculate */

	1627 PmaReader pReadr1; / First PmaReader to compare */

	1628 PmaReader pReadr2; / Second PmaReader to compare */

	1629 int bCached = 0;

	1630

	1631 /* Find the first two PmaReaders to compare. The one that was just

	1632 ** advanced (iPrev) and the one next to it in the array. */

	1633 pReadr1 = &pMerger->aReadr[(iPrev & 0xFFFE)];

	1634 pReadr2 = &pMerger->aReadr[(iPrev \| 0x0001)];

	1635

	1636 for(i=(pMerger->nTree+iPrev)/2; i>0; i=i/2){

	1637 /* Compare pReadr1 and pReadr2. Store the result in variable iRes. */

	1638 int iRes;

	1639 if( pReadr1->pFd==0 ){

	1640 iRes = +1;

	1641 }else if( pReadr2->pFd==0 ){

	1642 iRes = -1;

	1643 }else{

	1644 iRes = pTask->xCompare(pTask, &bCached,

	1645 pReadr1->aKey, pReadr1->nKey, pReadr2->aKey, pReadr2->nKey

	1646 );

	1647 }

	1648

	1649 /* If pReadr1 contained the smaller value, set aTree[i] to its index.

	1650 ** Then set pReadr2 to the next PmaReader to compare to pReadr1. In this

	1651 ** case there is no cache of pReadr2 in pTask->pUnpacked, so set

	1652 ** pKey2 to point to the record belonging to pReadr2.

	1653 **

	1654 ** Alternatively, if pReadr2 contains the smaller of the two values,

	1655 ** set aTree[i] to its index and update pReadr1. If vdbeSorterCompare()

	1656 ** was actually called above, then pTask->pUnpacked now contains

	1657 ** a value equivalent to pReadr2. So set pKey2 to NULL to prevent

	1658 ** vdbeSorterCompare() from decoding pReadr2 again.

	1659 **

	1660 ** If the two values were equal, then the value from the oldest

	1661 ** PMA should be considered smaller. The VdbeSorter.aReadr[] array

	1662 ** is sorted from oldest to newest, so pReadr1 contains older values

	1663 ** than pReadr2 iff (pReadr1<pReadr2). */

	1664 if( iRes<0 \|\| (iRes==0 && pReadr1<pReadr2) ){

	1665 pMerger->aTree[i] = (int)(pReadr1 - pMerger->aReadr);

	1666 pReadr2 = &pMerger->aReadr[ pMerger->aTree[i ^ 0x0001] ];

	1667 bCached = 0;

	1668 }else{

	1669 if( pReadr1->pFd ) bCached = 0;

	1670 pMerger->aTree[i] = (int)(pReadr2 - pMerger->aReadr);

	1671 pReadr1 = &pMerger->aReadr[ pMerger->aTree[i ^ 0x0001] ];

	1672 }

	1673 }

	1674 *pbEof = (pMerger->aReadr[pMerger->aTree[1]].pFd==0);

	1675 }

	1676

	1677 return (rc==SQLITE_OK ? pTask->pUnpacked->errCode : rc);

	1678 }

	1679

	1680 #if SQLITE_MAX_WORKER_THREADS>0

	1681 /*

	1682 ** The main routine for background threads that write level-0 PMAs.

	1683 */

	1684 static void vdbeSorterFlushThread(void pCtx){

	1685 SortSubtask pTask = (SortSubtask)pCtx;

	1686 int rc; /* Return code */

	1687 assert( pTask->bDone==0 );

	1688 rc = vdbeSorterListToPMA(pTask, &pTask->list);

	1689 pTask->bDone = 1;

	1690 return SQLITE_INT_TO_PTR(rc);

	1691 }

	1692 #endif /* SQLITE_MAX_WORKER_THREADS>0 */

	1693

	1694 /*

	1695 ** Flush the current contents of VdbeSorter.list to a new PMA, possibly

	1696 ** using a background thread.

	1697 */

	1698 static int vdbeSorterFlushPMA(VdbeSorter *pSorter){

	1699 #if SQLITE_MAX_WORKER_THREADS==0

	1700 pSorter->bUsePMA = 1;

	1701 return vdbeSorterListToPMA(&pSorter->aTask[0], &pSorter->list);

	1702 #else

	1703 int rc = SQLITE_OK;

	1704 int i;

	1705 SortSubtask pTask = 0; / Thread context used to create new PMA */

	1706 int nWorker = (pSorter->nTask-1);

	1707

	1708 /* Set the flag to indicate that at least one PMA has been written.

	1709 ** Or will be, anyhow. */

	1710 pSorter->bUsePMA = 1;

	1711

	1712 /* Select a sub-task to sort and flush the current list of in-memory

	1713 ** records to disk. If the sorter is running in multi-threaded mode,

	1714 ** round-robin between the first (pSorter->nTask-1) tasks. Except, if

	1715 ** the background thread from a sub-tasks previous turn is still running,

	1716 ** skip it. If the first (pSorter->nTask-1) sub-tasks are all still busy,

	1717 ** fall back to using the final sub-task. The first (pSorter->nTask-1)

	1718 ** sub-tasks are prefered as they use background threads - the final

	1719 ** sub-task uses the main thread. */

	1720 for(i=0; i<nWorker; i++){

	1721 int iTest = (pSorter->iPrev + i + 1) % nWorker;

	1722 pTask = &pSorter->aTask[iTest];

	1723 if( pTask->bDone ){

	1724 rc = vdbeSorterJoinThread(pTask);

	1725 }

	1726 if( rc!=SQLITE_OK \|\| pTask->pThread==0 ) break;

	1727 }

	1728

	1729 if( rc==SQLITE_OK ){

	1730 if( i==nWorker ){

	1731 /* Use the foreground thread for this operation */

	1732 rc = vdbeSorterListToPMA(&pSorter->aTask[nWorker], &pSorter->list);

	1733 }else{

	1734 /* Launch a background thread for this operation */

	1735 u8 *aMem = pTask->list.aMemory;

	1736 void pCtx = (void)pTask;

	1737

	1738 assert( pTask->pThread==0 && pTask->bDone==0 );

	1739 assert( pTask->list.pList==0 );

	1740 assert( pTask->list.aMemory==0 \|\| pSorter->list.aMemory!=0 );

	1741

	1742 pSorter->iPrev = (u8)(pTask - pSorter->aTask);

	1743 pTask->list = pSorter->list;

	1744 pSorter->list.pList = 0;

	1745 pSorter->list.szPMA = 0;

	1746 if( aMem ){

	1747 pSorter->list.aMemory = aMem;

	1748 pSorter->nMemory = sqlite3MallocSize(aMem);

	1749 }else if( pSorter->list.aMemory ){

	1750 pSorter->list.aMemory = sqlite3Malloc(pSorter->nMemory);

	1751 if( !pSorter->list.aMemory ) return SQLITE_NOMEM_BKPT;

	1752 }

	1753

	1754 rc = vdbeSorterCreateThread(pTask, vdbeSorterFlushThread, pCtx);

	1755 }

	1756 }

	1757

	1758 return rc;

	1759 #endif /* SQLITE_MAX_WORKER_THREADS!=0 */

	1760 }

	1761

	1762 /*

	1763 ** Add a record to the sorter.

	1764 */

	1765 int sqlite3VdbeSorterWrite(

	1766 const VdbeCursor pCsr, / Sorter cursor */

	1767 Mem pVal / Memory cell containing record */

	1768 ){

	1769 VdbeSorter *pSorter;

	1770 int rc = SQLITE_OK; /* Return Code */

	1771 SorterRecord pNew; / New list element */

	1772 int bFlush; /* True to flush contents of memory to PMA */

	1773 int nReq; /* Bytes of memory required */

	1774 int nPMA; /* Bytes of PMA space required */

	1775 int t; /* serial type of first record field */

	1776

	1777 assert( pCsr->eCurType==CURTYPE_SORTER );

	1778 pSorter = pCsr->uc.pSorter;

	1779 getVarint32((const u8*)&pVal->z[1], t);

	1780 if( t>0 && t<10 && t!=7 ){

	1781 pSorter->typeMask &= SORTER_TYPE_INTEGER;

	1782 }else if( t>10 && (t & 0x01) ){

	1783 pSorter->typeMask &= SORTER_TYPE_TEXT;

	1784 }else{

	1785 pSorter->typeMask = 0;

	1786 }

	1787

	1788 assert( pSorter );

	1789

	1790 /* Figure out whether or not the current contents of memory should be

	1791 ** flushed to a PMA before continuing. If so, do so.

	1792 **

	1793 ** If using the single large allocation mode (pSorter->aMemory!=0), then

	1794 ** flush the contents of memory to a new PMA if (a) at least one value is

	1795 ** already in memory and (b) the new value will not fit in memory.

	1796 **

	1797 ** Or, if using separate allocations for each record, flush the contents

	1798 ** of memory to a PMA if either of the following are true:

	1799 **

	1800 ** * The total memory allocated for the in-memory list is greater

	1801 ** than (page-size * cache-size), or

	1802 **

	1803 ** * The total memory allocated for the in-memory list is greater

	1804 ** than (page-size * 10) and sqlite3HeapNearlyFull() returns true.

	1805 */

	1806 nReq = pVal->n + sizeof(SorterRecord);

	1807 nPMA = pVal->n + sqlite3VarintLen(pVal->n);

	1808 if( pSorter->mxPmaSize ){

	1809 if( pSorter->list.aMemory ){

	1810 bFlush = pSorter->iMemory && (pSorter->iMemory+nReq) > pSorter->mxPmaSize;

	1811 }else{

	1812 bFlush = (

	1813 (pSorter->list.szPMA > pSorter->mxPmaSize)

	1814 \|\| (pSorter->list.szPMA > pSorter->mnPmaSize && sqlite3HeapNearlyFull())

	1815 );

	1816 }

	1817 if( bFlush ){

	1818 rc = vdbeSorterFlushPMA(pSorter);

	1819 pSorter->list.szPMA = 0;

	1820 pSorter->iMemory = 0;

	1821 assert( rc!=SQLITE_OK \|\| pSorter->list.pList==0 );

	1822 }

	1823 }

	1824

	1825 pSorter->list.szPMA += nPMA;

	1826 if( nPMA>pSorter->mxKeysize ){

	1827 pSorter->mxKeysize = nPMA;

	1828 }

	1829

	1830 if( pSorter->list.aMemory ){

	1831 int nMin = pSorter->iMemory + nReq;

	1832

	1833 if( nMin>pSorter->nMemory ){

	1834 u8 *aNew;

	1835 int iListOff = (u8*)pSorter->list.pList - pSorter->list.aMemory;

	1836 int nNew = pSorter->nMemory * 2;

	1837 while( nNew < nMin ) nNew = nNew*2;

	1838 if( nNew > pSorter->mxPmaSize ) nNew = pSorter->mxPmaSize;

	1839 if( nNew < nMin ) nNew = nMin;

	1840

	1841 aNew = sqlite3Realloc(pSorter->list.aMemory, nNew);

	1842 if( !aNew ) return SQLITE_NOMEM_BKPT;

	1843 pSorter->list.pList = (SorterRecord*)&aNew[iListOff];

	1844 pSorter->list.aMemory = aNew;

	1845 pSorter->nMemory = nNew;

	1846 }

	1847

	1848 pNew = (SorterRecord*)&pSorter->list.aMemory[pSorter->iMemory];

	1849 pSorter->iMemory += ROUND8(nReq);

	1850 if( pSorter->list.pList ){

	1851 pNew->u.iNext = (int)((u8*)(pSorter->list.pList) - pSorter->list.aMemory);

	1852 }

	1853 }else{

	1854 pNew = (SorterRecord *)sqlite3Malloc(nReq);

	1855 if( pNew==0 ){

	1856 return SQLITE_NOMEM_BKPT;

	1857 }

	1858 pNew->u.pNext = pSorter->list.pList;

	1859 }

	1860

	1861 memcpy(SRVAL(pNew), pVal->z, pVal->n);

	1862 pNew->nVal = pVal->n;

	1863 pSorter->list.pList = pNew;

	1864

	1865 return rc;

	1866 }

	1867

	1868 /*

	1869 ** Read keys from pIncr->pMerger and populate pIncr->aFile[1]. The format

	1870 ** of the data stored in aFile[1] is the same as that used by regular PMAs,

	1871 ** except that the number-of-bytes varint is omitted from the start.

	1872 */

	1873 static int vdbeIncrPopulate(IncrMerger *pIncr){

	1874 int rc = SQLITE_OK;

	1875 int rc2;

	1876 i64 iStart = pIncr->iStartOff;

	1877 SorterFile *pOut = &pIncr->aFile[1];

	1878 SortSubtask *pTask = pIncr->pTask;

	1879 MergeEngine *pMerger = pIncr->pMerger;

	1880 PmaWriter writer;

	1881 assert( pIncr->bEof==0 );

	1882

	1883 vdbeSorterPopulateDebug(pTask, "enter");

	1884

	1885 vdbePmaWriterInit(pOut->pFd, &writer, pTask->pSorter->pgsz, iStart);

	1886 while( rc==SQLITE_OK ){

	1887 int dummy;

	1888 PmaReader *pReader = &pMerger->aReadr[ pMerger->aTree[1] ];

	1889 int nKey = pReader->nKey;

	1890 i64 iEof = writer.iWriteOff + writer.iBufEnd;

	1891

	1892 /* Check if the output file is full or if the input has been exhausted.

	1893 ** In either case exit the loop. */

	1894 if( pReader->pFd==0 ) break;

	1895 if( (iEof + nKey + sqlite3VarintLen(nKey))>(iStart + pIncr->mxSz) ) break;

	1896

	1897 /* Write the next key to the output. */

	1898 vdbePmaWriteVarint(&writer, nKey);

	1899 vdbePmaWriteBlob(&writer, pReader->aKey, nKey);

	1900 assert( pIncr->pMerger->pTask==pTask );

	1901 rc = vdbeMergeEngineStep(pIncr->pMerger, &dummy);

	1902 }

	1903

	1904 rc2 = vdbePmaWriterFinish(&writer, &pOut->iEof);

	1905 if( rc==SQLITE_OK ) rc = rc2;

	1906 vdbeSorterPopulateDebug(pTask, "exit");

	1907 return rc;

	1908 }

	1909

	1910 #if SQLITE_MAX_WORKER_THREADS>0

	1911 /*

	1912 ** The main routine for background threads that populate aFile[1] of

	1913 ** multi-threaded IncrMerger objects.

	1914 */

	1915 static void vdbeIncrPopulateThread(void pCtx){

	1916 IncrMerger pIncr = (IncrMerger)pCtx;

	1917 void *pRet = SQLITE_INT_TO_PTR( vdbeIncrPopulate(pIncr) );

	1918 pIncr->pTask->bDone = 1;

	1919 return pRet;

	1920 }

	1921

	1922 /*

	1923 ** Launch a background thread to populate aFile[1] of pIncr.

	1924 */

	1925 static int vdbeIncrBgPopulate(IncrMerger *pIncr){

	1926 void p = (void)pIncr;

	1927 assert( pIncr->bUseThread );

	1928 return vdbeSorterCreateThread(pIncr->pTask, vdbeIncrPopulateThread, p);

	1929 }

	1930 #endif

	1931

	1932 /*

	1933 ** This function is called when the PmaReader corresponding to pIncr has

	1934 ** finished reading the contents of aFile[0]. Its purpose is to "refill"

	1935 ** aFile[0] such that the PmaReader should start rereading it from the

	1936 ** beginning.

	1937 **

	1938 ** For single-threaded objects, this is accomplished by literally reading

	1939 ** keys from pIncr->pMerger and repopulating aFile[0].

	1940 **

	1941 ** For multi-threaded objects, all that is required is to wait until the

	1942 ** background thread is finished (if it is not already) and then swap

	1943 ** aFile[0] and aFile[1] in place. If the contents of pMerger have not

	1944 ** been exhausted, this function also launches a new background thread

	1945 ** to populate the new aFile[1].

	1946 **

	1947 ** SQLITE_OK is returned on success, or an SQLite error code otherwise.

	1948 */

	1949 static int vdbeIncrSwap(IncrMerger *pIncr){

	1950 int rc = SQLITE_OK;

	1951

	1952 #if SQLITE_MAX_WORKER_THREADS>0

	1953 if( pIncr->bUseThread ){

	1954 rc = vdbeSorterJoinThread(pIncr->pTask);

	1955

	1956 if( rc==SQLITE_OK ){

	1957 SorterFile f0 = pIncr->aFile[0];

	1958 pIncr->aFile[0] = pIncr->aFile[1];

	1959 pIncr->aFile[1] = f0;

	1960 }

	1961

	1962 if( rc==SQLITE_OK ){

	1963 if( pIncr->aFile[0].iEof==pIncr->iStartOff ){

	1964 pIncr->bEof = 1;

	1965 }else{

	1966 rc = vdbeIncrBgPopulate(pIncr);

	1967 }

	1968 }

	1969 }else

	1970 #endif

	1971 {

	1972 rc = vdbeIncrPopulate(pIncr);

	1973 pIncr->aFile[0] = pIncr->aFile[1];

	1974 if( pIncr->aFile[0].iEof==pIncr->iStartOff ){

	1975 pIncr->bEof = 1;

	1976 }

	1977 }

	1978

	1979 return rc;

	1980 }

	1981

	1982 /*

	1983 ** Allocate and return a new IncrMerger object to read data from pMerger.

	1984 **

	1985 ** If an OOM condition is encountered, return NULL. In this case free the

	1986 ** pMerger argument before returning.

	1987 */

	1988 static int vdbeIncrMergerNew(

	1989 SortSubtask pTask, / The thread that will be using the new IncrMerger */

	1990 MergeEngine pMerger, / The MergeEngine that the IncrMerger will control */

	1991 IncrMerger *ppOut / Write the new IncrMerger here */

	1992 ){

	1993 int rc = SQLITE_OK;

	1994 IncrMerger pIncr = ppOut = (IncrMerger*)

	1995 (sqlite3FaultSim(100) ? 0 : sqlite3MallocZero(sizeof(*pIncr)));

	1996 if( pIncr ){

	1997 pIncr->pMerger = pMerger;

	1998 pIncr->pTask = pTask;

	1999 pIncr->mxSz = MAX(pTask->pSorter->mxKeysize+9,pTask->pSorter->mxPmaSize/2);

	2000 pTask->file2.iEof += pIncr->mxSz;

	2001 }else{

	2002 vdbeMergeEngineFree(pMerger);

	2003 rc = SQLITE_NOMEM_BKPT;

	2004 }

	2005 return rc;

	2006 }

	2007

	2008 #if SQLITE_MAX_WORKER_THREADS>0

	2009 /*

	2010 ** Set the "use-threads" flag on object pIncr.

	2011 */

	2012 static void vdbeIncrMergerSetThreads(IncrMerger *pIncr){

	2013 pIncr->bUseThread = 1;

	2014 pIncr->pTask->file2.iEof -= pIncr->mxSz;

	2015 }

	2016 #endif /* SQLITE_MAX_WORKER_THREADS>0 */

	2017

	2018

	2019

	2020 /*

	2021 ** Recompute pMerger->aTree[iOut] by comparing the next keys on the

	2022 ** two PmaReaders that feed that entry. Neither of the PmaReaders

	2023 ** are advanced. This routine merely does the comparison.

	2024 */

	2025 static void vdbeMergeEngineCompare(

	2026 MergeEngine pMerger, / Merge engine containing PmaReaders to compare */

	2027 int iOut /* Store the result in pMerger->aTree[iOut] */

	2028 ){

	2029 int i1;

	2030 int i2;

	2031 int iRes;

	2032 PmaReader *p1;

	2033 PmaReader *p2;

	2034

	2035 assert( iOut<pMerger->nTree && iOut>0 );

	2036

	2037 if( iOut>=(pMerger->nTree/2) ){

	2038 i1 = (iOut - pMerger->nTree/2) * 2;

	2039 i2 = i1 + 1;

	2040 }else{

	2041 i1 = pMerger->aTree[iOut*2];

	2042 i2 = pMerger->aTree[iOut*2+1];

	2043 }

	2044

	2045 p1 = &pMerger->aReadr[i1];

	2046 p2 = &pMerger->aReadr[i2];

	2047

	2048 if( p1->pFd==0 ){

	2049 iRes = i2;

	2050 }else if( p2->pFd==0 ){

	2051 iRes = i1;

	2052 }else{

	2053 SortSubtask *pTask = pMerger->pTask;

	2054 int bCached = 0;

	2055 int res;

	2056 assert( pTask->pUnpacked!=0 ); /* from vdbeSortSubtaskMain() */

	2057 res = pTask->xCompare(

	2058 pTask, &bCached, p1->aKey, p1->nKey, p2->aKey, p2->nKey

	2059 );

	2060 if( res<=0 ){

	2061 iRes = i1;

	2062 }else{

	2063 iRes = i2;

	2064 }

	2065 }

	2066

	2067 pMerger->aTree[iOut] = iRes;

	2068 }

	2069

	2070 /*

	2071 ** Allowed values for the eMode parameter to vdbeMergeEngineInit()

	2072 ** and vdbePmaReaderIncrMergeInit().

	2073 **

	2074 ** Only INCRINIT_NORMAL is valid in single-threaded builds (when

	2075 ** SQLITE_MAX_WORKER_THREADS==0). The other values are only used

	2076 ** when there exists one or more separate worker threads.

	2077 */

	2078 #define INCRINIT_NORMAL 0

	2079 #define INCRINIT_TASK 1

	2080 #define INCRINIT_ROOT 2

	2081

	2082 /*

	2083 ** Forward reference required as the vdbeIncrMergeInit() and

	2084 ** vdbePmaReaderIncrInit() routines are called mutually recursively when

	2085 ** building a merge tree.

	2086 */

	2087 static int vdbePmaReaderIncrInit(PmaReader *pReadr, int eMode);

	2088

	2089 /*

	2090 ** Initialize the MergeEngine object passed as the second argument. Once this

	2091 ** function returns, the first key of merged data may be read from the

	2092 ** MergeEngine object in the usual fashion.

	2093 **

	2094 ** If argument eMode is INCRINIT_ROOT, then it is assumed that any IncrMerge

	2095 ** objects attached to the PmaReader objects that the merger reads from have

	2096 ** already been populated, but that they have not yet populated aFile[0] and

	2097 ** set the PmaReader objects up to read from it. In this case all that is

	2098 ** required is to call vdbePmaReaderNext() on each PmaReader to point it at

	2099 ** its first key.

	2100 **

	2101 ** Otherwise, if eMode is any value other than INCRINIT_ROOT, then use

	2102 ** vdbePmaReaderIncrMergeInit() to initialize each PmaReader that feeds data

	2103 ** to pMerger.

	2104 **

	2105 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	2106 */

	2107 static int vdbeMergeEngineInit(

	2108 SortSubtask pTask, / Thread that will run pMerger */

	2109 MergeEngine pMerger, / MergeEngine to initialize */

	2110 int eMode /* One of the INCRINIT_XXX constants */

	2111 ){

	2112 int rc = SQLITE_OK; /* Return code */

	2113 int i; /* For looping over PmaReader objects */

	2114 int nTree = pMerger->nTree;

	2115

	2116 /* eMode is always INCRINIT_NORMAL in single-threaded mode */

	2117 assert( SQLITE_MAX_WORKER_THREADS>0 \|\| eMode==INCRINIT_NORMAL );

	2118

	2119 /* Verify that the MergeEngine is assigned to a single thread */

	2120 assert( pMerger->pTask==0 );

	2121 pMerger->pTask = pTask;

	2122

	2123 for(i=0; i<nTree; i++){

	2124 if( SQLITE_MAX_WORKER_THREADS>0 && eMode==INCRINIT_ROOT ){

	2125 /* PmaReaders should be normally initialized in order, as if they are

	2126 ** reading from the same temp file this makes for more linear file IO.

	2127 ** However, in the INCRINIT_ROOT case, if PmaReader aReadr[nTask-1] is

	2128 ** in use it will block the vdbePmaReaderNext() call while it uses

	2129 ** the main thread to fill its buffer. So calling PmaReaderNext()

	2130 ** on this PmaReader before any of the multi-threaded PmaReaders takes

	2131 ** better advantage of multi-processor hardware. */

	2132 rc = vdbePmaReaderNext(&pMerger->aReadr[nTree-i-1]);

	2133 }else{

	2134 rc = vdbePmaReaderIncrInit(&pMerger->aReadr[i], INCRINIT_NORMAL);

	2135 }

	2136 if( rc!=SQLITE_OK ) return rc;

	2137 }

	2138

	2139 for(i=pMerger->nTree-1; i>0; i--){

	2140 vdbeMergeEngineCompare(pMerger, i);

	2141 }

	2142 return pTask->pUnpacked->errCode;

	2143 }

	2144

	2145 /*

	2146 ** The PmaReader passed as the first argument is guaranteed to be an

	2147 ** incremental-reader (pReadr->pIncr!=0). This function serves to open

	2148 ** and/or initialize the temp file related fields of the IncrMerge

	2149 ** object at (pReadr->pIncr).

	2150 **

	2151 ** If argument eMode is set to INCRINIT_NORMAL, then all PmaReaders

	2152 ** in the sub-tree headed by pReadr are also initialized. Data is then

	2153 ** loaded into the buffers belonging to pReadr and it is set to point to

	2154 ** the first key in its range.

	2155 **

	2156 ** If argument eMode is set to INCRINIT_TASK, then pReadr is guaranteed

	2157 ** to be a multi-threaded PmaReader and this function is being called in a

	2158 ** background thread. In this case all PmaReaders in the sub-tree are

	2159 ** initialized as for INCRINIT_NORMAL and the aFile[1] buffer belonging to

	2160 ** pReadr is populated. However, pReadr itself is not set up to point

	2161 ** to its first key. A call to vdbePmaReaderNext() is still required to do

	2162 ** that.

	2163 **

	2164 ** The reason this function does not call vdbePmaReaderNext() immediately

	2165 ** in the INCRINIT_TASK case is that vdbePmaReaderNext() assumes that it has

	2166 ** to block on thread (pTask->thread) before accessing aFile[1]. But, since

	2167 ** this entire function is being run by thread (pTask->thread), that will

	2168 ** lead to the current background thread attempting to join itself.

	2169 **

	2170 ** Finally, if argument eMode is set to INCRINIT_ROOT, it may be assumed

	2171 ** that pReadr->pIncr is a multi-threaded IncrMerge objects, and that all

	2172 ** child-trees have already been initialized using IncrInit(INCRINIT_TASK).

	2173 ** In this case vdbePmaReaderNext() is called on all child PmaReaders and

	2174 ** the current PmaReader set to point to the first key in its range.

	2175 **

	2176 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	2177 */

	2178 static int vdbePmaReaderIncrMergeInit(PmaReader *pReadr, int eMode){

	2179 int rc = SQLITE_OK;

	2180 IncrMerger *pIncr = pReadr->pIncr;

	2181 SortSubtask *pTask = pIncr->pTask;

	2182 sqlite3 *db = pTask->pSorter->db;

	2183

	2184 /* eMode is always INCRINIT_NORMAL in single-threaded mode */

	2185 assert( SQLITE_MAX_WORKER_THREADS>0 \|\| eMode==INCRINIT_NORMAL );

	2186

	2187 rc = vdbeMergeEngineInit(pTask, pIncr->pMerger, eMode);

	2188

	2189 /* Set up the required files for pIncr. A multi-theaded IncrMerge object

	2190 ** requires two temp files to itself, whereas a single-threaded object

	2191 ** only requires a region of pTask->file2. */

	2192 if( rc==SQLITE_OK ){

	2193 int mxSz = pIncr->mxSz;

	2194 #if SQLITE_MAX_WORKER_THREADS>0

	2195 if( pIncr->bUseThread ){

	2196 rc = vdbeSorterOpenTempFile(db, mxSz, &pIncr->aFile[0].pFd);

	2197 if( rc==SQLITE_OK ){

	2198 rc = vdbeSorterOpenTempFile(db, mxSz, &pIncr->aFile[1].pFd);

	2199 }

	2200 }else

	2201 #endif

	2202 /if( !pIncr->bUseThread )/{

	2203 if( pTask->file2.pFd==0 ){

	2204 assert( pTask->file2.iEof>0 );

	2205 rc = vdbeSorterOpenTempFile(db, pTask->file2.iEof, &pTask->file2.pFd);

	2206 pTask->file2.iEof = 0;

	2207 }

	2208 if( rc==SQLITE_OK ){

	2209 pIncr->aFile[1].pFd = pTask->file2.pFd;

	2210 pIncr->iStartOff = pTask->file2.iEof;

	2211 pTask->file2.iEof += mxSz;

	2212 }

	2213 }

	2214 }

	2215

	2216 #if SQLITE_MAX_WORKER_THREADS>0

	2217 if( rc==SQLITE_OK && pIncr->bUseThread ){

	2218 /* Use the current thread to populate aFile[1], even though this

	2219 ** PmaReader is multi-threaded. If this is an INCRINIT_TASK object,

	2220 ** then this function is already running in background thread

	2221 ** pIncr->pTask->thread.

	2222 **

	2223 ** If this is the INCRINIT_ROOT object, then it is running in the

	2224 ** main VDBE thread. But that is Ok, as that thread cannot return

	2225 ** control to the VDBE or proceed with anything useful until the

	2226 ** first results are ready from this merger object anyway.

	2227 */

	2228 assert( eMode==INCRINIT_ROOT \|\| eMode==INCRINIT_TASK );

	2229 rc = vdbeIncrPopulate(pIncr);

	2230 }

	2231 #endif

	2232

	2233 if( rc==SQLITE_OK && (SQLITE_MAX_WORKER_THREADS==0 \|\| eMode!=INCRINIT_TASK) ){

	2234 rc = vdbePmaReaderNext(pReadr);

	2235 }

	2236

	2237 return rc;

	2238 }

	2239

	2240 #if SQLITE_MAX_WORKER_THREADS>0

	2241 /*

	2242 ** The main routine for vdbePmaReaderIncrMergeInit() operations run in

	2243 ** background threads.

	2244 */

	2245 static void vdbePmaReaderBgIncrInit(void pCtx){

	2246 PmaReader pReader = (PmaReader)pCtx;

	2247 void *pRet = SQLITE_INT_TO_PTR(

	2248 vdbePmaReaderIncrMergeInit(pReader,INCRINIT_TASK)

	2249 );

	2250 pReader->pIncr->pTask->bDone = 1;

	2251 return pRet;

	2252 }

	2253 #endif

	2254

	2255 /*

	2256 ** If the PmaReader passed as the first argument is not an incremental-reader

	2257 ** (if pReadr->pIncr==0), then this function is a no-op. Otherwise, it invokes

	2258 ** the vdbePmaReaderIncrMergeInit() function with the parameters passed to

	2259 ** this routine to initialize the incremental merge.

	2260 **

	2261 ** If the IncrMerger object is multi-threaded (IncrMerger.bUseThread==1),

	2262 ** then a background thread is launched to call vdbePmaReaderIncrMergeInit().

	2263 ** Or, if the IncrMerger is single threaded, the same function is called

	2264 ** using the current thread.

	2265 */

	2266 static int vdbePmaReaderIncrInit(PmaReader *pReadr, int eMode){

	2267 IncrMerger pIncr = pReadr->pIncr; / Incremental merger */

	2268 int rc = SQLITE_OK; /* Return code */

	2269 if( pIncr ){

	2270 #if SQLITE_MAX_WORKER_THREADS>0

	2271 assert( pIncr->bUseThread==0 \|\| eMode==INCRINIT_TASK );

	2272 if( pIncr->bUseThread ){

	2273 void pCtx = (void)pReadr;

	2274 rc = vdbeSorterCreateThread(pIncr->pTask, vdbePmaReaderBgIncrInit, pCtx);

	2275 }else

	2276 #endif

	2277 {

	2278 rc = vdbePmaReaderIncrMergeInit(pReadr, eMode);

	2279 }

	2280 }

	2281 return rc;

	2282 }

	2283

	2284 /*

	2285 ** Allocate a new MergeEngine object to merge the contents of nPMA level-0

	2286 ** PMAs from pTask->file. If no error occurs, set *ppOut to point to

	2287 ** the new object and return SQLITE_OK. Or, if an error does occur, set *ppOut

	2288 ** to NULL and return an SQLite error code.

	2289 **

	2290 ** When this function is called, *piOffset is set to the offset of the

	2291 ** first PMA to read from pTask->file. Assuming no error occurs, it is

	2292 ** set to the offset immediately following the last byte of the last

	2293 ** PMA before returning. If an error does occur, then the final value of

	2294 ** *piOffset is undefined.

	2295 */

	2296 static int vdbeMergeEngineLevel0(

	2297 SortSubtask pTask, / Sorter task to read from */

	2298 int nPMA, /* Number of PMAs to read */

	2299 i64 piOffset, / IN/OUT: Readr offset in pTask->file */

	2300 MergeEngine *ppOut / OUT: New merge-engine */

	2301 ){

	2302 MergeEngine pNew; / Merge engine to return */

	2303 i64 iOff = *piOffset;

	2304 int i;

	2305 int rc = SQLITE_OK;

	2306

	2307 *ppOut = pNew = vdbeMergeEngineNew(nPMA);

	2308 if( pNew==0 ) rc = SQLITE_NOMEM_BKPT;

	2309

	2310 for(i=0; i<nPMA && rc==SQLITE_OK; i++){

	2311 i64 nDummy = 0;

	2312 PmaReader *pReadr = &pNew->aReadr[i];

	2313 rc = vdbePmaReaderInit(pTask, &pTask->file, iOff, pReadr, &nDummy);

	2314 iOff = pReadr->iEof;

	2315 }

	2316

	2317 if( rc!=SQLITE_OK ){

	2318 vdbeMergeEngineFree(pNew);

	2319 *ppOut = 0;

	2320 }

	2321 *piOffset = iOff;

	2322 return rc;

	2323 }

	2324

	2325 /*

	2326 ** Return the depth of a tree comprising nPMA PMAs, assuming a fanout of

	2327 ** SORTER_MAX_MERGE_COUNT. The returned value does not include leaf nodes.

	2328 **

	2329 ** i.e.

	2330 **

	2331 ** nPMA<=16 -> TreeDepth() == 0

	2332 ** nPMA<=256 -> TreeDepth() == 1

	2333 ** nPMA<=65536 -> TreeDepth() == 2

	2334 */

	2335 static int vdbeSorterTreeDepth(int nPMA){

	2336 int nDepth = 0;

	2337 i64 nDiv = SORTER_MAX_MERGE_COUNT;

	2338 while( nDiv < (i64)nPMA ){

	2339 nDiv = nDiv * SORTER_MAX_MERGE_COUNT;

	2340 nDepth++;

	2341 }

	2342 return nDepth;

	2343 }

	2344

	2345 /*

	2346 ** pRoot is the root of an incremental merge-tree with depth nDepth (according

	2347 ** to vdbeSorterTreeDepth()). pLeaf is the iSeq'th leaf to be added to the

	2348 ** tree, counting from zero. This function adds pLeaf to the tree.

	2349 **

	2350 ** If successful, SQLITE_OK is returned. If an error occurs, an SQLite error

	2351 ** code is returned and pLeaf is freed.

	2352 */

	2353 static int vdbeSorterAddToTree(

	2354 SortSubtask pTask, / Task context */

	2355 int nDepth, /* Depth of tree according to TreeDepth() */

	2356 int iSeq, /* Sequence number of leaf within tree */

	2357 MergeEngine pRoot, / Root of tree */

	2358 MergeEngine pLeaf / Leaf to add to tree */

	2359 ){

	2360 int rc = SQLITE_OK;

	2361 int nDiv = 1;

	2362 int i;

	2363 MergeEngine *p = pRoot;

	2364 IncrMerger *pIncr;

	2365

	2366 rc = vdbeIncrMergerNew(pTask, pLeaf, &pIncr);

	2367

	2368 for(i=1; i<nDepth; i++){

	2369 nDiv = nDiv * SORTER_MAX_MERGE_COUNT;

	2370 }

	2371

	2372 for(i=1; i<nDepth && rc==SQLITE_OK; i++){

	2373 int iIter = (iSeq / nDiv) % SORTER_MAX_MERGE_COUNT;

	2374 PmaReader *pReadr = &p->aReadr[iIter];

	2375

	2376 if( pReadr->pIncr==0 ){

	2377 MergeEngine *pNew = vdbeMergeEngineNew(SORTER_MAX_MERGE_COUNT);

	2378 if( pNew==0 ){

	2379 rc = SQLITE_NOMEM_BKPT;

	2380 }else{

	2381 rc = vdbeIncrMergerNew(pTask, pNew, &pReadr->pIncr);

	2382 }

	2383 }

	2384 if( rc==SQLITE_OK ){

	2385 p = pReadr->pIncr->pMerger;

	2386 nDiv = nDiv / SORTER_MAX_MERGE_COUNT;

	2387 }

	2388 }

	2389

	2390 if( rc==SQLITE_OK ){

	2391 p->aReadr[iSeq % SORTER_MAX_MERGE_COUNT].pIncr = pIncr;

	2392 }else{

	2393 vdbeIncrFree(pIncr);

	2394 }

	2395 return rc;

	2396 }

	2397

	2398 /*

	2399 ** This function is called as part of a SorterRewind() operation on a sorter

	2400 ** that has already written two or more level-0 PMAs to one or more temp

	2401 ** files. It builds a tree of MergeEngine/IncrMerger/PmaReader objects that

	2402 ** can be used to incrementally merge all PMAs on disk.

	2403 **

	2404 ** If successful, SQLITE_OK is returned and *ppOut set to point to the

	2405 ** MergeEngine object at the root of the tree before returning. Or, if an

	2406 ** error occurs, an SQLite error code is returned and the final value

	2407 ** of *ppOut is undefined.

	2408 */

	2409 static int vdbeSorterMergeTreeBuild(

	2410 VdbeSorter pSorter, / The VDBE cursor that implements the sort */

	2411 MergeEngine *ppOut / Write the MergeEngine here */

	2412 ){

	2413 MergeEngine *pMain = 0;

	2414 int rc = SQLITE_OK;

	2415 int iTask;

	2416

	2417 #if SQLITE_MAX_WORKER_THREADS>0

	2418 /* If the sorter uses more than one task, then create the top-level

	2419 ** MergeEngine here. This MergeEngine will read data from exactly

	2420 ** one PmaReader per sub-task. */

	2421 assert( pSorter->bUseThreads \|\| pSorter->nTask==1 );

	2422 if( pSorter->nTask>1 ){

	2423 pMain = vdbeMergeEngineNew(pSorter->nTask);

	2424 if( pMain==0 ) rc = SQLITE_NOMEM_BKPT;

	2425 }

	2426 #endif

	2427

	2428 for(iTask=0; rc==SQLITE_OK && iTask<pSorter->nTask; iTask++){

	2429 SortSubtask *pTask = &pSorter->aTask[iTask];

	2430 assert( pTask->nPMA>0 \|\| SQLITE_MAX_WORKER_THREADS>0 );

	2431 if( SQLITE_MAX_WORKER_THREADS==0 \|\| pTask->nPMA ){

	2432 MergeEngine pRoot = 0; / Root node of tree for this task */

	2433 int nDepth = vdbeSorterTreeDepth(pTask->nPMA);

	2434 i64 iReadOff = 0;

	2435

	2436 if( pTask->nPMA<=SORTER_MAX_MERGE_COUNT ){

	2437 rc = vdbeMergeEngineLevel0(pTask, pTask->nPMA, &iReadOff, &pRoot);

	2438 }else{

	2439 int i;

	2440 int iSeq = 0;

	2441 pRoot = vdbeMergeEngineNew(SORTER_MAX_MERGE_COUNT);

	2442 if( pRoot==0 ) rc = SQLITE_NOMEM_BKPT;

	2443 for(i=0; i<pTask->nPMA && rc==SQLITE_OK; i += SORTER_MAX_MERGE_COUNT){

	2444 MergeEngine pMerger = 0; / New level-0 PMA merger */

	2445 int nReader; /* Number of level-0 PMAs to merge */

	2446

	2447 nReader = MIN(pTask->nPMA - i, SORTER_MAX_MERGE_COUNT);

	2448 rc = vdbeMergeEngineLevel0(pTask, nReader, &iReadOff, &pMerger);

	2449 if( rc==SQLITE_OK ){

	2450 rc = vdbeSorterAddToTree(pTask, nDepth, iSeq++, pRoot, pMerger);

	2451 }

	2452 }

	2453 }

	2454

	2455 if( rc==SQLITE_OK ){

	2456 #if SQLITE_MAX_WORKER_THREADS>0

	2457 if( pMain!=0 ){

	2458 rc = vdbeIncrMergerNew(pTask, pRoot, &pMain->aReadr[iTask].pIncr);

	2459 }else

	2460 #endif

	2461 {

	2462 assert( pMain==0 );

	2463 pMain = pRoot;

	2464 }

	2465 }else{

	2466 vdbeMergeEngineFree(pRoot);

	2467 }

	2468 }

	2469 }

	2470

	2471 if( rc!=SQLITE_OK ){

	2472 vdbeMergeEngineFree(pMain);

	2473 pMain = 0;

	2474 }

	2475 *ppOut = pMain;

	2476 return rc;

	2477 }

	2478

	2479 /*

	2480 ** This function is called as part of an sqlite3VdbeSorterRewind() operation

	2481 ** on a sorter that has written two or more PMAs to temporary files. It sets

	2482 ** up either VdbeSorter.pMerger (for single threaded sorters) or pReader

	2483 ** (for multi-threaded sorters) so that it can be used to iterate through

	2484 ** all records stored in the sorter.

	2485 **

	2486 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	2487 */

	2488 static int vdbeSorterSetupMerge(VdbeSorter *pSorter){

	2489 int rc; /* Return code */

	2490 SortSubtask *pTask0 = &pSorter->aTask[0];

	2491 MergeEngine *pMain = 0;

	2492 #if SQLITE_MAX_WORKER_THREADS

	2493 sqlite3 *db = pTask0->pSorter->db;

	2494 int i;

	2495 SorterCompare xCompare = vdbeSorterGetCompare(pSorter);

	2496 for(i=0; i<pSorter->nTask; i++){

	2497 pSorter->aTask[i].xCompare = xCompare;

	2498 }

	2499 #endif

	2500

	2501 rc = vdbeSorterMergeTreeBuild(pSorter, &pMain);

	2502 if( rc==SQLITE_OK ){

	2503 #if SQLITE_MAX_WORKER_THREADS

	2504 assert( pSorter->bUseThreads==0 \|\| pSorter->nTask>1 );

	2505 if( pSorter->bUseThreads ){

	2506 int iTask;

	2507 PmaReader *pReadr = 0;

	2508 SortSubtask *pLast = &pSorter->aTask[pSorter->nTask-1];

	2509 rc = vdbeSortAllocUnpacked(pLast);

	2510 if( rc==SQLITE_OK ){

	2511 pReadr = (PmaReader*)sqlite3DbMallocZero(db, sizeof(PmaReader));

	2512 pSorter->pReader = pReadr;

	2513 if( pReadr==0 ) rc = SQLITE_NOMEM_BKPT;

	2514 }

	2515 if( rc==SQLITE_OK ){

	2516 rc = vdbeIncrMergerNew(pLast, pMain, &pReadr->pIncr);

	2517 if( rc==SQLITE_OK ){

	2518 vdbeIncrMergerSetThreads(pReadr->pIncr);

	2519 for(iTask=0; iTask<(pSorter->nTask-1); iTask++){

	2520 IncrMerger *pIncr;

	2521 if( (pIncr = pMain->aReadr[iTask].pIncr) ){

	2522 vdbeIncrMergerSetThreads(pIncr);

	2523 assert( pIncr->pTask!=pLast );

	2524 }

	2525 }

	2526 for(iTask=0; rc==SQLITE_OK && iTask<pSorter->nTask; iTask++){

	2527 /* Check that:

	2528 **

	2529 ** a) The incremental merge object is configured to use the

	2530 ** right task, and

	2531 ** b) If it is using task (nTask-1), it is configured to run

	2532 ** in single-threaded mode. This is important, as the

	2533 ** root merge (INCRINIT_ROOT) will be using the same task

	2534 ** object.

	2535 */

	2536 PmaReader *p = &pMain->aReadr[iTask];

	2537 assert( p->pIncr==0 \|\| (

	2538 (p->pIncr->pTask==&pSorter->aTask[iTask]) /* a */

	2539 && (iTask!=pSorter->nTask-1 \|\| p->pIncr->bUseThread==0) /* b */

	2540 ));

	2541 rc = vdbePmaReaderIncrInit(p, INCRINIT_TASK);

	2542 }

	2543 }

	2544 pMain = 0;

	2545 }

	2546 if( rc==SQLITE_OK ){

	2547 rc = vdbePmaReaderIncrMergeInit(pReadr, INCRINIT_ROOT);

	2548 }

	2549 }else

	2550 #endif

	2551 {

	2552 rc = vdbeMergeEngineInit(pTask0, pMain, INCRINIT_NORMAL);

	2553 pSorter->pMerger = pMain;

	2554 pMain = 0;

	2555 }

	2556 }

	2557

	2558 if( rc!=SQLITE_OK ){

	2559 vdbeMergeEngineFree(pMain);

	2560 }

	2561 return rc;

	2562 }

	2563

	2564

	2565 /*

	2566 ** Once the sorter has been populated by calls to sqlite3VdbeSorterWrite,

	2567 ** this function is called to prepare for iterating through the records

	2568 ** in sorted order.

	2569 */

	2570 int sqlite3VdbeSorterRewind(const VdbeCursor pCsr, int pbEof){

	2571 VdbeSorter *pSorter;

	2572 int rc = SQLITE_OK; /* Return code */

	2573

	2574 assert( pCsr->eCurType==CURTYPE_SORTER );

	2575 pSorter = pCsr->uc.pSorter;

	2576 assert( pSorter );

	2577

	2578 /* If no data has been written to disk, then do not do so now. Instead,

	2579 ** sort the VdbeSorter.pRecord list. The vdbe layer will read data directly

	2580 ** from the in-memory list. */

	2581 if( pSorter->bUsePMA==0 ){

	2582 if( pSorter->list.pList ){

	2583 *pbEof = 0;

	2584 rc = vdbeSorterSort(&pSorter->aTask[0], &pSorter->list);

	2585 }else{

	2586 *pbEof = 1;

	2587 }

	2588 return rc;

	2589 }

	2590

	2591 /* Write the current in-memory list to a PMA. When the VdbeSorterWrite()

	2592 ** function flushes the contents of memory to disk, it immediately always

	2593 ** creates a new list consisting of a single key immediately afterwards.

	2594 ** So the list is never empty at this point. */

	2595 assert( pSorter->list.pList );

	2596 rc = vdbeSorterFlushPMA(pSorter);

	2597

	2598 /* Join all threads */

	2599 rc = vdbeSorterJoinAll(pSorter, rc);

	2600

	2601 vdbeSorterRewindDebug("rewind");

	2602

	2603 /* Assuming no errors have occurred, set up a merger structure to

	2604 ** incrementally read and merge all remaining PMAs. */

	2605 assert( pSorter->pReader==0 );

	2606 if( rc==SQLITE_OK ){

	2607 rc = vdbeSorterSetupMerge(pSorter);

	2608 *pbEof = 0;

	2609 }

	2610

	2611 vdbeSorterRewindDebug("rewinddone");

	2612 return rc;

	2613 }

	2614

	2615 /*

	2616 ** Advance to the next element in the sorter.

	2617 */

	2618 int sqlite3VdbeSorterNext(sqlite3 db, const VdbeCursor pCsr, int *pbEof){

	2619 VdbeSorter *pSorter;

	2620 int rc; /* Return code */

	2621

	2622 assert( pCsr->eCurType==CURTYPE_SORTER );

	2623 pSorter = pCsr->uc.pSorter;

	2624 assert( pSorter->bUsePMA \|\| (pSorter->pReader==0 && pSorter->pMerger==0) );

	2625 if( pSorter->bUsePMA ){

	2626 assert( pSorter->pReader==0 \|\| pSorter->pMerger==0 );

	2627 assert( pSorter->bUseThreads==0 \|\| pSorter->pReader );

	2628 assert( pSorter->bUseThreads==1 \|\| pSorter->pMerger );

	2629 #if SQLITE_MAX_WORKER_THREADS>0

	2630 if( pSorter->bUseThreads ){

	2631 rc = vdbePmaReaderNext(pSorter->pReader);

	2632 *pbEof = (pSorter->pReader->pFd==0);

	2633 }else

	2634 #endif

	2635 /if( !pSorter->bUseThreads )/ {

	2636 assert( pSorter->pMerger!=0 );

	2637 assert( pSorter->pMerger->pTask==(&pSorter->aTask[0]) );

	2638 rc = vdbeMergeEngineStep(pSorter->pMerger, pbEof);

	2639 }

	2640 }else{

	2641 SorterRecord *pFree = pSorter->list.pList;

	2642 pSorter->list.pList = pFree->u.pNext;

	2643 pFree->u.pNext = 0;

	2644 if( pSorter->list.aMemory==0 ) vdbeSorterRecordFree(db, pFree);

	2645 *pbEof = !pSorter->list.pList;

	2646 rc = SQLITE_OK;

	2647 }

	2648 return rc;

	2649 }

	2650

	2651 /*

	2652 ** Return a pointer to a buffer owned by the sorter that contains the

	2653 ** current key.

	2654 */

	2655 static void *vdbeSorterRowkey(

	2656 const VdbeSorter pSorter, / Sorter object */

	2657 int pnKey / OUT: Size of current key in bytes */

	2658 ){

	2659 void *pKey;

	2660 if( pSorter->bUsePMA ){

	2661 PmaReader *pReader;

	2662 #if SQLITE_MAX_WORKER_THREADS>0

	2663 if( pSorter->bUseThreads ){

	2664 pReader = pSorter->pReader;

	2665 }else

	2666 #endif

	2667 /if( !pSorter->bUseThreads )/{

	2668 pReader = &pSorter->pMerger->aReadr[pSorter->pMerger->aTree[1]];

	2669 }

	2670 *pnKey = pReader->nKey;

	2671 pKey = pReader->aKey;

	2672 }else{

	2673 *pnKey = pSorter->list.pList->nVal;

	2674 pKey = SRVAL(pSorter->list.pList);

	2675 }

	2676 return pKey;

	2677 }

	2678

	2679 /*

	2680 ** Copy the current sorter key into the memory cell pOut.

	2681 */

	2682 int sqlite3VdbeSorterRowkey(const VdbeCursor pCsr, Mem pOut){

	2683 VdbeSorter *pSorter;

	2684 void pKey; int nKey; / Sorter key to copy into pOut */

	2685

	2686 assert( pCsr->eCurType==CURTYPE_SORTER );

	2687 pSorter = pCsr->uc.pSorter;

	2688 pKey = vdbeSorterRowkey(pSorter, &nKey);

	2689 if( sqlite3VdbeMemClearAndResize(pOut, nKey) ){

	2690 return SQLITE_NOMEM_BKPT;

	2691 }

	2692 pOut->n = nKey;

	2693 MemSetTypeFlag(pOut, MEM_Blob);

	2694 memcpy(pOut->z, pKey, nKey);

	2695

	2696 return SQLITE_OK;

	2697 }

	2698

	2699 /*

	2700 ** Compare the key in memory cell pVal with the key that the sorter cursor

	2701 ** passed as the first argument currently points to. For the purposes of

	2702 ** the comparison, ignore the rowid field at the end of each record.

	2703 **

	2704 ** If the sorter cursor key contains any NULL values, consider it to be

	2705 ** less than pVal. Even if pVal also contains NULL values.

	2706 **

	2707 ** If an error occurs, return an SQLite error code (i.e. SQLITE_NOMEM).

	2708 ** Otherwise, set *pRes to a negative, zero or positive value if the

	2709 ** key in pVal is smaller than, equal to or larger than the current sorter

	2710 ** key.

	2711 **

	2712 ** This routine forms the core of the OP_SorterCompare opcode, which in

	2713 ** turn is used to verify uniqueness when constructing a UNIQUE INDEX.

	2714 */

	2715 int sqlite3VdbeSorterCompare(

	2716 const VdbeCursor pCsr, / Sorter cursor */

	2717 Mem pVal, / Value to compare to current sorter key */

	2718 int nKeyCol, /* Compare this many columns */

	2719 int pRes / OUT: Result of comparison */

	2720 ){

	2721 VdbeSorter *pSorter;

	2722 UnpackedRecord *r2;

	2723 KeyInfo *pKeyInfo;

	2724 int i;

	2725 void pKey; int nKey; / Sorter key to compare pVal with */

	2726

	2727 assert( pCsr->eCurType==CURTYPE_SORTER );

	2728 pSorter = pCsr->uc.pSorter;

	2729 r2 = pSorter->pUnpacked;

	2730 pKeyInfo = pCsr->pKeyInfo;

	2731 if( r2==0 ){

	2732 r2 = pSorter->pUnpacked = sqlite3VdbeAllocUnpackedRecord(pKeyInfo);

	2733 if( r2==0 ) return SQLITE_NOMEM_BKPT;

	2734 r2->nField = nKeyCol;

	2735 }

	2736 assert( r2->nField==nKeyCol );

	2737

	2738 pKey = vdbeSorterRowkey(pSorter, &nKey);

	2739 sqlite3VdbeRecordUnpack(pKeyInfo, nKey, pKey, r2);

	2740 for(i=0; i<nKeyCol; i++){

	2741 if( r2->aMem[i].flags & MEM_Null ){

	2742 *pRes = -1;

	2743 return SQLITE_OK;

	2744 }

	2745 }

	2746

	2747 *pRes = sqlite3VdbeRecordCompare(pVal->n, pVal->z, r2);

	2748 return SQLITE_OK;

	2749 }

OLD	NEW

« no previous file with comments | « third_party/sqlite/sqlite-src-3170000/src/vdbemem.c ('k') | third_party/sqlite/sqlite-src-3170000/src/vdbetrace.c » ('j') | no next file with comments »