third_party/sqlite/amalgamation/sqlite3.02.c - Issue 2755803002: NCI: trybot test for sqlite 3.17 import.

Side by Side Diff: third_party/sqlite/amalgamation/sqlite3.02.c

Issue 2755803002: NCI: trybot test for sqlite 3.17 import. (Closed)

Patch Set: also clang on Linux i386 Created 3 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
(Empty)
	1 /************ Begin file pcache1.c ***************************************/

	2 /*

	3 ** 2008 November 05

	4 **

	5 ** The author disclaims copyright to this source code. In place of

	6 ** a legal notice, here is a blessing:

	7 **

	8 ** May you do good and not evil.

	9 ** May you find forgiveness for yourself and forgive others.

	10 ** May you share freely, never taking more than you give.

	11 **

	12 *************************************************************************

	13 **

	14 ** This file implements the default page cache implementation (the

	15 ** sqlite3_pcache interface). It also contains part of the implementation

	16 ** of the SQLITE_CONFIG_PAGECACHE and sqlite3_release_memory() features.

	17 ** If the default page cache implementation is overridden, then neither of

	18 ** these two features are available.

	19 **

	20 ** A Page cache line looks like this:

	21 **

	22 ** -------------------------------------------------------------

	23 ** \| database page content \| PgHdr1 \| MemPage \| PgHdr \|

	24 ** -------------------------------------------------------------

	25 **

	26 ** The database page content is up front (so that buffer overreads tend to

	27 ** flow harmlessly into the PgHdr1, MemPage, and PgHdr extensions). MemPage

	28 ** is the extension added by the btree.c module containing information such

	29 ** as the database page number and how that database page is used. PgHdr

	30 ** is added by the pcache.c layer and contains information used to keep track

	31 ** of which pages are "dirty". PgHdr1 is an extension added by this

	32 ** module (pcache1.c). The PgHdr1 header is a subclass of sqlite3_pcache_page.

	33 ** PgHdr1 contains information needed to look up a page by its page number.

	34 ** The superclass sqlite3_pcache_page.pBuf points to the start of the

	35 ** database page content and sqlite3_pcache_page.pExtra points to PgHdr.

	36 **

	37 ** The size of the extension (MemPage+PgHdr+PgHdr1) can be determined at

	38 ** runtime using sqlite3_config(SQLITE_CONFIG_PCACHE_HDRSZ, &size). The

	39 ** sizes of the extensions sum to 272 bytes on x64 for 3.8.10, but this

	40 ** size can vary according to architecture, compile-time options, and

	41 ** SQLite library version number.

	42 **

	43 ** If SQLITE_PCACHE_SEPARATE_HEADER is defined, then the extension is obtained

	44 ** using a separate memory allocation from the database page content. This

	45 ** seeks to overcome the "clownshoe" problem (also called "internal

	46 ** fragmentation" in academic literature) of allocating a few bytes more

	47 ** than a power of two with the memory allocator rounding up to the next

	48 ** power of two, and leaving the rounded-up space unused.

	49 **

	50 ** This module tracks pointers to PgHdr1 objects. Only pcache.c communicates

	51 ** with this module. Information is passed back and forth as PgHdr1 pointers.

	52 **

	53 ** The pcache.c and pager.c modules deal pointers to PgHdr objects.

	54 ** The btree.c module deals with pointers to MemPage objects.

	55 **

	56 ** SOURCE OF PAGE CACHE MEMORY:

	57 **

	58 ** Memory for a page might come from any of three sources:

	59 **

	60 ** (1) The general-purpose memory allocator - sqlite3Malloc()

	61 ** (2) Global page-cache memory provided using sqlite3_config() with

	62 ** SQLITE_CONFIG_PAGECACHE.

	63 ** (3) PCache-local bulk allocation.

	64 **

	65 ** The third case is a chunk of heap memory (defaulting to 100 pages worth)

	66 ** that is allocated when the page cache is created. The size of the local

	67 ** bulk allocation can be adjusted using

	68 **

	69 ** sqlite3_config(SQLITE_CONFIG_PAGECACHE, (void*)0, 0, N).

	70 **

	71 ** If N is positive, then N pages worth of memory are allocated using a single

	72 ** sqlite3Malloc() call and that memory is used for the first N pages allocated.

	73 ** Or if N is negative, then -1024*N bytes of memory are allocated and used

	74 ** for as many pages as can be accomodated.

	75 **

	76 ** Only one of (2) or (3) can be used. Once the memory available to (2) or

	77 ** (3) is exhausted, subsequent allocations fail over to the general-purpose

	78 ** memory allocator (1).

	79 **

	80 ** Earlier versions of SQLite used only methods (1) and (2). But experiments

	81 ** show that method (3) with N==100 provides about a 5% performance boost for

	82 ** common workloads.

	83 */

	84 /* #include "sqliteInt.h" */

	85

	86 typedef struct PCache1 PCache1;

	87 typedef struct PgHdr1 PgHdr1;

	88 typedef struct PgFreeslot PgFreeslot;

	89 typedef struct PGroup PGroup;

	90

	91 /*

	92 ** Each cache entry is represented by an instance of the following

	93 ** structure. Unless SQLITE_PCACHE_SEPARATE_HEADER is defined, a buffer of

	94 ** PgHdr1.pCache->szPage bytes is allocated directly before this structure

	95 ** in memory.

	96 */

	97 struct PgHdr1 {

	98 sqlite3_pcache_page page; /* Base class. Must be first. pBuf & pExtra */

	99 unsigned int iKey; /* Key value (page number) */

	100 u8 isPinned; /* Page in use, not on the LRU list */

	101 u8 isBulkLocal; /* This page from bulk local storage */

	102 u8 isAnchor; /* This is the PGroup.lru element */

	103 PgHdr1 pNext; / Next in hash table chain */

	104 PCache1 pCache; / Cache that currently owns this page */

	105 PgHdr1 pLruNext; / Next in LRU list of unpinned pages */

	106 PgHdr1 pLruPrev; / Previous in LRU list of unpinned pages */

	107 };

	108

	109 /* Each page cache (or PCache) belongs to a PGroup. A PGroup is a set

	110 ** of one or more PCaches that are able to recycle each other's unpinned

	111 ** pages when they are under memory pressure. A PGroup is an instance of

	112 ** the following object.

	113 **

	114 ** This page cache implementation works in one of two modes:

	115 **

	116 ** (1) Every PCache is the sole member of its own PGroup. There is

	117 ** one PGroup per PCache.

	118 **

	119 ** (2) There is a single global PGroup that all PCaches are a member

	120 ** of.

	121 **

	122 ** Mode 1 uses more memory (since PCache instances are not able to rob

	123 ** unused pages from other PCaches) but it also operates without a mutex,

	124 ** and is therefore often faster. Mode 2 requires a mutex in order to be

	125 ** threadsafe, but recycles pages more efficiently.

	126 **

	127 ** For mode (1), PGroup.mutex is NULL. For mode (2) there is only a single

	128 ** PGroup which is the pcache1.grp global variable and its mutex is

	129 ** SQLITE_MUTEX_STATIC_LRU.

	130 */

	131 struct PGroup {

	132 sqlite3_mutex mutex; / MUTEX_STATIC_LRU or NULL */

	133 unsigned int nMaxPage; /* Sum of nMax for purgeable caches */

	134 unsigned int nMinPage; /* Sum of nMin for purgeable caches */

	135 unsigned int mxPinned; /* nMaxpage + 10 - nMinPage */

	136 unsigned int nCurrentPage; /* Number of purgeable pages allocated */

	137 PgHdr1 lru; /* The beginning and end of the LRU list */

	138 };

	139

	140 /* Each page cache is an instance of the following object. Every

	141 ** open database file (including each in-memory database and each

	142 ** temporary or transient database) has a single page cache which

	143 ** is an instance of this object.

	144 **

	145 ** Pointers to structures of this type are cast and returned as

	146 ** opaque sqlite3_pcache* handles.

	147 */

	148 struct PCache1 {

	149 /* Cache configuration parameters. Page size (szPage) and the purgeable

	150 ** flag (bPurgeable) are set when the cache is created. nMax may be

	151 ** modified at any time by a call to the pcache1Cachesize() method.

	152 ** The PGroup mutex must be held when accessing nMax.

	153 */

	154 PGroup pGroup; / PGroup this cache belongs to */

	155 int szPage; /* Size of database content section */

	156 int szExtra; /* sizeof(MemPage)+sizeof(PgHdr) */

	157 int szAlloc; /* Total size of one pcache line */

	158 int bPurgeable; /* True if cache is purgeable */

	159 unsigned int nMin; /* Minimum number of pages reserved */

	160 unsigned int nMax; /* Configured "cache_size" value */

	161 unsigned int n90pct; /* nMax9/10 /

	162 unsigned int iMaxKey; /* Largest key seen since xTruncate() */

	163

	164 /* Hash table of all pages. The following variables may only be accessed

	165 ** when the accessor is holding the PGroup mutex.

	166 */

	167 unsigned int nRecyclable; /* Number of pages in the LRU list */

	168 unsigned int nPage; /* Total number of pages in apHash */

	169 unsigned int nHash; /* Number of slots in apHash[] */

	170 PgHdr1 *apHash; / Hash table for fast lookup by key */

	171 PgHdr1 pFree; / List of unused pcache-local pages */

	172 void pBulk; / Bulk memory used by pcache-local */

	173 };

	174

	175 /*

	176 ** Free slots in the allocator used to divide up the global page cache

	177 ** buffer provided using the SQLITE_CONFIG_PAGECACHE mechanism.

	178 */

	179 struct PgFreeslot {

	180 PgFreeslot pNext; / Next free slot */

	181 };

	182

	183 /*

	184 ** Global data used by this cache.

	185 */

	186 static SQLITE_WSD struct PCacheGlobal {

	187 PGroup grp; /* The global PGroup for mode (2) */

	188

	189 /* Variables related to SQLITE_CONFIG_PAGECACHE settings. The

	190 ** szSlot, nSlot, pStart, pEnd, nReserve, and isInit values are all

	191 ** fixed at sqlite3_initialize() time and do not require mutex protection.

	192 ** The nFreeSlot and pFree values do require mutex protection.

	193 */

	194 int isInit; /* True if initialized */

	195 int separateCache; /* Use a new PGroup for each PCache */

	196 int nInitPage; /* Initial bulk allocation size */

	197 int szSlot; /* Size of each free slot */

	198 int nSlot; /* The number of pcache slots */

	199 int nReserve; /* Try to keep nFreeSlot above this */

	200 void pStart, pEnd; /* Bounds of global page cache memory */

	201 /* Above requires no mutex. Use mutex below for variable that follow. */

	202 sqlite3_mutex mutex; / Mutex for accessing the following: */

	203 PgFreeslot pFree; / Free page blocks */

	204 int nFreeSlot; /* Number of unused pcache slots */

	205 /* The following value requires a mutex to change. We skip the mutex on

	206 ** reading because (1) most platforms read a 32-bit integer atomically and

	207 ** (2) even if an incorrect value is read, no great harm is done since this

	208 ** is really just an optimization. */

	209 int bUnderPressure; /* True if low on PAGECACHE memory */

	210 } pcache1_g;

	211

	212 /*

	213 ** All code in this file should access the global structure above via the

	214 ** alias "pcache1". This ensures that the WSD emulation is used when

	215 ** compiling for systems that do not support real WSD.

	216 */

	217 #define pcache1 (GLOBAL(struct PCacheGlobal, pcache1_g))

	218

	219 /*

	220 ** Macros to enter and leave the PCache LRU mutex.

	221 */

	222 #if !defined(SQLITE_ENABLE_MEMORY_MANAGEMENT) \|\| SQLITE_THREADSAFE==0

	223 # define pcache1EnterMutex(X) assert((X)->mutex==0)

	224 # define pcache1LeaveMutex(X) assert((X)->mutex==0)

	225 # define PCACHE1_MIGHT_USE_GROUP_MUTEX 0

	226 #else

	227 # define pcache1EnterMutex(X) sqlite3_mutex_enter((X)->mutex)

	228 # define pcache1LeaveMutex(X) sqlite3_mutex_leave((X)->mutex)

	229 # define PCACHE1_MIGHT_USE_GROUP_MUTEX 1

	230 #endif

	231

	232 /******************************************************************************/

	233 /****** Page Allocation/SQLITE_CONFIG_PCACHE Related Functions ************/

	234

	235

	236 /*

	237 ** This function is called during initialization if a static buffer is

	238 ** supplied to use for the page-cache by passing the SQLITE_CONFIG_PAGECACHE

	239 ** verb to sqlite3_config(). Parameter pBuf points to an allocation large

	240 ** enough to contain 'n' buffers of 'sz' bytes each.

	241 **

	242 ** This routine is called from sqlite3_initialize() and so it is guaranteed

	243 ** to be serialized already. There is no need for further mutexing.

	244 */

	245 SQLITE_PRIVATE void sqlite3PCacheBufferSetup(void *pBuf, int sz, int n){

	246 if( pcache1.isInit ){

	247 PgFreeslot *p;

	248 if( pBuf==0 ) sz = n = 0;

	249 sz = ROUNDDOWN8(sz);

	250 pcache1.szSlot = sz;

	251 pcache1.nSlot = pcache1.nFreeSlot = n;

	252 pcache1.nReserve = n>90 ? 10 : (n/10 + 1);

	253 pcache1.pStart = pBuf;

	254 pcache1.pFree = 0;

	255 pcache1.bUnderPressure = 0;

	256 while( n-- ){

	257 p = (PgFreeslot*)pBuf;

	258 p->pNext = pcache1.pFree;

	259 pcache1.pFree = p;

	260 pBuf = (void)&((char)pBuf)[sz];

	261 }

	262 pcache1.pEnd = pBuf;

	263 }

	264 }

	265

	266 /*

	267 ** Try to initialize the pCache->pFree and pCache->pBulk fields. Return

	268 ** true if pCache->pFree ends up containing one or more free pages.

	269 */

	270 static int pcache1InitBulk(PCache1 *pCache){

	271 i64 szBulk;

	272 char *zBulk;

	273 if( pcache1.nInitPage==0 ) return 0;

	274 /* Do not bother with a bulk allocation if the cache size very small */

	275 if( pCache->nMax<3 ) return 0;

	276 sqlite3BeginBenignMalloc();

	277 if( pcache1.nInitPage>0 ){

	278 szBulk = pCache->szAlloc * (i64)pcache1.nInitPage;

	279 }else{

	280 szBulk = -1024 * (i64)pcache1.nInitPage;

	281 }

	282 if( szBulk > pCache->szAlloc*(i64)pCache->nMax ){

	283 szBulk = pCache->szAlloc*(i64)pCache->nMax;

	284 }

	285 zBulk = pCache->pBulk = sqlite3Malloc( szBulk );

	286 sqlite3EndBenignMalloc();

	287 if( zBulk ){

	288 int nBulk = sqlite3MallocSize(zBulk)/pCache->szAlloc;

	289 int i;

	290 for(i=0; i<nBulk; i++){

	291 PgHdr1 pX = (PgHdr1)&zBulk[pCache->szPage];

	292 pX->page.pBuf = zBulk;

	293 pX->page.pExtra = &pX[1];

	294 pX->isBulkLocal = 1;

	295 pX->isAnchor = 0;

	296 pX->pNext = pCache->pFree;

	297 pCache->pFree = pX;

	298 zBulk += pCache->szAlloc;

	299 }

	300 }

	301 return pCache->pFree!=0;

	302 }

	303

	304 /*

	305 ** Malloc function used within this file to allocate space from the buffer

	306 ** configured using sqlite3_config(SQLITE_CONFIG_PAGECACHE) option. If no

	307 ** such buffer exists or there is no space left in it, this function falls

	308 ** back to sqlite3Malloc().

	309 **

	310 ** Multiple threads can run this routine at the same time. Global variables

	311 ** in pcache1 need to be protected via mutex.

	312 */

	313 static void *pcache1Alloc(int nByte){

	314 void *p = 0;

	315 assert( sqlite3_mutex_notheld(pcache1.grp.mutex) );

	316 if( nByte<=pcache1.szSlot ){

	317 sqlite3_mutex_enter(pcache1.mutex);

	318 p = (PgHdr1 *)pcache1.pFree;

	319 if( p ){

	320 pcache1.pFree = pcache1.pFree->pNext;

	321 pcache1.nFreeSlot--;

	322 pcache1.bUnderPressure = pcache1.nFreeSlot<pcache1.nReserve;

	323 assert( pcache1.nFreeSlot>=0 );

	324 sqlite3StatusHighwater(SQLITE_STATUS_PAGECACHE_SIZE, nByte);

	325 sqlite3StatusUp(SQLITE_STATUS_PAGECACHE_USED, 1);

	326 }

	327 sqlite3_mutex_leave(pcache1.mutex);

	328 }

	329 if( p==0 ){

	330 /* Memory is not available in the SQLITE_CONFIG_PAGECACHE pool. Get

	331 ** it from sqlite3Malloc instead.

	332 */

	333 p = sqlite3Malloc(nByte);

	334 #ifndef SQLITE_DISABLE_PAGECACHE_OVERFLOW_STATS

	335 if( p ){

	336 int sz = sqlite3MallocSize(p);

	337 sqlite3_mutex_enter(pcache1.mutex);

	338 sqlite3StatusHighwater(SQLITE_STATUS_PAGECACHE_SIZE, nByte);

	339 sqlite3StatusUp(SQLITE_STATUS_PAGECACHE_OVERFLOW, sz);

	340 sqlite3_mutex_leave(pcache1.mutex);

	341 }

	342 #endif

	343 sqlite3MemdebugSetType(p, MEMTYPE_PCACHE);

	344 }

	345 return p;

	346 }

	347

	348 /*

	349 ** Free an allocated buffer obtained from pcache1Alloc().

	350 */

	351 static void pcache1Free(void *p){

	352 if( p==0 ) return;

	353 if( SQLITE_WITHIN(p, pcache1.pStart, pcache1.pEnd) ){

	354 PgFreeslot *pSlot;

	355 sqlite3_mutex_enter(pcache1.mutex);

	356 sqlite3StatusDown(SQLITE_STATUS_PAGECACHE_USED, 1);

	357 pSlot = (PgFreeslot*)p;

	358 pSlot->pNext = pcache1.pFree;

	359 pcache1.pFree = pSlot;

	360 pcache1.nFreeSlot++;

	361 pcache1.bUnderPressure = pcache1.nFreeSlot<pcache1.nReserve;

	362 assert( pcache1.nFreeSlot<=pcache1.nSlot );

	363 sqlite3_mutex_leave(pcache1.mutex);

	364 }else{

	365 assert( sqlite3MemdebugHasType(p, MEMTYPE_PCACHE) );

	366 sqlite3MemdebugSetType(p, MEMTYPE_HEAP);

	367 #ifndef SQLITE_DISABLE_PAGECACHE_OVERFLOW_STATS

	368 {

	369 int nFreed = 0;

	370 nFreed = sqlite3MallocSize(p);

	371 sqlite3_mutex_enter(pcache1.mutex);

	372 sqlite3StatusDown(SQLITE_STATUS_PAGECACHE_OVERFLOW, nFreed);

	373 sqlite3_mutex_leave(pcache1.mutex);

	374 }

	375 #endif

	376 sqlite3_free(p);

	377 }

	378 }

	379

	380 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT

	381 /*

	382 ** Return the size of a pcache allocation

	383 */

	384 static int pcache1MemSize(void *p){

	385 if( p>=pcache1.pStart && p<pcache1.pEnd ){

	386 return pcache1.szSlot;

	387 }else{

	388 int iSize;

	389 assert( sqlite3MemdebugHasType(p, MEMTYPE_PCACHE) );

	390 sqlite3MemdebugSetType(p, MEMTYPE_HEAP);

	391 iSize = sqlite3MallocSize(p);

	392 sqlite3MemdebugSetType(p, MEMTYPE_PCACHE);

	393 return iSize;

	394 }

	395 }

	396 #endif /* SQLITE_ENABLE_MEMORY_MANAGEMENT */

	397

	398 /*

	399 ** Allocate a new page object initially associated with cache pCache.

	400 */

	401 static PgHdr1 pcache1AllocPage(PCache1 pCache, int benignMalloc){

	402 PgHdr1 *p = 0;

	403 void *pPg;

	404

	405 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );

	406 if( pCache->pFree \|\| (pCache->nPage==0 && pcache1InitBulk(pCache)) ){

	407 p = pCache->pFree;

	408 pCache->pFree = p->pNext;

	409 p->pNext = 0;

	410 }else{

	411 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT

	412 /* The group mutex must be released before pcache1Alloc() is called. This

	413 ** is because it might call sqlite3_release_memory(), which assumes that

	414 ** this mutex is not held. */

	415 assert( pcache1.separateCache==0 );

	416 assert( pCache->pGroup==&pcache1.grp );

	417 pcache1LeaveMutex(pCache->pGroup);

	418 #endif

	419 if( benignMalloc ){ sqlite3BeginBenignMalloc(); }

	420 #ifdef SQLITE_PCACHE_SEPARATE_HEADER

	421 pPg = pcache1Alloc(pCache->szPage);

	422 p = sqlite3Malloc(sizeof(PgHdr1) + pCache->szExtra);

	423 if( !pPg \|\| !p ){

	424 pcache1Free(pPg);

	425 sqlite3_free(p);

	426 pPg = 0;

	427 }

	428 #else

	429 pPg = pcache1Alloc(pCache->szAlloc);

	430 p = (PgHdr1 )&((u8 )pPg)[pCache->szPage];

	431 #endif

	432 if( benignMalloc ){ sqlite3EndBenignMalloc(); }

	433 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT

	434 pcache1EnterMutex(pCache->pGroup);

	435 #endif

	436 if( pPg==0 ) return 0;

	437 p->page.pBuf = pPg;

	438 p->page.pExtra = &p[1];

	439 p->isBulkLocal = 0;

	440 p->isAnchor = 0;

	441 }

	442 if( pCache->bPurgeable ){

	443 pCache->pGroup->nCurrentPage++;

	444 }

	445 return p;

	446 }

	447

	448 /*

	449 ** Free a page object allocated by pcache1AllocPage().

	450 */

	451 static void pcache1FreePage(PgHdr1 *p){

	452 PCache1 *pCache;

	453 assert( p!=0 );

	454 pCache = p->pCache;

	455 assert( sqlite3_mutex_held(p->pCache->pGroup->mutex) );

	456 if( p->isBulkLocal ){

	457 p->pNext = pCache->pFree;

	458 pCache->pFree = p;

	459 }else{

	460 pcache1Free(p->page.pBuf);

	461 #ifdef SQLITE_PCACHE_SEPARATE_HEADER

	462 sqlite3_free(p);

	463 #endif

	464 }

	465 if( pCache->bPurgeable ){

	466 pCache->pGroup->nCurrentPage--;

	467 }

	468 }

	469

	470 /*

	471 ** Malloc function used by SQLite to obtain space from the buffer configured

	472 ** using sqlite3_config(SQLITE_CONFIG_PAGECACHE) option. If no such buffer

	473 ** exists, this function falls back to sqlite3Malloc().

	474 */

	475 SQLITE_PRIVATE void *sqlite3PageMalloc(int sz){

	476 return pcache1Alloc(sz);

	477 }

	478

	479 /*

	480 ** Free an allocated buffer obtained from sqlite3PageMalloc().

	481 */

	482 SQLITE_PRIVATE void sqlite3PageFree(void *p){

	483 pcache1Free(p);

	484 }

	485

	486

	487 /*

	488 ** Return true if it desirable to avoid allocating a new page cache

	489 ** entry.

	490 **

	491 ** If memory was allocated specifically to the page cache using

	492 ** SQLITE_CONFIG_PAGECACHE but that memory has all been used, then

	493 ** it is desirable to avoid allocating a new page cache entry because

	494 ** presumably SQLITE_CONFIG_PAGECACHE was suppose to be sufficient

	495 ** for all page cache needs and we should not need to spill the

	496 ** allocation onto the heap.

	497 **

	498 ** Or, the heap is used for all page cache memory but the heap is

	499 ** under memory pressure, then again it is desirable to avoid

	500 ** allocating a new page cache entry in order to avoid stressing

	501 ** the heap even further.

	502 */

	503 static int pcache1UnderMemoryPressure(PCache1 *pCache){

	504 if( pcache1.nSlot && (pCache->szPage+pCache->szExtra)<=pcache1.szSlot ){

	505 return pcache1.bUnderPressure;

	506 }else{

	507 return sqlite3HeapNearlyFull();

	508 }

	509 }

	510

	511 /******************************************************************************/

	512 /****** General Implementation Functions **********************************/

	513

	514 /*

	515 ** This function is used to resize the hash table used by the cache passed

	516 ** as the first argument.

	517 **

	518 ** The PCache mutex must be held when this function is called.

	519 */

	520 static void pcache1ResizeHash(PCache1 *p){

	521 PgHdr1 **apNew;

	522 unsigned int nNew;

	523 unsigned int i;

	524

	525 assert( sqlite3_mutex_held(p->pGroup->mutex) );

	526

	527 nNew = p->nHash*2;

	528 if( nNew<256 ){

	529 nNew = 256;

	530 }

	531

	532 pcache1LeaveMutex(p->pGroup);

	533 if( p->nHash ){ sqlite3BeginBenignMalloc(); }

	534 apNew = (PgHdr1 *)sqlite3MallocZero(sizeof(PgHdr1 )*nNew);

	535 if( p->nHash ){ sqlite3EndBenignMalloc(); }

	536 pcache1EnterMutex(p->pGroup);

	537 if( apNew ){

	538 for(i=0; i<p->nHash; i++){

	539 PgHdr1 *pPage;

	540 PgHdr1 *pNext = p->apHash[i];

	541 while( (pPage = pNext)!=0 ){

	542 unsigned int h = pPage->iKey % nNew;

	543 pNext = pPage->pNext;

	544 pPage->pNext = apNew[h];

	545 apNew[h] = pPage;

	546 }

	547 }

	548 sqlite3_free(p->apHash);

	549 p->apHash = apNew;

	550 p->nHash = nNew;

	551 }

	552 }

	553

	554 /*

	555 ** This function is used internally to remove the page pPage from the

	556 ** PGroup LRU list, if is part of it. If pPage is not part of the PGroup

	557 ** LRU list, then this function is a no-op.

	558 **

	559 ** The PGroup mutex must be held when this function is called.

	560 */

	561 static PgHdr1 pcache1PinPage(PgHdr1 pPage){

	562 PCache1 *pCache;

	563

	564 assert( pPage!=0 );

	565 assert( pPage->isPinned==0 );

	566 pCache = pPage->pCache;

	567 assert( pPage->pLruNext );

	568 assert( pPage->pLruPrev );

	569 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );

	570 pPage->pLruPrev->pLruNext = pPage->pLruNext;

	571 pPage->pLruNext->pLruPrev = pPage->pLruPrev;

	572 pPage->pLruNext = 0;

	573 pPage->pLruPrev = 0;

	574 pPage->isPinned = 1;

	575 assert( pPage->isAnchor==0 );

	576 assert( pCache->pGroup->lru.isAnchor==1 );

	577 pCache->nRecyclable--;

	578 return pPage;

	579 }

	580

	581

	582 /*

	583 ** Remove the page supplied as an argument from the hash table

	584 ** (PCache1.apHash structure) that it is currently stored in.

	585 ** Also free the page if freePage is true.

	586 **

	587 ** The PGroup mutex must be held when this function is called.

	588 */

	589 static void pcache1RemoveFromHash(PgHdr1 *pPage, int freeFlag){

	590 unsigned int h;

	591 PCache1 *pCache = pPage->pCache;

	592 PgHdr1 **pp;

	593

	594 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );

	595 h = pPage->iKey % pCache->nHash;

	596 for(pp=&pCache->apHash[h]; (pp)!=pPage; pp=&(pp)->pNext);

	597 pp = (pp)->pNext;

	598

	599 pCache->nPage--;

	600 if( freeFlag ) pcache1FreePage(pPage);

	601 }

	602

	603 /*

	604 ** If there are currently more than nMaxPage pages allocated, try

	605 ** to recycle pages to reduce the number allocated to nMaxPage.

	606 */

	607 static void pcache1EnforceMaxPage(PCache1 *pCache){

	608 PGroup *pGroup = pCache->pGroup;

	609 PgHdr1 *p;

	610 assert( sqlite3_mutex_held(pGroup->mutex) );

	611 while( pGroup->nCurrentPage>pGroup->nMaxPage

	612 && (p=pGroup->lru.pLruPrev)->isAnchor==0

	613 ){

	614 assert( p->pCache->pGroup==pGroup );

	615 assert( p->isPinned==0 );

	616 pcache1PinPage(p);

	617 pcache1RemoveFromHash(p, 1);

	618 }

	619 if( pCache->nPage==0 && pCache->pBulk ){

	620 sqlite3_free(pCache->pBulk);

	621 pCache->pBulk = pCache->pFree = 0;

	622 }

	623 }

	624

	625 /*

	626 ** Discard all pages from cache pCache with a page number (key value)

	627 ** greater than or equal to iLimit. Any pinned pages that meet this

	628 ** criteria are unpinned before they are discarded.

	629 **

	630 ** The PCache mutex must be held when this function is called.

	631 */

	632 static void pcache1TruncateUnsafe(

	633 PCache1 pCache, / The cache to truncate */

	634 unsigned int iLimit /* Drop pages with this pgno or larger */

	635 ){

	636 TESTONLY( int nPage = 0; ) /* To assert pCache->nPage is correct */

	637 unsigned int h, iStop;

	638 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );

	639 assert( pCache->iMaxKey >= iLimit );

	640 assert( pCache->nHash > 0 );

	641 if( pCache->iMaxKey - iLimit < pCache->nHash ){

	642 /* If we are just shaving the last few pages off the end of the

	643 ** cache, then there is no point in scanning the entire hash table.

	644 ** Only scan those hash slots that might contain pages that need to

	645 ** be removed. */

	646 h = iLimit % pCache->nHash;

	647 iStop = pCache->iMaxKey % pCache->nHash;

	648 TESTONLY( nPage = -10; ) /* Disable the pCache->nPage validity check */

	649 }else{

	650 /* This is the general case where many pages are being removed.

	651 ** It is necessary to scan the entire hash table */

	652 h = pCache->nHash/2;

	653 iStop = h - 1;

	654 }

	655 for(;;){

	656 PgHdr1 **pp;

	657 PgHdr1 *pPage;

	658 assert( h<pCache->nHash );

	659 pp = &pCache->apHash[h];

	660 while( (pPage = *pp)!=0 ){

	661 if( pPage->iKey>=iLimit ){

	662 pCache->nPage--;

	663 *pp = pPage->pNext;

	664 if( !pPage->isPinned ) pcache1PinPage(pPage);

	665 pcache1FreePage(pPage);

	666 }else{

	667 pp = &pPage->pNext;

	668 TESTONLY( if( nPage>=0 ) nPage++; )

	669 }

	670 }

	671 if( h==iStop ) break;

	672 h = (h+1) % pCache->nHash;

	673 }

	674 assert( nPage<0 \|\| pCache->nPage==(unsigned)nPage );

	675 }

	676

	677 /******************************************************************************/

	678 /****** sqlite3_pcache Methods ********************************************/

	679

	680 /*

	681 ** Implementation of the sqlite3_pcache.xInit method.

	682 */

	683 static int pcache1Init(void *NotUsed){

	684 UNUSED_PARAMETER(NotUsed);

	685 assert( pcache1.isInit==0 );

	686 memset(&pcache1, 0, sizeof(pcache1));

	687

	688

	689 /*

	690 ** The pcache1.separateCache variable is true if each PCache has its own

	691 ** private PGroup (mode-1). pcache1.separateCache is false if the single

	692 ** PGroup in pcache1.grp is used for all page caches (mode-2).

	693 **

	694 ** * Always use separate caches (mode-1) if SQLITE_SEPARATE_CACHE_POOLS

	695 **

	696 ** * Always use a unified cache (mode-2) if ENABLE_MEMORY_MANAGEMENT

	697 **

	698 ** * Use a unified cache in single-threaded applications that have

	699 ** configured a start-time buffer for use as page-cache memory using

	700 ** sqlite3_config(SQLITE_CONFIG_PAGECACHE, pBuf, sz, N) with non-NULL

	701 ** pBuf argument.

	702 **

	703 ** * Otherwise use separate caches (mode-1)

	704 */

	705 #ifdef SQLITE_SEPARATE_CACHE_POOLS

	706 pcache1.separateCache = 1;

	707 #elif defined(SQLITE_ENABLE_MEMORY_MANAGEMENT)

	708 pcache1.separateCache = 0;

	709 #elif SQLITE_THREADSAFE

	710 pcache1.separateCache = sqlite3GlobalConfig.pPage==0

	711 \|\| sqlite3GlobalConfig.bCoreMutex>0;

	712 #else

	713 pcache1.separateCache = sqlite3GlobalConfig.pPage==0;

	714 #endif

	715

	716 #if SQLITE_THREADSAFE

	717 if( sqlite3GlobalConfig.bCoreMutex ){

	718 pcache1.grp.mutex = sqlite3MutexAlloc(SQLITE_MUTEX_STATIC_LRU);

	719 pcache1.mutex = sqlite3MutexAlloc(SQLITE_MUTEX_STATIC_PMEM);

	720 }

	721 #endif

	722 if( pcache1.separateCache

	723 && sqlite3GlobalConfig.nPage!=0

	724 && sqlite3GlobalConfig.pPage==0

	725 ){

	726 pcache1.nInitPage = sqlite3GlobalConfig.nPage;

	727 }else{

	728 pcache1.nInitPage = 0;

	729 }

	730 pcache1.grp.mxPinned = 10;

	731 pcache1.isInit = 1;

	732 return SQLITE_OK;

	733 }

	734

	735 /*

	736 ** Implementation of the sqlite3_pcache.xShutdown method.

	737 ** Note that the static mutex allocated in xInit does

	738 ** not need to be freed.

	739 */

	740 static void pcache1Shutdown(void *NotUsed){

	741 UNUSED_PARAMETER(NotUsed);

	742 assert( pcache1.isInit!=0 );

	743 memset(&pcache1, 0, sizeof(pcache1));

	744 }

	745

	746 /* forward declaration */

	747 static void pcache1Destroy(sqlite3_pcache *p);

	748

	749 /*

	750 ** Implementation of the sqlite3_pcache.xCreate method.

	751 **

	752 ** Allocate a new cache.

	753 */

	754 static sqlite3_pcache *pcache1Create(int szPage, int szExtra, int bPurgeable){

	755 PCache1 pCache; / The newly created page cache */

	756 PGroup pGroup; / The group the new page cache will belong to */

	757 int sz; /* Bytes of memory required to allocate the new cache */

	758

	759 assert( (szPage & (szPage-1))==0 && szPage>=512 && szPage<=65536 );

	760 assert( szExtra < 300 );

	761

	762 sz = sizeof(PCache1) + sizeof(PGroup)*pcache1.separateCache;

	763 pCache = (PCache1 *)sqlite3MallocZero(sz);

	764 if( pCache ){

	765 if( pcache1.separateCache ){

	766 pGroup = (PGroup*)&pCache[1];

	767 pGroup->mxPinned = 10;

	768 }else{

	769 pGroup = &pcache1.grp;

	770 }

	771 if( pGroup->lru.isAnchor==0 ){

	772 pGroup->lru.isAnchor = 1;

	773 pGroup->lru.pLruPrev = pGroup->lru.pLruNext = &pGroup->lru;

	774 }

	775 pCache->pGroup = pGroup;

	776 pCache->szPage = szPage;

	777 pCache->szExtra = szExtra;

	778 pCache->szAlloc = szPage + szExtra + ROUND8(sizeof(PgHdr1));

	779 pCache->bPurgeable = (bPurgeable ? 1 : 0);

	780 pcache1EnterMutex(pGroup);

	781 pcache1ResizeHash(pCache);

	782 if( bPurgeable ){

	783 pCache->nMin = 10;

	784 pGroup->nMinPage += pCache->nMin;

	785 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;

	786 }

	787 pcache1LeaveMutex(pGroup);

	788 if( pCache->nHash==0 ){

	789 pcache1Destroy((sqlite3_pcache*)pCache);

	790 pCache = 0;

	791 }

	792 }

	793 return (sqlite3_pcache *)pCache;

	794 }

	795

	796 /*

	797 ** Implementation of the sqlite3_pcache.xCachesize method.

	798 **

	799 ** Configure the cache_size limit for a cache.

	800 */

	801 static void pcache1Cachesize(sqlite3_pcache *p, int nMax){

	802 PCache1 pCache = (PCache1 )p;

	803 if( pCache->bPurgeable ){

	804 PGroup *pGroup = pCache->pGroup;

	805 pcache1EnterMutex(pGroup);

	806 pGroup->nMaxPage += (nMax - pCache->nMax);

	807 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;

	808 pCache->nMax = nMax;

	809 pCache->n90pct = pCache->nMax*9/10;

	810 pcache1EnforceMaxPage(pCache);

	811 pcache1LeaveMutex(pGroup);

	812 }

	813 }

	814

	815 /*

	816 ** Implementation of the sqlite3_pcache.xShrink method.

	817 **

	818 ** Free up as much memory as possible.

	819 */

	820 static void pcache1Shrink(sqlite3_pcache *p){

	821 PCache1 pCache = (PCache1)p;

	822 if( pCache->bPurgeable ){

	823 PGroup *pGroup = pCache->pGroup;

	824 int savedMaxPage;

	825 pcache1EnterMutex(pGroup);

	826 savedMaxPage = pGroup->nMaxPage;

	827 pGroup->nMaxPage = 0;

	828 pcache1EnforceMaxPage(pCache);

	829 pGroup->nMaxPage = savedMaxPage;

	830 pcache1LeaveMutex(pGroup);

	831 }

	832 }

	833

	834 /*

	835 ** Implementation of the sqlite3_pcache.xPagecount method.

	836 */

	837 static int pcache1Pagecount(sqlite3_pcache *p){

	838 int n;

	839 PCache1 pCache = (PCache1)p;

	840 pcache1EnterMutex(pCache->pGroup);

	841 n = pCache->nPage;

	842 pcache1LeaveMutex(pCache->pGroup);

	843 return n;

	844 }

	845

	846

	847 /*

	848 ** Implement steps 3, 4, and 5 of the pcache1Fetch() algorithm described

	849 ** in the header of the pcache1Fetch() procedure.

	850 **

	851 ** This steps are broken out into a separate procedure because they are

	852 ** usually not needed, and by avoiding the stack initialization required

	853 ** for these steps, the main pcache1Fetch() procedure can run faster.

	854 */

	855 static SQLITE_NOINLINE PgHdr1 *pcache1FetchStage2(

	856 PCache1 *pCache,

	857 unsigned int iKey,

	858 int createFlag

	859 ){

	860 unsigned int nPinned;

	861 PGroup *pGroup = pCache->pGroup;

	862 PgHdr1 *pPage = 0;

	863

	864 /* Step 3: Abort if createFlag is 1 but the cache is nearly full */

	865 assert( pCache->nPage >= pCache->nRecyclable );

	866 nPinned = pCache->nPage - pCache->nRecyclable;

	867 assert( pGroup->mxPinned == pGroup->nMaxPage + 10 - pGroup->nMinPage );

	868 assert( pCache->n90pct == pCache->nMax*9/10 );

	869 if( createFlag==1 && (

	870 nPinned>=pGroup->mxPinned

	871 \|\| nPinned>=pCache->n90pct

	872 \|\| (pcache1UnderMemoryPressure(pCache) && pCache->nRecyclable<nPinned)

	873 )){

	874 return 0;

	875 }

	876

	877 if( pCache->nPage>=pCache->nHash ) pcache1ResizeHash(pCache);

	878 assert( pCache->nHash>0 && pCache->apHash );

	879

	880 /* Step 4. Try to recycle a page. */

	881 if( pCache->bPurgeable

	882 && !pGroup->lru.pLruPrev->isAnchor

	883 && ((pCache->nPage+1>=pCache->nMax) \|\| pcache1UnderMemoryPressure(pCache))

	884 ){

	885 PCache1 *pOther;

	886 pPage = pGroup->lru.pLruPrev;

	887 assert( pPage->isPinned==0 );

	888 pcache1RemoveFromHash(pPage, 0);

	889 pcache1PinPage(pPage);

	890 pOther = pPage->pCache;

	891 if( pOther->szAlloc != pCache->szAlloc ){

	892 pcache1FreePage(pPage);

	893 pPage = 0;

	894 }else{

	895 pGroup->nCurrentPage -= (pOther->bPurgeable - pCache->bPurgeable);

	896 }

	897 }

	898

	899 /* Step 5. If a usable page buffer has still not been found,

	900 ** attempt to allocate a new one.

	901 */

	902 if( !pPage ){

	903 pPage = pcache1AllocPage(pCache, createFlag==1);

	904 }

	905

	906 if( pPage ){

	907 unsigned int h = iKey % pCache->nHash;

	908 pCache->nPage++;

	909 pPage->iKey = iKey;

	910 pPage->pNext = pCache->apHash[h];

	911 pPage->pCache = pCache;

	912 pPage->pLruPrev = 0;

	913 pPage->pLruNext = 0;

	914 pPage->isPinned = 1;

	915 (void *)pPage->page.pExtra = 0;

	916 pCache->apHash[h] = pPage;

	917 if( iKey>pCache->iMaxKey ){

	918 pCache->iMaxKey = iKey;

	919 }

	920 }

	921 return pPage;

	922 }

	923

	924 /*

	925 ** Implementation of the sqlite3_pcache.xFetch method.

	926 **

	927 ** Fetch a page by key value.

	928 **

	929 ** Whether or not a new page may be allocated by this function depends on

	930 ** the value of the createFlag argument. 0 means do not allocate a new

	931 ** page. 1 means allocate a new page if space is easily available. 2

	932 ** means to try really hard to allocate a new page.

	933 **

	934 ** For a non-purgeable cache (a cache used as the storage for an in-memory

	935 ** database) there is really no difference between createFlag 1 and 2. So

	936 ** the calling function (pcache.c) will never have a createFlag of 1 on

	937 ** a non-purgeable cache.

	938 **

	939 ** There are three different approaches to obtaining space for a page,

	940 ** depending on the value of parameter createFlag (which may be 0, 1 or 2).

	941 **

	942 ** 1. Regardless of the value of createFlag, the cache is searched for a

	943 ** copy of the requested page. If one is found, it is returned.

	944 **

	945 ** 2. If createFlag==0 and the page is not already in the cache, NULL is

	946 ** returned.

	947 **

	948 ** 3. If createFlag is 1, and the page is not already in the cache, then

	949 ** return NULL (do not allocate a new page) if any of the following

	950 ** conditions are true:

	951 **

	952 ** (a) the number of pages pinned by the cache is greater than

	953 ** PCache1.nMax, or

	954 **

	955 ** (b) the number of pages pinned by the cache is greater than

	956 ** the sum of nMax for all purgeable caches, less the sum of

	957 ** nMin for all other purgeable caches, or

	958 **

	959 ** 4. If none of the first three conditions apply and the cache is marked

	960 ** as purgeable, and if one of the following is true:

	961 **

	962 ** (a) The number of pages allocated for the cache is already

	963 ** PCache1.nMax, or

	964 **

	965 ** (b) The number of pages allocated for all purgeable caches is

	966 ** already equal to or greater than the sum of nMax for all

	967 ** purgeable caches,

	968 **

	969 ** (c) The system is under memory pressure and wants to avoid

	970 ** unnecessary pages cache entry allocations

	971 **

	972 ** then attempt to recycle a page from the LRU list. If it is the right

	973 ** size, return the recycled buffer. Otherwise, free the buffer and

	974 ** proceed to step 5.

	975 **

	976 ** 5. Otherwise, allocate and return a new page buffer.

	977 **

	978 ** There are two versions of this routine. pcache1FetchWithMutex() is

	979 ** the general case. pcache1FetchNoMutex() is a faster implementation for

	980 ** the common case where pGroup->mutex is NULL. The pcache1Fetch() wrapper

	981 ** invokes the appropriate routine.

	982 */

	983 static PgHdr1 *pcache1FetchNoMutex(

	984 sqlite3_pcache *p,

	985 unsigned int iKey,

	986 int createFlag

	987 ){

	988 PCache1 pCache = (PCache1 )p;

	989 PgHdr1 *pPage = 0;

	990

	991 /* Step 1: Search the hash table for an existing entry. */

	992 pPage = pCache->apHash[iKey % pCache->nHash];

	993 while( pPage && pPage->iKey!=iKey ){ pPage = pPage->pNext; }

	994

	995 /* Step 2: If the page was found in the hash table, then return it.

	996 ** If the page was not in the hash table and createFlag is 0, abort.

	997 ** Otherwise (page not in hash and createFlag!=0) continue with

	998 ** subsequent steps to try to create the page. */

	999 if( pPage ){

	1000 if( !pPage->isPinned ){

	1001 return pcache1PinPage(pPage);

	1002 }else{

	1003 return pPage;

	1004 }

	1005 }else if( createFlag ){

	1006 /* Steps 3, 4, and 5 implemented by this subroutine */

	1007 return pcache1FetchStage2(pCache, iKey, createFlag);

	1008 }else{

	1009 return 0;

	1010 }

	1011 }

	1012 #if PCACHE1_MIGHT_USE_GROUP_MUTEX

	1013 static PgHdr1 *pcache1FetchWithMutex(

	1014 sqlite3_pcache *p,

	1015 unsigned int iKey,

	1016 int createFlag

	1017 ){

	1018 PCache1 pCache = (PCache1 )p;

	1019 PgHdr1 *pPage;

	1020

	1021 pcache1EnterMutex(pCache->pGroup);

	1022 pPage = pcache1FetchNoMutex(p, iKey, createFlag);

	1023 assert( pPage==0 \|\| pCache->iMaxKey>=iKey );

	1024 pcache1LeaveMutex(pCache->pGroup);

	1025 return pPage;

	1026 }

	1027 #endif

	1028 static sqlite3_pcache_page *pcache1Fetch(

	1029 sqlite3_pcache *p,

	1030 unsigned int iKey,

	1031 int createFlag

	1032 ){

	1033 #if PCACHE1_MIGHT_USE_GROUP_MUTEX \|\| defined(SQLITE_DEBUG)

	1034 PCache1 pCache = (PCache1 )p;

	1035 #endif

	1036

	1037 assert( offsetof(PgHdr1,page)==0 );

	1038 assert( pCache->bPurgeable \|\| createFlag!=1 );

	1039 assert( pCache->bPurgeable \|\| pCache->nMin==0 );

	1040 assert( pCache->bPurgeable==0 \|\| pCache->nMin==10 );

	1041 assert( pCache->nMin==0 \|\| pCache->bPurgeable );

	1042 assert( pCache->nHash>0 );

	1043 #if PCACHE1_MIGHT_USE_GROUP_MUTEX

	1044 if( pCache->pGroup->mutex ){

	1045 return (sqlite3_pcache_page*)pcache1FetchWithMutex(p, iKey, createFlag);

	1046 }else

	1047 #endif

	1048 {

	1049 return (sqlite3_pcache_page*)pcache1FetchNoMutex(p, iKey, createFlag);

	1050 }

	1051 }

	1052

	1053

	1054 /*

	1055 ** Implementation of the sqlite3_pcache.xUnpin method.

	1056 **

	1057 ** Mark a page as unpinned (eligible for asynchronous recycling).

	1058 */

	1059 static void pcache1Unpin(

	1060 sqlite3_pcache *p,

	1061 sqlite3_pcache_page *pPg,

	1062 int reuseUnlikely

	1063 ){

	1064 PCache1 pCache = (PCache1 )p;

	1065 PgHdr1 pPage = (PgHdr1 )pPg;

	1066 PGroup *pGroup = pCache->pGroup;

	1067

	1068 assert( pPage->pCache==pCache );

	1069 pcache1EnterMutex(pGroup);

	1070

	1071 /* It is an error to call this function if the page is already

	1072 ** part of the PGroup LRU list.

	1073 */

	1074 assert( pPage->pLruPrev==0 && pPage->pLruNext==0 );

	1075 assert( pPage->isPinned==1 );

	1076

	1077 if( reuseUnlikely \|\| pGroup->nCurrentPage>pGroup->nMaxPage ){

	1078 pcache1RemoveFromHash(pPage, 1);

	1079 }else{

	1080 /* Add the page to the PGroup LRU list. */

	1081 PgHdr1 **ppFirst = &pGroup->lru.pLruNext;

	1082 pPage->pLruPrev = &pGroup->lru;

	1083 (pPage->pLruNext = *ppFirst)->pLruPrev = pPage;

	1084 *ppFirst = pPage;

	1085 pCache->nRecyclable++;

	1086 pPage->isPinned = 0;

	1087 }

	1088

	1089 pcache1LeaveMutex(pCache->pGroup);

	1090 }

	1091

	1092 /*

	1093 ** Implementation of the sqlite3_pcache.xRekey method.

	1094 */

	1095 static void pcache1Rekey(

	1096 sqlite3_pcache *p,

	1097 sqlite3_pcache_page *pPg,

	1098 unsigned int iOld,

	1099 unsigned int iNew

	1100 ){

	1101 PCache1 pCache = (PCache1 )p;

	1102 PgHdr1 pPage = (PgHdr1 )pPg;

	1103 PgHdr1 **pp;

	1104 unsigned int h;

	1105 assert( pPage->iKey==iOld );

	1106 assert( pPage->pCache==pCache );

	1107

	1108 pcache1EnterMutex(pCache->pGroup);

	1109

	1110 h = iOld%pCache->nHash;

	1111 pp = &pCache->apHash[h];

	1112 while( (*pp)!=pPage ){

	1113 pp = &(*pp)->pNext;

	1114 }

	1115 *pp = pPage->pNext;

	1116

	1117 h = iNew%pCache->nHash;

	1118 pPage->iKey = iNew;

	1119 pPage->pNext = pCache->apHash[h];

	1120 pCache->apHash[h] = pPage;

	1121 if( iNew>pCache->iMaxKey ){

	1122 pCache->iMaxKey = iNew;

	1123 }

	1124

	1125 pcache1LeaveMutex(pCache->pGroup);

	1126 }

	1127

	1128 /*

	1129 ** Implementation of the sqlite3_pcache.xTruncate method.

	1130 **

	1131 ** Discard all unpinned pages in the cache with a page number equal to

	1132 ** or greater than parameter iLimit. Any pinned pages with a page number

	1133 ** equal to or greater than iLimit are implicitly unpinned.

	1134 */

	1135 static void pcache1Truncate(sqlite3_pcache *p, unsigned int iLimit){

	1136 PCache1 pCache = (PCache1 )p;

	1137 pcache1EnterMutex(pCache->pGroup);

	1138 if( iLimit<=pCache->iMaxKey ){

	1139 pcache1TruncateUnsafe(pCache, iLimit);

	1140 pCache->iMaxKey = iLimit-1;

	1141 }

	1142 pcache1LeaveMutex(pCache->pGroup);

	1143 }

	1144

	1145 /*

	1146 ** Implementation of the sqlite3_pcache.xDestroy method.

	1147 **

	1148 ** Destroy a cache allocated using pcache1Create().

	1149 */

	1150 static void pcache1Destroy(sqlite3_pcache *p){

	1151 PCache1 pCache = (PCache1 )p;

	1152 PGroup *pGroup = pCache->pGroup;

	1153 assert( pCache->bPurgeable \|\| (pCache->nMax==0 && pCache->nMin==0) );

	1154 pcache1EnterMutex(pGroup);

	1155 if( pCache->nPage ) pcache1TruncateUnsafe(pCache, 0);

	1156 assert( pGroup->nMaxPage >= pCache->nMax );

	1157 pGroup->nMaxPage -= pCache->nMax;

	1158 assert( pGroup->nMinPage >= pCache->nMin );

	1159 pGroup->nMinPage -= pCache->nMin;

	1160 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;

	1161 pcache1EnforceMaxPage(pCache);

	1162 pcache1LeaveMutex(pGroup);

	1163 sqlite3_free(pCache->pBulk);

	1164 sqlite3_free(pCache->apHash);

	1165 sqlite3_free(pCache);

	1166 }

	1167

	1168 /*

	1169 ** This function is called during initialization (sqlite3_initialize()) to

	1170 ** install the default pluggable cache module, assuming the user has not

	1171 ** already provided an alternative.

	1172 */

	1173 SQLITE_PRIVATE void sqlite3PCacheSetDefault(void){

	1174 static const sqlite3_pcache_methods2 defaultMethods = {

	1175 1, /* iVersion */

	1176 0, /* pArg */

	1177 pcache1Init, /* xInit */

	1178 pcache1Shutdown, /* xShutdown */

	1179 pcache1Create, /* xCreate */

	1180 pcache1Cachesize, /* xCachesize */

	1181 pcache1Pagecount, /* xPagecount */

	1182 pcache1Fetch, /* xFetch */

	1183 pcache1Unpin, /* xUnpin */

	1184 pcache1Rekey, /* xRekey */

	1185 pcache1Truncate, /* xTruncate */

	1186 pcache1Destroy, /* xDestroy */

	1187 pcache1Shrink /* xShrink */

	1188 };

	1189 sqlite3_config(SQLITE_CONFIG_PCACHE2, &defaultMethods);

	1190 }

	1191

	1192 /*

	1193 ** Return the size of the header on each page of this PCACHE implementation.

	1194 */

	1195 SQLITE_PRIVATE int sqlite3HeaderSizePcache1(void){ return ROUND8(sizeof(PgHdr1)) ; }

	1196

	1197 /*

	1198 ** Return the global mutex used by this PCACHE implementation. The

	1199 ** sqlite3_status() routine needs access to this mutex.

	1200 */

	1201 SQLITE_PRIVATE sqlite3_mutex *sqlite3Pcache1Mutex(void){

	1202 return pcache1.mutex;

	1203 }

	1204

	1205 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT

	1206 /*

	1207 ** This function is called to free superfluous dynamically allocated memory

	1208 ** held by the pager system. Memory in use by any SQLite pager allocated

	1209 ** by the current thread may be sqlite3_free()ed.

	1210 **

	1211 ** nReq is the number of bytes of memory required. Once this much has

	1212 ** been released, the function returns. The return value is the total number

	1213 ** of bytes of memory released.

	1214 */

	1215 SQLITE_PRIVATE int sqlite3PcacheReleaseMemory(int nReq){

	1216 int nFree = 0;

	1217 assert( sqlite3_mutex_notheld(pcache1.grp.mutex) );

	1218 assert( sqlite3_mutex_notheld(pcache1.mutex) );

	1219 if( sqlite3GlobalConfig.nPage==0 ){

	1220 PgHdr1 *p;

	1221 pcache1EnterMutex(&pcache1.grp);

	1222 while( (nReq<0 \|\| nFree<nReq)

	1223 && (p=pcache1.grp.lru.pLruPrev)!=0

	1224 && p->isAnchor==0

	1225 ){

	1226 nFree += pcache1MemSize(p->page.pBuf);

	1227 #ifdef SQLITE_PCACHE_SEPARATE_HEADER

	1228 nFree += sqlite3MemSize(p);

	1229 #endif

	1230 assert( p->isPinned==0 );

	1231 pcache1PinPage(p);

	1232 pcache1RemoveFromHash(p, 1);

	1233 }

	1234 pcache1LeaveMutex(&pcache1.grp);

	1235 }

	1236 return nFree;

	1237 }

	1238 #endif /* SQLITE_ENABLE_MEMORY_MANAGEMENT */

	1239

	1240 #ifdef SQLITE_TEST

	1241 /*

	1242 ** This function is used by test procedures to inspect the internal state

	1243 ** of the global cache.

	1244 */

	1245 SQLITE_PRIVATE void sqlite3PcacheStats(

	1246 int pnCurrent, / OUT: Total number of pages cached */

	1247 int pnMax, / OUT: Global maximum cache size */

	1248 int pnMin, / OUT: Sum of PCache1.nMin for purgeable caches */

	1249 int pnRecyclable / OUT: Total number of pages available for recycling */

	1250 ){

	1251 PgHdr1 *p;

	1252 int nRecyclable = 0;

	1253 for(p=pcache1.grp.lru.pLruNext; p && !p->isAnchor; p=p->pLruNext){

	1254 assert( p->isPinned==0 );

	1255 nRecyclable++;

	1256 }

	1257 *pnCurrent = pcache1.grp.nCurrentPage;

	1258 *pnMax = (int)pcache1.grp.nMaxPage;

	1259 *pnMin = (int)pcache1.grp.nMinPage;

	1260 *pnRecyclable = nRecyclable;

	1261 }

	1262 #endif

	1263

	1264 /************ End of pcache1.c *******************************************/

	1265 /************ Begin file rowset.c ****************************************/

	1266 /*

	1267 ** 2008 December 3

	1268 **

	1269 ** The author disclaims copyright to this source code. In place of

	1270 ** a legal notice, here is a blessing:

	1271 **

	1272 ** May you do good and not evil.

	1273 ** May you find forgiveness for yourself and forgive others.

	1274 ** May you share freely, never taking more than you give.

	1275 **

	1276 *************************************************************************

	1277 **

	1278 ** This module implements an object we call a "RowSet".

	1279 **

	1280 ** The RowSet object is a collection of rowids. Rowids

	1281 ** are inserted into the RowSet in an arbitrary order. Inserts

	1282 ** can be intermixed with tests to see if a given rowid has been

	1283 ** previously inserted into the RowSet.

	1284 **

	1285 ** After all inserts are finished, it is possible to extract the

	1286 ** elements of the RowSet in sorted order. Once this extraction

	1287 ** process has started, no new elements may be inserted.

	1288 **

	1289 ** Hence, the primitive operations for a RowSet are:

	1290 **

	1291 ** CREATE

	1292 ** INSERT

	1293 ** TEST

	1294 ** SMALLEST

	1295 ** DESTROY

	1296 **

	1297 ** The CREATE and DESTROY primitives are the constructor and destructor,

	1298 ** obviously. The INSERT primitive adds a new element to the RowSet.

	1299 ** TEST checks to see if an element is already in the RowSet. SMALLEST

	1300 ** extracts the least value from the RowSet.

	1301 **

	1302 ** The INSERT primitive might allocate additional memory. Memory is

	1303 ** allocated in chunks so most INSERTs do no allocation. There is an

	1304 ** upper bound on the size of allocated memory. No memory is freed

	1305 ** until DESTROY.

	1306 **

	1307 ** The TEST primitive includes a "batch" number. The TEST primitive

	1308 ** will only see elements that were inserted before the last change

	1309 ** in the batch number. In other words, if an INSERT occurs between

	1310 ** two TESTs where the TESTs have the same batch nubmer, then the

	1311 ** value added by the INSERT will not be visible to the second TEST.

	1312 ** The initial batch number is zero, so if the very first TEST contains

	1313 ** a non-zero batch number, it will see all prior INSERTs.

	1314 **

	1315 ** No INSERTs may occurs after a SMALLEST. An assertion will fail if

	1316 ** that is attempted.

	1317 **

	1318 ** The cost of an INSERT is roughly constant. (Sometimes new memory

	1319 ** has to be allocated on an INSERT.) The cost of a TEST with a new

	1320 ** batch number is O(NlogN) where N is the number of elements in the RowSet.

	1321 ** The cost of a TEST using the same batch number is O(logN). The cost

	1322 ** of the first SMALLEST is O(NlogN). Second and subsequent SMALLEST

	1323 ** primitives are constant time. The cost of DESTROY is O(N).

	1324 **

	1325 ** TEST and SMALLEST may not be used by the same RowSet. This used to

	1326 ** be possible, but the feature was not used, so it was removed in order

	1327 ** to simplify the code.

	1328 */

	1329 /* #include "sqliteInt.h" */

	1330

	1331

	1332 /*

	1333 ** Target size for allocation chunks.

	1334 */

	1335 #define ROWSET_ALLOCATION_SIZE 1024

	1336

	1337 /*

	1338 ** The number of rowset entries per allocation chunk.

	1339 */

	1340 #define ROWSET_ENTRY_PER_CHUNK \

	1341 ((ROWSET_ALLOCATION_SIZE-8)/sizeof(struct RowSetEntry))

	1342

	1343 /*

	1344 ** Each entry in a RowSet is an instance of the following object.

	1345 **

	1346 ** This same object is reused to store a linked list of trees of RowSetEntry

	1347 ** objects. In that alternative use, pRight points to the next entry

	1348 ** in the list, pLeft points to the tree, and v is unused. The

	1349 ** RowSet.pForest value points to the head of this forest list.

	1350 */

	1351 struct RowSetEntry {

	1352 i64 v; /* ROWID value for this entry */

	1353 struct RowSetEntry pRight; / Right subtree (larger entries) or list */

	1354 struct RowSetEntry pLeft; / Left subtree (smaller entries) */

	1355 };

	1356

	1357 /*

	1358 ** RowSetEntry objects are allocated in large chunks (instances of the

	1359 ** following structure) to reduce memory allocation overhead. The

	1360 ** chunks are kept on a linked list so that they can be deallocated

	1361 ** when the RowSet is destroyed.

	1362 */

	1363 struct RowSetChunk {

	1364 struct RowSetChunk pNextChunk; / Next chunk on list of them all */

	1365 struct RowSetEntry aEntry[ROWSET_ENTRY_PER_CHUNK]; /* Allocated entries */

	1366 };

	1367

	1368 /*

	1369 ** A RowSet in an instance of the following structure.

	1370 **

	1371 ** A typedef of this structure if found in sqliteInt.h.

	1372 */

	1373 struct RowSet {

	1374 struct RowSetChunk pChunk; / List of all chunk allocations */

	1375 sqlite3 db; / The database connection */

	1376 struct RowSetEntry pEntry; / List of entries using pRight */

	1377 struct RowSetEntry pLast; / Last entry on the pEntry list */

	1378 struct RowSetEntry pFresh; / Source of new entry objects */

	1379 struct RowSetEntry pForest; / List of binary trees of entries */

	1380 u16 nFresh; /* Number of objects on pFresh */

	1381 u16 rsFlags; /* Various flags */

	1382 int iBatch; /* Current insert batch */

	1383 };

	1384

	1385 /*

	1386 ** Allowed values for RowSet.rsFlags

	1387 */

	1388 #define ROWSET_SORTED 0x01 /* True if RowSet.pEntry is sorted */

	1389 #define ROWSET_NEXT 0x02 /* True if sqlite3RowSetNext() has been called */

	1390

	1391 /*

	1392 ** Turn bulk memory into a RowSet object. N bytes of memory

	1393 ** are available at pSpace. The db pointer is used as a memory context

	1394 ** for any subsequent allocations that need to occur.

	1395 ** Return a pointer to the new RowSet object.

	1396 **

	1397 ** It must be the case that N is sufficient to make a Rowset. If not

	1398 ** an assertion fault occurs.

	1399 **

	1400 ** If N is larger than the minimum, use the surplus as an initial

	1401 ** allocation of entries available to be filled.

	1402 */

	1403 SQLITE_PRIVATE RowSet sqlite3RowSetInit(sqlite3 db, void *pSpace, unsigned int N){

	1404 RowSet *p;

	1405 assert( N >= ROUND8(sizeof(*p)) );

	1406 p = pSpace;

	1407 p->pChunk = 0;

	1408 p->db = db;

	1409 p->pEntry = 0;

	1410 p->pLast = 0;

	1411 p->pForest = 0;

	1412 p->pFresh = (struct RowSetEntry)(ROUND8(sizeof(p)) + (char*)p);

	1413 p->nFresh = (u16)((N - ROUND8(sizeof(*p)))/sizeof(struct RowSetEntry));

	1414 p->rsFlags = ROWSET_SORTED;

	1415 p->iBatch = 0;

	1416 return p;

	1417 }

	1418

	1419 /*

	1420 ** Deallocate all chunks from a RowSet. This frees all memory that

	1421 ** the RowSet has allocated over its lifetime. This routine is

	1422 ** the destructor for the RowSet.

	1423 */

	1424 SQLITE_PRIVATE void sqlite3RowSetClear(RowSet *p){

	1425 struct RowSetChunk pChunk, pNextChunk;

	1426 for(pChunk=p->pChunk; pChunk; pChunk = pNextChunk){

	1427 pNextChunk = pChunk->pNextChunk;

	1428 sqlite3DbFree(p->db, pChunk);

	1429 }

	1430 p->pChunk = 0;

	1431 p->nFresh = 0;

	1432 p->pEntry = 0;

	1433 p->pLast = 0;

	1434 p->pForest = 0;

	1435 p->rsFlags = ROWSET_SORTED;

	1436 }

	1437

	1438 /*

	1439 ** Allocate a new RowSetEntry object that is associated with the

	1440 ** given RowSet. Return a pointer to the new and completely uninitialized

	1441 ** objected.

	1442 **

	1443 ** In an OOM situation, the RowSet.db->mallocFailed flag is set and this

	1444 ** routine returns NULL.

	1445 */

	1446 static struct RowSetEntry rowSetEntryAlloc(RowSet p){

	1447 assert( p!=0 );

	1448 if( p->nFresh==0 ){ /OPTIMIZATION-IF-FALSE/

	1449 /* We could allocate a fresh RowSetEntry each time one is needed, but it

	1450 ** is more efficient to pull a preallocated entry from the pool */

	1451 struct RowSetChunk *pNew;

	1452 pNew = sqlite3DbMallocRawNN(p->db, sizeof(*pNew));

	1453 if( pNew==0 ){

	1454 return 0;

	1455 }

	1456 pNew->pNextChunk = p->pChunk;

	1457 p->pChunk = pNew;

	1458 p->pFresh = pNew->aEntry;

	1459 p->nFresh = ROWSET_ENTRY_PER_CHUNK;

	1460 }

	1461 p->nFresh--;

	1462 return p->pFresh++;

	1463 }

	1464

	1465 /*

	1466 ** Insert a new value into a RowSet.

	1467 **

	1468 ** The mallocFailed flag of the database connection is set if a

	1469 ** memory allocation fails.

	1470 */

	1471 SQLITE_PRIVATE void sqlite3RowSetInsert(RowSet *p, i64 rowid){

	1472 struct RowSetEntry pEntry; / The new entry */

	1473 struct RowSetEntry pLast; / The last prior entry */

	1474

	1475 /* This routine is never called after sqlite3RowSetNext() */

	1476 assert( p!=0 && (p->rsFlags & ROWSET_NEXT)==0 );

	1477

	1478 pEntry = rowSetEntryAlloc(p);

	1479 if( pEntry==0 ) return;

	1480 pEntry->v = rowid;

	1481 pEntry->pRight = 0;

	1482 pLast = p->pLast;

	1483 if( pLast ){

	1484 if( rowid<=pLast->v ){ /OPTIMIZATION-IF-FALSE/

	1485 /* Avoid unnecessary sorts by preserving the ROWSET_SORTED flags

	1486 ** where possible */

	1487 p->rsFlags &= ~ROWSET_SORTED;

	1488 }

	1489 pLast->pRight = pEntry;

	1490 }else{

	1491 p->pEntry = pEntry;

	1492 }

	1493 p->pLast = pEntry;

	1494 }

	1495

	1496 /*

	1497 ** Merge two lists of RowSetEntry objects. Remove duplicates.

	1498 **

	1499 ** The input lists are connected via pRight pointers and are

	1500 ** assumed to each already be in sorted order.

	1501 */

	1502 static struct RowSetEntry *rowSetEntryMerge(

	1503 struct RowSetEntry pA, / First sorted list to be merged */

	1504 struct RowSetEntry pB / Second sorted list to be merged */

	1505 ){

	1506 struct RowSetEntry head;

	1507 struct RowSetEntry *pTail;

	1508

	1509 pTail = &head;

	1510 assert( pA!=0 && pB!=0 );

	1511 for(;;){

	1512 assert( pA->pRight==0 \|\| pA->v<=pA->pRight->v );

	1513 assert( pB->pRight==0 \|\| pB->v<=pB->pRight->v );

	1514 if( pA->v<=pB->v ){

	1515 if( pA->v<pB->v ) pTail = pTail->pRight = pA;

	1516 pA = pA->pRight;

	1517 if( pA==0 ){

	1518 pTail->pRight = pB;

	1519 break;

	1520 }

	1521 }else{

	1522 pTail = pTail->pRight = pB;

	1523 pB = pB->pRight;

	1524 if( pB==0 ){

	1525 pTail->pRight = pA;

	1526 break;

	1527 }

	1528 }

	1529 }

	1530 return head.pRight;

	1531 }

	1532

	1533 /*

	1534 ** Sort all elements on the list of RowSetEntry objects into order of

	1535 ** increasing v.

	1536 */

	1537 static struct RowSetEntry rowSetEntrySort(struct RowSetEntry pIn){

	1538 unsigned int i;

	1539 struct RowSetEntry pNext, aBucket[40];

	1540

	1541 memset(aBucket, 0, sizeof(aBucket));

	1542 while( pIn ){

	1543 pNext = pIn->pRight;

	1544 pIn->pRight = 0;

	1545 for(i=0; aBucket[i]; i++){

	1546 pIn = rowSetEntryMerge(aBucket[i], pIn);

	1547 aBucket[i] = 0;

	1548 }

	1549 aBucket[i] = pIn;

	1550 pIn = pNext;

	1551 }

	1552 pIn = aBucket[0];

	1553 for(i=1; i<sizeof(aBucket)/sizeof(aBucket[0]); i++){

	1554 if( aBucket[i]==0 ) continue;

	1555 pIn = pIn ? rowSetEntryMerge(pIn, aBucket[i]) : aBucket[i];

	1556 }

	1557 return pIn;

	1558 }

	1559

	1560

	1561 /*

	1562 ** The input, pIn, is a binary tree (or subtree) of RowSetEntry objects.

	1563 ** Convert this tree into a linked list connected by the pRight pointers

	1564 ** and return pointers to the first and last elements of the new list.

	1565 */

	1566 static void rowSetTreeToList(

	1567 struct RowSetEntry pIn, / Root of the input tree */

	1568 struct RowSetEntry *ppFirst, / Write head of the output list here */

	1569 struct RowSetEntry *ppLast / Write tail of the output list here */

	1570 ){

	1571 assert( pIn!=0 );

	1572 if( pIn->pLeft ){

	1573 struct RowSetEntry *p;

	1574 rowSetTreeToList(pIn->pLeft, ppFirst, &p);

	1575 p->pRight = pIn;

	1576 }else{

	1577 *ppFirst = pIn;

	1578 }

	1579 if( pIn->pRight ){

	1580 rowSetTreeToList(pIn->pRight, &pIn->pRight, ppLast);

	1581 }else{

	1582 *ppLast = pIn;

	1583 }

	1584 assert( (*ppLast)->pRight==0 );

	1585 }

	1586

	1587

	1588 /*

	1589 ** Convert a sorted list of elements (connected by pRight) into a binary

	1590 ** tree with depth of iDepth. A depth of 1 means the tree contains a single

	1591 ** node taken from the head of *ppList. A depth of 2 means a tree with

	1592 ** three nodes. And so forth.

	1593 **

	1594 ** Use as many entries from the input list as required and update the

	1595 ** *ppList to point to the unused elements of the list. If the input

	1596 ** list contains too few elements, then construct an incomplete tree

	1597 ** and leave *ppList set to NULL.

	1598 **

	1599 ** Return a pointer to the root of the constructed binary tree.

	1600 */

	1601 static struct RowSetEntry *rowSetNDeepTree(

	1602 struct RowSetEntry **ppList,

	1603 int iDepth

	1604 ){

	1605 struct RowSetEntry p; / Root of the new tree */

	1606 struct RowSetEntry pLeft; / Left subtree */

	1607 if( ppList==0 ){ /OPTIMIZATION-IF-TRUE*/

	1608 /* Prevent unnecessary deep recursion when we run out of entries */

	1609 return 0;

	1610 }

	1611 if( iDepth>1 ){ /OPTIMIZATION-IF-TRUE/

	1612 /* This branch causes a balanced tree to be generated. A valid tree

	1613 ** is still generated without this branch, but the tree is wildly

	1614 ** unbalanced and inefficient. */

	1615 pLeft = rowSetNDeepTree(ppList, iDepth-1);

	1616 p = *ppList;

	1617 if( p==0 ){ /OPTIMIZATION-IF-FALSE/

	1618 /* It is safe to always return here, but the resulting tree

	1619 ** would be unbalanced */

	1620 return pLeft;

	1621 }

	1622 p->pLeft = pLeft;

	1623 *ppList = p->pRight;

	1624 p->pRight = rowSetNDeepTree(ppList, iDepth-1);

	1625 }else{

	1626 p = *ppList;

	1627 *ppList = p->pRight;

	1628 p->pLeft = p->pRight = 0;

	1629 }

	1630 return p;

	1631 }

	1632

	1633 /*

	1634 ** Convert a sorted list of elements into a binary tree. Make the tree

	1635 ** as deep as it needs to be in order to contain the entire list.

	1636 */

	1637 static struct RowSetEntry rowSetListToTree(struct RowSetEntry pList){

	1638 int iDepth; /* Depth of the tree so far */

	1639 struct RowSetEntry p; / Current tree root */

	1640 struct RowSetEntry pLeft; / Left subtree */

	1641

	1642 assert( pList!=0 );

	1643 p = pList;

	1644 pList = p->pRight;

	1645 p->pLeft = p->pRight = 0;

	1646 for(iDepth=1; pList; iDepth++){

	1647 pLeft = p;

	1648 p = pList;

	1649 pList = p->pRight;

	1650 p->pLeft = pLeft;

	1651 p->pRight = rowSetNDeepTree(&pList, iDepth);

	1652 }

	1653 return p;

	1654 }

	1655

	1656 /*

	1657 ** Extract the smallest element from the RowSet.

	1658 ** Write the element into *pRowid. Return 1 on success. Return

	1659 ** 0 if the RowSet is already empty.

	1660 **

	1661 ** After this routine has been called, the sqlite3RowSetInsert()

	1662 ** routine may not be called again.

	1663 **

	1664 ** This routine may not be called after sqlite3RowSetTest() has

	1665 ** been used. Older versions of RowSet allowed that, but as the

	1666 ** capability was not used by the code generator, it was removed

	1667 ** for code economy.

	1668 */

	1669 SQLITE_PRIVATE int sqlite3RowSetNext(RowSet p, i64 pRowid){

	1670 assert( p!=0 );

	1671 assert( p->pForest==0 ); /* Cannot be used with sqlite3RowSetText() */

	1672

	1673 /* Merge the forest into a single sorted list on first call */

	1674 if( (p->rsFlags & ROWSET_NEXT)==0 ){ /OPTIMIZATION-IF-FALSE/

	1675 if( (p->rsFlags & ROWSET_SORTED)==0 ){ /OPTIMIZATION-IF-FALSE/

	1676 p->pEntry = rowSetEntrySort(p->pEntry);

	1677 }

	1678 p->rsFlags \|= ROWSET_SORTED\|ROWSET_NEXT;

	1679 }

	1680

	1681 /* Return the next entry on the list */

	1682 if( p->pEntry ){

	1683 *pRowid = p->pEntry->v;

	1684 p->pEntry = p->pEntry->pRight;

	1685 if( p->pEntry==0 ){ /OPTIMIZATION-IF-TRUE/

	1686 /* Free memory immediately, rather than waiting on sqlite3_finalize() */

	1687 sqlite3RowSetClear(p);

	1688 }

	1689 return 1;

	1690 }else{

	1691 return 0;

	1692 }

	1693 }

	1694

	1695 /*

	1696 ** Check to see if element iRowid was inserted into the rowset as

	1697 ** part of any insert batch prior to iBatch. Return 1 or 0.

	1698 **

	1699 ** If this is the first test of a new batch and if there exist entries

	1700 ** on pRowSet->pEntry, then sort those entries into the forest at

	1701 ** pRowSet->pForest so that they can be tested.

	1702 */

	1703 SQLITE_PRIVATE int sqlite3RowSetTest(RowSet *pRowSet, int iBatch, sqlite3_int64 iRowid){

	1704 struct RowSetEntry p, pTree;

	1705

	1706 /* This routine is never called after sqlite3RowSetNext() */

	1707 assert( pRowSet!=0 && (pRowSet->rsFlags & ROWSET_NEXT)==0 );

	1708

	1709 /* Sort entries into the forest on the first test of a new batch.

	1710 ** To save unnecessary work, only do this when the batch number changes.

	1711 */

	1712 if( iBatch!=pRowSet->iBatch ){ /OPTIMIZATION-IF-FALSE/

	1713 p = pRowSet->pEntry;

	1714 if( p ){

	1715 struct RowSetEntry **ppPrevTree = &pRowSet->pForest;

	1716 if( (pRowSet->rsFlags & ROWSET_SORTED)==0 ){ /OPTIMIZATION-IF-FALSE/

	1717 /* Only sort the current set of entiries if they need it */

	1718 p = rowSetEntrySort(p);

	1719 }

	1720 for(pTree = pRowSet->pForest; pTree; pTree=pTree->pRight){

	1721 ppPrevTree = &pTree->pRight;

	1722 if( pTree->pLeft==0 ){

	1723 pTree->pLeft = rowSetListToTree(p);

	1724 break;

	1725 }else{

	1726 struct RowSetEntry pAux, pTail;

	1727 rowSetTreeToList(pTree->pLeft, &pAux, &pTail);

	1728 pTree->pLeft = 0;

	1729 p = rowSetEntryMerge(pAux, p);

	1730 }

	1731 }

	1732 if( pTree==0 ){

	1733 *ppPrevTree = pTree = rowSetEntryAlloc(pRowSet);

	1734 if( pTree ){

	1735 pTree->v = 0;

	1736 pTree->pRight = 0;

	1737 pTree->pLeft = rowSetListToTree(p);

	1738 }

	1739 }

	1740 pRowSet->pEntry = 0;

	1741 pRowSet->pLast = 0;

	1742 pRowSet->rsFlags \|= ROWSET_SORTED;

	1743 }

	1744 pRowSet->iBatch = iBatch;

	1745 }

	1746

	1747 /* Test to see if the iRowid value appears anywhere in the forest.

	1748 ** Return 1 if it does and 0 if not.

	1749 */

	1750 for(pTree = pRowSet->pForest; pTree; pTree=pTree->pRight){

	1751 p = pTree->pLeft;

	1752 while( p ){

	1753 if( p->v<iRowid ){

	1754 p = p->pRight;

	1755 }else if( p->v>iRowid ){

	1756 p = p->pLeft;

	1757 }else{

	1758 return 1;

	1759 }

	1760 }

	1761 }

	1762 return 0;

	1763 }

	1764

	1765 /************ End of rowset.c ********************************************/

	1766 /************ Begin file pager.c *****************************************/

	1767 /*

	1768 ** 2001 September 15

	1769 **

	1770 ** The author disclaims copyright to this source code. In place of

	1771 ** a legal notice, here is a blessing:

	1772 **

	1773 ** May you do good and not evil.

	1774 ** May you find forgiveness for yourself and forgive others.

	1775 ** May you share freely, never taking more than you give.

	1776 **

	1777 *************************************************************************

	1778 ** This is the implementation of the page cache subsystem or "pager".

	1779 **

	1780 ** The pager is used to access a database disk file. It implements

	1781 ** atomic commit and rollback through the use of a journal file that

	1782 ** is separate from the database file. The pager also implements file

	1783 ** locking to prevent two processes from writing the same database

	1784 ** file simultaneously, or one process from reading the database while

	1785 ** another is writing.

	1786 */

	1787 #ifndef SQLITE_OMIT_DISKIO

	1788 /* #include "sqliteInt.h" */

	1789 /************ Include wal.h in the middle of pager.c *********************/

	1790 /************ Begin file wal.h *******************************************/

	1791 /*

	1792 ** 2010 February 1

	1793 **

	1794 ** The author disclaims copyright to this source code. In place of

	1795 ** a legal notice, here is a blessing:

	1796 **

	1797 ** May you do good and not evil.

	1798 ** May you find forgiveness for yourself and forgive others.

	1799 ** May you share freely, never taking more than you give.

	1800 **

	1801 *************************************************************************

	1802 ** This header file defines the interface to the write-ahead logging

	1803 ** system. Refer to the comments below and the header comment attached to

	1804 ** the implementation of each function in log.c for further details.

	1805 */

	1806

	1807 #ifndef SQLITE_WAL_H

	1808 #define SQLITE_WAL_H

	1809

	1810 /* #include "sqliteInt.h" */

	1811

	1812 /* Additional values that can be added to the sync_flags argument of

	1813 ** sqlite3WalFrames():

	1814 */

	1815 #define WAL_SYNC_TRANSACTIONS 0x20 /* Sync at the end of each transaction */

	1816 #define SQLITE_SYNC_MASK 0x13 /* Mask off the SQLITE_SYNC_* values */

	1817

	1818 #ifdef SQLITE_OMIT_WAL

	1819 # define sqlite3WalOpen(x,y,z) 0

	1820 # define sqlite3WalLimit(x,y)

	1821 # define sqlite3WalClose(v,w,x,y,z) 0

	1822 # define sqlite3WalBeginReadTransaction(y,z) 0

	1823 # define sqlite3WalEndReadTransaction(z)

	1824 # define sqlite3WalDbsize(y) 0

	1825 # define sqlite3WalBeginWriteTransaction(y) 0

	1826 # define sqlite3WalEndWriteTransaction(x) 0

	1827 # define sqlite3WalUndo(x,y,z) 0

	1828 # define sqlite3WalSavepoint(y,z)

	1829 # define sqlite3WalSavepointUndo(y,z) 0

	1830 # define sqlite3WalFrames(u,v,w,x,y,z) 0

	1831 # define sqlite3WalCheckpoint(q,r,s,t,u,v,w,x,y,z) 0

	1832 # define sqlite3WalCallback(z) 0

	1833 # define sqlite3WalExclusiveMode(y,z) 0

	1834 # define sqlite3WalHeapMemory(z) 0

	1835 # define sqlite3WalFramesize(z) 0

	1836 # define sqlite3WalFindFrame(x,y,z) 0

	1837 # define sqlite3WalFile(x) 0

	1838 #else

	1839

	1840 #define WAL_SAVEPOINT_NDATA 4

	1841

	1842 /* Connection to a write-ahead log (WAL) file.

	1843 ** There is one object of this type for each pager.

	1844 */

	1845 typedef struct Wal Wal;

	1846

	1847 /* Open and close a connection to a write-ahead log. */

	1848 SQLITE_PRIVATE int sqlite3WalOpen(sqlite3_vfs, sqlite3_file, const char , int , i64, Wal*);

	1849 SQLITE_PRIVATE int sqlite3WalClose(Wal pWal, sqlite3, int sync_flags, int, u8 *);

	1850

	1851 /* Set the limiting size of a WAL file. */

	1852 SQLITE_PRIVATE void sqlite3WalLimit(Wal*, i64);

	1853

	1854 /* Used by readers to open (lock) and close (unlock) a snapshot. A

	1855 ** snapshot is like a read-transaction. It is the state of the database

	1856 ** at an instant in time. sqlite3WalOpenSnapshot gets a read lock and

	1857 ** preserves the current state even if the other threads or processes

	1858 ** write to or checkpoint the WAL. sqlite3WalCloseSnapshot() closes the

	1859 ** transaction and releases the lock.

	1860 */

	1861 SQLITE_PRIVATE int sqlite3WalBeginReadTransaction(Wal pWal, int );

	1862 SQLITE_PRIVATE void sqlite3WalEndReadTransaction(Wal *pWal);

	1863

	1864 /* Read a page from the write-ahead log, if it is present. */

	1865 SQLITE_PRIVATE int sqlite3WalFindFrame(Wal , Pgno, u32 );

	1866 SQLITE_PRIVATE int sqlite3WalReadFrame(Wal , u32, int, u8 );

	1867

	1868 /* If the WAL is not empty, return the size of the database. */

	1869 SQLITE_PRIVATE Pgno sqlite3WalDbsize(Wal *pWal);

	1870

	1871 /* Obtain or release the WRITER lock. */

	1872 SQLITE_PRIVATE int sqlite3WalBeginWriteTransaction(Wal *pWal);

	1873 SQLITE_PRIVATE int sqlite3WalEndWriteTransaction(Wal *pWal);

	1874

	1875 /* Undo any frames written (but not committed) to the log */

	1876 SQLITE_PRIVATE int sqlite3WalUndo(Wal pWal, int (xUndo)(void , Pgno), void p UndoCtx);

	1877

	1878 /* Return an integer that records the current (uncommitted) write

	1879 ** position in the WAL */

	1880 SQLITE_PRIVATE void sqlite3WalSavepoint(Wal pWal, u32 aWalData);

	1881

	1882 /* Move the write position of the WAL back to iFrame. Called in

	1883 ** response to a ROLLBACK TO command. */

	1884 SQLITE_PRIVATE int sqlite3WalSavepointUndo(Wal pWal, u32 aWalData);

	1885

	1886 /* Write a frame or frames to the log. */

	1887 SQLITE_PRIVATE int sqlite3WalFrames(Wal pWal, int, PgHdr , Pgno, int, int);

	1888

	1889 /* Copy pages from the log to the database file */

	1890 SQLITE_PRIVATE int sqlite3WalCheckpoint(

	1891 Wal pWal, / Write-ahead log connection */

	1892 sqlite3 db, / Check this handle's interrupt flag */

	1893 int eMode, /* One of PASSIVE, FULL and RESTART */

	1894 int (xBusy)(void), /* Function to call when busy */

	1895 void pBusyArg, / Context argument for xBusyHandler */

	1896 int sync_flags, /* Flags to sync db file with (or 0) */

	1897 int nBuf, /* Size of buffer nBuf */

	1898 u8 zBuf, / Temporary buffer to use */

	1899 int pnLog, / OUT: Number of frames in WAL */

	1900 int pnCkpt / OUT: Number of backfilled frames in WAL */

	1901 );

	1902

	1903 /* Return the value to pass to a sqlite3_wal_hook callback, the

	1904 ** number of frames in the WAL at the point of the last commit since

	1905 ** sqlite3WalCallback() was called. If no commits have occurred since

	1906 ** the last call, then return 0.

	1907 */

	1908 SQLITE_PRIVATE int sqlite3WalCallback(Wal *pWal);

	1909

	1910 /* Tell the wal layer that an EXCLUSIVE lock has been obtained (or released)

	1911 ** by the pager layer on the database file.

	1912 */

	1913 SQLITE_PRIVATE int sqlite3WalExclusiveMode(Wal *pWal, int op);

	1914

	1915 /* Return true if the argument is non-NULL and the WAL module is using

	1916 ** heap-memory for the wal-index. Otherwise, if the argument is NULL or the

	1917 ** WAL module is using shared-memory, return false.

	1918 */

	1919 SQLITE_PRIVATE int sqlite3WalHeapMemory(Wal *pWal);

	1920

	1921 #ifdef SQLITE_ENABLE_SNAPSHOT

	1922 SQLITE_PRIVATE int sqlite3WalSnapshotGet(Wal pWal, sqlite3_snapshot *ppSnapsho t);

	1923 SQLITE_PRIVATE void sqlite3WalSnapshotOpen(Wal pWal, sqlite3_snapshot pSnapsho t);

	1924 SQLITE_PRIVATE int sqlite3WalSnapshotRecover(Wal *pWal);

	1925 #endif

	1926

	1927 #ifdef SQLITE_ENABLE_ZIPVFS

	1928 /* If the WAL file is not empty, return the number of bytes of content

	1929 ** stored in each frame (i.e. the db page-size when the WAL was created).

	1930 */

	1931 SQLITE_PRIVATE int sqlite3WalFramesize(Wal *pWal);

	1932 #endif

	1933

	1934 /* Return the sqlite3_file object for the WAL file */

	1935 SQLITE_PRIVATE sqlite3_file sqlite3WalFile(Wal pWal);

	1936

	1937 #endif /* ifndef SQLITE_OMIT_WAL */

	1938 #endif /* SQLITE_WAL_H */

	1939

	1940 /************ End of wal.h ***********************************************/

	1941 /************ Continuing where we left off in pager.c ********************/

	1942

	1943

	1944 /***************** NOTES ON THE DESIGN OF THE PAGER **********************

	1945 **

	1946 ** This comment block describes invariants that hold when using a rollback

	1947 ** journal. These invariants do not apply for journal_mode=WAL,

	1948 ** journal_mode=MEMORY, or journal_mode=OFF.

	1949 **

	1950 ** Within this comment block, a page is deemed to have been synced

	1951 ** automatically as soon as it is written when PRAGMA synchronous=OFF.

	1952 ** Otherwise, the page is not synced until the xSync method of the VFS

	1953 ** is called successfully on the file containing the page.

	1954 **

	1955 ** Definition: A page of the database file is said to be "overwriteable" if

	1956 ** one or more of the following are true about the page:

	1957 **

	1958 ** (a) The original content of the page as it was at the beginning of

	1959 ** the transaction has been written into the rollback journal and

	1960 ** synced.

	1961 **

	1962 ** (b) The page was a freelist leaf page at the start of the transaction.

	1963 **

	1964 ** (c) The page number is greater than the largest page that existed in

	1965 ** the database file at the start of the transaction.

	1966 **

	1967 ** (1) A page of the database file is never overwritten unless one of the

	1968 ** following are true:

	1969 **

	1970 ** (a) The page and all other pages on the same sector are overwriteable.

	1971 **

	1972 ** (b) The atomic page write optimization is enabled, and the entire

	1973 ** transaction other than the update of the transaction sequence

	1974 ** number consists of a single page change.

	1975 **

	1976 ** (2) The content of a page written into the rollback journal exactly matches

	1977 ** both the content in the database when the rollback journal was written

	1978 ** and the content in the database at the beginning of the current

	1979 ** transaction.

	1980 **

	1981 ** (3) Writes to the database file are an integer multiple of the page size

	1982 ** in length and are aligned on a page boundary.

	1983 **

	1984 ** (4) Reads from the database file are either aligned on a page boundary and

	1985 ** an integer multiple of the page size in length or are taken from the

	1986 ** first 100 bytes of the database file.

	1987 **

	1988 ** (5) All writes to the database file are synced prior to the rollback journal

	1989 ** being deleted, truncated, or zeroed.

	1990 **

	1991 ** (6) If a master journal file is used, then all writes to the database file

	1992 ** are synced prior to the master journal being deleted.

	1993 **

	1994 ** Definition: Two databases (or the same database at two points it time)

	1995 ** are said to be "logically equivalent" if they give the same answer to

	1996 ** all queries. Note in particular the content of freelist leaf

	1997 ** pages can be changed arbitrarily without affecting the logical equivalence

	1998 ** of the database.

	1999 **

	2000 ** (7) At any time, if any subset, including the empty set and the total set,

	2001 ** of the unsynced changes to a rollback journal are removed and the

	2002 ** journal is rolled back, the resulting database file will be logically

	2003 ** equivalent to the database file at the beginning of the transaction.

	2004 **

	2005 ** (8) When a transaction is rolled back, the xTruncate method of the VFS

	2006 ** is called to restore the database file to the same size it was at

	2007 ** the beginning of the transaction. (In some VFSes, the xTruncate

	2008 ** method is a no-op, but that does not change the fact the SQLite will

	2009 ** invoke it.)

	2010 **

	2011 ** (9) Whenever the database file is modified, at least one bit in the range

	2012 ** of bytes from 24 through 39 inclusive will be changed prior to releasing

	2013 ** the EXCLUSIVE lock, thus signaling other connections on the same

	2014 ** database to flush their caches.

	2015 **

	2016 ** (10) The pattern of bits in bytes 24 through 39 shall not repeat in less

	2017 ** than one billion transactions.

	2018 **

	2019 ** (11) A database file is well-formed at the beginning and at the conclusion

	2020 ** of every transaction.

	2021 **

	2022 ** (12) An EXCLUSIVE lock is held on the database file when writing to

	2023 ** the database file.

	2024 **

	2025 ** (13) A SHARED lock is held on the database file while reading any

	2026 ** content out of the database file.

	2027 **

	2028 ******************************************************************************/

	2029

	2030 /*

	2031 ** Macros for troubleshooting. Normally turned off

	2032 */

	2033 #if 0

	2034 int sqlite3PagerTrace=1; /* True to enable tracing */

	2035 #define sqlite3DebugPrintf printf

	2036 #define PAGERTRACE(X) if( sqlite3PagerTrace ){ sqlite3DebugPrintf X; }

	2037 #else

	2038 #define PAGERTRACE(X)

	2039 #endif

	2040

	2041 /*

	2042 ** The following two macros are used within the PAGERTRACE() macros above

	2043 ** to print out file-descriptors.

	2044 **

	2045 ** PAGERID() takes a pointer to a Pager struct as its argument. The

	2046 ** associated file-descriptor is returned. FILEHANDLEID() takes an sqlite3_file

	2047 ** struct as its argument.

	2048 */

	2049 #define PAGERID(p) ((int)(p->fd))

	2050 #define FILEHANDLEID(fd) ((int)fd)

	2051

	2052 /*

	2053 ** The Pager.eState variable stores the current 'state' of a pager. A

	2054 ** pager may be in any one of the seven states shown in the following

	2055 ** state diagram.

	2056 **

	2057 ** OPEN <------+------+

	2058 ** \| \| \|

	2059 ** V \| \|

	2060 ** +---------> READER-------+ \|

	2061 ** \| \| \|

	2062 ** \| V \|

	2063 ** \|<-------WRITER_LOCKED------> ERROR

	2064 ** \| \| ^

	2065 ** \| V \|

	2066 ** \|<------WRITER_CACHEMOD-------->\|

	2067 ** \| \| \|

	2068 ** \| V \|

	2069 ** \|<-------WRITER_DBMOD---------->\|

	2070 ** \| \| \|

	2071 ** \| V \|

	2072 ** +<------WRITER_FINISHED-------->+

	2073 **

	2074 **

	2075 ** List of state transitions and the C [function] that performs each:

	2076 **

	2077 ** OPEN -> READER [sqlite3PagerSharedLock]

	2078 ** READER -> OPEN [pager_unlock]

	2079 **

	2080 ** READER -> WRITER_LOCKED [sqlite3PagerBegin]

	2081 ** WRITER_LOCKED -> WRITER_CACHEMOD [pager_open_journal]

	2082 ** WRITER_CACHEMOD -> WRITER_DBMOD [syncJournal]

	2083 ** WRITER_DBMOD -> WRITER_FINISHED [sqlite3PagerCommitPhaseOne]

	2084 WRITER_* -> READER [pager_end_transaction]

	2085 **

	2086 WRITER_* -> ERROR [pager_error]

	2087 ** ERROR -> OPEN [pager_unlock]

	2088 **

	2089 **

	2090 ** OPEN:

	2091 **

	2092 ** The pager starts up in this state. Nothing is guaranteed in this

	2093 ** state - the file may or may not be locked and the database size is

	2094 ** unknown. The database may not be read or written.

	2095 **

	2096 ** * No read or write transaction is active.

	2097 ** * Any lock, or no lock at all, may be held on the database file.

	2098 ** * The dbSize, dbOrigSize and dbFileSize variables may not be trusted.

	2099 **

	2100 ** READER:

	2101 **

	2102 ** In this state all the requirements for reading the database in

	2103 ** rollback (non-WAL) mode are met. Unless the pager is (or recently

	2104 ** was) in exclusive-locking mode, a user-level read transaction is

	2105 ** open. The database size is known in this state.

	2106 **

	2107 ** A connection running with locking_mode=normal enters this state when

	2108 ** it opens a read-transaction on the database and returns to state

	2109 ** OPEN after the read-transaction is completed. However a connection

	2110 ** running in locking_mode=exclusive (including temp databases) remains in

	2111 ** this state even after the read-transaction is closed. The only way

	2112 ** a locking_mode=exclusive connection can transition from READER to OPEN

	2113 ** is via the ERROR state (see below).

	2114 **

	2115 ** * A read transaction may be active (but a write-transaction cannot).

	2116 ** * A SHARED or greater lock is held on the database file.

	2117 ** * The dbSize variable may be trusted (even if a user-level read

	2118 ** transaction is not active). The dbOrigSize and dbFileSize variables

	2119 ** may not be trusted at this point.

	2120 ** * If the database is a WAL database, then the WAL connection is open.

	2121 ** * Even if a read-transaction is not open, it is guaranteed that

	2122 ** there is no hot-journal in the file-system.

	2123 **

	2124 ** WRITER_LOCKED:

	2125 **

	2126 ** The pager moves to this state from READER when a write-transaction

	2127 ** is first opened on the database. In WRITER_LOCKED state, all locks

	2128 ** required to start a write-transaction are held, but no actual

	2129 ** modifications to the cache or database have taken place.

	2130 **

	2131 ** In rollback mode, a RESERVED or (if the transaction was opened with

	2132 ** BEGIN EXCLUSIVE) EXCLUSIVE lock is obtained on the database file when

	2133 ** moving to this state, but the journal file is not written to or opened

	2134 ** to in this state. If the transaction is committed or rolled back while

	2135 ** in WRITER_LOCKED state, all that is required is to unlock the database

	2136 ** file.

	2137 **

	2138 ** IN WAL mode, WalBeginWriteTransaction() is called to lock the log file.

	2139 ** If the connection is running with locking_mode=exclusive, an attempt

	2140 ** is made to obtain an EXCLUSIVE lock on the database file.

	2141 **

	2142 ** * A write transaction is active.

	2143 ** * If the connection is open in rollback-mode, a RESERVED or greater

	2144 ** lock is held on the database file.

	2145 ** * If the connection is open in WAL-mode, a WAL write transaction

	2146 ** is open (i.e. sqlite3WalBeginWriteTransaction() has been successfully

	2147 ** called).

	2148 ** * The dbSize, dbOrigSize and dbFileSize variables are all valid.

	2149 ** * The contents of the pager cache have not been modified.

	2150 ** * The journal file may or may not be open.

	2151 ** * Nothing (not even the first header) has been written to the journal.

	2152 **

	2153 ** WRITER_CACHEMOD:

	2154 **

	2155 ** A pager moves from WRITER_LOCKED state to this state when a page is

	2156 ** first modified by the upper layer. In rollback mode the journal file

	2157 ** is opened (if it is not already open) and a header written to the

	2158 ** start of it. The database file on disk has not been modified.

	2159 **

	2160 ** * A write transaction is active.

	2161 ** * A RESERVED or greater lock is held on the database file.

	2162 ** * The journal file is open and the first header has been written

	2163 ** to it, but the header has not been synced to disk.

	2164 ** * The contents of the page cache have been modified.

	2165 **

	2166 ** WRITER_DBMOD:

	2167 **

	2168 ** The pager transitions from WRITER_CACHEMOD into WRITER_DBMOD state

	2169 ** when it modifies the contents of the database file. WAL connections

	2170 ** never enter this state (since they do not modify the database file,

	2171 ** just the log file).

	2172 **

	2173 ** * A write transaction is active.

	2174 ** * An EXCLUSIVE or greater lock is held on the database file.

	2175 ** * The journal file is open and the first header has been written

	2176 ** and synced to disk.

	2177 ** * The contents of the page cache have been modified (and possibly

	2178 ** written to disk).

	2179 **

	2180 ** WRITER_FINISHED:

	2181 **

	2182 ** It is not possible for a WAL connection to enter this state.

	2183 **

	2184 ** A rollback-mode pager changes to WRITER_FINISHED state from WRITER_DBMOD

	2185 ** state after the entire transaction has been successfully written into the

	2186 ** database file. In this state the transaction may be committed simply

	2187 ** by finalizing the journal file. Once in WRITER_FINISHED state, it is

	2188 ** not possible to modify the database further. At this point, the upper

	2189 ** layer must either commit or rollback the transaction.

	2190 **

	2191 ** * A write transaction is active.

	2192 ** * An EXCLUSIVE or greater lock is held on the database file.

	2193 ** * All writing and syncing of journal and database data has finished.

	2194 ** If no error occurred, all that remains is to finalize the journal to

	2195 ** commit the transaction. If an error did occur, the caller will need

	2196 ** to rollback the transaction.

	2197 **

	2198 ** ERROR:

	2199 **

	2200 ** The ERROR state is entered when an IO or disk-full error (including

	2201 ** SQLITE_IOERR_NOMEM) occurs at a point in the code that makes it

	2202 ** difficult to be sure that the in-memory pager state (cache contents,

	2203 ** db size etc.) are consistent with the contents of the file-system.

	2204 **

	2205 ** Temporary pager files may enter the ERROR state, but in-memory pagers

	2206 ** cannot.

	2207 **

	2208 ** For example, if an IO error occurs while performing a rollback,

	2209 ** the contents of the page-cache may be left in an inconsistent state.

	2210 ** At this point it would be dangerous to change back to READER state

	2211 ** (as usually happens after a rollback). Any subsequent readers might

	2212 ** report database corruption (due to the inconsistent cache), and if

	2213 ** they upgrade to writers, they may inadvertently corrupt the database

	2214 ** file. To avoid this hazard, the pager switches into the ERROR state

	2215 ** instead of READER following such an error.

	2216 **

	2217 ** Once it has entered the ERROR state, any attempt to use the pager

	2218 ** to read or write data returns an error. Eventually, once all

	2219 ** outstanding transactions have been abandoned, the pager is able to

	2220 ** transition back to OPEN state, discarding the contents of the

	2221 ** page-cache and any other in-memory state at the same time. Everything

	2222 ** is reloaded from disk (and, if necessary, hot-journal rollback peformed)

	2223 ** when a read-transaction is next opened on the pager (transitioning

	2224 ** the pager into READER state). At that point the system has recovered

	2225 ** from the error.

	2226 **

	2227 ** Specifically, the pager jumps into the ERROR state if:

	2228 **

	2229 ** 1. An error occurs while attempting a rollback. This happens in

	2230 ** function sqlite3PagerRollback().

	2231 **

	2232 ** 2. An error occurs while attempting to finalize a journal file

	2233 ** following a commit in function sqlite3PagerCommitPhaseTwo().

	2234 **

	2235 ** 3. An error occurs while attempting to write to the journal or

	2236 ** database file in function pagerStress() in order to free up

	2237 ** memory.

	2238 **

	2239 ** In other cases, the error is returned to the b-tree layer. The b-tree

	2240 ** layer then attempts a rollback operation. If the error condition

	2241 ** persists, the pager enters the ERROR state via condition (1) above.

	2242 **

	2243 ** Condition (3) is necessary because it can be triggered by a read-only

	2244 ** statement executed within a transaction. In this case, if the error

	2245 ** code were simply returned to the user, the b-tree layer would not

	2246 ** automatically attempt a rollback, as it assumes that an error in a

	2247 ** read-only statement cannot leave the pager in an internally inconsistent

	2248 ** state.

	2249 **

	2250 ** * The Pager.errCode variable is set to something other than SQLITE_OK.

	2251 ** * There are one or more outstanding references to pages (after the

	2252 ** last reference is dropped the pager should move back to OPEN state).

	2253 ** * The pager is not an in-memory pager.

	2254 **

	2255 **

	2256 ** Notes:

	2257 **

	2258 ** * A pager is never in WRITER_DBMOD or WRITER_FINISHED state if the

	2259 ** connection is open in WAL mode. A WAL connection is always in one

	2260 ** of the first four states.

	2261 **

	2262 ** * Normally, a connection open in exclusive mode is never in PAGER_OPEN

	2263 ** state. There are two exceptions: immediately after exclusive-mode has

	2264 ** been turned on (and before any read or write transactions are

	2265 ** executed), and when the pager is leaving the "error state".

	2266 **

	2267 ** * See also: assert_pager_state().

	2268 */

	2269 #define PAGER_OPEN 0

	2270 #define PAGER_READER 1

	2271 #define PAGER_WRITER_LOCKED 2

	2272 #define PAGER_WRITER_CACHEMOD 3

	2273 #define PAGER_WRITER_DBMOD 4

	2274 #define PAGER_WRITER_FINISHED 5

	2275 #define PAGER_ERROR 6

	2276

	2277 /*

	2278 ** The Pager.eLock variable is almost always set to one of the

	2279 ** following locking-states, according to the lock currently held on

	2280 ** the database file: NO_LOCK, SHARED_LOCK, RESERVED_LOCK or EXCLUSIVE_LOCK.

	2281 ** This variable is kept up to date as locks are taken and released by

	2282 ** the pagerLockDb() and pagerUnlockDb() wrappers.

	2283 **

	2284 ** If the VFS xLock() or xUnlock() returns an error other than SQLITE_BUSY

	2285 ** (i.e. one of the SQLITE_IOERR subtypes), it is not clear whether or not

	2286 ** the operation was successful. In these circumstances pagerLockDb() and

	2287 ** pagerUnlockDb() take a conservative approach - eLock is always updated

	2288 ** when unlocking the file, and only updated when locking the file if the

	2289 ** VFS call is successful. This way, the Pager.eLock variable may be set

	2290 ** to a less exclusive (lower) value than the lock that is actually held

	2291 ** at the system level, but it is never set to a more exclusive value.

	2292 **

	2293 ** This is usually safe. If an xUnlock fails or appears to fail, there may

	2294 ** be a few redundant xLock() calls or a lock may be held for longer than

	2295 ** required, but nothing really goes wrong.

	2296 **

	2297 ** The exception is when the database file is unlocked as the pager moves

	2298 ** from ERROR to OPEN state. At this point there may be a hot-journal file

	2299 ** in the file-system that needs to be rolled back (as part of an OPEN->SHARED

	2300 ** transition, by the same pager or any other). If the call to xUnlock()

	2301 ** fails at this point and the pager is left holding an EXCLUSIVE lock, this

	2302 ** can confuse the call to xCheckReservedLock() call made later as part

	2303 ** of hot-journal detection.

	2304 **

	2305 ** xCheckReservedLock() is defined as returning true "if there is a RESERVED

	2306 ** lock held by this process or any others". So xCheckReservedLock may

	2307 ** return true because the caller itself is holding an EXCLUSIVE lock (but

	2308 ** doesn't know it because of a previous error in xUnlock). If this happens

	2309 ** a hot-journal may be mistaken for a journal being created by an active

	2310 ** transaction in another process, causing SQLite to read from the database

	2311 ** without rolling it back.

	2312 **

	2313 ** To work around this, if a call to xUnlock() fails when unlocking the

	2314 ** database in the ERROR state, Pager.eLock is set to UNKNOWN_LOCK. It

	2315 ** is only changed back to a real locking state after a successful call

	2316 ** to xLock(EXCLUSIVE). Also, the code to do the OPEN->SHARED state transition

	2317 ** omits the check for a hot-journal if Pager.eLock is set to UNKNOWN_LOCK

	2318 ** lock. Instead, it assumes a hot-journal exists and obtains an EXCLUSIVE

	2319 ** lock on the database file before attempting to roll it back. See function

	2320 ** PagerSharedLock() for more detail.

	2321 **

	2322 ** Pager.eLock may only be set to UNKNOWN_LOCK when the pager is in

	2323 ** PAGER_OPEN state.

	2324 */

	2325 #define UNKNOWN_LOCK (EXCLUSIVE_LOCK+1)

	2326

	2327 /*

	2328 ** A macro used for invoking the codec if there is one

	2329 */

	2330 #ifdef SQLITE_HAS_CODEC

	2331 # define CODEC1(P,D,N,X,E) \

	2332 if( P->xCodec && P->xCodec(P->pCodec,D,N,X)==0 ){ E; }

	2333 # define CODEC2(P,D,N,X,E,O) \

	2334 if( P->xCodec==0 ){ O=(char*)D; }else \

	2335 if( (O=(char*)(P->xCodec(P->pCodec,D,N,X)))==0 ){ E; }

	2336 #else

	2337 # define CODEC1(P,D,N,X,E) /* NO-OP */

	2338 # define CODEC2(P,D,N,X,E,O) O=(char*)D

	2339 #endif

	2340

	2341 /*

	2342 ** The maximum allowed sector size. 64KiB. If the xSectorsize() method

	2343 ** returns a value larger than this, then MAX_SECTOR_SIZE is used instead.

	2344 ** This could conceivably cause corruption following a power failure on

	2345 ** such a system. This is currently an undocumented limit.

	2346 */

	2347 #define MAX_SECTOR_SIZE 0x10000

	2348

	2349

	2350 /*

	2351 ** An instance of the following structure is allocated for each active

	2352 ** savepoint and statement transaction in the system. All such structures

	2353 ** are stored in the Pager.aSavepoint[] array, which is allocated and

	2354 ** resized using sqlite3Realloc().

	2355 **

	2356 ** When a savepoint is created, the PagerSavepoint.iHdrOffset field is

	2357 ** set to 0. If a journal-header is written into the main journal while

	2358 ** the savepoint is active, then iHdrOffset is set to the byte offset

	2359 ** immediately following the last journal record written into the main

	2360 ** journal before the journal-header. This is required during savepoint

	2361 ** rollback (see pagerPlaybackSavepoint()).

	2362 */

	2363 typedef struct PagerSavepoint PagerSavepoint;

	2364 struct PagerSavepoint {

	2365 i64 iOffset; /* Starting offset in main journal */

	2366 i64 iHdrOffset; /* See above */

	2367 Bitvec pInSavepoint; / Set of pages in this savepoint */

	2368 Pgno nOrig; /* Original number of pages in file */

	2369 Pgno iSubRec; /* Index of first record in sub-journal */

	2370 #ifndef SQLITE_OMIT_WAL

	2371 u32 aWalData[WAL_SAVEPOINT_NDATA]; /* WAL savepoint context */

	2372 #endif

	2373 };

	2374

	2375 /*

	2376 ** Bits of the Pager.doNotSpill flag. See further description below.

	2377 */

	2378 #define SPILLFLAG_OFF 0x01 /* Never spill cache. Set via pragma */

	2379 #define SPILLFLAG_ROLLBACK 0x02 /* Current rolling back, so do not spill */

	2380 #define SPILLFLAG_NOSYNC 0x04 /* Spill is ok, but do not sync */

	2381

	2382 /*

	2383 ** An open page cache is an instance of struct Pager. A description of

	2384 ** some of the more important member variables follows:

	2385 **

	2386 ** eState

	2387 **

	2388 ** The current 'state' of the pager object. See the comment and state

	2389 ** diagram above for a description of the pager state.

	2390 **

	2391 ** eLock

	2392 **

	2393 ** For a real on-disk database, the current lock held on the database file -

	2394 ** NO_LOCK, SHARED_LOCK, RESERVED_LOCK or EXCLUSIVE_LOCK.

	2395 **

	2396 ** For a temporary or in-memory database (neither of which require any

	2397 ** locks), this variable is always set to EXCLUSIVE_LOCK. Since such

	2398 ** databases always have Pager.exclusiveMode==1, this tricks the pager

	2399 ** logic into thinking that it already has all the locks it will ever

	2400 ** need (and no reason to release them).

	2401 **

	2402 ** In some (obscure) circumstances, this variable may also be set to

	2403 ** UNKNOWN_LOCK. See the comment above the #define of UNKNOWN_LOCK for

	2404 ** details.

	2405 **

	2406 ** changeCountDone

	2407 **

	2408 ** This boolean variable is used to make sure that the change-counter

	2409 ** (the 4-byte header field at byte offset 24 of the database file) is

	2410 ** not updated more often than necessary.

	2411 **

	2412 ** It is set to true when the change-counter field is updated, which

	2413 ** can only happen if an exclusive lock is held on the database file.

	2414 ** It is cleared (set to false) whenever an exclusive lock is

	2415 ** relinquished on the database file. Each time a transaction is committed,

	2416 ** The changeCountDone flag is inspected. If it is true, the work of

	2417 ** updating the change-counter is omitted for the current transaction.

	2418 **

	2419 ** This mechanism means that when running in exclusive mode, a connection

	2420 ** need only update the change-counter once, for the first transaction

	2421 ** committed.

	2422 **

	2423 ** setMaster

	2424 **

	2425 ** When PagerCommitPhaseOne() is called to commit a transaction, it may

	2426 ** (or may not) specify a master-journal name to be written into the

	2427 ** journal file before it is synced to disk.

	2428 **

	2429 ** Whether or not a journal file contains a master-journal pointer affects

	2430 ** the way in which the journal file is finalized after the transaction is

	2431 ** committed or rolled back when running in "journal_mode=PERSIST" mode.

	2432 ** If a journal file does not contain a master-journal pointer, it is

	2433 ** finalized by overwriting the first journal header with zeroes. If

	2434 ** it does contain a master-journal pointer the journal file is finalized

	2435 ** by truncating it to zero bytes, just as if the connection were

	2436 ** running in "journal_mode=truncate" mode.

	2437 **

	2438 ** Journal files that contain master journal pointers cannot be finalized

	2439 ** simply by overwriting the first journal-header with zeroes, as the

	2440 ** master journal pointer could interfere with hot-journal rollback of any

	2441 ** subsequently interrupted transaction that reuses the journal file.

	2442 **

	2443 ** The flag is cleared as soon as the journal file is finalized (either

	2444 ** by PagerCommitPhaseTwo or PagerRollback). If an IO error prevents the

	2445 ** journal file from being successfully finalized, the setMaster flag

	2446 ** is cleared anyway (and the pager will move to ERROR state).

	2447 **

	2448 ** doNotSpill

	2449 **

	2450 ** This variables control the behavior of cache-spills (calls made by

	2451 ** the pcache module to the pagerStress() routine to write cached data

	2452 ** to the file-system in order to free up memory).

	2453 **

	2454 ** When bits SPILLFLAG_OFF or SPILLFLAG_ROLLBACK of doNotSpill are set,

	2455 ** writing to the database from pagerStress() is disabled altogether.

	2456 ** The SPILLFLAG_ROLLBACK case is done in a very obscure case that

	2457 ** comes up during savepoint rollback that requires the pcache module

	2458 ** to allocate a new page to prevent the journal file from being written

	2459 ** while it is being traversed by code in pager_playback(). The SPILLFLAG_OFF

	2460 ** case is a user preference.

	2461 **

	2462 ** If the SPILLFLAG_NOSYNC bit is set, writing to the database from

	2463 ** pagerStress() is permitted, but syncing the journal file is not.

	2464 ** This flag is set by sqlite3PagerWrite() when the file-system sector-size

	2465 ** is larger than the database page-size in order to prevent a journal sync

	2466 ** from happening in between the journalling of two pages on the same sector.

	2467 **

	2468 ** subjInMemory

	2469 **

	2470 ** This is a boolean variable. If true, then any required sub-journal

	2471 ** is opened as an in-memory journal file. If false, then in-memory

	2472 ** sub-journals are only used for in-memory pager files.

	2473 **

	2474 ** This variable is updated by the upper layer each time a new

	2475 ** write-transaction is opened.

	2476 **

	2477 ** dbSize, dbOrigSize, dbFileSize

	2478 **

	2479 ** Variable dbSize is set to the number of pages in the database file.

	2480 ** It is valid in PAGER_READER and higher states (all states except for

	2481 ** OPEN and ERROR).

	2482 **

	2483 ** dbSize is set based on the size of the database file, which may be

	2484 ** larger than the size of the database (the value stored at offset

	2485 ** 28 of the database header by the btree). If the size of the file

	2486 ** is not an integer multiple of the page-size, the value stored in

	2487 ** dbSize is rounded down (i.e. a 5KB file with 2K page-size has dbSize==2).

	2488 ** Except, any file that is greater than 0 bytes in size is considered

	2489 ** to have at least one page. (i.e. a 1KB file with 2K page-size leads

	2490 ** to dbSize==1).

	2491 **

	2492 ** During a write-transaction, if pages with page-numbers greater than

	2493 ** dbSize are modified in the cache, dbSize is updated accordingly.

	2494 ** Similarly, if the database is truncated using PagerTruncateImage(),

	2495 ** dbSize is updated.

	2496 **

	2497 ** Variables dbOrigSize and dbFileSize are valid in states

	2498 ** PAGER_WRITER_LOCKED and higher. dbOrigSize is a copy of the dbSize

	2499 ** variable at the start of the transaction. It is used during rollback,

	2500 ** and to determine whether or not pages need to be journalled before

	2501 ** being modified.

	2502 **

	2503 ** Throughout a write-transaction, dbFileSize contains the size of

	2504 ** the file on disk in pages. It is set to a copy of dbSize when the

	2505 ** write-transaction is first opened, and updated when VFS calls are made

	2506 ** to write or truncate the database file on disk.

	2507 **

	2508 ** The only reason the dbFileSize variable is required is to suppress

	2509 ** unnecessary calls to xTruncate() after committing a transaction. If,

	2510 ** when a transaction is committed, the dbFileSize variable indicates

	2511 ** that the database file is larger than the database image (Pager.dbSize),

	2512 ** pager_truncate() is called. The pager_truncate() call uses xFilesize()

	2513 ** to measure the database file on disk, and then truncates it if required.

	2514 ** dbFileSize is not used when rolling back a transaction. In this case

	2515 ** pager_truncate() is called unconditionally (which means there may be

	2516 ** a call to xFilesize() that is not strictly required). In either case,

	2517 ** pager_truncate() may cause the file to become smaller or larger.

	2518 **

	2519 ** dbHintSize

	2520 **

	2521 ** The dbHintSize variable is used to limit the number of calls made to

	2522 ** the VFS xFileControl(FCNTL_SIZE_HINT) method.

	2523 **

	2524 ** dbHintSize is set to a copy of the dbSize variable when a

	2525 ** write-transaction is opened (at the same time as dbFileSize and

	2526 ** dbOrigSize). If the xFileControl(FCNTL_SIZE_HINT) method is called,

	2527 ** dbHintSize is increased to the number of pages that correspond to the

	2528 ** size-hint passed to the method call. See pager_write_pagelist() for

	2529 ** details.

	2530 **

	2531 ** errCode

	2532 **

	2533 ** The Pager.errCode variable is only ever used in PAGER_ERROR state. It

	2534 ** is set to zero in all other states. In PAGER_ERROR state, Pager.errCode

	2535 ** is always set to SQLITE_FULL, SQLITE_IOERR or one of the SQLITE_IOERR_XXX

	2536 ** sub-codes.

	2537 */

	2538 struct Pager {

	2539 sqlite3_vfs pVfs; / OS functions to use for IO */

	2540 u8 exclusiveMode; /* Boolean. True if locking_mode==EXCLUSIVE */

	2541 u8 journalMode; /* One of the PAGER_JOURNALMODE_* values */

	2542 u8 useJournal; /* Use a rollback journal on this file */

	2543 u8 noSync; /* Do not sync the journal if true */

	2544 u8 fullSync; /* Do extra syncs of the journal for robustness */

	2545 u8 extraSync; /* sync directory after journal delete */

	2546 u8 ckptSyncFlags; /* SYNC_NORMAL or SYNC_FULL for checkpoint */

	2547 u8 walSyncFlags; /* SYNC_NORMAL or SYNC_FULL for wal writes */

	2548 u8 syncFlags; /* SYNC_NORMAL or SYNC_FULL otherwise */

	2549 u8 tempFile; /* zFilename is a temporary or immutable file */

	2550 u8 noLock; /* Do not lock (except in WAL mode) */

	2551 u8 readOnly; /* True for a read-only database */

	2552 u8 memDb; /* True to inhibit all file I/O */

	2553

	2554 /**************************************************************************

	2555 ** The following block contains those class members that change during

	2556 ** routine operation. Class members not in this block are either fixed

	2557 ** when the pager is first created or else only change when there is a

	2558 ** significant mode change (such as changing the page_size, locking_mode,

	2559 ** or the journal_mode). From another view, these class members describe

	2560 ** the "state" of the pager, while other class members describe the

	2561 ** "configuration" of the pager.

	2562 */

	2563 u8 eState; /* Pager state (OPEN, READER, WRITER_LOCKED..) */

	2564 u8 eLock; /* Current lock held on database file */

	2565 u8 changeCountDone; /* Set after incrementing the change-counter */

	2566 u8 setMaster; /* True if a m-j name has been written to jrnl */

	2567 u8 doNotSpill; /* Do not spill the cache when non-zero */

	2568 u8 subjInMemory; /* True to use in-memory sub-journals */

	2569 u8 bUseFetch; /* True to use xFetch() */

	2570 u8 hasHeldSharedLock; /* True if a shared lock has ever been held */

	2571 Pgno dbSize; /* Number of pages in the database */

	2572 Pgno dbOrigSize; /* dbSize before the current transaction */

	2573 Pgno dbFileSize; /* Number of pages in the database file */

	2574 Pgno dbHintSize; /* Value passed to FCNTL_SIZE_HINT call */

	2575 int errCode; /* One of several kinds of errors */

	2576 int nRec; /* Pages journalled since last j-header written */

	2577 u32 cksumInit; /* Quasi-random value added to every checksum */

	2578 u32 nSubRec; /* Number of records written to sub-journal */

	2579 Bitvec pInJournal; / One bit for each page in the database file */

	2580 sqlite3_file fd; / File descriptor for database */

	2581 sqlite3_file jfd; / File descriptor for main journal */

	2582 sqlite3_file sjfd; / File descriptor for sub-journal */

	2583 i64 journalOff; /* Current write offset in the journal file */

	2584 i64 journalHdr; /* Byte offset to previous journal header */

	2585 sqlite3_backup pBackup; / Pointer to list of ongoing backup processes */

	2586 PagerSavepoint aSavepoint; / Array of active savepoints */

	2587 int nSavepoint; /* Number of elements in aSavepoint[] */

	2588 u32 iDataVersion; /* Changes whenever database content changes */

	2589 char dbFileVers[16]; /* Changes whenever database file changes */

	2590

	2591 int nMmapOut; /* Number of mmap pages currently outstanding */

	2592 sqlite3_int64 szMmap; /* Desired maximum mmap size */

	2593 PgHdr pMmapFreelist; / List of free mmap page headers (pDirty) */

	2594 /*

	2595 ** End of the routinely-changing class members

	2596 ***************************************************************************/

	2597

	2598 u16 nExtra; /* Add this many bytes to each in-memory page */

	2599 i16 nReserve; /* Number of unused bytes at end of each page */

	2600 u32 vfsFlags; /* Flags for sqlite3_vfs.xOpen() */

	2601 u32 sectorSize; /* Assumed sector size during rollback */

	2602 int pageSize; /* Number of bytes in a page */

	2603 Pgno mxPgno; /* Maximum allowed size of the database */

	2604 i64 journalSizeLimit; /* Size limit for persistent journal files */

	2605 char zFilename; / Name of the database file */

	2606 char zJournal; / Name of the journal file */

	2607 int (xBusyHandler)(void); /* Function to call when busy */

	2608 void pBusyHandlerArg; / Context argument for xBusyHandler */

	2609 int aStat[3]; /* Total cache hits, misses and writes */

	2610 #ifdef SQLITE_TEST

	2611 int nRead; /* Database pages read */

	2612 #endif

	2613 void (xReiniter)(DbPage); /* Call this routine when reloading pages */

	2614 int (xGet)(Pager,Pgno,DbPage*,int); / Routine to fetch a patch */

	2615 #ifdef SQLITE_HAS_CODEC

	2616 void (xCodec)(void,void,Pgno,int); /* Routine for en/decoding data */

	2617 void (xCodecSizeChng)(void,int,int); /* Notify of page size changes */

	2618 void (xCodecFree)(void); /* Destructor for the codec */

	2619 void pCodec; / First argument to xCodec... methods */

	2620 #endif

	2621 char pTmpSpace; / Pager.pageSize bytes of space for tmp use */

	2622 PCache pPCache; / Pointer to page cache object */

	2623 #ifndef SQLITE_OMIT_WAL

	2624 Wal pWal; / Write-ahead log used by "journal_mode=wal" */

	2625 char zWal; / File name for write-ahead log */

	2626 #endif

	2627 };

	2628

	2629 /*

	2630 ** Indexes for use with Pager.aStat[]. The Pager.aStat[] array contains

	2631 ** the values accessed by passing SQLITE_DBSTATUS_CACHE_HIT, CACHE_MISS

	2632 ** or CACHE_WRITE to sqlite3_db_status().

	2633 */

	2634 #define PAGER_STAT_HIT 0

	2635 #define PAGER_STAT_MISS 1

	2636 #define PAGER_STAT_WRITE 2

	2637

	2638 /*

	2639 ** The following global variables hold counters used for

	2640 ** testing purposes only. These variables do not exist in

	2641 ** a non-testing build. These variables are not thread-safe.

	2642 */

	2643 #ifdef SQLITE_TEST

	2644 SQLITE_API int sqlite3_pager_readdb_count = 0; /* Number of full pages read f rom DB */

	2645 SQLITE_API int sqlite3_pager_writedb_count = 0; /* Number of full pages writte n to DB */

	2646 SQLITE_API int sqlite3_pager_writej_count = 0; /* Number of pages written to journal */

	2647 # define PAGER_INCR(v) v++

	2648 #else

	2649 # define PAGER_INCR(v)

	2650 #endif

	2651

	2652

	2653

	2654 /*

	2655 ** Journal files begin with the following magic string. The data

	2656 ** was obtained from /dev/random. It is used only as a sanity check.

	2657 **

	2658 ** Since version 2.8.0, the journal format contains additional sanity

	2659 ** checking information. If the power fails while the journal is being

	2660 ** written, semi-random garbage data might appear in the journal

	2661 ** file after power is restored. If an attempt is then made

	2662 ** to roll the journal back, the database could be corrupted. The additional

	2663 ** sanity checking data is an attempt to discover the garbage in the

	2664 ** journal and ignore it.

	2665 **

	2666 ** The sanity checking information for the new journal format consists

	2667 ** of a 32-bit checksum on each page of data. The checksum covers both

	2668 ** the page number and the pPager->pageSize bytes of data for the page.

	2669 ** This cksum is initialized to a 32-bit random value that appears in the

	2670 ** journal file right after the header. The random initializer is important,

	2671 ** because garbage data that appears at the end of a journal is likely

	2672 ** data that was once in other files that have now been deleted. If the

	2673 ** garbage data came from an obsolete journal file, the checksums might

	2674 ** be correct. But by initializing the checksum to random value which

	2675 ** is different for every journal, we minimize that risk.

	2676 */

	2677 static const unsigned char aJournalMagic[] = {

	2678 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd7,

	2679 };

	2680

	2681 /*

	2682 ** The size of the of each page record in the journal is given by

	2683 ** the following macro.

	2684 */

	2685 #define JOURNAL_PG_SZ(pPager) ((pPager->pageSize) + 8)

	2686

	2687 /*

	2688 ** The journal header size for this pager. This is usually the same

	2689 ** size as a single disk sector. See also setSectorSize().

	2690 */

	2691 #define JOURNAL_HDR_SZ(pPager) (pPager->sectorSize)

	2692

	2693 /*

	2694 ** The macro MEMDB is true if we are dealing with an in-memory database.

	2695 ** We do this as a macro so that if the SQLITE_OMIT_MEMORYDB macro is set,

	2696 ** the value of MEMDB will be a constant and the compiler will optimize

	2697 ** out code that would never execute.

	2698 */

	2699 #ifdef SQLITE_OMIT_MEMORYDB

	2700 # define MEMDB 0

	2701 #else

	2702 # define MEMDB pPager->memDb

	2703 #endif

	2704

	2705 /*

	2706 ** The macro USEFETCH is true if we are allowed to use the xFetch and xUnfetch

	2707 ** interfaces to access the database using memory-mapped I/O.

	2708 */

	2709 #if SQLITE_MAX_MMAP_SIZE>0

	2710 # define USEFETCH(x) ((x)->bUseFetch)

	2711 #else

	2712 # define USEFETCH(x) 0

	2713 #endif

	2714

	2715 /*

	2716 ** The maximum legal page number is (2^31 - 1).

	2717 */

	2718 #define PAGER_MAX_PGNO 2147483647

	2719

	2720 /*

	2721 ** The argument to this macro is a file descriptor (type sqlite3_file*).

	2722 ** Return 0 if it is not open, or non-zero (but not 1) if it is.

	2723 **

	2724 ** This is so that expressions can be written as:

	2725 **

	2726 ** if( isOpen(pPager->jfd) ){ ...

	2727 **

	2728 ** instead of

	2729 **

	2730 ** if( pPager->jfd->pMethods ){ ...

	2731 */

	2732 #define isOpen(pFd) ((pFd)->pMethods!=0)

	2733

	2734 /*

	2735 ** Return true if this pager uses a write-ahead log to read page pgno.

	2736 ** Return false if the pager reads pgno directly from the database.

	2737 */

	2738 #if !defined(SQLITE_OMIT_WAL) && defined(SQLITE_DIRECT_OVERFLOW_READ)

	2739 SQLITE_PRIVATE int sqlite3PagerUseWal(Pager *pPager, Pgno pgno){

	2740 u32 iRead = 0;

	2741 int rc;

	2742 if( pPager->pWal==0 ) return 0;

	2743 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iRead);

	2744 return rc \|\| iRead;

	2745 }

	2746 #endif

	2747 #ifndef SQLITE_OMIT_WAL

	2748 # define pagerUseWal(x) ((x)->pWal!=0)

	2749 #else

	2750 # define pagerUseWal(x) 0

	2751 # define pagerRollbackWal(x) 0

	2752 # define pagerWalFrames(v,w,x,y) 0

	2753 # define pagerOpenWalIfPresent(z) SQLITE_OK

	2754 # define pagerBeginReadTransaction(z) SQLITE_OK

	2755 #endif

	2756

	2757 #ifndef NDEBUG

	2758 /*

	2759 ** Usage:

	2760 **

	2761 ** assert( assert_pager_state(pPager) );

	2762 **

	2763 ** This function runs many asserts to try to find inconsistencies in

	2764 ** the internal state of the Pager object.

	2765 */

	2766 static int assert_pager_state(Pager *p){

	2767 Pager *pPager = p;

	2768

	2769 /* State must be valid. */

	2770 assert( p->eState==PAGER_OPEN

	2771 \|\| p->eState==PAGER_READER

	2772 \|\| p->eState==PAGER_WRITER_LOCKED

	2773 \|\| p->eState==PAGER_WRITER_CACHEMOD

	2774 \|\| p->eState==PAGER_WRITER_DBMOD

	2775 \|\| p->eState==PAGER_WRITER_FINISHED

	2776 \|\| p->eState==PAGER_ERROR

	2777 );

	2778

	2779 /* Regardless of the current state, a temp-file connection always behaves

	2780 ** as if it has an exclusive lock on the database file. It never updates

	2781 ** the change-counter field, so the changeCountDone flag is always set.

	2782 */

	2783 assert( p->tempFile==0 \|\| p->eLock==EXCLUSIVE_LOCK );

	2784 assert( p->tempFile==0 \|\| pPager->changeCountDone );

	2785

	2786 /* If the useJournal flag is clear, the journal-mode must be "OFF".

	2787 ** And if the journal-mode is "OFF", the journal file must not be open.

	2788 */

	2789 assert( p->journalMode==PAGER_JOURNALMODE_OFF \|\| p->useJournal );

	2790 assert( p->journalMode!=PAGER_JOURNALMODE_OFF \|\| !isOpen(p->jfd) );

	2791

	2792 /* Check that MEMDB implies noSync. And an in-memory journal. Since

	2793 ** this means an in-memory pager performs no IO at all, it cannot encounter

	2794 ** either SQLITE_IOERR or SQLITE_FULL during rollback or while finalizing

	2795 ** a journal file. (although the in-memory journal implementation may

	2796 ** return SQLITE_IOERR_NOMEM while the journal file is being written). It

	2797 ** is therefore not possible for an in-memory pager to enter the ERROR

	2798 ** state.

	2799 */

	2800 if( MEMDB ){

	2801 assert( !isOpen(p->fd) );

	2802 assert( p->noSync );

	2803 assert( p->journalMode==PAGER_JOURNALMODE_OFF

	2804 \|\| p->journalMode==PAGER_JOURNALMODE_MEMORY

	2805 );

	2806 assert( p->eState!=PAGER_ERROR && p->eState!=PAGER_OPEN );

	2807 assert( pagerUseWal(p)==0 );

	2808 }

	2809

	2810 /* If changeCountDone is set, a RESERVED lock or greater must be held

	2811 ** on the file.

	2812 */

	2813 assert( pPager->changeCountDone==0 \|\| pPager->eLock>=RESERVED_LOCK );

	2814 assert( p->eLock!=PENDING_LOCK );

	2815

	2816 switch( p->eState ){

	2817 case PAGER_OPEN:

	2818 assert( !MEMDB );

	2819 assert( pPager->errCode==SQLITE_OK );

	2820 assert( sqlite3PcacheRefCount(pPager->pPCache)==0 \|\| pPager->tempFile );

	2821 break;

	2822

	2823 case PAGER_READER:

	2824 assert( pPager->errCode==SQLITE_OK );

	2825 assert( p->eLock!=UNKNOWN_LOCK );

	2826 assert( p->eLock>=SHARED_LOCK );

	2827 break;

	2828

	2829 case PAGER_WRITER_LOCKED:

	2830 assert( p->eLock!=UNKNOWN_LOCK );

	2831 assert( pPager->errCode==SQLITE_OK );

	2832 if( !pagerUseWal(pPager) ){

	2833 assert( p->eLock>=RESERVED_LOCK );

	2834 }

	2835 assert( pPager->dbSize==pPager->dbOrigSize );

	2836 assert( pPager->dbOrigSize==pPager->dbFileSize );

	2837 assert( pPager->dbOrigSize==pPager->dbHintSize );

	2838 assert( pPager->setMaster==0 );

	2839 break;

	2840

	2841 case PAGER_WRITER_CACHEMOD:

	2842 assert( p->eLock!=UNKNOWN_LOCK );

	2843 assert( pPager->errCode==SQLITE_OK );

	2844 if( !pagerUseWal(pPager) ){

	2845 /* It is possible that if journal_mode=wal here that neither the

	2846 ** journal file nor the WAL file are open. This happens during

	2847 ** a rollback transaction that switches from journal_mode=off

	2848 ** to journal_mode=wal.

	2849 */

	2850 assert( p->eLock>=RESERVED_LOCK );

	2851 assert( isOpen(p->jfd)

	2852 \|\| p->journalMode==PAGER_JOURNALMODE_OFF

	2853 \|\| p->journalMode==PAGER_JOURNALMODE_WAL

	2854 );

	2855 }

	2856 assert( pPager->dbOrigSize==pPager->dbFileSize );

	2857 assert( pPager->dbOrigSize==pPager->dbHintSize );

	2858 break;

	2859

	2860 case PAGER_WRITER_DBMOD:

	2861 assert( p->eLock==EXCLUSIVE_LOCK );

	2862 assert( pPager->errCode==SQLITE_OK );

	2863 assert( !pagerUseWal(pPager) );

	2864 assert( p->eLock>=EXCLUSIVE_LOCK );

	2865 assert( isOpen(p->jfd)

	2866 \|\| p->journalMode==PAGER_JOURNALMODE_OFF

	2867 \|\| p->journalMode==PAGER_JOURNALMODE_WAL

	2868 );

	2869 assert( pPager->dbOrigSize<=pPager->dbHintSize );

	2870 break;

	2871

	2872 case PAGER_WRITER_FINISHED:

	2873 assert( p->eLock==EXCLUSIVE_LOCK );

	2874 assert( pPager->errCode==SQLITE_OK );

	2875 assert( !pagerUseWal(pPager) );

	2876 assert( isOpen(p->jfd)

	2877 \|\| p->journalMode==PAGER_JOURNALMODE_OFF

	2878 \|\| p->journalMode==PAGER_JOURNALMODE_WAL

	2879 );

	2880 break;

	2881

	2882 case PAGER_ERROR:

	2883 /* There must be at least one outstanding reference to the pager if

	2884 ** in ERROR state. Otherwise the pager should have already dropped

	2885 ** back to OPEN state.

	2886 */

	2887 assert( pPager->errCode!=SQLITE_OK );

	2888 assert( sqlite3PcacheRefCount(pPager->pPCache)>0 \|\| pPager->tempFile );

	2889 break;

	2890 }

	2891

	2892 return 1;

	2893 }

	2894 #endif /* ifndef NDEBUG */

	2895

	2896 #ifdef SQLITE_DEBUG

	2897 /*

	2898 ** Return a pointer to a human readable string in a static buffer

	2899 ** containing the state of the Pager object passed as an argument. This

	2900 ** is intended to be used within debuggers. For example, as an alternative

	2901 ** to "print *pPager" in gdb:

	2902 **

	2903 ** (gdb) printf "%s", print_pager_state(pPager)

	2904 */

	2905 static char print_pager_state(Pager p){

	2906 static char zRet[1024];

	2907

	2908 sqlite3_snprintf(1024, zRet,

	2909 "Filename: %s\n"

	2910 "State: %s errCode=%d\n"

	2911 "Lock: %s\n"

	2912 "Locking mode: locking_mode=%s\n"

	2913 "Journal mode: journal_mode=%s\n"

	2914 "Backing store: tempFile=%d memDb=%d useJournal=%d\n"

	2915 "Journal: journalOff=%lld journalHdr=%lld\n"

	2916 "Size: dbsize=%d dbOrigSize=%d dbFileSize=%d\n"

	2917 , p->zFilename

	2918 , p->eState==PAGER_OPEN ? "OPEN" :

	2919 p->eState==PAGER_READER ? "READER" :

	2920 p->eState==PAGER_WRITER_LOCKED ? "WRITER_LOCKED" :

	2921 p->eState==PAGER_WRITER_CACHEMOD ? "WRITER_CACHEMOD" :

	2922 p->eState==PAGER_WRITER_DBMOD ? "WRITER_DBMOD" :

	2923 p->eState==PAGER_WRITER_FINISHED ? "WRITER_FINISHED" :

	2924 p->eState==PAGER_ERROR ? "ERROR" : "?error?"

	2925 , (int)p->errCode

	2926 , p->eLock==NO_LOCK ? "NO_LOCK" :

	2927 p->eLock==RESERVED_LOCK ? "RESERVED" :

	2928 p->eLock==EXCLUSIVE_LOCK ? "EXCLUSIVE" :

	2929 p->eLock==SHARED_LOCK ? "SHARED" :

	2930 p->eLock==UNKNOWN_LOCK ? "UNKNOWN" : "?error?"

	2931 , p->exclusiveMode ? "exclusive" : "normal"

	2932 , p->journalMode==PAGER_JOURNALMODE_MEMORY ? "memory" :

	2933 p->journalMode==PAGER_JOURNALMODE_OFF ? "off" :

	2934 p->journalMode==PAGER_JOURNALMODE_DELETE ? "delete" :

	2935 p->journalMode==PAGER_JOURNALMODE_PERSIST ? "persist" :

	2936 p->journalMode==PAGER_JOURNALMODE_TRUNCATE ? "truncate" :

	2937 p->journalMode==PAGER_JOURNALMODE_WAL ? "wal" : "?error?"

	2938 , (int)p->tempFile, (int)p->memDb, (int)p->useJournal

	2939 , p->journalOff, p->journalHdr

	2940 , (int)p->dbSize, (int)p->dbOrigSize, (int)p->dbFileSize

	2941 );

	2942

	2943 return zRet;

	2944 }

	2945 #endif

	2946

	2947 /* Forward references to the various page getters */

	2948 static int getPageNormal(Pager,Pgno,DbPage*,int);

	2949 static int getPageError(Pager,Pgno,DbPage*,int);

	2950 #if SQLITE_MAX_MMAP_SIZE>0

	2951 static int getPageMMap(Pager,Pgno,DbPage*,int);

	2952 #endif

	2953

	2954 /*

	2955 ** Set the Pager.xGet method for the appropriate routine used to fetch

	2956 ** content from the pager.

	2957 */

	2958 static void setGetterMethod(Pager *pPager){

	2959 if( pPager->errCode ){

	2960 pPager->xGet = getPageError;

	2961 #if SQLITE_MAX_MMAP_SIZE>0

	2962 }else if( USEFETCH(pPager)

	2963 #ifdef SQLITE_HAS_CODEC

	2964 && pPager->xCodec==0

	2965 #endif

	2966 ){

	2967 pPager->xGet = getPageMMap;

	2968 #endif /* SQLITE_MAX_MMAP_SIZE>0 */

	2969 }else{

	2970 pPager->xGet = getPageNormal;

	2971 }

	2972 }

	2973

	2974 /*

	2975 ** Return true if it is necessary to write page *pPg into the sub-journal.

	2976 ** A page needs to be written into the sub-journal if there exists one

	2977 ** or more open savepoints for which:

	2978 **

	2979 ** * The page-number is less than or equal to PagerSavepoint.nOrig, and

	2980 ** * The bit corresponding to the page-number is not set in

	2981 ** PagerSavepoint.pInSavepoint.

	2982 */

	2983 static int subjRequiresPage(PgHdr *pPg){

	2984 Pager *pPager = pPg->pPager;

	2985 PagerSavepoint *p;

	2986 Pgno pgno = pPg->pgno;

	2987 int i;

	2988 for(i=0; i<pPager->nSavepoint; i++){

	2989 p = &pPager->aSavepoint[i];

	2990 if( p->nOrig>=pgno && 0==sqlite3BitvecTestNotNull(p->pInSavepoint, pgno) ){

	2991 return 1;

	2992 }

	2993 }

	2994 return 0;

	2995 }

	2996

	2997 #ifdef SQLITE_DEBUG

	2998 /*

	2999 ** Return true if the page is already in the journal file.

	3000 */

	3001 static int pageInJournal(Pager pPager, PgHdr pPg){

	3002 return sqlite3BitvecTest(pPager->pInJournal, pPg->pgno);

	3003 }

	3004 #endif

	3005

	3006 /*

	3007 ** Read a 32-bit integer from the given file descriptor. Store the integer

	3008 ** that is read in *pRes. Return SQLITE_OK if everything worked, or an

	3009 ** error code is something goes wrong.

	3010 **

	3011 ** All values are stored on disk as big-endian.

	3012 */

	3013 static int read32bits(sqlite3_file fd, i64 offset, u32 pRes){

	3014 unsigned char ac[4];

	3015 int rc = sqlite3OsRead(fd, ac, sizeof(ac), offset);

	3016 if( rc==SQLITE_OK ){

	3017 *pRes = sqlite3Get4byte(ac);

	3018 }

	3019 return rc;

	3020 }

	3021

	3022 /*

	3023 ** Write a 32-bit integer into a string buffer in big-endian byte order.

	3024 */

	3025 #define put32bits(A,B) sqlite3Put4byte((u8*)A,B)

	3026

	3027

	3028 /*

	3029 ** Write a 32-bit integer into the given file descriptor. Return SQLITE_OK

	3030 ** on success or an error code is something goes wrong.

	3031 */

	3032 static int write32bits(sqlite3_file *fd, i64 offset, u32 val){

	3033 char ac[4];

	3034 put32bits(ac, val);

	3035 return sqlite3OsWrite(fd, ac, 4, offset);

	3036 }

	3037

	3038 /*

	3039 ** Unlock the database file to level eLock, which must be either NO_LOCK

	3040 ** or SHARED_LOCK. Regardless of whether or not the call to xUnlock()

	3041 ** succeeds, set the Pager.eLock variable to match the (attempted) new lock.

	3042 **

	3043 ** Except, if Pager.eLock is set to UNKNOWN_LOCK when this function is

	3044 ** called, do not modify it. See the comment above the #define of

	3045 ** UNKNOWN_LOCK for an explanation of this.

	3046 */

	3047 static int pagerUnlockDb(Pager *pPager, int eLock){

	3048 int rc = SQLITE_OK;

	3049

	3050 assert( !pPager->exclusiveMode \|\| pPager->eLock==eLock );

	3051 assert( eLock==NO_LOCK \|\| eLock==SHARED_LOCK );

	3052 assert( eLock!=NO_LOCK \|\| pagerUseWal(pPager)==0 );

	3053 if( isOpen(pPager->fd) ){

	3054 assert( pPager->eLock>=eLock );

	3055 rc = pPager->noLock ? SQLITE_OK : sqlite3OsUnlock(pPager->fd, eLock);

	3056 if( pPager->eLock!=UNKNOWN_LOCK ){

	3057 pPager->eLock = (u8)eLock;

	3058 }

	3059 IOTRACE(("UNLOCK %p %d\n", pPager, eLock))

	3060 }

	3061 return rc;

	3062 }

	3063

	3064 /*

	3065 ** Lock the database file to level eLock, which must be either SHARED_LOCK,

	3066 ** RESERVED_LOCK or EXCLUSIVE_LOCK. If the caller is successful, set the

	3067 ** Pager.eLock variable to the new locking state.

	3068 **

	3069 ** Except, if Pager.eLock is set to UNKNOWN_LOCK when this function is

	3070 ** called, do not modify it unless the new locking state is EXCLUSIVE_LOCK.

	3071 ** See the comment above the #define of UNKNOWN_LOCK for an explanation

	3072 ** of this.

	3073 */

	3074 static int pagerLockDb(Pager *pPager, int eLock){

	3075 int rc = SQLITE_OK;

	3076

	3077 assert( eLock==SHARED_LOCK \|\| eLock==RESERVED_LOCK \|\| eLock==EXCLUSIVE_LOCK );

	3078 if( pPager->eLock<eLock \|\| pPager->eLock==UNKNOWN_LOCK ){

	3079 rc = pPager->noLock ? SQLITE_OK : sqlite3OsLock(pPager->fd, eLock);

	3080 if( rc==SQLITE_OK && (pPager->eLock!=UNKNOWN_LOCK\|\|eLock==EXCLUSIVE_LOCK) ){

	3081 pPager->eLock = (u8)eLock;

	3082 IOTRACE(("LOCK %p %d\n", pPager, eLock))

	3083 }

	3084 }

	3085 return rc;

	3086 }

	3087

	3088 /*

	3089 ** This function determines whether or not the atomic-write optimization

	3090 ** can be used with this pager. The optimization can be used if:

	3091 **

	3092 ** (a) the value returned by OsDeviceCharacteristics() indicates that

	3093 ** a database page may be written atomically, and

	3094 ** (b) the value returned by OsSectorSize() is less than or equal

	3095 ** to the page size.

	3096 **

	3097 ** The optimization is also always enabled for temporary files. It is

	3098 ** an error to call this function if pPager is opened on an in-memory

	3099 ** database.

	3100 **

	3101 ** If the optimization cannot be used, 0 is returned. If it can be used,

	3102 ** then the value returned is the size of the journal file when it

	3103 ** contains rollback data for exactly one page.

	3104 */

	3105 #ifdef SQLITE_ENABLE_ATOMIC_WRITE

	3106 static int jrnlBufferSize(Pager *pPager){

	3107 assert( !MEMDB );

	3108 if( !pPager->tempFile ){

	3109 int dc; /* Device characteristics */

	3110 int nSector; /* Sector size */

	3111 int szPage; /* Page size */

	3112

	3113 assert( isOpen(pPager->fd) );

	3114 dc = sqlite3OsDeviceCharacteristics(pPager->fd);

	3115 nSector = pPager->sectorSize;

	3116 szPage = pPager->pageSize;

	3117

	3118 assert(SQLITE_IOCAP_ATOMIC512==(512>>8));

	3119 assert(SQLITE_IOCAP_ATOMIC64K==(65536>>8));

	3120 if( 0==(dc&(SQLITE_IOCAP_ATOMIC\|(szPage>>8)) \|\| nSector>szPage) ){

	3121 return 0;

	3122 }

	3123 }

	3124

	3125 return JOURNAL_HDR_SZ(pPager) + JOURNAL_PG_SZ(pPager);

	3126 }

	3127 #else

	3128 # define jrnlBufferSize(x) 0

	3129 #endif

	3130

	3131 /*

	3132 ** If SQLITE_CHECK_PAGES is defined then we do some sanity checking

	3133 ** on the cache using a hash function. This is used for testing

	3134 ** and debugging only.

	3135 */

	3136 #ifdef SQLITE_CHECK_PAGES

	3137 /*

	3138 ** Return a 32-bit hash of the page data for pPage.

	3139 */

	3140 static u32 pager_datahash(int nByte, unsigned char *pData){

	3141 u32 hash = 0;

	3142 int i;

	3143 for(i=0; i<nByte; i++){

	3144 hash = (hash*1039) + pData[i];

	3145 }

	3146 return hash;

	3147 }

	3148 static u32 pager_pagehash(PgHdr *pPage){

	3149 return pager_datahash(pPage->pPager->pageSize, (unsigned char *)pPage->pData);

	3150 }

	3151 static void pager_set_pagehash(PgHdr *pPage){

	3152 pPage->pageHash = pager_pagehash(pPage);

	3153 }

	3154

	3155 /*

	3156 ** The CHECK_PAGE macro takes a PgHdr* as an argument. If SQLITE_CHECK_PAGES

	3157 ** is defined, and NDEBUG is not defined, an assert() statement checks

	3158 ** that the page is either dirty or still matches the calculated page-hash.

	3159 */

	3160 #define CHECK_PAGE(x) checkPage(x)

	3161 static void checkPage(PgHdr *pPg){

	3162 Pager *pPager = pPg->pPager;

	3163 assert( pPager->eState!=PAGER_ERROR );

	3164 assert( (pPg->flags&PGHDR_DIRTY) \|\| pPg->pageHash==pager_pagehash(pPg) );

	3165 }

	3166

	3167 #else

	3168 #define pager_datahash(X,Y) 0

	3169 #define pager_pagehash(X) 0

	3170 #define pager_set_pagehash(X)

	3171 #define CHECK_PAGE(x)

	3172 #endif /* SQLITE_CHECK_PAGES */

	3173

	3174 /*

	3175 ** When this is called the journal file for pager pPager must be open.

	3176 ** This function attempts to read a master journal file name from the

	3177 ** end of the file and, if successful, copies it into memory supplied

	3178 ** by the caller. See comments above writeMasterJournal() for the format

	3179 ** used to store a master journal file name at the end of a journal file.

	3180 **

	3181 ** zMaster must point to a buffer of at least nMaster bytes allocated by

	3182 ** the caller. This should be sqlite3_vfs.mxPathname+1 (to ensure there is

	3183 ** enough space to write the master journal name). If the master journal

	3184 ** name in the journal is longer than nMaster bytes (including a

	3185 ** nul-terminator), then this is handled as if no master journal name

	3186 ** were present in the journal.

	3187 **

	3188 ** If a master journal file name is present at the end of the journal

	3189 ** file, then it is copied into the buffer pointed to by zMaster. A

	3190 ** nul-terminator byte is appended to the buffer following the master

	3191 ** journal file name.

	3192 **

	3193 ** If it is determined that no master journal file name is present

	3194 ** zMaster[0] is set to 0 and SQLITE_OK returned.

	3195 **

	3196 ** If an error occurs while reading from the journal file, an SQLite

	3197 ** error code is returned.

	3198 */

	3199 static int readMasterJournal(sqlite3_file pJrnl, char zMaster, u32 nMaster){

	3200 int rc; /* Return code */

	3201 u32 len; /* Length in bytes of master journal name */

	3202 i64 szJ; /* Total size in bytes of journal file pJrnl */

	3203 u32 cksum; /* MJ checksum value read from journal */

	3204 u32 u; /* Unsigned loop counter */

	3205 unsigned char aMagic[8]; /* A buffer to hold the magic header */

	3206 zMaster[0] = '\0';

	3207

	3208 if( SQLITE_OK!=(rc = sqlite3OsFileSize(pJrnl, &szJ))

	3209 \|\| szJ<16

	3210 \|\| SQLITE_OK!=(rc = read32bits(pJrnl, szJ-16, &len))

	3211 \|\| len>=nMaster

	3212 \|\| len==0

	3213 \|\| SQLITE_OK!=(rc = read32bits(pJrnl, szJ-12, &cksum))

	3214 \|\| SQLITE_OK!=(rc = sqlite3OsRead(pJrnl, aMagic, 8, szJ-8))

	3215 \|\| memcmp(aMagic, aJournalMagic, 8)

	3216 \|\| SQLITE_OK!=(rc = sqlite3OsRead(pJrnl, zMaster, len, szJ-16-len))

	3217 ){

	3218 return rc;

	3219 }

	3220

	3221 /* See if the checksum matches the master journal name */

	3222 for(u=0; u<len; u++){

	3223 cksum -= zMaster[u];

	3224 }

	3225 if( cksum ){

	3226 /* If the checksum doesn't add up, then one or more of the disk sectors

	3227 ** containing the master journal filename is corrupted. This means

	3228 ** definitely roll back, so just return SQLITE_OK and report a (nul)

	3229 ** master-journal filename.

	3230 */

	3231 len = 0;

	3232 }

	3233 zMaster[len] = '\0';

	3234

	3235 return SQLITE_OK;

	3236 }

	3237

	3238 /*

	3239 ** Return the offset of the sector boundary at or immediately

	3240 ** following the value in pPager->journalOff, assuming a sector

	3241 ** size of pPager->sectorSize bytes.

	3242 **

	3243 ** i.e for a sector size of 512:

	3244 **

	3245 ** Pager.journalOff Return value

	3246 ** ---------------------------------------

	3247 ** 0 0

	3248 ** 512 512

	3249 ** 100 512

	3250 ** 2000 2048

	3251 **

	3252 */

	3253 static i64 journalHdrOffset(Pager *pPager){

	3254 i64 offset = 0;

	3255 i64 c = pPager->journalOff;

	3256 if( c ){

	3257 offset = ((c-1)/JOURNAL_HDR_SZ(pPager) + 1) * JOURNAL_HDR_SZ(pPager);

	3258 }

	3259 assert( offset%JOURNAL_HDR_SZ(pPager)==0 );

	3260 assert( offset>=c );

	3261 assert( (offset-c)<JOURNAL_HDR_SZ(pPager) );

	3262 return offset;

	3263 }

	3264

	3265 /*

	3266 ** The journal file must be open when this function is called.

	3267 **

	3268 ** This function is a no-op if the journal file has not been written to

	3269 ** within the current transaction (i.e. if Pager.journalOff==0).

	3270 **

	3271 ** If doTruncate is non-zero or the Pager.journalSizeLimit variable is

	3272 ** set to 0, then truncate the journal file to zero bytes in size. Otherwise,

	3273 ** zero the 28-byte header at the start of the journal file. In either case,

	3274 ** if the pager is not in no-sync mode, sync the journal file immediately

	3275 ** after writing or truncating it.

	3276 **

	3277 ** If Pager.journalSizeLimit is set to a positive, non-zero value, and

	3278 ** following the truncation or zeroing described above the size of the

	3279 ** journal file in bytes is larger than this value, then truncate the

	3280 ** journal file to Pager.journalSizeLimit bytes. The journal file does

	3281 ** not need to be synced following this operation.

	3282 **

	3283 ** If an IO error occurs, abandon processing and return the IO error code.

	3284 ** Otherwise, return SQLITE_OK.

	3285 */

	3286 static int zeroJournalHdr(Pager *pPager, int doTruncate){

	3287 int rc = SQLITE_OK; /* Return code */

	3288 assert( isOpen(pPager->jfd) );

	3289 assert( !sqlite3JournalIsInMemory(pPager->jfd) );

	3290 if( pPager->journalOff ){

	3291 const i64 iLimit = pPager->journalSizeLimit; /* Local cache of jsl */

	3292

	3293 IOTRACE(("JZEROHDR %p\n", pPager))

	3294 if( doTruncate \|\| iLimit==0 ){

	3295 rc = sqlite3OsTruncate(pPager->jfd, 0);

	3296 }else{

	3297 static const char zeroHdr[28] = {0};

	3298 rc = sqlite3OsWrite(pPager->jfd, zeroHdr, sizeof(zeroHdr), 0);

	3299 }

	3300 if( rc==SQLITE_OK && !pPager->noSync ){

	3301 rc = sqlite3OsSync(pPager->jfd, SQLITE_SYNC_DATAONLY\|pPager->syncFlags);

	3302 }

	3303

	3304 /* At this point the transaction is committed but the write lock

	3305 ** is still held on the file. If there is a size limit configured for

	3306 ** the persistent journal and the journal file currently consumes more

	3307 ** space than that limit allows for, truncate it now. There is no need

	3308 ** to sync the file following this operation.

	3309 */

	3310 if( rc==SQLITE_OK && iLimit>0 ){

	3311 i64 sz;

	3312 rc = sqlite3OsFileSize(pPager->jfd, &sz);

	3313 if( rc==SQLITE_OK && sz>iLimit ){

	3314 rc = sqlite3OsTruncate(pPager->jfd, iLimit);

	3315 }

	3316 }

	3317 }

	3318 return rc;

	3319 }

	3320

	3321 /*

	3322 ** The journal file must be open when this routine is called. A journal

	3323 ** header (JOURNAL_HDR_SZ bytes) is written into the journal file at the

	3324 ** current location.

	3325 **

	3326 ** The format for the journal header is as follows:

	3327 ** - 8 bytes: Magic identifying journal format.

	3328 ** - 4 bytes: Number of records in journal, or -1 no-sync mode is on.

	3329 ** - 4 bytes: Random number used for page hash.

	3330 ** - 4 bytes: Initial database page count.

	3331 ** - 4 bytes: Sector size used by the process that wrote this journal.

	3332 ** - 4 bytes: Database page size.

	3333 **

	3334 ** Followed by (JOURNAL_HDR_SZ - 28) bytes of unused space.

	3335 */

	3336 static int writeJournalHdr(Pager *pPager){

	3337 int rc = SQLITE_OK; /* Return code */

	3338 char zHeader = pPager->pTmpSpace; / Temporary space used to build header */

	3339 u32 nHeader = (u32)pPager->pageSize;/* Size of buffer pointed to by zHeader */

	3340 u32 nWrite; /* Bytes of header sector written */

	3341 int ii; /* Loop counter */

	3342

	3343 assert( isOpen(pPager->jfd) ); /* Journal file must be open. */

	3344

	3345 if( nHeader>JOURNAL_HDR_SZ(pPager) ){

	3346 nHeader = JOURNAL_HDR_SZ(pPager);

	3347 }

	3348

	3349 /* If there are active savepoints and any of them were created

	3350 ** since the most recent journal header was written, update the

	3351 ** PagerSavepoint.iHdrOffset fields now.

	3352 */

	3353 for(ii=0; ii<pPager->nSavepoint; ii++){

	3354 if( pPager->aSavepoint[ii].iHdrOffset==0 ){

	3355 pPager->aSavepoint[ii].iHdrOffset = pPager->journalOff;

	3356 }

	3357 }

	3358

	3359 pPager->journalHdr = pPager->journalOff = journalHdrOffset(pPager);

	3360

	3361 /*

	3362 ** Write the nRec Field - the number of page records that follow this

	3363 ** journal header. Normally, zero is written to this value at this time.

	3364 ** After the records are added to the journal (and the journal synced,

	3365 ** if in full-sync mode), the zero is overwritten with the true number

	3366 ** of records (see syncJournal()).

	3367 **

	3368 ** A faster alternative is to write 0xFFFFFFFF to the nRec field. When

	3369 ** reading the journal this value tells SQLite to assume that the

	3370 ** rest of the journal file contains valid page records. This assumption

	3371 ** is dangerous, as if a failure occurred whilst writing to the journal

	3372 ** file it may contain some garbage data. There are two scenarios

	3373 ** where this risk can be ignored:

	3374 **

	3375 ** * When the pager is in no-sync mode. Corruption can follow a

	3376 ** power failure in this case anyway.

	3377 **

	3378 ** * When the SQLITE_IOCAP_SAFE_APPEND flag is set. This guarantees

	3379 ** that garbage data is never appended to the journal file.

	3380 */

	3381 assert( isOpen(pPager->fd) \|\| pPager->noSync );

	3382 if( pPager->noSync \|\| (pPager->journalMode==PAGER_JOURNALMODE_MEMORY)

	3383 \|\| (sqlite3OsDeviceCharacteristics(pPager->fd)&SQLITE_IOCAP_SAFE_APPEND)

	3384 ){

	3385 memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic));

	3386 put32bits(&zHeader[sizeof(aJournalMagic)], 0xffffffff);

	3387 }else{

	3388 memset(zHeader, 0, sizeof(aJournalMagic)+4);

	3389 }

	3390

	3391 /* The random check-hash initializer */

	3392 sqlite3_randomness(sizeof(pPager->cksumInit), &pPager->cksumInit);

	3393 put32bits(&zHeader[sizeof(aJournalMagic)+4], pPager->cksumInit);

	3394 /* The initial database size */

	3395 put32bits(&zHeader[sizeof(aJournalMagic)+8], pPager->dbOrigSize);

	3396 /* The assumed sector size for this process */

	3397 put32bits(&zHeader[sizeof(aJournalMagic)+12], pPager->sectorSize);

	3398

	3399 /* The page size */

	3400 put32bits(&zHeader[sizeof(aJournalMagic)+16], pPager->pageSize);

	3401

	3402 /* Initializing the tail of the buffer is not necessary. Everything

	3403 ** works find if the following memset() is omitted. But initializing

	3404 ** the memory prevents valgrind from complaining, so we are willing to

	3405 ** take the performance hit.

	3406 */

	3407 memset(&zHeader[sizeof(aJournalMagic)+20], 0,

	3408 nHeader-(sizeof(aJournalMagic)+20));

	3409

	3410 /* In theory, it is only necessary to write the 28 bytes that the

	3411 ** journal header consumes to the journal file here. Then increment the

	3412 ** Pager.journalOff variable by JOURNAL_HDR_SZ so that the next

	3413 ** record is written to the following sector (leaving a gap in the file

	3414 ** that will be implicitly filled in by the OS).

	3415 **

	3416 ** However it has been discovered that on some systems this pattern can

	3417 ** be significantly slower than contiguously writing data to the file,

	3418 ** even if that means explicitly writing data to the block of

	3419 ** (JOURNAL_HDR_SZ - 28) bytes that will not be used. So that is what

	3420 ** is done.

	3421 **

	3422 ** The loop is required here in case the sector-size is larger than the

	3423 ** database page size. Since the zHeader buffer is only Pager.pageSize

	3424 ** bytes in size, more than one call to sqlite3OsWrite() may be required

	3425 ** to populate the entire journal header sector.

	3426 */

	3427 for(nWrite=0; rc==SQLITE_OK&&nWrite<JOURNAL_HDR_SZ(pPager); nWrite+=nHeader){

	3428 IOTRACE(("JHDR %p %lld %d\n", pPager, pPager->journalHdr, nHeader))

	3429 rc = sqlite3OsWrite(pPager->jfd, zHeader, nHeader, pPager->journalOff);

	3430 assert( pPager->journalHdr <= pPager->journalOff );

	3431 pPager->journalOff += nHeader;

	3432 }

	3433

	3434 return rc;

	3435 }

	3436

	3437 /*

	3438 ** The journal file must be open when this is called. A journal header file

	3439 ** (JOURNAL_HDR_SZ bytes) is read from the current location in the journal

	3440 ** file. The current location in the journal file is given by

	3441 ** pPager->journalOff. See comments above function writeJournalHdr() for

	3442 ** a description of the journal header format.

	3443 **

	3444 ** If the header is read successfully, *pNRec is set to the number of

	3445 ** page records following this header and *pDbSize is set to the size of the

	3446 ** database before the transaction began, in pages. Also, pPager->cksumInit

	3447 ** is set to the value read from the journal header. SQLITE_OK is returned

	3448 ** in this case.

	3449 **

	3450 ** If the journal header file appears to be corrupted, SQLITE_DONE is

	3451 ** returned and pNRec and PDbSize are undefined. If JOURNAL_HDR_SZ bytes

	3452 ** cannot be read from the journal file an error code is returned.

	3453 */

	3454 static int readJournalHdr(

	3455 Pager pPager, / Pager object */

	3456 int isHot,

	3457 i64 journalSize, /* Size of the open journal file in bytes */

	3458 u32 pNRec, / OUT: Value read from the nRec field */

	3459 u32 pDbSize / OUT: Value of original database size field */

	3460 ){

	3461 int rc; /* Return code */

	3462 unsigned char aMagic[8]; /* A buffer to hold the magic header */

	3463 i64 iHdrOff; /* Offset of journal header being read */

	3464

	3465 assert( isOpen(pPager->jfd) ); /* Journal file must be open. */

	3466

	3467 /* Advance Pager.journalOff to the start of the next sector. If the

	3468 ** journal file is too small for there to be a header stored at this

	3469 ** point, return SQLITE_DONE.

	3470 */

	3471 pPager->journalOff = journalHdrOffset(pPager);

	3472 if( pPager->journalOff+JOURNAL_HDR_SZ(pPager) > journalSize ){

	3473 return SQLITE_DONE;

	3474 }

	3475 iHdrOff = pPager->journalOff;

	3476

	3477 /* Read in the first 8 bytes of the journal header. If they do not match

	3478 ** the magic string found at the start of each journal header, return

	3479 ** SQLITE_DONE. If an IO error occurs, return an error code. Otherwise,

	3480 ** proceed.

	3481 */

	3482 if( isHot \|\| iHdrOff!=pPager->journalHdr ){

	3483 rc = sqlite3OsRead(pPager->jfd, aMagic, sizeof(aMagic), iHdrOff);

	3484 if( rc ){

	3485 return rc;

	3486 }

	3487 if( memcmp(aMagic, aJournalMagic, sizeof(aMagic))!=0 ){

	3488 return SQLITE_DONE;

	3489 }

	3490 }

	3491

	3492 /* Read the first three 32-bit fields of the journal header: The nRec

	3493 ** field, the checksum-initializer and the database size at the start

	3494 ** of the transaction. Return an error code if anything goes wrong.

	3495 */

	3496 if( SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+8, pNRec))

	3497 \|\| SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+12, &pPager->cksumInit))

	3498 \|\| SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+16, pDbSize))

	3499 ){

	3500 return rc;

	3501 }

	3502

	3503 if( pPager->journalOff==0 ){

	3504 u32 iPageSize; /* Page-size field of journal header */

	3505 u32 iSectorSize; /* Sector-size field of journal header */

	3506

	3507 /* Read the page-size and sector-size journal header fields. */

	3508 if( SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+20, &iSectorSize))

	3509 \|\| SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+24, &iPageSize))

	3510 ){

	3511 return rc;

	3512 }

	3513

	3514 /* Versions of SQLite prior to 3.5.8 set the page-size field of the

	3515 ** journal header to zero. In this case, assume that the Pager.pageSize

	3516 ** variable is already set to the correct page size.

	3517 */

	3518 if( iPageSize==0 ){

	3519 iPageSize = pPager->pageSize;

	3520 }

	3521

	3522 /* Check that the values read from the page-size and sector-size fields

	3523 ** are within range. To be 'in range', both values need to be a power

	3524 ** of two greater than or equal to 512 or 32, and not greater than their

	3525 ** respective compile time maximum limits.

	3526 */

	3527 if( iPageSize<512 \|\| iSectorSize<32

	3528 \|\| iPageSize>SQLITE_MAX_PAGE_SIZE \|\| iSectorSize>MAX_SECTOR_SIZE

	3529 \|\| ((iPageSize-1)&iPageSize)!=0 \|\| ((iSectorSize-1)&iSectorSize)!=0

	3530 ){

	3531 /* If the either the page-size or sector-size in the journal-header is

	3532 ** invalid, then the process that wrote the journal-header must have

	3533 ** crashed before the header was synced. In this case stop reading

	3534 ** the journal file here.

	3535 */

	3536 return SQLITE_DONE;

	3537 }

	3538

	3539 /* Update the page-size to match the value read from the journal.

	3540 ** Use a testcase() macro to make sure that malloc failure within

	3541 ** PagerSetPagesize() is tested.

	3542 */

	3543 rc = sqlite3PagerSetPagesize(pPager, &iPageSize, -1);

	3544 testcase( rc!=SQLITE_OK );

	3545

	3546 /* Update the assumed sector-size to match the value used by

	3547 ** the process that created this journal. If this journal was

	3548 ** created by a process other than this one, then this routine

	3549 ** is being called from within pager_playback(). The local value

	3550 ** of Pager.sectorSize is restored at the end of that routine.

	3551 */

	3552 pPager->sectorSize = iSectorSize;

	3553 }

	3554

	3555 pPager->journalOff += JOURNAL_HDR_SZ(pPager);

	3556 return rc;

	3557 }

	3558

	3559

	3560 /*

	3561 ** Write the supplied master journal name into the journal file for pager

	3562 ** pPager at the current location. The master journal name must be the last

	3563 ** thing written to a journal file. If the pager is in full-sync mode, the

	3564 ** journal file descriptor is advanced to the next sector boundary before

	3565 ** anything is written. The format is:

	3566 **

	3567 ** + 4 bytes: PAGER_MJ_PGNO.

	3568 ** + N bytes: Master journal filename in utf-8.

	3569 ** + 4 bytes: N (length of master journal name in bytes, no nul-terminator).

	3570 ** + 4 bytes: Master journal name checksum.

	3571 ** + 8 bytes: aJournalMagic[].

	3572 **

	3573 ** The master journal page checksum is the sum of the bytes in the master

	3574 ** journal name, where each byte is interpreted as a signed 8-bit integer.

	3575 **

	3576 ** If zMaster is a NULL pointer (occurs for a single database transaction),

	3577 ** this call is a no-op.

	3578 */

	3579 static int writeMasterJournal(Pager pPager, const char zMaster){

	3580 int rc; /* Return code */

	3581 int nMaster; /* Length of string zMaster */

	3582 i64 iHdrOff; /* Offset of header in journal file */

	3583 i64 jrnlSize; /* Size of journal file on disk */

	3584 u32 cksum = 0; /* Checksum of string zMaster */

	3585

	3586 assert( pPager->setMaster==0 );

	3587 assert( !pagerUseWal(pPager) );

	3588

	3589 if( !zMaster

	3590 \|\| pPager->journalMode==PAGER_JOURNALMODE_MEMORY

	3591 \|\| !isOpen(pPager->jfd)

	3592 ){

	3593 return SQLITE_OK;

	3594 }

	3595 pPager->setMaster = 1;

	3596 assert( pPager->journalHdr <= pPager->journalOff );

	3597

	3598 /* Calculate the length in bytes and the checksum of zMaster */

	3599 for(nMaster=0; zMaster[nMaster]; nMaster++){

	3600 cksum += zMaster[nMaster];

	3601 }

	3602

	3603 /* If in full-sync mode, advance to the next disk sector before writing

	3604 ** the master journal name. This is in case the previous page written to

	3605 ** the journal has already been synced.

	3606 */

	3607 if( pPager->fullSync ){

	3608 pPager->journalOff = journalHdrOffset(pPager);

	3609 }

	3610 iHdrOff = pPager->journalOff;

	3611

	3612 /* Write the master journal data to the end of the journal file. If

	3613 ** an error occurs, return the error code to the caller.

	3614 */

	3615 if( (0 != (rc = write32bits(pPager->jfd, iHdrOff, PAGER_MJ_PGNO(pPager))))

	3616 \|\| (0 != (rc = sqlite3OsWrite(pPager->jfd, zMaster, nMaster, iHdrOff+4)))

	3617 \|\| (0 != (rc = write32bits(pPager->jfd, iHdrOff+4+nMaster, nMaster)))

	3618 \|\| (0 != (rc = write32bits(pPager->jfd, iHdrOff+4+nMaster+4, cksum)))

	3619 \|\| (0 != (rc = sqlite3OsWrite(pPager->jfd, aJournalMagic, 8,

	3620 iHdrOff+4+nMaster+8)))

	3621 ){

	3622 return rc;

	3623 }

	3624 pPager->journalOff += (nMaster+20);

	3625

	3626 /* If the pager is in peristent-journal mode, then the physical

	3627 ** journal-file may extend past the end of the master-journal name

	3628 ** and 8 bytes of magic data just written to the file. This is

	3629 ** dangerous because the code to rollback a hot-journal file

	3630 ** will not be able to find the master-journal name to determine

	3631 ** whether or not the journal is hot.

	3632 **

	3633 ** Easiest thing to do in this scenario is to truncate the journal

	3634 ** file to the required size.

	3635 */

	3636 if( SQLITE_OK==(rc = sqlite3OsFileSize(pPager->jfd, &jrnlSize))

	3637 && jrnlSize>pPager->journalOff

	3638 ){

	3639 rc = sqlite3OsTruncate(pPager->jfd, pPager->journalOff);

	3640 }

	3641 return rc;

	3642 }

	3643

	3644 /*

	3645 ** Discard the entire contents of the in-memory page-cache.

	3646 */

	3647 static void pager_reset(Pager *pPager){

	3648 pPager->iDataVersion++;

	3649 sqlite3BackupRestart(pPager->pBackup);

	3650 sqlite3PcacheClear(pPager->pPCache);

	3651 }

	3652

	3653 /*

	3654 ** Return the pPager->iDataVersion value

	3655 */

	3656 SQLITE_PRIVATE u32 sqlite3PagerDataVersion(Pager *pPager){

	3657 assert( pPager->eState>PAGER_OPEN );

	3658 return pPager->iDataVersion;

	3659 }

	3660

	3661 /*

	3662 ** Free all structures in the Pager.aSavepoint[] array and set both

	3663 ** Pager.aSavepoint and Pager.nSavepoint to zero. Close the sub-journal

	3664 ** if it is open and the pager is not in exclusive mode.

	3665 */

	3666 static void releaseAllSavepoints(Pager *pPager){

	3667 int ii; /* Iterator for looping through Pager.aSavepoint */

	3668 for(ii=0; ii<pPager->nSavepoint; ii++){

	3669 sqlite3BitvecDestroy(pPager->aSavepoint[ii].pInSavepoint);

	3670 }

	3671 if( !pPager->exclusiveMode \|\| sqlite3JournalIsInMemory(pPager->sjfd) ){

	3672 sqlite3OsClose(pPager->sjfd);

	3673 }

	3674 sqlite3_free(pPager->aSavepoint);

	3675 pPager->aSavepoint = 0;

	3676 pPager->nSavepoint = 0;

	3677 pPager->nSubRec = 0;

	3678 }

	3679

	3680 /*

	3681 ** Set the bit number pgno in the PagerSavepoint.pInSavepoint

	3682 ** bitvecs of all open savepoints. Return SQLITE_OK if successful

	3683 ** or SQLITE_NOMEM if a malloc failure occurs.

	3684 */

	3685 static int addToSavepointBitvecs(Pager *pPager, Pgno pgno){

	3686 int ii; /* Loop counter */

	3687 int rc = SQLITE_OK; /* Result code */

	3688

	3689 for(ii=0; ii<pPager->nSavepoint; ii++){

	3690 PagerSavepoint *p = &pPager->aSavepoint[ii];

	3691 if( pgno<=p->nOrig ){

	3692 rc \|= sqlite3BitvecSet(p->pInSavepoint, pgno);

	3693 testcase( rc==SQLITE_NOMEM );

	3694 assert( rc==SQLITE_OK \|\| rc==SQLITE_NOMEM );

	3695 }

	3696 }

	3697 return rc;

	3698 }

	3699

	3700 /*

	3701 ** This function is a no-op if the pager is in exclusive mode and not

	3702 ** in the ERROR state. Otherwise, it switches the pager to PAGER_OPEN

	3703 ** state.

	3704 **

	3705 ** If the pager is not in exclusive-access mode, the database file is

	3706 ** completely unlocked. If the file is unlocked and the file-system does

	3707 ** not exhibit the UNDELETABLE_WHEN_OPEN property, the journal file is

	3708 ** closed (if it is open).

	3709 **

	3710 ** If the pager is in ERROR state when this function is called, the

	3711 ** contents of the pager cache are discarded before switching back to

	3712 ** the OPEN state. Regardless of whether the pager is in exclusive-mode

	3713 ** or not, any journal file left in the file-system will be treated

	3714 ** as a hot-journal and rolled back the next time a read-transaction

	3715 ** is opened (by this or by any other connection).

	3716 */

	3717 static void pager_unlock(Pager *pPager){

	3718

	3719 assert( pPager->eState==PAGER_READER

	3720 \|\| pPager->eState==PAGER_OPEN

	3721 \|\| pPager->eState==PAGER_ERROR

	3722 );

	3723

	3724 sqlite3BitvecDestroy(pPager->pInJournal);

	3725 pPager->pInJournal = 0;

	3726 releaseAllSavepoints(pPager);

	3727

	3728 if( pagerUseWal(pPager) ){

	3729 assert( !isOpen(pPager->jfd) );

	3730 sqlite3WalEndReadTransaction(pPager->pWal);

	3731 pPager->eState = PAGER_OPEN;

	3732 }else if( !pPager->exclusiveMode ){

	3733 int rc; /* Error code returned by pagerUnlockDb() */

	3734 int iDc = isOpen(pPager->fd)?sqlite3OsDeviceCharacteristics(pPager->fd):0;

	3735

	3736 /* If the operating system support deletion of open files, then

	3737 ** close the journal file when dropping the database lock. Otherwise

	3738 ** another connection with journal_mode=delete might delete the file

	3739 ** out from under us.

	3740 */

	3741 assert( (PAGER_JOURNALMODE_MEMORY & 5)!=1 );

	3742 assert( (PAGER_JOURNALMODE_OFF & 5)!=1 );

	3743 assert( (PAGER_JOURNALMODE_WAL & 5)!=1 );

	3744 assert( (PAGER_JOURNALMODE_DELETE & 5)!=1 );

	3745 assert( (PAGER_JOURNALMODE_TRUNCATE & 5)==1 );

	3746 assert( (PAGER_JOURNALMODE_PERSIST & 5)==1 );

	3747 if( 0==(iDc & SQLITE_IOCAP_UNDELETABLE_WHEN_OPEN)

	3748 \|\| 1!=(pPager->journalMode & 5)

	3749 ){

	3750 sqlite3OsClose(pPager->jfd);

	3751 }

	3752

	3753 /* If the pager is in the ERROR state and the call to unlock the database

	3754 ** file fails, set the current lock to UNKNOWN_LOCK. See the comment

	3755 ** above the #define for UNKNOWN_LOCK for an explanation of why this

	3756 ** is necessary.

	3757 */

	3758 rc = pagerUnlockDb(pPager, NO_LOCK);

	3759 if( rc!=SQLITE_OK && pPager->eState==PAGER_ERROR ){

	3760 pPager->eLock = UNKNOWN_LOCK;

	3761 }

	3762

	3763 /* The pager state may be changed from PAGER_ERROR to PAGER_OPEN here

	3764 ** without clearing the error code. This is intentional - the error

	3765 ** code is cleared and the cache reset in the block below.

	3766 */

	3767 assert( pPager->errCode \|\| pPager->eState!=PAGER_ERROR );

	3768 pPager->changeCountDone = 0;

	3769 pPager->eState = PAGER_OPEN;

	3770 }

	3771

	3772 /* If Pager.errCode is set, the contents of the pager cache cannot be

	3773 ** trusted. Now that there are no outstanding references to the pager,

	3774 ** it can safely move back to PAGER_OPEN state. This happens in both

	3775 ** normal and exclusive-locking mode.

	3776 */

	3777 assert( pPager->errCode==SQLITE_OK \|\| !MEMDB );

	3778 if( pPager->errCode ){

	3779 if( pPager->tempFile==0 ){

	3780 pager_reset(pPager);

	3781 pPager->changeCountDone = 0;

	3782 pPager->eState = PAGER_OPEN;

	3783 }else{

	3784 pPager->eState = (isOpen(pPager->jfd) ? PAGER_OPEN : PAGER_READER);

	3785 }

	3786 if( USEFETCH(pPager) ) sqlite3OsUnfetch(pPager->fd, 0, 0);

	3787 pPager->errCode = SQLITE_OK;

	3788 setGetterMethod(pPager);

	3789 }

	3790

	3791 pPager->journalOff = 0;

	3792 pPager->journalHdr = 0;

	3793 pPager->setMaster = 0;

	3794 }

	3795

	3796 /*

	3797 ** This function is called whenever an IOERR or FULL error that requires

	3798 ** the pager to transition into the ERROR state may ahve occurred.

	3799 ** The first argument is a pointer to the pager structure, the second

	3800 ** the error-code about to be returned by a pager API function. The

	3801 ** value returned is a copy of the second argument to this function.

	3802 **

	3803 ** If the second argument is SQLITE_FULL, SQLITE_IOERR or one of the

	3804 ** IOERR sub-codes, the pager enters the ERROR state and the error code

	3805 ** is stored in Pager.errCode. While the pager remains in the ERROR state,

	3806 ** all major API calls on the Pager will immediately return Pager.errCode.

	3807 **

	3808 ** The ERROR state indicates that the contents of the pager-cache

	3809 ** cannot be trusted. This state can be cleared by completely discarding

	3810 ** the contents of the pager-cache. If a transaction was active when

	3811 ** the persistent error occurred, then the rollback journal may need

	3812 ** to be replayed to restore the contents of the database file (as if

	3813 ** it were a hot-journal).

	3814 */

	3815 static int pager_error(Pager *pPager, int rc){

	3816 int rc2 = rc & 0xff;

	3817 assert( rc==SQLITE_OK \|\| !MEMDB );

	3818 assert(

	3819 pPager->errCode==SQLITE_FULL \|\|

	3820 pPager->errCode==SQLITE_OK \|\|

	3821 (pPager->errCode & 0xff)==SQLITE_IOERR

	3822 );

	3823 if( rc2==SQLITE_FULL \|\| rc2==SQLITE_IOERR ){

	3824 pPager->errCode = rc;

	3825 pPager->eState = PAGER_ERROR;

	3826 setGetterMethod(pPager);

	3827 }

	3828 return rc;

	3829 }

	3830

	3831 static int pager_truncate(Pager *pPager, Pgno nPage);

	3832

	3833 /*

	3834 ** The write transaction open on pPager is being committed (bCommit==1)

	3835 ** or rolled back (bCommit==0).

	3836 **

	3837 ** Return TRUE if and only if all dirty pages should be flushed to disk.

	3838 **

	3839 ** Rules:

	3840 **

	3841 ** * For non-TEMP databases, always sync to disk. This is necessary

	3842 ** for transactions to be durable.

	3843 **

	3844 ** * Sync TEMP database only on a COMMIT (not a ROLLBACK) when the backing

	3845 ** file has been created already (via a spill on pagerStress()) and

	3846 ** when the number of dirty pages in memory exceeds 25% of the total

	3847 ** cache size.

	3848 */

	3849 static int pagerFlushOnCommit(Pager *pPager, int bCommit){

	3850 if( pPager->tempFile==0 ) return 1;

	3851 if( !bCommit ) return 0;

	3852 if( !isOpen(pPager->fd) ) return 0;

	3853 return (sqlite3PCachePercentDirty(pPager->pPCache)>=25);

	3854 }

	3855

	3856 /*

	3857 ** This routine ends a transaction. A transaction is usually ended by

	3858 ** either a COMMIT or a ROLLBACK operation. This routine may be called

	3859 ** after rollback of a hot-journal, or if an error occurs while opening

	3860 ** the journal file or writing the very first journal-header of a

	3861 ** database transaction.

	3862 **

	3863 ** This routine is never called in PAGER_ERROR state. If it is called

	3864 ** in PAGER_NONE or PAGER_SHARED state and the lock held is less

	3865 ** exclusive than a RESERVED lock, it is a no-op.

	3866 **

	3867 ** Otherwise, any active savepoints are released.

	3868 **

	3869 ** If the journal file is open, then it is "finalized". Once a journal

	3870 ** file has been finalized it is not possible to use it to roll back a

	3871 ** transaction. Nor will it be considered to be a hot-journal by this

	3872 ** or any other database connection. Exactly how a journal is finalized

	3873 ** depends on whether or not the pager is running in exclusive mode and

	3874 ** the current journal-mode (Pager.journalMode value), as follows:

	3875 **

	3876 ** journalMode==MEMORY

	3877 ** Journal file descriptor is simply closed. This destroys an

	3878 ** in-memory journal.

	3879 **

	3880 ** journalMode==TRUNCATE

	3881 ** Journal file is truncated to zero bytes in size.

	3882 **

	3883 ** journalMode==PERSIST

	3884 ** The first 28 bytes of the journal file are zeroed. This invalidates

	3885 ** the first journal header in the file, and hence the entire journal

	3886 ** file. An invalid journal file cannot be rolled back.

	3887 **

	3888 ** journalMode==DELETE

	3889 ** The journal file is closed and deleted using sqlite3OsDelete().

	3890 **

	3891 ** If the pager is running in exclusive mode, this method of finalizing

	3892 ** the journal file is never used. Instead, if the journalMode is

	3893 ** DELETE and the pager is in exclusive mode, the method described under

	3894 ** journalMode==PERSIST is used instead.

	3895 **

	3896 ** After the journal is finalized, the pager moves to PAGER_READER state.

	3897 ** If running in non-exclusive rollback mode, the lock on the file is

	3898 ** downgraded to a SHARED_LOCK.

	3899 **

	3900 ** SQLITE_OK is returned if no error occurs. If an error occurs during

	3901 ** any of the IO operations to finalize the journal file or unlock the

	3902 ** database then the IO error code is returned to the user. If the

	3903 ** operation to finalize the journal file fails, then the code still

	3904 ** tries to unlock the database file if not in exclusive mode. If the

	3905 ** unlock operation fails as well, then the first error code related

	3906 ** to the first error encountered (the journal finalization one) is

	3907 ** returned.

	3908 */

	3909 static int pager_end_transaction(Pager *pPager, int hasMaster, int bCommit){

	3910 int rc = SQLITE_OK; /* Error code from journal finalization operation */

	3911 int rc2 = SQLITE_OK; /* Error code from db file unlock operation */

	3912

	3913 /* Do nothing if the pager does not have an open write transaction

	3914 ** or at least a RESERVED lock. This function may be called when there

	3915 ** is no write-transaction active but a RESERVED or greater lock is

	3916 ** held under two circumstances:

	3917 **

	3918 ** 1. After a successful hot-journal rollback, it is called with

	3919 ** eState==PAGER_NONE and eLock==EXCLUSIVE_LOCK.

	3920 **

	3921 ** 2. If a connection with locking_mode=exclusive holding an EXCLUSIVE

	3922 ** lock switches back to locking_mode=normal and then executes a

	3923 ** read-transaction, this function is called with eState==PAGER_READER

	3924 ** and eLock==EXCLUSIVE_LOCK when the read-transaction is closed.

	3925 */

	3926 assert( assert_pager_state(pPager) );

	3927 assert( pPager->eState!=PAGER_ERROR );

	3928 if( pPager->eState<PAGER_WRITER_LOCKED && pPager->eLock<RESERVED_LOCK ){

	3929 return SQLITE_OK;

	3930 }

	3931

	3932 releaseAllSavepoints(pPager);

	3933 assert( isOpen(pPager->jfd) \|\| pPager->pInJournal==0 );

	3934 if( isOpen(pPager->jfd) ){

	3935 assert( !pagerUseWal(pPager) );

	3936

	3937 /* Finalize the journal file. */

	3938 if( sqlite3JournalIsInMemory(pPager->jfd) ){

	3939 /* assert( pPager->journalMode==PAGER_JOURNALMODE_MEMORY ); */

	3940 sqlite3OsClose(pPager->jfd);

	3941 }else if( pPager->journalMode==PAGER_JOURNALMODE_TRUNCATE ){

	3942 if( pPager->journalOff==0 ){

	3943 rc = SQLITE_OK;

	3944 }else{

	3945 rc = sqlite3OsTruncate(pPager->jfd, 0);

	3946 if( rc==SQLITE_OK && pPager->fullSync ){

	3947 /* Make sure the new file size is written into the inode right away.

	3948 ** Otherwise the journal might resurrect following a power loss and

	3949 ** cause the last transaction to roll back. See

	3950 ** https://bugzilla.mozilla.org/show_bug.cgi?id=1072773

	3951 */

	3952 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags);

	3953 }

	3954 }

	3955 pPager->journalOff = 0;

	3956 }else if( pPager->journalMode==PAGER_JOURNALMODE_PERSIST

	3957 \|\| (pPager->exclusiveMode && pPager->journalMode!=PAGER_JOURNALMODE_WAL)

	3958 ){

	3959 rc = zeroJournalHdr(pPager, hasMaster\|\|pPager->tempFile);

	3960 pPager->journalOff = 0;

	3961 }else{

	3962 /* This branch may be executed with Pager.journalMode==MEMORY if

	3963 ** a hot-journal was just rolled back. In this case the journal

	3964 ** file should be closed and deleted. If this connection writes to

	3965 ** the database file, it will do so using an in-memory journal.

	3966 */

	3967 int bDelete = !pPager->tempFile;

	3968 assert( sqlite3JournalIsInMemory(pPager->jfd)==0 );

	3969 assert( pPager->journalMode==PAGER_JOURNALMODE_DELETE

	3970 \|\| pPager->journalMode==PAGER_JOURNALMODE_MEMORY

	3971 \|\| pPager->journalMode==PAGER_JOURNALMODE_WAL

	3972 );

	3973 sqlite3OsClose(pPager->jfd);

	3974 if( bDelete ){

	3975 rc = sqlite3OsDelete(pPager->pVfs, pPager->zJournal, pPager->extraSync);

	3976 }

	3977 }

	3978 }

	3979

	3980 #ifdef SQLITE_CHECK_PAGES

	3981 sqlite3PcacheIterateDirty(pPager->pPCache, pager_set_pagehash);

	3982 if( pPager->dbSize==0 && sqlite3PcacheRefCount(pPager->pPCache)>0 ){

	3983 PgHdr *p = sqlite3PagerLookup(pPager, 1);

	3984 if( p ){

	3985 p->pageHash = 0;

	3986 sqlite3PagerUnrefNotNull(p);

	3987 }

	3988 }

	3989 #endif

	3990

	3991 sqlite3BitvecDestroy(pPager->pInJournal);

	3992 pPager->pInJournal = 0;

	3993 pPager->nRec = 0;

	3994 if( rc==SQLITE_OK ){

	3995 if( MEMDB \|\| pagerFlushOnCommit(pPager, bCommit) ){

	3996 sqlite3PcacheCleanAll(pPager->pPCache);

	3997 }else{

	3998 sqlite3PcacheClearWritable(pPager->pPCache);

	3999 }

	4000 sqlite3PcacheTruncate(pPager->pPCache, pPager->dbSize);

	4001 }

	4002

	4003 if( pagerUseWal(pPager) ){

	4004 /* Drop the WAL write-lock, if any. Also, if the connection was in

	4005 ** locking_mode=exclusive mode but is no longer, drop the EXCLUSIVE

	4006 ** lock held on the database file.

	4007 */

	4008 rc2 = sqlite3WalEndWriteTransaction(pPager->pWal);

	4009 assert( rc2==SQLITE_OK );

	4010 }else if( rc==SQLITE_OK && bCommit && pPager->dbFileSize>pPager->dbSize ){

	4011 /* This branch is taken when committing a transaction in rollback-journal

	4012 ** mode if the database file on disk is larger than the database image.

	4013 ** At this point the journal has been finalized and the transaction

	4014 ** successfully committed, but the EXCLUSIVE lock is still held on the

	4015 ** file. So it is safe to truncate the database file to its minimum

	4016 ** required size. */

	4017 assert( pPager->eLock==EXCLUSIVE_LOCK );

	4018 rc = pager_truncate(pPager, pPager->dbSize);

	4019 }

	4020

	4021 if( rc==SQLITE_OK && bCommit && isOpen(pPager->fd) ){

	4022 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_COMMIT_PHASETWO, 0);

	4023 if( rc==SQLITE_NOTFOUND ) rc = SQLITE_OK;

	4024 }

	4025

	4026 if( !pPager->exclusiveMode

	4027 && (!pagerUseWal(pPager) \|\| sqlite3WalExclusiveMode(pPager->pWal, 0))

	4028 ){

	4029 rc2 = pagerUnlockDb(pPager, SHARED_LOCK);

	4030 pPager->changeCountDone = 0;

	4031 }

	4032 pPager->eState = PAGER_READER;

	4033 pPager->setMaster = 0;

	4034

	4035 return (rc==SQLITE_OK?rc2:rc);

	4036 }

	4037

	4038 /*

	4039 ** Execute a rollback if a transaction is active and unlock the

	4040 ** database file.

	4041 **

	4042 ** If the pager has already entered the ERROR state, do not attempt

	4043 ** the rollback at this time. Instead, pager_unlock() is called. The

	4044 ** call to pager_unlock() will discard all in-memory pages, unlock

	4045 ** the database file and move the pager back to OPEN state. If this

	4046 ** means that there is a hot-journal left in the file-system, the next

	4047 ** connection to obtain a shared lock on the pager (which may be this one)

	4048 ** will roll it back.

	4049 **

	4050 ** If the pager has not already entered the ERROR state, but an IO or

	4051 ** malloc error occurs during a rollback, then this will itself cause

	4052 ** the pager to enter the ERROR state. Which will be cleared by the

	4053 ** call to pager_unlock(), as described above.

	4054 */

	4055 static void pagerUnlockAndRollback(Pager *pPager){

	4056 if( pPager->eState!=PAGER_ERROR && pPager->eState!=PAGER_OPEN ){

	4057 assert( assert_pager_state(pPager) );

	4058 if( pPager->eState>=PAGER_WRITER_LOCKED ){

	4059 sqlite3BeginBenignMalloc();

	4060 sqlite3PagerRollback(pPager);

	4061 sqlite3EndBenignMalloc();

	4062 }else if( !pPager->exclusiveMode ){

	4063 assert( pPager->eState==PAGER_READER );

	4064 pager_end_transaction(pPager, 0, 0);

	4065 }

	4066 }

	4067 pager_unlock(pPager);

	4068 }

	4069

	4070 /*

	4071 ** Parameter aData must point to a buffer of pPager->pageSize bytes

	4072 ** of data. Compute and return a checksum based ont the contents of the

	4073 ** page of data and the current value of pPager->cksumInit.

	4074 **

	4075 ** This is not a real checksum. It is really just the sum of the

	4076 ** random initial value (pPager->cksumInit) and every 200th byte

	4077 ** of the page data, starting with byte offset (pPager->pageSize%200).

	4078 ** Each byte is interpreted as an 8-bit unsigned integer.

	4079 **

	4080 ** Changing the formula used to compute this checksum results in an

	4081 ** incompatible journal file format.

	4082 **

	4083 ** If journal corruption occurs due to a power failure, the most likely

	4084 ** scenario is that one end or the other of the record will be changed.

	4085 ** It is much less likely that the two ends of the journal record will be

	4086 ** correct and the middle be corrupt. Thus, this "checksum" scheme,

	4087 ** though fast and simple, catches the mostly likely kind of corruption.

	4088 */

	4089 static u32 pager_cksum(Pager pPager, const u8 aData){

	4090 u32 cksum = pPager->cksumInit; /* Checksum value to return */

	4091 int i = pPager->pageSize-200; /* Loop counter */

	4092 while( i>0 ){

	4093 cksum += aData[i];

	4094 i -= 200;

	4095 }

	4096 return cksum;

	4097 }

	4098

	4099 /*

	4100 ** Report the current page size and number of reserved bytes back

	4101 ** to the codec.

	4102 */

	4103 #ifdef SQLITE_HAS_CODEC

	4104 static void pagerReportSize(Pager *pPager){

	4105 if( pPager->xCodecSizeChng ){

	4106 pPager->xCodecSizeChng(pPager->pCodec, pPager->pageSize,

	4107 (int)pPager->nReserve);

	4108 }

	4109 }

	4110 #else

	4111 # define pagerReportSize(X) /* No-op if we do not support a codec */

	4112 #endif

	4113

	4114 #ifdef SQLITE_HAS_CODEC

	4115 /*

	4116 ** Make sure the number of reserved bits is the same in the destination

	4117 ** pager as it is in the source. This comes up when a VACUUM changes the

	4118 ** number of reserved bits to the "optimal" amount.

	4119 */

	4120 SQLITE_PRIVATE void sqlite3PagerAlignReserve(Pager pDest, Pager pSrc){

	4121 if( pDest->nReserve!=pSrc->nReserve ){

	4122 pDest->nReserve = pSrc->nReserve;

	4123 pagerReportSize(pDest);

	4124 }

	4125 }

	4126 #endif

	4127

	4128 /*

	4129 ** Read a single page from either the journal file (if isMainJrnl==1) or

	4130 ** from the sub-journal (if isMainJrnl==0) and playback that page.

	4131 ** The page begins at offset pOffset into the file. The pOffset

	4132 ** value is increased to the start of the next page in the journal.

	4133 **

	4134 ** The main rollback journal uses checksums - the statement journal does

	4135 ** not.

	4136 **

	4137 ** If the page number of the page record read from the (sub-)journal file

	4138 ** is greater than the current value of Pager.dbSize, then playback is

	4139 ** skipped and SQLITE_OK is returned.

	4140 **

	4141 ** If pDone is not NULL, then it is a record of pages that have already

	4142 ** been played back. If the page at *pOffset has already been played back

	4143 ** (if the corresponding pDone bit is set) then skip the playback.

	4144 ** Make sure the pDone bit corresponding to the *pOffset page is set

	4145 ** prior to returning.

	4146 **

	4147 ** If the page record is successfully read from the (sub-)journal file

	4148 ** and played back, then SQLITE_OK is returned. If an IO error occurs

	4149 ** while reading the record from the (sub-)journal file or while writing

	4150 ** to the database file, then the IO error code is returned. If data

	4151 ** is successfully read from the (sub-)journal file but appears to be

	4152 ** corrupted, SQLITE_DONE is returned. Data is considered corrupted in

	4153 ** two circumstances:

	4154 **

	4155 ** * If the record page-number is illegal (0 or PAGER_MJ_PGNO), or

	4156 ** * If the record is being rolled back from the main journal file

	4157 ** and the checksum field does not match the record content.

	4158 **

	4159 ** Neither of these two scenarios are possible during a savepoint rollback.

	4160 **

	4161 ** If this is a savepoint rollback, then memory may have to be dynamically

	4162 ** allocated by this function. If this is the case and an allocation fails,

	4163 ** SQLITE_NOMEM is returned.

	4164 */

	4165 static int pager_playback_one_page(

	4166 Pager pPager, / The pager being played back */

	4167 i64 pOffset, / Offset of record to playback */

	4168 Bitvec pDone, / Bitvec of pages already played back */

	4169 int isMainJrnl, /* 1 -> main journal. 0 -> sub-journal. */

	4170 int isSavepnt /* True for a savepoint rollback */

	4171 ){

	4172 int rc;

	4173 PgHdr pPg; / An existing page in the cache */

	4174 Pgno pgno; /* The page number of a page in journal */

	4175 u32 cksum; /* Checksum used for sanity checking */

	4176 char aData; / Temporary storage for the page */

	4177 sqlite3_file jfd; / The file descriptor for the journal file */

	4178 int isSynced; /* True if journal page is synced */

	4179

	4180 assert( (isMainJrnl&~1)==0 ); /* isMainJrnl is 0 or 1 */

	4181 assert( (isSavepnt&~1)==0 ); /* isSavepnt is 0 or 1 */

	4182 assert( isMainJrnl \|\| pDone ); /* pDone always used on sub-journals */

	4183 assert( isSavepnt \|\| pDone==0 ); /* pDone never used on non-savepoint */

	4184

	4185 aData = pPager->pTmpSpace;

	4186 assert( aData ); /* Temp storage must have already been allocated */

	4187 assert( pagerUseWal(pPager)==0 \|\| (!isMainJrnl && isSavepnt) );

	4188

	4189 /* Either the state is greater than PAGER_WRITER_CACHEMOD (a transaction

	4190 ** or savepoint rollback done at the request of the caller) or this is

	4191 ** a hot-journal rollback. If it is a hot-journal rollback, the pager

	4192 ** is in state OPEN and holds an EXCLUSIVE lock. Hot-journal rollback

	4193 ** only reads from the main journal, not the sub-journal.

	4194 */

	4195 assert( pPager->eState>=PAGER_WRITER_CACHEMOD

	4196 \|\| (pPager->eState==PAGER_OPEN && pPager->eLock==EXCLUSIVE_LOCK)

	4197 );

	4198 assert( pPager->eState>=PAGER_WRITER_CACHEMOD \|\| isMainJrnl );

	4199

	4200 /* Read the page number and page data from the journal or sub-journal

	4201 ** file. Return an error code to the caller if an IO error occurs.

	4202 */

	4203 jfd = isMainJrnl ? pPager->jfd : pPager->sjfd;

	4204 rc = read32bits(jfd, *pOffset, &pgno);

	4205 if( rc!=SQLITE_OK ) return rc;

	4206 rc = sqlite3OsRead(jfd, (u8)aData, pPager->pageSize, (pOffset)+4);

	4207 if( rc!=SQLITE_OK ) return rc;

	4208 pOffset += pPager->pageSize + 4 + isMainJrnl4;

	4209

	4210 /* Sanity checking on the page. This is more important that I originally

	4211 ** thought. If a power failure occurs while the journal is being written,

	4212 ** it could cause invalid data to be written into the journal. We need to

	4213 ** detect this invalid data (with high probability) and ignore it.

	4214 */

	4215 if( pgno==0 \|\| pgno==PAGER_MJ_PGNO(pPager) ){

	4216 assert( !isSavepnt );

	4217 return SQLITE_DONE;

	4218 }

	4219 if( pgno>(Pgno)pPager->dbSize \|\| sqlite3BitvecTest(pDone, pgno) ){

	4220 return SQLITE_OK;

	4221 }

	4222 if( isMainJrnl ){

	4223 rc = read32bits(jfd, (*pOffset)-4, &cksum);

	4224 if( rc ) return rc;

	4225 if( !isSavepnt && pager_cksum(pPager, (u8*)aData)!=cksum ){

	4226 return SQLITE_DONE;

	4227 }

	4228 }

	4229

	4230 /* If this page has already been played back before during the current

	4231 ** rollback, then don't bother to play it back again.

	4232 */

	4233 if( pDone && (rc = sqlite3BitvecSet(pDone, pgno))!=SQLITE_OK ){

	4234 return rc;

	4235 }

	4236

	4237 /* When playing back page 1, restore the nReserve setting

	4238 */

	4239 if( pgno==1 && pPager->nReserve!=((u8*)aData)[20] ){

	4240 pPager->nReserve = ((u8*)aData)[20];

	4241 pagerReportSize(pPager);

	4242 }

	4243

	4244 /* If the pager is in CACHEMOD state, then there must be a copy of this

	4245 ** page in the pager cache. In this case just update the pager cache,

	4246 ** not the database file. The page is left marked dirty in this case.

	4247 **

	4248 ** An exception to the above rule: If the database is in no-sync mode

	4249 ** and a page is moved during an incremental vacuum then the page may

	4250 ** not be in the pager cache. Later: if a malloc() or IO error occurs

	4251 ** during a Movepage() call, then the page may not be in the cache

	4252 ** either. So the condition described in the above paragraph is not

	4253 ** assert()able.

	4254 **

	4255 ** If in WRITER_DBMOD, WRITER_FINISHED or OPEN state, then we update the

	4256 ** pager cache if it exists and the main file. The page is then marked

	4257 ** not dirty. Since this code is only executed in PAGER_OPEN state for

	4258 ** a hot-journal rollback, it is guaranteed that the page-cache is empty

	4259 ** if the pager is in OPEN state.

	4260 **

	4261 ** Ticket #1171: The statement journal might contain page content that is

	4262 ** different from the page content at the start of the transaction.

	4263 ** This occurs when a page is changed prior to the start of a statement

	4264 ** then changed again within the statement. When rolling back such a

	4265 ** statement we must not write to the original database unless we know

	4266 ** for certain that original page contents are synced into the main rollback

	4267 ** journal. Otherwise, a power loss might leave modified data in the

	4268 ** database file without an entry in the rollback journal that can

	4269 ** restore the database to its original form. Two conditions must be

	4270 ** met before writing to the database files. (1) the database must be

	4271 ** locked. (2) we know that the original page content is fully synced

	4272 ** in the main journal either because the page is not in cache or else

	4273 ** the page is marked as needSync==0.

	4274 **

	4275 ** 2008-04-14: When attempting to vacuum a corrupt database file, it

	4276 ** is possible to fail a statement on a database that does not yet exist.

	4277 ** Do not attempt to write if database file has never been opened.

	4278 */

	4279 if( pagerUseWal(pPager) ){

	4280 pPg = 0;

	4281 }else{

	4282 pPg = sqlite3PagerLookup(pPager, pgno);

	4283 }

	4284 assert( pPg \|\| !MEMDB );

	4285 assert( pPager->eState!=PAGER_OPEN \|\| pPg==0 \|\| pPager->tempFile );

	4286 PAGERTRACE(("PLAYBACK %d page %d hash(%08x) %s\n",

	4287 PAGERID(pPager), pgno, pager_datahash(pPager->pageSize, (u8*)aData),

	4288 (isMainJrnl?"main-journal":"sub-journal")

	4289 ));

	4290 if( isMainJrnl ){

	4291 isSynced = pPager->noSync \|\| (*pOffset <= pPager->journalHdr);

	4292 }else{

	4293 isSynced = (pPg==0 \|\| 0==(pPg->flags & PGHDR_NEED_SYNC));

	4294 }

	4295 if( isOpen(pPager->fd)

	4296 && (pPager->eState>=PAGER_WRITER_DBMOD \|\| pPager->eState==PAGER_OPEN)

	4297 && isSynced

	4298 ){

	4299 i64 ofst = (pgno-1)*(i64)pPager->pageSize;

	4300 testcase( !isSavepnt && pPg!=0 && (pPg->flags&PGHDR_NEED_SYNC)!=0 );

	4301 assert( !pagerUseWal(pPager) );

	4302 rc = sqlite3OsWrite(pPager->fd, (u8 *)aData, pPager->pageSize, ofst);

	4303 if( pgno>pPager->dbFileSize ){

	4304 pPager->dbFileSize = pgno;

	4305 }

	4306 if( pPager->pBackup ){

	4307 CODEC1(pPager, aData, pgno, 3, rc=SQLITE_NOMEM_BKPT);

	4308 sqlite3BackupUpdate(pPager->pBackup, pgno, (u8*)aData);

	4309 CODEC2(pPager, aData, pgno, 7, rc=SQLITE_NOMEM_BKPT, aData);

	4310 }

	4311 }else if( !isMainJrnl && pPg==0 ){

	4312 /* If this is a rollback of a savepoint and data was not written to

	4313 ** the database and the page is not in-memory, there is a potential

	4314 ** problem. When the page is next fetched by the b-tree layer, it

	4315 ** will be read from the database file, which may or may not be

	4316 ** current.

	4317 **

	4318 ** There are a couple of different ways this can happen. All are quite

	4319 ** obscure. When running in synchronous mode, this can only happen

	4320 ** if the page is on the free-list at the start of the transaction, then

	4321 ** populated, then moved using sqlite3PagerMovepage().

	4322 **

	4323 ** The solution is to add an in-memory page to the cache containing

	4324 ** the data just read from the sub-journal. Mark the page as dirty

	4325 ** and if the pager requires a journal-sync, then mark the page as

	4326 ** requiring a journal-sync before it is written.

	4327 */

	4328 assert( isSavepnt );

	4329 assert( (pPager->doNotSpill & SPILLFLAG_ROLLBACK)==0 );

	4330 pPager->doNotSpill \|= SPILLFLAG_ROLLBACK;

	4331 rc = sqlite3PagerGet(pPager, pgno, &pPg, 1);

	4332 assert( (pPager->doNotSpill & SPILLFLAG_ROLLBACK)!=0 );

	4333 pPager->doNotSpill &= ~SPILLFLAG_ROLLBACK;

	4334 if( rc!=SQLITE_OK ) return rc;

	4335 sqlite3PcacheMakeDirty(pPg);

	4336 }

	4337 if( pPg ){

	4338 /* No page should ever be explicitly rolled back that is in use, except

	4339 ** for page 1 which is held in use in order to keep the lock on the

	4340 ** database active. However such a page may be rolled back as a result

	4341 ** of an internal error resulting in an automatic call to

	4342 ** sqlite3PagerRollback().

	4343 */

	4344 void *pData;

	4345 pData = pPg->pData;

	4346 memcpy(pData, (u8*)aData, pPager->pageSize);

	4347 pPager->xReiniter(pPg);

	4348 /* It used to be that sqlite3PcacheMakeClean(pPg) was called here. But

	4349 ** that call was dangerous and had no detectable benefit since the cache

	4350 ** is normally cleaned by sqlite3PcacheCleanAll() after rollback and so

	4351 ** has been removed. */

	4352 pager_set_pagehash(pPg);

	4353

	4354 /* If this was page 1, then restore the value of Pager.dbFileVers.

	4355 ** Do this before any decoding. */

	4356 if( pgno==1 ){

	4357 memcpy(&pPager->dbFileVers, &((u8*)pData)[24],sizeof(pPager->dbFileVers));

	4358 }

	4359

	4360 /* Decode the page just read from disk */

	4361 CODEC1(pPager, pData, pPg->pgno, 3, rc=SQLITE_NOMEM_BKPT);

	4362 sqlite3PcacheRelease(pPg);

	4363 }

	4364 return rc;

	4365 }

	4366

	4367 /*

	4368 ** Parameter zMaster is the name of a master journal file. A single journal

	4369 ** file that referred to the master journal file has just been rolled back.

	4370 ** This routine checks if it is possible to delete the master journal file,

	4371 ** and does so if it is.

	4372 **

	4373 ** Argument zMaster may point to Pager.pTmpSpace. So that buffer is not

	4374 ** available for use within this function.

	4375 **

	4376 ** When a master journal file is created, it is populated with the names

	4377 ** of all of its child journals, one after another, formatted as utf-8

	4378 ** encoded text. The end of each child journal file is marked with a

	4379 ** nul-terminator byte (0x00). i.e. the entire contents of a master journal

	4380 ** file for a transaction involving two databases might be:

	4381 **

	4382 ** "/home/bill/a.db-journal\x00/home/bill/b.db-journal\x00"

	4383 **

	4384 ** A master journal file may only be deleted once all of its child

	4385 ** journals have been rolled back.

	4386 **

	4387 ** This function reads the contents of the master-journal file into

	4388 ** memory and loops through each of the child journal names. For

	4389 ** each child journal, it checks if:

	4390 **

	4391 ** * if the child journal exists, and if so

	4392 ** * if the child journal contains a reference to master journal

	4393 ** file zMaster

	4394 **

	4395 ** If a child journal can be found that matches both of the criteria

	4396 ** above, this function returns without doing anything. Otherwise, if

	4397 ** no such child journal can be found, file zMaster is deleted from

	4398 ** the file-system using sqlite3OsDelete().

	4399 **

	4400 ** If an IO error within this function, an error code is returned. This

	4401 ** function allocates memory by calling sqlite3Malloc(). If an allocation

	4402 ** fails, SQLITE_NOMEM is returned. Otherwise, if no IO or malloc errors

	4403 ** occur, SQLITE_OK is returned.

	4404 **

	4405 ** TODO: This function allocates a single block of memory to load

	4406 ** the entire contents of the master journal file. This could be

	4407 ** a couple of kilobytes or so - potentially larger than the page

	4408 ** size.

	4409 */

	4410 static int pager_delmaster(Pager pPager, const char zMaster){

	4411 sqlite3_vfs *pVfs = pPager->pVfs;

	4412 int rc; /* Return code */

	4413 sqlite3_file pMaster; / Malloc'd master-journal file descriptor */

	4414 sqlite3_file pJournal; / Malloc'd child-journal file descriptor */

	4415 char zMasterJournal = 0; / Contents of master journal file */

	4416 i64 nMasterJournal; /* Size of master journal file */

	4417 char zJournal; / Pointer to one journal within MJ file */

	4418 char zMasterPtr; / Space to hold MJ filename from a journal file */

	4419 int nMasterPtr; /* Amount of space allocated to zMasterPtr[] */

	4420

	4421 /* Allocate space for both the pJournal and pMaster file descriptors.

	4422 ** If successful, open the master journal file for reading.

	4423 */

	4424 pMaster = (sqlite3_file )sqlite3MallocZero(pVfs->szOsFile 2);

	4425 pJournal = (sqlite3_file )(((u8 )pMaster) + pVfs->szOsFile);

	4426 if( !pMaster ){

	4427 rc = SQLITE_NOMEM_BKPT;

	4428 }else{

	4429 const int flags = (SQLITE_OPEN_READONLY\|SQLITE_OPEN_MASTER_JOURNAL);

	4430 rc = sqlite3OsOpen(pVfs, zMaster, pMaster, flags, 0);

	4431 }

	4432 if( rc!=SQLITE_OK ) goto delmaster_out;

	4433

	4434 /* Load the entire master journal file into space obtained from

	4435 ** sqlite3_malloc() and pointed to by zMasterJournal. Also obtain

	4436 ** sufficient space (in zMasterPtr) to hold the names of master

	4437 ** journal files extracted from regular rollback-journals.

	4438 */

	4439 rc = sqlite3OsFileSize(pMaster, &nMasterJournal);

	4440 if( rc!=SQLITE_OK ) goto delmaster_out;

	4441 nMasterPtr = pVfs->mxPathname+1;

	4442 zMasterJournal = sqlite3Malloc(nMasterJournal + nMasterPtr + 1);

	4443 if( !zMasterJournal ){

	4444 rc = SQLITE_NOMEM_BKPT;

	4445 goto delmaster_out;

	4446 }

	4447 zMasterPtr = &zMasterJournal[nMasterJournal+1];

	4448 rc = sqlite3OsRead(pMaster, zMasterJournal, (int)nMasterJournal, 0);

	4449 if( rc!=SQLITE_OK ) goto delmaster_out;

	4450 zMasterJournal[nMasterJournal] = 0;

	4451

	4452 zJournal = zMasterJournal;

	4453 while( (zJournal-zMasterJournal)<nMasterJournal ){

	4454 int exists;

	4455 rc = sqlite3OsAccess(pVfs, zJournal, SQLITE_ACCESS_EXISTS, &exists);

	4456 if( rc!=SQLITE_OK ){

	4457 goto delmaster_out;

	4458 }

	4459 if( exists ){

	4460 /* One of the journals pointed to by the master journal exists.

	4461 ** Open it and check if it points at the master journal. If

	4462 ** so, return without deleting the master journal file.

	4463 */

	4464 int c;

	4465 int flags = (SQLITE_OPEN_READONLY\|SQLITE_OPEN_MAIN_JOURNAL);

	4466 rc = sqlite3OsOpen(pVfs, zJournal, pJournal, flags, 0);

	4467 if( rc!=SQLITE_OK ){

	4468 goto delmaster_out;

	4469 }

	4470

	4471 rc = readMasterJournal(pJournal, zMasterPtr, nMasterPtr);

	4472 sqlite3OsClose(pJournal);

	4473 if( rc!=SQLITE_OK ){

	4474 goto delmaster_out;

	4475 }

	4476

	4477 c = zMasterPtr[0]!=0 && strcmp(zMasterPtr, zMaster)==0;

	4478 if( c ){

	4479 /* We have a match. Do not delete the master journal file. */

	4480 goto delmaster_out;

	4481 }

	4482 }

	4483 zJournal += (sqlite3Strlen30(zJournal)+1);

	4484 }

	4485

	4486 sqlite3OsClose(pMaster);

	4487 rc = sqlite3OsDelete(pVfs, zMaster, 0);

	4488

	4489 delmaster_out:

	4490 sqlite3_free(zMasterJournal);

	4491 if( pMaster ){

	4492 sqlite3OsClose(pMaster);

	4493 assert( !isOpen(pJournal) );

	4494 sqlite3_free(pMaster);

	4495 }

	4496 return rc;

	4497 }

	4498

	4499

	4500 /*

	4501 ** This function is used to change the actual size of the database

	4502 ** file in the file-system. This only happens when committing a transaction,

	4503 ** or rolling back a transaction (including rolling back a hot-journal).

	4504 **

	4505 ** If the main database file is not open, or the pager is not in either

	4506 ** DBMOD or OPEN state, this function is a no-op. Otherwise, the size

	4507 ** of the file is changed to nPage pages (nPage*pPager->pageSize bytes).

	4508 ** If the file on disk is currently larger than nPage pages, then use the VFS

	4509 ** xTruncate() method to truncate it.

	4510 **

	4511 ** Or, it might be the case that the file on disk is smaller than

	4512 ** nPage pages. Some operating system implementations can get confused if

	4513 ** you try to truncate a file to some size that is larger than it

	4514 ** currently is, so detect this case and write a single zero byte to

	4515 ** the end of the new file instead.

	4516 **

	4517 ** If successful, return SQLITE_OK. If an IO error occurs while modifying

	4518 ** the database file, return the error code to the caller.

	4519 */

	4520 static int pager_truncate(Pager *pPager, Pgno nPage){

	4521 int rc = SQLITE_OK;

	4522 assert( pPager->eState!=PAGER_ERROR );

	4523 assert( pPager->eState!=PAGER_READER );

	4524

	4525 if( isOpen(pPager->fd)

	4526 && (pPager->eState>=PAGER_WRITER_DBMOD \|\| pPager->eState==PAGER_OPEN)

	4527 ){

	4528 i64 currentSize, newSize;

	4529 int szPage = pPager->pageSize;

	4530 assert( pPager->eLock==EXCLUSIVE_LOCK );

	4531 /* TODO: Is it safe to use Pager.dbFileSize here? */

	4532 rc = sqlite3OsFileSize(pPager->fd, &currentSize);

	4533 newSize = szPage*(i64)nPage;

	4534 if( rc==SQLITE_OK && currentSize!=newSize ){

	4535 if( currentSize>newSize ){

	4536 rc = sqlite3OsTruncate(pPager->fd, newSize);

	4537 }else if( (currentSize+szPage)<=newSize ){

	4538 char *pTmp = pPager->pTmpSpace;

	4539 memset(pTmp, 0, szPage);

	4540 testcase( (newSize-szPage) == currentSize );

	4541 testcase( (newSize-szPage) > currentSize );

	4542 rc = sqlite3OsWrite(pPager->fd, pTmp, szPage, newSize-szPage);

	4543 }

	4544 if( rc==SQLITE_OK ){

	4545 pPager->dbFileSize = nPage;

	4546 }

	4547 }

	4548 }

	4549 return rc;

	4550 }

	4551

	4552 /*

	4553 ** Return a sanitized version of the sector-size of OS file pFile. The

	4554 ** return value is guaranteed to lie between 32 and MAX_SECTOR_SIZE.

	4555 */

	4556 SQLITE_PRIVATE int sqlite3SectorSize(sqlite3_file *pFile){

	4557 int iRet = sqlite3OsSectorSize(pFile);

	4558 if( iRet<32 ){

	4559 iRet = 512;

	4560 }else if( iRet>MAX_SECTOR_SIZE ){

	4561 assert( MAX_SECTOR_SIZE>=512 );

	4562 iRet = MAX_SECTOR_SIZE;

	4563 }

	4564 return iRet;

	4565 }

	4566

	4567 /*

	4568 ** Set the value of the Pager.sectorSize variable for the given

	4569 ** pager based on the value returned by the xSectorSize method

	4570 ** of the open database file. The sector size will be used

	4571 ** to determine the size and alignment of journal header and

	4572 ** master journal pointers within created journal files.

	4573 **

	4574 ** For temporary files the effective sector size is always 512 bytes.

	4575 **

	4576 ** Otherwise, for non-temporary files, the effective sector size is

	4577 ** the value returned by the xSectorSize() method rounded up to 32 if

	4578 ** it is less than 32, or rounded down to MAX_SECTOR_SIZE if it

	4579 ** is greater than MAX_SECTOR_SIZE.

	4580 **

	4581 ** If the file has the SQLITE_IOCAP_POWERSAFE_OVERWRITE property, then set

	4582 ** the effective sector size to its minimum value (512). The purpose of

	4583 ** pPager->sectorSize is to define the "blast radius" of bytes that

	4584 ** might change if a crash occurs while writing to a single byte in

	4585 ** that range. But with POWERSAFE_OVERWRITE, the blast radius is zero

	4586 ** (that is what POWERSAFE_OVERWRITE means), so we minimize the sector

	4587 ** size. For backwards compatibility of the rollback journal file format,

	4588 ** we cannot reduce the effective sector size below 512.

	4589 */

	4590 static void setSectorSize(Pager *pPager){

	4591 assert( isOpen(pPager->fd) \|\| pPager->tempFile );

	4592

	4593 if( pPager->tempFile

	4594 \|\| (sqlite3OsDeviceCharacteristics(pPager->fd) &

	4595 SQLITE_IOCAP_POWERSAFE_OVERWRITE)!=0

	4596 ){

	4597 /* Sector size doesn't matter for temporary files. Also, the file

	4598 ** may not have been opened yet, in which case the OsSectorSize()

	4599 ** call will segfault. */

	4600 pPager->sectorSize = 512;

	4601 }else{

	4602 pPager->sectorSize = sqlite3SectorSize(pPager->fd);

	4603 }

	4604 }

	4605

	4606 /*

	4607 ** Playback the journal and thus restore the database file to

	4608 ** the state it was in before we started making changes.

	4609 **

	4610 ** The journal file format is as follows:

	4611 **

	4612 ** (1) 8 byte prefix. A copy of aJournalMagic[].

	4613 ** (2) 4 byte big-endian integer which is the number of valid page records

	4614 ** in the journal. If this value is 0xffffffff, then compute the

	4615 ** number of page records from the journal size.

	4616 ** (3) 4 byte big-endian integer which is the initial value for the

	4617 ** sanity checksum.

	4618 ** (4) 4 byte integer which is the number of pages to truncate the

	4619 ** database to during a rollback.

	4620 ** (5) 4 byte big-endian integer which is the sector size. The header

	4621 ** is this many bytes in size.

	4622 ** (6) 4 byte big-endian integer which is the page size.

	4623 ** (7) zero padding out to the next sector size.

	4624 ** (8) Zero or more pages instances, each as follows:

	4625 ** + 4 byte page number.

	4626 ** + pPager->pageSize bytes of data.

	4627 ** + 4 byte checksum

	4628 **

	4629 ** When we speak of the journal header, we mean the first 7 items above.

	4630 ** Each entry in the journal is an instance of the 8th item.

	4631 **

	4632 ** Call the value from the second bullet "nRec". nRec is the number of

	4633 ** valid page entries in the journal. In most cases, you can compute the

	4634 ** value of nRec from the size of the journal file. But if a power

	4635 ** failure occurred while the journal was being written, it could be the

	4636 ** case that the size of the journal file had already been increased but

	4637 ** the extra entries had not yet made it safely to disk. In such a case,

	4638 ** the value of nRec computed from the file size would be too large. For

	4639 ** that reason, we always use the nRec value in the header.

	4640 **

	4641 ** If the nRec value is 0xffffffff it means that nRec should be computed

	4642 ** from the file size. This value is used when the user selects the

	4643 ** no-sync option for the journal. A power failure could lead to corruption

	4644 ** in this case. But for things like temporary table (which will be

	4645 ** deleted when the power is restored) we don't care.

	4646 **

	4647 ** If the file opened as the journal file is not a well-formed

	4648 ** journal file then all pages up to the first corrupted page are rolled

	4649 ** back (or no pages if the journal header is corrupted). The journal file

	4650 ** is then deleted and SQLITE_OK returned, just as if no corruption had

	4651 ** been encountered.

	4652 **

	4653 ** If an I/O or malloc() error occurs, the journal-file is not deleted

	4654 ** and an error code is returned.

	4655 **

	4656 ** The isHot parameter indicates that we are trying to rollback a journal

	4657 ** that might be a hot journal. Or, it could be that the journal is

	4658 ** preserved because of JOURNALMODE_PERSIST or JOURNALMODE_TRUNCATE.

	4659 ** If the journal really is hot, reset the pager cache prior rolling

	4660 ** back any content. If the journal is merely persistent, no reset is

	4661 ** needed.

	4662 */

	4663 static int pager_playback(Pager *pPager, int isHot){

	4664 sqlite3_vfs *pVfs = pPager->pVfs;

	4665 i64 szJ; /* Size of the journal file in bytes */

	4666 u32 nRec; /* Number of Records in the journal */

	4667 u32 u; /* Unsigned loop counter */

	4668 Pgno mxPg = 0; /* Size of the original file in pages */

	4669 int rc; /* Result code of a subroutine */

	4670 int res = 1; /* Value returned by sqlite3OsAccess() */

	4671 char zMaster = 0; / Name of master journal file if any */

	4672 int needPagerReset; /* True to reset page prior to first page rollback */

	4673 int nPlayback = 0; /* Total number of pages restored from journal */

	4674

	4675 /* Figure out how many records are in the journal. Abort early if

	4676 ** the journal is empty.

	4677 */

	4678 assert( isOpen(pPager->jfd) );

	4679 rc = sqlite3OsFileSize(pPager->jfd, &szJ);

	4680 if( rc!=SQLITE_OK ){

	4681 goto end_playback;

	4682 }

	4683

	4684 /* Read the master journal name from the journal, if it is present.

	4685 ** If a master journal file name is specified, but the file is not

	4686 ** present on disk, then the journal is not hot and does not need to be

	4687 ** played back.

	4688 **

	4689 ** TODO: Technically the following is an error because it assumes that

	4690 ** buffer Pager.pTmpSpace is (mxPathname+1) bytes or larger. i.e. that

	4691 ** (pPager->pageSize >= pPager->pVfs->mxPathname+1). Using os_unix.c,

	4692 ** mxPathname is 512, which is the same as the minimum allowable value

	4693 ** for pageSize.

	4694 */

	4695 zMaster = pPager->pTmpSpace;

	4696 rc = readMasterJournal(pPager->jfd, zMaster, pPager->pVfs->mxPathname+1);

	4697 if( rc==SQLITE_OK && zMaster[0] ){

	4698 rc = sqlite3OsAccess(pVfs, zMaster, SQLITE_ACCESS_EXISTS, &res);

	4699 }

	4700 zMaster = 0;

	4701 if( rc!=SQLITE_OK \|\| !res ){

	4702 goto end_playback;

	4703 }

	4704 pPager->journalOff = 0;

	4705 needPagerReset = isHot;

	4706

	4707 /* This loop terminates either when a readJournalHdr() or

	4708 ** pager_playback_one_page() call returns SQLITE_DONE or an IO error

	4709 ** occurs.

	4710 */

	4711 while( 1 ){

	4712 /* Read the next journal header from the journal file. If there are

	4713 ** not enough bytes left in the journal file for a complete header, or

	4714 ** it is corrupted, then a process must have failed while writing it.

	4715 ** This indicates nothing more needs to be rolled back.

	4716 */

	4717 rc = readJournalHdr(pPager, isHot, szJ, &nRec, &mxPg);

	4718 if( rc!=SQLITE_OK ){

	4719 if( rc==SQLITE_DONE ){

	4720 rc = SQLITE_OK;

	4721 }

	4722 goto end_playback;

	4723 }

	4724

	4725 /* If nRec is 0xffffffff, then this journal was created by a process

	4726 ** working in no-sync mode. This means that the rest of the journal

	4727 ** file consists of pages, there are no more journal headers. Compute

	4728 ** the value of nRec based on this assumption.

	4729 */

	4730 if( nRec==0xffffffff ){

	4731 assert( pPager->journalOff==JOURNAL_HDR_SZ(pPager) );

	4732 nRec = (int)((szJ - JOURNAL_HDR_SZ(pPager))/JOURNAL_PG_SZ(pPager));

	4733 }

	4734

	4735 /* If nRec is 0 and this rollback is of a transaction created by this

	4736 ** process and if this is the final header in the journal, then it means

	4737 ** that this part of the journal was being filled but has not yet been

	4738 ** synced to disk. Compute the number of pages based on the remaining

	4739 ** size of the file.

	4740 **

	4741 ** The third term of the test was added to fix ticket #2565.

	4742 ** When rolling back a hot journal, nRec==0 always means that the next

	4743 ** chunk of the journal contains zero pages to be rolled back. But

	4744 ** when doing a ROLLBACK and the nRec==0 chunk is the last chunk in

	4745 ** the journal, it means that the journal might contain additional

	4746 ** pages that need to be rolled back and that the number of pages

	4747 ** should be computed based on the journal file size.

	4748 */

	4749 if( nRec==0 && !isHot &&

	4750 pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff ){

	4751 nRec = (int)((szJ - pPager->journalOff) / JOURNAL_PG_SZ(pPager));

	4752 }

	4753

	4754 /* If this is the first header read from the journal, truncate the

	4755 ** database file back to its original size.

	4756 */

	4757 if( pPager->journalOff==JOURNAL_HDR_SZ(pPager) ){

	4758 rc = pager_truncate(pPager, mxPg);

	4759 if( rc!=SQLITE_OK ){

	4760 goto end_playback;

	4761 }

	4762 pPager->dbSize = mxPg;

	4763 }

	4764

	4765 /* Copy original pages out of the journal and back into the

	4766 ** database file and/or page cache.

	4767 */

	4768 for(u=0; u<nRec; u++){

	4769 if( needPagerReset ){

	4770 pager_reset(pPager);

	4771 needPagerReset = 0;

	4772 }

	4773 rc = pager_playback_one_page(pPager,&pPager->journalOff,0,1,0);

	4774 if( rc==SQLITE_OK ){

	4775 nPlayback++;

	4776 }else{

	4777 if( rc==SQLITE_DONE ){

	4778 pPager->journalOff = szJ;

	4779 break;

	4780 }else if( rc==SQLITE_IOERR_SHORT_READ ){

	4781 /* If the journal has been truncated, simply stop reading and

	4782 ** processing the journal. This might happen if the journal was

	4783 ** not completely written and synced prior to a crash. In that

	4784 ** case, the database should have never been written in the

	4785 ** first place so it is OK to simply abandon the rollback. */

	4786 rc = SQLITE_OK;

	4787 goto end_playback;

	4788 }else{

	4789 /* If we are unable to rollback, quit and return the error

	4790 ** code. This will cause the pager to enter the error state

	4791 ** so that no further harm will be done. Perhaps the next

	4792 ** process to come along will be able to rollback the database.

	4793 */

	4794 goto end_playback;

	4795 }

	4796 }

	4797 }

	4798 }

	4799 /NOTREACHED/

	4800 assert( 0 );

	4801

	4802 end_playback:

	4803 /* Following a rollback, the database file should be back in its original

	4804 ** state prior to the start of the transaction, so invoke the

	4805 ** SQLITE_FCNTL_DB_UNCHANGED file-control method to disable the

	4806 ** assertion that the transaction counter was modified.

	4807 */

	4808 #ifdef SQLITE_DEBUG

	4809 if( pPager->fd->pMethods ){

	4810 sqlite3OsFileControlHint(pPager->fd,SQLITE_FCNTL_DB_UNCHANGED,0);

	4811 }

	4812 #endif

	4813

	4814 /* If this playback is happening automatically as a result of an IO or

	4815 ** malloc error that occurred after the change-counter was updated but

	4816 ** before the transaction was committed, then the change-counter

	4817 ** modification may just have been reverted. If this happens in exclusive

	4818 ** mode, then subsequent transactions performed by the connection will not

	4819 ** update the change-counter at all. This may lead to cache inconsistency

	4820 ** problems for other processes at some point in the future. So, just

	4821 ** in case this has happened, clear the changeCountDone flag now.

	4822 */

	4823 pPager->changeCountDone = pPager->tempFile;

	4824

	4825 if( rc==SQLITE_OK ){

	4826 zMaster = pPager->pTmpSpace;

	4827 rc = readMasterJournal(pPager->jfd, zMaster, pPager->pVfs->mxPathname+1);

	4828 testcase( rc!=SQLITE_OK );

	4829 }

	4830 if( rc==SQLITE_OK

	4831 && (pPager->eState>=PAGER_WRITER_DBMOD \|\| pPager->eState==PAGER_OPEN)

	4832 ){

	4833 rc = sqlite3PagerSync(pPager, 0);

	4834 }

	4835 if( rc==SQLITE_OK ){

	4836 rc = pager_end_transaction(pPager, zMaster[0]!='\0', 0);

	4837 testcase( rc!=SQLITE_OK );

	4838 }

	4839 if( rc==SQLITE_OK && zMaster[0] && res ){

	4840 /* If there was a master journal and this routine will return success,

	4841 ** see if it is possible to delete the master journal.

	4842 */

	4843 rc = pager_delmaster(pPager, zMaster);

	4844 testcase( rc!=SQLITE_OK );

	4845 }

	4846 if( isHot && nPlayback ){

	4847 sqlite3_log(SQLITE_NOTICE_RECOVER_ROLLBACK, "recovered %d pages from %s",

	4848 nPlayback, pPager->zJournal);

	4849 }

	4850

	4851 /* The Pager.sectorSize variable may have been updated while rolling

	4852 ** back a journal created by a process with a different sector size

	4853 ** value. Reset it to the correct value for this process.

	4854 */

	4855 setSectorSize(pPager);

	4856 return rc;

	4857 }

	4858

	4859

	4860 /*

	4861 ** Read the content for page pPg out of the database file and into

	4862 ** pPg->pData. A shared lock or greater must be held on the database

	4863 ** file before this function is called.

	4864 **

	4865 ** If page 1 is read, then the value of Pager.dbFileVers[] is set to

	4866 ** the value read from the database file.

	4867 **

	4868 ** If an IO error occurs, then the IO error is returned to the caller.

	4869 ** Otherwise, SQLITE_OK is returned.

	4870 */

	4871 static int readDbPage(PgHdr *pPg, u32 iFrame){

	4872 Pager pPager = pPg->pPager; / Pager object associated with page pPg */

	4873 Pgno pgno = pPg->pgno; /* Page number to read */

	4874 int rc = SQLITE_OK; /* Return code */

	4875 int pgsz = pPager->pageSize; /* Number of bytes to read */

	4876

	4877 assert( pPager->eState>=PAGER_READER && !MEMDB );

	4878 assert( isOpen(pPager->fd) );

	4879

	4880 #ifndef SQLITE_OMIT_WAL

	4881 if( iFrame ){

	4882 /* Try to pull the page from the write-ahead log. */

	4883 rc = sqlite3WalReadFrame(pPager->pWal, iFrame, pgsz, pPg->pData);

	4884 }else

	4885 #endif

	4886 {

	4887 i64 iOffset = (pgno-1)*(i64)pPager->pageSize;

	4888 rc = sqlite3OsRead(pPager->fd, pPg->pData, pgsz, iOffset);

	4889 if( rc==SQLITE_IOERR_SHORT_READ ){

	4890 rc = SQLITE_OK;

	4891 }

	4892 }

	4893

	4894 if( pgno==1 ){

	4895 if( rc ){

	4896 /* If the read is unsuccessful, set the dbFileVers[] to something

	4897 ** that will never be a valid file version. dbFileVers[] is a copy

	4898 ** of bytes 24..39 of the database. Bytes 28..31 should always be

	4899 ** zero or the size of the database in page. Bytes 32..35 and 35..39

	4900 ** should be page numbers which are never 0xffffffff. So filling

	4901 ** pPager->dbFileVers[] with all 0xff bytes should suffice.

	4902 **

	4903 ** For an encrypted database, the situation is more complex: bytes

	4904 ** 24..39 of the database are white noise. But the probability of

	4905 ** white noise equaling 16 bytes of 0xff is vanishingly small so

	4906 ** we should still be ok.

	4907 */

	4908 memset(pPager->dbFileVers, 0xff, sizeof(pPager->dbFileVers));

	4909 }else{

	4910 u8 dbFileVers = &((u8)pPg->pData)[24];

	4911 memcpy(&pPager->dbFileVers, dbFileVers, sizeof(pPager->dbFileVers));

	4912 }

	4913 }

	4914 CODEC1(pPager, pPg->pData, pgno, 3, rc = SQLITE_NOMEM_BKPT);

	4915

	4916 PAGER_INCR(sqlite3_pager_readdb_count);

	4917 PAGER_INCR(pPager->nRead);

	4918 IOTRACE(("PGIN %p %d\n", pPager, pgno));

	4919 PAGERTRACE(("FETCH %d page %d hash(%08x)\n",

	4920 PAGERID(pPager), pgno, pager_pagehash(pPg)));

	4921

	4922 return rc;

	4923 }

	4924

	4925 /*

	4926 ** Update the value of the change-counter at offsets 24 and 92 in

	4927 ** the header and the sqlite version number at offset 96.

	4928 **

	4929 ** This is an unconditional update. See also the pager_incr_changecounter()

	4930 ** routine which only updates the change-counter if the update is actually

	4931 ** needed, as determined by the pPager->changeCountDone state variable.

	4932 */

	4933 static void pager_write_changecounter(PgHdr *pPg){

	4934 u32 change_counter;

	4935

	4936 /* Increment the value just read and write it back to byte 24. */

	4937 change_counter = sqlite3Get4byte((u8*)pPg->pPager->dbFileVers)+1;

	4938 put32bits(((char*)pPg->pData)+24, change_counter);

	4939

	4940 /* Also store the SQLite version number in bytes 96..99 and in

	4941 ** bytes 92..95 store the change counter for which the version number

	4942 ** is valid. */

	4943 put32bits(((char*)pPg->pData)+92, change_counter);

	4944 put32bits(((char*)pPg->pData)+96, SQLITE_VERSION_NUMBER);

	4945 }

	4946

	4947 #ifndef SQLITE_OMIT_WAL

	4948 /*

	4949 ** This function is invoked once for each page that has already been

	4950 ** written into the log file when a WAL transaction is rolled back.

	4951 ** Parameter iPg is the page number of said page. The pCtx argument

	4952 ** is actually a pointer to the Pager structure.

	4953 **

	4954 ** If page iPg is present in the cache, and has no outstanding references,

	4955 ** it is discarded. Otherwise, if there are one or more outstanding

	4956 ** references, the page content is reloaded from the database. If the

	4957 ** attempt to reload content from the database is required and fails,

	4958 ** return an SQLite error code. Otherwise, SQLITE_OK.

	4959 */

	4960 static int pagerUndoCallback(void *pCtx, Pgno iPg){

	4961 int rc = SQLITE_OK;

	4962 Pager pPager = (Pager )pCtx;

	4963 PgHdr *pPg;

	4964

	4965 assert( pagerUseWal(pPager) );

	4966 pPg = sqlite3PagerLookup(pPager, iPg);

	4967 if( pPg ){

	4968 if( sqlite3PcachePageRefcount(pPg)==1 ){

	4969 sqlite3PcacheDrop(pPg);

	4970 }else{

	4971 u32 iFrame = 0;

	4972 rc = sqlite3WalFindFrame(pPager->pWal, pPg->pgno, &iFrame);

	4973 if( rc==SQLITE_OK ){

	4974 rc = readDbPage(pPg, iFrame);

	4975 }

	4976 if( rc==SQLITE_OK ){

	4977 pPager->xReiniter(pPg);

	4978 }

	4979 sqlite3PagerUnrefNotNull(pPg);

	4980 }

	4981 }

	4982

	4983 /* Normally, if a transaction is rolled back, any backup processes are

	4984 ** updated as data is copied out of the rollback journal and into the

	4985 ** database. This is not generally possible with a WAL database, as

	4986 ** rollback involves simply truncating the log file. Therefore, if one

	4987 ** or more frames have already been written to the log (and therefore

	4988 ** also copied into the backup databases) as part of this transaction,

	4989 ** the backups must be restarted.

	4990 */

	4991 sqlite3BackupRestart(pPager->pBackup);

	4992

	4993 return rc;

	4994 }

	4995

	4996 /*

	4997 ** This function is called to rollback a transaction on a WAL database.

	4998 */

	4999 static int pagerRollbackWal(Pager *pPager){

	5000 int rc; /* Return Code */

	5001 PgHdr pList; / List of dirty pages to revert */

	5002

	5003 /* For all pages in the cache that are currently dirty or have already

	5004 ** been written (but not committed) to the log file, do one of the

	5005 ** following:

	5006 **

	5007 ** + Discard the cached page (if refcount==0), or

	5008 ** + Reload page content from the database (if refcount>0).

	5009 */

	5010 pPager->dbSize = pPager->dbOrigSize;

	5011 rc = sqlite3WalUndo(pPager->pWal, pagerUndoCallback, (void *)pPager);

	5012 pList = sqlite3PcacheDirtyList(pPager->pPCache);

	5013 while( pList && rc==SQLITE_OK ){

	5014 PgHdr *pNext = pList->pDirty;

	5015 rc = pagerUndoCallback((void *)pPager, pList->pgno);

	5016 pList = pNext;

	5017 }

	5018

	5019 return rc;

	5020 }

	5021

	5022 /*

	5023 ** This function is a wrapper around sqlite3WalFrames(). As well as logging

	5024 ** the contents of the list of pages headed by pList (connected by pDirty),

	5025 ** this function notifies any active backup processes that the pages have

	5026 ** changed.

	5027 **

	5028 ** The list of pages passed into this routine is always sorted by page number.

	5029 ** Hence, if page 1 appears anywhere on the list, it will be the first page.

	5030 */

	5031 static int pagerWalFrames(

	5032 Pager pPager, / Pager object */

	5033 PgHdr pList, / List of frames to log */

	5034 Pgno nTruncate, /* Database size after this commit */

	5035 int isCommit /* True if this is a commit */

	5036 ){

	5037 int rc; /* Return code */

	5038 int nList; /* Number of pages in pList */

	5039 PgHdr p; / For looping over pages */

	5040

	5041 assert( pPager->pWal );

	5042 assert( pList );

	5043 #ifdef SQLITE_DEBUG

	5044 /* Verify that the page list is in accending order */

	5045 for(p=pList; p && p->pDirty; p=p->pDirty){

	5046 assert( p->pgno < p->pDirty->pgno );

	5047 }

	5048 #endif

	5049

	5050 assert( pList->pDirty==0 \|\| isCommit );

	5051 if( isCommit ){

	5052 /* If a WAL transaction is being committed, there is no point in writing

	5053 ** any pages with page numbers greater than nTruncate into the WAL file.

	5054 ** They will never be read by any client. So remove them from the pDirty

	5055 ** list here. */

	5056 PgHdr **ppNext = &pList;

	5057 nList = 0;

	5058 for(p=pList; (*ppNext = p)!=0; p=p->pDirty){

	5059 if( p->pgno<=nTruncate ){

	5060 ppNext = &p->pDirty;

	5061 nList++;

	5062 }

	5063 }

	5064 assert( pList );

	5065 }else{

	5066 nList = 1;

	5067 }

	5068 pPager->aStat[PAGER_STAT_WRITE] += nList;

	5069

	5070 if( pList->pgno==1 ) pager_write_changecounter(pList);

	5071 rc = sqlite3WalFrames(pPager->pWal,

	5072 pPager->pageSize, pList, nTruncate, isCommit, pPager->walSyncFlags

	5073 );

	5074 if( rc==SQLITE_OK && pPager->pBackup ){

	5075 for(p=pList; p; p=p->pDirty){

	5076 sqlite3BackupUpdate(pPager->pBackup, p->pgno, (u8 *)p->pData);

	5077 }

	5078 }

	5079

	5080 #ifdef SQLITE_CHECK_PAGES

	5081 pList = sqlite3PcacheDirtyList(pPager->pPCache);

	5082 for(p=pList; p; p=p->pDirty){

	5083 pager_set_pagehash(p);

	5084 }

	5085 #endif

	5086

	5087 return rc;

	5088 }

	5089

	5090 /*

	5091 ** Begin a read transaction on the WAL.

	5092 **

	5093 ** This routine used to be called "pagerOpenSnapshot()" because it essentially

	5094 ** makes a snapshot of the database at the current point in time and preserves

	5095 ** that snapshot for use by the reader in spite of concurrently changes by

	5096 ** other writers or checkpointers.

	5097 */

	5098 static int pagerBeginReadTransaction(Pager *pPager){

	5099 int rc; /* Return code */

	5100 int changed = 0; /* True if cache must be reset */

	5101

	5102 assert( pagerUseWal(pPager) );

	5103 assert( pPager->eState==PAGER_OPEN \|\| pPager->eState==PAGER_READER );

	5104

	5105 /* sqlite3WalEndReadTransaction() was not called for the previous

	5106 ** transaction in locking_mode=EXCLUSIVE. So call it now. If we

	5107 ** are in locking_mode=NORMAL and EndRead() was previously called,

	5108 ** the duplicate call is harmless.

	5109 */

	5110 sqlite3WalEndReadTransaction(pPager->pWal);

	5111

	5112 rc = sqlite3WalBeginReadTransaction(pPager->pWal, &changed);

	5113 if( rc!=SQLITE_OK \|\| changed ){

	5114 pager_reset(pPager);

	5115 if( USEFETCH(pPager) ) sqlite3OsUnfetch(pPager->fd, 0, 0);

	5116 }

	5117

	5118 return rc;

	5119 }

	5120 #endif

	5121

	5122 /*

	5123 ** This function is called as part of the transition from PAGER_OPEN

	5124 ** to PAGER_READER state to determine the size of the database file

	5125 ** in pages (assuming the page size currently stored in Pager.pageSize).

	5126 **

	5127 ** If no error occurs, SQLITE_OK is returned and the size of the database

	5128 ** in pages is stored in *pnPage. Otherwise, an error code (perhaps

	5129 ** SQLITE_IOERR_FSTAT) is returned and *pnPage is left unmodified.

	5130 */

	5131 static int pagerPagecount(Pager pPager, Pgno pnPage){

	5132 Pgno nPage; /* Value to return via pnPage /

	5133

	5134 /* Query the WAL sub-system for the database size. The WalDbsize()

	5135 ** function returns zero if the WAL is not open (i.e. Pager.pWal==0), or

	5136 ** if the database size is not available. The database size is not

	5137 ** available from the WAL sub-system if the log file is empty or

	5138 ** contains no valid committed transactions.

	5139 */

	5140 assert( pPager->eState==PAGER_OPEN );

	5141 assert( pPager->eLock>=SHARED_LOCK );

	5142 assert( isOpen(pPager->fd) );

	5143 assert( pPager->tempFile==0 );

	5144 nPage = sqlite3WalDbsize(pPager->pWal);

	5145

	5146 /* If the number of pages in the database is not available from the

	5147 ** WAL sub-system, determine the page counte based on the size of

	5148 ** the database file. If the size of the database file is not an

	5149 ** integer multiple of the page-size, round up the result.

	5150 */

	5151 if( nPage==0 && ALWAYS(isOpen(pPager->fd)) ){

	5152 i64 n = 0; /* Size of db file in bytes */

	5153 int rc = sqlite3OsFileSize(pPager->fd, &n);

	5154 if( rc!=SQLITE_OK ){

	5155 return rc;

	5156 }

	5157 nPage = (Pgno)((n+pPager->pageSize-1) / pPager->pageSize);

	5158 }

	5159

	5160 /* If the current number of pages in the file is greater than the

	5161 ** configured maximum pager number, increase the allowed limit so

	5162 ** that the file can be read.

	5163 */

	5164 if( nPage>pPager->mxPgno ){

	5165 pPager->mxPgno = (Pgno)nPage;

	5166 }

	5167

	5168 *pnPage = nPage;

	5169 return SQLITE_OK;

	5170 }

	5171

	5172 #ifndef SQLITE_OMIT_WAL

	5173 /*

	5174 ** Check if the *-wal file that corresponds to the database opened by pPager

	5175 ** exists if the database is not empy, or verify that the *-wal file does

	5176 ** not exist (by deleting it) if the database file is empty.

	5177 **

	5178 ** If the database is not empty and the *-wal file exists, open the pager

	5179 ** in WAL mode. If the database is empty or if no *-wal file exists and

	5180 ** if no error occurs, make sure Pager.journalMode is not set to

	5181 ** PAGER_JOURNALMODE_WAL.

	5182 **

	5183 ** Return SQLITE_OK or an error code.

	5184 **

	5185 ** The caller must hold a SHARED lock on the database file to call this

	5186 ** function. Because an EXCLUSIVE lock on the db file is required to delete

	5187 ** a WAL on a none-empty database, this ensures there is no race condition

	5188 ** between the xAccess() below and an xDelete() being executed by some

	5189 ** other connection.

	5190 */

	5191 static int pagerOpenWalIfPresent(Pager *pPager){

	5192 int rc = SQLITE_OK;

	5193 assert( pPager->eState==PAGER_OPEN );

	5194 assert( pPager->eLock>=SHARED_LOCK );

	5195

	5196 if( !pPager->tempFile ){

	5197 int isWal; /* True if WAL file exists */

	5198 Pgno nPage; /* Size of the database file */

	5199

	5200 rc = pagerPagecount(pPager, &nPage);

	5201 if( rc ) return rc;

	5202 if( nPage==0 ){

	5203 rc = sqlite3OsDelete(pPager->pVfs, pPager->zWal, 0);

	5204 if( rc==SQLITE_IOERR_DELETE_NOENT ) rc = SQLITE_OK;

	5205 isWal = 0;

	5206 }else{

	5207 rc = sqlite3OsAccess(

	5208 pPager->pVfs, pPager->zWal, SQLITE_ACCESS_EXISTS, &isWal

	5209 );

	5210 }

	5211 if( rc==SQLITE_OK ){

	5212 if( isWal ){

	5213 testcase( sqlite3PcachePagecount(pPager->pPCache)==0 );

	5214 rc = sqlite3PagerOpenWal(pPager, 0);

	5215 }else if( pPager->journalMode==PAGER_JOURNALMODE_WAL ){

	5216 pPager->journalMode = PAGER_JOURNALMODE_DELETE;

	5217 }

	5218 }

	5219 }

	5220 return rc;

	5221 }

	5222 #endif

	5223

	5224 /*

	5225 ** Playback savepoint pSavepoint. Or, if pSavepoint==NULL, then playback

	5226 ** the entire master journal file. The case pSavepoint==NULL occurs when

	5227 ** a ROLLBACK TO command is invoked on a SAVEPOINT that is a transaction

	5228 ** savepoint.

	5229 **

	5230 ** When pSavepoint is not NULL (meaning a non-transaction savepoint is

	5231 ** being rolled back), then the rollback consists of up to three stages,

	5232 ** performed in the order specified:

	5233 **

	5234 ** * Pages are played back from the main journal starting at byte

	5235 ** offset PagerSavepoint.iOffset and continuing to

	5236 ** PagerSavepoint.iHdrOffset, or to the end of the main journal

	5237 ** file if PagerSavepoint.iHdrOffset is zero.

	5238 **

	5239 ** * If PagerSavepoint.iHdrOffset is not zero, then pages are played

	5240 ** back starting from the journal header immediately following

	5241 ** PagerSavepoint.iHdrOffset to the end of the main journal file.

	5242 **

	5243 ** * Pages are then played back from the sub-journal file, starting

	5244 ** with the PagerSavepoint.iSubRec and continuing to the end of

	5245 ** the journal file.

	5246 **

	5247 ** Throughout the rollback process, each time a page is rolled back, the

	5248 ** corresponding bit is set in a bitvec structure (variable pDone in the

	5249 ** implementation below). This is used to ensure that a page is only

	5250 ** rolled back the first time it is encountered in either journal.

	5251 **

	5252 ** If pSavepoint is NULL, then pages are only played back from the main

	5253 ** journal file. There is no need for a bitvec in this case.

	5254 **

	5255 ** In either case, before playback commences the Pager.dbSize variable

	5256 ** is reset to the value that it held at the start of the savepoint

	5257 ** (or transaction). No page with a page-number greater than this value

	5258 ** is played back. If one is encountered it is simply skipped.

	5259 */

	5260 static int pagerPlaybackSavepoint(Pager pPager, PagerSavepoint pSavepoint){

	5261 i64 szJ; /* Effective size of the main journal */

	5262 i64 iHdrOff; /* End of first segment of main-journal records */

	5263 int rc = SQLITE_OK; /* Return code */

	5264 Bitvec pDone = 0; / Bitvec to ensure pages played back only once */

	5265

	5266 assert( pPager->eState!=PAGER_ERROR );

	5267 assert( pPager->eState>=PAGER_WRITER_LOCKED );

	5268

	5269 /* Allocate a bitvec to use to store the set of pages rolled back */

	5270 if( pSavepoint ){

	5271 pDone = sqlite3BitvecCreate(pSavepoint->nOrig);

	5272 if( !pDone ){

	5273 return SQLITE_NOMEM_BKPT;

	5274 }

	5275 }

	5276

	5277 /* Set the database size back to the value it was before the savepoint

	5278 ** being reverted was opened.

	5279 */

	5280 pPager->dbSize = pSavepoint ? pSavepoint->nOrig : pPager->dbOrigSize;

	5281 pPager->changeCountDone = pPager->tempFile;

	5282

	5283 if( !pSavepoint && pagerUseWal(pPager) ){

	5284 return pagerRollbackWal(pPager);

	5285 }

	5286

	5287 /* Use pPager->journalOff as the effective size of the main rollback

	5288 ** journal. The actual file might be larger than this in

	5289 ** PAGER_JOURNALMODE_TRUNCATE or PAGER_JOURNALMODE_PERSIST. But anything

	5290 ** past pPager->journalOff is off-limits to us.

	5291 */

	5292 szJ = pPager->journalOff;

	5293 assert( pagerUseWal(pPager)==0 \|\| szJ==0 );

	5294

	5295 /* Begin by rolling back records from the main journal starting at

	5296 ** PagerSavepoint.iOffset and continuing to the next journal header.

	5297 ** There might be records in the main journal that have a page number

	5298 ** greater than the current database size (pPager->dbSize) but those

	5299 ** will be skipped automatically. Pages are added to pDone as they

	5300 ** are played back.

	5301 */

	5302 if( pSavepoint && !pagerUseWal(pPager) ){

	5303 iHdrOff = pSavepoint->iHdrOffset ? pSavepoint->iHdrOffset : szJ;

	5304 pPager->journalOff = pSavepoint->iOffset;

	5305 while( rc==SQLITE_OK && pPager->journalOff<iHdrOff ){

	5306 rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);

	5307 }

	5308 assert( rc!=SQLITE_DONE );

	5309 }else{

	5310 pPager->journalOff = 0;

	5311 }

	5312

	5313 /* Continue rolling back records out of the main journal starting at

	5314 ** the first journal header seen and continuing until the effective end

	5315 ** of the main journal file. Continue to skip out-of-range pages and

	5316 ** continue adding pages rolled back to pDone.

	5317 */

	5318 while( rc==SQLITE_OK && pPager->journalOff<szJ ){

	5319 u32 ii; /* Loop counter */

	5320 u32 nJRec = 0; /* Number of Journal Records */

	5321 u32 dummy;

	5322 rc = readJournalHdr(pPager, 0, szJ, &nJRec, &dummy);

	5323 assert( rc!=SQLITE_DONE );

	5324

	5325 /*

	5326 ** The "pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff"

	5327 ** test is related to ticket #2565. See the discussion in the

	5328 ** pager_playback() function for additional information.

	5329 */

	5330 if( nJRec==0

	5331 && pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff

	5332 ){

	5333 nJRec = (u32)((szJ - pPager->journalOff)/JOURNAL_PG_SZ(pPager));

	5334 }

	5335 for(ii=0; rc==SQLITE_OK && ii<nJRec && pPager->journalOff<szJ; ii++){

	5336 rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);

	5337 }

	5338 assert( rc!=SQLITE_DONE );

	5339 }

	5340 assert( rc!=SQLITE_OK \|\| pPager->journalOff>=szJ );

	5341

	5342 /* Finally, rollback pages from the sub-journal. Page that were

	5343 ** previously rolled back out of the main journal (and are hence in pDone)

	5344 ** will be skipped. Out-of-range pages are also skipped.

	5345 */

	5346 if( pSavepoint ){

	5347 u32 ii; /* Loop counter */

	5348 i64 offset = (i64)pSavepoint->iSubRec*(4+pPager->pageSize);

	5349

	5350 if( pagerUseWal(pPager) ){

	5351 rc = sqlite3WalSavepointUndo(pPager->pWal, pSavepoint->aWalData);

	5352 }

	5353 for(ii=pSavepoint->iSubRec; rc==SQLITE_OK && ii<pPager->nSubRec; ii++){

	5354 assert( offset==(i64)ii*(4+pPager->pageSize) );

	5355 rc = pager_playback_one_page(pPager, &offset, pDone, 0, 1);

	5356 }

	5357 assert( rc!=SQLITE_DONE );

	5358 }

	5359

	5360 sqlite3BitvecDestroy(pDone);

	5361 if( rc==SQLITE_OK ){

	5362 pPager->journalOff = szJ;

	5363 }

	5364

	5365 return rc;

	5366 }

	5367

	5368 /*

	5369 ** Change the maximum number of in-memory pages that are allowed

	5370 ** before attempting to recycle clean and unused pages.

	5371 */

	5372 SQLITE_PRIVATE void sqlite3PagerSetCachesize(Pager *pPager, int mxPage){

	5373 sqlite3PcacheSetCachesize(pPager->pPCache, mxPage);

	5374 }

	5375

	5376 /*

	5377 ** Change the maximum number of in-memory pages that are allowed

	5378 ** before attempting to spill pages to journal.

	5379 */

	5380 SQLITE_PRIVATE int sqlite3PagerSetSpillsize(Pager *pPager, int mxPage){

	5381 return sqlite3PcacheSetSpillsize(pPager->pPCache, mxPage);

	5382 }

	5383

	5384 /*

	5385 ** Invoke SQLITE_FCNTL_MMAP_SIZE based on the current value of szMmap.

	5386 */

	5387 static void pagerFixMaplimit(Pager *pPager){

	5388 #if SQLITE_MAX_MMAP_SIZE>0

	5389 sqlite3_file *fd = pPager->fd;

	5390 if( isOpen(fd) && fd->pMethods->iVersion>=3 ){

	5391 sqlite3_int64 sz;

	5392 sz = pPager->szMmap;

	5393 pPager->bUseFetch = (sz>0);

	5394 setGetterMethod(pPager);

	5395 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_MMAP_SIZE, &sz);

	5396 }

	5397 #endif

	5398 }

	5399

	5400 /*

	5401 ** Change the maximum size of any memory mapping made of the database file.

	5402 */

	5403 SQLITE_PRIVATE void sqlite3PagerSetMmapLimit(Pager *pPager, sqlite3_int64 szMmap ){

	5404 pPager->szMmap = szMmap;

	5405 pagerFixMaplimit(pPager);

	5406 }

	5407

	5408 /*

	5409 ** Free as much memory as possible from the pager.

	5410 */

	5411 SQLITE_PRIVATE void sqlite3PagerShrink(Pager *pPager){

	5412 sqlite3PcacheShrink(pPager->pPCache);

	5413 }

	5414

	5415 /*

	5416 ** Adjust settings of the pager to those specified in the pgFlags parameter.

	5417 **

	5418 ** The "level" in pgFlags & PAGER_SYNCHRONOUS_MASK sets the robustness

	5419 ** of the database to damage due to OS crashes or power failures by

	5420 ** changing the number of syncs()s when writing the journals.

	5421 ** There are four levels:

	5422 **

	5423 ** OFF sqlite3OsSync() is never called. This is the default

	5424 ** for temporary and transient files.

	5425 **

	5426 ** NORMAL The journal is synced once before writes begin on the

	5427 ** database. This is normally adequate protection, but

	5428 ** it is theoretically possible, though very unlikely,

	5429 ** that an inopertune power failure could leave the journal

	5430 ** in a state which would cause damage to the database

	5431 ** when it is rolled back.

	5432 **

	5433 ** FULL The journal is synced twice before writes begin on the

	5434 ** database (with some additional information - the nRec field

	5435 ** of the journal header - being written in between the two

	5436 ** syncs). If we assume that writing a

	5437 ** single disk sector is atomic, then this mode provides

	5438 ** assurance that the journal will not be corrupted to the

	5439 ** point of causing damage to the database during rollback.

	5440 **

	5441 ** EXTRA This is like FULL except that is also syncs the directory

	5442 ** that contains the rollback journal after the rollback

	5443 ** journal is unlinked.

	5444 **

	5445 ** The above is for a rollback-journal mode. For WAL mode, OFF continues

	5446 ** to mean that no syncs ever occur. NORMAL means that the WAL is synced

	5447 ** prior to the start of checkpoint and that the database file is synced

	5448 ** at the conclusion of the checkpoint if the entire content of the WAL

	5449 ** was written back into the database. But no sync operations occur for

	5450 ** an ordinary commit in NORMAL mode with WAL. FULL means that the WAL

	5451 ** file is synced following each commit operation, in addition to the

	5452 ** syncs associated with NORMAL. There is no difference between FULL

	5453 ** and EXTRA for WAL mode.

	5454 **

	5455 ** Do not confuse synchronous=FULL with SQLITE_SYNC_FULL. The

	5456 ** SQLITE_SYNC_FULL macro means to use the MacOSX-style full-fsync

	5457 ** using fcntl(F_FULLFSYNC). SQLITE_SYNC_NORMAL means to do an

	5458 ** ordinary fsync() call. There is no difference between SQLITE_SYNC_FULL

	5459 ** and SQLITE_SYNC_NORMAL on platforms other than MacOSX. But the

	5460 ** synchronous=FULL versus synchronous=NORMAL setting determines when

	5461 ** the xSync primitive is called and is relevant to all platforms.

	5462 **

	5463 ** Numeric values associated with these states are OFF==1, NORMAL=2,

	5464 ** and FULL=3.

	5465 */

	5466 #ifndef SQLITE_OMIT_PAGER_PRAGMAS

	5467 SQLITE_PRIVATE void sqlite3PagerSetFlags(

	5468 Pager pPager, / The pager to set safety level for */

	5469 unsigned pgFlags /* Various flags */

	5470 ){

	5471 unsigned level = pgFlags & PAGER_SYNCHRONOUS_MASK;

	5472 if( pPager->tempFile ){

	5473 pPager->noSync = 1;

	5474 pPager->fullSync = 0;

	5475 pPager->extraSync = 0;

	5476 }else{

	5477 pPager->noSync = level==PAGER_SYNCHRONOUS_OFF ?1:0;

	5478 pPager->fullSync = level>=PAGER_SYNCHRONOUS_FULL ?1:0;

	5479 pPager->extraSync = level==PAGER_SYNCHRONOUS_EXTRA ?1:0;

	5480 }

	5481 if( pPager->noSync ){

	5482 pPager->syncFlags = 0;

	5483 pPager->ckptSyncFlags = 0;

	5484 }else if( pgFlags & PAGER_FULLFSYNC ){

	5485 pPager->syncFlags = SQLITE_SYNC_FULL;

	5486 pPager->ckptSyncFlags = SQLITE_SYNC_FULL;

	5487 }else if( pgFlags & PAGER_CKPT_FULLFSYNC ){

	5488 pPager->syncFlags = SQLITE_SYNC_NORMAL;

	5489 pPager->ckptSyncFlags = SQLITE_SYNC_FULL;

	5490 }else{

	5491 pPager->syncFlags = SQLITE_SYNC_NORMAL;

	5492 pPager->ckptSyncFlags = SQLITE_SYNC_NORMAL;

	5493 }

	5494 pPager->walSyncFlags = pPager->syncFlags;

	5495 if( pPager->fullSync ){

	5496 pPager->walSyncFlags \|= WAL_SYNC_TRANSACTIONS;

	5497 }

	5498 if( pgFlags & PAGER_CACHESPILL ){

	5499 pPager->doNotSpill &= ~SPILLFLAG_OFF;

	5500 }else{

	5501 pPager->doNotSpill \|= SPILLFLAG_OFF;

	5502 }

	5503 }

	5504 #endif

	5505

	5506 /*

	5507 ** The following global variable is incremented whenever the library

	5508 ** attempts to open a temporary file. This information is used for

	5509 ** testing and analysis only.

	5510 */

	5511 #ifdef SQLITE_TEST

	5512 SQLITE_API int sqlite3_opentemp_count = 0;

	5513 #endif

	5514

	5515 /*

	5516 ** Open a temporary file.

	5517 **

	5518 ** Write the file descriptor into *pFile. Return SQLITE_OK on success

	5519 ** or some other error code if we fail. The OS will automatically

	5520 ** delete the temporary file when it is closed.

	5521 **

	5522 ** The flags passed to the VFS layer xOpen() call are those specified

	5523 ** by parameter vfsFlags ORed with the following:

	5524 **

	5525 ** SQLITE_OPEN_READWRITE

	5526 ** SQLITE_OPEN_CREATE

	5527 ** SQLITE_OPEN_EXCLUSIVE

	5528 ** SQLITE_OPEN_DELETEONCLOSE

	5529 */

	5530 static int pagerOpentemp(

	5531 Pager pPager, / The pager object */

	5532 sqlite3_file pFile, / Write the file descriptor here */

	5533 int vfsFlags /* Flags passed through to the VFS */

	5534 ){

	5535 int rc; /* Return code */

	5536

	5537 #ifdef SQLITE_TEST

	5538 sqlite3_opentemp_count++; /* Used for testing and analysis only */

	5539 #endif

	5540

	5541 vfsFlags \|= SQLITE_OPEN_READWRITE \| SQLITE_OPEN_CREATE \|

	5542 SQLITE_OPEN_EXCLUSIVE \| SQLITE_OPEN_DELETEONCLOSE;

	5543 rc = sqlite3OsOpen(pPager->pVfs, 0, pFile, vfsFlags, 0);

	5544 assert( rc!=SQLITE_OK \|\| isOpen(pFile) );

	5545 return rc;

	5546 }

	5547

	5548 /*

	5549 ** Set the busy handler function.

	5550 **

	5551 ** The pager invokes the busy-handler if sqlite3OsLock() returns

	5552 ** SQLITE_BUSY when trying to upgrade from no-lock to a SHARED lock,

	5553 ** or when trying to upgrade from a RESERVED lock to an EXCLUSIVE

	5554 ** lock. It does not invoke the busy handler when upgrading from

	5555 ** SHARED to RESERVED, or when upgrading from SHARED to EXCLUSIVE

	5556 ** (which occurs during hot-journal rollback). Summary:

	5557 **

	5558 ** Transition \| Invokes xBusyHandler

	5559 ** --------------------------------------------------------

	5560 ** NO_LOCK -> SHARED_LOCK \| Yes

	5561 ** SHARED_LOCK -> RESERVED_LOCK \| No

	5562 ** SHARED_LOCK -> EXCLUSIVE_LOCK \| No

	5563 ** RESERVED_LOCK -> EXCLUSIVE_LOCK \| Yes

	5564 **

	5565 ** If the busy-handler callback returns non-zero, the lock is

	5566 ** retried. If it returns zero, then the SQLITE_BUSY error is

	5567 ** returned to the caller of the pager API function.

	5568 */

	5569 SQLITE_PRIVATE void sqlite3PagerSetBusyhandler(

	5570 Pager pPager, / Pager object */

	5571 int (xBusyHandler)(void ), /* Pointer to busy-handler function */

	5572 void pBusyHandlerArg / Argument to pass to xBusyHandler */

	5573 ){

	5574 pPager->xBusyHandler = xBusyHandler;

	5575 pPager->pBusyHandlerArg = pBusyHandlerArg;

	5576

	5577 if( isOpen(pPager->fd) ){

	5578 void ap = (void )&pPager->xBusyHandler;

	5579 assert( ((int()(void ))(ap[0]))==xBusyHandler );

	5580 assert( ap[1]==pBusyHandlerArg );

	5581 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_BUSYHANDLER, (void *)ap);

	5582 }

	5583 }

	5584

	5585 /*

	5586 ** Change the page size used by the Pager object. The new page size

	5587 ** is passed in *pPageSize.

	5588 **

	5589 ** If the pager is in the error state when this function is called, it

	5590 ** is a no-op. The value returned is the error state error code (i.e.

	5591 ** one of SQLITE_IOERR, an SQLITE_IOERR_xxx sub-code or SQLITE_FULL).

	5592 **

	5593 ** Otherwise, if all of the following are true:

	5594 **

	5595 ** * the new page size (value of *pPageSize) is valid (a power

	5596 ** of two between 512 and SQLITE_MAX_PAGE_SIZE, inclusive), and

	5597 **

	5598 ** * there are no outstanding page references, and

	5599 **

	5600 ** * the database is either not an in-memory database or it is

	5601 ** an in-memory database that currently consists of zero pages.

	5602 **

	5603 ** then the pager object page size is set to *pPageSize.

	5604 **

	5605 ** If the page size is changed, then this function uses sqlite3PagerMalloc()

	5606 ** to obtain a new Pager.pTmpSpace buffer. If this allocation attempt

	5607 ** fails, SQLITE_NOMEM is returned and the page size remains unchanged.

	5608 ** In all other cases, SQLITE_OK is returned.

	5609 **

	5610 ** If the page size is not changed, either because one of the enumerated

	5611 ** conditions above is not true, the pager was in error state when this

	5612 ** function was called, or because the memory allocation attempt failed,

	5613 ** then *pPageSize is set to the old, retained page size before returning.

	5614 */

	5615 SQLITE_PRIVATE int sqlite3PagerSetPagesize(Pager pPager, u32 pPageSize, int nR eserve){

	5616 int rc = SQLITE_OK;

	5617

	5618 /* It is not possible to do a full assert_pager_state() here, as this

	5619 ** function may be called from within PagerOpen(), before the state

	5620 ** of the Pager object is internally consistent.

	5621 **

	5622 ** At one point this function returned an error if the pager was in

	5623 ** PAGER_ERROR state. But since PAGER_ERROR state guarantees that

	5624 ** there is at least one outstanding page reference, this function

	5625 ** is a no-op for that case anyhow.

	5626 */

	5627

	5628 u32 pageSize = *pPageSize;

	5629 assert( pageSize==0 \|\| (pageSize>=512 && pageSize<=SQLITE_MAX_PAGE_SIZE) );

	5630 if( (pPager->memDb==0 \|\| pPager->dbSize==0)

	5631 && sqlite3PcacheRefCount(pPager->pPCache)==0

	5632 && pageSize && pageSize!=(u32)pPager->pageSize

	5633 ){

	5634 char pNew = NULL; / New temp space */

	5635 i64 nByte = 0;

	5636

	5637 if( pPager->eState>PAGER_OPEN && isOpen(pPager->fd) ){

	5638 rc = sqlite3OsFileSize(pPager->fd, &nByte);

	5639 }

	5640 if( rc==SQLITE_OK ){

	5641 pNew = (char *)sqlite3PageMalloc(pageSize);

	5642 if( !pNew ) rc = SQLITE_NOMEM_BKPT;

	5643 }

	5644

	5645 if( rc==SQLITE_OK ){

	5646 pager_reset(pPager);

	5647 rc = sqlite3PcacheSetPageSize(pPager->pPCache, pageSize);

	5648 }

	5649 if( rc==SQLITE_OK ){

	5650 sqlite3PageFree(pPager->pTmpSpace);

	5651 pPager->pTmpSpace = pNew;

	5652 pPager->dbSize = (Pgno)((nByte+pageSize-1)/pageSize);

	5653 pPager->pageSize = pageSize;

	5654 }else{

	5655 sqlite3PageFree(pNew);

	5656 }

	5657 }

	5658

	5659 *pPageSize = pPager->pageSize;

	5660 if( rc==SQLITE_OK ){

	5661 if( nReserve<0 ) nReserve = pPager->nReserve;

	5662 assert( nReserve>=0 && nReserve<1000 );

	5663 pPager->nReserve = (i16)nReserve;

	5664 pagerReportSize(pPager);

	5665 pagerFixMaplimit(pPager);

	5666 }

	5667 return rc;

	5668 }

	5669

	5670 /*

	5671 ** Return a pointer to the "temporary page" buffer held internally

	5672 ** by the pager. This is a buffer that is big enough to hold the

	5673 ** entire content of a database page. This buffer is used internally

	5674 ** during rollback and will be overwritten whenever a rollback

	5675 ** occurs. But other modules are free to use it too, as long as

	5676 ** no rollbacks are happening.

	5677 */

	5678 SQLITE_PRIVATE void sqlite3PagerTempSpace(Pager pPager){

	5679 return pPager->pTmpSpace;

	5680 }

	5681

	5682 /*

	5683 ** Attempt to set the maximum database page count if mxPage is positive.

	5684 ** Make no changes if mxPage is zero or negative. And never reduce the

	5685 ** maximum page count below the current size of the database.

	5686 **

	5687 ** Regardless of mxPage, return the current maximum page count.

	5688 */

	5689 SQLITE_PRIVATE int sqlite3PagerMaxPageCount(Pager *pPager, int mxPage){

	5690 if( mxPage>0 ){

	5691 pPager->mxPgno = mxPage;

	5692 }

	5693 assert( pPager->eState!=PAGER_OPEN ); /* Called only by OP_MaxPgcnt */

	5694 assert( pPager->mxPgno>=pPager->dbSize ); /* OP_MaxPgcnt enforces this */

	5695 return pPager->mxPgno;

	5696 }

	5697

	5698 /*

	5699 ** The following set of routines are used to disable the simulated

	5700 ** I/O error mechanism. These routines are used to avoid simulated

	5701 ** errors in places where we do not care about errors.

	5702 **

	5703 ** Unless -DSQLITE_TEST=1 is used, these routines are all no-ops

	5704 ** and generate no code.

	5705 */

	5706 #ifdef SQLITE_TEST

	5707 SQLITE_API extern int sqlite3_io_error_pending;

	5708 SQLITE_API extern int sqlite3_io_error_hit;

	5709 static int saved_cnt;

	5710 void disable_simulated_io_errors(void){

	5711 saved_cnt = sqlite3_io_error_pending;

	5712 sqlite3_io_error_pending = -1;

	5713 }

	5714 void enable_simulated_io_errors(void){

	5715 sqlite3_io_error_pending = saved_cnt;

	5716 }

	5717 #else

	5718 # define disable_simulated_io_errors()

	5719 # define enable_simulated_io_errors()

	5720 #endif

	5721

	5722 /*

	5723 ** Read the first N bytes from the beginning of the file into memory

	5724 ** that pDest points to.

	5725 **

	5726 ** If the pager was opened on a transient file (zFilename==""), or

	5727 ** opened on a file less than N bytes in size, the output buffer is

	5728 ** zeroed and SQLITE_OK returned. The rationale for this is that this

	5729 ** function is used to read database headers, and a new transient or

	5730 ** zero sized database has a header than consists entirely of zeroes.

	5731 **

	5732 ** If any IO error apart from SQLITE_IOERR_SHORT_READ is encountered,

	5733 ** the error code is returned to the caller and the contents of the

	5734 ** output buffer undefined.

	5735 */

	5736 SQLITE_PRIVATE int sqlite3PagerReadFileheader(Pager pPager, int N, unsigned cha r pDest){

	5737 int rc = SQLITE_OK;

	5738 memset(pDest, 0, N);

	5739 assert( isOpen(pPager->fd) \|\| pPager->tempFile );

	5740

	5741 /* This routine is only called by btree immediately after creating

	5742 ** the Pager object. There has not been an opportunity to transition

	5743 ** to WAL mode yet.

	5744 */

	5745 assert( !pagerUseWal(pPager) );

	5746

	5747 if( isOpen(pPager->fd) ){

	5748 IOTRACE(("DBHDR %p 0 %d\n", pPager, N))

	5749 rc = sqlite3OsRead(pPager->fd, pDest, N, 0);

	5750 if( rc==SQLITE_IOERR_SHORT_READ ){

	5751 rc = SQLITE_OK;

	5752 }

	5753 }

	5754 return rc;

	5755 }

	5756

	5757 /*

	5758 ** This function may only be called when a read-transaction is open on

	5759 ** the pager. It returns the total number of pages in the database.

	5760 **

	5761 ** However, if the file is between 1 and <page-size> bytes in size, then

	5762 ** this is considered a 1 page file.

	5763 */

	5764 SQLITE_PRIVATE void sqlite3PagerPagecount(Pager pPager, int pnPage){

	5765 assert( pPager->eState>=PAGER_READER );

	5766 assert( pPager->eState!=PAGER_WRITER_FINISHED );

	5767 *pnPage = (int)pPager->dbSize;

	5768 }

	5769

	5770

	5771 /*

	5772 ** Try to obtain a lock of type locktype on the database file. If

	5773 ** a similar or greater lock is already held, this function is a no-op

	5774 ** (returning SQLITE_OK immediately).

	5775 **

	5776 ** Otherwise, attempt to obtain the lock using sqlite3OsLock(). Invoke

	5777 ** the busy callback if the lock is currently not available. Repeat

	5778 ** until the busy callback returns false or until the attempt to

	5779 ** obtain the lock succeeds.

	5780 **

	5781 ** Return SQLITE_OK on success and an error code if we cannot obtain

	5782 ** the lock. If the lock is obtained successfully, set the Pager.state

	5783 ** variable to locktype before returning.

	5784 */

	5785 static int pager_wait_on_lock(Pager *pPager, int locktype){

	5786 int rc; /* Return code */

	5787

	5788 /* Check that this is either a no-op (because the requested lock is

	5789 ** already held), or one of the transitions that the busy-handler

	5790 ** may be invoked during, according to the comment above

	5791 ** sqlite3PagerSetBusyhandler().

	5792 */

	5793 assert( (pPager->eLock>=locktype)

	5794 \|\| (pPager->eLock==NO_LOCK && locktype==SHARED_LOCK)

	5795 \|\| (pPager->eLock==RESERVED_LOCK && locktype==EXCLUSIVE_LOCK)

	5796 );

	5797

	5798 do {

	5799 rc = pagerLockDb(pPager, locktype);

	5800 }while( rc==SQLITE_BUSY && pPager->xBusyHandler(pPager->pBusyHandlerArg) );

	5801 return rc;

	5802 }

	5803

	5804 /*

	5805 ** Function assertTruncateConstraint(pPager) checks that one of the

	5806 ** following is true for all dirty pages currently in the page-cache:

	5807 **

	5808 ** a) The page number is less than or equal to the size of the

	5809 ** current database image, in pages, OR

	5810 **

	5811 ** b) if the page content were written at this time, it would not

	5812 ** be necessary to write the current content out to the sub-journal

	5813 ** (as determined by function subjRequiresPage()).

	5814 **

	5815 ** If the condition asserted by this function were not true, and the

	5816 ** dirty page were to be discarded from the cache via the pagerStress()

	5817 ** routine, pagerStress() would not write the current page content to

	5818 ** the database file. If a savepoint transaction were rolled back after

	5819 ** this happened, the correct behavior would be to restore the current

	5820 ** content of the page. However, since this content is not present in either

	5821 ** the database file or the portion of the rollback journal and

	5822 ** sub-journal rolled back the content could not be restored and the

	5823 ** database image would become corrupt. It is therefore fortunate that

	5824 ** this circumstance cannot arise.

	5825 */

	5826 #if defined(SQLITE_DEBUG)

	5827 static void assertTruncateConstraintCb(PgHdr *pPg){

	5828 assert( pPg->flags&PGHDR_DIRTY );

	5829 assert( !subjRequiresPage(pPg) \|\| pPg->pgno<=pPg->pPager->dbSize );

	5830 }

	5831 static void assertTruncateConstraint(Pager *pPager){

	5832 sqlite3PcacheIterateDirty(pPager->pPCache, assertTruncateConstraintCb);

	5833 }

	5834 #else

	5835 # define assertTruncateConstraint(pPager)

	5836 #endif

	5837

	5838 /*

	5839 ** Truncate the in-memory database file image to nPage pages. This

	5840 ** function does not actually modify the database file on disk. It

	5841 ** just sets the internal state of the pager object so that the

	5842 ** truncation will be done when the current transaction is committed.

	5843 **

	5844 ** This function is only called right before committing a transaction.

	5845 ** Once this function has been called, the transaction must either be

	5846 ** rolled back or committed. It is not safe to call this function and

	5847 ** then continue writing to the database.

	5848 */

	5849 SQLITE_PRIVATE void sqlite3PagerTruncateImage(Pager *pPager, Pgno nPage){

	5850 assert( pPager->dbSize>=nPage );

	5851 assert( pPager->eState>=PAGER_WRITER_CACHEMOD );

	5852 pPager->dbSize = nPage;

	5853

	5854 /* At one point the code here called assertTruncateConstraint() to

	5855 ** ensure that all pages being truncated away by this operation are,

	5856 ** if one or more savepoints are open, present in the savepoint

	5857 ** journal so that they can be restored if the savepoint is rolled

	5858 ** back. This is no longer necessary as this function is now only

	5859 ** called right before committing a transaction. So although the

	5860 ** Pager object may still have open savepoints (Pager.nSavepoint!=0),

	5861 ** they cannot be rolled back. So the assertTruncateConstraint() call

	5862 ** is no longer correct. */

	5863 }

	5864

	5865

	5866 /*

	5867 ** This function is called before attempting a hot-journal rollback. It

	5868 ** syncs the journal file to disk, then sets pPager->journalHdr to the

	5869 ** size of the journal file so that the pager_playback() routine knows

	5870 ** that the entire journal file has been synced.

	5871 **

	5872 ** Syncing a hot-journal to disk before attempting to roll it back ensures

	5873 ** that if a power-failure occurs during the rollback, the process that

	5874 ** attempts rollback following system recovery sees the same journal

	5875 ** content as this process.

	5876 **

	5877 ** If everything goes as planned, SQLITE_OK is returned. Otherwise,

	5878 ** an SQLite error code.

	5879 */

	5880 static int pagerSyncHotJournal(Pager *pPager){

	5881 int rc = SQLITE_OK;

	5882 if( !pPager->noSync ){

	5883 rc = sqlite3OsSync(pPager->jfd, SQLITE_SYNC_NORMAL);

	5884 }

	5885 if( rc==SQLITE_OK ){

	5886 rc = sqlite3OsFileSize(pPager->jfd, &pPager->journalHdr);

	5887 }

	5888 return rc;

	5889 }

	5890

	5891 #if SQLITE_MAX_MMAP_SIZE>0

	5892 /*

	5893 ** Obtain a reference to a memory mapped page object for page number pgno.

	5894 ** The new object will use the pointer pData, obtained from xFetch().

	5895 ** If successful, set *ppPage to point to the new page reference

	5896 ** and return SQLITE_OK. Otherwise, return an SQLite error code and set

	5897 ** *ppPage to zero.

	5898 **

	5899 ** Page references obtained by calling this function should be released

	5900 ** by calling pagerReleaseMapPage().

	5901 */

	5902 static int pagerAcquireMapPage(

	5903 Pager pPager, / Pager object */

	5904 Pgno pgno, /* Page number */

	5905 void pData, / xFetch()'d data for this page */

	5906 PgHdr *ppPage / OUT: Acquired page object */

	5907 ){

	5908 PgHdr p; / Memory mapped page to return */

	5909

	5910 if( pPager->pMmapFreelist ){

	5911 *ppPage = p = pPager->pMmapFreelist;

	5912 pPager->pMmapFreelist = p->pDirty;

	5913 p->pDirty = 0;

	5914 assert( pPager->nExtra>=8 );

	5915 memset(p->pExtra, 0, 8);

	5916 }else{

	5917 ppPage = p = (PgHdr )sqlite3MallocZero(sizeof(PgHdr) + pPager->nExtra);

	5918 if( p==0 ){

	5919 sqlite3OsUnfetch(pPager->fd, (i64)(pgno-1) * pPager->pageSize, pData);

	5920 return SQLITE_NOMEM_BKPT;

	5921 }

	5922 p->pExtra = (void *)&p[1];

	5923 p->flags = PGHDR_MMAP;

	5924 p->nRef = 1;

	5925 p->pPager = pPager;

	5926 }

	5927

	5928 assert( p->pExtra==(void *)&p[1] );

	5929 assert( p->pPage==0 );

	5930 assert( p->flags==PGHDR_MMAP );

	5931 assert( p->pPager==pPager );

	5932 assert( p->nRef==1 );

	5933

	5934 p->pgno = pgno;

	5935 p->pData = pData;

	5936 pPager->nMmapOut++;

	5937

	5938 return SQLITE_OK;

	5939 }

	5940 #endif

	5941

	5942 /*

	5943 ** Release a reference to page pPg. pPg must have been returned by an

	5944 ** earlier call to pagerAcquireMapPage().

	5945 */

	5946 static void pagerReleaseMapPage(PgHdr *pPg){

	5947 Pager *pPager = pPg->pPager;

	5948 pPager->nMmapOut--;

	5949 pPg->pDirty = pPager->pMmapFreelist;

	5950 pPager->pMmapFreelist = pPg;

	5951

	5952 assert( pPager->fd->pMethods->iVersion>=3 );

	5953 sqlite3OsUnfetch(pPager->fd, (i64)(pPg->pgno-1)*pPager->pageSize, pPg->pData);

	5954 }

	5955

	5956 /*

	5957 ** Free all PgHdr objects stored in the Pager.pMmapFreelist list.

	5958 */

	5959 static void pagerFreeMapHdrs(Pager *pPager){

	5960 PgHdr *p;

	5961 PgHdr *pNext;

	5962 for(p=pPager->pMmapFreelist; p; p=pNext){

	5963 pNext = p->pDirty;

	5964 sqlite3_free(p);

	5965 }

	5966 }

	5967

	5968

	5969 /*

	5970 ** Shutdown the page cache. Free all memory and close all files.

	5971 **

	5972 ** If a transaction was in progress when this routine is called, that

	5973 ** transaction is rolled back. All outstanding pages are invalidated

	5974 ** and their memory is freed. Any attempt to use a page associated

	5975 ** with this page cache after this function returns will likely

	5976 ** result in a coredump.

	5977 **

	5978 ** This function always succeeds. If a transaction is active an attempt

	5979 ** is made to roll it back. If an error occurs during the rollback

	5980 ** a hot journal may be left in the filesystem but no error is returned

	5981 ** to the caller.

	5982 */

	5983 SQLITE_PRIVATE int sqlite3PagerClose(Pager pPager, sqlite3 db){

	5984 u8 pTmp = (u8 )pPager->pTmpSpace;

	5985

	5986 assert( db \|\| pagerUseWal(pPager)==0 );

	5987 assert( assert_pager_state(pPager) );

	5988 disable_simulated_io_errors();

	5989 sqlite3BeginBenignMalloc();

	5990 pagerFreeMapHdrs(pPager);

	5991 /* pPager->errCode = 0; */

	5992 pPager->exclusiveMode = 0;

	5993 #ifndef SQLITE_OMIT_WAL

	5994 assert( db \|\| pPager->pWal==0 );

	5995 sqlite3WalClose(pPager->pWal, db, pPager->ckptSyncFlags, pPager->pageSize,

	5996 (db && (db->flags & SQLITE_NoCkptOnClose) ? 0 : pTmp)

	5997 );

	5998 pPager->pWal = 0;

	5999 #endif

	6000 pager_reset(pPager);

	6001 if( MEMDB ){

	6002 pager_unlock(pPager);

	6003 }else{

	6004 /* If it is open, sync the journal file before calling UnlockAndRollback.

	6005 ** If this is not done, then an unsynced portion of the open journal

	6006 ** file may be played back into the database. If a power failure occurs

	6007 ** while this is happening, the database could become corrupt.

	6008 **

	6009 ** If an error occurs while trying to sync the journal, shift the pager

	6010 ** into the ERROR state. This causes UnlockAndRollback to unlock the

	6011 ** database and close the journal file without attempting to roll it

	6012 ** back or finalize it. The next database user will have to do hot-journal

	6013 ** rollback before accessing the database file.

	6014 */

	6015 if( isOpen(pPager->jfd) ){

	6016 pager_error(pPager, pagerSyncHotJournal(pPager));

	6017 }

	6018 pagerUnlockAndRollback(pPager);

	6019 }

	6020 sqlite3EndBenignMalloc();

	6021 enable_simulated_io_errors();

	6022 PAGERTRACE(("CLOSE %d\n", PAGERID(pPager)));

	6023 IOTRACE(("CLOSE %p\n", pPager))

	6024 sqlite3OsClose(pPager->jfd);

	6025 sqlite3OsClose(pPager->fd);

	6026 sqlite3PageFree(pTmp);

	6027 sqlite3PcacheClose(pPager->pPCache);

	6028

	6029 #ifdef SQLITE_HAS_CODEC

	6030 if( pPager->xCodecFree ) pPager->xCodecFree(pPager->pCodec);

	6031 #endif

	6032

	6033 assert( !pPager->aSavepoint && !pPager->pInJournal );

	6034 assert( !isOpen(pPager->jfd) && !isOpen(pPager->sjfd) );

	6035

	6036 sqlite3_free(pPager);

	6037 return SQLITE_OK;

	6038 }

	6039

	6040 #if !defined(NDEBUG) \|\| defined(SQLITE_TEST)

	6041 /*

	6042 ** Return the page number for page pPg.

	6043 */

	6044 SQLITE_PRIVATE Pgno sqlite3PagerPagenumber(DbPage *pPg){

	6045 return pPg->pgno;

	6046 }

	6047 #endif

	6048

	6049 /*

	6050 ** Increment the reference count for page pPg.

	6051 */

	6052 SQLITE_PRIVATE void sqlite3PagerRef(DbPage *pPg){

	6053 sqlite3PcacheRef(pPg);

	6054 }

	6055

	6056 /*

	6057 ** Sync the journal. In other words, make sure all the pages that have

	6058 ** been written to the journal have actually reached the surface of the

	6059 ** disk and can be restored in the event of a hot-journal rollback.

	6060 **

	6061 ** If the Pager.noSync flag is set, then this function is a no-op.

	6062 ** Otherwise, the actions required depend on the journal-mode and the

	6063 ** device characteristics of the file-system, as follows:

	6064 **

	6065 ** * If the journal file is an in-memory journal file, no action need

	6066 ** be taken.

	6067 **

	6068 ** * Otherwise, if the device does not support the SAFE_APPEND property,

	6069 ** then the nRec field of the most recently written journal header

	6070 ** is updated to contain the number of journal records that have

	6071 ** been written following it. If the pager is operating in full-sync

	6072 ** mode, then the journal file is synced before this field is updated.

	6073 **

	6074 ** * If the device does not support the SEQUENTIAL property, then

	6075 ** journal file is synced.

	6076 **

	6077 ** Or, in pseudo-code:

	6078 **

	6079 ** if( NOT <in-memory journal> ){

	6080 ** if( NOT SAFE_APPEND ){

	6081 ** if( <full-sync mode> ) xSync(<journal file>);

	6082 ** <update nRec field>

	6083 ** }

	6084 ** if( NOT SEQUENTIAL ) xSync(<journal file>);

	6085 ** }

	6086 **

	6087 ** If successful, this routine clears the PGHDR_NEED_SYNC flag of every

	6088 ** page currently held in memory before returning SQLITE_OK. If an IO

	6089 ** error is encountered, then the IO error code is returned to the caller.

	6090 */

	6091 static int syncJournal(Pager *pPager, int newHdr){

	6092 int rc; /* Return code */

	6093

	6094 assert( pPager->eState==PAGER_WRITER_CACHEMOD

	6095 \|\| pPager->eState==PAGER_WRITER_DBMOD

	6096 );

	6097 assert( assert_pager_state(pPager) );

	6098 assert( !pagerUseWal(pPager) );

	6099

	6100 rc = sqlite3PagerExclusiveLock(pPager);

	6101 if( rc!=SQLITE_OK ) return rc;

	6102

	6103 if( !pPager->noSync ){

	6104 assert( !pPager->tempFile );

	6105 if( isOpen(pPager->jfd) && pPager->journalMode!=PAGER_JOURNALMODE_MEMORY ){

	6106 const int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);

	6107 assert( isOpen(pPager->jfd) );

	6108

	6109 if( 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){

	6110 /* This block deals with an obscure problem. If the last connection

	6111 ** that wrote to this database was operating in persistent-journal

	6112 ** mode, then the journal file may at this point actually be larger

	6113 ** than Pager.journalOff bytes. If the next thing in the journal

	6114 ** file happens to be a journal-header (written as part of the

	6115 ** previous connection's transaction), and a crash or power-failure

	6116 ** occurs after nRec is updated but before this connection writes

	6117 ** anything else to the journal file (or commits/rolls back its

	6118 ** transaction), then SQLite may become confused when doing the

	6119 ** hot-journal rollback following recovery. It may roll back all

	6120 ** of this connections data, then proceed to rolling back the old,

	6121 ** out-of-date data that follows it. Database corruption.

	6122 **

	6123 ** To work around this, if the journal file does appear to contain

	6124 ** a valid header following Pager.journalOff, then write a 0x00

	6125 ** byte to the start of it to prevent it from being recognized.

	6126 **

	6127 ** Variable iNextHdrOffset is set to the offset at which this

	6128 ** problematic header will occur, if it exists. aMagic is used

	6129 ** as a temporary buffer to inspect the first couple of bytes of

	6130 ** the potential journal header.

	6131 */

	6132 i64 iNextHdrOffset;

	6133 u8 aMagic[8];

	6134 u8 zHeader[sizeof(aJournalMagic)+4];

	6135

	6136 memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic));

	6137 put32bits(&zHeader[sizeof(aJournalMagic)], pPager->nRec);

	6138

	6139 iNextHdrOffset = journalHdrOffset(pPager);

	6140 rc = sqlite3OsRead(pPager->jfd, aMagic, 8, iNextHdrOffset);

	6141 if( rc==SQLITE_OK && 0==memcmp(aMagic, aJournalMagic, 8) ){

	6142 static const u8 zerobyte = 0;

	6143 rc = sqlite3OsWrite(pPager->jfd, &zerobyte, 1, iNextHdrOffset);

	6144 }

	6145 if( rc!=SQLITE_OK && rc!=SQLITE_IOERR_SHORT_READ ){

	6146 return rc;

	6147 }

	6148

	6149 /* Write the nRec value into the journal file header. If in

	6150 ** full-synchronous mode, sync the journal first. This ensures that

	6151 ** all data has really hit the disk before nRec is updated to mark

	6152 ** it as a candidate for rollback.

	6153 **

	6154 ** This is not required if the persistent media supports the

	6155 ** SAFE_APPEND property. Because in this case it is not possible

	6156 ** for garbage data to be appended to the file, the nRec field

	6157 ** is populated with 0xFFFFFFFF when the journal header is written

	6158 ** and never needs to be updated.

	6159 */

	6160 if( pPager->fullSync && 0==(iDc&SQLITE_IOCAP_SEQUENTIAL) ){

	6161 PAGERTRACE(("SYNC journal of %d\n", PAGERID(pPager)));

	6162 IOTRACE(("JSYNC %p\n", pPager))

	6163 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags);

	6164 if( rc!=SQLITE_OK ) return rc;

	6165 }

	6166 IOTRACE(("JHDR %p %lld\n", pPager, pPager->journalHdr));

	6167 rc = sqlite3OsWrite(

	6168 pPager->jfd, zHeader, sizeof(zHeader), pPager->journalHdr

	6169 );

	6170 if( rc!=SQLITE_OK ) return rc;

	6171 }

	6172 if( 0==(iDc&SQLITE_IOCAP_SEQUENTIAL) ){

	6173 PAGERTRACE(("SYNC journal of %d\n", PAGERID(pPager)));

	6174 IOTRACE(("JSYNC %p\n", pPager))

	6175 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags\|

	6176 (pPager->syncFlags==SQLITE_SYNC_FULL?SQLITE_SYNC_DATAONLY:0)

	6177 );

	6178 if( rc!=SQLITE_OK ) return rc;

	6179 }

	6180

	6181 pPager->journalHdr = pPager->journalOff;

	6182 if( newHdr && 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){

	6183 pPager->nRec = 0;

	6184 rc = writeJournalHdr(pPager);

	6185 if( rc!=SQLITE_OK ) return rc;

	6186 }

	6187 }else{

	6188 pPager->journalHdr = pPager->journalOff;

	6189 }

	6190 }

	6191

	6192 /* Unless the pager is in noSync mode, the journal file was just

	6193 ** successfully synced. Either way, clear the PGHDR_NEED_SYNC flag on

	6194 ** all pages.

	6195 */

	6196 sqlite3PcacheClearSyncFlags(pPager->pPCache);

	6197 pPager->eState = PAGER_WRITER_DBMOD;

	6198 assert( assert_pager_state(pPager) );

	6199 return SQLITE_OK;

	6200 }

	6201

	6202 /*

	6203 ** The argument is the first in a linked list of dirty pages connected

	6204 ** by the PgHdr.pDirty pointer. This function writes each one of the

	6205 ** in-memory pages in the list to the database file. The argument may

	6206 ** be NULL, representing an empty list. In this case this function is

	6207 ** a no-op.

	6208 **

	6209 ** The pager must hold at least a RESERVED lock when this function

	6210 ** is called. Before writing anything to the database file, this lock

	6211 ** is upgraded to an EXCLUSIVE lock. If the lock cannot be obtained,

	6212 ** SQLITE_BUSY is returned and no data is written to the database file.

	6213 **

	6214 ** If the pager is a temp-file pager and the actual file-system file

	6215 ** is not yet open, it is created and opened before any data is

	6216 ** written out.

	6217 **

	6218 ** Once the lock has been upgraded and, if necessary, the file opened,

	6219 ** the pages are written out to the database file in list order. Writing

	6220 ** a page is skipped if it meets either of the following criteria:

	6221 **

	6222 ** * The page number is greater than Pager.dbSize, or

	6223 ** * The PGHDR_DONT_WRITE flag is set on the page.

	6224 **

	6225 ** If writing out a page causes the database file to grow, Pager.dbFileSize

	6226 ** is updated accordingly. If page 1 is written out, then the value cached

	6227 ** in Pager.dbFileVers[] is updated to match the new value stored in

	6228 ** the database file.

	6229 **

	6230 ** If everything is successful, SQLITE_OK is returned. If an IO error

	6231 ** occurs, an IO error code is returned. Or, if the EXCLUSIVE lock cannot

	6232 ** be obtained, SQLITE_BUSY is returned.

	6233 */

	6234 static int pager_write_pagelist(Pager pPager, PgHdr pList){

	6235 int rc = SQLITE_OK; /* Return code */

	6236

	6237 /* This function is only called for rollback pagers in WRITER_DBMOD state. */

	6238 assert( !pagerUseWal(pPager) );

	6239 assert( pPager->tempFile \|\| pPager->eState==PAGER_WRITER_DBMOD );

	6240 assert( pPager->eLock==EXCLUSIVE_LOCK );

	6241 assert( isOpen(pPager->fd) \|\| pList->pDirty==0 );

	6242

	6243 /* If the file is a temp-file has not yet been opened, open it now. It

	6244 ** is not possible for rc to be other than SQLITE_OK if this branch

	6245 ** is taken, as pager_wait_on_lock() is a no-op for temp-files.

	6246 */

	6247 if( !isOpen(pPager->fd) ){

	6248 assert( pPager->tempFile && rc==SQLITE_OK );

	6249 rc = pagerOpentemp(pPager, pPager->fd, pPager->vfsFlags);

	6250 }

	6251

	6252 /* Before the first write, give the VFS a hint of what the final

	6253 ** file size will be.

	6254 */

	6255 assert( rc!=SQLITE_OK \|\| isOpen(pPager->fd) );

	6256 if( rc==SQLITE_OK

	6257 && pPager->dbHintSize<pPager->dbSize

	6258 && (pList->pDirty \|\| pList->pgno>pPager->dbHintSize)

	6259 ){

	6260 sqlite3_int64 szFile = pPager->pageSize * (sqlite3_int64)pPager->dbSize;

	6261 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_SIZE_HINT, &szFile);

	6262 pPager->dbHintSize = pPager->dbSize;

	6263 }

	6264

	6265 while( rc==SQLITE_OK && pList ){

	6266 Pgno pgno = pList->pgno;

	6267

	6268 /* If there are dirty pages in the page cache with page numbers greater

	6269 ** than Pager.dbSize, this means sqlite3PagerTruncateImage() was called to

	6270 ** make the file smaller (presumably by auto-vacuum code). Do not write

	6271 ** any such pages to the file.

	6272 **

	6273 ** Also, do not write out any page that has the PGHDR_DONT_WRITE flag

	6274 ** set (set by sqlite3PagerDontWrite()).

	6275 */

	6276 if( pgno<=pPager->dbSize && 0==(pList->flags&PGHDR_DONT_WRITE) ){

	6277 i64 offset = (pgno-1)(i64)pPager->pageSize; / Offset to write */

	6278 char pData; / Data to write */

	6279

	6280 assert( (pList->flags&PGHDR_NEED_SYNC)==0 );

	6281 if( pList->pgno==1 ) pager_write_changecounter(pList);

	6282

	6283 /* Encode the database */

	6284 CODEC2(pPager, pList->pData, pgno, 6, return SQLITE_NOMEM_BKPT, pData);

	6285

	6286 /* Write out the page data. */

	6287 rc = sqlite3OsWrite(pPager->fd, pData, pPager->pageSize, offset);

	6288

	6289 /* If page 1 was just written, update Pager.dbFileVers to match

	6290 ** the value now stored in the database file. If writing this

	6291 ** page caused the database file to grow, update dbFileSize.

	6292 */

	6293 if( pgno==1 ){

	6294 memcpy(&pPager->dbFileVers, &pData[24], sizeof(pPager->dbFileVers));

	6295 }

	6296 if( pgno>pPager->dbFileSize ){

	6297 pPager->dbFileSize = pgno;

	6298 }

	6299 pPager->aStat[PAGER_STAT_WRITE]++;

	6300

	6301 /* Update any backup objects copying the contents of this pager. */

	6302 sqlite3BackupUpdate(pPager->pBackup, pgno, (u8*)pList->pData);

	6303

	6304 PAGERTRACE(("STORE %d page %d hash(%08x)\n",

	6305 PAGERID(pPager), pgno, pager_pagehash(pList)));

	6306 IOTRACE(("PGOUT %p %d\n", pPager, pgno));

	6307 PAGER_INCR(sqlite3_pager_writedb_count);

	6308 }else{

	6309 PAGERTRACE(("NOSTORE %d page %d\n", PAGERID(pPager), pgno));

	6310 }

	6311 pager_set_pagehash(pList);

	6312 pList = pList->pDirty;

	6313 }

	6314

	6315 return rc;

	6316 }

	6317

	6318 /*

	6319 ** Ensure that the sub-journal file is open. If it is already open, this

	6320 ** function is a no-op.

	6321 **

	6322 ** SQLITE_OK is returned if everything goes according to plan. An

	6323 ** SQLITE_IOERR_XXX error code is returned if a call to sqlite3OsOpen()

	6324 ** fails.

	6325 */

	6326 static int openSubJournal(Pager *pPager){

	6327 int rc = SQLITE_OK;

	6328 if( !isOpen(pPager->sjfd) ){

	6329 const int flags = SQLITE_OPEN_SUBJOURNAL \| SQLITE_OPEN_READWRITE

	6330 \| SQLITE_OPEN_CREATE \| SQLITE_OPEN_EXCLUSIVE

	6331 \| SQLITE_OPEN_DELETEONCLOSE;

	6332 int nStmtSpill = sqlite3Config.nStmtSpill;

	6333 if( pPager->journalMode==PAGER_JOURNALMODE_MEMORY \|\| pPager->subjInMemory ){

	6334 nStmtSpill = -1;

	6335 }

	6336 rc = sqlite3JournalOpen(pPager->pVfs, 0, pPager->sjfd, flags, nStmtSpill);

	6337 }

	6338 return rc;

	6339 }

	6340

	6341 /*

	6342 ** Append a record of the current state of page pPg to the sub-journal.

	6343 **

	6344 ** If successful, set the bit corresponding to pPg->pgno in the bitvecs

	6345 ** for all open savepoints before returning.

	6346 **

	6347 ** This function returns SQLITE_OK if everything is successful, an IO

	6348 ** error code if the attempt to write to the sub-journal fails, or

	6349 ** SQLITE_NOMEM if a malloc fails while setting a bit in a savepoint

	6350 ** bitvec.

	6351 */

	6352 static int subjournalPage(PgHdr *pPg){

	6353 int rc = SQLITE_OK;

	6354 Pager *pPager = pPg->pPager;

	6355 if( pPager->journalMode!=PAGER_JOURNALMODE_OFF ){

	6356

	6357 /* Open the sub-journal, if it has not already been opened */

	6358 assert( pPager->useJournal );

	6359 assert( isOpen(pPager->jfd) \|\| pagerUseWal(pPager) );

	6360 assert( isOpen(pPager->sjfd) \|\| pPager->nSubRec==0 );

	6361 assert( pagerUseWal(pPager)

	6362 \|\| pageInJournal(pPager, pPg)

	6363 \|\| pPg->pgno>pPager->dbOrigSize

	6364 );

	6365 rc = openSubJournal(pPager);

	6366

	6367 /* If the sub-journal was opened successfully (or was already open),

	6368 ** write the journal record into the file. */

	6369 if( rc==SQLITE_OK ){

	6370 void *pData = pPg->pData;

	6371 i64 offset = (i64)pPager->nSubRec*(4+pPager->pageSize);

	6372 char *pData2;

	6373

	6374 CODEC2(pPager, pData, pPg->pgno, 7, return SQLITE_NOMEM_BKPT, pData2);

	6375 PAGERTRACE(("STMT-JOURNAL %d page %d\n", PAGERID(pPager), pPg->pgno));

	6376 rc = write32bits(pPager->sjfd, offset, pPg->pgno);

	6377 if( rc==SQLITE_OK ){

	6378 rc = sqlite3OsWrite(pPager->sjfd, pData2, pPager->pageSize, offset+4);

	6379 }

	6380 }

	6381 }

	6382 if( rc==SQLITE_OK ){

	6383 pPager->nSubRec++;

	6384 assert( pPager->nSavepoint>0 );

	6385 rc = addToSavepointBitvecs(pPager, pPg->pgno);

	6386 }

	6387 return rc;

	6388 }

	6389 static int subjournalPageIfRequired(PgHdr *pPg){

	6390 if( subjRequiresPage(pPg) ){

	6391 return subjournalPage(pPg);

	6392 }else{

	6393 return SQLITE_OK;

	6394 }

	6395 }

	6396

	6397 /*

	6398 ** This function is called by the pcache layer when it has reached some

	6399 ** soft memory limit. The first argument is a pointer to a Pager object

	6400 ** (cast as a void*). The pager is always 'purgeable' (not an in-memory

	6401 ** database). The second argument is a reference to a page that is

	6402 ** currently dirty but has no outstanding references. The page

	6403 ** is always associated with the Pager object passed as the first

	6404 ** argument.

	6405 **

	6406 ** The job of this function is to make pPg clean by writing its contents

	6407 ** out to the database file, if possible. This may involve syncing the

	6408 ** journal file.

	6409 **

	6410 ** If successful, sqlite3PcacheMakeClean() is called on the page and

	6411 ** SQLITE_OK returned. If an IO error occurs while trying to make the

	6412 ** page clean, the IO error code is returned. If the page cannot be

	6413 ** made clean for some other reason, but no error occurs, then SQLITE_OK

	6414 ** is returned by sqlite3PcacheMakeClean() is not called.

	6415 */

	6416 static int pagerStress(void p, PgHdr pPg){

	6417 Pager pPager = (Pager )p;

	6418 int rc = SQLITE_OK;

	6419

	6420 assert( pPg->pPager==pPager );

	6421 assert( pPg->flags&PGHDR_DIRTY );

	6422

	6423 /* The doNotSpill NOSYNC bit is set during times when doing a sync of

	6424 ** journal (and adding a new header) is not allowed. This occurs

	6425 ** during calls to sqlite3PagerWrite() while trying to journal multiple

	6426 ** pages belonging to the same sector.

	6427 **

	6428 ** The doNotSpill ROLLBACK and OFF bits inhibits all cache spilling

	6429 ** regardless of whether or not a sync is required. This is set during

	6430 ** a rollback or by user request, respectively.

	6431 **

	6432 ** Spilling is also prohibited when in an error state since that could

	6433 ** lead to database corruption. In the current implementation it

	6434 ** is impossible for sqlite3PcacheFetch() to be called with createFlag==3

	6435 ** while in the error state, hence it is impossible for this routine to

	6436 ** be called in the error state. Nevertheless, we include a NEVER()

	6437 ** test for the error state as a safeguard against future changes.

	6438 */

	6439 if( NEVER(pPager->errCode) ) return SQLITE_OK;

	6440 testcase( pPager->doNotSpill & SPILLFLAG_ROLLBACK );

	6441 testcase( pPager->doNotSpill & SPILLFLAG_OFF );

	6442 testcase( pPager->doNotSpill & SPILLFLAG_NOSYNC );

	6443 if( pPager->doNotSpill

	6444 && ((pPager->doNotSpill & (SPILLFLAG_ROLLBACK\|SPILLFLAG_OFF))!=0

	6445 \|\| (pPg->flags & PGHDR_NEED_SYNC)!=0)

	6446 ){

	6447 return SQLITE_OK;

	6448 }

	6449

	6450 pPg->pDirty = 0;

	6451 if( pagerUseWal(pPager) ){

	6452 /* Write a single frame for this page to the log. */

	6453 rc = subjournalPageIfRequired(pPg);

	6454 if( rc==SQLITE_OK ){

	6455 rc = pagerWalFrames(pPager, pPg, 0, 0);

	6456 }

	6457 }else{

	6458

	6459 /* Sync the journal file if required. */

	6460 if( pPg->flags&PGHDR_NEED_SYNC

	6461 \|\| pPager->eState==PAGER_WRITER_CACHEMOD

	6462 ){

	6463 rc = syncJournal(pPager, 1);

	6464 }

	6465

	6466 /* Write the contents of the page out to the database file. */

	6467 if( rc==SQLITE_OK ){

	6468 assert( (pPg->flags&PGHDR_NEED_SYNC)==0 );

	6469 rc = pager_write_pagelist(pPager, pPg);

	6470 }

	6471 }

	6472

	6473 /* Mark the page as clean. */

	6474 if( rc==SQLITE_OK ){

	6475 PAGERTRACE(("STRESS %d page %d\n", PAGERID(pPager), pPg->pgno));

	6476 sqlite3PcacheMakeClean(pPg);

	6477 }

	6478

	6479 return pager_error(pPager, rc);

	6480 }

	6481

	6482 /*

	6483 ** Flush all unreferenced dirty pages to disk.

	6484 */

	6485 SQLITE_PRIVATE int sqlite3PagerFlush(Pager *pPager){

	6486 int rc = pPager->errCode;

	6487 if( !MEMDB ){

	6488 PgHdr *pList = sqlite3PcacheDirtyList(pPager->pPCache);

	6489 assert( assert_pager_state(pPager) );

	6490 while( rc==SQLITE_OK && pList ){

	6491 PgHdr *pNext = pList->pDirty;

	6492 if( pList->nRef==0 ){

	6493 rc = pagerStress((void*)pPager, pList);

	6494 }

	6495 pList = pNext;

	6496 }

	6497 }

	6498

	6499 return rc;

	6500 }

	6501

	6502 /*

	6503 ** Allocate and initialize a new Pager object and put a pointer to it

	6504 ** in *ppPager. The pager should eventually be freed by passing it

	6505 ** to sqlite3PagerClose().

	6506 **

	6507 ** The zFilename argument is the path to the database file to open.

	6508 ** If zFilename is NULL then a randomly-named temporary file is created

	6509 ** and used as the file to be cached. Temporary files are be deleted

	6510 ** automatically when they are closed. If zFilename is ":memory:" then

	6511 ** all information is held in cache. It is never written to disk.

	6512 ** This can be used to implement an in-memory database.

	6513 **

	6514 ** The nExtra parameter specifies the number of bytes of space allocated

	6515 ** along with each page reference. This space is available to the user

	6516 ** via the sqlite3PagerGetExtra() API. When a new page is allocated, the

	6517 ** first 8 bytes of this space are zeroed but the remainder is uninitialized.

	6518 ** (The extra space is used by btree as the MemPage object.)

	6519 **

	6520 ** The flags argument is used to specify properties that affect the

	6521 ** operation of the pager. It should be passed some bitwise combination

	6522 ** of the PAGER_* flags.

	6523 **

	6524 ** The vfsFlags parameter is a bitmask to pass to the flags parameter

	6525 ** of the xOpen() method of the supplied VFS when opening files.

	6526 **

	6527 ** If the pager object is allocated and the specified file opened

	6528 ** successfully, SQLITE_OK is returned and *ppPager set to point to

	6529 ** the new pager object. If an error occurs, *ppPager is set to NULL

	6530 ** and error code returned. This function may return SQLITE_NOMEM

	6531 ** (sqlite3Malloc() is used to allocate memory), SQLITE_CANTOPEN or

	6532 ** various SQLITE_IO_XXX errors.

	6533 */

	6534 SQLITE_PRIVATE int sqlite3PagerOpen(

	6535 sqlite3_vfs pVfs, / The virtual file system to use */

	6536 Pager *ppPager, / OUT: Return the Pager structure here */

	6537 const char zFilename, / Name of the database file to open */

	6538 int nExtra, /* Extra bytes append to each in-memory page */

	6539 int flags, /* flags controlling this file */

	6540 int vfsFlags, /* flags passed through to sqlite3_vfs.xOpen() */

	6541 void (xReinit)(DbPage) /* Function to reinitialize pages */

	6542 ){

	6543 u8 *pPtr;

	6544 Pager pPager = 0; / Pager object to allocate and return */

	6545 int rc = SQLITE_OK; /* Return code */

	6546 int tempFile = 0; /* True for temp files (incl. in-memory files) */

	6547 int memDb = 0; /* True if this is an in-memory file */

	6548 int readOnly = 0; /* True if this is a read-only file */

	6549 int journalFileSize; /* Bytes to allocate for each journal fd */

	6550 char zPathname = 0; / Full path to database file */

	6551 int nPathname = 0; /* Number of bytes in zPathname */

	6552 int useJournal = (flags & PAGER_OMIT_JOURNAL)==0; /* False to omit journal */

	6553 int pcacheSize = sqlite3PcacheSize(); /* Bytes to allocate for PCache */

	6554 u32 szPageDflt = SQLITE_DEFAULT_PAGE_SIZE; /* Default page size */

	6555 const char zUri = 0; / URI args to copy */

	6556 int nUri = 0; /* Number of bytes of URI args at zUri /

	6557

	6558 /* Figure out how much space is required for each journal file-handle

	6559 ** (there are two of them, the main journal and the sub-journal). */

	6560 journalFileSize = ROUND8(sqlite3JournalSize(pVfs));

	6561

	6562 /* Set the output variable to NULL in case an error occurs. */

	6563 *ppPager = 0;

	6564

	6565 #ifndef SQLITE_OMIT_MEMORYDB

	6566 if( flags & PAGER_MEMORY ){

	6567 memDb = 1;

	6568 if( zFilename && zFilename[0] ){

	6569 zPathname = sqlite3DbStrDup(0, zFilename);

	6570 if( zPathname==0 ) return SQLITE_NOMEM_BKPT;

	6571 nPathname = sqlite3Strlen30(zPathname);

	6572 zFilename = 0;

	6573 }

	6574 }

	6575 #endif

	6576

	6577 /* Compute and store the full pathname in an allocated buffer pointed

	6578 ** to by zPathname, length nPathname. Or, if this is a temporary file,

	6579 ** leave both nPathname and zPathname set to 0.

	6580 */

	6581 if( zFilename && zFilename[0] ){

	6582 const char *z;

	6583 nPathname = pVfs->mxPathname+1;

	6584 zPathname = sqlite3DbMallocRaw(0, nPathname*2);

	6585 if( zPathname==0 ){

	6586 return SQLITE_NOMEM_BKPT;

	6587 }

	6588 zPathname[0] = 0; /* Make sure initialized even if FullPathname() fails */

	6589 rc = sqlite3OsFullPathname(pVfs, zFilename, nPathname, zPathname);

	6590 nPathname = sqlite3Strlen30(zPathname);

	6591 z = zUri = &zFilename[sqlite3Strlen30(zFilename)+1];

	6592 while( *z ){

	6593 z += sqlite3Strlen30(z)+1;

	6594 z += sqlite3Strlen30(z)+1;

	6595 }

	6596 nUri = (int)(&z[1] - zUri);

	6597 assert( nUri>=0 );

	6598 if( rc==SQLITE_OK && nPathname+8>pVfs->mxPathname ){

	6599 /* This branch is taken when the journal path required by

	6600 ** the database being opened will be more than pVfs->mxPathname

	6601 ** bytes in length. This means the database cannot be opened,

	6602 ** as it will not be possible to open the journal file or even

	6603 ** check for a hot-journal before reading.

	6604 */

	6605 rc = SQLITE_CANTOPEN_BKPT;

	6606 }

	6607 if( rc!=SQLITE_OK ){

	6608 sqlite3DbFree(0, zPathname);

	6609 return rc;

	6610 }

	6611 }

	6612

	6613 /* Allocate memory for the Pager structure, PCache object, the

	6614 ** three file descriptors, the database file name and the journal

	6615 ** file name. The layout in memory is as follows:

	6616 **

	6617 ** Pager object (sizeof(Pager) bytes)

	6618 ** PCache object (sqlite3PcacheSize() bytes)

	6619 ** Database file handle (pVfs->szOsFile bytes)

	6620 ** Sub-journal file handle (journalFileSize bytes)

	6621 ** Main journal file handle (journalFileSize bytes)

	6622 ** Database file name (nPathname+1 bytes)

	6623 ** Journal file name (nPathname+8+1 bytes)

	6624 */

	6625 pPtr = (u8 *)sqlite3MallocZero(

	6626 ROUND8(sizeof(pPager)) + / Pager structure */

	6627 ROUND8(pcacheSize) + /* PCache object */

	6628 ROUND8(pVfs->szOsFile) + /* The main db file */

	6629 journalFileSize * 2 + /* The two journal files */

	6630 nPathname + 1 + nUri + /* zFilename */

	6631 nPathname + 8 + 2 /* zJournal */

	6632 #ifndef SQLITE_OMIT_WAL

	6633 + nPathname + 4 + 2 /* zWal */

	6634 #endif

	6635 );

	6636 assert( EIGHT_BYTE_ALIGNMENT(SQLITE_INT_TO_PTR(journalFileSize)) );

	6637 if( !pPtr ){

	6638 sqlite3DbFree(0, zPathname);

	6639 return SQLITE_NOMEM_BKPT;

	6640 }

	6641 pPager = (Pager*)(pPtr);

	6642 pPager->pPCache = (PCache)(pPtr += ROUND8(sizeof(pPager)));

	6643 pPager->fd = (sqlite3_file*)(pPtr += ROUND8(pcacheSize));

	6644 pPager->sjfd = (sqlite3_file*)(pPtr += ROUND8(pVfs->szOsFile));

	6645 pPager->jfd = (sqlite3_file*)(pPtr += journalFileSize);

	6646 pPager->zFilename = (char*)(pPtr += journalFileSize);

	6647 assert( EIGHT_BYTE_ALIGNMENT(pPager->jfd) );

	6648

	6649 /* Fill in the Pager.zFilename and Pager.zJournal buffers, if required. */

	6650 if( zPathname ){

	6651 assert( nPathname>0 );

	6652 pPager->zJournal = (char*)(pPtr += nPathname + 1 + nUri);

	6653 memcpy(pPager->zFilename, zPathname, nPathname);

	6654 if( nUri ) memcpy(&pPager->zFilename[nPathname+1], zUri, nUri);

	6655 memcpy(pPager->zJournal, zPathname, nPathname);

	6656 memcpy(&pPager->zJournal[nPathname], "-journal\000", 8+2);

	6657 sqlite3FileSuffix3(pPager->zFilename, pPager->zJournal);

	6658 #ifndef SQLITE_OMIT_WAL

	6659 pPager->zWal = &pPager->zJournal[nPathname+8+1];

	6660 memcpy(pPager->zWal, zPathname, nPathname);

	6661 memcpy(&pPager->zWal[nPathname], "-wal\000", 4+1);

	6662 sqlite3FileSuffix3(pPager->zFilename, pPager->zWal);

	6663 #endif

	6664 sqlite3DbFree(0, zPathname);

	6665 }

	6666 pPager->pVfs = pVfs;

	6667 pPager->vfsFlags = vfsFlags;

	6668

	6669 /* Open the pager file.

	6670 */

	6671 if( zFilename && zFilename[0] ){

	6672 int fout = 0; /* VFS flags returned by xOpen() */

	6673 rc = sqlite3OsOpen(pVfs, pPager->zFilename, pPager->fd, vfsFlags, &fout);

	6674 assert( !memDb );

	6675 readOnly = (fout&SQLITE_OPEN_READONLY);

	6676

	6677 /* If the file was successfully opened for read/write access,

	6678 ** choose a default page size in case we have to create the

	6679 ** database file. The default page size is the maximum of:

	6680 **

	6681 ** + SQLITE_DEFAULT_PAGE_SIZE,

	6682 ** + The value returned by sqlite3OsSectorSize()

	6683 ** + The largest page size that can be written atomically.

	6684 */

	6685 if( rc==SQLITE_OK ){

	6686 int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);

	6687 if( !readOnly ){

	6688 setSectorSize(pPager);

	6689 assert(SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE);

	6690 if( szPageDflt<pPager->sectorSize ){

	6691 if( pPager->sectorSize>SQLITE_MAX_DEFAULT_PAGE_SIZE ){

	6692 szPageDflt = SQLITE_MAX_DEFAULT_PAGE_SIZE;

	6693 }else{

	6694 szPageDflt = (u32)pPager->sectorSize;

	6695 }

	6696 }

	6697 #ifdef SQLITE_ENABLE_ATOMIC_WRITE

	6698 {

	6699 int ii;

	6700 assert(SQLITE_IOCAP_ATOMIC512==(512>>8));

	6701 assert(SQLITE_IOCAP_ATOMIC64K==(65536>>8));

	6702 assert(SQLITE_MAX_DEFAULT_PAGE_SIZE<=65536);

	6703 for(ii=szPageDflt; ii<=SQLITE_MAX_DEFAULT_PAGE_SIZE; ii=ii*2){

	6704 if( iDc&(SQLITE_IOCAP_ATOMIC\|(ii>>8)) ){

	6705 szPageDflt = ii;

	6706 }

	6707 }

	6708 }

	6709 #endif

	6710 }

	6711 pPager->noLock = sqlite3_uri_boolean(zFilename, "nolock", 0);

	6712 if( (iDc & SQLITE_IOCAP_IMMUTABLE)!=0

	6713 \|\| sqlite3_uri_boolean(zFilename, "immutable", 0) ){

	6714 vfsFlags \|= SQLITE_OPEN_READONLY;

	6715 goto act_like_temp_file;

	6716 }

	6717 }

	6718 }else{

	6719 /* If a temporary file is requested, it is not opened immediately.

	6720 ** In this case we accept the default page size and delay actually

	6721 ** opening the file until the first call to OsWrite().

	6722 **

	6723 ** This branch is also run for an in-memory database. An in-memory

	6724 ** database is the same as a temp-file that is never written out to

	6725 ** disk and uses an in-memory rollback journal.

	6726 **

	6727 ** This branch also runs for files marked as immutable.

	6728 */

	6729 act_like_temp_file:

	6730 tempFile = 1;

	6731 pPager->eState = PAGER_READER; /* Pretend we already have a lock */

	6732 pPager->eLock = EXCLUSIVE_LOCK; /* Pretend we are in EXCLUSIVE mode */

	6733 pPager->noLock = 1; /* Do no locking */

	6734 readOnly = (vfsFlags&SQLITE_OPEN_READONLY);

	6735 }

	6736

	6737 /* The following call to PagerSetPagesize() serves to set the value of

	6738 ** Pager.pageSize and to allocate the Pager.pTmpSpace buffer.

	6739 */

	6740 if( rc==SQLITE_OK ){

	6741 assert( pPager->memDb==0 );

	6742 rc = sqlite3PagerSetPagesize(pPager, &szPageDflt, -1);

	6743 testcase( rc!=SQLITE_OK );

	6744 }

	6745

	6746 /* Initialize the PCache object. */

	6747 if( rc==SQLITE_OK ){

	6748 nExtra = ROUND8(nExtra);

	6749 assert( nExtra>=8 && nExtra<1000 );

	6750 rc = sqlite3PcacheOpen(szPageDflt, nExtra, !memDb,

	6751 !memDb?pagerStress:0, (void *)pPager, pPager->pPCache);

	6752 }

	6753

	6754 /* If an error occurred above, free the Pager structure and close the file.

	6755 */

	6756 if( rc!=SQLITE_OK ){

	6757 sqlite3OsClose(pPager->fd);

	6758 sqlite3PageFree(pPager->pTmpSpace);

	6759 sqlite3_free(pPager);

	6760 return rc;

	6761 }

	6762

	6763 PAGERTRACE(("OPEN %d %s\n", FILEHANDLEID(pPager->fd), pPager->zFilename));

	6764 IOTRACE(("OPEN %p %s\n", pPager, pPager->zFilename))

	6765

	6766 pPager->useJournal = (u8)useJournal;

	6767 /* pPager->stmtOpen = 0; */

	6768 /* pPager->stmtInUse = 0; */

	6769 /* pPager->nRef = 0; */

	6770 /* pPager->stmtSize = 0; */

	6771 /* pPager->stmtJSize = 0; */

	6772 /* pPager->nPage = 0; */

	6773 pPager->mxPgno = SQLITE_MAX_PAGE_COUNT;

	6774 /* pPager->state = PAGER_UNLOCK; */

	6775 /* pPager->errMask = 0; */

	6776 pPager->tempFile = (u8)tempFile;

	6777 assert( tempFile==PAGER_LOCKINGMODE_NORMAL

	6778 \|\| tempFile==PAGER_LOCKINGMODE_EXCLUSIVE );

	6779 assert( PAGER_LOCKINGMODE_EXCLUSIVE==1 );

	6780 pPager->exclusiveMode = (u8)tempFile;

	6781 pPager->changeCountDone = pPager->tempFile;

	6782 pPager->memDb = (u8)memDb;

	6783 pPager->readOnly = (u8)readOnly;

	6784 assert( useJournal \|\| pPager->tempFile );

	6785 pPager->noSync = pPager->tempFile;

	6786 if( pPager->noSync ){

	6787 assert( pPager->fullSync==0 );

	6788 assert( pPager->extraSync==0 );

	6789 assert( pPager->syncFlags==0 );

	6790 assert( pPager->walSyncFlags==0 );

	6791 assert( pPager->ckptSyncFlags==0 );

	6792 }else{

	6793 pPager->fullSync = 1;

	6794 pPager->extraSync = 0;

	6795 pPager->syncFlags = SQLITE_SYNC_NORMAL;

	6796 pPager->walSyncFlags = SQLITE_SYNC_NORMAL \| WAL_SYNC_TRANSACTIONS;

	6797 pPager->ckptSyncFlags = SQLITE_SYNC_NORMAL;

	6798 }

	6799 /* pPager->pFirst = 0; */

	6800 /* pPager->pFirstSynced = 0; */

	6801 /* pPager->pLast = 0; */

	6802 pPager->nExtra = (u16)nExtra;

	6803 pPager->journalSizeLimit = SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT;

	6804 assert( isOpen(pPager->fd) \|\| tempFile );

	6805 setSectorSize(pPager);

	6806 if( !useJournal ){

	6807 pPager->journalMode = PAGER_JOURNALMODE_OFF;

	6808 }else if( memDb ){

	6809 pPager->journalMode = PAGER_JOURNALMODE_MEMORY;

	6810 }

	6811 /* pPager->xBusyHandler = 0; */

	6812 /* pPager->pBusyHandlerArg = 0; */

	6813 pPager->xReiniter = xReinit;

	6814 setGetterMethod(pPager);

	6815 /* memset(pPager->aHash, 0, sizeof(pPager->aHash)); */

	6816 /* pPager->szMmap = SQLITE_DEFAULT_MMAP_SIZE // will be set by btree.c */

	6817

	6818 *ppPager = pPager;

	6819 return SQLITE_OK;

	6820 }

	6821

	6822

	6823 /* Verify that the database file has not be deleted or renamed out from

	6824 ** under the pager. Return SQLITE_OK if the database is still were it ought

	6825 ** to be on disk. Return non-zero (SQLITE_READONLY_DBMOVED or some other error

	6826 ** code from sqlite3OsAccess()) if the database has gone missing.

	6827 */

	6828 static int databaseIsUnmoved(Pager *pPager){

	6829 int bHasMoved = 0;

	6830 int rc;

	6831

	6832 if( pPager->tempFile ) return SQLITE_OK;

	6833 if( pPager->dbSize==0 ) return SQLITE_OK;

	6834 assert( pPager->zFilename && pPager->zFilename[0] );

	6835 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_HAS_MOVED, &bHasMoved);

	6836 if( rc==SQLITE_NOTFOUND ){

	6837 /* If the HAS_MOVED file-control is unimplemented, assume that the file

	6838 ** has not been moved. That is the historical behavior of SQLite: prior to

	6839 ** version 3.8.3, it never checked */

	6840 rc = SQLITE_OK;

	6841 }else if( rc==SQLITE_OK && bHasMoved ){

	6842 rc = SQLITE_READONLY_DBMOVED;

	6843 }

	6844 return rc;

	6845 }

	6846

	6847

	6848 /*

	6849 ** This function is called after transitioning from PAGER_UNLOCK to

	6850 ** PAGER_SHARED state. It tests if there is a hot journal present in

	6851 ** the file-system for the given pager. A hot journal is one that

	6852 ** needs to be played back. According to this function, a hot-journal

	6853 ** file exists if the following criteria are met:

	6854 **

	6855 ** * The journal file exists in the file system, and

	6856 ** * No process holds a RESERVED or greater lock on the database file, and

	6857 ** * The database file itself is greater than 0 bytes in size, and

	6858 ** * The first byte of the journal file exists and is not 0x00.

	6859 **

	6860 ** If the current size of the database file is 0 but a journal file

	6861 ** exists, that is probably an old journal left over from a prior

	6862 ** database with the same name. In this case the journal file is

	6863 ** just deleted using OsDelete, *pExists is set to 0 and SQLITE_OK

	6864 ** is returned.

	6865 **

	6866 ** This routine does not check if there is a master journal filename

	6867 ** at the end of the file. If there is, and that master journal file

	6868 ** does not exist, then the journal file is not really hot. In this

	6869 ** case this routine will return a false-positive. The pager_playback()

	6870 ** routine will discover that the journal file is not really hot and

	6871 ** will not roll it back.

	6872 **

	6873 ** If a hot-journal file is found to exist, *pExists is set to 1 and

	6874 ** SQLITE_OK returned. If no hot-journal file is present, *pExists is

	6875 ** set to 0 and SQLITE_OK returned. If an IO error occurs while trying

	6876 ** to determine whether or not a hot-journal file exists, the IO error

	6877 ** code is returned and the value of *pExists is undefined.

	6878 */

	6879 static int hasHotJournal(Pager pPager, int pExists){

	6880 sqlite3_vfs * const pVfs = pPager->pVfs;

	6881 int rc = SQLITE_OK; /* Return code */

	6882 int exists = 1; /* True if a journal file is present */

	6883 int jrnlOpen = !!isOpen(pPager->jfd);

	6884

	6885 assert( pPager->useJournal );

	6886 assert( isOpen(pPager->fd) );

	6887 assert( pPager->eState==PAGER_OPEN );

	6888

	6889 assert( jrnlOpen==0 \|\| ( sqlite3OsDeviceCharacteristics(pPager->jfd) &

	6890 SQLITE_IOCAP_UNDELETABLE_WHEN_OPEN

	6891 ));

	6892

	6893 *pExists = 0;

	6894 if( !jrnlOpen ){

	6895 rc = sqlite3OsAccess(pVfs, pPager->zJournal, SQLITE_ACCESS_EXISTS, &exists);

	6896 }

	6897 if( rc==SQLITE_OK && exists ){

	6898 int locked = 0; /* True if some process holds a RESERVED lock */

	6899

	6900 /* Race condition here: Another process might have been holding the

	6901 ** the RESERVED lock and have a journal open at the sqlite3OsAccess()

	6902 ** call above, but then delete the journal and drop the lock before

	6903 ** we get to the following sqlite3OsCheckReservedLock() call. If that

	6904 ** is the case, this routine might think there is a hot journal when

	6905 ** in fact there is none. This results in a false-positive which will

	6906 ** be dealt with by the playback routine. Ticket #3883.

	6907 */

	6908 rc = sqlite3OsCheckReservedLock(pPager->fd, &locked);

	6909 if( rc==SQLITE_OK && !locked ){

	6910 Pgno nPage; /* Number of pages in database file */

	6911

	6912 assert( pPager->tempFile==0 );

	6913 rc = pagerPagecount(pPager, &nPage);

	6914 if( rc==SQLITE_OK ){

	6915 /* If the database is zero pages in size, that means that either (1) the

	6916 ** journal is a remnant from a prior database with the same name where

	6917 ** the database file but not the journal was deleted, or (2) the initial

	6918 ** transaction that populates a new database is being rolled back.

	6919 ** In either case, the journal file can be deleted. However, take care

	6920 ** not to delete the journal file if it is already open due to

	6921 ** journal_mode=PERSIST.

	6922 */

	6923 if( nPage==0 && !jrnlOpen ){

	6924 sqlite3BeginBenignMalloc();

	6925 if( pagerLockDb(pPager, RESERVED_LOCK)==SQLITE_OK ){

	6926 sqlite3OsDelete(pVfs, pPager->zJournal, 0);

	6927 if( !pPager->exclusiveMode ) pagerUnlockDb(pPager, SHARED_LOCK);

	6928 }

	6929 sqlite3EndBenignMalloc();

	6930 }else{

	6931 /* The journal file exists and no other connection has a reserved

	6932 ** or greater lock on the database file. Now check that there is

	6933 ** at least one non-zero bytes at the start of the journal file.

	6934 ** If there is, then we consider this journal to be hot. If not,

	6935 ** it can be ignored.

	6936 */

	6937 if( !jrnlOpen ){

	6938 int f = SQLITE_OPEN_READONLY\|SQLITE_OPEN_MAIN_JOURNAL;

	6939 rc = sqlite3OsOpen(pVfs, pPager->zJournal, pPager->jfd, f, &f);

	6940 }

	6941 if( rc==SQLITE_OK ){

	6942 u8 first = 0;

	6943 rc = sqlite3OsRead(pPager->jfd, (void *)&first, 1, 0);

	6944 if( rc==SQLITE_IOERR_SHORT_READ ){

	6945 rc = SQLITE_OK;

	6946 }

	6947 if( !jrnlOpen ){

	6948 sqlite3OsClose(pPager->jfd);

	6949 }

	6950 *pExists = (first!=0);

	6951 }else if( rc==SQLITE_CANTOPEN ){

	6952 /* If we cannot open the rollback journal file in order to see if

	6953 ** it has a zero header, that might be due to an I/O error, or

	6954 ** it might be due to the race condition described above and in

	6955 ** ticket #3883. Either way, assume that the journal is hot.

	6956 ** This might be a false positive. But if it is, then the

	6957 ** automatic journal playback and recovery mechanism will deal

	6958 ** with it under an EXCLUSIVE lock where we do not need to

	6959 ** worry so much with race conditions.

	6960 */

	6961 *pExists = 1;

	6962 rc = SQLITE_OK;

	6963 }

	6964 }

	6965 }

	6966 }

	6967 }

	6968

	6969 return rc;

	6970 }

	6971

	6972 /*

	6973 ** This function is called to obtain a shared lock on the database file.

	6974 ** It is illegal to call sqlite3PagerGet() until after this function

	6975 ** has been successfully called. If a shared-lock is already held when

	6976 ** this function is called, it is a no-op.

	6977 **

	6978 ** The following operations are also performed by this function.

	6979 **

	6980 ** 1) If the pager is currently in PAGER_OPEN state (no lock held

	6981 ** on the database file), then an attempt is made to obtain a

	6982 ** SHARED lock on the database file. Immediately after obtaining

	6983 ** the SHARED lock, the file-system is checked for a hot-journal,

	6984 ** which is played back if present. Following any hot-journal

	6985 ** rollback, the contents of the cache are validated by checking

	6986 ** the 'change-counter' field of the database file header and

	6987 ** discarded if they are found to be invalid.

	6988 **

	6989 ** 2) If the pager is running in exclusive-mode, and there are currently

	6990 ** no outstanding references to any pages, and is in the error state,

	6991 ** then an attempt is made to clear the error state by discarding

	6992 ** the contents of the page cache and rolling back any open journal

	6993 ** file.

	6994 **

	6995 ** If everything is successful, SQLITE_OK is returned. If an IO error

	6996 ** occurs while locking the database, checking for a hot-journal file or

	6997 ** rolling back a journal file, the IO error code is returned.

	6998 */

	6999 SQLITE_PRIVATE int sqlite3PagerSharedLock(Pager *pPager){

	7000 int rc = SQLITE_OK; /* Return code */

	7001

	7002 /* This routine is only called from b-tree and only when there are no

	7003 ** outstanding pages. This implies that the pager state should either

	7004 ** be OPEN or READER. READER is only possible if the pager is or was in

	7005 ** exclusive access mode. */

	7006 assert( sqlite3PcacheRefCount(pPager->pPCache)==0 );

	7007 assert( assert_pager_state(pPager) );

	7008 assert( pPager->eState==PAGER_OPEN \|\| pPager->eState==PAGER_READER );

	7009 assert( pPager->errCode==SQLITE_OK );

	7010

	7011 if( !pagerUseWal(pPager) && pPager->eState==PAGER_OPEN ){

	7012 int bHotJournal = 1; /* True if there exists a hot journal-file */

	7013

	7014 assert( !MEMDB );

	7015 assert( pPager->tempFile==0 \|\| pPager->eLock==EXCLUSIVE_LOCK );

	7016

	7017 rc = pager_wait_on_lock(pPager, SHARED_LOCK);

	7018 if( rc!=SQLITE_OK ){

	7019 assert( pPager->eLock==NO_LOCK \|\| pPager->eLock==UNKNOWN_LOCK );

	7020 goto failed;

	7021 }

	7022

	7023 /* If a journal file exists, and there is no RESERVED lock on the

	7024 ** database file, then it either needs to be played back or deleted.

	7025 */

	7026 if( pPager->eLock<=SHARED_LOCK ){

	7027 rc = hasHotJournal(pPager, &bHotJournal);

	7028 }

	7029 if( rc!=SQLITE_OK ){

	7030 goto failed;

	7031 }

	7032 if( bHotJournal ){

	7033 if( pPager->readOnly ){

	7034 rc = SQLITE_READONLY_ROLLBACK;

	7035 goto failed;

	7036 }

	7037

	7038 /* Get an EXCLUSIVE lock on the database file. At this point it is

	7039 ** important that a RESERVED lock is not obtained on the way to the

	7040 ** EXCLUSIVE lock. If it were, another process might open the

	7041 ** database file, detect the RESERVED lock, and conclude that the

	7042 ** database is safe to read while this process is still rolling the

	7043 ** hot-journal back.

	7044 **

	7045 ** Because the intermediate RESERVED lock is not requested, any

	7046 ** other process attempting to access the database file will get to

	7047 ** this point in the code and fail to obtain its own EXCLUSIVE lock

	7048 ** on the database file.

	7049 **

	7050 ** Unless the pager is in locking_mode=exclusive mode, the lock is

	7051 ** downgraded to SHARED_LOCK before this function returns.

	7052 */

	7053 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);

	7054 if( rc!=SQLITE_OK ){

	7055 goto failed;

	7056 }

	7057

	7058 /* If it is not already open and the file exists on disk, open the

	7059 ** journal for read/write access. Write access is required because

	7060 ** in exclusive-access mode the file descriptor will be kept open

	7061 ** and possibly used for a transaction later on. Also, write-access

	7062 ** is usually required to finalize the journal in journal_mode=persist

	7063 ** mode (and also for journal_mode=truncate on some systems).

	7064 **

	7065 ** If the journal does not exist, it usually means that some

	7066 ** other connection managed to get in and roll it back before

	7067 ** this connection obtained the exclusive lock above. Or, it

	7068 ** may mean that the pager was in the error-state when this

	7069 ** function was called and the journal file does not exist.

	7070 */

	7071 if( !isOpen(pPager->jfd) ){

	7072 sqlite3_vfs * const pVfs = pPager->pVfs;

	7073 int bExists; /* True if journal file exists */

	7074 rc = sqlite3OsAccess(

	7075 pVfs, pPager->zJournal, SQLITE_ACCESS_EXISTS, &bExists);

	7076 if( rc==SQLITE_OK && bExists ){

	7077 int fout = 0;

	7078 int f = SQLITE_OPEN_READWRITE\|SQLITE_OPEN_MAIN_JOURNAL;

	7079 assert( !pPager->tempFile );

	7080 rc = sqlite3OsOpen(pVfs, pPager->zJournal, pPager->jfd, f, &fout);

	7081 assert( rc!=SQLITE_OK \|\| isOpen(pPager->jfd) );

	7082 if( rc==SQLITE_OK && fout&SQLITE_OPEN_READONLY ){

	7083 rc = SQLITE_CANTOPEN_BKPT;

	7084 sqlite3OsClose(pPager->jfd);

	7085 }

	7086 }

	7087 }

	7088

	7089 /* Playback and delete the journal. Drop the database write

	7090 ** lock and reacquire the read lock. Purge the cache before

	7091 ** playing back the hot-journal so that we don't end up with

	7092 ** an inconsistent cache. Sync the hot journal before playing

	7093 ** it back since the process that crashed and left the hot journal

	7094 ** probably did not sync it and we are required to always sync

	7095 ** the journal before playing it back.

	7096 */

	7097 if( isOpen(pPager->jfd) ){

	7098 assert( rc==SQLITE_OK );

	7099 rc = pagerSyncHotJournal(pPager);

	7100 if( rc==SQLITE_OK ){

	7101 rc = pager_playback(pPager, !pPager->tempFile);

	7102 pPager->eState = PAGER_OPEN;

	7103 }

	7104 }else if( !pPager->exclusiveMode ){

	7105 pagerUnlockDb(pPager, SHARED_LOCK);

	7106 }

	7107

	7108 if( rc!=SQLITE_OK ){

	7109 /* This branch is taken if an error occurs while trying to open

	7110 ** or roll back a hot-journal while holding an EXCLUSIVE lock. The

	7111 ** pager_unlock() routine will be called before returning to unlock

	7112 ** the file. If the unlock attempt fails, then Pager.eLock must be

	7113 ** set to UNKNOWN_LOCK (see the comment above the #define for

	7114 ** UNKNOWN_LOCK above for an explanation).

	7115 **

	7116 ** In order to get pager_unlock() to do this, set Pager.eState to

	7117 ** PAGER_ERROR now. This is not actually counted as a transition

	7118 ** to ERROR state in the state diagram at the top of this file,

	7119 ** since we know that the same call to pager_unlock() will very

	7120 ** shortly transition the pager object to the OPEN state. Calling

	7121 ** assert_pager_state() would fail now, as it should not be possible

	7122 ** to be in ERROR state when there are zero outstanding page

	7123 ** references.

	7124 */

	7125 pager_error(pPager, rc);

	7126 goto failed;

	7127 }

	7128

	7129 assert( pPager->eState==PAGER_OPEN );

	7130 assert( (pPager->eLock==SHARED_LOCK)

	7131 \|\| (pPager->exclusiveMode && pPager->eLock>SHARED_LOCK)

	7132 );

	7133 }

	7134

	7135 if( !pPager->tempFile && pPager->hasHeldSharedLock ){

	7136 /* The shared-lock has just been acquired then check to

	7137 ** see if the database has been modified. If the database has changed,

	7138 ** flush the cache. The hasHeldSharedLock flag prevents this from

	7139 ** occurring on the very first access to a file, in order to save a

	7140 ** single unnecessary sqlite3OsRead() call at the start-up.

	7141 **

	7142 ** Database changes are detected by looking at 15 bytes beginning

	7143 ** at offset 24 into the file. The first 4 of these 16 bytes are

	7144 ** a 32-bit counter that is incremented with each change. The

	7145 ** other bytes change randomly with each file change when

	7146 ** a codec is in use.

	7147 **

	7148 ** There is a vanishingly small chance that a change will not be

	7149 ** detected. The chance of an undetected change is so small that

	7150 ** it can be neglected.

	7151 */

	7152 Pgno nPage = 0;

	7153 char dbFileVers[sizeof(pPager->dbFileVers)];

	7154

	7155 rc = pagerPagecount(pPager, &nPage);

	7156 if( rc ) goto failed;

	7157

	7158 if( nPage>0 ){

	7159 IOTRACE(("CKVERS %p %d\n", pPager, sizeof(dbFileVers)));

	7160 rc = sqlite3OsRead(pPager->fd, &dbFileVers, sizeof(dbFileVers), 24);

	7161 if( rc!=SQLITE_OK && rc!=SQLITE_IOERR_SHORT_READ ){

	7162 goto failed;

	7163 }

	7164 }else{

	7165 memset(dbFileVers, 0, sizeof(dbFileVers));

	7166 }

	7167

	7168 if( memcmp(pPager->dbFileVers, dbFileVers, sizeof(dbFileVers))!=0 ){

	7169 pager_reset(pPager);

	7170

	7171 /* Unmap the database file. It is possible that external processes

	7172 ** may have truncated the database file and then extended it back

	7173 ** to its original size while this process was not holding a lock.

	7174 ** In this case there may exist a Pager.pMap mapping that appears

	7175 ** to be the right size but is not actually valid. Avoid this

	7176 ** possibility by unmapping the db here. */

	7177 if( USEFETCH(pPager) ){

	7178 sqlite3OsUnfetch(pPager->fd, 0, 0);

	7179 }

	7180 }

	7181 }

	7182

	7183 /* If there is a WAL file in the file-system, open this database in WAL

	7184 ** mode. Otherwise, the following function call is a no-op.

	7185 */

	7186 rc = pagerOpenWalIfPresent(pPager);

	7187 #ifndef SQLITE_OMIT_WAL

	7188 assert( pPager->pWal==0 \|\| rc==SQLITE_OK );

	7189 #endif

	7190 }

	7191

	7192 if( pagerUseWal(pPager) ){

	7193 assert( rc==SQLITE_OK );

	7194 rc = pagerBeginReadTransaction(pPager);

	7195 }

	7196

	7197 if( pPager->tempFile==0 && pPager->eState==PAGER_OPEN && rc==SQLITE_OK ){

	7198 rc = pagerPagecount(pPager, &pPager->dbSize);

	7199 }

	7200

	7201 failed:

	7202 if( rc!=SQLITE_OK ){

	7203 assert( !MEMDB );

	7204 pager_unlock(pPager);

	7205 assert( pPager->eState==PAGER_OPEN );

	7206 }else{

	7207 pPager->eState = PAGER_READER;

	7208 pPager->hasHeldSharedLock = 1;

	7209 }

	7210 return rc;

	7211 }

	7212

	7213 /*

	7214 ** If the reference count has reached zero, rollback any active

	7215 ** transaction and unlock the pager.

	7216 **

	7217 ** Except, in locking_mode=EXCLUSIVE when there is nothing to in

	7218 ** the rollback journal, the unlock is not performed and there is

	7219 ** nothing to rollback, so this routine is a no-op.

	7220 */

	7221 static void pagerUnlockIfUnused(Pager *pPager){

	7222 if( pPager->nMmapOut==0 && (sqlite3PcacheRefCount(pPager->pPCache)==0) ){

	7223 pagerUnlockAndRollback(pPager);

	7224 }

	7225 }

	7226

	7227 /*

	7228 ** The page getter methods each try to acquire a reference to a

	7229 ** page with page number pgno. If the requested reference is

	7230 ** successfully obtained, it is copied to *ppPage and SQLITE_OK returned.

	7231 **

	7232 ** There are different implementations of the getter method depending

	7233 ** on the current state of the pager.

	7234 **

	7235 ** getPageNormal() -- The normal getter

	7236 ** getPageError() -- Used if the pager is in an error state

	7237 ** getPageMmap() -- Used if memory-mapped I/O is enabled

	7238 **

	7239 ** If the requested page is already in the cache, it is returned.

	7240 ** Otherwise, a new page object is allocated and populated with data

	7241 ** read from the database file. In some cases, the pcache module may

	7242 ** choose not to allocate a new page object and may reuse an existing

	7243 ** object with no outstanding references.

	7244 **

	7245 ** The extra data appended to a page is always initialized to zeros the

	7246 ** first time a page is loaded into memory. If the page requested is

	7247 ** already in the cache when this function is called, then the extra

	7248 ** data is left as it was when the page object was last used.

	7249 **

	7250 ** If the database image is smaller than the requested page or if

	7251 ** the flags parameter contains the PAGER_GET_NOCONTENT bit and the

	7252 ** requested page is not already stored in the cache, then no

	7253 ** actual disk read occurs. In this case the memory image of the

	7254 ** page is initialized to all zeros.

	7255 **

	7256 ** If PAGER_GET_NOCONTENT is true, it means that we do not care about

	7257 ** the contents of the page. This occurs in two scenarios:

	7258 **

	7259 ** a) When reading a free-list leaf page from the database, and

	7260 **

	7261 ** b) When a savepoint is being rolled back and we need to load

	7262 ** a new page into the cache to be filled with the data read

	7263 ** from the savepoint journal.

	7264 **

	7265 ** If PAGER_GET_NOCONTENT is true, then the data returned is zeroed instead

	7266 ** of being read from the database. Additionally, the bits corresponding

	7267 ** to pgno in Pager.pInJournal (bitvec of pages already written to the

	7268 ** journal file) and the PagerSavepoint.pInSavepoint bitvecs of any open

	7269 ** savepoints are set. This means if the page is made writable at any

	7270 ** point in the future, using a call to sqlite3PagerWrite(), its contents

	7271 ** will not be journaled. This saves IO.

	7272 **

	7273 ** The acquisition might fail for several reasons. In all cases,

	7274 ** an appropriate error code is returned and *ppPage is set to NULL.

	7275 **

	7276 ** See also sqlite3PagerLookup(). Both this routine and Lookup() attempt

	7277 ** to find a page in the in-memory cache first. If the page is not already

	7278 ** in memory, this routine goes to disk to read it in whereas Lookup()

	7279 ** just returns 0. This routine acquires a read-lock the first time it

	7280 ** has to go to disk, and could also playback an old journal if necessary.

	7281 ** Since Lookup() never goes to disk, it never has to deal with locks

	7282 ** or journal files.

	7283 */

	7284 static int getPageNormal(

	7285 Pager pPager, / The pager open on the database file */

	7286 Pgno pgno, /* Page number to fetch */

	7287 DbPage *ppPage, / Write a pointer to the page here */

	7288 int flags /* PAGER_GET_XXX flags */

	7289 ){

	7290 int rc = SQLITE_OK;

	7291 PgHdr *pPg;

	7292 u8 noContent; /* True if PAGER_GET_NOCONTENT is set */

	7293 sqlite3_pcache_page *pBase;

	7294

	7295 assert( pPager->errCode==SQLITE_OK );

	7296 assert( pPager->eState>=PAGER_READER );

	7297 assert( assert_pager_state(pPager) );

	7298 assert( pPager->hasHeldSharedLock==1 );

	7299

	7300 if( pgno==0 ) return SQLITE_CORRUPT_BKPT;

	7301 pBase = sqlite3PcacheFetch(pPager->pPCache, pgno, 3);

	7302 if( pBase==0 ){

	7303 pPg = 0;

	7304 rc = sqlite3PcacheFetchStress(pPager->pPCache, pgno, &pBase);

	7305 if( rc!=SQLITE_OK ) goto pager_acquire_err;

	7306 if( pBase==0 ){

	7307 rc = SQLITE_NOMEM_BKPT;

	7308 goto pager_acquire_err;

	7309 }

	7310 }

	7311 pPg = *ppPage = sqlite3PcacheFetchFinish(pPager->pPCache, pgno, pBase);

	7312 assert( pPg==(*ppPage) );

	7313 assert( pPg->pgno==pgno );

	7314 assert( pPg->pPager==pPager \|\| pPg->pPager==0 );

	7315

	7316 noContent = (flags & PAGER_GET_NOCONTENT)!=0;

	7317 if( pPg->pPager && !noContent ){

	7318 /* In this case the pcache already contains an initialized copy of

	7319 ** the page. Return without further ado. */

	7320 assert( pgno<=PAGER_MAX_PGNO && pgno!=PAGER_MJ_PGNO(pPager) );

	7321 pPager->aStat[PAGER_STAT_HIT]++;

	7322 return SQLITE_OK;

	7323

	7324 }else{

	7325 /* The pager cache has created a new page. Its content needs to

	7326 ** be initialized. But first some error checks:

	7327 **

	7328 ** (1) The maximum page number is 2^31

	7329 ** (2) Never try to fetch the locking page

	7330 */

	7331 if( pgno>PAGER_MAX_PGNO \|\| pgno==PAGER_MJ_PGNO(pPager) ){

	7332 rc = SQLITE_CORRUPT_BKPT;

	7333 goto pager_acquire_err;

	7334 }

	7335

	7336 pPg->pPager = pPager;

	7337

	7338 assert( !isOpen(pPager->fd) \|\| !MEMDB );

	7339 if( !isOpen(pPager->fd) \|\| pPager->dbSize<pgno \|\| noContent ){

	7340 if( pgno>pPager->mxPgno ){

	7341 rc = SQLITE_FULL;

	7342 goto pager_acquire_err;

	7343 }

	7344 if( noContent ){

	7345 /* Failure to set the bits in the InJournal bit-vectors is benign.

	7346 ** It merely means that we might do some extra work to journal a

	7347 ** page that does not need to be journaled. Nevertheless, be sure

	7348 ** to test the case where a malloc error occurs while trying to set

	7349 ** a bit in a bit vector.

	7350 */

	7351 sqlite3BeginBenignMalloc();

	7352 if( pgno<=pPager->dbOrigSize ){

	7353 TESTONLY( rc = ) sqlite3BitvecSet(pPager->pInJournal, pgno);

	7354 testcase( rc==SQLITE_NOMEM );

	7355 }

	7356 TESTONLY( rc = ) addToSavepointBitvecs(pPager, pgno);

	7357 testcase( rc==SQLITE_NOMEM );

	7358 sqlite3EndBenignMalloc();

	7359 }

	7360 memset(pPg->pData, 0, pPager->pageSize);

	7361 IOTRACE(("ZERO %p %d\n", pPager, pgno));

	7362 }else{

	7363 u32 iFrame = 0; /* Frame to read from WAL file */

	7364 if( pagerUseWal(pPager) ){

	7365 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iFrame);

	7366 if( rc!=SQLITE_OK ) goto pager_acquire_err;

	7367 }

	7368 assert( pPg->pPager==pPager );

	7369 pPager->aStat[PAGER_STAT_MISS]++;

	7370 rc = readDbPage(pPg, iFrame);

	7371 if( rc!=SQLITE_OK ){

	7372 goto pager_acquire_err;

	7373 }

	7374 }

	7375 pager_set_pagehash(pPg);

	7376 }

	7377 return SQLITE_OK;

	7378

	7379 pager_acquire_err:

	7380 assert( rc!=SQLITE_OK );

	7381 if( pPg ){

	7382 sqlite3PcacheDrop(pPg);

	7383 }

	7384 pagerUnlockIfUnused(pPager);

	7385 *ppPage = 0;

	7386 return rc;

	7387 }

	7388

	7389 #if SQLITE_MAX_MMAP_SIZE>0

	7390 /* The page getter for when memory-mapped I/O is enabled */

	7391 static int getPageMMap(

	7392 Pager pPager, / The pager open on the database file */

	7393 Pgno pgno, /* Page number to fetch */

	7394 DbPage *ppPage, / Write a pointer to the page here */

	7395 int flags /* PAGER_GET_XXX flags */

	7396 ){

	7397 int rc = SQLITE_OK;

	7398 PgHdr *pPg = 0;

	7399 u32 iFrame = 0; /* Frame to read from WAL file */

	7400

	7401 /* It is acceptable to use a read-only (mmap) page for any page except

	7402 ** page 1 if there is no write-transaction open or the ACQUIRE_READONLY

	7403 ** flag was specified by the caller. And so long as the db is not a

	7404 ** temporary or in-memory database. */

	7405 const int bMmapOk = (pgno>1

	7406 && (pPager->eState==PAGER_READER \|\| (flags & PAGER_GET_READONLY))

	7407 );

	7408

	7409 assert( USEFETCH(pPager) );

	7410 #ifdef SQLITE_HAS_CODEC

	7411 assert( pPager->xCodec==0 );

	7412 #endif

	7413

	7414 /* Optimization note: Adding the "pgno<=1" term before "pgno==0" here

	7415 ** allows the compiler optimizer to reuse the results of the "pgno>1"

	7416 ** test in the previous statement, and avoid testing pgno==0 in the

	7417 ** common case where pgno is large. */

	7418 if( pgno<=1 && pgno==0 ){

	7419 return SQLITE_CORRUPT_BKPT;

	7420 }

	7421 assert( pPager->eState>=PAGER_READER );

	7422 assert( assert_pager_state(pPager) );

	7423 assert( pPager->hasHeldSharedLock==1 );

	7424 assert( pPager->errCode==SQLITE_OK );

	7425

	7426 if( bMmapOk && pagerUseWal(pPager) ){

	7427 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iFrame);

	7428 if( rc!=SQLITE_OK ){

	7429 *ppPage = 0;

	7430 return rc;

	7431 }

	7432 }

	7433 if( bMmapOk && iFrame==0 ){

	7434 void *pData = 0;

	7435 rc = sqlite3OsFetch(pPager->fd,

	7436 (i64)(pgno-1) * pPager->pageSize, pPager->pageSize, &pData

	7437 );

	7438 if( rc==SQLITE_OK && pData ){

	7439 if( pPager->eState>PAGER_READER \|\| pPager->tempFile ){

	7440 pPg = sqlite3PagerLookup(pPager, pgno);

	7441 }

	7442 if( pPg==0 ){

	7443 rc = pagerAcquireMapPage(pPager, pgno, pData, &pPg);

	7444 }else{

	7445 sqlite3OsUnfetch(pPager->fd, (i64)(pgno-1)*pPager->pageSize, pData);

	7446 }

	7447 if( pPg ){

	7448 assert( rc==SQLITE_OK );

	7449 *ppPage = pPg;

	7450 return SQLITE_OK;

	7451 }

	7452 }

	7453 if( rc!=SQLITE_OK ){

	7454 *ppPage = 0;

	7455 return rc;

	7456 }

	7457 }

	7458 return getPageNormal(pPager, pgno, ppPage, flags);

	7459 }

	7460 #endif /* SQLITE_MAX_MMAP_SIZE>0 */

	7461

	7462 /* The page getter method for when the pager is an error state */

	7463 static int getPageError(

	7464 Pager pPager, / The pager open on the database file */

	7465 Pgno pgno, /* Page number to fetch */

	7466 DbPage *ppPage, / Write a pointer to the page here */

	7467 int flags /* PAGER_GET_XXX flags */

	7468 ){

	7469 UNUSED_PARAMETER(pgno);

	7470 UNUSED_PARAMETER(flags);

	7471 assert( pPager->errCode!=SQLITE_OK );

	7472 *ppPage = 0;

	7473 return pPager->errCode;

	7474 }

	7475

	7476

	7477 /* Dispatch all page fetch requests to the appropriate getter method.

	7478 */

	7479 SQLITE_PRIVATE int sqlite3PagerGet(

	7480 Pager pPager, / The pager open on the database file */

	7481 Pgno pgno, /* Page number to fetch */

	7482 DbPage *ppPage, / Write a pointer to the page here */

	7483 int flags /* PAGER_GET_XXX flags */

	7484 ){

	7485 return pPager->xGet(pPager, pgno, ppPage, flags);

	7486 }

	7487

	7488 /*

	7489 ** Acquire a page if it is already in the in-memory cache. Do

	7490 ** not read the page from disk. Return a pointer to the page,

	7491 ** or 0 if the page is not in cache.

	7492 **

	7493 ** See also sqlite3PagerGet(). The difference between this routine

	7494 ** and sqlite3PagerGet() is that _get() will go to the disk and read

	7495 ** in the page if the page is not already in cache. This routine

	7496 ** returns NULL if the page is not in cache or if a disk I/O error

	7497 ** has ever happened.

	7498 */

	7499 SQLITE_PRIVATE DbPage sqlite3PagerLookup(Pager pPager, Pgno pgno){

	7500 sqlite3_pcache_page *pPage;

	7501 assert( pPager!=0 );

	7502 assert( pgno!=0 );

	7503 assert( pPager->pPCache!=0 );

	7504 pPage = sqlite3PcacheFetch(pPager->pPCache, pgno, 0);

	7505 assert( pPage==0 \|\| pPager->hasHeldSharedLock );

	7506 if( pPage==0 ) return 0;

	7507 return sqlite3PcacheFetchFinish(pPager->pPCache, pgno, pPage);

	7508 }

	7509

	7510 /*

	7511 ** Release a page reference.

	7512 **

	7513 ** If the number of references to the page drop to zero, then the

	7514 ** page is added to the LRU list. When all references to all pages

	7515 ** are released, a rollback occurs and the lock on the database is

	7516 ** removed.

	7517 */

	7518 SQLITE_PRIVATE void sqlite3PagerUnrefNotNull(DbPage *pPg){

	7519 Pager *pPager;

	7520 assert( pPg!=0 );

	7521 pPager = pPg->pPager;

	7522 if( pPg->flags & PGHDR_MMAP ){

	7523 pagerReleaseMapPage(pPg);

	7524 }else{

	7525 sqlite3PcacheRelease(pPg);

	7526 }

	7527 pagerUnlockIfUnused(pPager);

	7528 }

	7529 SQLITE_PRIVATE void sqlite3PagerUnref(DbPage *pPg){

	7530 if( pPg ) sqlite3PagerUnrefNotNull(pPg);

	7531 }

	7532

	7533 /*

	7534 ** This function is called at the start of every write transaction.

	7535 ** There must already be a RESERVED or EXCLUSIVE lock on the database

	7536 ** file when this routine is called.

	7537 **

	7538 ** Open the journal file for pager pPager and write a journal header

	7539 ** to the start of it. If there are active savepoints, open the sub-journal

	7540 ** as well. This function is only used when the journal file is being

	7541 ** opened to write a rollback log for a transaction. It is not used

	7542 ** when opening a hot journal file to roll it back.

	7543 **

	7544 ** If the journal file is already open (as it may be in exclusive mode),

	7545 ** then this function just writes a journal header to the start of the

	7546 ** already open file.

	7547 **

	7548 ** Whether or not the journal file is opened by this function, the

	7549 ** Pager.pInJournal bitvec structure is allocated.

	7550 **

	7551 ** Return SQLITE_OK if everything is successful. Otherwise, return

	7552 ** SQLITE_NOMEM if the attempt to allocate Pager.pInJournal fails, or

	7553 ** an IO error code if opening or writing the journal file fails.

	7554 */

	7555 static int pager_open_journal(Pager *pPager){

	7556 int rc = SQLITE_OK; /* Return code */

	7557 sqlite3_vfs * const pVfs = pPager->pVfs; /* Local cache of vfs pointer */

	7558

	7559 assert( pPager->eState==PAGER_WRITER_LOCKED );

	7560 assert( assert_pager_state(pPager) );

	7561 assert( pPager->pInJournal==0 );

	7562

	7563 /* If already in the error state, this function is a no-op. But on

	7564 ** the other hand, this routine is never called if we are already in

	7565 ** an error state. */

	7566 if( NEVER(pPager->errCode) ) return pPager->errCode;

	7567

	7568 if( !pagerUseWal(pPager) && pPager->journalMode!=PAGER_JOURNALMODE_OFF ){

	7569 pPager->pInJournal = sqlite3BitvecCreate(pPager->dbSize);

	7570 if( pPager->pInJournal==0 ){

	7571 return SQLITE_NOMEM_BKPT;

	7572 }

	7573

	7574 /* Open the journal file if it is not already open. */

	7575 if( !isOpen(pPager->jfd) ){

	7576 if( pPager->journalMode==PAGER_JOURNALMODE_MEMORY ){

	7577 sqlite3MemJournalOpen(pPager->jfd);

	7578 }else{

	7579 int flags = SQLITE_OPEN_READWRITE\|SQLITE_OPEN_CREATE;

	7580 int nSpill;

	7581

	7582 if( pPager->tempFile ){

	7583 flags \|= (SQLITE_OPEN_DELETEONCLOSE\|SQLITE_OPEN_TEMP_JOURNAL);

	7584 nSpill = sqlite3Config.nStmtSpill;

	7585 }else{

	7586 flags \|= SQLITE_OPEN_MAIN_JOURNAL;

	7587 nSpill = jrnlBufferSize(pPager);

	7588 }

	7589

	7590 /* Verify that the database still has the same name as it did when

	7591 ** it was originally opened. */

	7592 rc = databaseIsUnmoved(pPager);

	7593 if( rc==SQLITE_OK ){

	7594 rc = sqlite3JournalOpen (

	7595 pVfs, pPager->zJournal, pPager->jfd, flags, nSpill

	7596 );

	7597 }

	7598 }

	7599 assert( rc!=SQLITE_OK \|\| isOpen(pPager->jfd) );

	7600 }

	7601

	7602

	7603 /* Write the first journal header to the journal file and open

	7604 ** the sub-journal if necessary.

	7605 */

	7606 if( rc==SQLITE_OK ){

	7607 /* TODO: Check if all of these are really required. */

	7608 pPager->nRec = 0;

	7609 pPager->journalOff = 0;

	7610 pPager->setMaster = 0;

	7611 pPager->journalHdr = 0;

	7612 rc = writeJournalHdr(pPager);

	7613 }

	7614 }

	7615

	7616 if( rc!=SQLITE_OK ){

	7617 sqlite3BitvecDestroy(pPager->pInJournal);

	7618 pPager->pInJournal = 0;

	7619 }else{

	7620 assert( pPager->eState==PAGER_WRITER_LOCKED );

	7621 pPager->eState = PAGER_WRITER_CACHEMOD;

	7622 }

	7623

	7624 return rc;

	7625 }

	7626

	7627 /*

	7628 ** Begin a write-transaction on the specified pager object. If a

	7629 ** write-transaction has already been opened, this function is a no-op.

	7630 **

	7631 ** If the exFlag argument is false, then acquire at least a RESERVED

	7632 ** lock on the database file. If exFlag is true, then acquire at least

	7633 ** an EXCLUSIVE lock. If such a lock is already held, no locking

	7634 ** functions need be called.

	7635 **

	7636 ** If the subjInMemory argument is non-zero, then any sub-journal opened

	7637 ** within this transaction will be opened as an in-memory file. This

	7638 ** has no effect if the sub-journal is already opened (as it may be when

	7639 ** running in exclusive mode) or if the transaction does not require a

	7640 ** sub-journal. If the subjInMemory argument is zero, then any required

	7641 ** sub-journal is implemented in-memory if pPager is an in-memory database,

	7642 ** or using a temporary file otherwise.

	7643 */

	7644 SQLITE_PRIVATE int sqlite3PagerBegin(Pager *pPager, int exFlag, int subjInMemory ){

	7645 int rc = SQLITE_OK;

	7646

	7647 if( pPager->errCode ) return pPager->errCode;

	7648 assert( pPager->eState>=PAGER_READER && pPager->eState<PAGER_ERROR );

	7649 pPager->subjInMemory = (u8)subjInMemory;

	7650

	7651 if( ALWAYS(pPager->eState==PAGER_READER) ){

	7652 assert( pPager->pInJournal==0 );

	7653

	7654 if( pagerUseWal(pPager) ){

	7655 /* If the pager is configured to use locking_mode=exclusive, and an

	7656 ** exclusive lock on the database is not already held, obtain it now.

	7657 */

	7658 if( pPager->exclusiveMode && sqlite3WalExclusiveMode(pPager->pWal, -1) ){

	7659 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);

	7660 if( rc!=SQLITE_OK ){

	7661 return rc;

	7662 }

	7663 (void)sqlite3WalExclusiveMode(pPager->pWal, 1);

	7664 }

	7665

	7666 /* Grab the write lock on the log file. If successful, upgrade to

	7667 ** PAGER_RESERVED state. Otherwise, return an error code to the caller.

	7668 ** The busy-handler is not invoked if another connection already

	7669 ** holds the write-lock. If possible, the upper layer will call it.

	7670 */

	7671 rc = sqlite3WalBeginWriteTransaction(pPager->pWal);

	7672 }else{

	7673 /* Obtain a RESERVED lock on the database file. If the exFlag parameter

	7674 ** is true, then immediately upgrade this to an EXCLUSIVE lock. The

	7675 ** busy-handler callback can be used when upgrading to the EXCLUSIVE

	7676 ** lock, but not when obtaining the RESERVED lock.

	7677 */

	7678 rc = pagerLockDb(pPager, RESERVED_LOCK);

	7679 if( rc==SQLITE_OK && exFlag ){

	7680 rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);

	7681 }

	7682 }

	7683

	7684 if( rc==SQLITE_OK ){

	7685 /* Change to WRITER_LOCKED state.

	7686 **

	7687 ** WAL mode sets Pager.eState to PAGER_WRITER_LOCKED or CACHEMOD

	7688 ** when it has an open transaction, but never to DBMOD or FINISHED.

	7689 ** This is because in those states the code to roll back savepoint

	7690 ** transactions may copy data from the sub-journal into the database

	7691 ** file as well as into the page cache. Which would be incorrect in

	7692 ** WAL mode.

	7693 */

	7694 pPager->eState = PAGER_WRITER_LOCKED;

	7695 pPager->dbHintSize = pPager->dbSize;

	7696 pPager->dbFileSize = pPager->dbSize;

	7697 pPager->dbOrigSize = pPager->dbSize;

	7698 pPager->journalOff = 0;

	7699 }

	7700

	7701 assert( rc==SQLITE_OK \|\| pPager->eState==PAGER_READER );

	7702 assert( rc!=SQLITE_OK \|\| pPager->eState==PAGER_WRITER_LOCKED );

	7703 assert( assert_pager_state(pPager) );

	7704 }

	7705

	7706 PAGERTRACE(("TRANSACTION %d\n", PAGERID(pPager)));

	7707 return rc;

	7708 }

	7709

	7710 /*

	7711 ** Write page pPg onto the end of the rollback journal.

	7712 */

	7713 static SQLITE_NOINLINE int pagerAddPageToRollbackJournal(PgHdr *pPg){

	7714 Pager *pPager = pPg->pPager;

	7715 int rc;

	7716 u32 cksum;

	7717 char *pData2;

	7718 i64 iOff = pPager->journalOff;

	7719

	7720 /* We should never write to the journal file the page that

	7721 ** contains the database locks. The following assert verifies

	7722 ** that we do not. */

	7723 assert( pPg->pgno!=PAGER_MJ_PGNO(pPager) );

	7724

	7725 assert( pPager->journalHdr<=pPager->journalOff );

	7726 CODEC2(pPager, pPg->pData, pPg->pgno, 7, return SQLITE_NOMEM_BKPT, pData2);

	7727 cksum = pager_cksum(pPager, (u8*)pData2);

	7728

	7729 /* Even if an IO or diskfull error occurs while journalling the

	7730 ** page in the block above, set the need-sync flag for the page.

	7731 ** Otherwise, when the transaction is rolled back, the logic in

	7732 ** playback_one_page() will think that the page needs to be restored

	7733 ** in the database file. And if an IO error occurs while doing so,

	7734 ** then corruption may follow.

	7735 */

	7736 pPg->flags \|= PGHDR_NEED_SYNC;

	7737

	7738 rc = write32bits(pPager->jfd, iOff, pPg->pgno);

	7739 if( rc!=SQLITE_OK ) return rc;

	7740 rc = sqlite3OsWrite(pPager->jfd, pData2, pPager->pageSize, iOff+4);

	7741 if( rc!=SQLITE_OK ) return rc;

	7742 rc = write32bits(pPager->jfd, iOff+pPager->pageSize+4, cksum);

	7743 if( rc!=SQLITE_OK ) return rc;

	7744

	7745 IOTRACE(("JOUT %p %d %lld %d\n", pPager, pPg->pgno,

	7746 pPager->journalOff, pPager->pageSize));

	7747 PAGER_INCR(sqlite3_pager_writej_count);

	7748 PAGERTRACE(("JOURNAL %d page %d needSync=%d hash(%08x)\n",

	7749 PAGERID(pPager), pPg->pgno,

	7750 ((pPg->flags&PGHDR_NEED_SYNC)?1:0), pager_pagehash(pPg)));

	7751

	7752 pPager->journalOff += 8 + pPager->pageSize;

	7753 pPager->nRec++;

	7754 assert( pPager->pInJournal!=0 );

	7755 rc = sqlite3BitvecSet(pPager->pInJournal, pPg->pgno);

	7756 testcase( rc==SQLITE_NOMEM );

	7757 assert( rc==SQLITE_OK \|\| rc==SQLITE_NOMEM );

	7758 rc \|= addToSavepointBitvecs(pPager, pPg->pgno);

	7759 assert( rc==SQLITE_OK \|\| rc==SQLITE_NOMEM );

	7760 return rc;

	7761 }

	7762

	7763 /*

	7764 ** Mark a single data page as writeable. The page is written into the

	7765 ** main journal or sub-journal as required. If the page is written into

	7766 ** one of the journals, the corresponding bit is set in the

	7767 ** Pager.pInJournal bitvec and the PagerSavepoint.pInSavepoint bitvecs

	7768 ** of any open savepoints as appropriate.

	7769 */

	7770 static int pager_write(PgHdr *pPg){

	7771 Pager *pPager = pPg->pPager;

	7772 int rc = SQLITE_OK;

	7773

	7774 /* This routine is not called unless a write-transaction has already

	7775 ** been started. The journal file may or may not be open at this point.

	7776 ** It is never called in the ERROR state.

	7777 */

	7778 assert( pPager->eState==PAGER_WRITER_LOCKED

	7779 \|\| pPager->eState==PAGER_WRITER_CACHEMOD

	7780 \|\| pPager->eState==PAGER_WRITER_DBMOD

	7781 );

	7782 assert( assert_pager_state(pPager) );

	7783 assert( pPager->errCode==0 );

	7784 assert( pPager->readOnly==0 );

	7785 CHECK_PAGE(pPg);

	7786

	7787 /* The journal file needs to be opened. Higher level routines have already

	7788 ** obtained the necessary locks to begin the write-transaction, but the

	7789 ** rollback journal might not yet be open. Open it now if this is the case.

	7790 **

	7791 ** This is done before calling sqlite3PcacheMakeDirty() on the page.

	7792 ** Otherwise, if it were done after calling sqlite3PcacheMakeDirty(), then

	7793 ** an error might occur and the pager would end up in WRITER_LOCKED state

	7794 ** with pages marked as dirty in the cache.

	7795 */

	7796 if( pPager->eState==PAGER_WRITER_LOCKED ){

	7797 rc = pager_open_journal(pPager);

	7798 if( rc!=SQLITE_OK ) return rc;

	7799 }

	7800 assert( pPager->eState>=PAGER_WRITER_CACHEMOD );

	7801 assert( assert_pager_state(pPager) );

	7802

	7803 /* Mark the page that is about to be modified as dirty. */

	7804 sqlite3PcacheMakeDirty(pPg);

	7805

	7806 /* If a rollback journal is in use, them make sure the page that is about

	7807 ** to change is in the rollback journal, or if the page is a new page off

	7808 ** then end of the file, make sure it is marked as PGHDR_NEED_SYNC.

	7809 */

	7810 assert( (pPager->pInJournal!=0) == isOpen(pPager->jfd) );

	7811 if( pPager->pInJournal!=0

	7812 && sqlite3BitvecTestNotNull(pPager->pInJournal, pPg->pgno)==0

	7813 ){

	7814 assert( pagerUseWal(pPager)==0 );

	7815 if( pPg->pgno<=pPager->dbOrigSize ){

	7816 rc = pagerAddPageToRollbackJournal(pPg);

	7817 if( rc!=SQLITE_OK ){

	7818 return rc;

	7819 }

	7820 }else{

	7821 if( pPager->eState!=PAGER_WRITER_DBMOD ){

	7822 pPg->flags \|= PGHDR_NEED_SYNC;

	7823 }

	7824 PAGERTRACE(("APPEND %d page %d needSync=%d\n",

	7825 PAGERID(pPager), pPg->pgno,

	7826 ((pPg->flags&PGHDR_NEED_SYNC)?1:0)));

	7827 }

	7828 }

	7829

	7830 /* The PGHDR_DIRTY bit is set above when the page was added to the dirty-list

	7831 ** and before writing the page into the rollback journal. Wait until now,

	7832 ** after the page has been successfully journalled, before setting the

	7833 ** PGHDR_WRITEABLE bit that indicates that the page can be safely modified.

	7834 */

	7835 pPg->flags \|= PGHDR_WRITEABLE;

	7836

	7837 /* If the statement journal is open and the page is not in it,

	7838 ** then write the page into the statement journal.

	7839 */

	7840 if( pPager->nSavepoint>0 ){

	7841 rc = subjournalPageIfRequired(pPg);

	7842 }

	7843

	7844 /* Update the database size and return. */

	7845 if( pPager->dbSize<pPg->pgno ){

	7846 pPager->dbSize = pPg->pgno;

	7847 }

	7848 return rc;

	7849 }

	7850

	7851 /*

	7852 ** This is a variant of sqlite3PagerWrite() that runs when the sector size

	7853 ** is larger than the page size. SQLite makes the (reasonable) assumption that

	7854 ** all bytes of a sector are written together by hardware. Hence, all bytes of

	7855 ** a sector need to be journalled in case of a power loss in the middle of

	7856 ** a write.

	7857 **

	7858 ** Usually, the sector size is less than or equal to the page size, in which

	7859 ** case pages can be individually written. This routine only runs in the

	7860 ** exceptional case where the page size is smaller than the sector size.

	7861 */

	7862 static SQLITE_NOINLINE int pagerWriteLargeSector(PgHdr *pPg){

	7863 int rc = SQLITE_OK; /* Return code */

	7864 Pgno nPageCount; /* Total number of pages in database file */

	7865 Pgno pg1; /* First page of the sector pPg is located on. */

	7866 int nPage = 0; /* Number of pages starting at pg1 to journal */

	7867 int ii; /* Loop counter */

	7868 int needSync = 0; /* True if any page has PGHDR_NEED_SYNC */

	7869 Pager pPager = pPg->pPager; / The pager that owns pPg */

	7870 Pgno nPagePerSector = (pPager->sectorSize/pPager->pageSize);

	7871

	7872 /* Set the doNotSpill NOSYNC bit to 1. This is because we cannot allow

	7873 ** a journal header to be written between the pages journaled by

	7874 ** this function.

	7875 */

	7876 assert( !MEMDB );

	7877 assert( (pPager->doNotSpill & SPILLFLAG_NOSYNC)==0 );

	7878 pPager->doNotSpill \|= SPILLFLAG_NOSYNC;

	7879

	7880 /* This trick assumes that both the page-size and sector-size are

	7881 ** an integer power of 2. It sets variable pg1 to the identifier

	7882 ** of the first page of the sector pPg is located on.

	7883 */

	7884 pg1 = ((pPg->pgno-1) & ~(nPagePerSector-1)) + 1;

	7885

	7886 nPageCount = pPager->dbSize;

	7887 if( pPg->pgno>nPageCount ){

	7888 nPage = (pPg->pgno - pg1)+1;

	7889 }else if( (pg1+nPagePerSector-1)>nPageCount ){

	7890 nPage = nPageCount+1-pg1;

	7891 }else{

	7892 nPage = nPagePerSector;

	7893 }

	7894 assert(nPage>0);

	7895 assert(pg1<=pPg->pgno);

	7896 assert((pg1+nPage)>pPg->pgno);

	7897

	7898 for(ii=0; ii<nPage && rc==SQLITE_OK; ii++){

	7899 Pgno pg = pg1+ii;

	7900 PgHdr *pPage;

	7901 if( pg==pPg->pgno \|\| !sqlite3BitvecTest(pPager->pInJournal, pg) ){

	7902 if( pg!=PAGER_MJ_PGNO(pPager) ){

	7903 rc = sqlite3PagerGet(pPager, pg, &pPage, 0);

	7904 if( rc==SQLITE_OK ){

	7905 rc = pager_write(pPage);

	7906 if( pPage->flags&PGHDR_NEED_SYNC ){

	7907 needSync = 1;

	7908 }

	7909 sqlite3PagerUnrefNotNull(pPage);

	7910 }

	7911 }

	7912 }else if( (pPage = sqlite3PagerLookup(pPager, pg))!=0 ){

	7913 if( pPage->flags&PGHDR_NEED_SYNC ){

	7914 needSync = 1;

	7915 }

	7916 sqlite3PagerUnrefNotNull(pPage);

	7917 }

	7918 }

	7919

	7920 /* If the PGHDR_NEED_SYNC flag is set for any of the nPage pages

	7921 ** starting at pg1, then it needs to be set for all of them. Because

	7922 ** writing to any of these nPage pages may damage the others, the

	7923 ** journal file must contain sync()ed copies of all of them

	7924 ** before any of them can be written out to the database file.

	7925 */

	7926 if( rc==SQLITE_OK && needSync ){

	7927 assert( !MEMDB );

	7928 for(ii=0; ii<nPage; ii++){

	7929 PgHdr *pPage = sqlite3PagerLookup(pPager, pg1+ii);

	7930 if( pPage ){

	7931 pPage->flags \|= PGHDR_NEED_SYNC;

	7932 sqlite3PagerUnrefNotNull(pPage);

	7933 }

	7934 }

	7935 }

	7936

	7937 assert( (pPager->doNotSpill & SPILLFLAG_NOSYNC)!=0 );

	7938 pPager->doNotSpill &= ~SPILLFLAG_NOSYNC;

	7939 return rc;

	7940 }

	7941

	7942 /*

	7943 ** Mark a data page as writeable. This routine must be called before

	7944 ** making changes to a page. The caller must check the return value

	7945 ** of this function and be careful not to change any page data unless

	7946 ** this routine returns SQLITE_OK.

	7947 **

	7948 ** The difference between this function and pager_write() is that this

	7949 ** function also deals with the special case where 2 or more pages

	7950 ** fit on a single disk sector. In this case all co-resident pages

	7951 ** must have been written to the journal file before returning.

	7952 **

	7953 ** If an error occurs, SQLITE_NOMEM or an IO error code is returned

	7954 ** as appropriate. Otherwise, SQLITE_OK.

	7955 */

	7956 SQLITE_PRIVATE int sqlite3PagerWrite(PgHdr *pPg){

	7957 Pager *pPager = pPg->pPager;

	7958 assert( (pPg->flags & PGHDR_MMAP)==0 );

	7959 assert( pPager->eState>=PAGER_WRITER_LOCKED );

	7960 assert( assert_pager_state(pPager) );

	7961 if( (pPg->flags & PGHDR_WRITEABLE)!=0 && pPager->dbSize>=pPg->pgno ){

	7962 if( pPager->nSavepoint ) return subjournalPageIfRequired(pPg);

	7963 return SQLITE_OK;

	7964 }else if( pPager->errCode ){

	7965 return pPager->errCode;

	7966 }else if( pPager->sectorSize > (u32)pPager->pageSize ){

	7967 assert( pPager->tempFile==0 );

	7968 return pagerWriteLargeSector(pPg);

	7969 }else{

	7970 return pager_write(pPg);

	7971 }

	7972 }

	7973

	7974 /*

	7975 ** Return TRUE if the page given in the argument was previously passed

	7976 ** to sqlite3PagerWrite(). In other words, return TRUE if it is ok

	7977 ** to change the content of the page.

	7978 */

	7979 #ifndef NDEBUG

	7980 SQLITE_PRIVATE int sqlite3PagerIswriteable(DbPage *pPg){

	7981 return pPg->flags & PGHDR_WRITEABLE;

	7982 }

	7983 #endif

	7984

	7985 /*

	7986 ** A call to this routine tells the pager that it is not necessary to

	7987 ** write the information on page pPg back to the disk, even though

	7988 ** that page might be marked as dirty. This happens, for example, when

	7989 ** the page has been added as a leaf of the freelist and so its

	7990 ** content no longer matters.

	7991 **

	7992 ** The overlying software layer calls this routine when all of the data

	7993 ** on the given page is unused. The pager marks the page as clean so

	7994 ** that it does not get written to disk.

	7995 **

	7996 ** Tests show that this optimization can quadruple the speed of large

	7997 ** DELETE operations.

	7998 **

	7999 ** This optimization cannot be used with a temp-file, as the page may

	8000 ** have been dirty at the start of the transaction. In that case, if

	8001 ** memory pressure forces page pPg out of the cache, the data does need

	8002 ** to be written out to disk so that it may be read back in if the

	8003 ** current transaction is rolled back.

	8004 */

	8005 SQLITE_PRIVATE void sqlite3PagerDontWrite(PgHdr *pPg){

	8006 Pager *pPager = pPg->pPager;

	8007 if( !pPager->tempFile && (pPg->flags&PGHDR_DIRTY) && pPager->nSavepoint==0 ){

	8008 PAGERTRACE(("DONT_WRITE page %d of %d\n", pPg->pgno, PAGERID(pPager)));

	8009 IOTRACE(("CLEAN %p %d\n", pPager, pPg->pgno))

	8010 pPg->flags \|= PGHDR_DONT_WRITE;

	8011 pPg->flags &= ~PGHDR_WRITEABLE;

	8012 testcase( pPg->flags & PGHDR_NEED_SYNC );

	8013 pager_set_pagehash(pPg);

	8014 }

	8015 }

	8016

	8017 /*

	8018 ** This routine is called to increment the value of the database file

	8019 ** change-counter, stored as a 4-byte big-endian integer starting at

	8020 ** byte offset 24 of the pager file. The secondary change counter at

	8021 ** 92 is also updated, as is the SQLite version number at offset 96.

	8022 **

	8023 ** But this only happens if the pPager->changeCountDone flag is false.

	8024 ** To avoid excess churning of page 1, the update only happens once.

	8025 ** See also the pager_write_changecounter() routine that does an

	8026 ** unconditional update of the change counters.

	8027 **

	8028 ** If the isDirectMode flag is zero, then this is done by calling

	8029 ** sqlite3PagerWrite() on page 1, then modifying the contents of the

	8030 ** page data. In this case the file will be updated when the current

	8031 ** transaction is committed.

	8032 **

	8033 ** The isDirectMode flag may only be non-zero if the library was compiled

	8034 ** with the SQLITE_ENABLE_ATOMIC_WRITE macro defined. In this case,

	8035 ** if isDirect is non-zero, then the database file is updated directly

	8036 ** by writing an updated version of page 1 using a call to the

	8037 ** sqlite3OsWrite() function.

	8038 */

	8039 static int pager_incr_changecounter(Pager *pPager, int isDirectMode){

	8040 int rc = SQLITE_OK;

	8041

	8042 assert( pPager->eState==PAGER_WRITER_CACHEMOD

	8043 \|\| pPager->eState==PAGER_WRITER_DBMOD

	8044 );

	8045 assert( assert_pager_state(pPager) );

	8046

	8047 /* Declare and initialize constant integer 'isDirect'. If the

	8048 ** atomic-write optimization is enabled in this build, then isDirect

	8049 ** is initialized to the value passed as the isDirectMode parameter

	8050 ** to this function. Otherwise, it is always set to zero.

	8051 **

	8052 ** The idea is that if the atomic-write optimization is not

	8053 ** enabled at compile time, the compiler can omit the tests of

	8054 ** 'isDirect' below, as well as the block enclosed in the

	8055 ** "if( isDirect )" condition.

	8056 */

	8057 #ifndef SQLITE_ENABLE_ATOMIC_WRITE

	8058 # define DIRECT_MODE 0

	8059 assert( isDirectMode==0 );

	8060 UNUSED_PARAMETER(isDirectMode);

	8061 #else

	8062 # define DIRECT_MODE isDirectMode

	8063 #endif

	8064

	8065 if( !pPager->changeCountDone && ALWAYS(pPager->dbSize>0) ){

	8066 PgHdr pPgHdr; / Reference to page 1 */

	8067

	8068 assert( !pPager->tempFile && isOpen(pPager->fd) );

	8069

	8070 /* Open page 1 of the file for writing. */

	8071 rc = sqlite3PagerGet(pPager, 1, &pPgHdr, 0);

	8072 assert( pPgHdr==0 \|\| rc==SQLITE_OK );

	8073

	8074 /* If page one was fetched successfully, and this function is not

	8075 ** operating in direct-mode, make page 1 writable. When not in

	8076 ** direct mode, page 1 is always held in cache and hence the PagerGet()

	8077 ** above is always successful - hence the ALWAYS on rc==SQLITE_OK.

	8078 */

	8079 if( !DIRECT_MODE && ALWAYS(rc==SQLITE_OK) ){

	8080 rc = sqlite3PagerWrite(pPgHdr);

	8081 }

	8082

	8083 if( rc==SQLITE_OK ){

	8084 /* Actually do the update of the change counter */

	8085 pager_write_changecounter(pPgHdr);

	8086

	8087 /* If running in direct mode, write the contents of page 1 to the file. */

	8088 if( DIRECT_MODE ){

	8089 const void *zBuf;

	8090 assert( pPager->dbFileSize>0 );

	8091 CODEC2(pPager, pPgHdr->pData, 1, 6, rc=SQLITE_NOMEM_BKPT, zBuf);

	8092 if( rc==SQLITE_OK ){

	8093 rc = sqlite3OsWrite(pPager->fd, zBuf, pPager->pageSize, 0);

	8094 pPager->aStat[PAGER_STAT_WRITE]++;

	8095 }

	8096 if( rc==SQLITE_OK ){

	8097 /* Update the pager's copy of the change-counter. Otherwise, the

	8098 ** next time a read transaction is opened the cache will be

	8099 ** flushed (as the change-counter values will not match). */

	8100 const void pCopy = (const void )&((const char *)zBuf)[24];

	8101 memcpy(&pPager->dbFileVers, pCopy, sizeof(pPager->dbFileVers));

	8102 pPager->changeCountDone = 1;

	8103 }

	8104 }else{

	8105 pPager->changeCountDone = 1;

	8106 }

	8107 }

	8108

	8109 /* Release the page reference. */

	8110 sqlite3PagerUnref(pPgHdr);

	8111 }

	8112 return rc;

	8113 }

	8114

	8115 /*

	8116 ** Sync the database file to disk. This is a no-op for in-memory databases

	8117 ** or pages with the Pager.noSync flag set.

	8118 **

	8119 ** If successful, or if called on a pager for which it is a no-op, this

	8120 ** function returns SQLITE_OK. Otherwise, an IO error code is returned.

	8121 */

	8122 SQLITE_PRIVATE int sqlite3PagerSync(Pager pPager, const char zMaster){

	8123 int rc = SQLITE_OK;

	8124

	8125 if( isOpen(pPager->fd) ){

	8126 void pArg = (void)zMaster;

	8127 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_SYNC, pArg);

	8128 if( rc==SQLITE_NOTFOUND ) rc = SQLITE_OK;

	8129 }

	8130 if( rc==SQLITE_OK && !pPager->noSync ){

	8131 assert( !MEMDB );

	8132 rc = sqlite3OsSync(pPager->fd, pPager->syncFlags);

	8133 }

	8134 return rc;

	8135 }

	8136

	8137 /*

	8138 ** This function may only be called while a write-transaction is active in

	8139 ** rollback. If the connection is in WAL mode, this call is a no-op.

	8140 ** Otherwise, if the connection does not already have an EXCLUSIVE lock on

	8141 ** the database file, an attempt is made to obtain one.

	8142 **

	8143 ** If the EXCLUSIVE lock is already held or the attempt to obtain it is

	8144 ** successful, or the connection is in WAL mode, SQLITE_OK is returned.

	8145 ** Otherwise, either SQLITE_BUSY or an SQLITE_IOERR_XXX error code is

	8146 ** returned.

	8147 */

	8148 SQLITE_PRIVATE int sqlite3PagerExclusiveLock(Pager *pPager){

	8149 int rc = pPager->errCode;

	8150 assert( assert_pager_state(pPager) );

	8151 if( rc==SQLITE_OK ){

	8152 assert( pPager->eState==PAGER_WRITER_CACHEMOD

	8153 \|\| pPager->eState==PAGER_WRITER_DBMOD

	8154 \|\| pPager->eState==PAGER_WRITER_LOCKED

	8155 );

	8156 assert( assert_pager_state(pPager) );

	8157 if( 0==pagerUseWal(pPager) ){

	8158 rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);

	8159 }

	8160 }

	8161 return rc;

	8162 }

	8163

	8164 /*

	8165 ** Sync the database file for the pager pPager. zMaster points to the name

	8166 ** of a master journal file that should be written into the individual

	8167 ** journal file. zMaster may be NULL, which is interpreted as no master

	8168 ** journal (a single database transaction).

	8169 **

	8170 ** This routine ensures that:

	8171 **

	8172 ** * The database file change-counter is updated,

	8173 ** * the journal is synced (unless the atomic-write optimization is used),

	8174 ** * all dirty pages are written to the database file,

	8175 ** * the database file is truncated (if required), and

	8176 ** * the database file synced.

	8177 **

	8178 ** The only thing that remains to commit the transaction is to finalize

	8179 ** (delete, truncate or zero the first part of) the journal file (or

	8180 ** delete the master journal file if specified).

	8181 **

	8182 ** Note that if zMaster==NULL, this does not overwrite a previous value

	8183 ** passed to an sqlite3PagerCommitPhaseOne() call.

	8184 **

	8185 ** If the final parameter - noSync - is true, then the database file itself

	8186 ** is not synced. The caller must call sqlite3PagerSync() directly to

	8187 ** sync the database file before calling CommitPhaseTwo() to delete the

	8188 ** journal file in this case.

	8189 */

	8190 SQLITE_PRIVATE int sqlite3PagerCommitPhaseOne(

	8191 Pager pPager, / Pager object */

	8192 const char zMaster, / If not NULL, the master journal name */

	8193 int noSync /* True to omit the xSync on the db file */

	8194 ){

	8195 int rc = SQLITE_OK; /* Return code */

	8196

	8197 assert( pPager->eState==PAGER_WRITER_LOCKED

	8198 \|\| pPager->eState==PAGER_WRITER_CACHEMOD

	8199 \|\| pPager->eState==PAGER_WRITER_DBMOD

	8200 \|\| pPager->eState==PAGER_ERROR

	8201 );

	8202 assert( assert_pager_state(pPager) );

	8203

	8204 /* If a prior error occurred, report that error again. */

	8205 if( NEVER(pPager->errCode) ) return pPager->errCode;

	8206

	8207 /* Provide the ability to easily simulate an I/O error during testing */

	8208 if( sqlite3FaultSim(400) ) return SQLITE_IOERR;

	8209

	8210 PAGERTRACE(("DATABASE SYNC: File=%s zMaster=%s nSize=%d\n",

	8211 pPager->zFilename, zMaster, pPager->dbSize));

	8212

	8213 /* If no database changes have been made, return early. */

	8214 if( pPager->eState<PAGER_WRITER_CACHEMOD ) return SQLITE_OK;

	8215

	8216 assert( MEMDB==0 \|\| pPager->tempFile );

	8217 assert( isOpen(pPager->fd) \|\| pPager->tempFile );

	8218 if( 0==pagerFlushOnCommit(pPager, 1) ){

	8219 /* If this is an in-memory db, or no pages have been written to, or this

	8220 ** function has already been called, it is mostly a no-op. However, any

	8221 ** backup in progress needs to be restarted. */

	8222 sqlite3BackupRestart(pPager->pBackup);

	8223 }else{

	8224 if( pagerUseWal(pPager) ){

	8225 PgHdr *pList = sqlite3PcacheDirtyList(pPager->pPCache);

	8226 PgHdr *pPageOne = 0;

	8227 if( pList==0 ){

	8228 /* Must have at least one page for the WAL commit flag.

	8229 ** Ticket [2d1a5c67dfc2363e44f29d9bbd57f] 2011-05-18 */

	8230 rc = sqlite3PagerGet(pPager, 1, &pPageOne, 0);

	8231 pList = pPageOne;

	8232 pList->pDirty = 0;

	8233 }

	8234 assert( rc==SQLITE_OK );

	8235 if( ALWAYS(pList) ){

	8236 rc = pagerWalFrames(pPager, pList, pPager->dbSize, 1);

	8237 }

	8238 sqlite3PagerUnref(pPageOne);

	8239 if( rc==SQLITE_OK ){

	8240 sqlite3PcacheCleanAll(pPager->pPCache);

	8241 }

	8242 }else{

	8243 /* The following block updates the change-counter. Exactly how it

	8244 ** does this depends on whether or not the atomic-update optimization

	8245 ** was enabled at compile time, and if this transaction meets the

	8246 ** runtime criteria to use the operation:

	8247 **

	8248 ** * The file-system supports the atomic-write property for

	8249 ** blocks of size page-size, and

	8250 ** * This commit is not part of a multi-file transaction, and

	8251 ** * Exactly one page has been modified and store in the journal file.

	8252 **

	8253 ** If the optimization was not enabled at compile time, then the

	8254 ** pager_incr_changecounter() function is called to update the change

	8255 ** counter in 'indirect-mode'. If the optimization is compiled in but

	8256 ** is not applicable to this transaction, call sqlite3JournalCreate()

	8257 ** to make sure the journal file has actually been created, then call

	8258 ** pager_incr_changecounter() to update the change-counter in indirect

	8259 ** mode.

	8260 **

	8261 ** Otherwise, if the optimization is both enabled and applicable,

	8262 ** then call pager_incr_changecounter() to update the change-counter

	8263 ** in 'direct' mode. In this case the journal file will never be

	8264 ** created for this transaction.

	8265 */

	8266 #ifdef SQLITE_ENABLE_ATOMIC_WRITE

	8267 PgHdr *pPg;

	8268 assert( isOpen(pPager->jfd)

	8269 \|\| pPager->journalMode==PAGER_JOURNALMODE_OFF

	8270 \|\| pPager->journalMode==PAGER_JOURNALMODE_WAL

	8271 );

	8272 if( !zMaster && isOpen(pPager->jfd)

	8273 && pPager->journalOff==jrnlBufferSize(pPager)

	8274 && pPager->dbSize>=pPager->dbOrigSize

	8275 && (0==(pPg = sqlite3PcacheDirtyList(pPager->pPCache)) \|\| 0==pPg->pDirty)

	8276 ){

	8277 /* Update the db file change counter via the direct-write method. The

	8278 ** following call will modify the in-memory representation of page 1

	8279 ** to include the updated change counter and then write page 1

	8280 ** directly to the database file. Because of the atomic-write

	8281 ** property of the host file-system, this is safe.

	8282 */

	8283 rc = pager_incr_changecounter(pPager, 1);

	8284 }else{

	8285 rc = sqlite3JournalCreate(pPager->jfd);

	8286 if( rc==SQLITE_OK ){

	8287 rc = pager_incr_changecounter(pPager, 0);

	8288 }

	8289 }

	8290 #else

	8291 rc = pager_incr_changecounter(pPager, 0);

	8292 #endif

	8293 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;

	8294

	8295 /* Write the master journal name into the journal file. If a master

	8296 ** journal file name has already been written to the journal file,

	8297 ** or if zMaster is NULL (no master journal), then this call is a no-op.

	8298 */

	8299 rc = writeMasterJournal(pPager, zMaster);

	8300 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;

	8301

	8302 /* Sync the journal file and write all dirty pages to the database.

	8303 ** If the atomic-update optimization is being used, this sync will not

	8304 ** create the journal file or perform any real IO.

	8305 **

	8306 ** Because the change-counter page was just modified, unless the

	8307 ** atomic-update optimization is used it is almost certain that the

	8308 ** journal requires a sync here. However, in locking_mode=exclusive

	8309 ** on a system under memory pressure it is just possible that this is

	8310 ** not the case. In this case it is likely enough that the redundant

	8311 ** xSync() call will be changed to a no-op by the OS anyhow.

	8312 */

	8313 rc = syncJournal(pPager, 0);

	8314 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;

	8315

	8316 rc = pager_write_pagelist(pPager,sqlite3PcacheDirtyList(pPager->pPCache));

	8317 if( rc!=SQLITE_OK ){

	8318 assert( rc!=SQLITE_IOERR_BLOCKED );

	8319 goto commit_phase_one_exit;

	8320 }

	8321 sqlite3PcacheCleanAll(pPager->pPCache);

	8322

	8323 /* If the file on disk is smaller than the database image, use

	8324 ** pager_truncate to grow the file here. This can happen if the database

	8325 ** image was extended as part of the current transaction and then the

	8326 ** last page in the db image moved to the free-list. In this case the

	8327 ** last page is never written out to disk, leaving the database file

	8328 ** undersized. Fix this now if it is the case. */

	8329 if( pPager->dbSize>pPager->dbFileSize ){

	8330 Pgno nNew = pPager->dbSize - (pPager->dbSize==PAGER_MJ_PGNO(pPager));

	8331 assert( pPager->eState==PAGER_WRITER_DBMOD );

	8332 rc = pager_truncate(pPager, nNew);

	8333 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;

	8334 }

	8335

	8336 /* Finally, sync the database file. */

	8337 if( !noSync ){

	8338 rc = sqlite3PagerSync(pPager, zMaster);

	8339 }

	8340 IOTRACE(("DBSYNC %p\n", pPager))

	8341 }

	8342 }

	8343

	8344 commit_phase_one_exit:

	8345 if( rc==SQLITE_OK && !pagerUseWal(pPager) ){

	8346 pPager->eState = PAGER_WRITER_FINISHED;

	8347 }

	8348 return rc;

	8349 }

	8350

	8351

	8352 /*

	8353 ** When this function is called, the database file has been completely

	8354 ** updated to reflect the changes made by the current transaction and

	8355 ** synced to disk. The journal file still exists in the file-system

	8356 ** though, and if a failure occurs at this point it will eventually

	8357 ** be used as a hot-journal and the current transaction rolled back.

	8358 **

	8359 ** This function finalizes the journal file, either by deleting,

	8360 ** truncating or partially zeroing it, so that it cannot be used

	8361 ** for hot-journal rollback. Once this is done the transaction is

	8362 ** irrevocably committed.

	8363 **

	8364 ** If an error occurs, an IO error code is returned and the pager

	8365 ** moves into the error state. Otherwise, SQLITE_OK is returned.

	8366 */

	8367 SQLITE_PRIVATE int sqlite3PagerCommitPhaseTwo(Pager *pPager){

	8368 int rc = SQLITE_OK; /* Return code */

	8369

	8370 /* This routine should not be called if a prior error has occurred.

	8371 ** But if (due to a coding error elsewhere in the system) it does get

	8372 ** called, just return the same error code without doing anything. */

	8373 if( NEVER(pPager->errCode) ) return pPager->errCode;

	8374

	8375 assert( pPager->eState==PAGER_WRITER_LOCKED

	8376 \|\| pPager->eState==PAGER_WRITER_FINISHED

	8377 \|\| (pagerUseWal(pPager) && pPager->eState==PAGER_WRITER_CACHEMOD)

	8378 );

	8379 assert( assert_pager_state(pPager) );

	8380

	8381 /* An optimization. If the database was not actually modified during

	8382 ** this transaction, the pager is running in exclusive-mode and is

	8383 ** using persistent journals, then this function is a no-op.

	8384 **

	8385 ** The start of the journal file currently contains a single journal

	8386 ** header with the nRec field set to 0. If such a journal is used as

	8387 ** a hot-journal during hot-journal rollback, 0 changes will be made

	8388 ** to the database file. So there is no need to zero the journal

	8389 ** header. Since the pager is in exclusive mode, there is no need

	8390 ** to drop any locks either.

	8391 */

	8392 if( pPager->eState==PAGER_WRITER_LOCKED

	8393 && pPager->exclusiveMode

	8394 && pPager->journalMode==PAGER_JOURNALMODE_PERSIST

	8395 ){

	8396 assert( pPager->journalOff==JOURNAL_HDR_SZ(pPager) \|\| !pPager->journalOff );

	8397 pPager->eState = PAGER_READER;

	8398 return SQLITE_OK;

	8399 }

	8400

	8401 PAGERTRACE(("COMMIT %d\n", PAGERID(pPager)));

	8402 pPager->iDataVersion++;

	8403 rc = pager_end_transaction(pPager, pPager->setMaster, 1);

	8404 return pager_error(pPager, rc);

	8405 }

	8406

	8407 /*

	8408 ** If a write transaction is open, then all changes made within the

	8409 ** transaction are reverted and the current write-transaction is closed.

	8410 ** The pager falls back to PAGER_READER state if successful, or PAGER_ERROR

	8411 ** state if an error occurs.

	8412 **

	8413 ** If the pager is already in PAGER_ERROR state when this function is called,

	8414 ** it returns Pager.errCode immediately. No work is performed in this case.

	8415 **

	8416 ** Otherwise, in rollback mode, this function performs two functions:

	8417 **

	8418 ** 1) It rolls back the journal file, restoring all database file and

	8419 ** in-memory cache pages to the state they were in when the transaction

	8420 ** was opened, and

	8421 **

	8422 ** 2) It finalizes the journal file, so that it is not used for hot

	8423 ** rollback at any point in the future.

	8424 **

	8425 ** Finalization of the journal file (task 2) is only performed if the

	8426 ** rollback is successful.

	8427 **

	8428 ** In WAL mode, all cache-entries containing data modified within the

	8429 ** current transaction are either expelled from the cache or reverted to

	8430 ** their pre-transaction state by re-reading data from the database or

	8431 ** WAL files. The WAL transaction is then closed.

	8432 */

	8433 SQLITE_PRIVATE int sqlite3PagerRollback(Pager *pPager){

	8434 int rc = SQLITE_OK; /* Return code */

	8435 PAGERTRACE(("ROLLBACK %d\n", PAGERID(pPager)));

	8436

	8437 /* PagerRollback() is a no-op if called in READER or OPEN state. If

	8438 ** the pager is already in the ERROR state, the rollback is not

	8439 ** attempted here. Instead, the error code is returned to the caller.

	8440 */

	8441 assert( assert_pager_state(pPager) );

	8442 if( pPager->eState==PAGER_ERROR ) return pPager->errCode;

	8443 if( pPager->eState<=PAGER_READER ) return SQLITE_OK;

	8444

	8445 if( pagerUseWal(pPager) ){

	8446 int rc2;

	8447 rc = sqlite3PagerSavepoint(pPager, SAVEPOINT_ROLLBACK, -1);

	8448 rc2 = pager_end_transaction(pPager, pPager->setMaster, 0);

	8449 if( rc==SQLITE_OK ) rc = rc2;

	8450 }else if( !isOpen(pPager->jfd) \|\| pPager->eState==PAGER_WRITER_LOCKED ){

	8451 int eState = pPager->eState;

	8452 rc = pager_end_transaction(pPager, 0, 0);

	8453 if( !MEMDB && eState>PAGER_WRITER_LOCKED ){

	8454 /* This can happen using journal_mode=off. Move the pager to the error

	8455 ** state to indicate that the contents of the cache may not be trusted.

	8456 ** Any active readers will get SQLITE_ABORT.

	8457 */

	8458 pPager->errCode = SQLITE_ABORT;

	8459 pPager->eState = PAGER_ERROR;

	8460 setGetterMethod(pPager);

	8461 return rc;

	8462 }

	8463 }else{

	8464 rc = pager_playback(pPager, 0);

	8465 }

	8466

	8467 assert( pPager->eState==PAGER_READER \|\| rc!=SQLITE_OK );

	8468 assert( rc==SQLITE_OK \|\| rc==SQLITE_FULL \|\| rc==SQLITE_CORRUPT

	8469 \|\| rc==SQLITE_NOMEM \|\| (rc&0xFF)==SQLITE_IOERR

	8470 \|\| rc==SQLITE_CANTOPEN

	8471 );

	8472

	8473 /* If an error occurs during a ROLLBACK, we can no longer trust the pager

	8474 ** cache. So call pager_error() on the way out to make any error persistent.

	8475 */

	8476 return pager_error(pPager, rc);

	8477 }

	8478

	8479 /*

	8480 ** Return TRUE if the database file is opened read-only. Return FALSE

	8481 ** if the database is (in theory) writable.

	8482 */

	8483 SQLITE_PRIVATE u8 sqlite3PagerIsreadonly(Pager *pPager){

	8484 return pPager->readOnly;

	8485 }

	8486

	8487 #ifdef SQLITE_DEBUG

	8488 /*

	8489 ** Return the sum of the reference counts for all pages held by pPager.

	8490 */

	8491 SQLITE_PRIVATE int sqlite3PagerRefcount(Pager *pPager){

	8492 return sqlite3PcacheRefCount(pPager->pPCache);

	8493 }

	8494 #endif

	8495

	8496 /*

	8497 ** Return the approximate number of bytes of memory currently

	8498 ** used by the pager and its associated cache.

	8499 */

	8500 SQLITE_PRIVATE int sqlite3PagerMemUsed(Pager *pPager){

	8501 int perPageSize = pPager->pageSize + pPager->nExtra + sizeof(PgHdr)

	8502 + 5sizeof(void);

	8503 return perPageSize*sqlite3PcachePagecount(pPager->pPCache)

	8504 + sqlite3MallocSize(pPager)

	8505 + pPager->pageSize;

	8506 }

	8507

	8508 /*

	8509 ** Return the number of references to the specified page.

	8510 */

	8511 SQLITE_PRIVATE int sqlite3PagerPageRefcount(DbPage *pPage){

	8512 return sqlite3PcachePageRefcount(pPage);

	8513 }

	8514

	8515 #ifdef SQLITE_TEST

	8516 /*

	8517 ** This routine is used for testing and analysis only.

	8518 */

	8519 SQLITE_PRIVATE int sqlite3PagerStats(Pager pPager){

	8520 static int a[11];

	8521 a[0] = sqlite3PcacheRefCount(pPager->pPCache);

	8522 a[1] = sqlite3PcachePagecount(pPager->pPCache);

	8523 a[2] = sqlite3PcacheGetCachesize(pPager->pPCache);

	8524 a[3] = pPager->eState==PAGER_OPEN ? -1 : (int) pPager->dbSize;

	8525 a[4] = pPager->eState;

	8526 a[5] = pPager->errCode;

	8527 a[6] = pPager->aStat[PAGER_STAT_HIT];

	8528 a[7] = pPager->aStat[PAGER_STAT_MISS];

	8529 a[8] = 0; /* Used to be pPager->nOvfl */

	8530 a[9] = pPager->nRead;

	8531 a[10] = pPager->aStat[PAGER_STAT_WRITE];

	8532 return a;

	8533 }

	8534 #endif

	8535

	8536 /*

	8537 ** Parameter eStat must be either SQLITE_DBSTATUS_CACHE_HIT or

	8538 ** SQLITE_DBSTATUS_CACHE_MISS. Before returning, *pnVal is incremented by the

	8539 ** current cache hit or miss count, according to the value of eStat. If the

	8540 ** reset parameter is non-zero, the cache hit or miss count is zeroed before

	8541 ** returning.

	8542 */

	8543 SQLITE_PRIVATE void sqlite3PagerCacheStat(Pager pPager, int eStat, int reset, i nt pnVal){

	8544

	8545 assert( eStat==SQLITE_DBSTATUS_CACHE_HIT

	8546 \|\| eStat==SQLITE_DBSTATUS_CACHE_MISS

	8547 \|\| eStat==SQLITE_DBSTATUS_CACHE_WRITE

	8548 );

	8549

	8550 assert( SQLITE_DBSTATUS_CACHE_HIT+1==SQLITE_DBSTATUS_CACHE_MISS );

	8551 assert( SQLITE_DBSTATUS_CACHE_HIT+2==SQLITE_DBSTATUS_CACHE_WRITE );

	8552 assert( PAGER_STAT_HIT==0 && PAGER_STAT_MISS==1 && PAGER_STAT_WRITE==2 );

	8553

	8554 *pnVal += pPager->aStat[eStat - SQLITE_DBSTATUS_CACHE_HIT];

	8555 if( reset ){

	8556 pPager->aStat[eStat - SQLITE_DBSTATUS_CACHE_HIT] = 0;

	8557 }

	8558 }

	8559

	8560 /*

	8561 ** Return true if this is an in-memory or temp-file backed pager.

	8562 */

	8563 SQLITE_PRIVATE int sqlite3PagerIsMemdb(Pager *pPager){

	8564 return pPager->tempFile;

	8565 }

	8566

	8567 /*

	8568 ** Check that there are at least nSavepoint savepoints open. If there are

	8569 ** currently less than nSavepoints open, then open one or more savepoints

	8570 ** to make up the difference. If the number of savepoints is already

	8571 ** equal to nSavepoint, then this function is a no-op.

	8572 **

	8573 ** If a memory allocation fails, SQLITE_NOMEM is returned. If an error

	8574 ** occurs while opening the sub-journal file, then an IO error code is

	8575 ** returned. Otherwise, SQLITE_OK.

	8576 */

	8577 static SQLITE_NOINLINE int pagerOpenSavepoint(Pager *pPager, int nSavepoint){

	8578 int rc = SQLITE_OK; /* Return code */

	8579 int nCurrent = pPager->nSavepoint; /* Current number of savepoints */

	8580 int ii; /* Iterator variable */

	8581 PagerSavepoint aNew; / New Pager.aSavepoint array */

	8582

	8583 assert( pPager->eState>=PAGER_WRITER_LOCKED );

	8584 assert( assert_pager_state(pPager) );

	8585 assert( nSavepoint>nCurrent && pPager->useJournal );

	8586

	8587 /* Grow the Pager.aSavepoint array using realloc(). Return SQLITE_NOMEM

	8588 ** if the allocation fails. Otherwise, zero the new portion in case a

	8589 ** malloc failure occurs while populating it in the for(...) loop below.

	8590 */

	8591 aNew = (PagerSavepoint *)sqlite3Realloc(

	8592 pPager->aSavepoint, sizeof(PagerSavepoint)*nSavepoint

	8593 );

	8594 if( !aNew ){

	8595 return SQLITE_NOMEM_BKPT;

	8596 }

	8597 memset(&aNew[nCurrent], 0, (nSavepoint-nCurrent) * sizeof(PagerSavepoint));

	8598 pPager->aSavepoint = aNew;

	8599

	8600 /* Populate the PagerSavepoint structures just allocated. */

	8601 for(ii=nCurrent; ii<nSavepoint; ii++){

	8602 aNew[ii].nOrig = pPager->dbSize;

	8603 if( isOpen(pPager->jfd) && pPager->journalOff>0 ){

	8604 aNew[ii].iOffset = pPager->journalOff;

	8605 }else{

	8606 aNew[ii].iOffset = JOURNAL_HDR_SZ(pPager);

	8607 }

	8608 aNew[ii].iSubRec = pPager->nSubRec;

	8609 aNew[ii].pInSavepoint = sqlite3BitvecCreate(pPager->dbSize);

	8610 if( !aNew[ii].pInSavepoint ){

	8611 return SQLITE_NOMEM_BKPT;

	8612 }

	8613 if( pagerUseWal(pPager) ){

	8614 sqlite3WalSavepoint(pPager->pWal, aNew[ii].aWalData);

	8615 }

	8616 pPager->nSavepoint = ii+1;

	8617 }

	8618 assert( pPager->nSavepoint==nSavepoint );

	8619 assertTruncateConstraint(pPager);

	8620 return rc;

	8621 }

	8622 SQLITE_PRIVATE int sqlite3PagerOpenSavepoint(Pager *pPager, int nSavepoint){

	8623 assert( pPager->eState>=PAGER_WRITER_LOCKED );

	8624 assert( assert_pager_state(pPager) );

	8625

	8626 if( nSavepoint>pPager->nSavepoint && pPager->useJournal ){

	8627 return pagerOpenSavepoint(pPager, nSavepoint);

	8628 }else{

	8629 return SQLITE_OK;

	8630 }

	8631 }

	8632

	8633

	8634 /*

	8635 ** This function is called to rollback or release (commit) a savepoint.

	8636 ** The savepoint to release or rollback need not be the most recently

	8637 ** created savepoint.

	8638 **

	8639 ** Parameter op is always either SAVEPOINT_ROLLBACK or SAVEPOINT_RELEASE.

	8640 ** If it is SAVEPOINT_RELEASE, then release and destroy the savepoint with

	8641 ** index iSavepoint. If it is SAVEPOINT_ROLLBACK, then rollback all changes

	8642 ** that have occurred since the specified savepoint was created.

	8643 **

	8644 ** The savepoint to rollback or release is identified by parameter

	8645 ** iSavepoint. A value of 0 means to operate on the outermost savepoint

	8646 ** (the first created). A value of (Pager.nSavepoint-1) means operate

	8647 ** on the most recently created savepoint. If iSavepoint is greater than

	8648 ** (Pager.nSavepoint-1), then this function is a no-op.

	8649 **

	8650 ** If a negative value is passed to this function, then the current

	8651 ** transaction is rolled back. This is different to calling

	8652 ** sqlite3PagerRollback() because this function does not terminate

	8653 ** the transaction or unlock the database, it just restores the

	8654 ** contents of the database to its original state.

	8655 **

	8656 ** In any case, all savepoints with an index greater than iSavepoint

	8657 ** are destroyed. If this is a release operation (op==SAVEPOINT_RELEASE),

	8658 ** then savepoint iSavepoint is also destroyed.

	8659 **

	8660 ** This function may return SQLITE_NOMEM if a memory allocation fails,

	8661 ** or an IO error code if an IO error occurs while rolling back a

	8662 ** savepoint. If no errors occur, SQLITE_OK is returned.

	8663 */

	8664 SQLITE_PRIVATE int sqlite3PagerSavepoint(Pager *pPager, int op, int iSavepoint){

	8665 int rc = pPager->errCode;

	8666

	8667 #ifdef SQLITE_ENABLE_ZIPVFS

	8668 if( op==SAVEPOINT_RELEASE ) rc = SQLITE_OK;

	8669 #endif

	8670

	8671 assert( op==SAVEPOINT_RELEASE \|\| op==SAVEPOINT_ROLLBACK );

	8672 assert( iSavepoint>=0 \|\| op==SAVEPOINT_ROLLBACK );

	8673

	8674 if( rc==SQLITE_OK && iSavepoint<pPager->nSavepoint ){

	8675 int ii; /* Iterator variable */

	8676 int nNew; /* Number of remaining savepoints after this op. */

	8677

	8678 /* Figure out how many savepoints will still be active after this

	8679 ** operation. Store this value in nNew. Then free resources associated

	8680 ** with any savepoints that are destroyed by this operation.

	8681 */

	8682 nNew = iSavepoint + (( op==SAVEPOINT_RELEASE ) ? 0 : 1);

	8683 for(ii=nNew; ii<pPager->nSavepoint; ii++){

	8684 sqlite3BitvecDestroy(pPager->aSavepoint[ii].pInSavepoint);

	8685 }

	8686 pPager->nSavepoint = nNew;

	8687

	8688 /* If this is a release of the outermost savepoint, truncate

	8689 ** the sub-journal to zero bytes in size. */

	8690 if( op==SAVEPOINT_RELEASE ){

	8691 if( nNew==0 && isOpen(pPager->sjfd) ){

	8692 /* Only truncate if it is an in-memory sub-journal. */

	8693 if( sqlite3JournalIsInMemory(pPager->sjfd) ){

	8694 rc = sqlite3OsTruncate(pPager->sjfd, 0);

	8695 assert( rc==SQLITE_OK );

	8696 }

	8697 pPager->nSubRec = 0;

	8698 }

	8699 }

	8700 /* Else this is a rollback operation, playback the specified savepoint.

	8701 ** If this is a temp-file, it is possible that the journal file has

	8702 ** not yet been opened. In this case there have been no changes to

	8703 ** the database file, so the playback operation can be skipped.

	8704 */

	8705 else if( pagerUseWal(pPager) \|\| isOpen(pPager->jfd) ){

	8706 PagerSavepoint *pSavepoint = (nNew==0)?0:&pPager->aSavepoint[nNew-1];

	8707 rc = pagerPlaybackSavepoint(pPager, pSavepoint);

	8708 assert(rc!=SQLITE_DONE);

	8709 }

	8710

	8711 #ifdef SQLITE_ENABLE_ZIPVFS

	8712 /* If the cache has been modified but the savepoint cannot be rolled

	8713 ** back journal_mode=off, put the pager in the error state. This way,

	8714 ** if the VFS used by this pager includes ZipVFS, the entire transaction

	8715 ** can be rolled back at the ZipVFS level. */

	8716 else if(

	8717 pPager->journalMode==PAGER_JOURNALMODE_OFF

	8718 && pPager->eState>=PAGER_WRITER_CACHEMOD

	8719 ){

	8720 pPager->errCode = SQLITE_ABORT;

	8721 pPager->eState = PAGER_ERROR;

	8722 setGetterMethod(pPager);

	8723 }

	8724 #endif

	8725 }

	8726

	8727 return rc;

	8728 }

	8729

	8730 /*

	8731 ** Return the full pathname of the database file.

	8732 **

	8733 ** Except, if the pager is in-memory only, then return an empty string if

	8734 ** nullIfMemDb is true. This routine is called with nullIfMemDb==1 when

	8735 ** used to report the filename to the user, for compatibility with legacy

	8736 ** behavior. But when the Btree needs to know the filename for matching to

	8737 ** shared cache, it uses nullIfMemDb==0 so that in-memory databases can

	8738 ** participate in shared-cache.

	8739 */

	8740 SQLITE_PRIVATE const char sqlite3PagerFilename(Pager pPager, int nullIfMemDb){

	8741 return (nullIfMemDb && pPager->memDb) ? "" : pPager->zFilename;

	8742 }

	8743

	8744 /*

	8745 ** Return the VFS structure for the pager.

	8746 */

	8747 SQLITE_PRIVATE sqlite3_vfs sqlite3PagerVfs(Pager pPager){

	8748 return pPager->pVfs;

	8749 }

	8750

	8751 /*

	8752 ** Return the file handle for the database file associated

	8753 ** with the pager. This might return NULL if the file has

	8754 ** not yet been opened.

	8755 */

	8756 SQLITE_PRIVATE sqlite3_file sqlite3PagerFile(Pager pPager){

	8757 return pPager->fd;

	8758 }

	8759

	8760 /*

	8761 ** Return the file handle for the journal file (if it exists).

	8762 ** This will be either the rollback journal or the WAL file.

	8763 */

	8764 SQLITE_PRIVATE sqlite3_file sqlite3PagerJrnlFile(Pager pPager){

	8765 #if SQLITE_OMIT_WAL

	8766 return pPager->jfd;

	8767 #else

	8768 return pPager->pWal ? sqlite3WalFile(pPager->pWal) : pPager->jfd;

	8769 #endif

	8770 }

	8771

	8772 /*

	8773 ** Return the full pathname of the journal file.

	8774 */

	8775 SQLITE_PRIVATE const char sqlite3PagerJournalname(Pager pPager){

	8776 return pPager->zJournal;

	8777 }

	8778

	8779 #ifdef SQLITE_HAS_CODEC

	8780 /*

	8781 ** Set or retrieve the codec for this pager

	8782 */

	8783 SQLITE_PRIVATE void sqlite3PagerSetCodec(

	8784 Pager *pPager,

	8785 void (xCodec)(void,void,Pgno,int),

	8786 void (xCodecSizeChng)(void,int,int),

	8787 void (xCodecFree)(void),

	8788 void *pCodec

	8789 ){

	8790 if( pPager->xCodecFree ) pPager->xCodecFree(pPager->pCodec);

	8791 pPager->xCodec = pPager->memDb ? 0 : xCodec;

	8792 pPager->xCodecSizeChng = xCodecSizeChng;

	8793 pPager->xCodecFree = xCodecFree;

	8794 pPager->pCodec = pCodec;

	8795 setGetterMethod(pPager);

	8796 pagerReportSize(pPager);

	8797 }

	8798 SQLITE_PRIVATE void sqlite3PagerGetCodec(Pager pPager){

	8799 return pPager->pCodec;

	8800 }

	8801

	8802 /*

	8803 ** This function is called by the wal module when writing page content

	8804 ** into the log file.

	8805 **

	8806 ** This function returns a pointer to a buffer containing the encrypted

	8807 ** page content. If a malloc fails, this function may return NULL.

	8808 */

	8809 SQLITE_PRIVATE void sqlite3PagerCodec(PgHdr pPg){

	8810 void *aData = 0;

	8811 CODEC2(pPg->pPager, pPg->pData, pPg->pgno, 6, return 0, aData);

	8812 return aData;

	8813 }

	8814

	8815 /*

	8816 ** Return the current pager state

	8817 */

	8818 SQLITE_PRIVATE int sqlite3PagerState(Pager *pPager){

	8819 return pPager->eState;

	8820 }

	8821 #endif /* SQLITE_HAS_CODEC */

	8822

	8823 #ifndef SQLITE_OMIT_AUTOVACUUM

	8824 /*

	8825 ** Move the page pPg to location pgno in the file.

	8826 **

	8827 ** There must be no references to the page previously located at

	8828 ** pgno (which we call pPgOld) though that page is allowed to be

	8829 ** in cache. If the page previously located at pgno is not already

	8830 ** in the rollback journal, it is not put there by by this routine.

	8831 **

	8832 ** References to the page pPg remain valid. Updating any

	8833 ** meta-data associated with pPg (i.e. data stored in the nExtra bytes

	8834 ** allocated along with the page) is the responsibility of the caller.

	8835 **

	8836 ** A transaction must be active when this routine is called. It used to be

	8837 ** required that a statement transaction was not active, but this restriction

	8838 ** has been removed (CREATE INDEX needs to move a page when a statement

	8839 ** transaction is active).

	8840 **

	8841 ** If the fourth argument, isCommit, is non-zero, then this page is being

	8842 ** moved as part of a database reorganization just before the transaction

	8843 ** is being committed. In this case, it is guaranteed that the database page

	8844 ** pPg refers to will not be written to again within this transaction.

	8845 **

	8846 ** This function may return SQLITE_NOMEM or an IO error code if an error

	8847 ** occurs. Otherwise, it returns SQLITE_OK.

	8848 */

	8849 SQLITE_PRIVATE int sqlite3PagerMovepage(Pager pPager, DbPage pPg, Pgno pgno, i nt isCommit){

	8850 PgHdr pPgOld; / The page being overwritten. */

	8851 Pgno needSyncPgno = 0; /* Old value of pPg->pgno, if sync is required */

	8852 int rc; /* Return code */

	8853 Pgno origPgno; /* The original page number */

	8854

	8855 assert( pPg->nRef>0 );

	8856 assert( pPager->eState==PAGER_WRITER_CACHEMOD

	8857 \|\| pPager->eState==PAGER_WRITER_DBMOD

	8858 );

	8859 assert( assert_pager_state(pPager) );

	8860

	8861 /* In order to be able to rollback, an in-memory database must journal

	8862 ** the page we are moving from.

	8863 */

	8864 assert( pPager->tempFile \|\| !MEMDB );

	8865 if( pPager->tempFile ){

	8866 rc = sqlite3PagerWrite(pPg);

	8867 if( rc ) return rc;

	8868 }

	8869

	8870 /* If the page being moved is dirty and has not been saved by the latest

	8871 ** savepoint, then save the current contents of the page into the

	8872 ** sub-journal now. This is required to handle the following scenario:

	8873 **

	8874 ** BEGIN;

	8875 ** <journal page X, then modify it in memory>

	8876 ** SAVEPOINT one;

	8877 ** <Move page X to location Y>

	8878 ** ROLLBACK TO one;

	8879 **

	8880 ** If page X were not written to the sub-journal here, it would not

	8881 ** be possible to restore its contents when the "ROLLBACK TO one"

	8882 ** statement were is processed.

	8883 **

	8884 ** subjournalPage() may need to allocate space to store pPg->pgno into

	8885 ** one or more savepoint bitvecs. This is the reason this function

	8886 ** may return SQLITE_NOMEM.

	8887 */

	8888 if( (pPg->flags & PGHDR_DIRTY)!=0

	8889 && SQLITE_OK!=(rc = subjournalPageIfRequired(pPg))

	8890 ){

	8891 return rc;

	8892 }

	8893

	8894 PAGERTRACE(("MOVE %d page %d (needSync=%d) moves to %d\n",

	8895 PAGERID(pPager), pPg->pgno, (pPg->flags&PGHDR_NEED_SYNC)?1:0, pgno));

	8896 IOTRACE(("MOVE %p %d %d\n", pPager, pPg->pgno, pgno))

	8897

	8898 /* If the journal needs to be sync()ed before page pPg->pgno can

	8899 ** be written to, store pPg->pgno in local variable needSyncPgno.

	8900 **

	8901 ** If the isCommit flag is set, there is no need to remember that

	8902 ** the journal needs to be sync()ed before database page pPg->pgno

	8903 ** can be written to. The caller has already promised not to write to it.

	8904 */

	8905 if( (pPg->flags&PGHDR_NEED_SYNC) && !isCommit ){

	8906 needSyncPgno = pPg->pgno;

	8907 assert( pPager->journalMode==PAGER_JOURNALMODE_OFF \|\|

	8908 pageInJournal(pPager, pPg) \|\| pPg->pgno>pPager->dbOrigSize );

	8909 assert( pPg->flags&PGHDR_DIRTY );

	8910 }

	8911

	8912 /* If the cache contains a page with page-number pgno, remove it

	8913 ** from its hash chain. Also, if the PGHDR_NEED_SYNC flag was set for

	8914 ** page pgno before the 'move' operation, it needs to be retained

	8915 ** for the page moved there.

	8916 */

	8917 pPg->flags &= ~PGHDR_NEED_SYNC;

	8918 pPgOld = sqlite3PagerLookup(pPager, pgno);

	8919 assert( !pPgOld \|\| pPgOld->nRef==1 );

	8920 if( pPgOld ){

	8921 pPg->flags \|= (pPgOld->flags&PGHDR_NEED_SYNC);

	8922 if( pPager->tempFile ){

	8923 /* Do not discard pages from an in-memory database since we might

	8924 ** need to rollback later. Just move the page out of the way. */

	8925 sqlite3PcacheMove(pPgOld, pPager->dbSize+1);

	8926 }else{

	8927 sqlite3PcacheDrop(pPgOld);

	8928 }

	8929 }

	8930

	8931 origPgno = pPg->pgno;

	8932 sqlite3PcacheMove(pPg, pgno);

	8933 sqlite3PcacheMakeDirty(pPg);

	8934

	8935 /* For an in-memory database, make sure the original page continues

	8936 ** to exist, in case the transaction needs to roll back. Use pPgOld

	8937 ** as the original page since it has already been allocated.

	8938 */

	8939 if( pPager->tempFile && pPgOld ){

	8940 sqlite3PcacheMove(pPgOld, origPgno);

	8941 sqlite3PagerUnrefNotNull(pPgOld);

	8942 }

	8943

	8944 if( needSyncPgno ){

	8945 /* If needSyncPgno is non-zero, then the journal file needs to be

	8946 ** sync()ed before any data is written to database file page needSyncPgno.

	8947 ** Currently, no such page exists in the page-cache and the

	8948 ** "is journaled" bitvec flag has been set. This needs to be remedied by

	8949 ** loading the page into the pager-cache and setting the PGHDR_NEED_SYNC

	8950 ** flag.

	8951 **

	8952 ** If the attempt to load the page into the page-cache fails, (due

	8953 ** to a malloc() or IO failure), clear the bit in the pInJournal[]

	8954 ** array. Otherwise, if the page is loaded and written again in

	8955 ** this transaction, it may be written to the database file before

	8956 ** it is synced into the journal file. This way, it may end up in

	8957 ** the journal file twice, but that is not a problem.

	8958 */

	8959 PgHdr *pPgHdr;

	8960 rc = sqlite3PagerGet(pPager, needSyncPgno, &pPgHdr, 0);

	8961 if( rc!=SQLITE_OK ){

	8962 if( needSyncPgno<=pPager->dbOrigSize ){

	8963 assert( pPager->pTmpSpace!=0 );

	8964 sqlite3BitvecClear(pPager->pInJournal, needSyncPgno, pPager->pTmpSpace);

	8965 }

	8966 return rc;

	8967 }

	8968 pPgHdr->flags \|= PGHDR_NEED_SYNC;

	8969 sqlite3PcacheMakeDirty(pPgHdr);

	8970 sqlite3PagerUnrefNotNull(pPgHdr);

	8971 }

	8972

	8973 return SQLITE_OK;

	8974 }

	8975 #endif

	8976

	8977 /*

	8978 ** The page handle passed as the first argument refers to a dirty page

	8979 ** with a page number other than iNew. This function changes the page's

	8980 ** page number to iNew and sets the value of the PgHdr.flags field to

	8981 ** the value passed as the third parameter.

	8982 */

	8983 SQLITE_PRIVATE void sqlite3PagerRekey(DbPage *pPg, Pgno iNew, u16 flags){

	8984 assert( pPg->pgno!=iNew );

	8985 pPg->flags = flags;

	8986 sqlite3PcacheMove(pPg, iNew);

	8987 }

	8988

	8989 /*

	8990 ** Return a pointer to the data for the specified page.

	8991 */

	8992 SQLITE_PRIVATE void sqlite3PagerGetData(DbPage pPg){

	8993 assert( pPg->nRef>0 \|\| pPg->pPager->memDb );

	8994 return pPg->pData;

	8995 }

	8996

	8997 /*

	8998 ** Return a pointer to the Pager.nExtra bytes of "extra" space

	8999 ** allocated along with the specified page.

	9000 */

	9001 SQLITE_PRIVATE void sqlite3PagerGetExtra(DbPage pPg){

	9002 return pPg->pExtra;

	9003 }

	9004

	9005 /*

	9006 ** Get/set the locking-mode for this pager. Parameter eMode must be one

	9007 ** of PAGER_LOCKINGMODE_QUERY, PAGER_LOCKINGMODE_NORMAL or

	9008 ** PAGER_LOCKINGMODE_EXCLUSIVE. If the parameter is not _QUERY, then

	9009 ** the locking-mode is set to the value specified.

	9010 **

	9011 ** The returned value is either PAGER_LOCKINGMODE_NORMAL or

	9012 ** PAGER_LOCKINGMODE_EXCLUSIVE, indicating the current (possibly updated)

	9013 ** locking-mode.

	9014 */

	9015 SQLITE_PRIVATE int sqlite3PagerLockingMode(Pager *pPager, int eMode){

	9016 assert( eMode==PAGER_LOCKINGMODE_QUERY

	9017 \|\| eMode==PAGER_LOCKINGMODE_NORMAL

	9018 \|\| eMode==PAGER_LOCKINGMODE_EXCLUSIVE );

	9019 assert( PAGER_LOCKINGMODE_QUERY<0 );

	9020 assert( PAGER_LOCKINGMODE_NORMAL>=0 && PAGER_LOCKINGMODE_EXCLUSIVE>=0 );

	9021 assert( pPager->exclusiveMode \|\| 0==sqlite3WalHeapMemory(pPager->pWal) );

	9022 if( eMode>=0 && !pPager->tempFile && !sqlite3WalHeapMemory(pPager->pWal) ){

	9023 pPager->exclusiveMode = (u8)eMode;

	9024 }

	9025 return (int)pPager->exclusiveMode;

	9026 }

	9027

	9028 /*

	9029 ** Set the journal-mode for this pager. Parameter eMode must be one of:

	9030 **

	9031 ** PAGER_JOURNALMODE_DELETE

	9032 ** PAGER_JOURNALMODE_TRUNCATE

	9033 ** PAGER_JOURNALMODE_PERSIST

	9034 ** PAGER_JOURNALMODE_OFF

	9035 ** PAGER_JOURNALMODE_MEMORY

	9036 ** PAGER_JOURNALMODE_WAL

	9037 **

	9038 ** The journalmode is set to the value specified if the change is allowed.

	9039 ** The change may be disallowed for the following reasons:

	9040 **

	9041 ** * An in-memory database can only have its journal_mode set to _OFF

	9042 ** or _MEMORY.

	9043 **

	9044 ** * Temporary databases cannot have _WAL journalmode.

	9045 **

	9046 ** The returned indicate the current (possibly updated) journal-mode.

	9047 */

	9048 SQLITE_PRIVATE int sqlite3PagerSetJournalMode(Pager *pPager, int eMode){

	9049 u8 eOld = pPager->journalMode; /* Prior journalmode */

	9050

	9051 #ifdef SQLITE_DEBUG

	9052 /* The print_pager_state() routine is intended to be used by the debugger

	9053 ** only. We invoke it once here to suppress a compiler warning. */

	9054 print_pager_state(pPager);

	9055 #endif

	9056

	9057

	9058 /* The eMode parameter is always valid */

	9059 assert( eMode==PAGER_JOURNALMODE_DELETE

	9060 \|\| eMode==PAGER_JOURNALMODE_TRUNCATE

	9061 \|\| eMode==PAGER_JOURNALMODE_PERSIST

	9062 \|\| eMode==PAGER_JOURNALMODE_OFF

	9063 \|\| eMode==PAGER_JOURNALMODE_WAL

	9064 \|\| eMode==PAGER_JOURNALMODE_MEMORY );

	9065

	9066 /* This routine is only called from the OP_JournalMode opcode, and

	9067 ** the logic there will never allow a temporary file to be changed

	9068 ** to WAL mode.

	9069 */

	9070 assert( pPager->tempFile==0 \|\| eMode!=PAGER_JOURNALMODE_WAL );

	9071

	9072 /* Do allow the journalmode of an in-memory database to be set to

	9073 ** anything other than MEMORY or OFF

	9074 */

	9075 if( MEMDB ){

	9076 assert( eOld==PAGER_JOURNALMODE_MEMORY \|\| eOld==PAGER_JOURNALMODE_OFF );

	9077 if( eMode!=PAGER_JOURNALMODE_MEMORY && eMode!=PAGER_JOURNALMODE_OFF ){

	9078 eMode = eOld;

	9079 }

	9080 }

	9081

	9082 if( eMode!=eOld ){

	9083

	9084 /* Change the journal mode. */

	9085 assert( pPager->eState!=PAGER_ERROR );

	9086 pPager->journalMode = (u8)eMode;

	9087

	9088 /* When transistioning from TRUNCATE or PERSIST to any other journal

	9089 ** mode except WAL, unless the pager is in locking_mode=exclusive mode,

	9090 ** delete the journal file.

	9091 */

	9092 assert( (PAGER_JOURNALMODE_TRUNCATE & 5)==1 );

	9093 assert( (PAGER_JOURNALMODE_PERSIST & 5)==1 );

	9094 assert( (PAGER_JOURNALMODE_DELETE & 5)==0 );

	9095 assert( (PAGER_JOURNALMODE_MEMORY & 5)==4 );

	9096 assert( (PAGER_JOURNALMODE_OFF & 5)==0 );

	9097 assert( (PAGER_JOURNALMODE_WAL & 5)==5 );

	9098

	9099 assert( isOpen(pPager->fd) \|\| pPager->exclusiveMode );

	9100 if( !pPager->exclusiveMode && (eOld & 5)==1 && (eMode & 1)==0 ){

	9101

	9102 /* In this case we would like to delete the journal file. If it is

	9103 ** not possible, then that is not a problem. Deleting the journal file

	9104 ** here is an optimization only.

	9105 **

	9106 ** Before deleting the journal file, obtain a RESERVED lock on the

	9107 ** database file. This ensures that the journal file is not deleted

	9108 ** while it is in use by some other client.

	9109 */

	9110 sqlite3OsClose(pPager->jfd);

	9111 if( pPager->eLock>=RESERVED_LOCK ){

	9112 sqlite3OsDelete(pPager->pVfs, pPager->zJournal, 0);

	9113 }else{

	9114 int rc = SQLITE_OK;

	9115 int state = pPager->eState;

	9116 assert( state==PAGER_OPEN \|\| state==PAGER_READER );

	9117 if( state==PAGER_OPEN ){

	9118 rc = sqlite3PagerSharedLock(pPager);

	9119 }

	9120 if( pPager->eState==PAGER_READER ){

	9121 assert( rc==SQLITE_OK );

	9122 rc = pagerLockDb(pPager, RESERVED_LOCK);

	9123 }

	9124 if( rc==SQLITE_OK ){

	9125 sqlite3OsDelete(pPager->pVfs, pPager->zJournal, 0);

	9126 }

	9127 if( rc==SQLITE_OK && state==PAGER_READER ){

	9128 pagerUnlockDb(pPager, SHARED_LOCK);

	9129 }else if( state==PAGER_OPEN ){

	9130 pager_unlock(pPager);

	9131 }

	9132 assert( state==pPager->eState );

	9133 }

	9134 }else if( eMode==PAGER_JOURNALMODE_OFF ){

	9135 sqlite3OsClose(pPager->jfd);

	9136 }

	9137 }

	9138

	9139 /* Return the new journal mode */

	9140 return (int)pPager->journalMode;

	9141 }

	9142

	9143 /*

	9144 ** Return the current journal mode.

	9145 */

	9146 SQLITE_PRIVATE int sqlite3PagerGetJournalMode(Pager *pPager){

	9147 return (int)pPager->journalMode;

	9148 }

	9149

	9150 /*

	9151 ** Return TRUE if the pager is in a state where it is OK to change the

	9152 ** journalmode. Journalmode changes can only happen when the database

	9153 ** is unmodified.

	9154 */

	9155 SQLITE_PRIVATE int sqlite3PagerOkToChangeJournalMode(Pager *pPager){

	9156 assert( assert_pager_state(pPager) );

	9157 if( pPager->eState>=PAGER_WRITER_CACHEMOD ) return 0;

	9158 if( NEVER(isOpen(pPager->jfd) && pPager->journalOff>0) ) return 0;

	9159 return 1;

	9160 }

	9161

	9162 /*

	9163 ** Get/set the size-limit used for persistent journal files.

	9164 **

	9165 ** Setting the size limit to -1 means no limit is enforced.

	9166 ** An attempt to set a limit smaller than -1 is a no-op.

	9167 */

	9168 SQLITE_PRIVATE i64 sqlite3PagerJournalSizeLimit(Pager *pPager, i64 iLimit){

	9169 if( iLimit>=-1 ){

	9170 pPager->journalSizeLimit = iLimit;

	9171 sqlite3WalLimit(pPager->pWal, iLimit);

	9172 }

	9173 return pPager->journalSizeLimit;

	9174 }

	9175

	9176 /*

	9177 ** Return a pointer to the pPager->pBackup variable. The backup module

	9178 ** in backup.c maintains the content of this variable. This module

	9179 ** uses it opaquely as an argument to sqlite3BackupRestart() and

	9180 ** sqlite3BackupUpdate() only.

	9181 */

	9182 SQLITE_PRIVATE sqlite3_backup *sqlite3PagerBackupPtr(Pager pPager){

	9183 return &pPager->pBackup;

	9184 }

	9185

	9186 #ifndef SQLITE_OMIT_VACUUM

	9187 /*

	9188 ** Unless this is an in-memory or temporary database, clear the pager cache.

	9189 */

	9190 SQLITE_PRIVATE void sqlite3PagerClearCache(Pager *pPager){

	9191 assert( MEMDB==0 \|\| pPager->tempFile );

	9192 if( pPager->tempFile==0 ) pager_reset(pPager);

	9193 }

	9194 #endif

	9195

	9196

	9197 #ifndef SQLITE_OMIT_WAL

	9198 /*

	9199 ** This function is called when the user invokes "PRAGMA wal_checkpoint",

	9200 ** "PRAGMA wal_blocking_checkpoint" or calls the sqlite3_wal_checkpoint()

	9201 ** or wal_blocking_checkpoint() API functions.

	9202 **

	9203 ** Parameter eMode is one of SQLITE_CHECKPOINT_PASSIVE, FULL or RESTART.

	9204 */

	9205 SQLITE_PRIVATE int sqlite3PagerCheckpoint(

	9206 Pager pPager, / Checkpoint on this pager */

	9207 sqlite3 db, / Db handle used to check for interrupts */

	9208 int eMode, /* Type of checkpoint */

	9209 int pnLog, / OUT: Final number of frames in log */

	9210 int pnCkpt / OUT: Final number of checkpointed frames */

	9211 ){

	9212 int rc = SQLITE_OK;

	9213 if( pPager->pWal ){

	9214 rc = sqlite3WalCheckpoint(pPager->pWal, db, eMode,

	9215 (eMode==SQLITE_CHECKPOINT_PASSIVE ? 0 : pPager->xBusyHandler),

	9216 pPager->pBusyHandlerArg,

	9217 pPager->ckptSyncFlags, pPager->pageSize, (u8 *)pPager->pTmpSpace,

	9218 pnLog, pnCkpt

	9219 );

	9220 }

	9221 return rc;

	9222 }

	9223

	9224 SQLITE_PRIVATE int sqlite3PagerWalCallback(Pager *pPager){

	9225 return sqlite3WalCallback(pPager->pWal);

	9226 }

	9227

	9228 /*

	9229 ** Return true if the underlying VFS for the given pager supports the

	9230 ** primitives necessary for write-ahead logging.

	9231 */

	9232 SQLITE_PRIVATE int sqlite3PagerWalSupported(Pager *pPager){

	9233 const sqlite3_io_methods *pMethods = pPager->fd->pMethods;

	9234 if( pPager->noLock ) return 0;

	9235 return pPager->exclusiveMode \|\| (pMethods->iVersion>=2 && pMethods->xShmMap);

	9236 }

	9237

	9238 /*

	9239 ** Attempt to take an exclusive lock on the database file. If a PENDING lock

	9240 ** is obtained instead, immediately release it.

	9241 */

	9242 static int pagerExclusiveLock(Pager *pPager){

	9243 int rc; /* Return code */

	9244

	9245 assert( pPager->eLock==SHARED_LOCK \|\| pPager->eLock==EXCLUSIVE_LOCK );

	9246 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);

	9247 if( rc!=SQLITE_OK ){

	9248 /* If the attempt to grab the exclusive lock failed, release the

	9249 ** pending lock that may have been obtained instead. */

	9250 pagerUnlockDb(pPager, SHARED_LOCK);

	9251 }

	9252

	9253 return rc;

	9254 }

	9255

	9256 /*

	9257 ** Call sqlite3WalOpen() to open the WAL handle. If the pager is in

	9258 ** exclusive-locking mode when this function is called, take an EXCLUSIVE

	9259 ** lock on the database file and use heap-memory to store the wal-index

	9260 ** in. Otherwise, use the normal shared-memory.

	9261 */

	9262 static int pagerOpenWal(Pager *pPager){

	9263 int rc = SQLITE_OK;

	9264

	9265 assert( pPager->pWal==0 && pPager->tempFile==0 );

	9266 assert( pPager->eLock==SHARED_LOCK \|\| pPager->eLock==EXCLUSIVE_LOCK );

	9267

	9268 /* If the pager is already in exclusive-mode, the WAL module will use

	9269 ** heap-memory for the wal-index instead of the VFS shared-memory

	9270 ** implementation. Take the exclusive lock now, before opening the WAL

	9271 ** file, to make sure this is safe.

	9272 */

	9273 if( pPager->exclusiveMode ){

	9274 rc = pagerExclusiveLock(pPager);

	9275 }

	9276

	9277 /* Open the connection to the log file. If this operation fails,

	9278 ** (e.g. due to malloc() failure), return an error code.

	9279 */

	9280 if( rc==SQLITE_OK ){

	9281 rc = sqlite3WalOpen(pPager->pVfs,

	9282 pPager->fd, pPager->zWal, pPager->exclusiveMode,

	9283 pPager->journalSizeLimit, &pPager->pWal

	9284 );

	9285 }

	9286 pagerFixMaplimit(pPager);

	9287

	9288 return rc;

	9289 }

	9290

	9291

	9292 /*

	9293 ** The caller must be holding a SHARED lock on the database file to call

	9294 ** this function.

	9295 **

	9296 ** If the pager passed as the first argument is open on a real database

	9297 ** file (not a temp file or an in-memory database), and the WAL file

	9298 ** is not already open, make an attempt to open it now. If successful,

	9299 ** return SQLITE_OK. If an error occurs or the VFS used by the pager does

	9300 ** not support the xShmXXX() methods, return an error code. *pbOpen is

	9301 ** not modified in either case.

	9302 **

	9303 ** If the pager is open on a temp-file (or in-memory database), or if

	9304 ** the WAL file is already open, set *pbOpen to 1 and return SQLITE_OK

	9305 ** without doing anything.

	9306 */

	9307 SQLITE_PRIVATE int sqlite3PagerOpenWal(

	9308 Pager pPager, / Pager object */

	9309 int pbOpen / OUT: Set to true if call is a no-op */

	9310 ){

	9311 int rc = SQLITE_OK; /* Return code */

	9312

	9313 assert( assert_pager_state(pPager) );

	9314 assert( pPager->eState==PAGER_OPEN \|\| pbOpen );

	9315 assert( pPager->eState==PAGER_READER \|\| !pbOpen );

	9316 assert( pbOpen==0 \|\| *pbOpen==0 );

	9317 assert( pbOpen!=0 \|\| (!pPager->tempFile && !pPager->pWal) );

	9318

	9319 if( !pPager->tempFile && !pPager->pWal ){

	9320 if( !sqlite3PagerWalSupported(pPager) ) return SQLITE_CANTOPEN;

	9321

	9322 /* Close any rollback journal previously open */

	9323 sqlite3OsClose(pPager->jfd);

	9324

	9325 rc = pagerOpenWal(pPager);

	9326 if( rc==SQLITE_OK ){

	9327 pPager->journalMode = PAGER_JOURNALMODE_WAL;

	9328 pPager->eState = PAGER_OPEN;

	9329 }

	9330 }else{

	9331 *pbOpen = 1;

	9332 }

	9333

	9334 return rc;

	9335 }

	9336

	9337 /*

	9338 ** This function is called to close the connection to the log file prior

	9339 ** to switching from WAL to rollback mode.

	9340 **

	9341 ** Before closing the log file, this function attempts to take an

	9342 ** EXCLUSIVE lock on the database file. If this cannot be obtained, an

	9343 ** error (SQLITE_BUSY) is returned and the log connection is not closed.

	9344 ** If successful, the EXCLUSIVE lock is not released before returning.

	9345 */

	9346 SQLITE_PRIVATE int sqlite3PagerCloseWal(Pager pPager, sqlite3 db){

	9347 int rc = SQLITE_OK;

	9348

	9349 assert( pPager->journalMode==PAGER_JOURNALMODE_WAL );

	9350

	9351 /* If the log file is not already open, but does exist in the file-system,

	9352 ** it may need to be checkpointed before the connection can switch to

	9353 ** rollback mode. Open it now so this can happen.

	9354 */

	9355 if( !pPager->pWal ){

	9356 int logexists = 0;

	9357 rc = pagerLockDb(pPager, SHARED_LOCK);

	9358 if( rc==SQLITE_OK ){

	9359 rc = sqlite3OsAccess(

	9360 pPager->pVfs, pPager->zWal, SQLITE_ACCESS_EXISTS, &logexists

	9361 );

	9362 }

	9363 if( rc==SQLITE_OK && logexists ){

	9364 rc = pagerOpenWal(pPager);

	9365 }

	9366 }

	9367

	9368 /* Checkpoint and close the log. Because an EXCLUSIVE lock is held on

	9369 ** the database file, the log and log-summary files will be deleted.

	9370 */

	9371 if( rc==SQLITE_OK && pPager->pWal ){

	9372 rc = pagerExclusiveLock(pPager);

	9373 if( rc==SQLITE_OK ){

	9374 rc = sqlite3WalClose(pPager->pWal, db, pPager->ckptSyncFlags,

	9375 pPager->pageSize, (u8*)pPager->pTmpSpace);

	9376 pPager->pWal = 0;

	9377 pagerFixMaplimit(pPager);

	9378 if( rc && !pPager->exclusiveMode ) pagerUnlockDb(pPager, SHARED_LOCK);

	9379 }

	9380 }

	9381 return rc;

	9382 }

	9383

	9384 #ifdef SQLITE_ENABLE_SNAPSHOT

	9385 /*

	9386 ** If this is a WAL database, obtain a snapshot handle for the snapshot

	9387 ** currently open. Otherwise, return an error.

	9388 */

	9389 SQLITE_PRIVATE int sqlite3PagerSnapshotGet(Pager pPager, sqlite3_snapshot *ppS napshot){

	9390 int rc = SQLITE_ERROR;

	9391 if( pPager->pWal ){

	9392 rc = sqlite3WalSnapshotGet(pPager->pWal, ppSnapshot);

	9393 }

	9394 return rc;

	9395 }

	9396

	9397 /*

	9398 ** If this is a WAL database, store a pointer to pSnapshot. Next time a

	9399 ** read transaction is opened, attempt to read from the snapshot it

	9400 ** identifies. If this is not a WAL database, return an error.

	9401 */

	9402 SQLITE_PRIVATE int sqlite3PagerSnapshotOpen(Pager pPager, sqlite3_snapshot pSn apshot){

	9403 int rc = SQLITE_OK;

	9404 if( pPager->pWal ){

	9405 sqlite3WalSnapshotOpen(pPager->pWal, pSnapshot);

	9406 }else{

	9407 rc = SQLITE_ERROR;

	9408 }

	9409 return rc;

	9410 }

	9411

	9412 /*

	9413 ** If this is a WAL database, call sqlite3WalSnapshotRecover(). If this

	9414 ** is not a WAL database, return an error.

	9415 */

	9416 SQLITE_PRIVATE int sqlite3PagerSnapshotRecover(Pager *pPager){

	9417 int rc;

	9418 if( pPager->pWal ){

	9419 rc = sqlite3WalSnapshotRecover(pPager->pWal);

	9420 }else{

	9421 rc = SQLITE_ERROR;

	9422 }

	9423 return rc;

	9424 }

	9425 #endif /* SQLITE_ENABLE_SNAPSHOT */

	9426 #endif /* !SQLITE_OMIT_WAL */

	9427

	9428 #ifdef SQLITE_ENABLE_ZIPVFS

	9429 /*

	9430 ** A read-lock must be held on the pager when this function is called. If

	9431 ** the pager is in WAL mode and the WAL file currently contains one or more

	9432 ** frames, return the size in bytes of the page images stored within the

	9433 ** WAL frames. Otherwise, if this is not a WAL database or the WAL file

	9434 ** is empty, return 0.

	9435 */

	9436 SQLITE_PRIVATE int sqlite3PagerWalFramesize(Pager *pPager){

	9437 assert( pPager->eState>=PAGER_READER );

	9438 return sqlite3WalFramesize(pPager->pWal);

	9439 }

	9440 #endif

	9441

	9442 #endif /* SQLITE_OMIT_DISKIO */

	9443

	9444 /************ End of pager.c *********************************************/

	9445 /************ Begin file wal.c *******************************************/

	9446 /*

	9447 ** 2010 February 1

	9448 **

	9449 ** The author disclaims copyright to this source code. In place of

	9450 ** a legal notice, here is a blessing:

	9451 **

	9452 ** May you do good and not evil.

	9453 ** May you find forgiveness for yourself and forgive others.

	9454 ** May you share freely, never taking more than you give.

	9455 **

	9456 *************************************************************************

	9457 **

	9458 ** This file contains the implementation of a write-ahead log (WAL) used in

	9459 ** "journal_mode=WAL" mode.

	9460 **

	9461 ** WRITE-AHEAD LOG (WAL) FILE FORMAT

	9462 **

	9463 ** A WAL file consists of a header followed by zero or more "frames".

	9464 ** Each frame records the revised content of a single page from the

	9465 ** database file. All changes to the database are recorded by writing

	9466 ** frames into the WAL. Transactions commit when a frame is written that

	9467 ** contains a commit marker. A single WAL can and usually does record

	9468 ** multiple transactions. Periodically, the content of the WAL is

	9469 ** transferred back into the database file in an operation called a

	9470 ** "checkpoint".

	9471 **

	9472 ** A single WAL file can be used multiple times. In other words, the

	9473 ** WAL can fill up with frames and then be checkpointed and then new

	9474 ** frames can overwrite the old ones. A WAL always grows from beginning

	9475 ** toward the end. Checksums and counters attached to each frame are

	9476 ** used to determine which frames within the WAL are valid and which

	9477 ** are leftovers from prior checkpoints.

	9478 **

	9479 ** The WAL header is 32 bytes in size and consists of the following eight

	9480 ** big-endian 32-bit unsigned integer values:

	9481 **

	9482 ** 0: Magic number. 0x377f0682 or 0x377f0683

	9483 ** 4: File format version. Currently 3007000

	9484 ** 8: Database page size. Example: 1024

	9485 ** 12: Checkpoint sequence number

	9486 ** 16: Salt-1, random integer incremented with each checkpoint

	9487 ** 20: Salt-2, a different random integer changing with each ckpt

	9488 ** 24: Checksum-1 (first part of checksum for first 24 bytes of header).

	9489 ** 28: Checksum-2 (second part of checksum for first 24 bytes of header).

	9490 **

	9491 ** Immediately following the wal-header are zero or more frames. Each

	9492 ** frame consists of a 24-byte frame-header followed by a <page-size> bytes

	9493 ** of page data. The frame-header is six big-endian 32-bit unsigned

	9494 ** integer values, as follows:

	9495 **

	9496 ** 0: Page number.

	9497 ** 4: For commit records, the size of the database image in pages

	9498 ** after the commit. For all other records, zero.

	9499 ** 8: Salt-1 (copied from the header)

	9500 ** 12: Salt-2 (copied from the header)

	9501 ** 16: Checksum-1.

	9502 ** 20: Checksum-2.

	9503 **

	9504 ** A frame is considered valid if and only if the following conditions are

	9505 ** true:

	9506 **

	9507 ** (1) The salt-1 and salt-2 values in the frame-header match

	9508 ** salt values in the wal-header

	9509 **

	9510 ** (2) The checksum values in the final 8 bytes of the frame-header

	9511 ** exactly match the checksum computed consecutively on the

	9512 ** WAL header and the first 8 bytes and the content of all frames

	9513 ** up to and including the current frame.

	9514 **

	9515 ** The checksum is computed using 32-bit big-endian integers if the

	9516 ** magic number in the first 4 bytes of the WAL is 0x377f0683 and it

	9517 ** is computed using little-endian if the magic number is 0x377f0682.

	9518 ** The checksum values are always stored in the frame header in a

	9519 ** big-endian format regardless of which byte order is used to compute

	9520 ** the checksum. The checksum is computed by interpreting the input as

	9521 ** an even number of unsigned 32-bit integers: x[0] through x[N]. The

	9522 ** algorithm used for the checksum is as follows:

	9523 **

	9524 ** for i from 0 to n-1 step 2:

	9525 ** s0 += x[i] + s1;

	9526 ** s1 += x[i+1] + s0;

	9527 ** endfor

	9528 **

	9529 ** Note that s0 and s1 are both weighted checksums using fibonacci weights

	9530 ** in reverse order (the largest fibonacci weight occurs on the first element

	9531 ** of the sequence being summed.) The s1 value spans all 32-bit

	9532 ** terms of the sequence whereas s0 omits the final term.

	9533 **

	9534 ** On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the

	9535 ** WAL is transferred into the database, then the database is VFS.xSync-ed.

	9536 ** The VFS.xSync operations serve as write barriers - all writes launched

	9537 ** before the xSync must complete before any write that launches after the

	9538 ** xSync begins.

	9539 **

	9540 ** After each checkpoint, the salt-1 value is incremented and the salt-2

	9541 ** value is randomized. This prevents old and new frames in the WAL from

	9542 ** being considered valid at the same time and being checkpointing together

	9543 ** following a crash.

	9544 **

	9545 ** READER ALGORITHM

	9546 **

	9547 ** To read a page from the database (call it page number P), a reader

	9548 ** first checks the WAL to see if it contains page P. If so, then the

	9549 ** last valid instance of page P that is a followed by a commit frame

	9550 ** or is a commit frame itself becomes the value read. If the WAL

	9551 ** contains no copies of page P that are valid and which are a commit

	9552 ** frame or are followed by a commit frame, then page P is read from

	9553 ** the database file.

	9554 **

	9555 ** To start a read transaction, the reader records the index of the last

	9556 ** valid frame in the WAL. The reader uses this recorded "mxFrame" value

	9557 ** for all subsequent read operations. New transactions can be appended

	9558 ** to the WAL, but as long as the reader uses its original mxFrame value

	9559 ** and ignores the newly appended content, it will see a consistent snapshot

	9560 ** of the database from a single point in time. This technique allows

	9561 ** multiple concurrent readers to view different versions of the database

	9562 ** content simultaneously.

	9563 **

	9564 ** The reader algorithm in the previous paragraphs works correctly, but

	9565 ** because frames for page P can appear anywhere within the WAL, the

	9566 ** reader has to scan the entire WAL looking for page P frames. If the

	9567 ** WAL is large (multiple megabytes is typical) that scan can be slow,

	9568 ** and read performance suffers. To overcome this problem, a separate

	9569 ** data structure called the wal-index is maintained to expedite the

	9570 ** search for frames of a particular page.

	9571 **

	9572 ** WAL-INDEX FORMAT

	9573 **

	9574 ** Conceptually, the wal-index is shared memory, though VFS implementations

	9575 ** might choose to implement the wal-index using a mmapped file. Because

	9576 ** the wal-index is shared memory, SQLite does not support journal_mode=WAL

	9577 ** on a network filesystem. All users of the database must be able to

	9578 ** share memory.

	9579 **

	9580 ** The wal-index is transient. After a crash, the wal-index can (and should

	9581 ** be) reconstructed from the original WAL file. In fact, the VFS is required

	9582 ** to either truncate or zero the header of the wal-index when the last

	9583 ** connection to it closes. Because the wal-index is transient, it can

	9584 ** use an architecture-specific format; it does not have to be cross-platform.

	9585 ** Hence, unlike the database and WAL file formats which store all values

	9586 ** as big endian, the wal-index can store multi-byte values in the native

	9587 ** byte order of the host computer.

	9588 **

	9589 ** The purpose of the wal-index is to answer this question quickly: Given

	9590 ** a page number P and a maximum frame index M, return the index of the

	9591 ** last frame in the wal before frame M for page P in the WAL, or return

	9592 ** NULL if there are no frames for page P in the WAL prior to M.

	9593 **

	9594 ** The wal-index consists of a header region, followed by an one or

	9595 ** more index blocks.

	9596 **

	9597 ** The wal-index header contains the total number of frames within the WAL

	9598 ** in the mxFrame field.

	9599 **

	9600 ** Each index block except for the first contains information on

	9601 ** HASHTABLE_NPAGE frames. The first index block contains information on

	9602 ** HASHTABLE_NPAGE_ONE frames. The values of HASHTABLE_NPAGE_ONE and

	9603 ** HASHTABLE_NPAGE are selected so that together the wal-index header and

	9604 ** first index block are the same size as all other index blocks in the

	9605 ** wal-index.

	9606 **

	9607 ** Each index block contains two sections, a page-mapping that contains the

	9608 ** database page number associated with each wal frame, and a hash-table

	9609 ** that allows readers to query an index block for a specific page number.

	9610 ** The page-mapping is an array of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE

	9611 ** for the first index block) 32-bit page numbers. The first entry in the

	9612 ** first index-block contains the database page number corresponding to the

	9613 ** first frame in the WAL file. The first entry in the second index block

	9614 ** in the WAL file corresponds to the (HASHTABLE_NPAGE_ONE+1)th frame in

	9615 ** the log, and so on.

	9616 **

	9617 ** The last index block in a wal-index usually contains less than the full

	9618 ** complement of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE) page-numbers,

	9619 ** depending on the contents of the WAL file. This does not change the

	9620 ** allocated size of the page-mapping array - the page-mapping array merely

	9621 ** contains unused entries.

	9622 **

	9623 ** Even without using the hash table, the last frame for page P

	9624 ** can be found by scanning the page-mapping sections of each index block

	9625 ** starting with the last index block and moving toward the first, and

	9626 ** within each index block, starting at the end and moving toward the

	9627 ** beginning. The first entry that equals P corresponds to the frame

	9628 ** holding the content for that page.

	9629 **

	9630 ** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers.

	9631 ** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the

	9632 ** hash table for each page number in the mapping section, so the hash

	9633 ** table is never more than half full. The expected number of collisions

	9634 ** prior to finding a match is 1. Each entry of the hash table is an

	9635 ** 1-based index of an entry in the mapping section of the same

	9636 ** index block. Let K be the 1-based index of the largest entry in

	9637 ** the mapping section. (For index blocks other than the last, K will

	9638 ** always be exactly HASHTABLE_NPAGE (4096) and for the last index block

	9639 ** K will be (mxFrame%HASHTABLE_NPAGE).) Unused slots of the hash table

	9640 ** contain a value of 0.

	9641 **

	9642 ** To look for page P in the hash table, first compute a hash iKey on

	9643 ** P as follows:

	9644 **

	9645 ** iKey = (P * 383) % HASHTABLE_NSLOT

	9646 **

	9647 ** Then start scanning entries of the hash table, starting with iKey

	9648 ** (wrapping around to the beginning when the end of the hash table is

	9649 ** reached) until an unused hash slot is found. Let the first unused slot

	9650 ** be at index iUnused. (iUnused might be less than iKey if there was

	9651 ** wrap-around.) Because the hash table is never more than half full,

	9652 ** the search is guaranteed to eventually hit an unused entry. Let

	9653 ** iMax be the value between iKey and iUnused, closest to iUnused,

	9654 ** where aHash[iMax]==P. If there is no iMax entry (if there exists

	9655 ** no hash slot such that aHash[i]==p) then page P is not in the

	9656 ** current index block. Otherwise the iMax-th mapping entry of the

	9657 ** current index block corresponds to the last entry that references

	9658 ** page P.

	9659 **

	9660 ** A hash search begins with the last index block and moves toward the

	9661 ** first index block, looking for entries corresponding to page P. On

	9662 ** average, only two or three slots in each index block need to be

	9663 ** examined in order to either find the last entry for page P, or to

	9664 ** establish that no such entry exists in the block. Each index block

	9665 ** holds over 4000 entries. So two or three index blocks are sufficient

	9666 ** to cover a typical 10 megabyte WAL file, assuming 1K pages. 8 or 10

	9667 ** comparisons (on average) suffice to either locate a frame in the

	9668 ** WAL or to establish that the frame does not exist in the WAL. This

	9669 ** is much faster than scanning the entire 10MB WAL.

	9670 **

	9671 ** Note that entries are added in order of increasing K. Hence, one

	9672 ** reader might be using some value K0 and a second reader that started

	9673 ** at a later time (after additional transactions were added to the WAL

	9674 ** and to the wal-index) might be using a different value K1, where K1>K0.

	9675 ** Both readers can use the same hash table and mapping section to get

	9676 ** the correct result. There may be entries in the hash table with

	9677 ** K>K0 but to the first reader, those entries will appear to be unused

	9678 ** slots in the hash table and so the first reader will get an answer as

	9679 ** if no values greater than K0 had ever been inserted into the hash table

	9680 ** in the first place - which is what reader one wants. Meanwhile, the

	9681 ** second reader using K1 will see additional values that were inserted

	9682 ** later, which is exactly what reader two wants.

	9683 **

	9684 ** When a rollback occurs, the value of K is decreased. Hash table entries

	9685 ** that correspond to frames greater than the new K value are removed

	9686 ** from the hash table at this point.

	9687 */

	9688 #ifndef SQLITE_OMIT_WAL

	9689

	9690 /* #include "wal.h" */

	9691

	9692 /*

	9693 ** Trace output macros

	9694 */

	9695 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)

	9696 SQLITE_PRIVATE int sqlite3WalTrace = 0;

	9697 # define WALTRACE(X) if(sqlite3WalTrace) sqlite3DebugPrintf X

	9698 #else

	9699 # define WALTRACE(X)

	9700 #endif

	9701

	9702 /*

	9703 ** The maximum (and only) versions of the wal and wal-index formats

	9704 ** that may be interpreted by this version of SQLite.

	9705 **

	9706 ** If a client begins recovering a WAL file and finds that (a) the checksum

	9707 ** values in the wal-header are correct and (b) the version field is not

	9708 ** WAL_MAX_VERSION, recovery fails and SQLite returns SQLITE_CANTOPEN.

	9709 **

	9710 ** Similarly, if a client successfully reads a wal-index header (i.e. the

	9711 ** checksum test is successful) and finds that the version field is not

	9712 ** WALINDEX_MAX_VERSION, then no read-transaction is opened and SQLite

	9713 ** returns SQLITE_CANTOPEN.

	9714 */

	9715 #define WAL_MAX_VERSION 3007000

	9716 #define WALINDEX_MAX_VERSION 3007000

	9717

	9718 /*

	9719 ** Indices of various locking bytes. WAL_NREADER is the number

	9720 ** of available reader locks and should be at least 3. The default

	9721 ** is SQLITE_SHM_NLOCK==8 and WAL_NREADER==5.

	9722 */

	9723 #define WAL_WRITE_LOCK 0

	9724 #define WAL_ALL_BUT_WRITE 1

	9725 #define WAL_CKPT_LOCK 1

	9726 #define WAL_RECOVER_LOCK 2

	9727 #define WAL_READ_LOCK(I) (3+(I))

	9728 #define WAL_NREADER (SQLITE_SHM_NLOCK-3)

	9729

	9730

	9731 /* Object declarations */

	9732 typedef struct WalIndexHdr WalIndexHdr;

	9733 typedef struct WalIterator WalIterator;

	9734 typedef struct WalCkptInfo WalCkptInfo;

	9735

	9736

	9737 /*

	9738 ** The following object holds a copy of the wal-index header content.

	9739 **

	9740 ** The actual header in the wal-index consists of two copies of this

	9741 ** object followed by one instance of the WalCkptInfo object.

	9742 ** For all versions of SQLite through 3.10.0 and probably beyond,

	9743 ** the locking bytes (WalCkptInfo.aLock) start at offset 120 and

	9744 ** the total header size is 136 bytes.

	9745 **

	9746 ** The szPage value can be any power of 2 between 512 and 32768, inclusive.

	9747 ** Or it can be 1 to represent a 65536-byte page. The latter case was

	9748 ** added in 3.7.1 when support for 64K pages was added.

	9749 */

	9750 struct WalIndexHdr {

	9751 u32 iVersion; /* Wal-index version */

	9752 u32 unused; /* Unused (padding) field */

	9753 u32 iChange; /* Counter incremented each transaction */

	9754 u8 isInit; /* 1 when initialized */

	9755 u8 bigEndCksum; /* True if checksums in WAL are big-endian */

	9756 u16 szPage; /* Database page size in bytes. 1==64K */

	9757 u32 mxFrame; /* Index of last valid frame in the WAL */

	9758 u32 nPage; /* Size of database in pages */

	9759 u32 aFrameCksum[2]; /* Checksum of last frame in log */

	9760 u32 aSalt[2]; /* Two salt values copied from WAL header */

	9761 u32 aCksum[2]; /* Checksum over all prior fields */

	9762 };

	9763

	9764 /*

	9765 ** A copy of the following object occurs in the wal-index immediately

	9766 ** following the second copy of the WalIndexHdr. This object stores

	9767 ** information used by checkpoint.

	9768 **

	9769 ** nBackfill is the number of frames in the WAL that have been written

	9770 ** back into the database. (We call the act of moving content from WAL to

	9771 ** database "backfilling".) The nBackfill number is never greater than

	9772 ** WalIndexHdr.mxFrame. nBackfill can only be increased by threads

	9773 ** holding the WAL_CKPT_LOCK lock (which includes a recovery thread).

	9774 ** However, a WAL_WRITE_LOCK thread can move the value of nBackfill from

	9775 ** mxFrame back to zero when the WAL is reset.

	9776 **

	9777 ** nBackfillAttempted is the largest value of nBackfill that a checkpoint

	9778 ** has attempted to achieve. Normally nBackfill==nBackfillAtempted, however

	9779 ** the nBackfillAttempted is set before any backfilling is done and the

	9780 ** nBackfill is only set after all backfilling completes. So if a checkpoint

	9781 ** crashes, nBackfillAttempted might be larger than nBackfill. The

	9782 ** WalIndexHdr.mxFrame must never be less than nBackfillAttempted.

	9783 **

	9784 ** The aLock[] field is a set of bytes used for locking. These bytes should

	9785 ** never be read or written.

	9786 **

	9787 ** There is one entry in aReadMark[] for each reader lock. If a reader

	9788 ** holds read-lock K, then the value in aReadMark[K] is no greater than

	9789 ** the mxFrame for that reader. The value READMARK_NOT_USED (0xffffffff)

	9790 ** for any aReadMark[] means that entry is unused. aReadMark[0] is

	9791 ** a special case; its value is never used and it exists as a place-holder

	9792 ** to avoid having to offset aReadMark[] indexs by one. Readers holding

	9793 ** WAL_READ_LOCK(0) always ignore the entire WAL and read all content

	9794 ** directly from the database.

	9795 **

	9796 ** The value of aReadMark[K] may only be changed by a thread that

	9797 ** is holding an exclusive lock on WAL_READ_LOCK(K). Thus, the value of

	9798 ** aReadMark[K] cannot changed while there is a reader is using that mark

	9799 ** since the reader will be holding a shared lock on WAL_READ_LOCK(K).

	9800 **

	9801 ** The checkpointer may only transfer frames from WAL to database where

	9802 ** the frame numbers are less than or equal to every aReadMark[] that is

	9803 ** in use (that is, every aReadMark[j] for which there is a corresponding

	9804 ** WAL_READ_LOCK(j)). New readers (usually) pick the aReadMark[] with the

	9805 ** largest value and will increase an unused aReadMark[] to mxFrame if there

	9806 ** is not already an aReadMark[] equal to mxFrame. The exception to the

	9807 ** previous sentence is when nBackfill equals mxFrame (meaning that everything

	9808 ** in the WAL has been backfilled into the database) then new readers

	9809 ** will choose aReadMark[0] which has value 0 and hence such reader will

	9810 ** get all their all content directly from the database file and ignore

	9811 ** the WAL.

	9812 **

	9813 ** Writers normally append new frames to the end of the WAL. However,

	9814 ** if nBackfill equals mxFrame (meaning that all WAL content has been

	9815 ** written back into the database) and if no readers are using the WAL

	9816 ** (in other words, if there are no WAL_READ_LOCK(i) where i>0) then

	9817 ** the writer will first "reset" the WAL back to the beginning and start

	9818 ** writing new content beginning at frame 1.

	9819 **

	9820 ** We assume that 32-bit loads are atomic and so no locks are needed in

	9821 ** order to read from any aReadMark[] entries.

	9822 */

	9823 struct WalCkptInfo {

	9824 u32 nBackfill; /* Number of WAL frames backfilled into DB */

	9825 u32 aReadMark[WAL_NREADER]; /* Reader marks */

	9826 u8 aLock[SQLITE_SHM_NLOCK]; /* Reserved space for locks */

	9827 u32 nBackfillAttempted; /* WAL frames perhaps written, or maybe not */

	9828 u32 notUsed0; /* Available for future enhancements */

	9829 };

	9830 #define READMARK_NOT_USED 0xffffffff

	9831

	9832

	9833 /* A block of WALINDEX_LOCK_RESERVED bytes beginning at

	9834 ** WALINDEX_LOCK_OFFSET is reserved for locks. Since some systems

	9835 ** only support mandatory file-locks, we do not read or write data

	9836 ** from the region of the file on which locks are applied.

	9837 */

	9838 #define WALINDEX_LOCK_OFFSET (sizeof(WalIndexHdr)*2+offsetof(WalCkptInfo,aLock))

	9839 #define WALINDEX_HDR_SIZE (sizeof(WalIndexHdr)*2+sizeof(WalCkptInfo))

	9840

	9841 /* Size of header before each frame in wal */

	9842 #define WAL_FRAME_HDRSIZE 24

	9843

	9844 /* Size of write ahead log header, including checksum. */

	9845 /* #define WAL_HDRSIZE 24 */

	9846 #define WAL_HDRSIZE 32

	9847

	9848 /* WAL magic value. Either this value, or the same value with the least

	9849 ** significant bit also set (WAL_MAGIC \| 0x00000001) is stored in 32-bit

	9850 ** big-endian format in the first 4 bytes of a WAL file.

	9851 **

	9852 ** If the LSB is set, then the checksums for each frame within the WAL

	9853 ** file are calculated by treating all data as an array of 32-bit

	9854 ** big-endian words. Otherwise, they are calculated by interpreting

	9855 ** all data as 32-bit little-endian words.

	9856 */

	9857 #define WAL_MAGIC 0x377f0682

	9858

	9859 /*

	9860 ** Return the offset of frame iFrame in the write-ahead log file,

	9861 ** assuming a database page size of szPage bytes. The offset returned

	9862 ** is to the start of the write-ahead log frame-header.

	9863 */

	9864 #define walFrameOffset(iFrame, szPage) ( \

	9865 WAL_HDRSIZE + ((iFrame)-1)*(i64)((szPage)+WAL_FRAME_HDRSIZE) \

	9866 )

	9867

	9868 /*

	9869 ** An open write-ahead log file is represented by an instance of the

	9870 ** following object.

	9871 */

	9872 struct Wal {

	9873 sqlite3_vfs pVfs; / The VFS used to create pDbFd */

	9874 sqlite3_file pDbFd; / File handle for the database file */

	9875 sqlite3_file pWalFd; / File handle for WAL file */

	9876 u32 iCallback; /* Value to pass to log callback (or 0) */

	9877 i64 mxWalSize; /* Truncate WAL to this size upon reset */

	9878 int nWiData; /* Size of array apWiData */

	9879 int szFirstBlock; /* Size of first block written to WAL file */

	9880 volatile u32 *apWiData; / Pointer to wal-index content in memory */

	9881 u32 szPage; /* Database page size */

	9882 i16 readLock; /* Which read lock is being held. -1 for none */

	9883 u8 syncFlags; /* Flags to use to sync header writes */

	9884 u8 exclusiveMode; /* Non-zero if connection is in exclusive mode */

	9885 u8 writeLock; /* True if in a write transaction */

	9886 u8 ckptLock; /* True if holding a checkpoint lock */

	9887 u8 readOnly; /* WAL_RDWR, WAL_RDONLY, or WAL_SHM_RDONLY */

	9888 u8 truncateOnCommit; /* True to truncate WAL file on commit */

	9889 u8 syncHeader; /* Fsync the WAL header if true */

	9890 u8 padToSectorBoundary; /* Pad transactions out to the next sector */

	9891 WalIndexHdr hdr; /* Wal-index header for current transaction */

	9892 u32 minFrame; /* Ignore wal frames before this one */

	9893 u32 iReCksum; /* On commit, recalculate checksums from here */

	9894 const char zWalName; / Name of WAL file */

	9895 u32 nCkpt; /* Checkpoint sequence counter in the wal-header */

	9896 #ifdef SQLITE_DEBUG

	9897 u8 lockError; /* True if a locking error has occurred */

	9898 #endif

	9899 #ifdef SQLITE_ENABLE_SNAPSHOT

	9900 WalIndexHdr pSnapshot; / Start transaction here if not NULL */

	9901 #endif

	9902 };

	9903

	9904 /*

	9905 ** Candidate values for Wal.exclusiveMode.

	9906 */

	9907 #define WAL_NORMAL_MODE 0

	9908 #define WAL_EXCLUSIVE_MODE 1

	9909 #define WAL_HEAPMEMORY_MODE 2

	9910

	9911 /*

	9912 ** Possible values for WAL.readOnly

	9913 */

	9914 #define WAL_RDWR 0 /* Normal read/write connection */

	9915 #define WAL_RDONLY 1 /* The WAL file is readonly */

	9916 #define WAL_SHM_RDONLY 2 /* The SHM file is readonly */

	9917

	9918 /*

	9919 ** Each page of the wal-index mapping contains a hash-table made up of

	9920 ** an array of HASHTABLE_NSLOT elements of the following type.

	9921 */

	9922 typedef u16 ht_slot;

	9923

	9924 /*

	9925 ** This structure is used to implement an iterator that loops through

	9926 ** all frames in the WAL in database page order. Where two or more frames

	9927 ** correspond to the same database page, the iterator visits only the

	9928 ** frame most recently written to the WAL (in other words, the frame with

	9929 ** the largest index).

	9930 **

	9931 ** The internals of this structure are only accessed by:

	9932 **

	9933 ** walIteratorInit() - Create a new iterator,

	9934 ** walIteratorNext() - Step an iterator,

	9935 ** walIteratorFree() - Free an iterator.

	9936 **

	9937 ** This functionality is used by the checkpoint code (see walCheckpoint()).

	9938 */

	9939 struct WalIterator {

	9940 int iPrior; /* Last result returned from the iterator */

	9941 int nSegment; /* Number of entries in aSegment[] */

	9942 struct WalSegment {

	9943 int iNext; /* Next slot in aIndex[] not yet returned */

	9944 ht_slot aIndex; / i0, i1, i2... such that aPgno[iN] ascend */

	9945 u32 aPgno; / Array of page numbers. */

	9946 int nEntry; /* Nr. of entries in aPgno[] and aIndex[] */

	9947 int iZero; /* Frame number associated with aPgno[0] */

	9948 } aSegment[1]; /* One for every 32KB page in the wal-index */

	9949 };

	9950

	9951 /*

	9952 ** Define the parameters of the hash tables in the wal-index file. There

	9953 ** is a hash-table following every HASHTABLE_NPAGE page numbers in the

	9954 ** wal-index.

	9955 **

	9956 ** Changing any of these constants will alter the wal-index format and

	9957 ** create incompatibilities.

	9958 */

	9959 #define HASHTABLE_NPAGE 4096 /* Must be power of 2 */

	9960 #define HASHTABLE_HASH_1 383 /* Should be prime */

	9961 #define HASHTABLE_NSLOT (HASHTABLE_NPAGE2) / Must be a power of 2 */

	9962

	9963 /*

	9964 ** The block of page numbers associated with the first hash-table in a

	9965 ** wal-index is smaller than usual. This is so that there is a complete

	9966 ** hash-table on each aligned 32KB page of the wal-index.

	9967 */

	9968 #define HASHTABLE_NPAGE_ONE (HASHTABLE_NPAGE - (WALINDEX_HDR_SIZE/sizeof(u32)))

	9969

	9970 /* The wal-index is divided into pages of WALINDEX_PGSZ bytes each. */

	9971 #define WALINDEX_PGSZ ( \

	9972 sizeof(ht_slot)HASHTABLE_NSLOT + HASHTABLE_NPAGEsizeof(u32) \

	9973 )

	9974

	9975 /*

	9976 ** Obtain a pointer to the iPage'th page of the wal-index. The wal-index

	9977 ** is broken into pages of WALINDEX_PGSZ bytes. Wal-index pages are

	9978 ** numbered from zero.

	9979 **

	9980 ** If this call is successful, *ppPage is set to point to the wal-index

	9981 ** page and SQLITE_OK is returned. If an error (an OOM or VFS error) occurs,

	9982 ** then an SQLite error code is returned and *ppPage is set to 0.

	9983 */

	9984 static int walIndexPage(Wal pWal, int iPage, volatile u32 *ppPage){

	9985 int rc = SQLITE_OK;

	9986

	9987 /* Enlarge the pWal->apWiData[] array if required */

	9988 if( pWal->nWiData<=iPage ){

	9989 int nByte = sizeof(u32)(iPage+1);

	9990 volatile u32 **apNew;

	9991 apNew = (volatile u32 *)sqlite3_realloc64((void )pWal->apWiData, nByte);

	9992 if( !apNew ){

	9993 *ppPage = 0;

	9994 return SQLITE_NOMEM_BKPT;

	9995 }

	9996 memset((void*)&apNew[pWal->nWiData], 0,

	9997 sizeof(u32)(iPage+1-pWal->nWiData));

	9998 pWal->apWiData = apNew;

	9999 pWal->nWiData = iPage+1;

	10000 }

	10001

	10002 /* Request a pointer to the required page from the VFS */

	10003 if( pWal->apWiData[iPage]==0 ){

	10004 if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ){

	10005 pWal->apWiData[iPage] = (u32 volatile *)sqlite3MallocZero(WALINDEX_PGSZ);

	10006 if( !pWal->apWiData[iPage] ) rc = SQLITE_NOMEM_BKPT;

	10007 }else{

	10008 rc = sqlite3OsShmMap(pWal->pDbFd, iPage, WALINDEX_PGSZ,

	10009 pWal->writeLock, (void volatile **)&pWal->apWiData[iPage]

	10010 );

	10011 if( rc==SQLITE_READONLY ){

	10012 pWal->readOnly \|= WAL_SHM_RDONLY;

	10013 rc = SQLITE_OK;

	10014 }

	10015 }

	10016 }

	10017

	10018 *ppPage = pWal->apWiData[iPage];

	10019 assert( iPage==0 \|\| *ppPage \|\| rc!=SQLITE_OK );

	10020 return rc;

	10021 }

	10022

	10023 /*

	10024 ** Return a pointer to the WalCkptInfo structure in the wal-index.

	10025 */

	10026 static volatile WalCkptInfo walCkptInfo(Wal pWal){

	10027 assert( pWal->nWiData>0 && pWal->apWiData[0] );

	10028 return (volatile WalCkptInfo*)&(pWal->apWiData[0][sizeof(WalIndexHdr)/2]);

	10029 }

	10030

	10031 /*

	10032 ** Return a pointer to the WalIndexHdr structure in the wal-index.

	10033 */

	10034 static volatile WalIndexHdr walIndexHdr(Wal pWal){

	10035 assert( pWal->nWiData>0 && pWal->apWiData[0] );

	10036 return (volatile WalIndexHdr*)pWal->apWiData[0];

	10037 }

	10038

	10039 /*

	10040 ** The argument to this macro must be of type u32. On a little-endian

	10041 ** architecture, it returns the u32 value that results from interpreting

	10042 ** the 4 bytes as a big-endian value. On a big-endian architecture, it

	10043 ** returns the value that would be produced by interpreting the 4 bytes

	10044 ** of the input value as a little-endian integer.

	10045 */

	10046 #define BYTESWAP32(x) ( \

	10047 (((x)&0x000000FF)<<24) + (((x)&0x0000FF00)<<8) \

	10048 + (((x)&0x00FF0000)>>8) + (((x)&0xFF000000)>>24) \

	10049 )

	10050

	10051 /*

	10052 ** Generate or extend an 8 byte checksum based on the data in

	10053 ** array aByte[] and the initial values of aIn[0] and aIn[1] (or

	10054 ** initial values of 0 and 0 if aIn==NULL).

	10055 **

	10056 ** The checksum is written back into aOut[] before returning.

	10057 **

	10058 ** nByte must be a positive multiple of 8.

	10059 */

	10060 static void walChecksumBytes(

	10061 int nativeCksum, /* True for native byte-order, false for non-native */

	10062 u8 a, / Content to be checksummed */

	10063 int nByte, /* Bytes of content in a[]. Must be a multiple of 8. */

	10064 const u32 aIn, / Initial checksum value input */

	10065 u32 aOut / OUT: Final checksum value output */

	10066 ){

	10067 u32 s1, s2;

	10068 u32 aData = (u32 )a;

	10069 u32 aEnd = (u32 )&a[nByte];

	10070

	10071 if( aIn ){

	10072 s1 = aIn[0];

	10073 s2 = aIn[1];

	10074 }else{

	10075 s1 = s2 = 0;

	10076 }

	10077

	10078 assert( nByte>=8 );

	10079 assert( (nByte&0x00000007)==0 );

	10080

	10081 if( nativeCksum ){

	10082 do {

	10083 s1 += *aData++ + s2;

	10084 s2 += *aData++ + s1;

	10085 }while( aData<aEnd );

	10086 }else{

	10087 do {

	10088 s1 += BYTESWAP32(aData[0]) + s2;

	10089 s2 += BYTESWAP32(aData[1]) + s1;

	10090 aData += 2;

	10091 }while( aData<aEnd );

	10092 }

	10093

	10094 aOut[0] = s1;

	10095 aOut[1] = s2;

	10096 }

	10097

	10098 static void walShmBarrier(Wal *pWal){

	10099 if( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE ){

	10100 sqlite3OsShmBarrier(pWal->pDbFd);

	10101 }

	10102 }

	10103

	10104 /*

	10105 ** Write the header information in pWal->hdr into the wal-index.

	10106 **

	10107 ** The checksum on pWal->hdr is updated before it is written.

	10108 */

	10109 static void walIndexWriteHdr(Wal *pWal){

	10110 volatile WalIndexHdr *aHdr = walIndexHdr(pWal);

	10111 const int nCksum = offsetof(WalIndexHdr, aCksum);

	10112

	10113 assert( pWal->writeLock );

	10114 pWal->hdr.isInit = 1;

	10115 pWal->hdr.iVersion = WALINDEX_MAX_VERSION;

	10116 walChecksumBytes(1, (u8*)&pWal->hdr, nCksum, 0, pWal->hdr.aCksum);

	10117 memcpy((void)&aHdr[1], (const void)&pWal->hdr, sizeof(WalIndexHdr));

	10118 walShmBarrier(pWal);

	10119 memcpy((void)&aHdr[0], (const void)&pWal->hdr, sizeof(WalIndexHdr));

	10120 }

	10121

	10122 /*

	10123 ** This function encodes a single frame header and writes it to a buffer

	10124 ** supplied by the caller. A frame-header is made up of a series of

	10125 ** 4-byte big-endian integers, as follows:

	10126 **

	10127 ** 0: Page number.

	10128 ** 4: For commit records, the size of the database image in pages

	10129 ** after the commit. For all other records, zero.

	10130 ** 8: Salt-1 (copied from the wal-header)

	10131 ** 12: Salt-2 (copied from the wal-header)

	10132 ** 16: Checksum-1.

	10133 ** 20: Checksum-2.

	10134 */

	10135 static void walEncodeFrame(

	10136 Wal pWal, / The write-ahead log */

	10137 u32 iPage, /* Database page number for frame */

	10138 u32 nTruncate, /* New db size (or 0 for non-commit frames) */

	10139 u8 aData, / Pointer to page data */

	10140 u8 aFrame / OUT: Write encoded frame here */

	10141 ){

	10142 int nativeCksum; /* True for native byte-order checksums */

	10143 u32 *aCksum = pWal->hdr.aFrameCksum;

	10144 assert( WAL_FRAME_HDRSIZE==24 );

	10145 sqlite3Put4byte(&aFrame[0], iPage);

	10146 sqlite3Put4byte(&aFrame[4], nTruncate);

	10147 if( pWal->iReCksum==0 ){

	10148 memcpy(&aFrame[8], pWal->hdr.aSalt, 8);

	10149

	10150 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);

	10151 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);

	10152 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);

	10153

	10154 sqlite3Put4byte(&aFrame[16], aCksum[0]);

	10155 sqlite3Put4byte(&aFrame[20], aCksum[1]);

	10156 }else{

	10157 memset(&aFrame[8], 0, 16);

	10158 }

	10159 }

	10160

	10161 /*

	10162 ** Check to see if the frame with header in aFrame[] and content

	10163 ** in aData[] is valid. If it is a valid frame, fill *piPage and

	10164 ** *pnTruncate and return true. Return if the frame is not valid.

	10165 */

	10166 static int walDecodeFrame(

	10167 Wal pWal, / The write-ahead log */

	10168 u32 piPage, / OUT: Database page number for frame */

	10169 u32 pnTruncate, / OUT: New db size (or 0 if not commit) */

	10170 u8 aData, / Pointer to page data (for checksum) */

	10171 u8 aFrame / Frame data */

	10172 ){

	10173 int nativeCksum; /* True for native byte-order checksums */

	10174 u32 *aCksum = pWal->hdr.aFrameCksum;

	10175 u32 pgno; /* Page number of the frame */

	10176 assert( WAL_FRAME_HDRSIZE==24 );

	10177

	10178 /* A frame is only valid if the salt values in the frame-header

	10179 ** match the salt values in the wal-header.

	10180 */

	10181 if( memcmp(&pWal->hdr.aSalt, &aFrame[8], 8)!=0 ){

	10182 return 0;

	10183 }

	10184

	10185 /* A frame is only valid if the page number is creater than zero.

	10186 */

	10187 pgno = sqlite3Get4byte(&aFrame[0]);

	10188 if( pgno==0 ){

	10189 return 0;

	10190 }

	10191

	10192 /* A frame is only valid if a checksum of the WAL header,

	10193 ** all prior frams, the first 16 bytes of this frame-header,

	10194 ** and the frame-data matches the checksum in the last 8

	10195 ** bytes of this frame-header.

	10196 */

	10197 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);

	10198 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);

	10199 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);

	10200 if( aCksum[0]!=sqlite3Get4byte(&aFrame[16])

	10201 \|\| aCksum[1]!=sqlite3Get4byte(&aFrame[20])

	10202 ){

	10203 /* Checksum failed. */

	10204 return 0;

	10205 }

	10206

	10207 /* If we reach this point, the frame is valid. Return the page number

	10208 ** and the new database size.

	10209 */

	10210 *piPage = pgno;

	10211 *pnTruncate = sqlite3Get4byte(&aFrame[4]);

	10212 return 1;

	10213 }

	10214

	10215

	10216 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)

	10217 /*

	10218 ** Names of locks. This routine is used to provide debugging output and is not

	10219 ** a part of an ordinary build.

	10220 */

	10221 static const char *walLockName(int lockIdx){

	10222 if( lockIdx==WAL_WRITE_LOCK ){

	10223 return "WRITE-LOCK";

	10224 }else if( lockIdx==WAL_CKPT_LOCK ){

	10225 return "CKPT-LOCK";

	10226 }else if( lockIdx==WAL_RECOVER_LOCK ){

	10227 return "RECOVER-LOCK";

	10228 }else{

	10229 static char zName[15];

	10230 sqlite3_snprintf(sizeof(zName), zName, "READ-LOCK[%d]",

	10231 lockIdx-WAL_READ_LOCK(0));

	10232 return zName;

	10233 }

	10234 }

	10235 #endif /defined(SQLITE_TEST) \|\| defined(SQLITE_DEBUG) /

	10236

	10237

	10238 /*

	10239 ** Set or release locks on the WAL. Locks are either shared or exclusive.

	10240 ** A lock cannot be moved directly between shared and exclusive - it must go

	10241 ** through the unlocked state first.

	10242 **

	10243 ** In locking_mode=EXCLUSIVE, all of these routines become no-ops.

	10244 */

	10245 static int walLockShared(Wal *pWal, int lockIdx){

	10246 int rc;

	10247 if( pWal->exclusiveMode ) return SQLITE_OK;

	10248 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,

	10249 SQLITE_SHM_LOCK \| SQLITE_SHM_SHARED);

	10250 WALTRACE(("WAL%p: acquire SHARED-%s %s\n", pWal,

	10251 walLockName(lockIdx), rc ? "failed" : "ok"));

	10252 VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )

	10253 return rc;

	10254 }

	10255 static void walUnlockShared(Wal *pWal, int lockIdx){

	10256 if( pWal->exclusiveMode ) return;

	10257 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,

	10258 SQLITE_SHM_UNLOCK \| SQLITE_SHM_SHARED);

	10259 WALTRACE(("WAL%p: release SHARED-%s\n", pWal, walLockName(lockIdx)));

	10260 }

	10261 static int walLockExclusive(Wal *pWal, int lockIdx, int n){

	10262 int rc;

	10263 if( pWal->exclusiveMode ) return SQLITE_OK;

	10264 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,

	10265 SQLITE_SHM_LOCK \| SQLITE_SHM_EXCLUSIVE);

	10266 WALTRACE(("WAL%p: acquire EXCLUSIVE-%s cnt=%d %s\n", pWal,

	10267 walLockName(lockIdx), n, rc ? "failed" : "ok"));

	10268 VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )

	10269 return rc;

	10270 }

	10271 static void walUnlockExclusive(Wal *pWal, int lockIdx, int n){

	10272 if( pWal->exclusiveMode ) return;

	10273 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,

	10274 SQLITE_SHM_UNLOCK \| SQLITE_SHM_EXCLUSIVE);

	10275 WALTRACE(("WAL%p: release EXCLUSIVE-%s cnt=%d\n", pWal,

	10276 walLockName(lockIdx), n));

	10277 }

	10278

	10279 /*

	10280 ** Compute a hash on a page number. The resulting hash value must land

	10281 ** between 0 and (HASHTABLE_NSLOT-1). The walHashNext() function advances

	10282 ** the hash to the next value in the event of a collision.

	10283 */

	10284 static int walHash(u32 iPage){

	10285 assert( iPage>0 );

	10286 assert( (HASHTABLE_NSLOT & (HASHTABLE_NSLOT-1))==0 );

	10287 return (iPage*HASHTABLE_HASH_1) & (HASHTABLE_NSLOT-1);

	10288 }

	10289 static int walNextHash(int iPriorHash){

	10290 return (iPriorHash+1)&(HASHTABLE_NSLOT-1);

	10291 }

	10292

	10293 /*

	10294 ** Return pointers to the hash table and page number array stored on

	10295 ** page iHash of the wal-index. The wal-index is broken into 32KB pages

	10296 ** numbered starting from 0.

	10297 **

	10298 ** Set output variable *paHash to point to the start of the hash table

	10299 ** in the wal-index file. Set *piZero to one less than the frame

	10300 ** number of the first frame indexed by this hash table. If a

	10301 ** slot in the hash table is set to N, it refers to frame number

	10302 ** (*piZero+N) in the log.

	10303 **

	10304 ** Finally, set paPgno so that paPgno[1] is the page number of the

	10305 ** first frame indexed by the hash table, frame (*piZero+1).

	10306 */

	10307 static int walHashGet(

	10308 Wal pWal, / WAL handle */

	10309 int iHash, /* Find the iHash'th table */

	10310 volatile ht_slot *paHash, / OUT: Pointer to hash index */

	10311 volatile u32 *paPgno, / OUT: Pointer to page number array */

	10312 u32 piZero / OUT: Frame associated with paPgno[0] /

	10313 ){

	10314 int rc; /* Return code */

	10315 volatile u32 *aPgno;

	10316

	10317 rc = walIndexPage(pWal, iHash, &aPgno);

	10318 assert( rc==SQLITE_OK \|\| iHash>0 );

	10319

	10320 if( rc==SQLITE_OK ){

	10321 u32 iZero;

	10322 volatile ht_slot *aHash;

	10323

	10324 aHash = (volatile ht_slot *)&aPgno[HASHTABLE_NPAGE];

	10325 if( iHash==0 ){

	10326 aPgno = &aPgno[WALINDEX_HDR_SIZE/sizeof(u32)];

	10327 iZero = 0;

	10328 }else{

	10329 iZero = HASHTABLE_NPAGE_ONE + (iHash-1)*HASHTABLE_NPAGE;

	10330 }

	10331

	10332 *paPgno = &aPgno[-1];

	10333 *paHash = aHash;

	10334 *piZero = iZero;

	10335 }

	10336 return rc;

	10337 }

	10338

	10339 /*

	10340 ** Return the number of the wal-index page that contains the hash-table

	10341 ** and page-number array that contain entries corresponding to WAL frame

	10342 ** iFrame. The wal-index is broken up into 32KB pages. Wal-index pages

	10343 ** are numbered starting from 0.

	10344 */

	10345 static int walFramePage(u32 iFrame){

	10346 int iHash = (iFrame+HASHTABLE_NPAGE-HASHTABLE_NPAGE_ONE-1) / HASHTABLE_NPAGE;

	10347 assert( (iHash==0 \|\| iFrame>HASHTABLE_NPAGE_ONE)

	10348 && (iHash>=1 \|\| iFrame<=HASHTABLE_NPAGE_ONE)

	10349 && (iHash<=1 \|\| iFrame>(HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE))

	10350 && (iHash>=2 \|\| iFrame<=HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE)

	10351 && (iHash<=2 \|\| iFrame>(HASHTABLE_NPAGE_ONE+2*HASHTABLE_NPAGE))

	10352 );

	10353 return iHash;

	10354 }

	10355

	10356 /*

	10357 ** Return the page number associated with frame iFrame in this WAL.

	10358 */

	10359 static u32 walFramePgno(Wal *pWal, u32 iFrame){

	10360 int iHash = walFramePage(iFrame);

	10361 if( iHash==0 ){

	10362 return pWal->apWiData[0][WALINDEX_HDR_SIZE/sizeof(u32) + iFrame - 1];

	10363 }

	10364 return pWal->apWiData[iHash][(iFrame-1-HASHTABLE_NPAGE_ONE)%HASHTABLE_NPAGE];

	10365 }

	10366

	10367 /*

	10368 ** Remove entries from the hash table that point to WAL slots greater

	10369 ** than pWal->hdr.mxFrame.

	10370 **

	10371 ** This function is called whenever pWal->hdr.mxFrame is decreased due

	10372 ** to a rollback or savepoint.

	10373 **

	10374 ** At most only the hash table containing pWal->hdr.mxFrame needs to be

	10375 ** updated. Any later hash tables will be automatically cleared when

	10376 ** pWal->hdr.mxFrame advances to the point where those hash tables are

	10377 ** actually needed.

	10378 */

	10379 static void walCleanupHash(Wal *pWal){

	10380 volatile ht_slot aHash = 0; / Pointer to hash table to clear */

	10381 volatile u32 aPgno = 0; / Page number array for hash table */

	10382 u32 iZero = 0; /* frame == (aHash[x]+iZero) */

	10383 int iLimit = 0; /* Zero values greater than this */

	10384 int nByte; /* Number of bytes to zero in aPgno[] */

	10385 int i; /* Used to iterate through aHash[] */

	10386

	10387 assert( pWal->writeLock );

	10388 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE-1 );

	10389 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE );

	10390 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE+1 );

	10391

	10392 if( pWal->hdr.mxFrame==0 ) return;

	10393

	10394 /* Obtain pointers to the hash-table and page-number array containing

	10395 ** the entry that corresponds to frame pWal->hdr.mxFrame. It is guaranteed

	10396 ** that the page said hash-table and array reside on is already mapped.

	10397 */

	10398 assert( pWal->nWiData>walFramePage(pWal->hdr.mxFrame) );

	10399 assert( pWal->apWiData[walFramePage(pWal->hdr.mxFrame)] );

	10400 walHashGet(pWal, walFramePage(pWal->hdr.mxFrame), &aHash, &aPgno, &iZero);

	10401

	10402 /* Zero all hash-table entries that correspond to frame numbers greater

	10403 ** than pWal->hdr.mxFrame.

	10404 */

	10405 iLimit = pWal->hdr.mxFrame - iZero;

	10406 assert( iLimit>0 );

	10407 for(i=0; i<HASHTABLE_NSLOT; i++){

	10408 if( aHash[i]>iLimit ){

	10409 aHash[i] = 0;

	10410 }

	10411 }

	10412

	10413 /* Zero the entries in the aPgno array that correspond to frames with

	10414 ** frame numbers greater than pWal->hdr.mxFrame.

	10415 */

	10416 nByte = (int)((char )aHash - (char )&aPgno[iLimit+1]);

	10417 memset((void *)&aPgno[iLimit+1], 0, nByte);

	10418

	10419 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT

	10420 /* Verify that the every entry in the mapping region is still reachable

	10421 ** via the hash table even after the cleanup.

	10422 */

	10423 if( iLimit ){

	10424 int j; /* Loop counter */

	10425 int iKey; /* Hash key */

	10426 for(j=1; j<=iLimit; j++){

	10427 for(iKey=walHash(aPgno[j]); aHash[iKey]; iKey=walNextHash(iKey)){

	10428 if( aHash[iKey]==j ) break;

	10429 }

	10430 assert( aHash[iKey]==j );

	10431 }

	10432 }

	10433 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */

	10434 }

	10435

	10436

	10437 /*

	10438 ** Set an entry in the wal-index that will map database page number

	10439 ** pPage into WAL frame iFrame.

	10440 */

	10441 static int walIndexAppend(Wal *pWal, u32 iFrame, u32 iPage){

	10442 int rc; /* Return code */

	10443 u32 iZero = 0; /* One less than frame number of aPgno[1] */

	10444 volatile u32 aPgno = 0; / Page number array */

	10445 volatile ht_slot aHash = 0; / Hash table */

	10446

	10447 rc = walHashGet(pWal, walFramePage(iFrame), &aHash, &aPgno, &iZero);

	10448

	10449 /* Assuming the wal-index file was successfully mapped, populate the

	10450 ** page number array and hash table entry.

	10451 */

	10452 if( rc==SQLITE_OK ){

	10453 int iKey; /* Hash table key */

	10454 int idx; /* Value to write to hash-table slot */

	10455 int nCollide; /* Number of hash collisions */

	10456

	10457 idx = iFrame - iZero;

	10458 assert( idx <= HASHTABLE_NSLOT/2 + 1 );

	10459

	10460 /* If this is the first entry to be added to this hash-table, zero the

	10461 ** entire hash table and aPgno[] array before proceeding.

	10462 */

	10463 if( idx==1 ){

	10464 int nByte = (int)((u8 )&aHash[HASHTABLE_NSLOT] - (u8 )&aPgno[1]);

	10465 memset((void*)&aPgno[1], 0, nByte);

	10466 }

	10467

	10468 /* If the entry in aPgno[] is already set, then the previous writer

	10469 ** must have exited unexpectedly in the middle of a transaction (after

	10470 ** writing one or more dirty pages to the WAL to free up memory).

	10471 ** Remove the remnants of that writers uncommitted transaction from

	10472 ** the hash-table before writing any new entries.

	10473 */

	10474 if( aPgno[idx] ){

	10475 walCleanupHash(pWal);

	10476 assert( !aPgno[idx] );

	10477 }

	10478

	10479 /* Write the aPgno[] array entry and the hash-table slot. */

	10480 nCollide = idx;

	10481 for(iKey=walHash(iPage); aHash[iKey]; iKey=walNextHash(iKey)){

	10482 if( (nCollide--)==0 ) return SQLITE_CORRUPT_BKPT;

	10483 }

	10484 aPgno[idx] = iPage;

	10485 aHash[iKey] = (ht_slot)idx;

	10486

	10487 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT

	10488 /* Verify that the number of entries in the hash table exactly equals

	10489 ** the number of entries in the mapping region.

	10490 */

	10491 {

	10492 int i; /* Loop counter */

	10493 int nEntry = 0; /* Number of entries in the hash table */

	10494 for(i=0; i<HASHTABLE_NSLOT; i++){ if( aHash[i] ) nEntry++; }

	10495 assert( nEntry==idx );

	10496 }

	10497

	10498 /* Verify that the every entry in the mapping region is reachable

	10499 ** via the hash table. This turns out to be a really, really expensive

	10500 ** thing to check, so only do this occasionally - not on every

	10501 ** iteration.

	10502 */

	10503 if( (idx&0x3ff)==0 ){

	10504 int i; /* Loop counter */

	10505 for(i=1; i<=idx; i++){

	10506 for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){

	10507 if( aHash[iKey]==i ) break;

	10508 }

	10509 assert( aHash[iKey]==i );

	10510 }

	10511 }

	10512 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */

	10513 }

	10514

	10515

	10516 return rc;

	10517 }

	10518

	10519

	10520 /*

	10521 ** Recover the wal-index by reading the write-ahead log file.

	10522 **

	10523 ** This routine first tries to establish an exclusive lock on the

	10524 ** wal-index to prevent other threads/processes from doing anything

	10525 ** with the WAL or wal-index while recovery is running. The

	10526 ** WAL_RECOVER_LOCK is also held so that other threads will know

	10527 ** that this thread is running recovery. If unable to establish

	10528 ** the necessary locks, this routine returns SQLITE_BUSY.

	10529 */

	10530 static int walIndexRecover(Wal *pWal){

	10531 int rc; /* Return Code */

	10532 i64 nSize; /* Size of log file */

	10533 u32 aFrameCksum[2] = {0, 0};

	10534 int iLock; /* Lock offset to lock for checkpoint */

	10535 int nLock; /* Number of locks to hold */

	10536

	10537 /* Obtain an exclusive lock on all byte in the locking range not already

	10538 ** locked by the caller. The caller is guaranteed to have locked the

	10539 ** WAL_WRITE_LOCK byte, and may have also locked the WAL_CKPT_LOCK byte.

	10540 ** If successful, the same bytes that are locked here are unlocked before

	10541 ** this function returns.

	10542 */

	10543 assert( pWal->ckptLock==1 \|\| pWal->ckptLock==0 );

	10544 assert( WAL_ALL_BUT_WRITE==WAL_WRITE_LOCK+1 );

	10545 assert( WAL_CKPT_LOCK==WAL_ALL_BUT_WRITE );

	10546 assert( pWal->writeLock );

	10547 iLock = WAL_ALL_BUT_WRITE + pWal->ckptLock;

	10548 nLock = SQLITE_SHM_NLOCK - iLock;

	10549 rc = walLockExclusive(pWal, iLock, nLock);

	10550 if( rc ){

	10551 return rc;

	10552 }

	10553 WALTRACE(("WAL%p: recovery begin...\n", pWal));

	10554

	10555 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));

	10556

	10557 rc = sqlite3OsFileSize(pWal->pWalFd, &nSize);

	10558 if( rc!=SQLITE_OK ){

	10559 goto recovery_error;

	10560 }

	10561

	10562 if( nSize>WAL_HDRSIZE ){

	10563 u8 aBuf[WAL_HDRSIZE]; /* Buffer to load WAL header into */

	10564 u8 aFrame = 0; / Malloc'd buffer to load entire frame */

	10565 int szFrame; /* Number of bytes in buffer aFrame[] */

	10566 u8 aData; / Pointer to data part of aFrame buffer */

	10567 int iFrame; /* Index of last frame read */

	10568 i64 iOffset; /* Next offset to read from log file */

	10569 int szPage; /* Page size according to the log */

	10570 u32 magic; /* Magic value read from WAL header */

	10571 u32 version; /* Magic value read from WAL header */

	10572 int isValid; /* True if this frame is valid */

	10573

	10574 /* Read in the WAL header. */

	10575 rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0);

	10576 if( rc!=SQLITE_OK ){

	10577 goto recovery_error;

	10578 }

	10579

	10580 /* If the database page size is not a power of two, or is greater than

	10581 ** SQLITE_MAX_PAGE_SIZE, conclude that the WAL file contains no valid

	10582 ** data. Similarly, if the 'magic' value is invalid, ignore the whole

	10583 ** WAL file.

	10584 */

	10585 magic = sqlite3Get4byte(&aBuf[0]);

	10586 szPage = sqlite3Get4byte(&aBuf[8]);

	10587 if( (magic&0xFFFFFFFE)!=WAL_MAGIC

	10588 \|\| szPage&(szPage-1)

	10589 \|\| szPage>SQLITE_MAX_PAGE_SIZE

	10590 \|\| szPage<512

	10591 ){

	10592 goto finished;

	10593 }

	10594 pWal->hdr.bigEndCksum = (u8)(magic&0x00000001);

	10595 pWal->szPage = szPage;

	10596 pWal->nCkpt = sqlite3Get4byte(&aBuf[12]);

	10597 memcpy(&pWal->hdr.aSalt, &aBuf[16], 8);

	10598

	10599 /* Verify that the WAL header checksum is correct */

	10600 walChecksumBytes(pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN,

	10601 aBuf, WAL_HDRSIZE-2*4, 0, pWal->hdr.aFrameCksum

	10602 );

	10603 if( pWal->hdr.aFrameCksum[0]!=sqlite3Get4byte(&aBuf[24])

	10604 \|\| pWal->hdr.aFrameCksum[1]!=sqlite3Get4byte(&aBuf[28])

	10605 ){

	10606 goto finished;

	10607 }

	10608

	10609 /* Verify that the version number on the WAL format is one that

	10610 ** are able to understand */

	10611 version = sqlite3Get4byte(&aBuf[4]);

	10612 if( version!=WAL_MAX_VERSION ){

	10613 rc = SQLITE_CANTOPEN_BKPT;

	10614 goto finished;

	10615 }

	10616

	10617 /* Malloc a buffer to read frames into. */

	10618 szFrame = szPage + WAL_FRAME_HDRSIZE;

	10619 aFrame = (u8 *)sqlite3_malloc64(szFrame);

	10620 if( !aFrame ){

	10621 rc = SQLITE_NOMEM_BKPT;

	10622 goto recovery_error;

	10623 }

	10624 aData = &aFrame[WAL_FRAME_HDRSIZE];

	10625

	10626 /* Read all frames from the log file. */

	10627 iFrame = 0;

	10628 for(iOffset=WAL_HDRSIZE; (iOffset+szFrame)<=nSize; iOffset+=szFrame){

	10629 u32 pgno; /* Database page number for frame */

	10630 u32 nTruncate; /* dbsize field from frame header */

	10631

	10632 /* Read and decode the next log frame. */

	10633 iFrame++;

	10634 rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset);

	10635 if( rc!=SQLITE_OK ) break;

	10636 isValid = walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame);

	10637 if( !isValid ) break;

	10638 rc = walIndexAppend(pWal, iFrame, pgno);

	10639 if( rc!=SQLITE_OK ) break;

	10640

	10641 /* If nTruncate is non-zero, this is a commit record. */

	10642 if( nTruncate ){

	10643 pWal->hdr.mxFrame = iFrame;

	10644 pWal->hdr.nPage = nTruncate;

	10645 pWal->hdr.szPage = (u16)((szPage&0xff00) \| (szPage>>16));

	10646 testcase( szPage<=32768 );

	10647 testcase( szPage>=65536 );

	10648 aFrameCksum[0] = pWal->hdr.aFrameCksum[0];

	10649 aFrameCksum[1] = pWal->hdr.aFrameCksum[1];

	10650 }

	10651 }

	10652

	10653 sqlite3_free(aFrame);

	10654 }

	10655

	10656 finished:

	10657 if( rc==SQLITE_OK ){

	10658 volatile WalCkptInfo *pInfo;

	10659 int i;

	10660 pWal->hdr.aFrameCksum[0] = aFrameCksum[0];

	10661 pWal->hdr.aFrameCksum[1] = aFrameCksum[1];

	10662 walIndexWriteHdr(pWal);

	10663

	10664 /* Reset the checkpoint-header. This is safe because this thread is

	10665 ** currently holding locks that exclude all other readers, writers and

	10666 ** checkpointers.

	10667 */

	10668 pInfo = walCkptInfo(pWal);

	10669 pInfo->nBackfill = 0;

	10670 pInfo->nBackfillAttempted = pWal->hdr.mxFrame;

	10671 pInfo->aReadMark[0] = 0;

	10672 for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;

	10673 if( pWal->hdr.mxFrame ) pInfo->aReadMark[1] = pWal->hdr.mxFrame;

	10674

	10675 /* If more than one frame was recovered from the log file, report an

	10676 ** event via sqlite3_log(). This is to help with identifying performance

	10677 ** problems caused by applications routinely shutting down without

	10678 ** checkpointing the log file.

	10679 */

	10680 if( pWal->hdr.nPage ){

	10681 sqlite3_log(SQLITE_NOTICE_RECOVER_WAL,

	10682 "recovered %d frames from WAL file %s",

	10683 pWal->hdr.mxFrame, pWal->zWalName

	10684 );

	10685 }

	10686 }

	10687

	10688 recovery_error:

	10689 WALTRACE(("WAL%p: recovery %s\n", pWal, rc ? "failed" : "ok"));

	10690 walUnlockExclusive(pWal, iLock, nLock);

	10691 return rc;

	10692 }

	10693

	10694 /*

	10695 ** Close an open wal-index.

	10696 */

	10697 static void walIndexClose(Wal *pWal, int isDelete){

	10698 if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ){

	10699 int i;

	10700 for(i=0; i<pWal->nWiData; i++){

	10701 sqlite3_free((void *)pWal->apWiData[i]);

	10702 pWal->apWiData[i] = 0;

	10703 }

	10704 }else{

	10705 sqlite3OsShmUnmap(pWal->pDbFd, isDelete);

	10706 }

	10707 }

	10708

	10709 /*

	10710 ** Open a connection to the WAL file zWalName. The database file must

	10711 ** already be opened on connection pDbFd. The buffer that zWalName points

	10712 ** to must remain valid for the lifetime of the returned Wal* handle.

	10713 **

	10714 ** A SHARED lock should be held on the database file when this function

	10715 ** is called. The purpose of this SHARED lock is to prevent any other

	10716 ** client from unlinking the WAL or wal-index file. If another process

	10717 ** were to do this just after this client opened one of these files, the

	10718 ** system would be badly broken.

	10719 **

	10720 ** If the log file is successfully opened, SQLITE_OK is returned and

	10721 ** *ppWal is set to point to a new WAL handle. If an error occurs,

	10722 ** an SQLite error code is returned and *ppWal is left unmodified.

	10723 */

	10724 SQLITE_PRIVATE int sqlite3WalOpen(

	10725 sqlite3_vfs pVfs, / vfs module to open wal and wal-index */

	10726 sqlite3_file pDbFd, / The open database file */

	10727 const char zWalName, / Name of the WAL file */

	10728 int bNoShm, /* True to run in heap-memory mode */

	10729 i64 mxWalSize, /* Truncate WAL to this size on reset */

	10730 Wal *ppWal / OUT: Allocated Wal handle */

	10731 ){

	10732 int rc; /* Return Code */

	10733 Wal pRet; / Object to allocate and return */

	10734 int flags; /* Flags passed to OsOpen() */

	10735

	10736 assert( zWalName && zWalName[0] );

	10737 assert( pDbFd );

	10738

	10739 /* In the amalgamation, the os_unix.c and os_win.c source files come before

	10740 ** this source file. Verify that the #defines of the locking byte offsets

	10741 ** in os_unix.c and os_win.c agree with the WALINDEX_LOCK_OFFSET value.

	10742 ** For that matter, if the lock offset ever changes from its initial design

	10743 ** value of 120, we need to know that so there is an assert() to check it.

	10744 */

	10745 assert( 120==WALINDEX_LOCK_OFFSET );

	10746 assert( 136==WALINDEX_HDR_SIZE );

	10747 #ifdef WIN_SHM_BASE

	10748 assert( WIN_SHM_BASE==WALINDEX_LOCK_OFFSET );

	10749 #endif

	10750 #ifdef UNIX_SHM_BASE

	10751 assert( UNIX_SHM_BASE==WALINDEX_LOCK_OFFSET );

	10752 #endif

	10753

	10754

	10755 /* Allocate an instance of struct Wal to return. */

	10756 *ppWal = 0;

	10757 pRet = (Wal*)sqlite3MallocZero(sizeof(Wal) + pVfs->szOsFile);

	10758 if( !pRet ){

	10759 return SQLITE_NOMEM_BKPT;

	10760 }

	10761

	10762 pRet->pVfs = pVfs;

	10763 pRet->pWalFd = (sqlite3_file *)&pRet[1];

	10764 pRet->pDbFd = pDbFd;

	10765 pRet->readLock = -1;

	10766 pRet->mxWalSize = mxWalSize;

	10767 pRet->zWalName = zWalName;

	10768 pRet->syncHeader = 1;

	10769 pRet->padToSectorBoundary = 1;

	10770 pRet->exclusiveMode = (bNoShm ? WAL_HEAPMEMORY_MODE: WAL_NORMAL_MODE);

	10771

	10772 /* Open file handle on the write-ahead log file. */

	10773 flags = (SQLITE_OPEN_READWRITE\|SQLITE_OPEN_CREATE\|SQLITE_OPEN_WAL);

	10774 rc = sqlite3OsOpen(pVfs, zWalName, pRet->pWalFd, flags, &flags);

	10775 if( rc==SQLITE_OK && flags&SQLITE_OPEN_READONLY ){

	10776 pRet->readOnly = WAL_RDONLY;

	10777 }

	10778

	10779 if( rc!=SQLITE_OK ){

	10780 walIndexClose(pRet, 0);

	10781 sqlite3OsClose(pRet->pWalFd);

	10782 sqlite3_free(pRet);

	10783 }else{

	10784 int iDC = sqlite3OsDeviceCharacteristics(pDbFd);

	10785 if( iDC & SQLITE_IOCAP_SEQUENTIAL ){ pRet->syncHeader = 0; }

	10786 if( iDC & SQLITE_IOCAP_POWERSAFE_OVERWRITE ){

	10787 pRet->padToSectorBoundary = 0;

	10788 }

	10789 *ppWal = pRet;

	10790 WALTRACE(("WAL%d: opened\n", pRet));

	10791 }

	10792 return rc;

	10793 }

	10794

	10795 /*

	10796 ** Change the size to which the WAL file is trucated on each reset.

	10797 */

	10798 SQLITE_PRIVATE void sqlite3WalLimit(Wal *pWal, i64 iLimit){

	10799 if( pWal ) pWal->mxWalSize = iLimit;

	10800 }

	10801

	10802 /*

	10803 ** Find the smallest page number out of all pages held in the WAL that

	10804 ** has not been returned by any prior invocation of this method on the

	10805 ** same WalIterator object. Write into *piFrame the frame index where

	10806 ** that page was last written into the WAL. Write into *piPage the page

	10807 ** number.

	10808 **

	10809 ** Return 0 on success. If there are no pages in the WAL with a page

	10810 ** number larger than *piPage, then return 1.

	10811 */

	10812 static int walIteratorNext(

	10813 WalIterator p, / Iterator */

	10814 u32 piPage, / OUT: The page number of the next page */

	10815 u32 piFrame / OUT: Wal frame index of next page */

	10816 ){

	10817 u32 iMin; /* Result pgno must be greater than iMin */

	10818 u32 iRet = 0xFFFFFFFF; /* 0xffffffff is never a valid page number */

	10819 int i; /* For looping through segments */

	10820

	10821 iMin = p->iPrior;

	10822 assert( iMin<0xffffffff );

	10823 for(i=p->nSegment-1; i>=0; i--){

	10824 struct WalSegment *pSegment = &p->aSegment[i];

	10825 while( pSegment->iNext<pSegment->nEntry ){

	10826 u32 iPg = pSegment->aPgno[pSegment->aIndex[pSegment->iNext]];

	10827 if( iPg>iMin ){

	10828 if( iPg<iRet ){

	10829 iRet = iPg;

	10830 *piFrame = pSegment->iZero + pSegment->aIndex[pSegment->iNext];

	10831 }

	10832 break;

	10833 }

	10834 pSegment->iNext++;

	10835 }

	10836 }

	10837

	10838 *piPage = p->iPrior = iRet;

	10839 return (iRet==0xFFFFFFFF);

	10840 }

	10841

	10842 /*

	10843 ** This function merges two sorted lists into a single sorted list.

	10844 **

	10845 ** aLeft[] and aRight[] are arrays of indices. The sort key is

	10846 ** aContent[aLeft[]] and aContent[aRight[]]. Upon entry, the following

	10847 ** is guaranteed for all J<K:

	10848 **

	10849 ** aContent[aLeft[J]] < aContent[aLeft[K]]

	10850 ** aContent[aRight[J]] < aContent[aRight[K]]

	10851 **

	10852 ** This routine overwrites aRight[] with a new (probably longer) sequence

	10853 ** of indices such that the aRight[] contains every index that appears in

	10854 ** either aLeft[] or the old aRight[] and such that the second condition

	10855 ** above is still met.

	10856 **

	10857 ** The aContent[aLeft[X]] values will be unique for all X. And the

	10858 ** aContent[aRight[X]] values will be unique too. But there might be

	10859 ** one or more combinations of X and Y such that

	10860 **

	10861 ** aLeft[X]!=aRight[Y] && aContent[aLeft[X]] == aContent[aRight[Y]]

	10862 **

	10863 ** When that happens, omit the aLeft[X] and use the aRight[Y] index.

	10864 */

	10865 static void walMerge(

	10866 const u32 aContent, / Pages in wal - keys for the sort */

	10867 ht_slot aLeft, / IN: Left hand input list */

	10868 int nLeft, /* IN: Elements in array paLeft /

	10869 ht_slot *paRight, / IN/OUT: Right hand input list */

	10870 int pnRight, / IN/OUT: Elements in paRight /

	10871 ht_slot aTmp / Temporary buffer */

	10872 ){

	10873 int iLeft = 0; /* Current index in aLeft */

	10874 int iRight = 0; /* Current index in aRight */

	10875 int iOut = 0; /* Current index in output buffer */

	10876 int nRight = *pnRight;

	10877 ht_slot aRight = paRight;

	10878

	10879 assert( nLeft>0 && nRight>0 );

	10880 while( iRight<nRight \|\| iLeft<nLeft ){

	10881 ht_slot logpage;

	10882 Pgno dbpage;

	10883

	10884 if( (iLeft<nLeft)

	10885 && (iRight>=nRight \|\| aContent[aLeft[iLeft]]<aContent[aRight[iRight]])

	10886 ){

	10887 logpage = aLeft[iLeft++];

	10888 }else{

	10889 logpage = aRight[iRight++];

	10890 }

	10891 dbpage = aContent[logpage];

	10892

	10893 aTmp[iOut++] = logpage;

	10894 if( iLeft<nLeft && aContent[aLeft[iLeft]]==dbpage ) iLeft++;

	10895

	10896 assert( iLeft>=nLeft \|\| aContent[aLeft[iLeft]]>dbpage );

	10897 assert( iRight>=nRight \|\| aContent[aRight[iRight]]>dbpage );

	10898 }

	10899

	10900 *paRight = aLeft;

	10901 *pnRight = iOut;

	10902 memcpy(aLeft, aTmp, sizeof(aTmp[0])*iOut);

	10903 }

	10904

	10905 /*

	10906 ** Sort the elements in list aList using aContent[] as the sort key.

	10907 ** Remove elements with duplicate keys, preferring to keep the

	10908 ** larger aList[] values.

	10909 **

	10910 ** The aList[] entries are indices into aContent[]. The values in

	10911 ** aList[] are to be sorted so that for all J<K:

	10912 **

	10913 ** aContent[aList[J]] < aContent[aList[K]]

	10914 **

	10915 ** For any X and Y such that

	10916 **

	10917 ** aContent[aList[X]] == aContent[aList[Y]]

	10918 **

	10919 ** Keep the larger of the two values aList[X] and aList[Y] and discard

	10920 ** the smaller.

	10921 */

	10922 static void walMergesort(

	10923 const u32 aContent, / Pages in wal */

	10924 ht_slot aBuffer, / Buffer of at least pnList items to use /

	10925 ht_slot aList, / IN/OUT: List to sort */

	10926 int pnList / IN/OUT: Number of elements in aList[] */

	10927 ){

	10928 struct Sublist {

	10929 int nList; /* Number of elements in aList */

	10930 ht_slot aList; / Pointer to sub-list content */

	10931 };

	10932

	10933 const int nList = pnList; / Size of input list */

	10934 int nMerge = 0; /* Number of elements in list aMerge */

	10935 ht_slot aMerge = 0; / List to be merged */

	10936 int iList; /* Index into input list */

	10937 u32 iSub = 0; /* Index into aSub array */

	10938 struct Sublist aSub[13]; /* Array of sub-lists */

	10939

	10940 memset(aSub, 0, sizeof(aSub));

	10941 assert( nList<=HASHTABLE_NPAGE && nList>0 );

	10942 assert( HASHTABLE_NPAGE==(1<<(ArraySize(aSub)-1)) );

	10943

	10944 for(iList=0; iList<nList; iList++){

	10945 nMerge = 1;

	10946 aMerge = &aList[iList];

	10947 for(iSub=0; iList & (1<<iSub); iSub++){

	10948 struct Sublist *p;

	10949 assert( iSub<ArraySize(aSub) );

	10950 p = &aSub[iSub];

	10951 assert( p->aList && p->nList<=(1<<iSub) );

	10952 assert( p->aList==&aList[iList&~((2<<iSub)-1)] );

	10953 walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);

	10954 }

	10955 aSub[iSub].aList = aMerge;

	10956 aSub[iSub].nList = nMerge;

	10957 }

	10958

	10959 for(iSub++; iSub<ArraySize(aSub); iSub++){

	10960 if( nList & (1<<iSub) ){

	10961 struct Sublist *p;

	10962 assert( iSub<ArraySize(aSub) );

	10963 p = &aSub[iSub];

	10964 assert( p->nList<=(1<<iSub) );

	10965 assert( p->aList==&aList[nList&~((2<<iSub)-1)] );

	10966 walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);

	10967 }

	10968 }

	10969 assert( aMerge==aList );

	10970 *pnList = nMerge;

	10971

	10972 #ifdef SQLITE_DEBUG

	10973 {

	10974 int i;

	10975 for(i=1; i<*pnList; i++){

	10976 assert( aContent[aList[i]] > aContent[aList[i-1]] );

	10977 }

	10978 }

	10979 #endif

	10980 }

	10981

	10982 /*

	10983 ** Free an iterator allocated by walIteratorInit().

	10984 */

	10985 static void walIteratorFree(WalIterator *p){

	10986 sqlite3_free(p);

	10987 }

	10988

	10989 /*

	10990 ** Construct a WalInterator object that can be used to loop over all

	10991 ** pages in the WAL in ascending order. The caller must hold the checkpoint

	10992 ** lock.

	10993 **

	10994 ** On success, make *pp point to the newly allocated WalInterator object

	10995 ** return SQLITE_OK. Otherwise, return an error code. If this routine

	10996 ** returns an error, the value of *pp is undefined.

	10997 **

	10998 ** The calling routine should invoke walIteratorFree() to destroy the

	10999 ** WalIterator object when it has finished with it.

	11000 */

	11001 static int walIteratorInit(Wal pWal, WalIterator *pp){

	11002 WalIterator p; / Return value */

	11003 int nSegment; /* Number of segments to merge */

	11004 u32 iLast; /* Last frame in log */

	11005 int nByte; /* Number of bytes to allocate */

	11006 int i; /* Iterator variable */

	11007 ht_slot aTmp; / Temp space used by merge-sort */

	11008 int rc = SQLITE_OK; /* Return Code */

	11009

	11010 /* This routine only runs while holding the checkpoint lock. And

	11011 ** it only runs if there is actually content in the log (mxFrame>0).

	11012 */

	11013 assert( pWal->ckptLock && pWal->hdr.mxFrame>0 );

	11014 iLast = pWal->hdr.mxFrame;

	11015

	11016 /* Allocate space for the WalIterator object. */

	11017 nSegment = walFramePage(iLast) + 1;

	11018 nByte = sizeof(WalIterator)

	11019 + (nSegment-1)*sizeof(struct WalSegment)

	11020 + iLast*sizeof(ht_slot);

	11021 p = (WalIterator *)sqlite3_malloc64(nByte);

	11022 if( !p ){

	11023 return SQLITE_NOMEM_BKPT;

	11024 }

	11025 memset(p, 0, nByte);

	11026 p->nSegment = nSegment;

	11027

	11028 /* Allocate temporary space used by the merge-sort routine. This block

	11029 ** of memory will be freed before this function returns.

	11030 */

	11031 aTmp = (ht_slot *)sqlite3_malloc64(

	11032 sizeof(ht_slot) * (iLast>HASHTABLE_NPAGE?HASHTABLE_NPAGE:iLast)

	11033 );

	11034 if( !aTmp ){

	11035 rc = SQLITE_NOMEM_BKPT;

	11036 }

	11037

	11038 for(i=0; rc==SQLITE_OK && i<nSegment; i++){

	11039 volatile ht_slot *aHash;

	11040 u32 iZero;

	11041 volatile u32 *aPgno;

	11042

	11043 rc = walHashGet(pWal, i, &aHash, &aPgno, &iZero);

	11044 if( rc==SQLITE_OK ){

	11045 int j; /* Counter variable */

	11046 int nEntry; /* Number of entries in this segment */

	11047 ht_slot aIndex; / Sorted index for this segment */

	11048

	11049 aPgno++;

	11050 if( (i+1)==nSegment ){

	11051 nEntry = (int)(iLast - iZero);

	11052 }else{

	11053 nEntry = (int)((u32)aHash - (u32)aPgno);

	11054 }

	11055 aIndex = &((ht_slot *)&p->aSegment[p->nSegment])[iZero];

	11056 iZero++;

	11057

	11058 for(j=0; j<nEntry; j++){

	11059 aIndex[j] = (ht_slot)j;

	11060 }

	11061 walMergesort((u32 *)aPgno, aTmp, aIndex, &nEntry);

	11062 p->aSegment[i].iZero = iZero;

	11063 p->aSegment[i].nEntry = nEntry;

	11064 p->aSegment[i].aIndex = aIndex;

	11065 p->aSegment[i].aPgno = (u32 *)aPgno;

	11066 }

	11067 }

	11068 sqlite3_free(aTmp);

	11069

	11070 if( rc!=SQLITE_OK ){

	11071 walIteratorFree(p);

	11072 }

	11073 *pp = p;

	11074 return rc;

	11075 }

	11076

	11077 /*

	11078 ** Attempt to obtain the exclusive WAL lock defined by parameters lockIdx and

	11079 ** n. If the attempt fails and parameter xBusy is not NULL, then it is a

	11080 ** busy-handler function. Invoke it and retry the lock until either the

	11081 ** lock is successfully obtained or the busy-handler returns 0.

	11082 */

	11083 static int walBusyLock(

	11084 Wal pWal, / WAL connection */

	11085 int (xBusy)(void), /* Function to call when busy */

	11086 void pBusyArg, / Context argument for xBusyHandler */

	11087 int lockIdx, /* Offset of first byte to lock */

	11088 int n /* Number of bytes to lock */

	11089 ){

	11090 int rc;

	11091 do {

	11092 rc = walLockExclusive(pWal, lockIdx, n);

	11093 }while( xBusy && rc==SQLITE_BUSY && xBusy(pBusyArg) );

	11094 return rc;

	11095 }

	11096

	11097 /*

	11098 ** The cache of the wal-index header must be valid to call this function.

	11099 ** Return the page-size in bytes used by the database.

	11100 */

	11101 static int walPagesize(Wal *pWal){

	11102 return (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);

	11103 }

	11104

	11105 /*

	11106 ** The following is guaranteed when this function is called:

	11107 **

	11108 ** a) the WRITER lock is held,

	11109 ** b) the entire log file has been checkpointed, and

	11110 ** c) any existing readers are reading exclusively from the database

	11111 ** file - there are no readers that may attempt to read a frame from

	11112 ** the log file.

	11113 **

	11114 ** This function updates the shared-memory structures so that the next

	11115 ** client to write to the database (which may be this one) does so by

	11116 ** writing frames into the start of the log file.

	11117 **

	11118 ** The value of parameter salt1 is used as the aSalt[1] value in the

	11119 ** new wal-index header. It should be passed a pseudo-random value (i.e.

	11120 ** one obtained from sqlite3_randomness()).

	11121 */

	11122 static void walRestartHdr(Wal *pWal, u32 salt1){

	11123 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);

	11124 int i; /* Loop counter */

	11125 u32 aSalt = pWal->hdr.aSalt; / Big-endian salt values */

	11126 pWal->nCkpt++;

	11127 pWal->hdr.mxFrame = 0;

	11128 sqlite3Put4byte((u8)&aSalt[0], 1 + sqlite3Get4byte((u8)&aSalt[0]));

	11129 memcpy(&pWal->hdr.aSalt[1], &salt1, 4);

	11130 walIndexWriteHdr(pWal);

	11131 pInfo->nBackfill = 0;

	11132 pInfo->nBackfillAttempted = 0;

	11133 pInfo->aReadMark[1] = 0;

	11134 for(i=2; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;

	11135 assert( pInfo->aReadMark[0]==0 );

	11136 }

	11137

	11138 /*

	11139 ** Copy as much content as we can from the WAL back into the database file

	11140 ** in response to an sqlite3_wal_checkpoint() request or the equivalent.

	11141 **

	11142 ** The amount of information copies from WAL to database might be limited

	11143 ** by active readers. This routine will never overwrite a database page

	11144 ** that a concurrent reader might be using.

	11145 **

	11146 ** All I/O barrier operations (a.k.a fsyncs) occur in this routine when

	11147 ** SQLite is in WAL-mode in synchronous=NORMAL. That means that if

	11148 ** checkpoints are always run by a background thread or background

	11149 ** process, foreground threads will never block on a lengthy fsync call.

	11150 **

	11151 ** Fsync is called on the WAL before writing content out of the WAL and

	11152 ** into the database. This ensures that if the new content is persistent

	11153 ** in the WAL and can be recovered following a power-loss or hard reset.

	11154 **

	11155 ** Fsync is also called on the database file if (and only if) the entire

	11156 ** WAL content is copied into the database file. This second fsync makes

	11157 ** it safe to delete the WAL since the new content will persist in the

	11158 ** database file.

	11159 **

	11160 ** This routine uses and updates the nBackfill field of the wal-index header.

	11161 ** This is the only routine that will increase the value of nBackfill.

	11162 ** (A WAL reset or recovery will revert nBackfill to zero, but not increase

	11163 ** its value.)

	11164 **

	11165 ** The caller must be holding sufficient locks to ensure that no other

	11166 ** checkpoint is running (in any other thread or process) at the same

	11167 ** time.

	11168 */

	11169 static int walCheckpoint(

	11170 Wal pWal, / Wal connection */

	11171 sqlite3 db, / Check for interrupts on this handle */

	11172 int eMode, /* One of PASSIVE, FULL or RESTART */

	11173 int (xBusy)(void), /* Function to call when busy */

	11174 void pBusyArg, / Context argument for xBusyHandler */

	11175 int sync_flags, /* Flags for OsSync() (or 0) */

	11176 u8 zBuf / Temporary buffer to use */

	11177 ){

	11178 int rc = SQLITE_OK; /* Return code */

	11179 int szPage; /* Database page-size */

	11180 WalIterator pIter = 0; / Wal iterator context */

	11181 u32 iDbpage = 0; /* Next database page to write */

	11182 u32 iFrame = 0; /* Wal frame containing data for iDbpage */

	11183 u32 mxSafeFrame; /* Max frame that can be backfilled */

	11184 u32 mxPage; /* Max database page to write */

	11185 int i; /* Loop counter */

	11186 volatile WalCkptInfo pInfo; / The checkpoint status information */

	11187

	11188 szPage = walPagesize(pWal);

	11189 testcase( szPage<=32768 );

	11190 testcase( szPage>=65536 );

	11191 pInfo = walCkptInfo(pWal);

	11192 if( pInfo->nBackfill<pWal->hdr.mxFrame ){

	11193

	11194 /* Allocate the iterator */

	11195 rc = walIteratorInit(pWal, &pIter);

	11196 if( rc!=SQLITE_OK ){

	11197 return rc;

	11198 }

	11199 assert( pIter );

	11200

	11201 /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked

	11202 ** in the SQLITE_CHECKPOINT_PASSIVE mode. */

	11203 assert( eMode!=SQLITE_CHECKPOINT_PASSIVE \|\| xBusy==0 );

	11204

	11205 /* Compute in mxSafeFrame the index of the last frame of the WAL that is

	11206 ** safe to write into the database. Frames beyond mxSafeFrame might

	11207 ** overwrite database pages that are in use by active readers and thus

	11208 ** cannot be backfilled from the WAL.

	11209 */

	11210 mxSafeFrame = pWal->hdr.mxFrame;

	11211 mxPage = pWal->hdr.nPage;

	11212 for(i=1; i<WAL_NREADER; i++){

	11213 /* Thread-sanitizer reports that the following is an unsafe read,

	11214 ** as some other thread may be in the process of updating the value

	11215 ** of the aReadMark[] slot. The assumption here is that if that is

	11216 ** happening, the other client may only be increasing the value,

	11217 ** not decreasing it. So assuming either that either the "old" or

	11218 ** "new" version of the value is read, and not some arbitrary value

	11219 ** that would never be written by a real client, things are still

	11220 ** safe. */

	11221 u32 y = pInfo->aReadMark[i];

	11222 if( mxSafeFrame>y ){

	11223 assert( y<=pWal->hdr.mxFrame );

	11224 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(i), 1);

	11225 if( rc==SQLITE_OK ){

	11226 pInfo->aReadMark[i] = (i==1 ? mxSafeFrame : READMARK_NOT_USED);

	11227 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);

	11228 }else if( rc==SQLITE_BUSY ){

	11229 mxSafeFrame = y;

	11230 xBusy = 0;

	11231 }else{

	11232 goto walcheckpoint_out;

	11233 }

	11234 }

	11235 }

	11236

	11237 if( pInfo->nBackfill<mxSafeFrame

	11238 && (rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(0),1))==SQLITE_OK

	11239 ){

	11240 i64 nSize; /* Current size of database file */

	11241 u32 nBackfill = pInfo->nBackfill;

	11242

	11243 pInfo->nBackfillAttempted = mxSafeFrame;

	11244

	11245 /* Sync the WAL to disk */

	11246 if( sync_flags ){

	11247 rc = sqlite3OsSync(pWal->pWalFd, sync_flags);

	11248 }

	11249

	11250 /* If the database may grow as a result of this checkpoint, hint

	11251 ** about the eventual size of the db file to the VFS layer.

	11252 */

	11253 if( rc==SQLITE_OK ){

	11254 i64 nReq = ((i64)mxPage * szPage);

	11255 rc = sqlite3OsFileSize(pWal->pDbFd, &nSize);

	11256 if( rc==SQLITE_OK && nSize<nReq ){

	11257 sqlite3OsFileControlHint(pWal->pDbFd, SQLITE_FCNTL_SIZE_HINT, &nReq);

	11258 }

	11259 }

	11260

	11261

	11262 /* Iterate through the contents of the WAL, copying data to the db file */

	11263 while( rc==SQLITE_OK && 0==walIteratorNext(pIter, &iDbpage, &iFrame) ){

	11264 i64 iOffset;

	11265 assert( walFramePgno(pWal, iFrame)==iDbpage );

	11266 if( db->u1.isInterrupted ){

	11267 rc = db->mallocFailed ? SQLITE_NOMEM_BKPT : SQLITE_INTERRUPT;

	11268 break;

	11269 }

	11270 if( iFrame<=nBackfill \|\| iFrame>mxSafeFrame \|\| iDbpage>mxPage ){

	11271 continue;

	11272 }

	11273 iOffset = walFrameOffset(iFrame, szPage) + WAL_FRAME_HDRSIZE;

	11274 /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL file */

	11275 rc = sqlite3OsRead(pWal->pWalFd, zBuf, szPage, iOffset);

	11276 if( rc!=SQLITE_OK ) break;

	11277 iOffset = (iDbpage-1)*(i64)szPage;

	11278 testcase( IS_BIG_INT(iOffset) );

	11279 rc = sqlite3OsWrite(pWal->pDbFd, zBuf, szPage, iOffset);

	11280 if( rc!=SQLITE_OK ) break;

	11281 }

	11282

	11283 /* If work was actually accomplished... */

	11284 if( rc==SQLITE_OK ){

	11285 if( mxSafeFrame==walIndexHdr(pWal)->mxFrame ){

	11286 i64 szDb = pWal->hdr.nPage*(i64)szPage;

	11287 testcase( IS_BIG_INT(szDb) );

	11288 rc = sqlite3OsTruncate(pWal->pDbFd, szDb);

	11289 if( rc==SQLITE_OK && sync_flags ){

	11290 rc = sqlite3OsSync(pWal->pDbFd, sync_flags);

	11291 }

	11292 }

	11293 if( rc==SQLITE_OK ){

	11294 pInfo->nBackfill = mxSafeFrame;

	11295 }

	11296 }

	11297

	11298 /* Release the reader lock held while backfilling */

	11299 walUnlockExclusive(pWal, WAL_READ_LOCK(0), 1);

	11300 }

	11301

	11302 if( rc==SQLITE_BUSY ){

	11303 /* Reset the return code so as not to report a checkpoint failure

	11304 ** just because there are active readers. */

	11305 rc = SQLITE_OK;

	11306 }

	11307 }

	11308

	11309 /* If this is an SQLITE_CHECKPOINT_RESTART or TRUNCATE operation, and the

	11310 ** entire wal file has been copied into the database file, then block

	11311 ** until all readers have finished using the wal file. This ensures that

	11312 ** the next process to write to the database restarts the wal file.

	11313 */

	11314 if( rc==SQLITE_OK && eMode!=SQLITE_CHECKPOINT_PASSIVE ){

	11315 assert( pWal->writeLock );

	11316 if( pInfo->nBackfill<pWal->hdr.mxFrame ){

	11317 rc = SQLITE_BUSY;

	11318 }else if( eMode>=SQLITE_CHECKPOINT_RESTART ){

	11319 u32 salt1;

	11320 sqlite3_randomness(4, &salt1);

	11321 assert( pInfo->nBackfill==pWal->hdr.mxFrame );

	11322 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(1), WAL_NREADER-1);

	11323 if( rc==SQLITE_OK ){

	11324 if( eMode==SQLITE_CHECKPOINT_TRUNCATE ){

	11325 /* IMPLEMENTATION-OF: R-44699-57140 This mode works the same way as

	11326 ** SQLITE_CHECKPOINT_RESTART with the addition that it also

	11327 ** truncates the log file to zero bytes just prior to a

	11328 ** successful return.

	11329 **

	11330 ** In theory, it might be safe to do this without updating the

	11331 ** wal-index header in shared memory, as all subsequent reader or

	11332 ** writer clients should see that the entire log file has been

	11333 ** checkpointed and behave accordingly. This seems unsafe though,

	11334 ** as it would leave the system in a state where the contents of

	11335 ** the wal-index header do not match the contents of the

	11336 ** file-system. To avoid this, update the wal-index header to

	11337 ** indicate that the log file contains zero valid frames. */

	11338 walRestartHdr(pWal, salt1);

	11339 rc = sqlite3OsTruncate(pWal->pWalFd, 0);

	11340 }

	11341 walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);

	11342 }

	11343 }

	11344 }

	11345

	11346 walcheckpoint_out:

	11347 walIteratorFree(pIter);

	11348 return rc;

	11349 }

	11350

	11351 /*

	11352 ** If the WAL file is currently larger than nMax bytes in size, truncate

	11353 ** it to exactly nMax bytes. If an error occurs while doing so, ignore it.

	11354 */

	11355 static void walLimitSize(Wal *pWal, i64 nMax){

	11356 i64 sz;

	11357 int rx;

	11358 sqlite3BeginBenignMalloc();

	11359 rx = sqlite3OsFileSize(pWal->pWalFd, &sz);

	11360 if( rx==SQLITE_OK && (sz > nMax ) ){

	11361 rx = sqlite3OsTruncate(pWal->pWalFd, nMax);

	11362 }

	11363 sqlite3EndBenignMalloc();

	11364 if( rx ){

	11365 sqlite3_log(rx, "cannot limit WAL size: %s", pWal->zWalName);

	11366 }

	11367 }

	11368

	11369 /*

	11370 ** Close a connection to a log file.

	11371 */

	11372 SQLITE_PRIVATE int sqlite3WalClose(

	11373 Wal pWal, / Wal to close */

	11374 sqlite3 db, / For interrupt flag */

	11375 int sync_flags, /* Flags to pass to OsSync() (or 0) */

	11376 int nBuf,

	11377 u8 zBuf / Buffer of at least nBuf bytes */

	11378 ){

	11379 int rc = SQLITE_OK;

	11380 if( pWal ){

	11381 int isDelete = 0; /* True to unlink wal and wal-index files */

	11382

	11383 /* If an EXCLUSIVE lock can be obtained on the database file (using the

	11384 ** ordinary, rollback-mode locking methods, this guarantees that the

	11385 ** connection associated with this log file is the only connection to

	11386 ** the database. In this case checkpoint the database and unlink both

	11387 ** the wal and wal-index files.

	11388 **

	11389 ** The EXCLUSIVE lock is not released before returning.

	11390 */

	11391 if( zBuf!=0

	11392 && SQLITE_OK==(rc = sqlite3OsLock(pWal->pDbFd, SQLITE_LOCK_EXCLUSIVE))

	11393 ){

	11394 if( pWal->exclusiveMode==WAL_NORMAL_MODE ){

	11395 pWal->exclusiveMode = WAL_EXCLUSIVE_MODE;

	11396 }

	11397 rc = sqlite3WalCheckpoint(pWal, db,

	11398 SQLITE_CHECKPOINT_PASSIVE, 0, 0, sync_flags, nBuf, zBuf, 0, 0

	11399 );

	11400 if( rc==SQLITE_OK ){

	11401 int bPersist = -1;

	11402 sqlite3OsFileControlHint(

	11403 pWal->pDbFd, SQLITE_FCNTL_PERSIST_WAL, &bPersist

	11404 );

	11405 if( bPersist!=1 ){

	11406 /* Try to delete the WAL file if the checkpoint completed and

	11407 ** fsyned (rc==SQLITE_OK) and if we are not in persistent-wal

	11408 ** mode (!bPersist) */

	11409 isDelete = 1;

	11410 }else if( pWal->mxWalSize>=0 ){

	11411 /* Try to truncate the WAL file to zero bytes if the checkpoint

	11412 ** completed and fsynced (rc==SQLITE_OK) and we are in persistent

	11413 ** WAL mode (bPersist) and if the PRAGMA journal_size_limit is a

	11414 ** non-negative value (pWal->mxWalSize>=0). Note that we truncate

	11415 ** to zero bytes as truncating to the journal_size_limit might

	11416 ** leave a corrupt WAL file on disk. */

	11417 walLimitSize(pWal, 0);

	11418 }

	11419 }

	11420 }

	11421

	11422 walIndexClose(pWal, isDelete);

	11423 sqlite3OsClose(pWal->pWalFd);

	11424 if( isDelete ){

	11425 sqlite3BeginBenignMalloc();

	11426 sqlite3OsDelete(pWal->pVfs, pWal->zWalName, 0);

	11427 sqlite3EndBenignMalloc();

	11428 }

	11429 WALTRACE(("WAL%p: closed\n", pWal));

	11430 sqlite3_free((void *)pWal->apWiData);

	11431 sqlite3_free(pWal);

	11432 }

	11433 return rc;

	11434 }

	11435

	11436 /*

	11437 ** Try to read the wal-index header. Return 0 on success and 1 if

	11438 ** there is a problem.

	11439 **

	11440 ** The wal-index is in shared memory. Another thread or process might

	11441 ** be writing the header at the same time this procedure is trying to

	11442 ** read it, which might result in inconsistency. A dirty read is detected

	11443 ** by verifying that both copies of the header are the same and also by

	11444 ** a checksum on the header.

	11445 **

	11446 ** If and only if the read is consistent and the header is different from

	11447 ** pWal->hdr, then pWal->hdr is updated to the content of the new header

	11448 ** and *pChanged is set to 1.

	11449 **

	11450 ** If the checksum cannot be verified return non-zero. If the header

	11451 ** is read successfully and the checksum verified, return zero.

	11452 */

	11453 static int walIndexTryHdr(Wal pWal, int pChanged){

	11454 u32 aCksum[2]; /* Checksum on the header content */

	11455 WalIndexHdr h1, h2; /* Two copies of the header content */

	11456 WalIndexHdr volatile aHdr; / Header in shared memory */

	11457

	11458 /* The first page of the wal-index must be mapped at this point. */

	11459 assert( pWal->nWiData>0 && pWal->apWiData[0] );

	11460

	11461 /* Read the header. This might happen concurrently with a write to the

	11462 ** same area of shared memory on a different CPU in a SMP,

	11463 ** meaning it is possible that an inconsistent snapshot is read

	11464 ** from the file. If this happens, return non-zero.

	11465 **

	11466 ** There are two copies of the header at the beginning of the wal-index.

	11467 ** When reading, read [0] first then [1]. Writes are in the reverse order.

	11468 ** Memory barriers are used to prevent the compiler or the hardware from

	11469 ** reordering the reads and writes.

	11470 */

	11471 aHdr = walIndexHdr(pWal);

	11472 memcpy(&h1, (void *)&aHdr[0], sizeof(h1));

	11473 walShmBarrier(pWal);

	11474 memcpy(&h2, (void *)&aHdr[1], sizeof(h2));

	11475

	11476 if( memcmp(&h1, &h2, sizeof(h1))!=0 ){

	11477 return 1; /* Dirty read */

	11478 }

	11479 if( h1.isInit==0 ){

	11480 return 1; /* Malformed header - probably all zeros */

	11481 }

	11482 walChecksumBytes(1, (u8*)&h1, sizeof(h1)-sizeof(h1.aCksum), 0, aCksum);

	11483 if( aCksum[0]!=h1.aCksum[0] \|\| aCksum[1]!=h1.aCksum[1] ){

	11484 return 1; /* Checksum does not match */

	11485 }

	11486

	11487 if( memcmp(&pWal->hdr, &h1, sizeof(WalIndexHdr)) ){

	11488 *pChanged = 1;

	11489 memcpy(&pWal->hdr, &h1, sizeof(WalIndexHdr));

	11490 pWal->szPage = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);

	11491 testcase( pWal->szPage<=32768 );

	11492 testcase( pWal->szPage>=65536 );

	11493 }

	11494

	11495 /* The header was successfully read. Return zero. */

	11496 return 0;

	11497 }

	11498

	11499 /*

	11500 ** Read the wal-index header from the wal-index and into pWal->hdr.

	11501 ** If the wal-header appears to be corrupt, try to reconstruct the

	11502 ** wal-index from the WAL before returning.

	11503 **

	11504 ** Set *pChanged to 1 if the wal-index header value in pWal->hdr is

	11505 ** changed by this operation. If pWal->hdr is unchanged, set *pChanged

	11506 ** to 0.

	11507 **

	11508 ** If the wal-index header is successfully read, return SQLITE_OK.

	11509 ** Otherwise an SQLite error code.

	11510 */

	11511 static int walIndexReadHdr(Wal pWal, int pChanged){

	11512 int rc; /* Return code */

	11513 int badHdr; /* True if a header read failed */

	11514 volatile u32 page0; / Chunk of wal-index containing header */

	11515

	11516 /* Ensure that page 0 of the wal-index (the page that contains the

	11517 ** wal-index header) is mapped. Return early if an error occurs here.

	11518 */

	11519 assert( pChanged );

	11520 rc = walIndexPage(pWal, 0, &page0);

	11521 if( rc!=SQLITE_OK ){

	11522 return rc;

	11523 };

	11524 assert( page0 \|\| pWal->writeLock==0 );

	11525

	11526 /* If the first page of the wal-index has been mapped, try to read the

	11527 ** wal-index header immediately, without holding any lock. This usually

	11528 ** works, but may fail if the wal-index header is corrupt or currently

	11529 ** being modified by another thread or process.

	11530 */

	11531 badHdr = (page0 ? walIndexTryHdr(pWal, pChanged) : 1);

	11532

	11533 /* If the first attempt failed, it might have been due to a race

	11534 ** with a writer. So get a WRITE lock and try again.

	11535 */

	11536 assert( badHdr==0 \|\| pWal->writeLock==0 );

	11537 if( badHdr ){

	11538 if( pWal->readOnly & WAL_SHM_RDONLY ){

	11539 if( SQLITE_OK==(rc = walLockShared(pWal, WAL_WRITE_LOCK)) ){

	11540 walUnlockShared(pWal, WAL_WRITE_LOCK);

	11541 rc = SQLITE_READONLY_RECOVERY;

	11542 }

	11543 }else if( SQLITE_OK==(rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1)) ){

	11544 pWal->writeLock = 1;

	11545 if( SQLITE_OK==(rc = walIndexPage(pWal, 0, &page0)) ){

	11546 badHdr = walIndexTryHdr(pWal, pChanged);

	11547 if( badHdr ){

	11548 /* If the wal-index header is still malformed even while holding

	11549 ** a WRITE lock, it can only mean that the header is corrupted and

	11550 ** needs to be reconstructed. So run recovery to do exactly that.

	11551 */

	11552 rc = walIndexRecover(pWal);

	11553 *pChanged = 1;

	11554 }

	11555 }

	11556 pWal->writeLock = 0;

	11557 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);

	11558 }

	11559 }

	11560

	11561 /* If the header is read successfully, check the version number to make

	11562 ** sure the wal-index was not constructed with some future format that

	11563 ** this version of SQLite cannot understand.

	11564 */

	11565 if( badHdr==0 && pWal->hdr.iVersion!=WALINDEX_MAX_VERSION ){

	11566 rc = SQLITE_CANTOPEN_BKPT;

	11567 }

	11568

	11569 return rc;

	11570 }

	11571

	11572 /*

	11573 ** This is the value that walTryBeginRead returns when it needs to

	11574 ** be retried.

	11575 */

	11576 #define WAL_RETRY (-1)

	11577

	11578 /*

	11579 ** Attempt to start a read transaction. This might fail due to a race or

	11580 ** other transient condition. When that happens, it returns WAL_RETRY to

	11581 ** indicate to the caller that it is safe to retry immediately.

	11582 **

	11583 ** On success return SQLITE_OK. On a permanent failure (such an

	11584 ** I/O error or an SQLITE_BUSY because another process is running

	11585 ** recovery) return a positive error code.

	11586 **

	11587 ** The useWal parameter is true to force the use of the WAL and disable

	11588 ** the case where the WAL is bypassed because it has been completely

	11589 ** checkpointed. If useWal==0 then this routine calls walIndexReadHdr()

	11590 ** to make a copy of the wal-index header into pWal->hdr. If the

	11591 ** wal-index header has changed, *pChanged is set to 1 (as an indication

	11592 ** to the caller that the local paget cache is obsolete and needs to be

	11593 ** flushed.) When useWal==1, the wal-index header is assumed to already

	11594 ** be loaded and the pChanged parameter is unused.

	11595 **

	11596 ** The caller must set the cnt parameter to the number of prior calls to

	11597 ** this routine during the current read attempt that returned WAL_RETRY.

	11598 ** This routine will start taking more aggressive measures to clear the

	11599 ** race conditions after multiple WAL_RETRY returns, and after an excessive

	11600 ** number of errors will ultimately return SQLITE_PROTOCOL. The

	11601 ** SQLITE_PROTOCOL return indicates that some other process has gone rogue

	11602 ** and is not honoring the locking protocol. There is a vanishingly small

	11603 ** chance that SQLITE_PROTOCOL could be returned because of a run of really

	11604 ** bad luck when there is lots of contention for the wal-index, but that

	11605 ** possibility is so small that it can be safely neglected, we believe.

	11606 **

	11607 ** On success, this routine obtains a read lock on

	11608 ** WAL_READ_LOCK(pWal->readLock). The pWal->readLock integer is

	11609 ** in the range 0 <= pWal->readLock < WAL_NREADER. If pWal->readLock==(-1)

	11610 ** that means the Wal does not hold any read lock. The reader must not

	11611 ** access any database page that is modified by a WAL frame up to and

	11612 ** including frame number aReadMark[pWal->readLock]. The reader will

	11613 ** use WAL frames up to and including pWal->hdr.mxFrame if pWal->readLock>0

	11614 ** Or if pWal->readLock==0, then the reader will ignore the WAL

	11615 ** completely and get all content directly from the database file.

	11616 ** If the useWal parameter is 1 then the WAL will never be ignored and

	11617 ** this routine will always set pWal->readLock>0 on success.

	11618 ** When the read transaction is completed, the caller must release the

	11619 ** lock on WAL_READ_LOCK(pWal->readLock) and set pWal->readLock to -1.

	11620 **

	11621 ** This routine uses the nBackfill and aReadMark[] fields of the header

	11622 ** to select a particular WAL_READ_LOCK() that strives to let the

	11623 ** checkpoint process do as much work as possible. This routine might

	11624 ** update values of the aReadMark[] array in the header, but if it does

	11625 ** so it takes care to hold an exclusive lock on the corresponding

	11626 ** WAL_READ_LOCK() while changing values.

	11627 */

	11628 static int walTryBeginRead(Wal pWal, int pChanged, int useWal, int cnt){

	11629 volatile WalCkptInfo pInfo; / Checkpoint information in wal-index */

	11630 u32 mxReadMark; /* Largest aReadMark[] value */

	11631 int mxI; /* Index of largest aReadMark[] value */

	11632 int i; /* Loop counter */

	11633 int rc = SQLITE_OK; /* Return code */

	11634 u32 mxFrame; /* Wal frame to lock to */

	11635

	11636 assert( pWal->readLock<0 ); /* Not currently locked */

	11637

	11638 /* Take steps to avoid spinning forever if there is a protocol error.

	11639 **

	11640 ** Circumstances that cause a RETRY should only last for the briefest

	11641 ** instances of time. No I/O or other system calls are done while the

	11642 ** locks are held, so the locks should not be held for very long. But

	11643 ** if we are unlucky, another process that is holding a lock might get

	11644 ** paged out or take a page-fault that is time-consuming to resolve,

	11645 ** during the few nanoseconds that it is holding the lock. In that case,

	11646 ** it might take longer than normal for the lock to free.

	11647 **

	11648 ** After 5 RETRYs, we begin calling sqlite3OsSleep(). The first few

	11649 ** calls to sqlite3OsSleep() have a delay of 1 microsecond. Really this

	11650 ** is more of a scheduler yield than an actual delay. But on the 10th

	11651 ** an subsequent retries, the delays start becoming longer and longer,

	11652 ** so that on the 100th (and last) RETRY we delay for 323 milliseconds.

	11653 ** The total delay time before giving up is less than 10 seconds.

	11654 */

	11655 if( cnt>5 ){

	11656 int nDelay = 1; /* Pause time in microseconds */

	11657 if( cnt>100 ){

	11658 VVA_ONLY( pWal->lockError = 1; )

	11659 return SQLITE_PROTOCOL;

	11660 }

	11661 if( cnt>=10 ) nDelay = (cnt-9)(cnt-9)39;

	11662 sqlite3OsSleep(pWal->pVfs, nDelay);

	11663 }

	11664

	11665 if( !useWal ){

	11666 rc = walIndexReadHdr(pWal, pChanged);

	11667 if( rc==SQLITE_BUSY ){

	11668 /* If there is not a recovery running in another thread or process

	11669 ** then convert BUSY errors to WAL_RETRY. If recovery is known to

	11670 ** be running, convert BUSY to BUSY_RECOVERY. There is a race here

	11671 ** which might cause WAL_RETRY to be returned even if BUSY_RECOVERY

	11672 ** would be technically correct. But the race is benign since with

	11673 ** WAL_RETRY this routine will be called again and will probably be

	11674 ** right on the second iteration.

	11675 */

	11676 if( pWal->apWiData[0]==0 ){

	11677 /* This branch is taken when the xShmMap() method returns SQLITE_BUSY.

	11678 ** We assume this is a transient condition, so return WAL_RETRY. The

	11679 ** xShmMap() implementation used by the default unix and win32 VFS

	11680 ** modules may return SQLITE_BUSY due to a race condition in the

	11681 ** code that determines whether or not the shared-memory region

	11682 ** must be zeroed before the requested page is returned.

	11683 */

	11684 rc = WAL_RETRY;

	11685 }else if( SQLITE_OK==(rc = walLockShared(pWal, WAL_RECOVER_LOCK)) ){

	11686 walUnlockShared(pWal, WAL_RECOVER_LOCK);

	11687 rc = WAL_RETRY;

	11688 }else if( rc==SQLITE_BUSY ){

	11689 rc = SQLITE_BUSY_RECOVERY;

	11690 }

	11691 }

	11692 if( rc!=SQLITE_OK ){

	11693 return rc;

	11694 }

	11695 }

	11696

	11697 pInfo = walCkptInfo(pWal);

	11698 if( !useWal && pInfo->nBackfill==pWal->hdr.mxFrame

	11699 #ifdef SQLITE_ENABLE_SNAPSHOT

	11700 && (pWal->pSnapshot==0 \|\| pWal->hdr.mxFrame==0

	11701 \|\| 0==memcmp(&pWal->hdr, pWal->pSnapshot, sizeof(WalIndexHdr)))

	11702 #endif

	11703 ){

	11704 /* The WAL has been completely backfilled (or it is empty).

	11705 ** and can be safely ignored.

	11706 */

	11707 rc = walLockShared(pWal, WAL_READ_LOCK(0));

	11708 walShmBarrier(pWal);

	11709 if( rc==SQLITE_OK ){

	11710 if( memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) ){

	11711 /* It is not safe to allow the reader to continue here if frames

	11712 ** may have been appended to the log before READ_LOCK(0) was obtained.

	11713 ** When holding READ_LOCK(0), the reader ignores the entire log file,

	11714 ** which implies that the database file contains a trustworthy

	11715 ** snapshot. Since holding READ_LOCK(0) prevents a checkpoint from

	11716 ** happening, this is usually correct.

	11717 **

	11718 ** However, if frames have been appended to the log (or if the log

	11719 ** is wrapped and written for that matter) before the READ_LOCK(0)

	11720 ** is obtained, that is not necessarily true. A checkpointer may

	11721 ** have started to backfill the appended frames but crashed before

	11722 ** it finished. Leaving a corrupt image in the database file.

	11723 */

	11724 walUnlockShared(pWal, WAL_READ_LOCK(0));

	11725 return WAL_RETRY;

	11726 }

	11727 pWal->readLock = 0;

	11728 return SQLITE_OK;

	11729 }else if( rc!=SQLITE_BUSY ){

	11730 return rc;

	11731 }

	11732 }

	11733

	11734 /* If we get this far, it means that the reader will want to use

	11735 ** the WAL to get at content from recent commits. The job now is

	11736 ** to select one of the aReadMark[] entries that is closest to

	11737 ** but not exceeding pWal->hdr.mxFrame and lock that entry.

	11738 */

	11739 mxReadMark = 0;

	11740 mxI = 0;

	11741 mxFrame = pWal->hdr.mxFrame;

	11742 #ifdef SQLITE_ENABLE_SNAPSHOT

	11743 if( pWal->pSnapshot && pWal->pSnapshot->mxFrame<mxFrame ){

	11744 mxFrame = pWal->pSnapshot->mxFrame;

	11745 }

	11746 #endif

	11747 for(i=1; i<WAL_NREADER; i++){

	11748 u32 thisMark = pInfo->aReadMark[i];

	11749 if( mxReadMark<=thisMark && thisMark<=mxFrame ){

	11750 assert( thisMark!=READMARK_NOT_USED );

	11751 mxReadMark = thisMark;

	11752 mxI = i;

	11753 }

	11754 }

	11755 if( (pWal->readOnly & WAL_SHM_RDONLY)==0

	11756 && (mxReadMark<mxFrame \|\| mxI==0)

	11757 ){

	11758 for(i=1; i<WAL_NREADER; i++){

	11759 rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);

	11760 if( rc==SQLITE_OK ){

	11761 mxReadMark = pInfo->aReadMark[i] = mxFrame;

	11762 mxI = i;

	11763 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);

	11764 break;

	11765 }else if( rc!=SQLITE_BUSY ){

	11766 return rc;

	11767 }

	11768 }

	11769 }

	11770 if( mxI==0 ){

	11771 assert( rc==SQLITE_BUSY \|\| (pWal->readOnly & WAL_SHM_RDONLY)!=0 );

	11772 return rc==SQLITE_BUSY ? WAL_RETRY : SQLITE_READONLY_CANTLOCK;

	11773 }

	11774

	11775 rc = walLockShared(pWal, WAL_READ_LOCK(mxI));

	11776 if( rc ){

	11777 return rc==SQLITE_BUSY ? WAL_RETRY : rc;

	11778 }

	11779 /* Now that the read-lock has been obtained, check that neither the

	11780 ** value in the aReadMark[] array or the contents of the wal-index

	11781 ** header have changed.

	11782 **

	11783 ** It is necessary to check that the wal-index header did not change

	11784 ** between the time it was read and when the shared-lock was obtained

	11785 ** on WAL_READ_LOCK(mxI) was obtained to account for the possibility

	11786 ** that the log file may have been wrapped by a writer, or that frames

	11787 ** that occur later in the log than pWal->hdr.mxFrame may have been

	11788 ** copied into the database by a checkpointer. If either of these things

	11789 ** happened, then reading the database with the current value of

	11790 ** pWal->hdr.mxFrame risks reading a corrupted snapshot. So, retry

	11791 ** instead.

	11792 **

	11793 ** Before checking that the live wal-index header has not changed

	11794 ** since it was read, set Wal.minFrame to the first frame in the wal

	11795 ** file that has not yet been checkpointed. This client will not need

	11796 ** to read any frames earlier than minFrame from the wal file - they

	11797 ** can be safely read directly from the database file.

	11798 **

	11799 ** Because a ShmBarrier() call is made between taking the copy of

	11800 ** nBackfill and checking that the wal-header in shared-memory still

	11801 ** matches the one cached in pWal->hdr, it is guaranteed that the

	11802 ** checkpointer that set nBackfill was not working with a wal-index

	11803 ** header newer than that cached in pWal->hdr. If it were, that could

	11804 ** cause a problem. The checkpointer could omit to checkpoint

	11805 ** a version of page X that lies before pWal->minFrame (call that version

	11806 ** A) on the basis that there is a newer version (version B) of the same

	11807 ** page later in the wal file. But if version B happens to like past

	11808 ** frame pWal->hdr.mxFrame - then the client would incorrectly assume

	11809 ** that it can read version A from the database file. However, since

	11810 ** we can guarantee that the checkpointer that set nBackfill could not

	11811 ** see any pages past pWal->hdr.mxFrame, this problem does not come up.

	11812 */

	11813 pWal->minFrame = pInfo->nBackfill+1;

	11814 walShmBarrier(pWal);

	11815 if( pInfo->aReadMark[mxI]!=mxReadMark

	11816 \|\| memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr))

	11817 ){

	11818 walUnlockShared(pWal, WAL_READ_LOCK(mxI));

	11819 return WAL_RETRY;

	11820 }else{

	11821 assert( mxReadMark<=pWal->hdr.mxFrame );

	11822 pWal->readLock = (i16)mxI;

	11823 }

	11824 return rc;

	11825 }

	11826

	11827 #ifdef SQLITE_ENABLE_SNAPSHOT

	11828 /*

	11829 ** Attempt to reduce the value of the WalCkptInfo.nBackfillAttempted

	11830 ** variable so that older snapshots can be accessed. To do this, loop

	11831 ** through all wal frames from nBackfillAttempted to (nBackfill+1),

	11832 ** comparing their content to the corresponding page with the database

	11833 ** file, if any. Set nBackfillAttempted to the frame number of the

	11834 ** first frame for which the wal file content matches the db file.

	11835 **

	11836 ** This is only really safe if the file-system is such that any page

	11837 ** writes made by earlier checkpointers were atomic operations, which

	11838 ** is not always true. It is also possible that nBackfillAttempted

	11839 ** may be left set to a value larger than expected, if a wal frame

	11840 ** contains content that duplicate of an earlier version of the same

	11841 ** page.

	11842 **

	11843 ** SQLITE_OK is returned if successful, or an SQLite error code if an

	11844 ** error occurs. It is not an error if nBackfillAttempted cannot be

	11845 ** decreased at all.

	11846 */

	11847 SQLITE_PRIVATE int sqlite3WalSnapshotRecover(Wal *pWal){

	11848 int rc;

	11849

	11850 assert( pWal->readLock>=0 );

	11851 rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);

	11852 if( rc==SQLITE_OK ){

	11853 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);

	11854 int szPage = (int)pWal->szPage;

	11855 i64 szDb; /* Size of db file in bytes */

	11856

	11857 rc = sqlite3OsFileSize(pWal->pDbFd, &szDb);

	11858 if( rc==SQLITE_OK ){

	11859 void *pBuf1 = sqlite3_malloc(szPage);

	11860 void *pBuf2 = sqlite3_malloc(szPage);

	11861 if( pBuf1==0 \|\| pBuf2==0 ){

	11862 rc = SQLITE_NOMEM;

	11863 }else{

	11864 u32 i = pInfo->nBackfillAttempted;

	11865 for(i=pInfo->nBackfillAttempted; i>pInfo->nBackfill; i--){

	11866 volatile ht_slot *dummy;

	11867 volatile u32 aPgno; / Array of page numbers */

	11868 u32 iZero; /* Frame corresponding to aPgno[0] */

	11869 u32 pgno; /* Page number in db file */

	11870 i64 iDbOff; /* Offset of db file entry */

	11871 i64 iWalOff; /* Offset of wal file entry */

	11872

	11873 rc = walHashGet(pWal, walFramePage(i), &dummy, &aPgno, &iZero);

	11874 if( rc!=SQLITE_OK ) break;

	11875 pgno = aPgno[i-iZero];

	11876 iDbOff = (i64)(pgno-1) * szPage;

	11877

	11878 if( iDbOff+szPage<=szDb ){

	11879 iWalOff = walFrameOffset(i, szPage) + WAL_FRAME_HDRSIZE;

	11880 rc = sqlite3OsRead(pWal->pWalFd, pBuf1, szPage, iWalOff);

	11881

	11882 if( rc==SQLITE_OK ){

	11883 rc = sqlite3OsRead(pWal->pDbFd, pBuf2, szPage, iDbOff);

	11884 }

	11885

	11886 if( rc!=SQLITE_OK \|\| 0==memcmp(pBuf1, pBuf2, szPage) ){

	11887 break;

	11888 }

	11889 }

	11890

	11891 pInfo->nBackfillAttempted = i-1;

	11892 }

	11893 }

	11894

	11895 sqlite3_free(pBuf1);

	11896 sqlite3_free(pBuf2);

	11897 }

	11898 walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);

	11899 }

	11900

	11901 return rc;

	11902 }

	11903 #endif /* SQLITE_ENABLE_SNAPSHOT */

	11904

	11905 /*

	11906 ** Begin a read transaction on the database.

	11907 **

	11908 ** This routine used to be called sqlite3OpenSnapshot() and with good reason:

	11909 ** it takes a snapshot of the state of the WAL and wal-index for the current

	11910 ** instant in time. The current thread will continue to use this snapshot.

	11911 ** Other threads might append new content to the WAL and wal-index but

	11912 ** that extra content is ignored by the current thread.

	11913 **

	11914 ** If the database contents have changes since the previous read

	11915 ** transaction, then *pChanged is set to 1 before returning. The

	11916 ** Pager layer will use this to know that is cache is stale and

	11917 ** needs to be flushed.

	11918 */

	11919 SQLITE_PRIVATE int sqlite3WalBeginReadTransaction(Wal pWal, int pChanged){

	11920 int rc; /* Return code */

	11921 int cnt = 0; /* Number of TryBeginRead attempts */

	11922

	11923 #ifdef SQLITE_ENABLE_SNAPSHOT

	11924 int bChanged = 0;

	11925 WalIndexHdr *pSnapshot = pWal->pSnapshot;

	11926 if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){

	11927 bChanged = 1;

	11928 }

	11929 #endif

	11930

	11931 do{

	11932 rc = walTryBeginRead(pWal, pChanged, 0, ++cnt);

	11933 }while( rc==WAL_RETRY );

	11934 testcase( (rc&0xff)==SQLITE_BUSY );

	11935 testcase( (rc&0xff)==SQLITE_IOERR );

	11936 testcase( rc==SQLITE_PROTOCOL );

	11937 testcase( rc==SQLITE_OK );

	11938

	11939 #ifdef SQLITE_ENABLE_SNAPSHOT

	11940 if( rc==SQLITE_OK ){

	11941 if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){

	11942 /* At this point the client has a lock on an aReadMark[] slot holding

	11943 ** a value equal to or smaller than pSnapshot->mxFrame, but pWal->hdr

	11944 ** is populated with the wal-index header corresponding to the head

	11945 ** of the wal file. Verify that pSnapshot is still valid before

	11946 ** continuing. Reasons why pSnapshot might no longer be valid:

	11947 **

	11948 ** (1) The WAL file has been reset since the snapshot was taken.

	11949 ** In this case, the salt will have changed.

	11950 **

	11951 ** (2) A checkpoint as been attempted that wrote frames past

	11952 ** pSnapshot->mxFrame into the database file. Note that the

	11953 ** checkpoint need not have completed for this to cause problems.

	11954 */

	11955 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);

	11956

	11957 assert( pWal->readLock>0 \|\| pWal->hdr.mxFrame==0 );

	11958 assert( pInfo->aReadMark[pWal->readLock]<=pSnapshot->mxFrame );

	11959

	11960 /* It is possible that there is a checkpointer thread running

	11961 ** concurrent with this code. If this is the case, it may be that the

	11962 ** checkpointer has already determined that it will checkpoint

	11963 ** snapshot X, where X is later in the wal file than pSnapshot, but

	11964 ** has not yet set the pInfo->nBackfillAttempted variable to indicate

	11965 ** its intent. To avoid the race condition this leads to, ensure that

	11966 ** there is no checkpointer process by taking a shared CKPT lock

	11967 ** before checking pInfo->nBackfillAttempted.

	11968 **

	11969 ** TODO: Does the aReadMark[] lock prevent a checkpointer from doing

	11970 ** this already?

	11971 */

	11972 rc = walLockShared(pWal, WAL_CKPT_LOCK);

	11973

	11974 if( rc==SQLITE_OK ){

	11975 /* Check that the wal file has not been wrapped. Assuming that it has

	11976 ** not, also check that no checkpointer has attempted to checkpoint any

	11977 ** frames beyond pSnapshot->mxFrame. If either of these conditions are

	11978 ** true, return SQLITE_BUSY_SNAPSHOT. Otherwise, overwrite pWal->hdr

	11979 ** with pSnapshot and set pChanged as appropriate for opening the

	11980 ** snapshot. */

	11981 if( !memcmp(pSnapshot->aSalt, pWal->hdr.aSalt, sizeof(pWal->hdr.aSalt))

	11982 && pSnapshot->mxFrame>=pInfo->nBackfillAttempted

	11983 ){

	11984 assert( pWal->readLock>0 );

	11985 memcpy(&pWal->hdr, pSnapshot, sizeof(WalIndexHdr));

	11986 *pChanged = bChanged;

	11987 }else{

	11988 rc = SQLITE_BUSY_SNAPSHOT;

	11989 }

	11990

	11991 /* Release the shared CKPT lock obtained above. */

	11992 walUnlockShared(pWal, WAL_CKPT_LOCK);

	11993 }

	11994

	11995

	11996 if( rc!=SQLITE_OK ){

	11997 sqlite3WalEndReadTransaction(pWal);

	11998 }

	11999 }

	12000 }

	12001 #endif

	12002 return rc;

	12003 }

	12004

	12005 /*

	12006 ** Finish with a read transaction. All this does is release the

	12007 ** read-lock.

	12008 */

	12009 SQLITE_PRIVATE void sqlite3WalEndReadTransaction(Wal *pWal){

	12010 sqlite3WalEndWriteTransaction(pWal);

	12011 if( pWal->readLock>=0 ){

	12012 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));

	12013 pWal->readLock = -1;

	12014 }

	12015 }

	12016

	12017 /*

	12018 ** Search the wal file for page pgno. If found, set *piRead to the frame that

	12019 ** contains the page. Otherwise, if pgno is not in the wal file, set *piRead

	12020 ** to zero.

	12021 **

	12022 ** Return SQLITE_OK if successful, or an error code if an error occurs. If an

	12023 ** error does occur, the final value of *piRead is undefined.

	12024 */

	12025 SQLITE_PRIVATE int sqlite3WalFindFrame(

	12026 Wal pWal, / WAL handle */

	12027 Pgno pgno, /* Database page number to read data for */

	12028 u32 piRead / OUT: Frame number (or zero) */

	12029 ){

	12030 u32 iRead = 0; /* If !=0, WAL frame to return data from */

	12031 u32 iLast = pWal->hdr.mxFrame; /* Last page in WAL for this reader */

	12032 int iHash; /* Used to loop through N hash tables */

	12033 int iMinHash;

	12034

	12035 /* This routine is only be called from within a read transaction. */

	12036 assert( pWal->readLock>=0 \|\| pWal->lockError );

	12037

	12038 /* If the "last page" field of the wal-index header snapshot is 0, then

	12039 ** no data will be read from the wal under any circumstances. Return early

	12040 ** in this case as an optimization. Likewise, if pWal->readLock==0,

	12041 ** then the WAL is ignored by the reader so return early, as if the

	12042 ** WAL were empty.

	12043 */

	12044 if( iLast==0 \|\| pWal->readLock==0 ){

	12045 *piRead = 0;

	12046 return SQLITE_OK;

	12047 }

	12048

	12049 /* Search the hash table or tables for an entry matching page number

	12050 ** pgno. Each iteration of the following for() loop searches one

	12051 ** hash table (each hash table indexes up to HASHTABLE_NPAGE frames).

	12052 **

	12053 ** This code might run concurrently to the code in walIndexAppend()

	12054 ** that adds entries to the wal-index (and possibly to this hash

	12055 ** table). This means the value just read from the hash

	12056 ** slot (aHash[iKey]) may have been added before or after the

	12057 ** current read transaction was opened. Values added after the

	12058 ** read transaction was opened may have been written incorrectly -

	12059 ** i.e. these slots may contain garbage data. However, we assume

	12060 ** that any slots written before the current read transaction was

	12061 ** opened remain unmodified.

	12062 **

	12063 ** For the reasons above, the if(...) condition featured in the inner

	12064 ** loop of the following block is more stringent that would be required

	12065 ** if we had exclusive access to the hash-table:

	12066 **

	12067 ** (aPgno[iFrame]==pgno):

	12068 ** This condition filters out normal hash-table collisions.

	12069 **

	12070 ** (iFrame<=iLast):

	12071 ** This condition filters out entries that were added to the hash

	12072 ** table after the current read-transaction had started.

	12073 */

	12074 iMinHash = walFramePage(pWal->minFrame);

	12075 for(iHash=walFramePage(iLast); iHash>=iMinHash && iRead==0; iHash--){

	12076 volatile ht_slot aHash; / Pointer to hash table */

	12077 volatile u32 aPgno; / Pointer to array of page numbers */

	12078 u32 iZero; /* Frame number corresponding to aPgno[0] */

	12079 int iKey; /* Hash slot index */

	12080 int nCollide; /* Number of hash collisions remaining */

	12081 int rc; /* Error code */

	12082

	12083 rc = walHashGet(pWal, iHash, &aHash, &aPgno, &iZero);

	12084 if( rc!=SQLITE_OK ){

	12085 return rc;

	12086 }

	12087 nCollide = HASHTABLE_NSLOT;

	12088 for(iKey=walHash(pgno); aHash[iKey]; iKey=walNextHash(iKey)){

	12089 u32 iFrame = aHash[iKey] + iZero;

	12090 if( iFrame<=iLast && iFrame>=pWal->minFrame && aPgno[aHash[iKey]]==pgno ){

	12091 assert( iFrame>iRead \|\| CORRUPT_DB );

	12092 iRead = iFrame;

	12093 }

	12094 if( (nCollide--)==0 ){

	12095 return SQLITE_CORRUPT_BKPT;

	12096 }

	12097 }

	12098 }

	12099

	12100 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT

	12101 /* If expensive assert() statements are available, do a linear search

	12102 ** of the wal-index file content. Make sure the results agree with the

	12103 ** result obtained using the hash indexes above. */

	12104 {

	12105 u32 iRead2 = 0;

	12106 u32 iTest;

	12107 assert( pWal->minFrame>0 );

	12108 for(iTest=iLast; iTest>=pWal->minFrame; iTest--){

	12109 if( walFramePgno(pWal, iTest)==pgno ){

	12110 iRead2 = iTest;

	12111 break;

	12112 }

	12113 }

	12114 assert( iRead==iRead2 );

	12115 }

	12116 #endif

	12117

	12118 *piRead = iRead;

	12119 return SQLITE_OK;

	12120 }

	12121

	12122 /*

	12123 ** Read the contents of frame iRead from the wal file into buffer pOut

	12124 ** (which is nOut bytes in size). Return SQLITE_OK if successful, or an

	12125 ** error code otherwise.

	12126 */

	12127 SQLITE_PRIVATE int sqlite3WalReadFrame(

	12128 Wal pWal, / WAL handle */

	12129 u32 iRead, /* Frame to read */

	12130 int nOut, /* Size of buffer pOut in bytes */

	12131 u8 pOut / Buffer to write page data to */

	12132 ){

	12133 int sz;

	12134 i64 iOffset;

	12135 sz = pWal->hdr.szPage;

	12136 sz = (sz&0xfe00) + ((sz&0x0001)<<16);

	12137 testcase( sz<=32768 );

	12138 testcase( sz>=65536 );

	12139 iOffset = walFrameOffset(iRead, sz) + WAL_FRAME_HDRSIZE;

	12140 /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */

	12141 return sqlite3OsRead(pWal->pWalFd, pOut, (nOut>sz ? sz : nOut), iOffset);

	12142 }

	12143

	12144 /*

	12145 ** Return the size of the database in pages (or zero, if unknown).

	12146 */

	12147 SQLITE_PRIVATE Pgno sqlite3WalDbsize(Wal *pWal){

	12148 if( pWal && ALWAYS(pWal->readLock>=0) ){

	12149 return pWal->hdr.nPage;

	12150 }

	12151 return 0;

	12152 }

	12153

	12154

	12155 /*

	12156 ** This function starts a write transaction on the WAL.

	12157 **

	12158 ** A read transaction must have already been started by a prior call

	12159 ** to sqlite3WalBeginReadTransaction().

	12160 **

	12161 ** If another thread or process has written into the database since

	12162 ** the read transaction was started, then it is not possible for this

	12163 ** thread to write as doing so would cause a fork. So this routine

	12164 ** returns SQLITE_BUSY in that case and no write transaction is started.

	12165 **

	12166 ** There can only be a single writer active at a time.

	12167 */

	12168 SQLITE_PRIVATE int sqlite3WalBeginWriteTransaction(Wal *pWal){

	12169 int rc;

	12170

	12171 /* Cannot start a write transaction without first holding a read

	12172 ** transaction. */

	12173 assert( pWal->readLock>=0 );

	12174 assert( pWal->writeLock==0 && pWal->iReCksum==0 );

	12175

	12176 if( pWal->readOnly ){

	12177 return SQLITE_READONLY;

	12178 }

	12179

	12180 /* Only one writer allowed at a time. Get the write lock. Return

	12181 ** SQLITE_BUSY if unable.

	12182 */

	12183 rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1);

	12184 if( rc ){

	12185 return rc;

	12186 }

	12187 pWal->writeLock = 1;

	12188

	12189 /* If another connection has written to the database file since the

	12190 ** time the read transaction on this connection was started, then

	12191 ** the write is disallowed.

	12192 */

	12193 if( memcmp(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr))!=0 ){

	12194 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);

	12195 pWal->writeLock = 0;

	12196 rc = SQLITE_BUSY_SNAPSHOT;

	12197 }

	12198

	12199 return rc;

	12200 }

	12201

	12202 /*

	12203 ** End a write transaction. The commit has already been done. This

	12204 ** routine merely releases the lock.

	12205 */

	12206 SQLITE_PRIVATE int sqlite3WalEndWriteTransaction(Wal *pWal){

	12207 if( pWal->writeLock ){

	12208 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);

	12209 pWal->writeLock = 0;

	12210 pWal->iReCksum = 0;

	12211 pWal->truncateOnCommit = 0;

	12212 }

	12213 return SQLITE_OK;

	12214 }

	12215

	12216 /*

	12217 ** If any data has been written (but not committed) to the log file, this

	12218 ** function moves the write-pointer back to the start of the transaction.

	12219 **

	12220 ** Additionally, the callback function is invoked for each frame written

	12221 ** to the WAL since the start of the transaction. If the callback returns

	12222 ** other than SQLITE_OK, it is not invoked again and the error code is

	12223 ** returned to the caller.

	12224 **

	12225 ** Otherwise, if the callback function does not return an error, this

	12226 ** function returns SQLITE_OK.

	12227 */

	12228 SQLITE_PRIVATE int sqlite3WalUndo(Wal pWal, int (xUndo)(void , Pgno), void p UndoCtx){

	12229 int rc = SQLITE_OK;

	12230 if( ALWAYS(pWal->writeLock) ){

	12231 Pgno iMax = pWal->hdr.mxFrame;

	12232 Pgno iFrame;

	12233

	12234 /* Restore the clients cache of the wal-index header to the state it

	12235 ** was in before the client began writing to the database.

	12236 */

	12237 memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));

	12238

	12239 for(iFrame=pWal->hdr.mxFrame+1;

	12240 ALWAYS(rc==SQLITE_OK) && iFrame<=iMax;

	12241 iFrame++

	12242 ){

	12243 /* This call cannot fail. Unless the page for which the page number

	12244 ** is passed as the second argument is (a) in the cache and

	12245 ** (b) has an outstanding reference, then xUndo is either a no-op

	12246 ** (if (a) is false) or simply expels the page from the cache (if (b)

	12247 ** is false).

	12248 **

	12249 ** If the upper layer is doing a rollback, it is guaranteed that there

	12250 ** are no outstanding references to any page other than page 1. And

	12251 ** page 1 is never written to the log until the transaction is

	12252 ** committed. As a result, the call to xUndo may not fail.

	12253 */

	12254 assert( walFramePgno(pWal, iFrame)!=1 );

	12255 rc = xUndo(pUndoCtx, walFramePgno(pWal, iFrame));

	12256 }

	12257 if( iMax!=pWal->hdr.mxFrame ) walCleanupHash(pWal);

	12258 }

	12259 return rc;

	12260 }

	12261

	12262 /*

	12263 ** Argument aWalData must point to an array of WAL_SAVEPOINT_NDATA u32

	12264 ** values. This function populates the array with values required to

	12265 ** "rollback" the write position of the WAL handle back to the current

	12266 ** point in the event of a savepoint rollback (via WalSavepointUndo()).

	12267 */

	12268 SQLITE_PRIVATE void sqlite3WalSavepoint(Wal pWal, u32 aWalData){

	12269 assert( pWal->writeLock );

	12270 aWalData[0] = pWal->hdr.mxFrame;

	12271 aWalData[1] = pWal->hdr.aFrameCksum[0];

	12272 aWalData[2] = pWal->hdr.aFrameCksum[1];

	12273 aWalData[3] = pWal->nCkpt;

	12274 }

	12275

	12276 /*

	12277 ** Move the write position of the WAL back to the point identified by

	12278 ** the values in the aWalData[] array. aWalData must point to an array

	12279 ** of WAL_SAVEPOINT_NDATA u32 values that has been previously populated

	12280 ** by a call to WalSavepoint().

	12281 */

	12282 SQLITE_PRIVATE int sqlite3WalSavepointUndo(Wal pWal, u32 aWalData){

	12283 int rc = SQLITE_OK;

	12284

	12285 assert( pWal->writeLock );

	12286 assert( aWalData[3]!=pWal->nCkpt \|\| aWalData[0]<=pWal->hdr.mxFrame );

	12287

	12288 if( aWalData[3]!=pWal->nCkpt ){

	12289 /* This savepoint was opened immediately after the write-transaction

	12290 ** was started. Right after that, the writer decided to wrap around

	12291 ** to the start of the log. Update the savepoint values to match.

	12292 */

	12293 aWalData[0] = 0;

	12294 aWalData[3] = pWal->nCkpt;

	12295 }

	12296

	12297 if( aWalData[0]<pWal->hdr.mxFrame ){

	12298 pWal->hdr.mxFrame = aWalData[0];

	12299 pWal->hdr.aFrameCksum[0] = aWalData[1];

	12300 pWal->hdr.aFrameCksum[1] = aWalData[2];

	12301 walCleanupHash(pWal);

	12302 }

	12303

	12304 return rc;

	12305 }

	12306

	12307 /*

	12308 ** This function is called just before writing a set of frames to the log

	12309 ** file (see sqlite3WalFrames()). It checks to see if, instead of appending

	12310 ** to the current log file, it is possible to overwrite the start of the

	12311 ** existing log file with the new frames (i.e. "reset" the log). If so,

	12312 ** it sets pWal->hdr.mxFrame to 0. Otherwise, pWal->hdr.mxFrame is left

	12313 ** unchanged.

	12314 **

	12315 ** SQLITE_OK is returned if no error is encountered (regardless of whether

	12316 ** or not pWal->hdr.mxFrame is modified). An SQLite error code is returned

	12317 ** if an error occurs.

	12318 */

	12319 static int walRestartLog(Wal *pWal){

	12320 int rc = SQLITE_OK;

	12321 int cnt;

	12322

	12323 if( pWal->readLock==0 ){

	12324 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);

	12325 assert( pInfo->nBackfill==pWal->hdr.mxFrame );

	12326 if( pInfo->nBackfill>0 ){

	12327 u32 salt1;

	12328 sqlite3_randomness(4, &salt1);

	12329 rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);

	12330 if( rc==SQLITE_OK ){

	12331 /* If all readers are using WAL_READ_LOCK(0) (in other words if no

	12332 ** readers are currently using the WAL), then the transactions

	12333 ** frames will overwrite the start of the existing log. Update the

	12334 ** wal-index header to reflect this.

	12335 **

	12336 ** In theory it would be Ok to update the cache of the header only

	12337 ** at this point. But updating the actual wal-index header is also

	12338 ** safe and means there is no special case for sqlite3WalUndo()

	12339 ** to handle if this transaction is rolled back. */

	12340 walRestartHdr(pWal, salt1);

	12341 walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);

	12342 }else if( rc!=SQLITE_BUSY ){

	12343 return rc;

	12344 }

	12345 }

	12346 walUnlockShared(pWal, WAL_READ_LOCK(0));

	12347 pWal->readLock = -1;

	12348 cnt = 0;

	12349 do{

	12350 int notUsed;

	12351 rc = walTryBeginRead(pWal, &notUsed, 1, ++cnt);

	12352 }while( rc==WAL_RETRY );

	12353 assert( (rc&0xff)!=SQLITE_BUSY ); /* BUSY not possible when useWal==1 */

	12354 testcase( (rc&0xff)==SQLITE_IOERR );

	12355 testcase( rc==SQLITE_PROTOCOL );

	12356 testcase( rc==SQLITE_OK );

	12357 }

	12358 return rc;

	12359 }

	12360

	12361 /*

	12362 ** Information about the current state of the WAL file and where

	12363 ** the next fsync should occur - passed from sqlite3WalFrames() into

	12364 ** walWriteToLog().

	12365 */

	12366 typedef struct WalWriter {

	12367 Wal pWal; / The complete WAL information */

	12368 sqlite3_file pFd; / The WAL file to which we write */

	12369 sqlite3_int64 iSyncPoint; /* Fsync at this offset */

	12370 int syncFlags; /* Flags for the fsync */

	12371 int szPage; /* Size of one page */

	12372 } WalWriter;

	12373

	12374 /*

	12375 ** Write iAmt bytes of content into the WAL file beginning at iOffset.

	12376 ** Do a sync when crossing the p->iSyncPoint boundary.

	12377 **

	12378 ** In other words, if iSyncPoint is in between iOffset and iOffset+iAmt,

	12379 ** first write the part before iSyncPoint, then sync, then write the

	12380 ** rest.

	12381 */

	12382 static int walWriteToLog(

	12383 WalWriter p, / WAL to write to */

	12384 void pContent, / Content to be written */

	12385 int iAmt, /* Number of bytes to write */

	12386 sqlite3_int64 iOffset /* Start writing at this offset */

	12387 ){

	12388 int rc;

	12389 if( iOffset<p->iSyncPoint && iOffset+iAmt>=p->iSyncPoint ){

	12390 int iFirstAmt = (int)(p->iSyncPoint - iOffset);

	12391 rc = sqlite3OsWrite(p->pFd, pContent, iFirstAmt, iOffset);

	12392 if( rc ) return rc;

	12393 iOffset += iFirstAmt;

	12394 iAmt -= iFirstAmt;

	12395 pContent = (void)(iFirstAmt + (char)pContent);

	12396 assert( p->syncFlags & (SQLITE_SYNC_NORMAL\|SQLITE_SYNC_FULL) );

	12397 rc = sqlite3OsSync(p->pFd, p->syncFlags & SQLITE_SYNC_MASK);

	12398 if( iAmt==0 \|\| rc ) return rc;

	12399 }

	12400 rc = sqlite3OsWrite(p->pFd, pContent, iAmt, iOffset);

	12401 return rc;

	12402 }

	12403

	12404 /*

	12405 ** Write out a single frame of the WAL

	12406 */

	12407 static int walWriteOneFrame(

	12408 WalWriter p, / Where to write the frame */

	12409 PgHdr pPage, / The page of the frame to be written */

	12410 int nTruncate, /* The commit flag. Usually 0. >0 for commit */

	12411 sqlite3_int64 iOffset /* Byte offset at which to write */

	12412 ){

	12413 int rc; /* Result code from subfunctions */

	12414 void pData; / Data actually written */

	12415 u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-header in */

	12416 #if defined(SQLITE_HAS_CODEC)

	12417 if( (pData = sqlite3PagerCodec(pPage))==0 ) return SQLITE_NOMEM_BKPT;

	12418 #else

	12419 pData = pPage->pData;

	12420 #endif

	12421 walEncodeFrame(p->pWal, pPage->pgno, nTruncate, pData, aFrame);

	12422 rc = walWriteToLog(p, aFrame, sizeof(aFrame), iOffset);

	12423 if( rc ) return rc;

	12424 /* Write the page data */

	12425 rc = walWriteToLog(p, pData, p->szPage, iOffset+sizeof(aFrame));

	12426 return rc;

	12427 }

	12428

	12429 /*

	12430 ** This function is called as part of committing a transaction within which

	12431 ** one or more frames have been overwritten. It updates the checksums for

	12432 ** all frames written to the wal file by the current transaction starting

	12433 ** with the earliest to have been overwritten.

	12434 **

	12435 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.

	12436 */

	12437 static int walRewriteChecksums(Wal *pWal, u32 iLast){

	12438 const int szPage = pWal->szPage;/* Database page size */

	12439 int rc = SQLITE_OK; /* Return code */

	12440 u8 aBuf; / Buffer to load data from wal file into */

	12441 u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-headers in */

	12442 u32 iRead; /* Next frame to read from wal file */

	12443 i64 iCksumOff;

	12444

	12445 aBuf = sqlite3_malloc(szPage + WAL_FRAME_HDRSIZE);

	12446 if( aBuf==0 ) return SQLITE_NOMEM_BKPT;

	12447

	12448 /* Find the checksum values to use as input for the recalculating the

	12449 ** first checksum. If the first frame is frame 1 (implying that the current

	12450 ** transaction restarted the wal file), these values must be read from the

	12451 ** wal-file header. Otherwise, read them from the frame header of the

	12452 ** previous frame. */

	12453 assert( pWal->iReCksum>0 );

	12454 if( pWal->iReCksum==1 ){

	12455 iCksumOff = 24;

	12456 }else{

	12457 iCksumOff = walFrameOffset(pWal->iReCksum-1, szPage) + 16;

	12458 }

	12459 rc = sqlite3OsRead(pWal->pWalFd, aBuf, sizeof(u32)*2, iCksumOff);

	12460 pWal->hdr.aFrameCksum[0] = sqlite3Get4byte(aBuf);

	12461 pWal->hdr.aFrameCksum[1] = sqlite3Get4byte(&aBuf[sizeof(u32)]);

	12462

	12463 iRead = pWal->iReCksum;

	12464 pWal->iReCksum = 0;

	12465 for(; rc==SQLITE_OK && iRead<=iLast; iRead++){

	12466 i64 iOff = walFrameOffset(iRead, szPage);

	12467 rc = sqlite3OsRead(pWal->pWalFd, aBuf, szPage+WAL_FRAME_HDRSIZE, iOff);

	12468 if( rc==SQLITE_OK ){

	12469 u32 iPgno, nDbSize;

	12470 iPgno = sqlite3Get4byte(aBuf);

	12471 nDbSize = sqlite3Get4byte(&aBuf[4]);

	12472

	12473 walEncodeFrame(pWal, iPgno, nDbSize, &aBuf[WAL_FRAME_HDRSIZE], aFrame);

	12474 rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOff);

	12475 }

	12476 }

	12477

	12478 sqlite3_free(aBuf);

	12479 return rc;

	12480 }

	12481

	12482 /*

	12483 ** Write a set of frames to the log. The caller must hold the write-lock

	12484 ** on the log file (obtained using sqlite3WalBeginWriteTransaction()).

	12485 */

	12486 SQLITE_PRIVATE int sqlite3WalFrames(

	12487 Wal pWal, / Wal handle to write to */

	12488 int szPage, /* Database page-size in bytes */

	12489 PgHdr pList, / List of dirty pages to write */

	12490 Pgno nTruncate, /* Database size after this commit */

	12491 int isCommit, /* True if this is a commit */

	12492 int sync_flags /* Flags to pass to OsSync() (or 0) */

	12493 ){

	12494 int rc; /* Used to catch return codes */

	12495 u32 iFrame; /* Next frame address */

	12496 PgHdr p; / Iterator to run through pList with. */

	12497 PgHdr pLast = 0; / Last frame in list */

	12498 int nExtra = 0; /* Number of extra copies of last page */

	12499 int szFrame; /* The size of a single frame */

	12500 i64 iOffset; /* Next byte to write in WAL file */

	12501 WalWriter w; /* The writer */

	12502 u32 iFirst = 0; /* First frame that may be overwritten */

	12503 WalIndexHdr pLive; / Pointer to shared header */

	12504

	12505 assert( pList );

	12506 assert( pWal->writeLock );

	12507

	12508 /* If this frame set completes a transaction, then nTruncate>0. If

	12509 ** nTruncate==0 then this frame set does not complete the transaction. */

	12510 assert( (isCommit!=0)==(nTruncate!=0) );

	12511

	12512 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)

	12513 { int cnt; for(cnt=0, p=pList; p; p=p->pDirty, cnt++){}

	12514 WALTRACE(("WAL%p: frame write begin. %d frames. mxFrame=%d. %s\n",

	12515 pWal, cnt, pWal->hdr.mxFrame, isCommit ? "Commit" : "Spill"));

	12516 }

	12517 #endif

	12518

	12519 pLive = (WalIndexHdr*)walIndexHdr(pWal);

	12520 if( memcmp(&pWal->hdr, (void *)pLive, sizeof(WalIndexHdr))!=0 ){

	12521 iFirst = pLive->mxFrame+1;

	12522 }

	12523

	12524 /* See if it is possible to write these frames into the start of the

	12525 ** log file, instead of appending to it at pWal->hdr.mxFrame.

	12526 */

	12527 if( SQLITE_OK!=(rc = walRestartLog(pWal)) ){

	12528 return rc;

	12529 }

	12530

	12531 /* If this is the first frame written into the log, write the WAL

	12532 ** header to the start of the WAL file. See comments at the top of

	12533 ** this source file for a description of the WAL header format.

	12534 */

	12535 iFrame = pWal->hdr.mxFrame;

	12536 if( iFrame==0 ){

	12537 u8 aWalHdr[WAL_HDRSIZE]; /* Buffer to assemble wal-header in */

	12538 u32 aCksum[2]; /* Checksum for wal-header */

	12539

	12540 sqlite3Put4byte(&aWalHdr[0], (WAL_MAGIC \| SQLITE_BIGENDIAN));

	12541 sqlite3Put4byte(&aWalHdr[4], WAL_MAX_VERSION);

	12542 sqlite3Put4byte(&aWalHdr[8], szPage);

	12543 sqlite3Put4byte(&aWalHdr[12], pWal->nCkpt);

	12544 if( pWal->nCkpt==0 ) sqlite3_randomness(8, pWal->hdr.aSalt);

	12545 memcpy(&aWalHdr[16], pWal->hdr.aSalt, 8);

	12546 walChecksumBytes(1, aWalHdr, WAL_HDRSIZE-2*4, 0, aCksum);

	12547 sqlite3Put4byte(&aWalHdr[24], aCksum[0]);

	12548 sqlite3Put4byte(&aWalHdr[28], aCksum[1]);

	12549

	12550 pWal->szPage = szPage;

	12551 pWal->hdr.bigEndCksum = SQLITE_BIGENDIAN;

	12552 pWal->hdr.aFrameCksum[0] = aCksum[0];

	12553 pWal->hdr.aFrameCksum[1] = aCksum[1];

	12554 pWal->truncateOnCommit = 1;

	12555

	12556 rc = sqlite3OsWrite(pWal->pWalFd, aWalHdr, sizeof(aWalHdr), 0);

	12557 WALTRACE(("WAL%p: wal-header write %s\n", pWal, rc ? "failed" : "ok"));

	12558 if( rc!=SQLITE_OK ){

	12559 return rc;

	12560 }

	12561

	12562 /* Sync the header (unless SQLITE_IOCAP_SEQUENTIAL is true or unless

	12563 ** all syncing is turned off by PRAGMA synchronous=OFF). Otherwise

	12564 ** an out-of-order write following a WAL restart could result in

	12565 ** database corruption. See the ticket:

	12566 **

	12567 ** http://localhost:591/sqlite/info/ff5be73dee

	12568 */

	12569 if( pWal->syncHeader && sync_flags ){

	12570 rc = sqlite3OsSync(pWal->pWalFd, sync_flags & SQLITE_SYNC_MASK);

	12571 if( rc ) return rc;

	12572 }

	12573 }

	12574 assert( (int)pWal->szPage==szPage );

	12575

	12576 /* Setup information needed to write frames into the WAL */

	12577 w.pWal = pWal;

	12578 w.pFd = pWal->pWalFd;

	12579 w.iSyncPoint = 0;

	12580 w.syncFlags = sync_flags;

	12581 w.szPage = szPage;

	12582 iOffset = walFrameOffset(iFrame+1, szPage);

	12583 szFrame = szPage + WAL_FRAME_HDRSIZE;

	12584

	12585 /* Write all frames into the log file exactly once */

	12586 for(p=pList; p; p=p->pDirty){

	12587 int nDbSize; /* 0 normally. Positive == commit flag */

	12588

	12589 /* Check if this page has already been written into the wal file by

	12590 ** the current transaction. If so, overwrite the existing frame and

	12591 ** set Wal.writeLock to WAL_WRITELOCK_RECKSUM - indicating that

	12592 ** checksums must be recomputed when the transaction is committed. */

	12593 if( iFirst && (p->pDirty \|\| isCommit==0) ){

	12594 u32 iWrite = 0;

	12595 VVA_ONLY(rc =) sqlite3WalFindFrame(pWal, p->pgno, &iWrite);

	12596 assert( rc==SQLITE_OK \|\| iWrite==0 );

	12597 if( iWrite>=iFirst ){

	12598 i64 iOff = walFrameOffset(iWrite, szPage) + WAL_FRAME_HDRSIZE;

	12599 void *pData;

	12600 if( pWal->iReCksum==0 \|\| iWrite<pWal->iReCksum ){

	12601 pWal->iReCksum = iWrite;

	12602 }

	12603 #if defined(SQLITE_HAS_CODEC)

	12604 if( (pData = sqlite3PagerCodec(p))==0 ) return SQLITE_NOMEM;

	12605 #else

	12606 pData = p->pData;

	12607 #endif

	12608 rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOff);

	12609 if( rc ) return rc;

	12610 p->flags &= ~PGHDR_WAL_APPEND;

	12611 continue;

	12612 }

	12613 }

	12614

	12615 iFrame++;

	12616 assert( iOffset==walFrameOffset(iFrame, szPage) );

	12617 nDbSize = (isCommit && p->pDirty==0) ? nTruncate : 0;

	12618 rc = walWriteOneFrame(&w, p, nDbSize, iOffset);

	12619 if( rc ) return rc;

	12620 pLast = p;

	12621 iOffset += szFrame;

	12622 p->flags \|= PGHDR_WAL_APPEND;

	12623 }

	12624

	12625 /* Recalculate checksums within the wal file if required. */

	12626 if( isCommit && pWal->iReCksum ){

	12627 rc = walRewriteChecksums(pWal, iFrame);

	12628 if( rc ) return rc;

	12629 }

	12630

	12631 /* If this is the end of a transaction, then we might need to pad

	12632 ** the transaction and/or sync the WAL file.

	12633 **

	12634 ** Padding and syncing only occur if this set of frames complete a

	12635 ** transaction and if PRAGMA synchronous=FULL. If synchronous==NORMAL

	12636 ** or synchronous==OFF, then no padding or syncing are needed.

	12637 **

	12638 ** If SQLITE_IOCAP_POWERSAFE_OVERWRITE is defined, then padding is not

	12639 ** needed and only the sync is done. If padding is needed, then the

	12640 ** final frame is repeated (with its commit mark) until the next sector

	12641 ** boundary is crossed. Only the part of the WAL prior to the last

	12642 ** sector boundary is synced; the part of the last frame that extends

	12643 ** past the sector boundary is written after the sync.

	12644 */

	12645 if( isCommit && (sync_flags & WAL_SYNC_TRANSACTIONS)!=0 ){

	12646 int bSync = 1;

	12647 if( pWal->padToSectorBoundary ){

	12648 int sectorSize = sqlite3SectorSize(pWal->pWalFd);

	12649 w.iSyncPoint = ((iOffset+sectorSize-1)/sectorSize)*sectorSize;

	12650 bSync = (w.iSyncPoint==iOffset);

	12651 testcase( bSync );

	12652 while( iOffset<w.iSyncPoint ){

	12653 rc = walWriteOneFrame(&w, pLast, nTruncate, iOffset);

	12654 if( rc ) return rc;

	12655 iOffset += szFrame;

	12656 nExtra++;

	12657 }

	12658 }

	12659 if( bSync ){

	12660 assert( rc==SQLITE_OK );

	12661 rc = sqlite3OsSync(w.pFd, sync_flags & SQLITE_SYNC_MASK);

	12662 }

	12663 }

	12664

	12665 /* If this frame set completes the first transaction in the WAL and

	12666 ** if PRAGMA journal_size_limit is set, then truncate the WAL to the

	12667 ** journal size limit, if possible.

	12668 */

	12669 if( isCommit && pWal->truncateOnCommit && pWal->mxWalSize>=0 ){

	12670 i64 sz = pWal->mxWalSize;

	12671 if( walFrameOffset(iFrame+nExtra+1, szPage)>pWal->mxWalSize ){

	12672 sz = walFrameOffset(iFrame+nExtra+1, szPage);

	12673 }

	12674 walLimitSize(pWal, sz);

	12675 pWal->truncateOnCommit = 0;

	12676 }

	12677

	12678 /* Append data to the wal-index. It is not necessary to lock the

	12679 ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index

	12680 ** guarantees that there are no other writers, and no data that may

	12681 ** be in use by existing readers is being overwritten.

	12682 */

	12683 iFrame = pWal->hdr.mxFrame;

	12684 for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){

	12685 if( (p->flags & PGHDR_WAL_APPEND)==0 ) continue;

	12686 iFrame++;

	12687 rc = walIndexAppend(pWal, iFrame, p->pgno);

	12688 }

	12689 while( rc==SQLITE_OK && nExtra>0 ){

	12690 iFrame++;

	12691 nExtra--;

	12692 rc = walIndexAppend(pWal, iFrame, pLast->pgno);

	12693 }

	12694

	12695 if( rc==SQLITE_OK ){

	12696 /* Update the private copy of the header. */

	12697 pWal->hdr.szPage = (u16)((szPage&0xff00) \| (szPage>>16));

	12698 testcase( szPage<=32768 );

	12699 testcase( szPage>=65536 );

	12700 pWal->hdr.mxFrame = iFrame;

	12701 if( isCommit ){

	12702 pWal->hdr.iChange++;

	12703 pWal->hdr.nPage = nTruncate;

	12704 }

	12705 /* If this is a commit, update the wal-index header too. */

	12706 if( isCommit ){

	12707 walIndexWriteHdr(pWal);

	12708 pWal->iCallback = iFrame;

	12709 }

	12710 }

	12711

	12712 WALTRACE(("WAL%p: frame write %s\n", pWal, rc ? "failed" : "ok"));

	12713 return rc;

	12714 }

	12715

	12716 /*

	12717 ** This routine is called to implement sqlite3_wal_checkpoint() and

	12718 ** related interfaces.

	12719 **

	12720 ** Obtain a CHECKPOINT lock and then backfill as much information as

	12721 ** we can from WAL into the database.

	12722 **

	12723 ** If parameter xBusy is not NULL, it is a pointer to a busy-handler

	12724 ** callback. In this case this function runs a blocking checkpoint.

	12725 */

	12726 SQLITE_PRIVATE int sqlite3WalCheckpoint(

	12727 Wal pWal, / Wal connection */

	12728 sqlite3 db, / Check this handle's interrupt flag */

	12729 int eMode, /* PASSIVE, FULL, RESTART, or TRUNCATE */

	12730 int (xBusy)(void), /* Function to call when busy */

	12731 void pBusyArg, / Context argument for xBusyHandler */

	12732 int sync_flags, /* Flags to sync db file with (or 0) */

	12733 int nBuf, /* Size of temporary buffer */

	12734 u8 zBuf, / Temporary buffer to use */

	12735 int pnLog, / OUT: Number of frames in WAL */

	12736 int pnCkpt / OUT: Number of backfilled frames in WAL */

	12737 ){

	12738 int rc; /* Return code */

	12739 int isChanged = 0; /* True if a new wal-index header is loaded */

	12740 int eMode2 = eMode; /* Mode to pass to walCheckpoint() */

	12741 int (xBusy2)(void) = xBusy; /* Busy handler for eMode2 */

	12742

	12743 assert( pWal->ckptLock==0 );

	12744 assert( pWal->writeLock==0 );

	12745

	12746 /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked

	12747 ** in the SQLITE_CHECKPOINT_PASSIVE mode. */

	12748 assert( eMode!=SQLITE_CHECKPOINT_PASSIVE \|\| xBusy==0 );

	12749

	12750 if( pWal->readOnly ) return SQLITE_READONLY;

	12751 WALTRACE(("WAL%p: checkpoint begins\n", pWal));

	12752

	12753 /* IMPLEMENTATION-OF: R-62028-47212 All calls obtain an exclusive

	12754 ** "checkpoint" lock on the database file. */

	12755 rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);

	12756 if( rc ){

	12757 /* EVIDENCE-OF: R-10421-19736 If any other process is running a

	12758 ** checkpoint operation at the same time, the lock cannot be obtained and

	12759 ** SQLITE_BUSY is returned.

	12760 ** EVIDENCE-OF: R-53820-33897 Even if there is a busy-handler configured,

	12761 ** it will not be invoked in this case.

	12762 */

	12763 testcase( rc==SQLITE_BUSY );

	12764 testcase( xBusy!=0 );

	12765 return rc;

	12766 }

	12767 pWal->ckptLock = 1;

	12768

	12769 /* IMPLEMENTATION-OF: R-59782-36818 The SQLITE_CHECKPOINT_FULL, RESTART and

	12770 ** TRUNCATE modes also obtain the exclusive "writer" lock on the database

	12771 ** file.

	12772 **

	12773 ** EVIDENCE-OF: R-60642-04082 If the writer lock cannot be obtained

	12774 ** immediately, and a busy-handler is configured, it is invoked and the

	12775 ** writer lock retried until either the busy-handler returns 0 or the

	12776 ** lock is successfully obtained.

	12777 */

	12778 if( eMode!=SQLITE_CHECKPOINT_PASSIVE ){

	12779 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_WRITE_LOCK, 1);

	12780 if( rc==SQLITE_OK ){

	12781 pWal->writeLock = 1;

	12782 }else if( rc==SQLITE_BUSY ){

	12783 eMode2 = SQLITE_CHECKPOINT_PASSIVE;

	12784 xBusy2 = 0;

	12785 rc = SQLITE_OK;

	12786 }

	12787 }

	12788

	12789 /* Read the wal-index header. */

	12790 if( rc==SQLITE_OK ){

	12791 rc = walIndexReadHdr(pWal, &isChanged);

	12792 if( isChanged && pWal->pDbFd->pMethods->iVersion>=3 ){

	12793 sqlite3OsUnfetch(pWal->pDbFd, 0, 0);

	12794 }

	12795 }

	12796

	12797 /* Copy data from the log to the database file. */

	12798 if( rc==SQLITE_OK ){

	12799

	12800 if( pWal->hdr.mxFrame && walPagesize(pWal)!=nBuf ){

	12801 rc = SQLITE_CORRUPT_BKPT;

	12802 }else{

	12803 rc = walCheckpoint(pWal, db, eMode2, xBusy2, pBusyArg, sync_flags, zBuf);

	12804 }

	12805

	12806 /* If no error occurred, set the output variables. */

	12807 if( rc==SQLITE_OK \|\| rc==SQLITE_BUSY ){

	12808 if( pnLog ) *pnLog = (int)pWal->hdr.mxFrame;

	12809 if( pnCkpt ) *pnCkpt = (int)(walCkptInfo(pWal)->nBackfill);

	12810 }

	12811 }

	12812

	12813 if( isChanged ){

	12814 /* If a new wal-index header was loaded before the checkpoint was

	12815 ** performed, then the pager-cache associated with pWal is now

	12816 ** out of date. So zero the cached wal-index header to ensure that

	12817 ** next time the pager opens a snapshot on this database it knows that

	12818 ** the cache needs to be reset.

	12819 */

	12820 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));

	12821 }

	12822

	12823 /* Release the locks. */

	12824 sqlite3WalEndWriteTransaction(pWal);

	12825 walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);

	12826 pWal->ckptLock = 0;

	12827 WALTRACE(("WAL%p: checkpoint %s\n", pWal, rc ? "failed" : "ok"));

	12828 return (rc==SQLITE_OK && eMode!=eMode2 ? SQLITE_BUSY : rc);

	12829 }

	12830

	12831 /* Return the value to pass to a sqlite3_wal_hook callback, the

	12832 ** number of frames in the WAL at the point of the last commit since

	12833 ** sqlite3WalCallback() was called. If no commits have occurred since

	12834 ** the last call, then return 0.

	12835 */

	12836 SQLITE_PRIVATE int sqlite3WalCallback(Wal *pWal){

	12837 u32 ret = 0;

	12838 if( pWal ){

	12839 ret = pWal->iCallback;

	12840 pWal->iCallback = 0;

	12841 }

	12842 return (int)ret;

	12843 }

	12844

	12845 /*

	12846 ** This function is called to change the WAL subsystem into or out

	12847 ** of locking_mode=EXCLUSIVE.

	12848 **

	12849 ** If op is zero, then attempt to change from locking_mode=EXCLUSIVE

	12850 ** into locking_mode=NORMAL. This means that we must acquire a lock

	12851 ** on the pWal->readLock byte. If the WAL is already in locking_mode=NORMAL

	12852 ** or if the acquisition of the lock fails, then return 0. If the

	12853 ** transition out of exclusive-mode is successful, return 1. This

	12854 ** operation must occur while the pager is still holding the exclusive

	12855 ** lock on the main database file.

	12856 **

	12857 ** If op is one, then change from locking_mode=NORMAL into

	12858 ** locking_mode=EXCLUSIVE. This means that the pWal->readLock must

	12859 ** be released. Return 1 if the transition is made and 0 if the

	12860 ** WAL is already in exclusive-locking mode - meaning that this

	12861 ** routine is a no-op. The pager must already hold the exclusive lock

	12862 ** on the main database file before invoking this operation.

	12863 **

	12864 ** If op is negative, then do a dry-run of the op==1 case but do

	12865 ** not actually change anything. The pager uses this to see if it

	12866 ** should acquire the database exclusive lock prior to invoking

	12867 ** the op==1 case.

	12868 */

	12869 SQLITE_PRIVATE int sqlite3WalExclusiveMode(Wal *pWal, int op){

	12870 int rc;

	12871 assert( pWal->writeLock==0 );

	12872 assert( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE \|\| op==-1 );

	12873

	12874 /* pWal->readLock is usually set, but might be -1 if there was a

	12875 ** prior error while attempting to acquire are read-lock. This cannot

	12876 ** happen if the connection is actually in exclusive mode (as no xShmLock

	12877 ** locks are taken in this case). Nor should the pager attempt to

	12878 ** upgrade to exclusive-mode following such an error.

	12879 */

	12880 assert( pWal->readLock>=0 \|\| pWal->lockError );

	12881 assert( pWal->readLock>=0 \|\| (op<=0 && pWal->exclusiveMode==0) );

	12882

	12883 if( op==0 ){

	12884 if( pWal->exclusiveMode ){

	12885 pWal->exclusiveMode = 0;

	12886 if( walLockShared(pWal, WAL_READ_LOCK(pWal->readLock))!=SQLITE_OK ){

	12887 pWal->exclusiveMode = 1;

	12888 }

	12889 rc = pWal->exclusiveMode==0;

	12890 }else{

	12891 /* Already in locking_mode=NORMAL */

	12892 rc = 0;

	12893 }

	12894 }else if( op>0 ){

	12895 assert( pWal->exclusiveMode==0 );

	12896 assert( pWal->readLock>=0 );

	12897 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));

	12898 pWal->exclusiveMode = 1;

	12899 rc = 1;

	12900 }else{

	12901 rc = pWal->exclusiveMode==0;

	12902 }

	12903 return rc;

	12904 }

	12905

	12906 /*

	12907 ** Return true if the argument is non-NULL and the WAL module is using

	12908 ** heap-memory for the wal-index. Otherwise, if the argument is NULL or the

	12909 ** WAL module is using shared-memory, return false.

	12910 */

	12911 SQLITE_PRIVATE int sqlite3WalHeapMemory(Wal *pWal){

	12912 return (pWal && pWal->exclusiveMode==WAL_HEAPMEMORY_MODE );

	12913 }

	12914

	12915 #ifdef SQLITE_ENABLE_SNAPSHOT

	12916 /* Create a snapshot object. The content of a snapshot is opaque to

	12917 ** every other subsystem, so the WAL module can put whatever it needs

	12918 ** in the object.

	12919 */

	12920 SQLITE_PRIVATE int sqlite3WalSnapshotGet(Wal pWal, sqlite3_snapshot *ppSnapsho t){

	12921 int rc = SQLITE_OK;

	12922 WalIndexHdr *pRet;

	12923 static const u32 aZero[4] = { 0, 0, 0, 0 };

	12924

	12925 assert( pWal->readLock>=0 && pWal->writeLock==0 );

	12926

	12927 if( memcmp(&pWal->hdr.aFrameCksum[0],aZero,16)==0 ){

	12928 *ppSnapshot = 0;

	12929 return SQLITE_ERROR;

	12930 }

	12931 pRet = (WalIndexHdr*)sqlite3_malloc(sizeof(WalIndexHdr));

	12932 if( pRet==0 ){

	12933 rc = SQLITE_NOMEM_BKPT;

	12934 }else{

	12935 memcpy(pRet, &pWal->hdr, sizeof(WalIndexHdr));

	12936 ppSnapshot = (sqlite3_snapshot)pRet;

	12937 }

	12938

	12939 return rc;

	12940 }

	12941

	12942 /* Try to open on pSnapshot when the next read-transaction starts

	12943 */

	12944 SQLITE_PRIVATE void sqlite3WalSnapshotOpen(Wal pWal, sqlite3_snapshot pSnapsho t){

	12945 pWal->pSnapshot = (WalIndexHdr*)pSnapshot;

	12946 }

	12947

	12948 /*

	12949 ** Return a +ve value if snapshot p1 is newer than p2. A -ve value if

	12950 ** p1 is older than p2 and zero if p1 and p2 are the same snapshot.

	12951 */

	12952 SQLITE_API int sqlite3_snapshot_cmp(sqlite3_snapshot p1, sqlite3_snapshot p2){

	12953 WalIndexHdr pHdr1 = (WalIndexHdr)p1;

	12954 WalIndexHdr pHdr2 = (WalIndexHdr)p2;

	12955

	12956 /* aSalt[0] is a copy of the value stored in the wal file header. It

	12957 ** is incremented each time the wal file is restarted. */

	12958 if( pHdr1->aSalt[0]<pHdr2->aSalt[0] ) return -1;

	12959 if( pHdr1->aSalt[0]>pHdr2->aSalt[0] ) return +1;

	12960 if( pHdr1->mxFrame<pHdr2->mxFrame ) return -1;

	12961 if( pHdr1->mxFrame>pHdr2->mxFrame ) return +1;

	12962 return 0;

	12963 }

	12964 #endif /* SQLITE_ENABLE_SNAPSHOT */

	12965

	12966 #ifdef SQLITE_ENABLE_ZIPVFS

	12967 /*

	12968 ** If the argument is not NULL, it points to a Wal object that holds a

	12969 ** read-lock. This function returns the database page-size if it is known,

	12970 ** or zero if it is not (or if pWal is NULL).

	12971 */

	12972 SQLITE_PRIVATE int sqlite3WalFramesize(Wal *pWal){

	12973 assert( pWal==0 \|\| pWal->readLock>=0 );

	12974 return (pWal ? pWal->szPage : 0);

	12975 }

	12976 #endif

	12977

	12978 /* Return the sqlite3_file object for the WAL file

	12979 */

	12980 SQLITE_PRIVATE sqlite3_file sqlite3WalFile(Wal pWal){

	12981 return pWal->pWalFd;

	12982 }

	12983

	12984 #endif /* #ifndef SQLITE_OMIT_WAL */

	12985

	12986 /************ End of wal.c ***********************************************/

	12987 /************ Begin file btmutex.c ***************************************/

	12988 /*

	12989 ** 2007 August 27

	12990 **

	12991 ** The author disclaims copyright to this source code. In place of

	12992 ** a legal notice, here is a blessing:

	12993 **

	12994 ** May you do good and not evil.

	12995 ** May you find forgiveness for yourself and forgive others.

	12996 ** May you share freely, never taking more than you give.

	12997 **

	12998 *************************************************************************

	12999 **

	13000 ** This file contains code used to implement mutexes on Btree objects.

	13001 ** This code really belongs in btree.c. But btree.c is getting too

	13002 ** big and we want to break it down some. This packaged seemed like

	13003 ** a good breakout.

	13004 */

	13005 /************ Include btreeInt.h in the middle of btmutex.c **************/

	13006 /************ Begin file btreeInt.h **************************************/

	13007 /*

	13008 ** 2004 April 6

	13009 **

	13010 ** The author disclaims copyright to this source code. In place of

	13011 ** a legal notice, here is a blessing:

	13012 **

	13013 ** May you do good and not evil.

	13014 ** May you find forgiveness for yourself and forgive others.

	13015 ** May you share freely, never taking more than you give.

	13016 **

	13017 *************************************************************************

	13018 ** This file implements an external (disk-based) database using BTrees.

	13019 ** For a detailed discussion of BTrees, refer to

	13020 **

	13021 ** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3:

	13022 ** "Sorting And Searching", pages 473-480. Addison-Wesley

	13023 ** Publishing Company, Reading, Massachusetts.

	13024 **

	13025 ** The basic idea is that each page of the file contains N database

	13026 ** entries and N+1 pointers to subpages.

	13027 **

	13028 ** ----------------------------------------------------------------

	13029 ** \| Ptr(0) \| Key(0) \| Ptr(1) \| Key(1) \| ... \| Key(N-1) \| Ptr(N) \|

	13030 ** ----------------------------------------------------------------

	13031 **

	13032 ** All of the keys on the page that Ptr(0) points to have values less

	13033 ** than Key(0). All of the keys on page Ptr(1) and its subpages have

	13034 ** values greater than Key(0) and less than Key(1). All of the keys

	13035 ** on Ptr(N) and its subpages have values greater than Key(N-1). And

	13036 ** so forth.

	13037 **

	13038 ** Finding a particular key requires reading O(log(M)) pages from the

	13039 ** disk where M is the number of entries in the tree.

	13040 **

	13041 ** In this implementation, a single file can hold one or more separate

	13042 ** BTrees. Each BTree is identified by the index of its root page. The

	13043 ** key and data for any entry are combined to form the "payload". A

	13044 ** fixed amount of payload can be carried directly on the database

	13045 ** page. If the payload is larger than the preset amount then surplus

	13046 ** bytes are stored on overflow pages. The payload for an entry

	13047 ** and the preceding pointer are combined to form a "Cell". Each

	13048 ** page has a small header which contains the Ptr(N) pointer and other

	13049 ** information such as the size of key and data.

	13050 **

	13051 ** FORMAT DETAILS

	13052 **

	13053 ** The file is divided into pages. The first page is called page 1,

	13054 ** the second is page 2, and so forth. A page number of zero indicates

	13055 ** "no such page". The page size can be any power of 2 between 512 and 65536.

	13056 ** Each page can be either a btree page, a freelist page, an overflow

	13057 ** page, or a pointer-map page.

	13058 **

	13059 ** The first page is always a btree page. The first 100 bytes of the first

	13060 ** page contain a special header (the "file header") that describes the file.

	13061 ** The format of the file header is as follows:

	13062 **

	13063 ** OFFSET SIZE DESCRIPTION

	13064 ** 0 16 Header string: "SQLite format 3\000"

	13065 ** 16 2 Page size in bytes. (1 means 65536)

	13066 ** 18 1 File format write version

	13067 ** 19 1 File format read version

	13068 ** 20 1 Bytes of unused space at the end of each page

	13069 ** 21 1 Max embedded payload fraction (must be 64)

	13070 ** 22 1 Min embedded payload fraction (must be 32)

	13071 ** 23 1 Min leaf payload fraction (must be 32)

	13072 ** 24 4 File change counter

	13073 ** 28 4 Reserved for future use

	13074 ** 32 4 First freelist page

	13075 ** 36 4 Number of freelist pages in the file

	13076 ** 40 60 15 4-byte meta values passed to higher layers

	13077 **

	13078 ** 40 4 Schema cookie

	13079 ** 44 4 File format of schema layer

	13080 ** 48 4 Size of page cache

	13081 ** 52 4 Largest root-page (auto/incr_vacuum)

	13082 ** 56 4 1=UTF-8 2=UTF16le 3=UTF16be

	13083 ** 60 4 User version

	13084 ** 64 4 Incremental vacuum mode

	13085 ** 68 4 Application-ID

	13086 ** 72 20 unused

	13087 ** 92 4 The version-valid-for number

	13088 ** 96 4 SQLITE_VERSION_NUMBER

	13089 **

	13090 ** All of the integer values are big-endian (most significant byte first).

	13091 **

	13092 ** The file change counter is incremented when the database is changed

	13093 ** This counter allows other processes to know when the file has changed

	13094 ** and thus when they need to flush their cache.

	13095 **

	13096 ** The max embedded payload fraction is the amount of the total usable

	13097 ** space in a page that can be consumed by a single cell for standard

	13098 ** B-tree (non-LEAFDATA) tables. A value of 255 means 100%. The default

	13099 ** is to limit the maximum cell size so that at least 4 cells will fit

	13100 ** on one page. Thus the default max embedded payload fraction is 64.

	13101 **

	13102 ** If the payload for a cell is larger than the max payload, then extra

	13103 ** payload is spilled to overflow pages. Once an overflow page is allocated,

	13104 ** as many bytes as possible are moved into the overflow pages without letting

	13105 ** the cell size drop below the min embedded payload fraction.

	13106 **

	13107 ** The min leaf payload fraction is like the min embedded payload fraction

	13108 ** except that it applies to leaf nodes in a LEAFDATA tree. The maximum

	13109 ** payload fraction for a LEAFDATA tree is always 100% (or 255) and it

	13110 ** not specified in the header.

	13111 **

	13112 ** Each btree pages is divided into three sections: The header, the

	13113 ** cell pointer array, and the cell content area. Page 1 also has a 100-byte

	13114 ** file header that occurs before the page header.

	13115 **

	13116 ** \|----------------\|

	13117 ** \| file header \| 100 bytes. Page 1 only.

	13118 ** \|----------------\|

	13119 ** \| page header \| 8 bytes for leaves. 12 bytes for interior nodes

	13120 ** \|----------------\|

	13121 ** \| cell pointer \| \| 2 bytes per cell. Sorted order.

	13122 ** \| array \| \| Grows downward

	13123 ** \| \| v

	13124 ** \|----------------\|

	13125 ** \| unallocated \|

	13126 ** \| space \|

	13127 ** \|----------------\| ^ Grows upwards

	13128 ** \| cell content \| \| Arbitrary order interspersed with freeblocks.

	13129 ** \| area \| \| and free space fragments.

	13130 ** \|----------------\|

	13131 **

	13132 ** The page headers looks like this:

	13133 **

	13134 ** OFFSET SIZE DESCRIPTION

	13135 ** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf

	13136 ** 1 2 byte offset to the first freeblock

	13137 ** 3 2 number of cells on this page

	13138 ** 5 2 first byte of the cell content area

	13139 ** 7 1 number of fragmented free bytes

	13140 ** 8 4 Right child (the Ptr(N) value). Omitted on leaves.

	13141 **

	13142 ** The flags define the format of this btree page. The leaf flag means that

	13143 ** this page has no children. The zerodata flag means that this page carries

	13144 ** only keys and no data. The intkey flag means that the key is an integer

	13145 ** which is stored in the key size entry of the cell header rather than in

	13146 ** the payload area.

	13147 **

	13148 ** The cell pointer array begins on the first byte after the page header.

	13149 ** The cell pointer array contains zero or more 2-byte numbers which are

	13150 ** offsets from the beginning of the page to the cell content in the cell

	13151 ** content area. The cell pointers occur in sorted order. The system strives

	13152 ** to keep free space after the last cell pointer so that new cells can

	13153 ** be easily added without having to defragment the page.

	13154 **

	13155 ** Cell content is stored at the very end of the page and grows toward the

	13156 ** beginning of the page.

	13157 **

	13158 ** Unused space within the cell content area is collected into a linked list of

	13159 ** freeblocks. Each freeblock is at least 4 bytes in size. The byte offset

	13160 ** to the first freeblock is given in the header. Freeblocks occur in

	13161 ** increasing order. Because a freeblock must be at least 4 bytes in size,

	13162 ** any group of 3 or fewer unused bytes in the cell content area cannot

	13163 ** exist on the freeblock chain. A group of 3 or fewer free bytes is called

	13164 ** a fragment. The total number of bytes in all fragments is recorded.

	13165 ** in the page header at offset 7.

	13166 **

	13167 ** SIZE DESCRIPTION

	13168 ** 2 Byte offset of the next freeblock

	13169 ** 2 Bytes in this freeblock

	13170 **

	13171 ** Cells are of variable length. Cells are stored in the cell content area at

	13172 ** the end of the page. Pointers to the cells are in the cell pointer array

	13173 ** that immediately follows the page header. Cells is not necessarily

	13174 ** contiguous or in order, but cell pointers are contiguous and in order.

	13175 **

	13176 ** Cell content makes use of variable length integers. A variable

	13177 ** length integer is 1 to 9 bytes where the lower 7 bits of each

	13178 ** byte are used. The integer consists of all bytes that have bit 8 set and

	13179 ** the first byte with bit 8 clear. The most significant byte of the integer

	13180 ** appears first. A variable-length integer may not be more than 9 bytes long.

	13181 ** As a special case, all 8 bytes of the 9th byte are used as data. This

	13182 ** allows a 64-bit integer to be encoded in 9 bytes.

	13183 **

	13184 ** 0x00 becomes 0x00000000

	13185 ** 0x7f becomes 0x0000007f

	13186 ** 0x81 0x00 becomes 0x00000080

	13187 ** 0x82 0x00 becomes 0x00000100

	13188 ** 0x80 0x7f becomes 0x0000007f

	13189 ** 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678

	13190 ** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081

	13191 **

	13192 ** Variable length integers are used for rowids and to hold the number of

	13193 ** bytes of key and data in a btree cell.

	13194 **

	13195 ** The content of a cell looks like this:

	13196 **

	13197 ** SIZE DESCRIPTION

	13198 ** 4 Page number of the left child. Omitted if leaf flag is set.

	13199 ** var Number of bytes of data. Omitted if the zerodata flag is set.

	13200 ** var Number of bytes of key. Or the key itself if intkey flag is set.

	13201 ** * Payload

	13202 ** 4 First page of the overflow chain. Omitted if no overflow

	13203 **

	13204 ** Overflow pages form a linked list. Each page except the last is completely

	13205 ** filled with data (pagesize - 4 bytes). The last page can have as little

	13206 ** as 1 byte of data.

	13207 **

	13208 ** SIZE DESCRIPTION

	13209 ** 4 Page number of next overflow page

	13210 ** * Data

	13211 **

	13212 ** Freelist pages come in two subtypes: trunk pages and leaf pages. The

	13213 ** file header points to the first in a linked list of trunk page. Each trunk

	13214 ** page points to multiple leaf pages. The content of a leaf page is

	13215 ** unspecified. A trunk page looks like this:

	13216 **

	13217 ** SIZE DESCRIPTION

	13218 ** 4 Page number of next trunk page

	13219 ** 4 Number of leaf pointers on this page

	13220 ** * zero or more pages numbers of leaves

	13221 */

	13222 /* #include "sqliteInt.h" */

	13223

	13224

	13225 /* The following value is the maximum cell size assuming a maximum page

	13226 ** size give above.

	13227 */

	13228 #define MX_CELL_SIZE(pBt) ((int)(pBt->pageSize-8))

	13229

	13230 /* The maximum number of cells on a single page of the database. This

	13231 ** assumes a minimum cell size of 6 bytes (4 bytes for the cell itself

	13232 ** plus 2 bytes for the index to the cell in the page header). Such

	13233 ** small cells will be rare, but they are possible.

	13234 */

	13235 #define MX_CELL(pBt) ((pBt->pageSize-8)/6)

	13236

	13237 /* Forward declarations */

	13238 typedef struct MemPage MemPage;

	13239 typedef struct BtLock BtLock;

	13240 typedef struct CellInfo CellInfo;

	13241

	13242 /*

	13243 ** This is a magic string that appears at the beginning of every

	13244 ** SQLite database in order to identify the file as a real database.

	13245 **

	13246 ** You can change this value at compile-time by specifying a

	13247 ** -DSQLITE_FILE_HEADER="..." on the compiler command-line. The

	13248 ** header must be exactly 16 bytes including the zero-terminator so

	13249 ** the string itself should be 15 characters long. If you change

	13250 ** the header, then your custom library will not be able to read

	13251 ** databases generated by the standard tools and the standard tools

	13252 ** will not be able to read databases created by your custom library.

	13253 */

	13254 #ifndef SQLITE_FILE_HEADER /* 123456789 123456 */

	13255 # define SQLITE_FILE_HEADER "SQLite format 3"

	13256 #endif

	13257

	13258 /*

	13259 ** Page type flags. An ORed combination of these flags appear as the

	13260 ** first byte of on-disk image of every BTree page.

	13261 */

	13262 #define PTF_INTKEY 0x01

	13263 #define PTF_ZERODATA 0x02

	13264 #define PTF_LEAFDATA 0x04

	13265 #define PTF_LEAF 0x08

	13266

	13267 /*

	13268 ** An instance of this object stores information about each a single database

	13269 ** page that has been loaded into memory. The information in this object

	13270 ** is derived from the raw on-disk page content.

	13271 **

	13272 ** As each database page is loaded into memory, the pager allocats an

	13273 ** instance of this object and zeros the first 8 bytes. (This is the

	13274 ** "extra" information associated with each page of the pager.)

	13275 **

	13276 ** Access to all fields of this structure is controlled by the mutex

	13277 ** stored in MemPage.pBt->mutex.

	13278 */

	13279 struct MemPage {

	13280 u8 isInit; /* True if previously initialized. MUST BE FIRST! */

	13281 u8 bBusy; /* Prevent endless loops on corrupt database files */

	13282 u8 intKey; /* True if table b-trees. False for index b-trees */

	13283 u8 intKeyLeaf; /* True if the leaf of an intKey table */

	13284 Pgno pgno; /* Page number for this page */

	13285 /* Only the first 8 bytes (above) are zeroed by pager.c when a new page

	13286 ** is allocated. All fields that follow must be initialized before use */

	13287 u8 leaf; /* True if a leaf page */

	13288 u8 hdrOffset; /* 100 for page 1. 0 otherwise */

	13289 u8 childPtrSize; /* 0 if leaf==1. 4 if leaf==0 */

	13290 u8 max1bytePayload; /* min(maxLocal,127) */

	13291 u8 nOverflow; /* Number of overflow cell bodies in aCell[] */

	13292 u16 maxLocal; /* Copy of BtShared.maxLocal or BtShared.maxLeaf */

	13293 u16 minLocal; /* Copy of BtShared.minLocal or BtShared.minLeaf */

	13294 u16 cellOffset; /* Index in aData of first cell pointer */

	13295 u16 nFree; /* Number of free bytes on the page */

	13296 u16 nCell; /* Number of cells on this page, local and ovfl */

	13297 u16 maskPage; /* Mask for page offset */

	13298 u16 aiOvfl[4]; /* Insert the i-th overflow cell before the aiOvfl-th

	13299 ** non-overflow cell */

	13300 u8 apOvfl[4]; / Pointers to the body of overflow cells */

	13301 BtShared pBt; / Pointer to BtShared that this page is part of */

	13302 u8 aData; / Pointer to disk image of the page data */

	13303 u8 aDataEnd; / One byte past the end of usable data */

	13304 u8 aCellIdx; / The cell index area */

	13305 u8 aDataOfst; / Same as aData for leaves. aData+4 for interior */

	13306 DbPage pDbPage; / Pager page handle */

	13307 u16 (xCellSize)(MemPage,u8); / cellSizePtr method */

	13308 void (xParseCell)(MemPage,u8,CellInfo); /* btreeParseCell method */

	13309 };

	13310

	13311 /*

	13312 ** A linked list of the following structures is stored at BtShared.pLock.

	13313 ** Locks are added (or upgraded from READ_LOCK to WRITE_LOCK) when a cursor

	13314 ** is opened on the table with root page BtShared.iTable. Locks are removed

	13315 ** from this list when a transaction is committed or rolled back, or when

	13316 ** a btree handle is closed.

	13317 */

	13318 struct BtLock {

	13319 Btree pBtree; / Btree handle holding this lock */

	13320 Pgno iTable; /* Root page of table */

	13321 u8 eLock; /* READ_LOCK or WRITE_LOCK */

	13322 BtLock pNext; / Next in BtShared.pLock list */

	13323 };

	13324

	13325 /* Candidate values for BtLock.eLock */

	13326 #define READ_LOCK 1

	13327 #define WRITE_LOCK 2

	13328

	13329 /* A Btree handle

	13330 **

	13331 ** A database connection contains a pointer to an instance of

	13332 ** this object for every database file that it has open. This structure

	13333 ** is opaque to the database connection. The database connection cannot

	13334 ** see the internals of this structure and only deals with pointers to

	13335 ** this structure.

	13336 **

	13337 ** For some database files, the same underlying database cache might be

	13338 ** shared between multiple connections. In that case, each connection

	13339 ** has it own instance of this object. But each instance of this object

	13340 ** points to the same BtShared object. The database cache and the

	13341 ** schema associated with the database file are all contained within

	13342 ** the BtShared object.

	13343 **

	13344 ** All fields in this structure are accessed under sqlite3.mutex.

	13345 ** The pBt pointer itself may not be changed while there exists cursors

	13346 ** in the referenced BtShared that point back to this Btree since those

	13347 ** cursors have to go through this Btree to find their BtShared and

	13348 ** they often do so without holding sqlite3.mutex.

	13349 */

	13350 struct Btree {

	13351 sqlite3 db; / The database connection holding this btree */

	13352 BtShared pBt; / Sharable content of this btree */

	13353 u8 inTrans; /* TRANS_NONE, TRANS_READ or TRANS_WRITE */

	13354 u8 sharable; /* True if we can share pBt with another db */

	13355 u8 locked; /* True if db currently has pBt locked */

	13356 u8 hasIncrblobCur; /* True if there are one or more Incrblob cursors */

	13357 int wantToLock; /* Number of nested calls to sqlite3BtreeEnter() */

	13358 int nBackup; /* Number of backup operations reading this btree */

	13359 u32 iDataVersion; /* Combines with pBt->pPager->iDataVersion */

	13360 Btree pNext; / List of other sharable Btrees from the same db */

	13361 Btree pPrev; / Back pointer of the same list */

	13362 #ifndef SQLITE_OMIT_SHARED_CACHE

	13363 BtLock lock; /* Object used to lock page 1 */

	13364 #endif

	13365 };

	13366

	13367 /*

	13368 ** Btree.inTrans may take one of the following values.

	13369 **

	13370 ** If the shared-data extension is enabled, there may be multiple users

	13371 ** of the Btree structure. At most one of these may open a write transaction,

	13372 ** but any number may have active read transactions.

	13373 */

	13374 #define TRANS_NONE 0

	13375 #define TRANS_READ 1

	13376 #define TRANS_WRITE 2

	13377

	13378 /*

	13379 ** An instance of this object represents a single database file.

	13380 **

	13381 ** A single database file can be in use at the same time by two

	13382 ** or more database connections. When two or more connections are

	13383 ** sharing the same database file, each connection has it own

	13384 ** private Btree object for the file and each of those Btrees points

	13385 ** to this one BtShared object. BtShared.nRef is the number of

	13386 ** connections currently sharing this database file.

	13387 **

	13388 ** Fields in this structure are accessed under the BtShared.mutex

	13389 ** mutex, except for nRef and pNext which are accessed under the

	13390 ** global SQLITE_MUTEX_STATIC_MASTER mutex. The pPager field

	13391 ** may not be modified once it is initially set as long as nRef>0.

	13392 ** The pSchema field may be set once under BtShared.mutex and

	13393 ** thereafter is unchanged as long as nRef>0.

	13394 **

	13395 ** isPending:

	13396 **

	13397 ** If a BtShared client fails to obtain a write-lock on a database

	13398 ** table (because there exists one or more read-locks on the table),

	13399 ** the shared-cache enters 'pending-lock' state and isPending is

	13400 ** set to true.

	13401 **

	13402 ** The shared-cache leaves the 'pending lock' state when either of

	13403 ** the following occur:

	13404 **

	13405 ** 1) The current writer (BtShared.pWriter) concludes its transaction, OR

	13406 ** 2) The number of locks held by other connections drops to zero.

	13407 **

	13408 ** while in the 'pending-lock' state, no connection may start a new

	13409 ** transaction.

	13410 **

	13411 ** This feature is included to help prevent writer-starvation.

	13412 */

	13413 struct BtShared {

	13414 Pager pPager; / The page cache */

	13415 sqlite3 db; / Database connection currently using this Btree */

	13416 BtCursor pCursor; / A list of all open cursors */

	13417 MemPage pPage1; / First page of the database */

	13418 u8 openFlags; /* Flags to sqlite3BtreeOpen() */

	13419 #ifndef SQLITE_OMIT_AUTOVACUUM

	13420 u8 autoVacuum; /* True if auto-vacuum is enabled */

	13421 u8 incrVacuum; /* True if incr-vacuum is enabled */

	13422 u8 bDoTruncate; /* True to truncate db on commit */

	13423 #endif

	13424 u8 inTransaction; /* Transaction state */

	13425 u8 max1bytePayload; /* Maximum first byte of cell for a 1-byte payload */

	13426 #ifdef SQLITE_HAS_CODEC

	13427 u8 optimalReserve; /* Desired amount of reserved space per page */

	13428 #endif

	13429 u16 btsFlags; /* Boolean parameters. See BTS_* macros below */

	13430 u16 maxLocal; /* Maximum local payload in non-LEAFDATA tables */

	13431 u16 minLocal; /* Minimum local payload in non-LEAFDATA tables */

	13432 u16 maxLeaf; /* Maximum local payload in a LEAFDATA table */

	13433 u16 minLeaf; /* Minimum local payload in a LEAFDATA table */

	13434 u32 pageSize; /* Total number of bytes on a page */

	13435 u32 usableSize; /* Number of usable bytes on each page */

	13436 int nTransaction; /* Number of open transactions (read + write) */

	13437 u32 nPage; /* Number of pages in the database */

	13438 void pSchema; / Pointer to space allocated by sqlite3BtreeSchema() */

	13439 void (xFreeSchema)(void); /* Destructor for BtShared.pSchema */

	13440 sqlite3_mutex mutex; / Non-recursive mutex required to access this object */

	13441 Bitvec pHasContent; / Set of pages moved to free-list this transaction */

	13442 #ifndef SQLITE_OMIT_SHARED_CACHE

	13443 int nRef; /* Number of references to this structure */

	13444 BtShared pNext; / Next on a list of sharable BtShared structs */

	13445 BtLock pLock; / List of locks held on this shared-btree struct */

	13446 Btree pWriter; / Btree with currently open write transaction */

	13447 #endif

	13448 u8 pTmpSpace; / Temp space sufficient to hold a single cell */

	13449 };

	13450

	13451 /*

	13452 ** Allowed values for BtShared.btsFlags

	13453 */

	13454 #define BTS_READ_ONLY 0x0001 /* Underlying file is readonly */

	13455 #define BTS_PAGESIZE_FIXED 0x0002 /* Page size can no longer be changed */

	13456 #define BTS_SECURE_DELETE 0x0004 /* PRAGMA secure_delete is enabled */

	13457 #define BTS_INITIALLY_EMPTY 0x0008 /* Database was empty at trans start */

	13458 #define BTS_NO_WAL 0x0010 /* Do not open write-ahead-log files */

	13459 #define BTS_EXCLUSIVE 0x0020 /* pWriter has an exclusive lock */

	13460 #define BTS_PENDING 0x0040 /* Waiting for read-locks to clear */

	13461

	13462 /*

	13463 ** An instance of the following structure is used to hold information

	13464 ** about a cell. The parseCellPtr() function fills in this structure

	13465 ** based on information extract from the raw disk page.

	13466 */

	13467 struct CellInfo {

	13468 i64 nKey; /* The key for INTKEY tables, or nPayload otherwise */

	13469 u8 pPayload; / Pointer to the start of payload */

	13470 u32 nPayload; /* Bytes of payload */

	13471 u16 nLocal; /* Amount of payload held locally, not on overflow */

	13472 u16 nSize; /* Size of the cell content on the main b-tree page */

	13473 };

	13474

	13475 /*

	13476 ** Maximum depth of an SQLite B-Tree structure. Any B-Tree deeper than

	13477 ** this will be declared corrupt. This value is calculated based on a

	13478 ** maximum database size of 2^31 pages a minimum fanout of 2 for a

	13479 ** root-node and 3 for all other internal nodes.

	13480 **

	13481 ** If a tree that appears to be taller than this is encountered, it is

	13482 ** assumed that the database is corrupt.

	13483 */

	13484 #define BTCURSOR_MAX_DEPTH 20

	13485

	13486 /*

	13487 ** A cursor is a pointer to a particular entry within a particular

	13488 ** b-tree within a database file.

	13489 **

	13490 ** The entry is identified by its MemPage and the index in

	13491 ** MemPage.aCell[] of the entry.

	13492 **

	13493 ** A single database file can be shared by two more database connections,

	13494 ** but cursors cannot be shared. Each cursor is associated with a

	13495 ** particular database connection identified BtCursor.pBtree.db.

	13496 **

	13497 ** Fields in this structure are accessed under the BtShared.mutex

	13498 ** found at self->pBt->mutex.

	13499 **

	13500 ** skipNext meaning:

	13501 ** eState==SKIPNEXT && skipNext>0: Next sqlite3BtreeNext() is no-op.

	13502 ** eState==SKIPNEXT && skipNext<0: Next sqlite3BtreePrevious() is no-op.

	13503 ** eState==FAULT: Cursor fault with skipNext as error code.

	13504 */

	13505 struct BtCursor {

	13506 Btree pBtree; / The Btree to which this cursor belongs */

	13507 BtShared pBt; / The BtShared this cursor points to */

	13508 BtCursor pNext; / Forms a linked list of all cursors */

	13509 Pgno aOverflow; / Cache of overflow page locations */

	13510 CellInfo info; /* A parse of the cell we are pointing at */

	13511 i64 nKey; /* Size of pKey, or last integer key */

	13512 void pKey; / Saved key that was cursor last known position */

	13513 Pgno pgnoRoot; /* The root page of this tree */

	13514 int nOvflAlloc; /* Allocated size of aOverflow[] array */

	13515 int skipNext; /* Prev() is noop if negative. Next() is noop if positive.

	13516 ** Error code if eState==CURSOR_FAULT */

	13517 u8 curFlags; /* zero or more BTCF_* flags defined below */

	13518 u8 curPagerFlags; /* Flags to send to sqlite3PagerGet() */

	13519 u8 eState; /* One of the CURSOR_XXX constants (see below) */

	13520 u8 hints; /* As configured by CursorSetHints() */

	13521 /* All fields above are zeroed when the cursor is allocated. See

	13522 ** sqlite3BtreeCursorZero(). Fields that follow must be manually

	13523 ** initialized. */

	13524 i8 iPage; /* Index of current page in apPage */

	13525 u8 curIntKey; /* Value of apPage[0]->intKey */

	13526 struct KeyInfo pKeyInfo; / Argument passed to comparison function */

	13527 void padding1; / Make object size a multiple of 16 */

	13528 u16 aiIdx[BTCURSOR_MAX_DEPTH]; /* Current index in apPage[i] */

	13529 MemPage apPage[BTCURSOR_MAX_DEPTH]; / Pages from root to current page */

	13530 };

	13531

	13532 /*

	13533 ** Legal values for BtCursor.curFlags

	13534 */

	13535 #define BTCF_WriteFlag 0x01 /* True if a write cursor */

	13536 #define BTCF_ValidNKey 0x02 /* True if info.nKey is valid */

	13537 #define BTCF_ValidOvfl 0x04 /* True if aOverflow is valid */

	13538 #define BTCF_AtLast 0x08 /* Cursor is pointing ot the last entry */

	13539 #define BTCF_Incrblob 0x10 /* True if an incremental I/O handle */

	13540 #define BTCF_Multiple 0x20 /* Maybe another cursor on the same btree */

	13541

	13542 /*

	13543 ** Potential values for BtCursor.eState.

	13544 **

	13545 ** CURSOR_INVALID:

	13546 ** Cursor does not point to a valid entry. This can happen (for example)

	13547 ** because the table is empty or because BtreeCursorFirst() has not been

	13548 ** called.

	13549 **

	13550 ** CURSOR_VALID:

	13551 ** Cursor points to a valid entry. getPayload() etc. may be called.

	13552 **

	13553 ** CURSOR_SKIPNEXT:

	13554 ** Cursor is valid except that the Cursor.skipNext field is non-zero

	13555 ** indicating that the next sqlite3BtreeNext() or sqlite3BtreePrevious()

	13556 ** operation should be a no-op.

	13557 **

	13558 ** CURSOR_REQUIRESEEK:

	13559 ** The table that this cursor was opened on still exists, but has been

	13560 ** modified since the cursor was last used. The cursor position is saved

	13561 ** in variables BtCursor.pKey and BtCursor.nKey. When a cursor is in

	13562 ** this state, restoreCursorPosition() can be called to attempt to

	13563 ** seek the cursor to the saved position.

	13564 **

	13565 ** CURSOR_FAULT:

	13566 ** An unrecoverable error (an I/O error or a malloc failure) has occurred

	13567 ** on a different connection that shares the BtShared cache with this

	13568 ** cursor. The error has left the cache in an inconsistent state.

	13569 ** Do nothing else with this cursor. Any attempt to use the cursor

	13570 ** should return the error code stored in BtCursor.skipNext

	13571 */

	13572 #define CURSOR_INVALID 0

	13573 #define CURSOR_VALID 1

	13574 #define CURSOR_SKIPNEXT 2

	13575 #define CURSOR_REQUIRESEEK 3

	13576 #define CURSOR_FAULT 4

	13577

	13578 /*

	13579 ** The database page the PENDING_BYTE occupies. This page is never used.

	13580 */

	13581 # define PENDING_BYTE_PAGE(pBt) PAGER_MJ_PGNO(pBt)

	13582

	13583 /*

	13584 ** These macros define the location of the pointer-map entry for a

	13585 ** database page. The first argument to each is the number of usable

	13586 ** bytes on each page of the database (often 1024). The second is the

	13587 ** page number to look up in the pointer map.

	13588 **

	13589 ** PTRMAP_PAGENO returns the database page number of the pointer-map

	13590 ** page that stores the required pointer. PTRMAP_PTROFFSET returns

	13591 ** the offset of the requested map entry.

	13592 **

	13593 ** If the pgno argument passed to PTRMAP_PAGENO is a pointer-map page,

	13594 ** then pgno is returned. So (pgno==PTRMAP_PAGENO(pgsz, pgno)) can be

	13595 ** used to test if pgno is a pointer-map page. PTRMAP_ISPAGE implements

	13596 ** this test.

	13597 */

	13598 #define PTRMAP_PAGENO(pBt, pgno) ptrmapPageno(pBt, pgno)

	13599 #define PTRMAP_PTROFFSET(pgptrmap, pgno) (5*(pgno-pgptrmap-1))

	13600 #define PTRMAP_ISPAGE(pBt, pgno) (PTRMAP_PAGENO((pBt),(pgno))==(pgno))

	13601

	13602 /*

	13603 ** The pointer map is a lookup table that identifies the parent page for

	13604 ** each child page in the database file. The parent page is the page that

	13605 ** contains a pointer to the child. Every page in the database contains

	13606 ** 0 or 1 parent pages. (In this context 'database page' refers

	13607 ** to any page that is not part of the pointer map itself.) Each pointer map

	13608 ** entry consists of a single byte 'type' and a 4 byte parent page number.

	13609 ** The PTRMAP_XXX identifiers below are the valid types.

	13610 **

	13611 ** The purpose of the pointer map is to facility moving pages from one

	13612 ** position in the file to another as part of autovacuum. When a page

	13613 ** is moved, the pointer in its parent must be updated to point to the

	13614 ** new location. The pointer map is used to locate the parent page quickly.

	13615 **

	13616 ** PTRMAP_ROOTPAGE: The database page is a root-page. The page-number is not

	13617 ** used in this case.

	13618 **

	13619 ** PTRMAP_FREEPAGE: The database page is an unused (free) page. The page-number

	13620 ** is not used in this case.

	13621 **

	13622 ** PTRMAP_OVERFLOW1: The database page is the first page in a list of

	13623 ** overflow pages. The page number identifies the page that

	13624 ** contains the cell with a pointer to this overflow page.

	13625 **

	13626 ** PTRMAP_OVERFLOW2: The database page is the second or later page in a list of

	13627 ** overflow pages. The page-number identifies the previous

	13628 ** page in the overflow page list.

	13629 **

	13630 ** PTRMAP_BTREE: The database page is a non-root btree page. The page number

	13631 ** identifies the parent page in the btree.

	13632 */

	13633 #define PTRMAP_ROOTPAGE 1

	13634 #define PTRMAP_FREEPAGE 2

	13635 #define PTRMAP_OVERFLOW1 3

	13636 #define PTRMAP_OVERFLOW2 4

	13637 #define PTRMAP_BTREE 5

	13638

	13639 /* A bunch of assert() statements to check the transaction state variables

	13640 ** of handle p (type Btree*) are internally consistent.

	13641 */

	13642 #define btreeIntegrity(p) \

	13643 assert( p->pBt->inTransaction!=TRANS_NONE \|\| p->pBt->nTransaction==0 ); \

	13644 assert( p->pBt->inTransaction>=p->inTrans );

	13645

	13646

	13647 /*

	13648 ** The ISAUTOVACUUM macro is used within balance_nonroot() to determine

	13649 ** if the database supports auto-vacuum or not. Because it is used

	13650 ** within an expression that is an argument to another macro

	13651 ** (sqliteMallocRaw), it is not possible to use conditional compilation.

	13652 ** So, this macro is defined instead.

	13653 */

	13654 #ifndef SQLITE_OMIT_AUTOVACUUM

	13655 #define ISAUTOVACUUM (pBt->autoVacuum)

	13656 #else

	13657 #define ISAUTOVACUUM 0

	13658 #endif

	13659

	13660

	13661 /*

	13662 ** This structure is passed around through all the sanity checking routines

	13663 ** in order to keep track of some global state information.

	13664 **

	13665 ** The aRef[] array is allocated so that there is 1 bit for each page in

	13666 ** the database. As the integrity-check proceeds, for each page used in

	13667 ** the database the corresponding bit is set. This allows integrity-check to

	13668 ** detect pages that are used twice and orphaned pages (both of which

	13669 ** indicate corruption).

	13670 */

	13671 typedef struct IntegrityCk IntegrityCk;

	13672 struct IntegrityCk {

	13673 BtShared pBt; / The tree being checked out */

	13674 Pager pPager; / The associated pager. Also accessible by pBt->pPager */

	13675 u8 aPgRef; / 1 bit per page in the db (see above) */

	13676 Pgno nPage; /* Number of pages in the database */

	13677 int mxErr; /* Stop accumulating errors when this reaches zero */

	13678 int nErr; /* Number of messages written to zErrMsg so far */

	13679 int mallocFailed; /* A memory allocation error has occurred */

	13680 const char zPfx; / Error message prefix */

	13681 int v1, v2; /* Values for up to two %d fields in zPfx */

	13682 StrAccum errMsg; /* Accumulate the error message text here */

	13683 u32 heap; / Min-heap used for analyzing cell coverage */

	13684 };

	13685

	13686 /*

	13687 ** Routines to read or write a two- and four-byte big-endian integer values.

	13688 */

	13689 #define get2byte(x) ((x)[0]<<8 \| (x)[1])

	13690 #define put2byte(p,v) ((p)[0] = (u8)((v)>>8), (p)[1] = (u8)(v))

	13691 #define get4byte sqlite3Get4byte

	13692 #define put4byte sqlite3Put4byte

	13693

	13694 /*

	13695 ** get2byteAligned(), unlike get2byte(), requires that its argument point to a

	13696 ** two-byte aligned address. get2bytea() is only used for accessing the

	13697 ** cell addresses in a btree header.

	13698 */

	13699 #if SQLITE_BYTEORDER==4321

	13700 # define get2byteAligned(x) ((u16)(x))

	13701 #elif SQLITE_BYTEORDER==1234 && GCC_VERSION>=4008000

	13702 # define get2byteAligned(x) __builtin_bswap16((u16)(x))

	13703 #elif SQLITE_BYTEORDER==1234 && MSVC_VERSION>=1300

	13704 # define get2byteAligned(x) _byteswap_ushort((u16)(x))

	13705 #else

	13706 # define get2byteAligned(x) ((x)[0]<<8 \| (x)[1])

	13707 #endif

	13708

	13709 /************ End of btreeInt.h ******************************************/

	13710 /************ Continuing where we left off in btmutex.c ******************/

	13711 #ifndef SQLITE_OMIT_SHARED_CACHE

	13712 #if SQLITE_THREADSAFE

	13713

	13714 /*

	13715 ** Obtain the BtShared mutex associated with B-Tree handle p. Also,

	13716 ** set BtShared.db to the database handle associated with p and the

	13717 ** p->locked boolean to true.

	13718 */

	13719 static void lockBtreeMutex(Btree *p){

	13720 assert( p->locked==0 );

	13721 assert( sqlite3_mutex_notheld(p->pBt->mutex) );

	13722 assert( sqlite3_mutex_held(p->db->mutex) );

	13723

	13724 sqlite3_mutex_enter(p->pBt->mutex);

	13725 p->pBt->db = p->db;

	13726 p->locked = 1;

	13727 }

	13728

	13729 /*

	13730 ** Release the BtShared mutex associated with B-Tree handle p and

	13731 ** clear the p->locked boolean.

	13732 */

	13733 static void SQLITE_NOINLINE unlockBtreeMutex(Btree *p){

	13734 BtShared *pBt = p->pBt;

	13735 assert( p->locked==1 );

	13736 assert( sqlite3_mutex_held(pBt->mutex) );

	13737 assert( sqlite3_mutex_held(p->db->mutex) );

	13738 assert( p->db==pBt->db );

	13739

	13740 sqlite3_mutex_leave(pBt->mutex);

	13741 p->locked = 0;

	13742 }

	13743

	13744 /* Forward reference */

	13745 static void SQLITE_NOINLINE btreeLockCarefully(Btree *p);

	13746

	13747 /*

	13748 ** Enter a mutex on the given BTree object.

	13749 **

	13750 ** If the object is not sharable, then no mutex is ever required

	13751 ** and this routine is a no-op. The underlying mutex is non-recursive.

	13752 ** But we keep a reference count in Btree.wantToLock so the behavior

	13753 ** of this interface is recursive.

	13754 **

	13755 ** To avoid deadlocks, multiple Btrees are locked in the same order

	13756 ** by all database connections. The p->pNext is a list of other

	13757 ** Btrees belonging to the same database connection as the p Btree

	13758 ** which need to be locked after p. If we cannot get a lock on

	13759 ** p, then first unlock all of the others on p->pNext, then wait

	13760 ** for the lock to become available on p, then relock all of the

	13761 ** subsequent Btrees that desire a lock.

	13762 */

	13763 SQLITE_PRIVATE void sqlite3BtreeEnter(Btree *p){

	13764 /* Some basic sanity checking on the Btree. The list of Btrees

	13765 ** connected by pNext and pPrev should be in sorted order by

	13766 ** Btree.pBt value. All elements of the list should belong to

	13767 ** the same connection. Only shared Btrees are on the list. */

	13768 assert( p->pNext==0 \|\| p->pNext->pBt>p->pBt );

	13769 assert( p->pPrev==0 \|\| p->pPrev->pBt<p->pBt );

	13770 assert( p->pNext==0 \|\| p->pNext->db==p->db );

	13771 assert( p->pPrev==0 \|\| p->pPrev->db==p->db );

	13772 assert( p->sharable \|\| (p->pNext==0 && p->pPrev==0) );

	13773

	13774 /* Check for locking consistency */

	13775 assert( !p->locked \|\| p->wantToLock>0 );

	13776 assert( p->sharable \|\| p->wantToLock==0 );

	13777

	13778 /* We should already hold a lock on the database connection */

	13779 assert( sqlite3_mutex_held(p->db->mutex) );

	13780

	13781 /* Unless the database is sharable and unlocked, then BtShared.db

	13782 ** should already be set correctly. */

	13783 assert( (p->locked==0 && p->sharable) \|\| p->pBt->db==p->db );

	13784

	13785 if( !p->sharable ) return;

	13786 p->wantToLock++;

	13787 if( p->locked ) return;

	13788 btreeLockCarefully(p);

	13789 }

	13790

	13791 /* This is a helper function for sqlite3BtreeLock(). By moving

	13792 ** complex, but seldom used logic, out of sqlite3BtreeLock() and

	13793 ** into this routine, we avoid unnecessary stack pointer changes

	13794 ** and thus help the sqlite3BtreeLock() routine to run much faster

	13795 ** in the common case.

	13796 */

	13797 static void SQLITE_NOINLINE btreeLockCarefully(Btree *p){

	13798 Btree *pLater;

	13799

	13800 /* In most cases, we should be able to acquire the lock we

	13801 ** want without having to go through the ascending lock

	13802 ** procedure that follows. Just be sure not to block.

	13803 */

	13804 if( sqlite3_mutex_try(p->pBt->mutex)==SQLITE_OK ){

	13805 p->pBt->db = p->db;

	13806 p->locked = 1;

	13807 return;

	13808 }

	13809

	13810 /* To avoid deadlock, first release all locks with a larger

	13811 ** BtShared address. Then acquire our lock. Then reacquire

	13812 ** the other BtShared locks that we used to hold in ascending

	13813 ** order.

	13814 */

	13815 for(pLater=p->pNext; pLater; pLater=pLater->pNext){

	13816 assert( pLater->sharable );

	13817 assert( pLater->pNext==0 \|\| pLater->pNext->pBt>pLater->pBt );

	13818 assert( !pLater->locked \|\| pLater->wantToLock>0 );

	13819 if( pLater->locked ){

	13820 unlockBtreeMutex(pLater);

	13821 }

	13822 }

	13823 lockBtreeMutex(p);

	13824 for(pLater=p->pNext; pLater; pLater=pLater->pNext){

	13825 if( pLater->wantToLock ){

	13826 lockBtreeMutex(pLater);

	13827 }

	13828 }

	13829 }

	13830

	13831

	13832 /*

	13833 ** Exit the recursive mutex on a Btree.

	13834 */

	13835 SQLITE_PRIVATE void sqlite3BtreeLeave(Btree *p){

	13836 assert( sqlite3_mutex_held(p->db->mutex) );

	13837 if( p->sharable ){

	13838 assert( p->wantToLock>0 );

	13839 p->wantToLock--;

	13840 if( p->wantToLock==0 ){

	13841 unlockBtreeMutex(p);

	13842 }

	13843 }

	13844 }

	13845

	13846 #ifndef NDEBUG

	13847 /*

	13848 ** Return true if the BtShared mutex is held on the btree, or if the

	13849 ** B-Tree is not marked as sharable.

	13850 **

	13851 ** This routine is used only from within assert() statements.

	13852 */

	13853 SQLITE_PRIVATE int sqlite3BtreeHoldsMutex(Btree *p){

	13854 assert( p->sharable==0 \|\| p->locked==0 \|\| p->wantToLock>0 );

	13855 assert( p->sharable==0 \|\| p->locked==0 \|\| p->db==p->pBt->db );

	13856 assert( p->sharable==0 \|\| p->locked==0 \|\| sqlite3_mutex_held(p->pBt->mutex) );

	13857 assert( p->sharable==0 \|\| p->locked==0 \|\| sqlite3_mutex_held(p->db->mutex) );

	13858

	13859 return (p->sharable==0 \|\| p->locked);

	13860 }

	13861 #endif

	13862

	13863

	13864 /*

	13865 ** Enter the mutex on every Btree associated with a database

	13866 ** connection. This is needed (for example) prior to parsing

	13867 ** a statement since we will be comparing table and column names

	13868 ** against all schemas and we do not want those schemas being

	13869 ** reset out from under us.

	13870 **

	13871 ** There is a corresponding leave-all procedures.

	13872 **

	13873 ** Enter the mutexes in accending order by BtShared pointer address

	13874 ** to avoid the possibility of deadlock when two threads with

	13875 ** two or more btrees in common both try to lock all their btrees

	13876 ** at the same instant.

	13877 */

	13878 static void SQLITE_NOINLINE btreeEnterAll(sqlite3 *db){

	13879 int i;

	13880 int skipOk = 1;

	13881 Btree *p;

	13882 assert( sqlite3_mutex_held(db->mutex) );

	13883 for(i=0; i<db->nDb; i++){

	13884 p = db->aDb[i].pBt;

	13885 if( p && p->sharable ){

	13886 sqlite3BtreeEnter(p);

	13887 skipOk = 0;

	13888 }

	13889 }

	13890 db->skipBtreeMutex = skipOk;

	13891 }

	13892 SQLITE_PRIVATE void sqlite3BtreeEnterAll(sqlite3 *db){

	13893 if( db->skipBtreeMutex==0 ) btreeEnterAll(db);

	13894 }

	13895 static void SQLITE_NOINLINE btreeLeaveAll(sqlite3 *db){

	13896 int i;

	13897 Btree *p;

	13898 assert( sqlite3_mutex_held(db->mutex) );

	13899 for(i=0; i<db->nDb; i++){

	13900 p = db->aDb[i].pBt;

	13901 if( p ) sqlite3BtreeLeave(p);

	13902 }

	13903 }

	13904 SQLITE_PRIVATE void sqlite3BtreeLeaveAll(sqlite3 *db){

	13905 if( db->skipBtreeMutex==0 ) btreeLeaveAll(db);

	13906 }

	13907

	13908 #ifndef NDEBUG

	13909 /*

	13910 ** Return true if the current thread holds the database connection

	13911 ** mutex and all required BtShared mutexes.

	13912 **

	13913 ** This routine is used inside assert() statements only.

	13914 */

	13915 SQLITE_PRIVATE int sqlite3BtreeHoldsAllMutexes(sqlite3 *db){

	13916 int i;

	13917 if( !sqlite3_mutex_held(db->mutex) ){

	13918 return 0;

	13919 }

	13920 for(i=0; i<db->nDb; i++){

	13921 Btree *p;

	13922 p = db->aDb[i].pBt;

	13923 if( p && p->sharable &&

	13924 (p->wantToLock==0 \|\| !sqlite3_mutex_held(p->pBt->mutex)) ){

	13925 return 0;

	13926 }

	13927 }

	13928 return 1;

	13929 }

	13930 #endif /* NDEBUG */

	13931

	13932 #ifndef NDEBUG

	13933 /*

	13934 ** Return true if the correct mutexes are held for accessing the

	13935 ** db->aDb[iDb].pSchema structure. The mutexes required for schema

	13936 ** access are:

	13937 **

	13938 ** (1) The mutex on db

	13939 ** (2) if iDb!=1, then the mutex on db->aDb[iDb].pBt.

	13940 **

	13941 ** If pSchema is not NULL, then iDb is computed from pSchema and

	13942 ** db using sqlite3SchemaToIndex().

	13943 */

	13944 SQLITE_PRIVATE int sqlite3SchemaMutexHeld(sqlite3 db, int iDb, Schema pSchema) {

	13945 Btree *p;

	13946 assert( db!=0 );

	13947 if( pSchema ) iDb = sqlite3SchemaToIndex(db, pSchema);

	13948 assert( iDb>=0 && iDb<db->nDb );

	13949 if( !sqlite3_mutex_held(db->mutex) ) return 0;

	13950 if( iDb==1 ) return 1;

	13951 p = db->aDb[iDb].pBt;

	13952 assert( p!=0 );

	13953 return p->sharable==0 \|\| p->locked==1;

	13954 }

	13955 #endif /* NDEBUG */

	13956

	13957 #else /* SQLITE_THREADSAFE>0 above. SQLITE_THREADSAFE==0 below */

	13958 /*

	13959 ** The following are special cases for mutex enter routines for use

	13960 ** in single threaded applications that use shared cache. Except for

	13961 ** these two routines, all mutex operations are no-ops in that case and

	13962 ** are null #defines in btree.h.

	13963 **

	13964 ** If shared cache is disabled, then all btree mutex routines, including

	13965 ** the ones below, are no-ops and are null #defines in btree.h.

	13966 */

	13967

	13968 SQLITE_PRIVATE void sqlite3BtreeEnter(Btree *p){

	13969 p->pBt->db = p->db;

	13970 }

	13971 SQLITE_PRIVATE void sqlite3BtreeEnterAll(sqlite3 *db){

	13972 int i;

	13973 for(i=0; i<db->nDb; i++){

	13974 Btree *p = db->aDb[i].pBt;

	13975 if( p ){

	13976 p->pBt->db = p->db;

	13977 }

	13978 }

	13979 }

	13980 #endif /* if SQLITE_THREADSAFE */

	13981

	13982 #ifndef SQLITE_OMIT_INCRBLOB

	13983 /*

	13984 ** Enter a mutex on a Btree given a cursor owned by that Btree.

	13985 **

	13986 ** These entry points are used by incremental I/O only. Enter() is required

	13987 ** any time OMIT_SHARED_CACHE is not defined, regardless of whether or not

	13988 ** the build is threadsafe. Leave() is only required by threadsafe builds.

	13989 */

	13990 SQLITE_PRIVATE void sqlite3BtreeEnterCursor(BtCursor *pCur){

	13991 sqlite3BtreeEnter(pCur->pBtree);

	13992 }

	13993 # if SQLITE_THREADSAFE

	13994 SQLITE_PRIVATE void sqlite3BtreeLeaveCursor(BtCursor *pCur){

	13995 sqlite3BtreeLeave(pCur->pBtree);

	13996 }

	13997 # endif

	13998 #endif /* ifndef SQLITE_OMIT_INCRBLOB */

	13999

	14000 #endif /* ifndef SQLITE_OMIT_SHARED_CACHE */

	14001

	14002 /************ End of btmutex.c *******************************************/

	14003

	14004 /* Chain include. */

	14005 #include "sqlite3.03.c"

OLD	NEW

« no previous file with comments | « third_party/sqlite/amalgamation/sqlite3.01.c ('k') | third_party/sqlite/amalgamation/sqlite3.03.c » ('j') | no next file with comments »