Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(597)

Side by Side Diff: third_party/sqlite/amalgamation/sqlite3.02.c

Issue 2755803002: NCI: trybot test for sqlite 3.17 import. (Closed)
Patch Set: also clang on Linux i386 Created 3 years, 9 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
(Empty)
1 /************** Begin file pcache1.c *****************************************/
2 /*
3 ** 2008 November 05
4 **
5 ** The author disclaims copyright to this source code. In place of
6 ** a legal notice, here is a blessing:
7 **
8 ** May you do good and not evil.
9 ** May you find forgiveness for yourself and forgive others.
10 ** May you share freely, never taking more than you give.
11 **
12 *************************************************************************
13 **
14 ** This file implements the default page cache implementation (the
15 ** sqlite3_pcache interface). It also contains part of the implementation
16 ** of the SQLITE_CONFIG_PAGECACHE and sqlite3_release_memory() features.
17 ** If the default page cache implementation is overridden, then neither of
18 ** these two features are available.
19 **
20 ** A Page cache line looks like this:
21 **
22 ** -------------------------------------------------------------
23 ** | database page content | PgHdr1 | MemPage | PgHdr |
24 ** -------------------------------------------------------------
25 **
26 ** The database page content is up front (so that buffer overreads tend to
27 ** flow harmlessly into the PgHdr1, MemPage, and PgHdr extensions). MemPage
28 ** is the extension added by the btree.c module containing information such
29 ** as the database page number and how that database page is used. PgHdr
30 ** is added by the pcache.c layer and contains information used to keep track
31 ** of which pages are "dirty". PgHdr1 is an extension added by this
32 ** module (pcache1.c). The PgHdr1 header is a subclass of sqlite3_pcache_page.
33 ** PgHdr1 contains information needed to look up a page by its page number.
34 ** The superclass sqlite3_pcache_page.pBuf points to the start of the
35 ** database page content and sqlite3_pcache_page.pExtra points to PgHdr.
36 **
37 ** The size of the extension (MemPage+PgHdr+PgHdr1) can be determined at
38 ** runtime using sqlite3_config(SQLITE_CONFIG_PCACHE_HDRSZ, &size). The
39 ** sizes of the extensions sum to 272 bytes on x64 for 3.8.10, but this
40 ** size can vary according to architecture, compile-time options, and
41 ** SQLite library version number.
42 **
43 ** If SQLITE_PCACHE_SEPARATE_HEADER is defined, then the extension is obtained
44 ** using a separate memory allocation from the database page content. This
45 ** seeks to overcome the "clownshoe" problem (also called "internal
46 ** fragmentation" in academic literature) of allocating a few bytes more
47 ** than a power of two with the memory allocator rounding up to the next
48 ** power of two, and leaving the rounded-up space unused.
49 **
50 ** This module tracks pointers to PgHdr1 objects. Only pcache.c communicates
51 ** with this module. Information is passed back and forth as PgHdr1 pointers.
52 **
53 ** The pcache.c and pager.c modules deal pointers to PgHdr objects.
54 ** The btree.c module deals with pointers to MemPage objects.
55 **
56 ** SOURCE OF PAGE CACHE MEMORY:
57 **
58 ** Memory for a page might come from any of three sources:
59 **
60 ** (1) The general-purpose memory allocator - sqlite3Malloc()
61 ** (2) Global page-cache memory provided using sqlite3_config() with
62 ** SQLITE_CONFIG_PAGECACHE.
63 ** (3) PCache-local bulk allocation.
64 **
65 ** The third case is a chunk of heap memory (defaulting to 100 pages worth)
66 ** that is allocated when the page cache is created. The size of the local
67 ** bulk allocation can be adjusted using
68 **
69 ** sqlite3_config(SQLITE_CONFIG_PAGECACHE, (void*)0, 0, N).
70 **
71 ** If N is positive, then N pages worth of memory are allocated using a single
72 ** sqlite3Malloc() call and that memory is used for the first N pages allocated.
73 ** Or if N is negative, then -1024*N bytes of memory are allocated and used
74 ** for as many pages as can be accomodated.
75 **
76 ** Only one of (2) or (3) can be used. Once the memory available to (2) or
77 ** (3) is exhausted, subsequent allocations fail over to the general-purpose
78 ** memory allocator (1).
79 **
80 ** Earlier versions of SQLite used only methods (1) and (2). But experiments
81 ** show that method (3) with N==100 provides about a 5% performance boost for
82 ** common workloads.
83 */
84 /* #include "sqliteInt.h" */
85
86 typedef struct PCache1 PCache1;
87 typedef struct PgHdr1 PgHdr1;
88 typedef struct PgFreeslot PgFreeslot;
89 typedef struct PGroup PGroup;
90
91 /*
92 ** Each cache entry is represented by an instance of the following
93 ** structure. Unless SQLITE_PCACHE_SEPARATE_HEADER is defined, a buffer of
94 ** PgHdr1.pCache->szPage bytes is allocated directly before this structure
95 ** in memory.
96 */
97 struct PgHdr1 {
98 sqlite3_pcache_page page; /* Base class. Must be first. pBuf & pExtra */
99 unsigned int iKey; /* Key value (page number) */
100 u8 isPinned; /* Page in use, not on the LRU list */
101 u8 isBulkLocal; /* This page from bulk local storage */
102 u8 isAnchor; /* This is the PGroup.lru element */
103 PgHdr1 *pNext; /* Next in hash table chain */
104 PCache1 *pCache; /* Cache that currently owns this page */
105 PgHdr1 *pLruNext; /* Next in LRU list of unpinned pages */
106 PgHdr1 *pLruPrev; /* Previous in LRU list of unpinned pages */
107 };
108
109 /* Each page cache (or PCache) belongs to a PGroup. A PGroup is a set
110 ** of one or more PCaches that are able to recycle each other's unpinned
111 ** pages when they are under memory pressure. A PGroup is an instance of
112 ** the following object.
113 **
114 ** This page cache implementation works in one of two modes:
115 **
116 ** (1) Every PCache is the sole member of its own PGroup. There is
117 ** one PGroup per PCache.
118 **
119 ** (2) There is a single global PGroup that all PCaches are a member
120 ** of.
121 **
122 ** Mode 1 uses more memory (since PCache instances are not able to rob
123 ** unused pages from other PCaches) but it also operates without a mutex,
124 ** and is therefore often faster. Mode 2 requires a mutex in order to be
125 ** threadsafe, but recycles pages more efficiently.
126 **
127 ** For mode (1), PGroup.mutex is NULL. For mode (2) there is only a single
128 ** PGroup which is the pcache1.grp global variable and its mutex is
129 ** SQLITE_MUTEX_STATIC_LRU.
130 */
131 struct PGroup {
132 sqlite3_mutex *mutex; /* MUTEX_STATIC_LRU or NULL */
133 unsigned int nMaxPage; /* Sum of nMax for purgeable caches */
134 unsigned int nMinPage; /* Sum of nMin for purgeable caches */
135 unsigned int mxPinned; /* nMaxpage + 10 - nMinPage */
136 unsigned int nCurrentPage; /* Number of purgeable pages allocated */
137 PgHdr1 lru; /* The beginning and end of the LRU list */
138 };
139
140 /* Each page cache is an instance of the following object. Every
141 ** open database file (including each in-memory database and each
142 ** temporary or transient database) has a single page cache which
143 ** is an instance of this object.
144 **
145 ** Pointers to structures of this type are cast and returned as
146 ** opaque sqlite3_pcache* handles.
147 */
148 struct PCache1 {
149 /* Cache configuration parameters. Page size (szPage) and the purgeable
150 ** flag (bPurgeable) are set when the cache is created. nMax may be
151 ** modified at any time by a call to the pcache1Cachesize() method.
152 ** The PGroup mutex must be held when accessing nMax.
153 */
154 PGroup *pGroup; /* PGroup this cache belongs to */
155 int szPage; /* Size of database content section */
156 int szExtra; /* sizeof(MemPage)+sizeof(PgHdr) */
157 int szAlloc; /* Total size of one pcache line */
158 int bPurgeable; /* True if cache is purgeable */
159 unsigned int nMin; /* Minimum number of pages reserved */
160 unsigned int nMax; /* Configured "cache_size" value */
161 unsigned int n90pct; /* nMax*9/10 */
162 unsigned int iMaxKey; /* Largest key seen since xTruncate() */
163
164 /* Hash table of all pages. The following variables may only be accessed
165 ** when the accessor is holding the PGroup mutex.
166 */
167 unsigned int nRecyclable; /* Number of pages in the LRU list */
168 unsigned int nPage; /* Total number of pages in apHash */
169 unsigned int nHash; /* Number of slots in apHash[] */
170 PgHdr1 **apHash; /* Hash table for fast lookup by key */
171 PgHdr1 *pFree; /* List of unused pcache-local pages */
172 void *pBulk; /* Bulk memory used by pcache-local */
173 };
174
175 /*
176 ** Free slots in the allocator used to divide up the global page cache
177 ** buffer provided using the SQLITE_CONFIG_PAGECACHE mechanism.
178 */
179 struct PgFreeslot {
180 PgFreeslot *pNext; /* Next free slot */
181 };
182
183 /*
184 ** Global data used by this cache.
185 */
186 static SQLITE_WSD struct PCacheGlobal {
187 PGroup grp; /* The global PGroup for mode (2) */
188
189 /* Variables related to SQLITE_CONFIG_PAGECACHE settings. The
190 ** szSlot, nSlot, pStart, pEnd, nReserve, and isInit values are all
191 ** fixed at sqlite3_initialize() time and do not require mutex protection.
192 ** The nFreeSlot and pFree values do require mutex protection.
193 */
194 int isInit; /* True if initialized */
195 int separateCache; /* Use a new PGroup for each PCache */
196 int nInitPage; /* Initial bulk allocation size */
197 int szSlot; /* Size of each free slot */
198 int nSlot; /* The number of pcache slots */
199 int nReserve; /* Try to keep nFreeSlot above this */
200 void *pStart, *pEnd; /* Bounds of global page cache memory */
201 /* Above requires no mutex. Use mutex below for variable that follow. */
202 sqlite3_mutex *mutex; /* Mutex for accessing the following: */
203 PgFreeslot *pFree; /* Free page blocks */
204 int nFreeSlot; /* Number of unused pcache slots */
205 /* The following value requires a mutex to change. We skip the mutex on
206 ** reading because (1) most platforms read a 32-bit integer atomically and
207 ** (2) even if an incorrect value is read, no great harm is done since this
208 ** is really just an optimization. */
209 int bUnderPressure; /* True if low on PAGECACHE memory */
210 } pcache1_g;
211
212 /*
213 ** All code in this file should access the global structure above via the
214 ** alias "pcache1". This ensures that the WSD emulation is used when
215 ** compiling for systems that do not support real WSD.
216 */
217 #define pcache1 (GLOBAL(struct PCacheGlobal, pcache1_g))
218
219 /*
220 ** Macros to enter and leave the PCache LRU mutex.
221 */
222 #if !defined(SQLITE_ENABLE_MEMORY_MANAGEMENT) || SQLITE_THREADSAFE==0
223 # define pcache1EnterMutex(X) assert((X)->mutex==0)
224 # define pcache1LeaveMutex(X) assert((X)->mutex==0)
225 # define PCACHE1_MIGHT_USE_GROUP_MUTEX 0
226 #else
227 # define pcache1EnterMutex(X) sqlite3_mutex_enter((X)->mutex)
228 # define pcache1LeaveMutex(X) sqlite3_mutex_leave((X)->mutex)
229 # define PCACHE1_MIGHT_USE_GROUP_MUTEX 1
230 #endif
231
232 /******************************************************************************/
233 /******** Page Allocation/SQLITE_CONFIG_PCACHE Related Functions **************/
234
235
236 /*
237 ** This function is called during initialization if a static buffer is
238 ** supplied to use for the page-cache by passing the SQLITE_CONFIG_PAGECACHE
239 ** verb to sqlite3_config(). Parameter pBuf points to an allocation large
240 ** enough to contain 'n' buffers of 'sz' bytes each.
241 **
242 ** This routine is called from sqlite3_initialize() and so it is guaranteed
243 ** to be serialized already. There is no need for further mutexing.
244 */
245 SQLITE_PRIVATE void sqlite3PCacheBufferSetup(void *pBuf, int sz, int n){
246 if( pcache1.isInit ){
247 PgFreeslot *p;
248 if( pBuf==0 ) sz = n = 0;
249 sz = ROUNDDOWN8(sz);
250 pcache1.szSlot = sz;
251 pcache1.nSlot = pcache1.nFreeSlot = n;
252 pcache1.nReserve = n>90 ? 10 : (n/10 + 1);
253 pcache1.pStart = pBuf;
254 pcache1.pFree = 0;
255 pcache1.bUnderPressure = 0;
256 while( n-- ){
257 p = (PgFreeslot*)pBuf;
258 p->pNext = pcache1.pFree;
259 pcache1.pFree = p;
260 pBuf = (void*)&((char*)pBuf)[sz];
261 }
262 pcache1.pEnd = pBuf;
263 }
264 }
265
266 /*
267 ** Try to initialize the pCache->pFree and pCache->pBulk fields. Return
268 ** true if pCache->pFree ends up containing one or more free pages.
269 */
270 static int pcache1InitBulk(PCache1 *pCache){
271 i64 szBulk;
272 char *zBulk;
273 if( pcache1.nInitPage==0 ) return 0;
274 /* Do not bother with a bulk allocation if the cache size very small */
275 if( pCache->nMax<3 ) return 0;
276 sqlite3BeginBenignMalloc();
277 if( pcache1.nInitPage>0 ){
278 szBulk = pCache->szAlloc * (i64)pcache1.nInitPage;
279 }else{
280 szBulk = -1024 * (i64)pcache1.nInitPage;
281 }
282 if( szBulk > pCache->szAlloc*(i64)pCache->nMax ){
283 szBulk = pCache->szAlloc*(i64)pCache->nMax;
284 }
285 zBulk = pCache->pBulk = sqlite3Malloc( szBulk );
286 sqlite3EndBenignMalloc();
287 if( zBulk ){
288 int nBulk = sqlite3MallocSize(zBulk)/pCache->szAlloc;
289 int i;
290 for(i=0; i<nBulk; i++){
291 PgHdr1 *pX = (PgHdr1*)&zBulk[pCache->szPage];
292 pX->page.pBuf = zBulk;
293 pX->page.pExtra = &pX[1];
294 pX->isBulkLocal = 1;
295 pX->isAnchor = 0;
296 pX->pNext = pCache->pFree;
297 pCache->pFree = pX;
298 zBulk += pCache->szAlloc;
299 }
300 }
301 return pCache->pFree!=0;
302 }
303
304 /*
305 ** Malloc function used within this file to allocate space from the buffer
306 ** configured using sqlite3_config(SQLITE_CONFIG_PAGECACHE) option. If no
307 ** such buffer exists or there is no space left in it, this function falls
308 ** back to sqlite3Malloc().
309 **
310 ** Multiple threads can run this routine at the same time. Global variables
311 ** in pcache1 need to be protected via mutex.
312 */
313 static void *pcache1Alloc(int nByte){
314 void *p = 0;
315 assert( sqlite3_mutex_notheld(pcache1.grp.mutex) );
316 if( nByte<=pcache1.szSlot ){
317 sqlite3_mutex_enter(pcache1.mutex);
318 p = (PgHdr1 *)pcache1.pFree;
319 if( p ){
320 pcache1.pFree = pcache1.pFree->pNext;
321 pcache1.nFreeSlot--;
322 pcache1.bUnderPressure = pcache1.nFreeSlot<pcache1.nReserve;
323 assert( pcache1.nFreeSlot>=0 );
324 sqlite3StatusHighwater(SQLITE_STATUS_PAGECACHE_SIZE, nByte);
325 sqlite3StatusUp(SQLITE_STATUS_PAGECACHE_USED, 1);
326 }
327 sqlite3_mutex_leave(pcache1.mutex);
328 }
329 if( p==0 ){
330 /* Memory is not available in the SQLITE_CONFIG_PAGECACHE pool. Get
331 ** it from sqlite3Malloc instead.
332 */
333 p = sqlite3Malloc(nByte);
334 #ifndef SQLITE_DISABLE_PAGECACHE_OVERFLOW_STATS
335 if( p ){
336 int sz = sqlite3MallocSize(p);
337 sqlite3_mutex_enter(pcache1.mutex);
338 sqlite3StatusHighwater(SQLITE_STATUS_PAGECACHE_SIZE, nByte);
339 sqlite3StatusUp(SQLITE_STATUS_PAGECACHE_OVERFLOW, sz);
340 sqlite3_mutex_leave(pcache1.mutex);
341 }
342 #endif
343 sqlite3MemdebugSetType(p, MEMTYPE_PCACHE);
344 }
345 return p;
346 }
347
348 /*
349 ** Free an allocated buffer obtained from pcache1Alloc().
350 */
351 static void pcache1Free(void *p){
352 if( p==0 ) return;
353 if( SQLITE_WITHIN(p, pcache1.pStart, pcache1.pEnd) ){
354 PgFreeslot *pSlot;
355 sqlite3_mutex_enter(pcache1.mutex);
356 sqlite3StatusDown(SQLITE_STATUS_PAGECACHE_USED, 1);
357 pSlot = (PgFreeslot*)p;
358 pSlot->pNext = pcache1.pFree;
359 pcache1.pFree = pSlot;
360 pcache1.nFreeSlot++;
361 pcache1.bUnderPressure = pcache1.nFreeSlot<pcache1.nReserve;
362 assert( pcache1.nFreeSlot<=pcache1.nSlot );
363 sqlite3_mutex_leave(pcache1.mutex);
364 }else{
365 assert( sqlite3MemdebugHasType(p, MEMTYPE_PCACHE) );
366 sqlite3MemdebugSetType(p, MEMTYPE_HEAP);
367 #ifndef SQLITE_DISABLE_PAGECACHE_OVERFLOW_STATS
368 {
369 int nFreed = 0;
370 nFreed = sqlite3MallocSize(p);
371 sqlite3_mutex_enter(pcache1.mutex);
372 sqlite3StatusDown(SQLITE_STATUS_PAGECACHE_OVERFLOW, nFreed);
373 sqlite3_mutex_leave(pcache1.mutex);
374 }
375 #endif
376 sqlite3_free(p);
377 }
378 }
379
380 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT
381 /*
382 ** Return the size of a pcache allocation
383 */
384 static int pcache1MemSize(void *p){
385 if( p>=pcache1.pStart && p<pcache1.pEnd ){
386 return pcache1.szSlot;
387 }else{
388 int iSize;
389 assert( sqlite3MemdebugHasType(p, MEMTYPE_PCACHE) );
390 sqlite3MemdebugSetType(p, MEMTYPE_HEAP);
391 iSize = sqlite3MallocSize(p);
392 sqlite3MemdebugSetType(p, MEMTYPE_PCACHE);
393 return iSize;
394 }
395 }
396 #endif /* SQLITE_ENABLE_MEMORY_MANAGEMENT */
397
398 /*
399 ** Allocate a new page object initially associated with cache pCache.
400 */
401 static PgHdr1 *pcache1AllocPage(PCache1 *pCache, int benignMalloc){
402 PgHdr1 *p = 0;
403 void *pPg;
404
405 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );
406 if( pCache->pFree || (pCache->nPage==0 && pcache1InitBulk(pCache)) ){
407 p = pCache->pFree;
408 pCache->pFree = p->pNext;
409 p->pNext = 0;
410 }else{
411 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT
412 /* The group mutex must be released before pcache1Alloc() is called. This
413 ** is because it might call sqlite3_release_memory(), which assumes that
414 ** this mutex is not held. */
415 assert( pcache1.separateCache==0 );
416 assert( pCache->pGroup==&pcache1.grp );
417 pcache1LeaveMutex(pCache->pGroup);
418 #endif
419 if( benignMalloc ){ sqlite3BeginBenignMalloc(); }
420 #ifdef SQLITE_PCACHE_SEPARATE_HEADER
421 pPg = pcache1Alloc(pCache->szPage);
422 p = sqlite3Malloc(sizeof(PgHdr1) + pCache->szExtra);
423 if( !pPg || !p ){
424 pcache1Free(pPg);
425 sqlite3_free(p);
426 pPg = 0;
427 }
428 #else
429 pPg = pcache1Alloc(pCache->szAlloc);
430 p = (PgHdr1 *)&((u8 *)pPg)[pCache->szPage];
431 #endif
432 if( benignMalloc ){ sqlite3EndBenignMalloc(); }
433 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT
434 pcache1EnterMutex(pCache->pGroup);
435 #endif
436 if( pPg==0 ) return 0;
437 p->page.pBuf = pPg;
438 p->page.pExtra = &p[1];
439 p->isBulkLocal = 0;
440 p->isAnchor = 0;
441 }
442 if( pCache->bPurgeable ){
443 pCache->pGroup->nCurrentPage++;
444 }
445 return p;
446 }
447
448 /*
449 ** Free a page object allocated by pcache1AllocPage().
450 */
451 static void pcache1FreePage(PgHdr1 *p){
452 PCache1 *pCache;
453 assert( p!=0 );
454 pCache = p->pCache;
455 assert( sqlite3_mutex_held(p->pCache->pGroup->mutex) );
456 if( p->isBulkLocal ){
457 p->pNext = pCache->pFree;
458 pCache->pFree = p;
459 }else{
460 pcache1Free(p->page.pBuf);
461 #ifdef SQLITE_PCACHE_SEPARATE_HEADER
462 sqlite3_free(p);
463 #endif
464 }
465 if( pCache->bPurgeable ){
466 pCache->pGroup->nCurrentPage--;
467 }
468 }
469
470 /*
471 ** Malloc function used by SQLite to obtain space from the buffer configured
472 ** using sqlite3_config(SQLITE_CONFIG_PAGECACHE) option. If no such buffer
473 ** exists, this function falls back to sqlite3Malloc().
474 */
475 SQLITE_PRIVATE void *sqlite3PageMalloc(int sz){
476 return pcache1Alloc(sz);
477 }
478
479 /*
480 ** Free an allocated buffer obtained from sqlite3PageMalloc().
481 */
482 SQLITE_PRIVATE void sqlite3PageFree(void *p){
483 pcache1Free(p);
484 }
485
486
487 /*
488 ** Return true if it desirable to avoid allocating a new page cache
489 ** entry.
490 **
491 ** If memory was allocated specifically to the page cache using
492 ** SQLITE_CONFIG_PAGECACHE but that memory has all been used, then
493 ** it is desirable to avoid allocating a new page cache entry because
494 ** presumably SQLITE_CONFIG_PAGECACHE was suppose to be sufficient
495 ** for all page cache needs and we should not need to spill the
496 ** allocation onto the heap.
497 **
498 ** Or, the heap is used for all page cache memory but the heap is
499 ** under memory pressure, then again it is desirable to avoid
500 ** allocating a new page cache entry in order to avoid stressing
501 ** the heap even further.
502 */
503 static int pcache1UnderMemoryPressure(PCache1 *pCache){
504 if( pcache1.nSlot && (pCache->szPage+pCache->szExtra)<=pcache1.szSlot ){
505 return pcache1.bUnderPressure;
506 }else{
507 return sqlite3HeapNearlyFull();
508 }
509 }
510
511 /******************************************************************************/
512 /******** General Implementation Functions ************************************/
513
514 /*
515 ** This function is used to resize the hash table used by the cache passed
516 ** as the first argument.
517 **
518 ** The PCache mutex must be held when this function is called.
519 */
520 static void pcache1ResizeHash(PCache1 *p){
521 PgHdr1 **apNew;
522 unsigned int nNew;
523 unsigned int i;
524
525 assert( sqlite3_mutex_held(p->pGroup->mutex) );
526
527 nNew = p->nHash*2;
528 if( nNew<256 ){
529 nNew = 256;
530 }
531
532 pcache1LeaveMutex(p->pGroup);
533 if( p->nHash ){ sqlite3BeginBenignMalloc(); }
534 apNew = (PgHdr1 **)sqlite3MallocZero(sizeof(PgHdr1 *)*nNew);
535 if( p->nHash ){ sqlite3EndBenignMalloc(); }
536 pcache1EnterMutex(p->pGroup);
537 if( apNew ){
538 for(i=0; i<p->nHash; i++){
539 PgHdr1 *pPage;
540 PgHdr1 *pNext = p->apHash[i];
541 while( (pPage = pNext)!=0 ){
542 unsigned int h = pPage->iKey % nNew;
543 pNext = pPage->pNext;
544 pPage->pNext = apNew[h];
545 apNew[h] = pPage;
546 }
547 }
548 sqlite3_free(p->apHash);
549 p->apHash = apNew;
550 p->nHash = nNew;
551 }
552 }
553
554 /*
555 ** This function is used internally to remove the page pPage from the
556 ** PGroup LRU list, if is part of it. If pPage is not part of the PGroup
557 ** LRU list, then this function is a no-op.
558 **
559 ** The PGroup mutex must be held when this function is called.
560 */
561 static PgHdr1 *pcache1PinPage(PgHdr1 *pPage){
562 PCache1 *pCache;
563
564 assert( pPage!=0 );
565 assert( pPage->isPinned==0 );
566 pCache = pPage->pCache;
567 assert( pPage->pLruNext );
568 assert( pPage->pLruPrev );
569 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );
570 pPage->pLruPrev->pLruNext = pPage->pLruNext;
571 pPage->pLruNext->pLruPrev = pPage->pLruPrev;
572 pPage->pLruNext = 0;
573 pPage->pLruPrev = 0;
574 pPage->isPinned = 1;
575 assert( pPage->isAnchor==0 );
576 assert( pCache->pGroup->lru.isAnchor==1 );
577 pCache->nRecyclable--;
578 return pPage;
579 }
580
581
582 /*
583 ** Remove the page supplied as an argument from the hash table
584 ** (PCache1.apHash structure) that it is currently stored in.
585 ** Also free the page if freePage is true.
586 **
587 ** The PGroup mutex must be held when this function is called.
588 */
589 static void pcache1RemoveFromHash(PgHdr1 *pPage, int freeFlag){
590 unsigned int h;
591 PCache1 *pCache = pPage->pCache;
592 PgHdr1 **pp;
593
594 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );
595 h = pPage->iKey % pCache->nHash;
596 for(pp=&pCache->apHash[h]; (*pp)!=pPage; pp=&(*pp)->pNext);
597 *pp = (*pp)->pNext;
598
599 pCache->nPage--;
600 if( freeFlag ) pcache1FreePage(pPage);
601 }
602
603 /*
604 ** If there are currently more than nMaxPage pages allocated, try
605 ** to recycle pages to reduce the number allocated to nMaxPage.
606 */
607 static void pcache1EnforceMaxPage(PCache1 *pCache){
608 PGroup *pGroup = pCache->pGroup;
609 PgHdr1 *p;
610 assert( sqlite3_mutex_held(pGroup->mutex) );
611 while( pGroup->nCurrentPage>pGroup->nMaxPage
612 && (p=pGroup->lru.pLruPrev)->isAnchor==0
613 ){
614 assert( p->pCache->pGroup==pGroup );
615 assert( p->isPinned==0 );
616 pcache1PinPage(p);
617 pcache1RemoveFromHash(p, 1);
618 }
619 if( pCache->nPage==0 && pCache->pBulk ){
620 sqlite3_free(pCache->pBulk);
621 pCache->pBulk = pCache->pFree = 0;
622 }
623 }
624
625 /*
626 ** Discard all pages from cache pCache with a page number (key value)
627 ** greater than or equal to iLimit. Any pinned pages that meet this
628 ** criteria are unpinned before they are discarded.
629 **
630 ** The PCache mutex must be held when this function is called.
631 */
632 static void pcache1TruncateUnsafe(
633 PCache1 *pCache, /* The cache to truncate */
634 unsigned int iLimit /* Drop pages with this pgno or larger */
635 ){
636 TESTONLY( int nPage = 0; ) /* To assert pCache->nPage is correct */
637 unsigned int h, iStop;
638 assert( sqlite3_mutex_held(pCache->pGroup->mutex) );
639 assert( pCache->iMaxKey >= iLimit );
640 assert( pCache->nHash > 0 );
641 if( pCache->iMaxKey - iLimit < pCache->nHash ){
642 /* If we are just shaving the last few pages off the end of the
643 ** cache, then there is no point in scanning the entire hash table.
644 ** Only scan those hash slots that might contain pages that need to
645 ** be removed. */
646 h = iLimit % pCache->nHash;
647 iStop = pCache->iMaxKey % pCache->nHash;
648 TESTONLY( nPage = -10; ) /* Disable the pCache->nPage validity check */
649 }else{
650 /* This is the general case where many pages are being removed.
651 ** It is necessary to scan the entire hash table */
652 h = pCache->nHash/2;
653 iStop = h - 1;
654 }
655 for(;;){
656 PgHdr1 **pp;
657 PgHdr1 *pPage;
658 assert( h<pCache->nHash );
659 pp = &pCache->apHash[h];
660 while( (pPage = *pp)!=0 ){
661 if( pPage->iKey>=iLimit ){
662 pCache->nPage--;
663 *pp = pPage->pNext;
664 if( !pPage->isPinned ) pcache1PinPage(pPage);
665 pcache1FreePage(pPage);
666 }else{
667 pp = &pPage->pNext;
668 TESTONLY( if( nPage>=0 ) nPage++; )
669 }
670 }
671 if( h==iStop ) break;
672 h = (h+1) % pCache->nHash;
673 }
674 assert( nPage<0 || pCache->nPage==(unsigned)nPage );
675 }
676
677 /******************************************************************************/
678 /******** sqlite3_pcache Methods **********************************************/
679
680 /*
681 ** Implementation of the sqlite3_pcache.xInit method.
682 */
683 static int pcache1Init(void *NotUsed){
684 UNUSED_PARAMETER(NotUsed);
685 assert( pcache1.isInit==0 );
686 memset(&pcache1, 0, sizeof(pcache1));
687
688
689 /*
690 ** The pcache1.separateCache variable is true if each PCache has its own
691 ** private PGroup (mode-1). pcache1.separateCache is false if the single
692 ** PGroup in pcache1.grp is used for all page caches (mode-2).
693 **
694 ** * Always use separate caches (mode-1) if SQLITE_SEPARATE_CACHE_POOLS
695 **
696 ** * Always use a unified cache (mode-2) if ENABLE_MEMORY_MANAGEMENT
697 **
698 ** * Use a unified cache in single-threaded applications that have
699 ** configured a start-time buffer for use as page-cache memory using
700 ** sqlite3_config(SQLITE_CONFIG_PAGECACHE, pBuf, sz, N) with non-NULL
701 ** pBuf argument.
702 **
703 ** * Otherwise use separate caches (mode-1)
704 */
705 #ifdef SQLITE_SEPARATE_CACHE_POOLS
706 pcache1.separateCache = 1;
707 #elif defined(SQLITE_ENABLE_MEMORY_MANAGEMENT)
708 pcache1.separateCache = 0;
709 #elif SQLITE_THREADSAFE
710 pcache1.separateCache = sqlite3GlobalConfig.pPage==0
711 || sqlite3GlobalConfig.bCoreMutex>0;
712 #else
713 pcache1.separateCache = sqlite3GlobalConfig.pPage==0;
714 #endif
715
716 #if SQLITE_THREADSAFE
717 if( sqlite3GlobalConfig.bCoreMutex ){
718 pcache1.grp.mutex = sqlite3MutexAlloc(SQLITE_MUTEX_STATIC_LRU);
719 pcache1.mutex = sqlite3MutexAlloc(SQLITE_MUTEX_STATIC_PMEM);
720 }
721 #endif
722 if( pcache1.separateCache
723 && sqlite3GlobalConfig.nPage!=0
724 && sqlite3GlobalConfig.pPage==0
725 ){
726 pcache1.nInitPage = sqlite3GlobalConfig.nPage;
727 }else{
728 pcache1.nInitPage = 0;
729 }
730 pcache1.grp.mxPinned = 10;
731 pcache1.isInit = 1;
732 return SQLITE_OK;
733 }
734
735 /*
736 ** Implementation of the sqlite3_pcache.xShutdown method.
737 ** Note that the static mutex allocated in xInit does
738 ** not need to be freed.
739 */
740 static void pcache1Shutdown(void *NotUsed){
741 UNUSED_PARAMETER(NotUsed);
742 assert( pcache1.isInit!=0 );
743 memset(&pcache1, 0, sizeof(pcache1));
744 }
745
746 /* forward declaration */
747 static void pcache1Destroy(sqlite3_pcache *p);
748
749 /*
750 ** Implementation of the sqlite3_pcache.xCreate method.
751 **
752 ** Allocate a new cache.
753 */
754 static sqlite3_pcache *pcache1Create(int szPage, int szExtra, int bPurgeable){
755 PCache1 *pCache; /* The newly created page cache */
756 PGroup *pGroup; /* The group the new page cache will belong to */
757 int sz; /* Bytes of memory required to allocate the new cache */
758
759 assert( (szPage & (szPage-1))==0 && szPage>=512 && szPage<=65536 );
760 assert( szExtra < 300 );
761
762 sz = sizeof(PCache1) + sizeof(PGroup)*pcache1.separateCache;
763 pCache = (PCache1 *)sqlite3MallocZero(sz);
764 if( pCache ){
765 if( pcache1.separateCache ){
766 pGroup = (PGroup*)&pCache[1];
767 pGroup->mxPinned = 10;
768 }else{
769 pGroup = &pcache1.grp;
770 }
771 if( pGroup->lru.isAnchor==0 ){
772 pGroup->lru.isAnchor = 1;
773 pGroup->lru.pLruPrev = pGroup->lru.pLruNext = &pGroup->lru;
774 }
775 pCache->pGroup = pGroup;
776 pCache->szPage = szPage;
777 pCache->szExtra = szExtra;
778 pCache->szAlloc = szPage + szExtra + ROUND8(sizeof(PgHdr1));
779 pCache->bPurgeable = (bPurgeable ? 1 : 0);
780 pcache1EnterMutex(pGroup);
781 pcache1ResizeHash(pCache);
782 if( bPurgeable ){
783 pCache->nMin = 10;
784 pGroup->nMinPage += pCache->nMin;
785 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;
786 }
787 pcache1LeaveMutex(pGroup);
788 if( pCache->nHash==0 ){
789 pcache1Destroy((sqlite3_pcache*)pCache);
790 pCache = 0;
791 }
792 }
793 return (sqlite3_pcache *)pCache;
794 }
795
796 /*
797 ** Implementation of the sqlite3_pcache.xCachesize method.
798 **
799 ** Configure the cache_size limit for a cache.
800 */
801 static void pcache1Cachesize(sqlite3_pcache *p, int nMax){
802 PCache1 *pCache = (PCache1 *)p;
803 if( pCache->bPurgeable ){
804 PGroup *pGroup = pCache->pGroup;
805 pcache1EnterMutex(pGroup);
806 pGroup->nMaxPage += (nMax - pCache->nMax);
807 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;
808 pCache->nMax = nMax;
809 pCache->n90pct = pCache->nMax*9/10;
810 pcache1EnforceMaxPage(pCache);
811 pcache1LeaveMutex(pGroup);
812 }
813 }
814
815 /*
816 ** Implementation of the sqlite3_pcache.xShrink method.
817 **
818 ** Free up as much memory as possible.
819 */
820 static void pcache1Shrink(sqlite3_pcache *p){
821 PCache1 *pCache = (PCache1*)p;
822 if( pCache->bPurgeable ){
823 PGroup *pGroup = pCache->pGroup;
824 int savedMaxPage;
825 pcache1EnterMutex(pGroup);
826 savedMaxPage = pGroup->nMaxPage;
827 pGroup->nMaxPage = 0;
828 pcache1EnforceMaxPage(pCache);
829 pGroup->nMaxPage = savedMaxPage;
830 pcache1LeaveMutex(pGroup);
831 }
832 }
833
834 /*
835 ** Implementation of the sqlite3_pcache.xPagecount method.
836 */
837 static int pcache1Pagecount(sqlite3_pcache *p){
838 int n;
839 PCache1 *pCache = (PCache1*)p;
840 pcache1EnterMutex(pCache->pGroup);
841 n = pCache->nPage;
842 pcache1LeaveMutex(pCache->pGroup);
843 return n;
844 }
845
846
847 /*
848 ** Implement steps 3, 4, and 5 of the pcache1Fetch() algorithm described
849 ** in the header of the pcache1Fetch() procedure.
850 **
851 ** This steps are broken out into a separate procedure because they are
852 ** usually not needed, and by avoiding the stack initialization required
853 ** for these steps, the main pcache1Fetch() procedure can run faster.
854 */
855 static SQLITE_NOINLINE PgHdr1 *pcache1FetchStage2(
856 PCache1 *pCache,
857 unsigned int iKey,
858 int createFlag
859 ){
860 unsigned int nPinned;
861 PGroup *pGroup = pCache->pGroup;
862 PgHdr1 *pPage = 0;
863
864 /* Step 3: Abort if createFlag is 1 but the cache is nearly full */
865 assert( pCache->nPage >= pCache->nRecyclable );
866 nPinned = pCache->nPage - pCache->nRecyclable;
867 assert( pGroup->mxPinned == pGroup->nMaxPage + 10 - pGroup->nMinPage );
868 assert( pCache->n90pct == pCache->nMax*9/10 );
869 if( createFlag==1 && (
870 nPinned>=pGroup->mxPinned
871 || nPinned>=pCache->n90pct
872 || (pcache1UnderMemoryPressure(pCache) && pCache->nRecyclable<nPinned)
873 )){
874 return 0;
875 }
876
877 if( pCache->nPage>=pCache->nHash ) pcache1ResizeHash(pCache);
878 assert( pCache->nHash>0 && pCache->apHash );
879
880 /* Step 4. Try to recycle a page. */
881 if( pCache->bPurgeable
882 && !pGroup->lru.pLruPrev->isAnchor
883 && ((pCache->nPage+1>=pCache->nMax) || pcache1UnderMemoryPressure(pCache))
884 ){
885 PCache1 *pOther;
886 pPage = pGroup->lru.pLruPrev;
887 assert( pPage->isPinned==0 );
888 pcache1RemoveFromHash(pPage, 0);
889 pcache1PinPage(pPage);
890 pOther = pPage->pCache;
891 if( pOther->szAlloc != pCache->szAlloc ){
892 pcache1FreePage(pPage);
893 pPage = 0;
894 }else{
895 pGroup->nCurrentPage -= (pOther->bPurgeable - pCache->bPurgeable);
896 }
897 }
898
899 /* Step 5. If a usable page buffer has still not been found,
900 ** attempt to allocate a new one.
901 */
902 if( !pPage ){
903 pPage = pcache1AllocPage(pCache, createFlag==1);
904 }
905
906 if( pPage ){
907 unsigned int h = iKey % pCache->nHash;
908 pCache->nPage++;
909 pPage->iKey = iKey;
910 pPage->pNext = pCache->apHash[h];
911 pPage->pCache = pCache;
912 pPage->pLruPrev = 0;
913 pPage->pLruNext = 0;
914 pPage->isPinned = 1;
915 *(void **)pPage->page.pExtra = 0;
916 pCache->apHash[h] = pPage;
917 if( iKey>pCache->iMaxKey ){
918 pCache->iMaxKey = iKey;
919 }
920 }
921 return pPage;
922 }
923
924 /*
925 ** Implementation of the sqlite3_pcache.xFetch method.
926 **
927 ** Fetch a page by key value.
928 **
929 ** Whether or not a new page may be allocated by this function depends on
930 ** the value of the createFlag argument. 0 means do not allocate a new
931 ** page. 1 means allocate a new page if space is easily available. 2
932 ** means to try really hard to allocate a new page.
933 **
934 ** For a non-purgeable cache (a cache used as the storage for an in-memory
935 ** database) there is really no difference between createFlag 1 and 2. So
936 ** the calling function (pcache.c) will never have a createFlag of 1 on
937 ** a non-purgeable cache.
938 **
939 ** There are three different approaches to obtaining space for a page,
940 ** depending on the value of parameter createFlag (which may be 0, 1 or 2).
941 **
942 ** 1. Regardless of the value of createFlag, the cache is searched for a
943 ** copy of the requested page. If one is found, it is returned.
944 **
945 ** 2. If createFlag==0 and the page is not already in the cache, NULL is
946 ** returned.
947 **
948 ** 3. If createFlag is 1, and the page is not already in the cache, then
949 ** return NULL (do not allocate a new page) if any of the following
950 ** conditions are true:
951 **
952 ** (a) the number of pages pinned by the cache is greater than
953 ** PCache1.nMax, or
954 **
955 ** (b) the number of pages pinned by the cache is greater than
956 ** the sum of nMax for all purgeable caches, less the sum of
957 ** nMin for all other purgeable caches, or
958 **
959 ** 4. If none of the first three conditions apply and the cache is marked
960 ** as purgeable, and if one of the following is true:
961 **
962 ** (a) The number of pages allocated for the cache is already
963 ** PCache1.nMax, or
964 **
965 ** (b) The number of pages allocated for all purgeable caches is
966 ** already equal to or greater than the sum of nMax for all
967 ** purgeable caches,
968 **
969 ** (c) The system is under memory pressure and wants to avoid
970 ** unnecessary pages cache entry allocations
971 **
972 ** then attempt to recycle a page from the LRU list. If it is the right
973 ** size, return the recycled buffer. Otherwise, free the buffer and
974 ** proceed to step 5.
975 **
976 ** 5. Otherwise, allocate and return a new page buffer.
977 **
978 ** There are two versions of this routine. pcache1FetchWithMutex() is
979 ** the general case. pcache1FetchNoMutex() is a faster implementation for
980 ** the common case where pGroup->mutex is NULL. The pcache1Fetch() wrapper
981 ** invokes the appropriate routine.
982 */
983 static PgHdr1 *pcache1FetchNoMutex(
984 sqlite3_pcache *p,
985 unsigned int iKey,
986 int createFlag
987 ){
988 PCache1 *pCache = (PCache1 *)p;
989 PgHdr1 *pPage = 0;
990
991 /* Step 1: Search the hash table for an existing entry. */
992 pPage = pCache->apHash[iKey % pCache->nHash];
993 while( pPage && pPage->iKey!=iKey ){ pPage = pPage->pNext; }
994
995 /* Step 2: If the page was found in the hash table, then return it.
996 ** If the page was not in the hash table and createFlag is 0, abort.
997 ** Otherwise (page not in hash and createFlag!=0) continue with
998 ** subsequent steps to try to create the page. */
999 if( pPage ){
1000 if( !pPage->isPinned ){
1001 return pcache1PinPage(pPage);
1002 }else{
1003 return pPage;
1004 }
1005 }else if( createFlag ){
1006 /* Steps 3, 4, and 5 implemented by this subroutine */
1007 return pcache1FetchStage2(pCache, iKey, createFlag);
1008 }else{
1009 return 0;
1010 }
1011 }
1012 #if PCACHE1_MIGHT_USE_GROUP_MUTEX
1013 static PgHdr1 *pcache1FetchWithMutex(
1014 sqlite3_pcache *p,
1015 unsigned int iKey,
1016 int createFlag
1017 ){
1018 PCache1 *pCache = (PCache1 *)p;
1019 PgHdr1 *pPage;
1020
1021 pcache1EnterMutex(pCache->pGroup);
1022 pPage = pcache1FetchNoMutex(p, iKey, createFlag);
1023 assert( pPage==0 || pCache->iMaxKey>=iKey );
1024 pcache1LeaveMutex(pCache->pGroup);
1025 return pPage;
1026 }
1027 #endif
1028 static sqlite3_pcache_page *pcache1Fetch(
1029 sqlite3_pcache *p,
1030 unsigned int iKey,
1031 int createFlag
1032 ){
1033 #if PCACHE1_MIGHT_USE_GROUP_MUTEX || defined(SQLITE_DEBUG)
1034 PCache1 *pCache = (PCache1 *)p;
1035 #endif
1036
1037 assert( offsetof(PgHdr1,page)==0 );
1038 assert( pCache->bPurgeable || createFlag!=1 );
1039 assert( pCache->bPurgeable || pCache->nMin==0 );
1040 assert( pCache->bPurgeable==0 || pCache->nMin==10 );
1041 assert( pCache->nMin==0 || pCache->bPurgeable );
1042 assert( pCache->nHash>0 );
1043 #if PCACHE1_MIGHT_USE_GROUP_MUTEX
1044 if( pCache->pGroup->mutex ){
1045 return (sqlite3_pcache_page*)pcache1FetchWithMutex(p, iKey, createFlag);
1046 }else
1047 #endif
1048 {
1049 return (sqlite3_pcache_page*)pcache1FetchNoMutex(p, iKey, createFlag);
1050 }
1051 }
1052
1053
1054 /*
1055 ** Implementation of the sqlite3_pcache.xUnpin method.
1056 **
1057 ** Mark a page as unpinned (eligible for asynchronous recycling).
1058 */
1059 static void pcache1Unpin(
1060 sqlite3_pcache *p,
1061 sqlite3_pcache_page *pPg,
1062 int reuseUnlikely
1063 ){
1064 PCache1 *pCache = (PCache1 *)p;
1065 PgHdr1 *pPage = (PgHdr1 *)pPg;
1066 PGroup *pGroup = pCache->pGroup;
1067
1068 assert( pPage->pCache==pCache );
1069 pcache1EnterMutex(pGroup);
1070
1071 /* It is an error to call this function if the page is already
1072 ** part of the PGroup LRU list.
1073 */
1074 assert( pPage->pLruPrev==0 && pPage->pLruNext==0 );
1075 assert( pPage->isPinned==1 );
1076
1077 if( reuseUnlikely || pGroup->nCurrentPage>pGroup->nMaxPage ){
1078 pcache1RemoveFromHash(pPage, 1);
1079 }else{
1080 /* Add the page to the PGroup LRU list. */
1081 PgHdr1 **ppFirst = &pGroup->lru.pLruNext;
1082 pPage->pLruPrev = &pGroup->lru;
1083 (pPage->pLruNext = *ppFirst)->pLruPrev = pPage;
1084 *ppFirst = pPage;
1085 pCache->nRecyclable++;
1086 pPage->isPinned = 0;
1087 }
1088
1089 pcache1LeaveMutex(pCache->pGroup);
1090 }
1091
1092 /*
1093 ** Implementation of the sqlite3_pcache.xRekey method.
1094 */
1095 static void pcache1Rekey(
1096 sqlite3_pcache *p,
1097 sqlite3_pcache_page *pPg,
1098 unsigned int iOld,
1099 unsigned int iNew
1100 ){
1101 PCache1 *pCache = (PCache1 *)p;
1102 PgHdr1 *pPage = (PgHdr1 *)pPg;
1103 PgHdr1 **pp;
1104 unsigned int h;
1105 assert( pPage->iKey==iOld );
1106 assert( pPage->pCache==pCache );
1107
1108 pcache1EnterMutex(pCache->pGroup);
1109
1110 h = iOld%pCache->nHash;
1111 pp = &pCache->apHash[h];
1112 while( (*pp)!=pPage ){
1113 pp = &(*pp)->pNext;
1114 }
1115 *pp = pPage->pNext;
1116
1117 h = iNew%pCache->nHash;
1118 pPage->iKey = iNew;
1119 pPage->pNext = pCache->apHash[h];
1120 pCache->apHash[h] = pPage;
1121 if( iNew>pCache->iMaxKey ){
1122 pCache->iMaxKey = iNew;
1123 }
1124
1125 pcache1LeaveMutex(pCache->pGroup);
1126 }
1127
1128 /*
1129 ** Implementation of the sqlite3_pcache.xTruncate method.
1130 **
1131 ** Discard all unpinned pages in the cache with a page number equal to
1132 ** or greater than parameter iLimit. Any pinned pages with a page number
1133 ** equal to or greater than iLimit are implicitly unpinned.
1134 */
1135 static void pcache1Truncate(sqlite3_pcache *p, unsigned int iLimit){
1136 PCache1 *pCache = (PCache1 *)p;
1137 pcache1EnterMutex(pCache->pGroup);
1138 if( iLimit<=pCache->iMaxKey ){
1139 pcache1TruncateUnsafe(pCache, iLimit);
1140 pCache->iMaxKey = iLimit-1;
1141 }
1142 pcache1LeaveMutex(pCache->pGroup);
1143 }
1144
1145 /*
1146 ** Implementation of the sqlite3_pcache.xDestroy method.
1147 **
1148 ** Destroy a cache allocated using pcache1Create().
1149 */
1150 static void pcache1Destroy(sqlite3_pcache *p){
1151 PCache1 *pCache = (PCache1 *)p;
1152 PGroup *pGroup = pCache->pGroup;
1153 assert( pCache->bPurgeable || (pCache->nMax==0 && pCache->nMin==0) );
1154 pcache1EnterMutex(pGroup);
1155 if( pCache->nPage ) pcache1TruncateUnsafe(pCache, 0);
1156 assert( pGroup->nMaxPage >= pCache->nMax );
1157 pGroup->nMaxPage -= pCache->nMax;
1158 assert( pGroup->nMinPage >= pCache->nMin );
1159 pGroup->nMinPage -= pCache->nMin;
1160 pGroup->mxPinned = pGroup->nMaxPage + 10 - pGroup->nMinPage;
1161 pcache1EnforceMaxPage(pCache);
1162 pcache1LeaveMutex(pGroup);
1163 sqlite3_free(pCache->pBulk);
1164 sqlite3_free(pCache->apHash);
1165 sqlite3_free(pCache);
1166 }
1167
1168 /*
1169 ** This function is called during initialization (sqlite3_initialize()) to
1170 ** install the default pluggable cache module, assuming the user has not
1171 ** already provided an alternative.
1172 */
1173 SQLITE_PRIVATE void sqlite3PCacheSetDefault(void){
1174 static const sqlite3_pcache_methods2 defaultMethods = {
1175 1, /* iVersion */
1176 0, /* pArg */
1177 pcache1Init, /* xInit */
1178 pcache1Shutdown, /* xShutdown */
1179 pcache1Create, /* xCreate */
1180 pcache1Cachesize, /* xCachesize */
1181 pcache1Pagecount, /* xPagecount */
1182 pcache1Fetch, /* xFetch */
1183 pcache1Unpin, /* xUnpin */
1184 pcache1Rekey, /* xRekey */
1185 pcache1Truncate, /* xTruncate */
1186 pcache1Destroy, /* xDestroy */
1187 pcache1Shrink /* xShrink */
1188 };
1189 sqlite3_config(SQLITE_CONFIG_PCACHE2, &defaultMethods);
1190 }
1191
1192 /*
1193 ** Return the size of the header on each page of this PCACHE implementation.
1194 */
1195 SQLITE_PRIVATE int sqlite3HeaderSizePcache1(void){ return ROUND8(sizeof(PgHdr1)) ; }
1196
1197 /*
1198 ** Return the global mutex used by this PCACHE implementation. The
1199 ** sqlite3_status() routine needs access to this mutex.
1200 */
1201 SQLITE_PRIVATE sqlite3_mutex *sqlite3Pcache1Mutex(void){
1202 return pcache1.mutex;
1203 }
1204
1205 #ifdef SQLITE_ENABLE_MEMORY_MANAGEMENT
1206 /*
1207 ** This function is called to free superfluous dynamically allocated memory
1208 ** held by the pager system. Memory in use by any SQLite pager allocated
1209 ** by the current thread may be sqlite3_free()ed.
1210 **
1211 ** nReq is the number of bytes of memory required. Once this much has
1212 ** been released, the function returns. The return value is the total number
1213 ** of bytes of memory released.
1214 */
1215 SQLITE_PRIVATE int sqlite3PcacheReleaseMemory(int nReq){
1216 int nFree = 0;
1217 assert( sqlite3_mutex_notheld(pcache1.grp.mutex) );
1218 assert( sqlite3_mutex_notheld(pcache1.mutex) );
1219 if( sqlite3GlobalConfig.nPage==0 ){
1220 PgHdr1 *p;
1221 pcache1EnterMutex(&pcache1.grp);
1222 while( (nReq<0 || nFree<nReq)
1223 && (p=pcache1.grp.lru.pLruPrev)!=0
1224 && p->isAnchor==0
1225 ){
1226 nFree += pcache1MemSize(p->page.pBuf);
1227 #ifdef SQLITE_PCACHE_SEPARATE_HEADER
1228 nFree += sqlite3MemSize(p);
1229 #endif
1230 assert( p->isPinned==0 );
1231 pcache1PinPage(p);
1232 pcache1RemoveFromHash(p, 1);
1233 }
1234 pcache1LeaveMutex(&pcache1.grp);
1235 }
1236 return nFree;
1237 }
1238 #endif /* SQLITE_ENABLE_MEMORY_MANAGEMENT */
1239
1240 #ifdef SQLITE_TEST
1241 /*
1242 ** This function is used by test procedures to inspect the internal state
1243 ** of the global cache.
1244 */
1245 SQLITE_PRIVATE void sqlite3PcacheStats(
1246 int *pnCurrent, /* OUT: Total number of pages cached */
1247 int *pnMax, /* OUT: Global maximum cache size */
1248 int *pnMin, /* OUT: Sum of PCache1.nMin for purgeable caches */
1249 int *pnRecyclable /* OUT: Total number of pages available for recycling */
1250 ){
1251 PgHdr1 *p;
1252 int nRecyclable = 0;
1253 for(p=pcache1.grp.lru.pLruNext; p && !p->isAnchor; p=p->pLruNext){
1254 assert( p->isPinned==0 );
1255 nRecyclable++;
1256 }
1257 *pnCurrent = pcache1.grp.nCurrentPage;
1258 *pnMax = (int)pcache1.grp.nMaxPage;
1259 *pnMin = (int)pcache1.grp.nMinPage;
1260 *pnRecyclable = nRecyclable;
1261 }
1262 #endif
1263
1264 /************** End of pcache1.c *********************************************/
1265 /************** Begin file rowset.c ******************************************/
1266 /*
1267 ** 2008 December 3
1268 **
1269 ** The author disclaims copyright to this source code. In place of
1270 ** a legal notice, here is a blessing:
1271 **
1272 ** May you do good and not evil.
1273 ** May you find forgiveness for yourself and forgive others.
1274 ** May you share freely, never taking more than you give.
1275 **
1276 *************************************************************************
1277 **
1278 ** This module implements an object we call a "RowSet".
1279 **
1280 ** The RowSet object is a collection of rowids. Rowids
1281 ** are inserted into the RowSet in an arbitrary order. Inserts
1282 ** can be intermixed with tests to see if a given rowid has been
1283 ** previously inserted into the RowSet.
1284 **
1285 ** After all inserts are finished, it is possible to extract the
1286 ** elements of the RowSet in sorted order. Once this extraction
1287 ** process has started, no new elements may be inserted.
1288 **
1289 ** Hence, the primitive operations for a RowSet are:
1290 **
1291 ** CREATE
1292 ** INSERT
1293 ** TEST
1294 ** SMALLEST
1295 ** DESTROY
1296 **
1297 ** The CREATE and DESTROY primitives are the constructor and destructor,
1298 ** obviously. The INSERT primitive adds a new element to the RowSet.
1299 ** TEST checks to see if an element is already in the RowSet. SMALLEST
1300 ** extracts the least value from the RowSet.
1301 **
1302 ** The INSERT primitive might allocate additional memory. Memory is
1303 ** allocated in chunks so most INSERTs do no allocation. There is an
1304 ** upper bound on the size of allocated memory. No memory is freed
1305 ** until DESTROY.
1306 **
1307 ** The TEST primitive includes a "batch" number. The TEST primitive
1308 ** will only see elements that were inserted before the last change
1309 ** in the batch number. In other words, if an INSERT occurs between
1310 ** two TESTs where the TESTs have the same batch nubmer, then the
1311 ** value added by the INSERT will not be visible to the second TEST.
1312 ** The initial batch number is zero, so if the very first TEST contains
1313 ** a non-zero batch number, it will see all prior INSERTs.
1314 **
1315 ** No INSERTs may occurs after a SMALLEST. An assertion will fail if
1316 ** that is attempted.
1317 **
1318 ** The cost of an INSERT is roughly constant. (Sometimes new memory
1319 ** has to be allocated on an INSERT.) The cost of a TEST with a new
1320 ** batch number is O(NlogN) where N is the number of elements in the RowSet.
1321 ** The cost of a TEST using the same batch number is O(logN). The cost
1322 ** of the first SMALLEST is O(NlogN). Second and subsequent SMALLEST
1323 ** primitives are constant time. The cost of DESTROY is O(N).
1324 **
1325 ** TEST and SMALLEST may not be used by the same RowSet. This used to
1326 ** be possible, but the feature was not used, so it was removed in order
1327 ** to simplify the code.
1328 */
1329 /* #include "sqliteInt.h" */
1330
1331
1332 /*
1333 ** Target size for allocation chunks.
1334 */
1335 #define ROWSET_ALLOCATION_SIZE 1024
1336
1337 /*
1338 ** The number of rowset entries per allocation chunk.
1339 */
1340 #define ROWSET_ENTRY_PER_CHUNK \
1341 ((ROWSET_ALLOCATION_SIZE-8)/sizeof(struct RowSetEntry))
1342
1343 /*
1344 ** Each entry in a RowSet is an instance of the following object.
1345 **
1346 ** This same object is reused to store a linked list of trees of RowSetEntry
1347 ** objects. In that alternative use, pRight points to the next entry
1348 ** in the list, pLeft points to the tree, and v is unused. The
1349 ** RowSet.pForest value points to the head of this forest list.
1350 */
1351 struct RowSetEntry {
1352 i64 v; /* ROWID value for this entry */
1353 struct RowSetEntry *pRight; /* Right subtree (larger entries) or list */
1354 struct RowSetEntry *pLeft; /* Left subtree (smaller entries) */
1355 };
1356
1357 /*
1358 ** RowSetEntry objects are allocated in large chunks (instances of the
1359 ** following structure) to reduce memory allocation overhead. The
1360 ** chunks are kept on a linked list so that they can be deallocated
1361 ** when the RowSet is destroyed.
1362 */
1363 struct RowSetChunk {
1364 struct RowSetChunk *pNextChunk; /* Next chunk on list of them all */
1365 struct RowSetEntry aEntry[ROWSET_ENTRY_PER_CHUNK]; /* Allocated entries */
1366 };
1367
1368 /*
1369 ** A RowSet in an instance of the following structure.
1370 **
1371 ** A typedef of this structure if found in sqliteInt.h.
1372 */
1373 struct RowSet {
1374 struct RowSetChunk *pChunk; /* List of all chunk allocations */
1375 sqlite3 *db; /* The database connection */
1376 struct RowSetEntry *pEntry; /* List of entries using pRight */
1377 struct RowSetEntry *pLast; /* Last entry on the pEntry list */
1378 struct RowSetEntry *pFresh; /* Source of new entry objects */
1379 struct RowSetEntry *pForest; /* List of binary trees of entries */
1380 u16 nFresh; /* Number of objects on pFresh */
1381 u16 rsFlags; /* Various flags */
1382 int iBatch; /* Current insert batch */
1383 };
1384
1385 /*
1386 ** Allowed values for RowSet.rsFlags
1387 */
1388 #define ROWSET_SORTED 0x01 /* True if RowSet.pEntry is sorted */
1389 #define ROWSET_NEXT 0x02 /* True if sqlite3RowSetNext() has been called */
1390
1391 /*
1392 ** Turn bulk memory into a RowSet object. N bytes of memory
1393 ** are available at pSpace. The db pointer is used as a memory context
1394 ** for any subsequent allocations that need to occur.
1395 ** Return a pointer to the new RowSet object.
1396 **
1397 ** It must be the case that N is sufficient to make a Rowset. If not
1398 ** an assertion fault occurs.
1399 **
1400 ** If N is larger than the minimum, use the surplus as an initial
1401 ** allocation of entries available to be filled.
1402 */
1403 SQLITE_PRIVATE RowSet *sqlite3RowSetInit(sqlite3 *db, void *pSpace, unsigned int N){
1404 RowSet *p;
1405 assert( N >= ROUND8(sizeof(*p)) );
1406 p = pSpace;
1407 p->pChunk = 0;
1408 p->db = db;
1409 p->pEntry = 0;
1410 p->pLast = 0;
1411 p->pForest = 0;
1412 p->pFresh = (struct RowSetEntry*)(ROUND8(sizeof(*p)) + (char*)p);
1413 p->nFresh = (u16)((N - ROUND8(sizeof(*p)))/sizeof(struct RowSetEntry));
1414 p->rsFlags = ROWSET_SORTED;
1415 p->iBatch = 0;
1416 return p;
1417 }
1418
1419 /*
1420 ** Deallocate all chunks from a RowSet. This frees all memory that
1421 ** the RowSet has allocated over its lifetime. This routine is
1422 ** the destructor for the RowSet.
1423 */
1424 SQLITE_PRIVATE void sqlite3RowSetClear(RowSet *p){
1425 struct RowSetChunk *pChunk, *pNextChunk;
1426 for(pChunk=p->pChunk; pChunk; pChunk = pNextChunk){
1427 pNextChunk = pChunk->pNextChunk;
1428 sqlite3DbFree(p->db, pChunk);
1429 }
1430 p->pChunk = 0;
1431 p->nFresh = 0;
1432 p->pEntry = 0;
1433 p->pLast = 0;
1434 p->pForest = 0;
1435 p->rsFlags = ROWSET_SORTED;
1436 }
1437
1438 /*
1439 ** Allocate a new RowSetEntry object that is associated with the
1440 ** given RowSet. Return a pointer to the new and completely uninitialized
1441 ** objected.
1442 **
1443 ** In an OOM situation, the RowSet.db->mallocFailed flag is set and this
1444 ** routine returns NULL.
1445 */
1446 static struct RowSetEntry *rowSetEntryAlloc(RowSet *p){
1447 assert( p!=0 );
1448 if( p->nFresh==0 ){ /*OPTIMIZATION-IF-FALSE*/
1449 /* We could allocate a fresh RowSetEntry each time one is needed, but it
1450 ** is more efficient to pull a preallocated entry from the pool */
1451 struct RowSetChunk *pNew;
1452 pNew = sqlite3DbMallocRawNN(p->db, sizeof(*pNew));
1453 if( pNew==0 ){
1454 return 0;
1455 }
1456 pNew->pNextChunk = p->pChunk;
1457 p->pChunk = pNew;
1458 p->pFresh = pNew->aEntry;
1459 p->nFresh = ROWSET_ENTRY_PER_CHUNK;
1460 }
1461 p->nFresh--;
1462 return p->pFresh++;
1463 }
1464
1465 /*
1466 ** Insert a new value into a RowSet.
1467 **
1468 ** The mallocFailed flag of the database connection is set if a
1469 ** memory allocation fails.
1470 */
1471 SQLITE_PRIVATE void sqlite3RowSetInsert(RowSet *p, i64 rowid){
1472 struct RowSetEntry *pEntry; /* The new entry */
1473 struct RowSetEntry *pLast; /* The last prior entry */
1474
1475 /* This routine is never called after sqlite3RowSetNext() */
1476 assert( p!=0 && (p->rsFlags & ROWSET_NEXT)==0 );
1477
1478 pEntry = rowSetEntryAlloc(p);
1479 if( pEntry==0 ) return;
1480 pEntry->v = rowid;
1481 pEntry->pRight = 0;
1482 pLast = p->pLast;
1483 if( pLast ){
1484 if( rowid<=pLast->v ){ /*OPTIMIZATION-IF-FALSE*/
1485 /* Avoid unnecessary sorts by preserving the ROWSET_SORTED flags
1486 ** where possible */
1487 p->rsFlags &= ~ROWSET_SORTED;
1488 }
1489 pLast->pRight = pEntry;
1490 }else{
1491 p->pEntry = pEntry;
1492 }
1493 p->pLast = pEntry;
1494 }
1495
1496 /*
1497 ** Merge two lists of RowSetEntry objects. Remove duplicates.
1498 **
1499 ** The input lists are connected via pRight pointers and are
1500 ** assumed to each already be in sorted order.
1501 */
1502 static struct RowSetEntry *rowSetEntryMerge(
1503 struct RowSetEntry *pA, /* First sorted list to be merged */
1504 struct RowSetEntry *pB /* Second sorted list to be merged */
1505 ){
1506 struct RowSetEntry head;
1507 struct RowSetEntry *pTail;
1508
1509 pTail = &head;
1510 assert( pA!=0 && pB!=0 );
1511 for(;;){
1512 assert( pA->pRight==0 || pA->v<=pA->pRight->v );
1513 assert( pB->pRight==0 || pB->v<=pB->pRight->v );
1514 if( pA->v<=pB->v ){
1515 if( pA->v<pB->v ) pTail = pTail->pRight = pA;
1516 pA = pA->pRight;
1517 if( pA==0 ){
1518 pTail->pRight = pB;
1519 break;
1520 }
1521 }else{
1522 pTail = pTail->pRight = pB;
1523 pB = pB->pRight;
1524 if( pB==0 ){
1525 pTail->pRight = pA;
1526 break;
1527 }
1528 }
1529 }
1530 return head.pRight;
1531 }
1532
1533 /*
1534 ** Sort all elements on the list of RowSetEntry objects into order of
1535 ** increasing v.
1536 */
1537 static struct RowSetEntry *rowSetEntrySort(struct RowSetEntry *pIn){
1538 unsigned int i;
1539 struct RowSetEntry *pNext, *aBucket[40];
1540
1541 memset(aBucket, 0, sizeof(aBucket));
1542 while( pIn ){
1543 pNext = pIn->pRight;
1544 pIn->pRight = 0;
1545 for(i=0; aBucket[i]; i++){
1546 pIn = rowSetEntryMerge(aBucket[i], pIn);
1547 aBucket[i] = 0;
1548 }
1549 aBucket[i] = pIn;
1550 pIn = pNext;
1551 }
1552 pIn = aBucket[0];
1553 for(i=1; i<sizeof(aBucket)/sizeof(aBucket[0]); i++){
1554 if( aBucket[i]==0 ) continue;
1555 pIn = pIn ? rowSetEntryMerge(pIn, aBucket[i]) : aBucket[i];
1556 }
1557 return pIn;
1558 }
1559
1560
1561 /*
1562 ** The input, pIn, is a binary tree (or subtree) of RowSetEntry objects.
1563 ** Convert this tree into a linked list connected by the pRight pointers
1564 ** and return pointers to the first and last elements of the new list.
1565 */
1566 static void rowSetTreeToList(
1567 struct RowSetEntry *pIn, /* Root of the input tree */
1568 struct RowSetEntry **ppFirst, /* Write head of the output list here */
1569 struct RowSetEntry **ppLast /* Write tail of the output list here */
1570 ){
1571 assert( pIn!=0 );
1572 if( pIn->pLeft ){
1573 struct RowSetEntry *p;
1574 rowSetTreeToList(pIn->pLeft, ppFirst, &p);
1575 p->pRight = pIn;
1576 }else{
1577 *ppFirst = pIn;
1578 }
1579 if( pIn->pRight ){
1580 rowSetTreeToList(pIn->pRight, &pIn->pRight, ppLast);
1581 }else{
1582 *ppLast = pIn;
1583 }
1584 assert( (*ppLast)->pRight==0 );
1585 }
1586
1587
1588 /*
1589 ** Convert a sorted list of elements (connected by pRight) into a binary
1590 ** tree with depth of iDepth. A depth of 1 means the tree contains a single
1591 ** node taken from the head of *ppList. A depth of 2 means a tree with
1592 ** three nodes. And so forth.
1593 **
1594 ** Use as many entries from the input list as required and update the
1595 ** *ppList to point to the unused elements of the list. If the input
1596 ** list contains too few elements, then construct an incomplete tree
1597 ** and leave *ppList set to NULL.
1598 **
1599 ** Return a pointer to the root of the constructed binary tree.
1600 */
1601 static struct RowSetEntry *rowSetNDeepTree(
1602 struct RowSetEntry **ppList,
1603 int iDepth
1604 ){
1605 struct RowSetEntry *p; /* Root of the new tree */
1606 struct RowSetEntry *pLeft; /* Left subtree */
1607 if( *ppList==0 ){ /*OPTIMIZATION-IF-TRUE*/
1608 /* Prevent unnecessary deep recursion when we run out of entries */
1609 return 0;
1610 }
1611 if( iDepth>1 ){ /*OPTIMIZATION-IF-TRUE*/
1612 /* This branch causes a *balanced* tree to be generated. A valid tree
1613 ** is still generated without this branch, but the tree is wildly
1614 ** unbalanced and inefficient. */
1615 pLeft = rowSetNDeepTree(ppList, iDepth-1);
1616 p = *ppList;
1617 if( p==0 ){ /*OPTIMIZATION-IF-FALSE*/
1618 /* It is safe to always return here, but the resulting tree
1619 ** would be unbalanced */
1620 return pLeft;
1621 }
1622 p->pLeft = pLeft;
1623 *ppList = p->pRight;
1624 p->pRight = rowSetNDeepTree(ppList, iDepth-1);
1625 }else{
1626 p = *ppList;
1627 *ppList = p->pRight;
1628 p->pLeft = p->pRight = 0;
1629 }
1630 return p;
1631 }
1632
1633 /*
1634 ** Convert a sorted list of elements into a binary tree. Make the tree
1635 ** as deep as it needs to be in order to contain the entire list.
1636 */
1637 static struct RowSetEntry *rowSetListToTree(struct RowSetEntry *pList){
1638 int iDepth; /* Depth of the tree so far */
1639 struct RowSetEntry *p; /* Current tree root */
1640 struct RowSetEntry *pLeft; /* Left subtree */
1641
1642 assert( pList!=0 );
1643 p = pList;
1644 pList = p->pRight;
1645 p->pLeft = p->pRight = 0;
1646 for(iDepth=1; pList; iDepth++){
1647 pLeft = p;
1648 p = pList;
1649 pList = p->pRight;
1650 p->pLeft = pLeft;
1651 p->pRight = rowSetNDeepTree(&pList, iDepth);
1652 }
1653 return p;
1654 }
1655
1656 /*
1657 ** Extract the smallest element from the RowSet.
1658 ** Write the element into *pRowid. Return 1 on success. Return
1659 ** 0 if the RowSet is already empty.
1660 **
1661 ** After this routine has been called, the sqlite3RowSetInsert()
1662 ** routine may not be called again.
1663 **
1664 ** This routine may not be called after sqlite3RowSetTest() has
1665 ** been used. Older versions of RowSet allowed that, but as the
1666 ** capability was not used by the code generator, it was removed
1667 ** for code economy.
1668 */
1669 SQLITE_PRIVATE int sqlite3RowSetNext(RowSet *p, i64 *pRowid){
1670 assert( p!=0 );
1671 assert( p->pForest==0 ); /* Cannot be used with sqlite3RowSetText() */
1672
1673 /* Merge the forest into a single sorted list on first call */
1674 if( (p->rsFlags & ROWSET_NEXT)==0 ){ /*OPTIMIZATION-IF-FALSE*/
1675 if( (p->rsFlags & ROWSET_SORTED)==0 ){ /*OPTIMIZATION-IF-FALSE*/
1676 p->pEntry = rowSetEntrySort(p->pEntry);
1677 }
1678 p->rsFlags |= ROWSET_SORTED|ROWSET_NEXT;
1679 }
1680
1681 /* Return the next entry on the list */
1682 if( p->pEntry ){
1683 *pRowid = p->pEntry->v;
1684 p->pEntry = p->pEntry->pRight;
1685 if( p->pEntry==0 ){ /*OPTIMIZATION-IF-TRUE*/
1686 /* Free memory immediately, rather than waiting on sqlite3_finalize() */
1687 sqlite3RowSetClear(p);
1688 }
1689 return 1;
1690 }else{
1691 return 0;
1692 }
1693 }
1694
1695 /*
1696 ** Check to see if element iRowid was inserted into the rowset as
1697 ** part of any insert batch prior to iBatch. Return 1 or 0.
1698 **
1699 ** If this is the first test of a new batch and if there exist entries
1700 ** on pRowSet->pEntry, then sort those entries into the forest at
1701 ** pRowSet->pForest so that they can be tested.
1702 */
1703 SQLITE_PRIVATE int sqlite3RowSetTest(RowSet *pRowSet, int iBatch, sqlite3_int64 iRowid){
1704 struct RowSetEntry *p, *pTree;
1705
1706 /* This routine is never called after sqlite3RowSetNext() */
1707 assert( pRowSet!=0 && (pRowSet->rsFlags & ROWSET_NEXT)==0 );
1708
1709 /* Sort entries into the forest on the first test of a new batch.
1710 ** To save unnecessary work, only do this when the batch number changes.
1711 */
1712 if( iBatch!=pRowSet->iBatch ){ /*OPTIMIZATION-IF-FALSE*/
1713 p = pRowSet->pEntry;
1714 if( p ){
1715 struct RowSetEntry **ppPrevTree = &pRowSet->pForest;
1716 if( (pRowSet->rsFlags & ROWSET_SORTED)==0 ){ /*OPTIMIZATION-IF-FALSE*/
1717 /* Only sort the current set of entiries if they need it */
1718 p = rowSetEntrySort(p);
1719 }
1720 for(pTree = pRowSet->pForest; pTree; pTree=pTree->pRight){
1721 ppPrevTree = &pTree->pRight;
1722 if( pTree->pLeft==0 ){
1723 pTree->pLeft = rowSetListToTree(p);
1724 break;
1725 }else{
1726 struct RowSetEntry *pAux, *pTail;
1727 rowSetTreeToList(pTree->pLeft, &pAux, &pTail);
1728 pTree->pLeft = 0;
1729 p = rowSetEntryMerge(pAux, p);
1730 }
1731 }
1732 if( pTree==0 ){
1733 *ppPrevTree = pTree = rowSetEntryAlloc(pRowSet);
1734 if( pTree ){
1735 pTree->v = 0;
1736 pTree->pRight = 0;
1737 pTree->pLeft = rowSetListToTree(p);
1738 }
1739 }
1740 pRowSet->pEntry = 0;
1741 pRowSet->pLast = 0;
1742 pRowSet->rsFlags |= ROWSET_SORTED;
1743 }
1744 pRowSet->iBatch = iBatch;
1745 }
1746
1747 /* Test to see if the iRowid value appears anywhere in the forest.
1748 ** Return 1 if it does and 0 if not.
1749 */
1750 for(pTree = pRowSet->pForest; pTree; pTree=pTree->pRight){
1751 p = pTree->pLeft;
1752 while( p ){
1753 if( p->v<iRowid ){
1754 p = p->pRight;
1755 }else if( p->v>iRowid ){
1756 p = p->pLeft;
1757 }else{
1758 return 1;
1759 }
1760 }
1761 }
1762 return 0;
1763 }
1764
1765 /************** End of rowset.c **********************************************/
1766 /************** Begin file pager.c *******************************************/
1767 /*
1768 ** 2001 September 15
1769 **
1770 ** The author disclaims copyright to this source code. In place of
1771 ** a legal notice, here is a blessing:
1772 **
1773 ** May you do good and not evil.
1774 ** May you find forgiveness for yourself and forgive others.
1775 ** May you share freely, never taking more than you give.
1776 **
1777 *************************************************************************
1778 ** This is the implementation of the page cache subsystem or "pager".
1779 **
1780 ** The pager is used to access a database disk file. It implements
1781 ** atomic commit and rollback through the use of a journal file that
1782 ** is separate from the database file. The pager also implements file
1783 ** locking to prevent two processes from writing the same database
1784 ** file simultaneously, or one process from reading the database while
1785 ** another is writing.
1786 */
1787 #ifndef SQLITE_OMIT_DISKIO
1788 /* #include "sqliteInt.h" */
1789 /************** Include wal.h in the middle of pager.c ***********************/
1790 /************** Begin file wal.h *********************************************/
1791 /*
1792 ** 2010 February 1
1793 **
1794 ** The author disclaims copyright to this source code. In place of
1795 ** a legal notice, here is a blessing:
1796 **
1797 ** May you do good and not evil.
1798 ** May you find forgiveness for yourself and forgive others.
1799 ** May you share freely, never taking more than you give.
1800 **
1801 *************************************************************************
1802 ** This header file defines the interface to the write-ahead logging
1803 ** system. Refer to the comments below and the header comment attached to
1804 ** the implementation of each function in log.c for further details.
1805 */
1806
1807 #ifndef SQLITE_WAL_H
1808 #define SQLITE_WAL_H
1809
1810 /* #include "sqliteInt.h" */
1811
1812 /* Additional values that can be added to the sync_flags argument of
1813 ** sqlite3WalFrames():
1814 */
1815 #define WAL_SYNC_TRANSACTIONS 0x20 /* Sync at the end of each transaction */
1816 #define SQLITE_SYNC_MASK 0x13 /* Mask off the SQLITE_SYNC_* values */
1817
1818 #ifdef SQLITE_OMIT_WAL
1819 # define sqlite3WalOpen(x,y,z) 0
1820 # define sqlite3WalLimit(x,y)
1821 # define sqlite3WalClose(v,w,x,y,z) 0
1822 # define sqlite3WalBeginReadTransaction(y,z) 0
1823 # define sqlite3WalEndReadTransaction(z)
1824 # define sqlite3WalDbsize(y) 0
1825 # define sqlite3WalBeginWriteTransaction(y) 0
1826 # define sqlite3WalEndWriteTransaction(x) 0
1827 # define sqlite3WalUndo(x,y,z) 0
1828 # define sqlite3WalSavepoint(y,z)
1829 # define sqlite3WalSavepointUndo(y,z) 0
1830 # define sqlite3WalFrames(u,v,w,x,y,z) 0
1831 # define sqlite3WalCheckpoint(q,r,s,t,u,v,w,x,y,z) 0
1832 # define sqlite3WalCallback(z) 0
1833 # define sqlite3WalExclusiveMode(y,z) 0
1834 # define sqlite3WalHeapMemory(z) 0
1835 # define sqlite3WalFramesize(z) 0
1836 # define sqlite3WalFindFrame(x,y,z) 0
1837 # define sqlite3WalFile(x) 0
1838 #else
1839
1840 #define WAL_SAVEPOINT_NDATA 4
1841
1842 /* Connection to a write-ahead log (WAL) file.
1843 ** There is one object of this type for each pager.
1844 */
1845 typedef struct Wal Wal;
1846
1847 /* Open and close a connection to a write-ahead log. */
1848 SQLITE_PRIVATE int sqlite3WalOpen(sqlite3_vfs*, sqlite3_file*, const char *, int , i64, Wal**);
1849 SQLITE_PRIVATE int sqlite3WalClose(Wal *pWal, sqlite3*, int sync_flags, int, u8 *);
1850
1851 /* Set the limiting size of a WAL file. */
1852 SQLITE_PRIVATE void sqlite3WalLimit(Wal*, i64);
1853
1854 /* Used by readers to open (lock) and close (unlock) a snapshot. A
1855 ** snapshot is like a read-transaction. It is the state of the database
1856 ** at an instant in time. sqlite3WalOpenSnapshot gets a read lock and
1857 ** preserves the current state even if the other threads or processes
1858 ** write to or checkpoint the WAL. sqlite3WalCloseSnapshot() closes the
1859 ** transaction and releases the lock.
1860 */
1861 SQLITE_PRIVATE int sqlite3WalBeginReadTransaction(Wal *pWal, int *);
1862 SQLITE_PRIVATE void sqlite3WalEndReadTransaction(Wal *pWal);
1863
1864 /* Read a page from the write-ahead log, if it is present. */
1865 SQLITE_PRIVATE int sqlite3WalFindFrame(Wal *, Pgno, u32 *);
1866 SQLITE_PRIVATE int sqlite3WalReadFrame(Wal *, u32, int, u8 *);
1867
1868 /* If the WAL is not empty, return the size of the database. */
1869 SQLITE_PRIVATE Pgno sqlite3WalDbsize(Wal *pWal);
1870
1871 /* Obtain or release the WRITER lock. */
1872 SQLITE_PRIVATE int sqlite3WalBeginWriteTransaction(Wal *pWal);
1873 SQLITE_PRIVATE int sqlite3WalEndWriteTransaction(Wal *pWal);
1874
1875 /* Undo any frames written (but not committed) to the log */
1876 SQLITE_PRIVATE int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *p UndoCtx);
1877
1878 /* Return an integer that records the current (uncommitted) write
1879 ** position in the WAL */
1880 SQLITE_PRIVATE void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData);
1881
1882 /* Move the write position of the WAL back to iFrame. Called in
1883 ** response to a ROLLBACK TO command. */
1884 SQLITE_PRIVATE int sqlite3WalSavepointUndo(Wal *pWal, u32 *aWalData);
1885
1886 /* Write a frame or frames to the log. */
1887 SQLITE_PRIVATE int sqlite3WalFrames(Wal *pWal, int, PgHdr *, Pgno, int, int);
1888
1889 /* Copy pages from the log to the database file */
1890 SQLITE_PRIVATE int sqlite3WalCheckpoint(
1891 Wal *pWal, /* Write-ahead log connection */
1892 sqlite3 *db, /* Check this handle's interrupt flag */
1893 int eMode, /* One of PASSIVE, FULL and RESTART */
1894 int (*xBusy)(void*), /* Function to call when busy */
1895 void *pBusyArg, /* Context argument for xBusyHandler */
1896 int sync_flags, /* Flags to sync db file with (or 0) */
1897 int nBuf, /* Size of buffer nBuf */
1898 u8 *zBuf, /* Temporary buffer to use */
1899 int *pnLog, /* OUT: Number of frames in WAL */
1900 int *pnCkpt /* OUT: Number of backfilled frames in WAL */
1901 );
1902
1903 /* Return the value to pass to a sqlite3_wal_hook callback, the
1904 ** number of frames in the WAL at the point of the last commit since
1905 ** sqlite3WalCallback() was called. If no commits have occurred since
1906 ** the last call, then return 0.
1907 */
1908 SQLITE_PRIVATE int sqlite3WalCallback(Wal *pWal);
1909
1910 /* Tell the wal layer that an EXCLUSIVE lock has been obtained (or released)
1911 ** by the pager layer on the database file.
1912 */
1913 SQLITE_PRIVATE int sqlite3WalExclusiveMode(Wal *pWal, int op);
1914
1915 /* Return true if the argument is non-NULL and the WAL module is using
1916 ** heap-memory for the wal-index. Otherwise, if the argument is NULL or the
1917 ** WAL module is using shared-memory, return false.
1918 */
1919 SQLITE_PRIVATE int sqlite3WalHeapMemory(Wal *pWal);
1920
1921 #ifdef SQLITE_ENABLE_SNAPSHOT
1922 SQLITE_PRIVATE int sqlite3WalSnapshotGet(Wal *pWal, sqlite3_snapshot **ppSnapsho t);
1923 SQLITE_PRIVATE void sqlite3WalSnapshotOpen(Wal *pWal, sqlite3_snapshot *pSnapsho t);
1924 SQLITE_PRIVATE int sqlite3WalSnapshotRecover(Wal *pWal);
1925 #endif
1926
1927 #ifdef SQLITE_ENABLE_ZIPVFS
1928 /* If the WAL file is not empty, return the number of bytes of content
1929 ** stored in each frame (i.e. the db page-size when the WAL was created).
1930 */
1931 SQLITE_PRIVATE int sqlite3WalFramesize(Wal *pWal);
1932 #endif
1933
1934 /* Return the sqlite3_file object for the WAL file */
1935 SQLITE_PRIVATE sqlite3_file *sqlite3WalFile(Wal *pWal);
1936
1937 #endif /* ifndef SQLITE_OMIT_WAL */
1938 #endif /* SQLITE_WAL_H */
1939
1940 /************** End of wal.h *************************************************/
1941 /************** Continuing where we left off in pager.c **********************/
1942
1943
1944 /******************* NOTES ON THE DESIGN OF THE PAGER ************************
1945 **
1946 ** This comment block describes invariants that hold when using a rollback
1947 ** journal. These invariants do not apply for journal_mode=WAL,
1948 ** journal_mode=MEMORY, or journal_mode=OFF.
1949 **
1950 ** Within this comment block, a page is deemed to have been synced
1951 ** automatically as soon as it is written when PRAGMA synchronous=OFF.
1952 ** Otherwise, the page is not synced until the xSync method of the VFS
1953 ** is called successfully on the file containing the page.
1954 **
1955 ** Definition: A page of the database file is said to be "overwriteable" if
1956 ** one or more of the following are true about the page:
1957 **
1958 ** (a) The original content of the page as it was at the beginning of
1959 ** the transaction has been written into the rollback journal and
1960 ** synced.
1961 **
1962 ** (b) The page was a freelist leaf page at the start of the transaction.
1963 **
1964 ** (c) The page number is greater than the largest page that existed in
1965 ** the database file at the start of the transaction.
1966 **
1967 ** (1) A page of the database file is never overwritten unless one of the
1968 ** following are true:
1969 **
1970 ** (a) The page and all other pages on the same sector are overwriteable.
1971 **
1972 ** (b) The atomic page write optimization is enabled, and the entire
1973 ** transaction other than the update of the transaction sequence
1974 ** number consists of a single page change.
1975 **
1976 ** (2) The content of a page written into the rollback journal exactly matches
1977 ** both the content in the database when the rollback journal was written
1978 ** and the content in the database at the beginning of the current
1979 ** transaction.
1980 **
1981 ** (3) Writes to the database file are an integer multiple of the page size
1982 ** in length and are aligned on a page boundary.
1983 **
1984 ** (4) Reads from the database file are either aligned on a page boundary and
1985 ** an integer multiple of the page size in length or are taken from the
1986 ** first 100 bytes of the database file.
1987 **
1988 ** (5) All writes to the database file are synced prior to the rollback journal
1989 ** being deleted, truncated, or zeroed.
1990 **
1991 ** (6) If a master journal file is used, then all writes to the database file
1992 ** are synced prior to the master journal being deleted.
1993 **
1994 ** Definition: Two databases (or the same database at two points it time)
1995 ** are said to be "logically equivalent" if they give the same answer to
1996 ** all queries. Note in particular the content of freelist leaf
1997 ** pages can be changed arbitrarily without affecting the logical equivalence
1998 ** of the database.
1999 **
2000 ** (7) At any time, if any subset, including the empty set and the total set,
2001 ** of the unsynced changes to a rollback journal are removed and the
2002 ** journal is rolled back, the resulting database file will be logically
2003 ** equivalent to the database file at the beginning of the transaction.
2004 **
2005 ** (8) When a transaction is rolled back, the xTruncate method of the VFS
2006 ** is called to restore the database file to the same size it was at
2007 ** the beginning of the transaction. (In some VFSes, the xTruncate
2008 ** method is a no-op, but that does not change the fact the SQLite will
2009 ** invoke it.)
2010 **
2011 ** (9) Whenever the database file is modified, at least one bit in the range
2012 ** of bytes from 24 through 39 inclusive will be changed prior to releasing
2013 ** the EXCLUSIVE lock, thus signaling other connections on the same
2014 ** database to flush their caches.
2015 **
2016 ** (10) The pattern of bits in bytes 24 through 39 shall not repeat in less
2017 ** than one billion transactions.
2018 **
2019 ** (11) A database file is well-formed at the beginning and at the conclusion
2020 ** of every transaction.
2021 **
2022 ** (12) An EXCLUSIVE lock is held on the database file when writing to
2023 ** the database file.
2024 **
2025 ** (13) A SHARED lock is held on the database file while reading any
2026 ** content out of the database file.
2027 **
2028 ******************************************************************************/
2029
2030 /*
2031 ** Macros for troubleshooting. Normally turned off
2032 */
2033 #if 0
2034 int sqlite3PagerTrace=1; /* True to enable tracing */
2035 #define sqlite3DebugPrintf printf
2036 #define PAGERTRACE(X) if( sqlite3PagerTrace ){ sqlite3DebugPrintf X; }
2037 #else
2038 #define PAGERTRACE(X)
2039 #endif
2040
2041 /*
2042 ** The following two macros are used within the PAGERTRACE() macros above
2043 ** to print out file-descriptors.
2044 **
2045 ** PAGERID() takes a pointer to a Pager struct as its argument. The
2046 ** associated file-descriptor is returned. FILEHANDLEID() takes an sqlite3_file
2047 ** struct as its argument.
2048 */
2049 #define PAGERID(p) ((int)(p->fd))
2050 #define FILEHANDLEID(fd) ((int)fd)
2051
2052 /*
2053 ** The Pager.eState variable stores the current 'state' of a pager. A
2054 ** pager may be in any one of the seven states shown in the following
2055 ** state diagram.
2056 **
2057 ** OPEN <------+------+
2058 ** | | |
2059 ** V | |
2060 ** +---------> READER-------+ |
2061 ** | | |
2062 ** | V |
2063 ** |<-------WRITER_LOCKED------> ERROR
2064 ** | | ^
2065 ** | V |
2066 ** |<------WRITER_CACHEMOD-------->|
2067 ** | | |
2068 ** | V |
2069 ** |<-------WRITER_DBMOD---------->|
2070 ** | | |
2071 ** | V |
2072 ** +<------WRITER_FINISHED-------->+
2073 **
2074 **
2075 ** List of state transitions and the C [function] that performs each:
2076 **
2077 ** OPEN -> READER [sqlite3PagerSharedLock]
2078 ** READER -> OPEN [pager_unlock]
2079 **
2080 ** READER -> WRITER_LOCKED [sqlite3PagerBegin]
2081 ** WRITER_LOCKED -> WRITER_CACHEMOD [pager_open_journal]
2082 ** WRITER_CACHEMOD -> WRITER_DBMOD [syncJournal]
2083 ** WRITER_DBMOD -> WRITER_FINISHED [sqlite3PagerCommitPhaseOne]
2084 ** WRITER_*** -> READER [pager_end_transaction]
2085 **
2086 ** WRITER_*** -> ERROR [pager_error]
2087 ** ERROR -> OPEN [pager_unlock]
2088 **
2089 **
2090 ** OPEN:
2091 **
2092 ** The pager starts up in this state. Nothing is guaranteed in this
2093 ** state - the file may or may not be locked and the database size is
2094 ** unknown. The database may not be read or written.
2095 **
2096 ** * No read or write transaction is active.
2097 ** * Any lock, or no lock at all, may be held on the database file.
2098 ** * The dbSize, dbOrigSize and dbFileSize variables may not be trusted.
2099 **
2100 ** READER:
2101 **
2102 ** In this state all the requirements for reading the database in
2103 ** rollback (non-WAL) mode are met. Unless the pager is (or recently
2104 ** was) in exclusive-locking mode, a user-level read transaction is
2105 ** open. The database size is known in this state.
2106 **
2107 ** A connection running with locking_mode=normal enters this state when
2108 ** it opens a read-transaction on the database and returns to state
2109 ** OPEN after the read-transaction is completed. However a connection
2110 ** running in locking_mode=exclusive (including temp databases) remains in
2111 ** this state even after the read-transaction is closed. The only way
2112 ** a locking_mode=exclusive connection can transition from READER to OPEN
2113 ** is via the ERROR state (see below).
2114 **
2115 ** * A read transaction may be active (but a write-transaction cannot).
2116 ** * A SHARED or greater lock is held on the database file.
2117 ** * The dbSize variable may be trusted (even if a user-level read
2118 ** transaction is not active). The dbOrigSize and dbFileSize variables
2119 ** may not be trusted at this point.
2120 ** * If the database is a WAL database, then the WAL connection is open.
2121 ** * Even if a read-transaction is not open, it is guaranteed that
2122 ** there is no hot-journal in the file-system.
2123 **
2124 ** WRITER_LOCKED:
2125 **
2126 ** The pager moves to this state from READER when a write-transaction
2127 ** is first opened on the database. In WRITER_LOCKED state, all locks
2128 ** required to start a write-transaction are held, but no actual
2129 ** modifications to the cache or database have taken place.
2130 **
2131 ** In rollback mode, a RESERVED or (if the transaction was opened with
2132 ** BEGIN EXCLUSIVE) EXCLUSIVE lock is obtained on the database file when
2133 ** moving to this state, but the journal file is not written to or opened
2134 ** to in this state. If the transaction is committed or rolled back while
2135 ** in WRITER_LOCKED state, all that is required is to unlock the database
2136 ** file.
2137 **
2138 ** IN WAL mode, WalBeginWriteTransaction() is called to lock the log file.
2139 ** If the connection is running with locking_mode=exclusive, an attempt
2140 ** is made to obtain an EXCLUSIVE lock on the database file.
2141 **
2142 ** * A write transaction is active.
2143 ** * If the connection is open in rollback-mode, a RESERVED or greater
2144 ** lock is held on the database file.
2145 ** * If the connection is open in WAL-mode, a WAL write transaction
2146 ** is open (i.e. sqlite3WalBeginWriteTransaction() has been successfully
2147 ** called).
2148 ** * The dbSize, dbOrigSize and dbFileSize variables are all valid.
2149 ** * The contents of the pager cache have not been modified.
2150 ** * The journal file may or may not be open.
2151 ** * Nothing (not even the first header) has been written to the journal.
2152 **
2153 ** WRITER_CACHEMOD:
2154 **
2155 ** A pager moves from WRITER_LOCKED state to this state when a page is
2156 ** first modified by the upper layer. In rollback mode the journal file
2157 ** is opened (if it is not already open) and a header written to the
2158 ** start of it. The database file on disk has not been modified.
2159 **
2160 ** * A write transaction is active.
2161 ** * A RESERVED or greater lock is held on the database file.
2162 ** * The journal file is open and the first header has been written
2163 ** to it, but the header has not been synced to disk.
2164 ** * The contents of the page cache have been modified.
2165 **
2166 ** WRITER_DBMOD:
2167 **
2168 ** The pager transitions from WRITER_CACHEMOD into WRITER_DBMOD state
2169 ** when it modifies the contents of the database file. WAL connections
2170 ** never enter this state (since they do not modify the database file,
2171 ** just the log file).
2172 **
2173 ** * A write transaction is active.
2174 ** * An EXCLUSIVE or greater lock is held on the database file.
2175 ** * The journal file is open and the first header has been written
2176 ** and synced to disk.
2177 ** * The contents of the page cache have been modified (and possibly
2178 ** written to disk).
2179 **
2180 ** WRITER_FINISHED:
2181 **
2182 ** It is not possible for a WAL connection to enter this state.
2183 **
2184 ** A rollback-mode pager changes to WRITER_FINISHED state from WRITER_DBMOD
2185 ** state after the entire transaction has been successfully written into the
2186 ** database file. In this state the transaction may be committed simply
2187 ** by finalizing the journal file. Once in WRITER_FINISHED state, it is
2188 ** not possible to modify the database further. At this point, the upper
2189 ** layer must either commit or rollback the transaction.
2190 **
2191 ** * A write transaction is active.
2192 ** * An EXCLUSIVE or greater lock is held on the database file.
2193 ** * All writing and syncing of journal and database data has finished.
2194 ** If no error occurred, all that remains is to finalize the journal to
2195 ** commit the transaction. If an error did occur, the caller will need
2196 ** to rollback the transaction.
2197 **
2198 ** ERROR:
2199 **
2200 ** The ERROR state is entered when an IO or disk-full error (including
2201 ** SQLITE_IOERR_NOMEM) occurs at a point in the code that makes it
2202 ** difficult to be sure that the in-memory pager state (cache contents,
2203 ** db size etc.) are consistent with the contents of the file-system.
2204 **
2205 ** Temporary pager files may enter the ERROR state, but in-memory pagers
2206 ** cannot.
2207 **
2208 ** For example, if an IO error occurs while performing a rollback,
2209 ** the contents of the page-cache may be left in an inconsistent state.
2210 ** At this point it would be dangerous to change back to READER state
2211 ** (as usually happens after a rollback). Any subsequent readers might
2212 ** report database corruption (due to the inconsistent cache), and if
2213 ** they upgrade to writers, they may inadvertently corrupt the database
2214 ** file. To avoid this hazard, the pager switches into the ERROR state
2215 ** instead of READER following such an error.
2216 **
2217 ** Once it has entered the ERROR state, any attempt to use the pager
2218 ** to read or write data returns an error. Eventually, once all
2219 ** outstanding transactions have been abandoned, the pager is able to
2220 ** transition back to OPEN state, discarding the contents of the
2221 ** page-cache and any other in-memory state at the same time. Everything
2222 ** is reloaded from disk (and, if necessary, hot-journal rollback peformed)
2223 ** when a read-transaction is next opened on the pager (transitioning
2224 ** the pager into READER state). At that point the system has recovered
2225 ** from the error.
2226 **
2227 ** Specifically, the pager jumps into the ERROR state if:
2228 **
2229 ** 1. An error occurs while attempting a rollback. This happens in
2230 ** function sqlite3PagerRollback().
2231 **
2232 ** 2. An error occurs while attempting to finalize a journal file
2233 ** following a commit in function sqlite3PagerCommitPhaseTwo().
2234 **
2235 ** 3. An error occurs while attempting to write to the journal or
2236 ** database file in function pagerStress() in order to free up
2237 ** memory.
2238 **
2239 ** In other cases, the error is returned to the b-tree layer. The b-tree
2240 ** layer then attempts a rollback operation. If the error condition
2241 ** persists, the pager enters the ERROR state via condition (1) above.
2242 **
2243 ** Condition (3) is necessary because it can be triggered by a read-only
2244 ** statement executed within a transaction. In this case, if the error
2245 ** code were simply returned to the user, the b-tree layer would not
2246 ** automatically attempt a rollback, as it assumes that an error in a
2247 ** read-only statement cannot leave the pager in an internally inconsistent
2248 ** state.
2249 **
2250 ** * The Pager.errCode variable is set to something other than SQLITE_OK.
2251 ** * There are one or more outstanding references to pages (after the
2252 ** last reference is dropped the pager should move back to OPEN state).
2253 ** * The pager is not an in-memory pager.
2254 **
2255 **
2256 ** Notes:
2257 **
2258 ** * A pager is never in WRITER_DBMOD or WRITER_FINISHED state if the
2259 ** connection is open in WAL mode. A WAL connection is always in one
2260 ** of the first four states.
2261 **
2262 ** * Normally, a connection open in exclusive mode is never in PAGER_OPEN
2263 ** state. There are two exceptions: immediately after exclusive-mode has
2264 ** been turned on (and before any read or write transactions are
2265 ** executed), and when the pager is leaving the "error state".
2266 **
2267 ** * See also: assert_pager_state().
2268 */
2269 #define PAGER_OPEN 0
2270 #define PAGER_READER 1
2271 #define PAGER_WRITER_LOCKED 2
2272 #define PAGER_WRITER_CACHEMOD 3
2273 #define PAGER_WRITER_DBMOD 4
2274 #define PAGER_WRITER_FINISHED 5
2275 #define PAGER_ERROR 6
2276
2277 /*
2278 ** The Pager.eLock variable is almost always set to one of the
2279 ** following locking-states, according to the lock currently held on
2280 ** the database file: NO_LOCK, SHARED_LOCK, RESERVED_LOCK or EXCLUSIVE_LOCK.
2281 ** This variable is kept up to date as locks are taken and released by
2282 ** the pagerLockDb() and pagerUnlockDb() wrappers.
2283 **
2284 ** If the VFS xLock() or xUnlock() returns an error other than SQLITE_BUSY
2285 ** (i.e. one of the SQLITE_IOERR subtypes), it is not clear whether or not
2286 ** the operation was successful. In these circumstances pagerLockDb() and
2287 ** pagerUnlockDb() take a conservative approach - eLock is always updated
2288 ** when unlocking the file, and only updated when locking the file if the
2289 ** VFS call is successful. This way, the Pager.eLock variable may be set
2290 ** to a less exclusive (lower) value than the lock that is actually held
2291 ** at the system level, but it is never set to a more exclusive value.
2292 **
2293 ** This is usually safe. If an xUnlock fails or appears to fail, there may
2294 ** be a few redundant xLock() calls or a lock may be held for longer than
2295 ** required, but nothing really goes wrong.
2296 **
2297 ** The exception is when the database file is unlocked as the pager moves
2298 ** from ERROR to OPEN state. At this point there may be a hot-journal file
2299 ** in the file-system that needs to be rolled back (as part of an OPEN->SHARED
2300 ** transition, by the same pager or any other). If the call to xUnlock()
2301 ** fails at this point and the pager is left holding an EXCLUSIVE lock, this
2302 ** can confuse the call to xCheckReservedLock() call made later as part
2303 ** of hot-journal detection.
2304 **
2305 ** xCheckReservedLock() is defined as returning true "if there is a RESERVED
2306 ** lock held by this process or any others". So xCheckReservedLock may
2307 ** return true because the caller itself is holding an EXCLUSIVE lock (but
2308 ** doesn't know it because of a previous error in xUnlock). If this happens
2309 ** a hot-journal may be mistaken for a journal being created by an active
2310 ** transaction in another process, causing SQLite to read from the database
2311 ** without rolling it back.
2312 **
2313 ** To work around this, if a call to xUnlock() fails when unlocking the
2314 ** database in the ERROR state, Pager.eLock is set to UNKNOWN_LOCK. It
2315 ** is only changed back to a real locking state after a successful call
2316 ** to xLock(EXCLUSIVE). Also, the code to do the OPEN->SHARED state transition
2317 ** omits the check for a hot-journal if Pager.eLock is set to UNKNOWN_LOCK
2318 ** lock. Instead, it assumes a hot-journal exists and obtains an EXCLUSIVE
2319 ** lock on the database file before attempting to roll it back. See function
2320 ** PagerSharedLock() for more detail.
2321 **
2322 ** Pager.eLock may only be set to UNKNOWN_LOCK when the pager is in
2323 ** PAGER_OPEN state.
2324 */
2325 #define UNKNOWN_LOCK (EXCLUSIVE_LOCK+1)
2326
2327 /*
2328 ** A macro used for invoking the codec if there is one
2329 */
2330 #ifdef SQLITE_HAS_CODEC
2331 # define CODEC1(P,D,N,X,E) \
2332 if( P->xCodec && P->xCodec(P->pCodec,D,N,X)==0 ){ E; }
2333 # define CODEC2(P,D,N,X,E,O) \
2334 if( P->xCodec==0 ){ O=(char*)D; }else \
2335 if( (O=(char*)(P->xCodec(P->pCodec,D,N,X)))==0 ){ E; }
2336 #else
2337 # define CODEC1(P,D,N,X,E) /* NO-OP */
2338 # define CODEC2(P,D,N,X,E,O) O=(char*)D
2339 #endif
2340
2341 /*
2342 ** The maximum allowed sector size. 64KiB. If the xSectorsize() method
2343 ** returns a value larger than this, then MAX_SECTOR_SIZE is used instead.
2344 ** This could conceivably cause corruption following a power failure on
2345 ** such a system. This is currently an undocumented limit.
2346 */
2347 #define MAX_SECTOR_SIZE 0x10000
2348
2349
2350 /*
2351 ** An instance of the following structure is allocated for each active
2352 ** savepoint and statement transaction in the system. All such structures
2353 ** are stored in the Pager.aSavepoint[] array, which is allocated and
2354 ** resized using sqlite3Realloc().
2355 **
2356 ** When a savepoint is created, the PagerSavepoint.iHdrOffset field is
2357 ** set to 0. If a journal-header is written into the main journal while
2358 ** the savepoint is active, then iHdrOffset is set to the byte offset
2359 ** immediately following the last journal record written into the main
2360 ** journal before the journal-header. This is required during savepoint
2361 ** rollback (see pagerPlaybackSavepoint()).
2362 */
2363 typedef struct PagerSavepoint PagerSavepoint;
2364 struct PagerSavepoint {
2365 i64 iOffset; /* Starting offset in main journal */
2366 i64 iHdrOffset; /* See above */
2367 Bitvec *pInSavepoint; /* Set of pages in this savepoint */
2368 Pgno nOrig; /* Original number of pages in file */
2369 Pgno iSubRec; /* Index of first record in sub-journal */
2370 #ifndef SQLITE_OMIT_WAL
2371 u32 aWalData[WAL_SAVEPOINT_NDATA]; /* WAL savepoint context */
2372 #endif
2373 };
2374
2375 /*
2376 ** Bits of the Pager.doNotSpill flag. See further description below.
2377 */
2378 #define SPILLFLAG_OFF 0x01 /* Never spill cache. Set via pragma */
2379 #define SPILLFLAG_ROLLBACK 0x02 /* Current rolling back, so do not spill */
2380 #define SPILLFLAG_NOSYNC 0x04 /* Spill is ok, but do not sync */
2381
2382 /*
2383 ** An open page cache is an instance of struct Pager. A description of
2384 ** some of the more important member variables follows:
2385 **
2386 ** eState
2387 **
2388 ** The current 'state' of the pager object. See the comment and state
2389 ** diagram above for a description of the pager state.
2390 **
2391 ** eLock
2392 **
2393 ** For a real on-disk database, the current lock held on the database file -
2394 ** NO_LOCK, SHARED_LOCK, RESERVED_LOCK or EXCLUSIVE_LOCK.
2395 **
2396 ** For a temporary or in-memory database (neither of which require any
2397 ** locks), this variable is always set to EXCLUSIVE_LOCK. Since such
2398 ** databases always have Pager.exclusiveMode==1, this tricks the pager
2399 ** logic into thinking that it already has all the locks it will ever
2400 ** need (and no reason to release them).
2401 **
2402 ** In some (obscure) circumstances, this variable may also be set to
2403 ** UNKNOWN_LOCK. See the comment above the #define of UNKNOWN_LOCK for
2404 ** details.
2405 **
2406 ** changeCountDone
2407 **
2408 ** This boolean variable is used to make sure that the change-counter
2409 ** (the 4-byte header field at byte offset 24 of the database file) is
2410 ** not updated more often than necessary.
2411 **
2412 ** It is set to true when the change-counter field is updated, which
2413 ** can only happen if an exclusive lock is held on the database file.
2414 ** It is cleared (set to false) whenever an exclusive lock is
2415 ** relinquished on the database file. Each time a transaction is committed,
2416 ** The changeCountDone flag is inspected. If it is true, the work of
2417 ** updating the change-counter is omitted for the current transaction.
2418 **
2419 ** This mechanism means that when running in exclusive mode, a connection
2420 ** need only update the change-counter once, for the first transaction
2421 ** committed.
2422 **
2423 ** setMaster
2424 **
2425 ** When PagerCommitPhaseOne() is called to commit a transaction, it may
2426 ** (or may not) specify a master-journal name to be written into the
2427 ** journal file before it is synced to disk.
2428 **
2429 ** Whether or not a journal file contains a master-journal pointer affects
2430 ** the way in which the journal file is finalized after the transaction is
2431 ** committed or rolled back when running in "journal_mode=PERSIST" mode.
2432 ** If a journal file does not contain a master-journal pointer, it is
2433 ** finalized by overwriting the first journal header with zeroes. If
2434 ** it does contain a master-journal pointer the journal file is finalized
2435 ** by truncating it to zero bytes, just as if the connection were
2436 ** running in "journal_mode=truncate" mode.
2437 **
2438 ** Journal files that contain master journal pointers cannot be finalized
2439 ** simply by overwriting the first journal-header with zeroes, as the
2440 ** master journal pointer could interfere with hot-journal rollback of any
2441 ** subsequently interrupted transaction that reuses the journal file.
2442 **
2443 ** The flag is cleared as soon as the journal file is finalized (either
2444 ** by PagerCommitPhaseTwo or PagerRollback). If an IO error prevents the
2445 ** journal file from being successfully finalized, the setMaster flag
2446 ** is cleared anyway (and the pager will move to ERROR state).
2447 **
2448 ** doNotSpill
2449 **
2450 ** This variables control the behavior of cache-spills (calls made by
2451 ** the pcache module to the pagerStress() routine to write cached data
2452 ** to the file-system in order to free up memory).
2453 **
2454 ** When bits SPILLFLAG_OFF or SPILLFLAG_ROLLBACK of doNotSpill are set,
2455 ** writing to the database from pagerStress() is disabled altogether.
2456 ** The SPILLFLAG_ROLLBACK case is done in a very obscure case that
2457 ** comes up during savepoint rollback that requires the pcache module
2458 ** to allocate a new page to prevent the journal file from being written
2459 ** while it is being traversed by code in pager_playback(). The SPILLFLAG_OFF
2460 ** case is a user preference.
2461 **
2462 ** If the SPILLFLAG_NOSYNC bit is set, writing to the database from
2463 ** pagerStress() is permitted, but syncing the journal file is not.
2464 ** This flag is set by sqlite3PagerWrite() when the file-system sector-size
2465 ** is larger than the database page-size in order to prevent a journal sync
2466 ** from happening in between the journalling of two pages on the same sector.
2467 **
2468 ** subjInMemory
2469 **
2470 ** This is a boolean variable. If true, then any required sub-journal
2471 ** is opened as an in-memory journal file. If false, then in-memory
2472 ** sub-journals are only used for in-memory pager files.
2473 **
2474 ** This variable is updated by the upper layer each time a new
2475 ** write-transaction is opened.
2476 **
2477 ** dbSize, dbOrigSize, dbFileSize
2478 **
2479 ** Variable dbSize is set to the number of pages in the database file.
2480 ** It is valid in PAGER_READER and higher states (all states except for
2481 ** OPEN and ERROR).
2482 **
2483 ** dbSize is set based on the size of the database file, which may be
2484 ** larger than the size of the database (the value stored at offset
2485 ** 28 of the database header by the btree). If the size of the file
2486 ** is not an integer multiple of the page-size, the value stored in
2487 ** dbSize is rounded down (i.e. a 5KB file with 2K page-size has dbSize==2).
2488 ** Except, any file that is greater than 0 bytes in size is considered
2489 ** to have at least one page. (i.e. a 1KB file with 2K page-size leads
2490 ** to dbSize==1).
2491 **
2492 ** During a write-transaction, if pages with page-numbers greater than
2493 ** dbSize are modified in the cache, dbSize is updated accordingly.
2494 ** Similarly, if the database is truncated using PagerTruncateImage(),
2495 ** dbSize is updated.
2496 **
2497 ** Variables dbOrigSize and dbFileSize are valid in states
2498 ** PAGER_WRITER_LOCKED and higher. dbOrigSize is a copy of the dbSize
2499 ** variable at the start of the transaction. It is used during rollback,
2500 ** and to determine whether or not pages need to be journalled before
2501 ** being modified.
2502 **
2503 ** Throughout a write-transaction, dbFileSize contains the size of
2504 ** the file on disk in pages. It is set to a copy of dbSize when the
2505 ** write-transaction is first opened, and updated when VFS calls are made
2506 ** to write or truncate the database file on disk.
2507 **
2508 ** The only reason the dbFileSize variable is required is to suppress
2509 ** unnecessary calls to xTruncate() after committing a transaction. If,
2510 ** when a transaction is committed, the dbFileSize variable indicates
2511 ** that the database file is larger than the database image (Pager.dbSize),
2512 ** pager_truncate() is called. The pager_truncate() call uses xFilesize()
2513 ** to measure the database file on disk, and then truncates it if required.
2514 ** dbFileSize is not used when rolling back a transaction. In this case
2515 ** pager_truncate() is called unconditionally (which means there may be
2516 ** a call to xFilesize() that is not strictly required). In either case,
2517 ** pager_truncate() may cause the file to become smaller or larger.
2518 **
2519 ** dbHintSize
2520 **
2521 ** The dbHintSize variable is used to limit the number of calls made to
2522 ** the VFS xFileControl(FCNTL_SIZE_HINT) method.
2523 **
2524 ** dbHintSize is set to a copy of the dbSize variable when a
2525 ** write-transaction is opened (at the same time as dbFileSize and
2526 ** dbOrigSize). If the xFileControl(FCNTL_SIZE_HINT) method is called,
2527 ** dbHintSize is increased to the number of pages that correspond to the
2528 ** size-hint passed to the method call. See pager_write_pagelist() for
2529 ** details.
2530 **
2531 ** errCode
2532 **
2533 ** The Pager.errCode variable is only ever used in PAGER_ERROR state. It
2534 ** is set to zero in all other states. In PAGER_ERROR state, Pager.errCode
2535 ** is always set to SQLITE_FULL, SQLITE_IOERR or one of the SQLITE_IOERR_XXX
2536 ** sub-codes.
2537 */
2538 struct Pager {
2539 sqlite3_vfs *pVfs; /* OS functions to use for IO */
2540 u8 exclusiveMode; /* Boolean. True if locking_mode==EXCLUSIVE */
2541 u8 journalMode; /* One of the PAGER_JOURNALMODE_* values */
2542 u8 useJournal; /* Use a rollback journal on this file */
2543 u8 noSync; /* Do not sync the journal if true */
2544 u8 fullSync; /* Do extra syncs of the journal for robustness */
2545 u8 extraSync; /* sync directory after journal delete */
2546 u8 ckptSyncFlags; /* SYNC_NORMAL or SYNC_FULL for checkpoint */
2547 u8 walSyncFlags; /* SYNC_NORMAL or SYNC_FULL for wal writes */
2548 u8 syncFlags; /* SYNC_NORMAL or SYNC_FULL otherwise */
2549 u8 tempFile; /* zFilename is a temporary or immutable file */
2550 u8 noLock; /* Do not lock (except in WAL mode) */
2551 u8 readOnly; /* True for a read-only database */
2552 u8 memDb; /* True to inhibit all file I/O */
2553
2554 /**************************************************************************
2555 ** The following block contains those class members that change during
2556 ** routine operation. Class members not in this block are either fixed
2557 ** when the pager is first created or else only change when there is a
2558 ** significant mode change (such as changing the page_size, locking_mode,
2559 ** or the journal_mode). From another view, these class members describe
2560 ** the "state" of the pager, while other class members describe the
2561 ** "configuration" of the pager.
2562 */
2563 u8 eState; /* Pager state (OPEN, READER, WRITER_LOCKED..) */
2564 u8 eLock; /* Current lock held on database file */
2565 u8 changeCountDone; /* Set after incrementing the change-counter */
2566 u8 setMaster; /* True if a m-j name has been written to jrnl */
2567 u8 doNotSpill; /* Do not spill the cache when non-zero */
2568 u8 subjInMemory; /* True to use in-memory sub-journals */
2569 u8 bUseFetch; /* True to use xFetch() */
2570 u8 hasHeldSharedLock; /* True if a shared lock has ever been held */
2571 Pgno dbSize; /* Number of pages in the database */
2572 Pgno dbOrigSize; /* dbSize before the current transaction */
2573 Pgno dbFileSize; /* Number of pages in the database file */
2574 Pgno dbHintSize; /* Value passed to FCNTL_SIZE_HINT call */
2575 int errCode; /* One of several kinds of errors */
2576 int nRec; /* Pages journalled since last j-header written */
2577 u32 cksumInit; /* Quasi-random value added to every checksum */
2578 u32 nSubRec; /* Number of records written to sub-journal */
2579 Bitvec *pInJournal; /* One bit for each page in the database file */
2580 sqlite3_file *fd; /* File descriptor for database */
2581 sqlite3_file *jfd; /* File descriptor for main journal */
2582 sqlite3_file *sjfd; /* File descriptor for sub-journal */
2583 i64 journalOff; /* Current write offset in the journal file */
2584 i64 journalHdr; /* Byte offset to previous journal header */
2585 sqlite3_backup *pBackup; /* Pointer to list of ongoing backup processes */
2586 PagerSavepoint *aSavepoint; /* Array of active savepoints */
2587 int nSavepoint; /* Number of elements in aSavepoint[] */
2588 u32 iDataVersion; /* Changes whenever database content changes */
2589 char dbFileVers[16]; /* Changes whenever database file changes */
2590
2591 int nMmapOut; /* Number of mmap pages currently outstanding */
2592 sqlite3_int64 szMmap; /* Desired maximum mmap size */
2593 PgHdr *pMmapFreelist; /* List of free mmap page headers (pDirty) */
2594 /*
2595 ** End of the routinely-changing class members
2596 ***************************************************************************/
2597
2598 u16 nExtra; /* Add this many bytes to each in-memory page */
2599 i16 nReserve; /* Number of unused bytes at end of each page */
2600 u32 vfsFlags; /* Flags for sqlite3_vfs.xOpen() */
2601 u32 sectorSize; /* Assumed sector size during rollback */
2602 int pageSize; /* Number of bytes in a page */
2603 Pgno mxPgno; /* Maximum allowed size of the database */
2604 i64 journalSizeLimit; /* Size limit for persistent journal files */
2605 char *zFilename; /* Name of the database file */
2606 char *zJournal; /* Name of the journal file */
2607 int (*xBusyHandler)(void*); /* Function to call when busy */
2608 void *pBusyHandlerArg; /* Context argument for xBusyHandler */
2609 int aStat[3]; /* Total cache hits, misses and writes */
2610 #ifdef SQLITE_TEST
2611 int nRead; /* Database pages read */
2612 #endif
2613 void (*xReiniter)(DbPage*); /* Call this routine when reloading pages */
2614 int (*xGet)(Pager*,Pgno,DbPage**,int); /* Routine to fetch a patch */
2615 #ifdef SQLITE_HAS_CODEC
2616 void *(*xCodec)(void*,void*,Pgno,int); /* Routine for en/decoding data */
2617 void (*xCodecSizeChng)(void*,int,int); /* Notify of page size changes */
2618 void (*xCodecFree)(void*); /* Destructor for the codec */
2619 void *pCodec; /* First argument to xCodec... methods */
2620 #endif
2621 char *pTmpSpace; /* Pager.pageSize bytes of space for tmp use */
2622 PCache *pPCache; /* Pointer to page cache object */
2623 #ifndef SQLITE_OMIT_WAL
2624 Wal *pWal; /* Write-ahead log used by "journal_mode=wal" */
2625 char *zWal; /* File name for write-ahead log */
2626 #endif
2627 };
2628
2629 /*
2630 ** Indexes for use with Pager.aStat[]. The Pager.aStat[] array contains
2631 ** the values accessed by passing SQLITE_DBSTATUS_CACHE_HIT, CACHE_MISS
2632 ** or CACHE_WRITE to sqlite3_db_status().
2633 */
2634 #define PAGER_STAT_HIT 0
2635 #define PAGER_STAT_MISS 1
2636 #define PAGER_STAT_WRITE 2
2637
2638 /*
2639 ** The following global variables hold counters used for
2640 ** testing purposes only. These variables do not exist in
2641 ** a non-testing build. These variables are not thread-safe.
2642 */
2643 #ifdef SQLITE_TEST
2644 SQLITE_API int sqlite3_pager_readdb_count = 0; /* Number of full pages read f rom DB */
2645 SQLITE_API int sqlite3_pager_writedb_count = 0; /* Number of full pages writte n to DB */
2646 SQLITE_API int sqlite3_pager_writej_count = 0; /* Number of pages written to journal */
2647 # define PAGER_INCR(v) v++
2648 #else
2649 # define PAGER_INCR(v)
2650 #endif
2651
2652
2653
2654 /*
2655 ** Journal files begin with the following magic string. The data
2656 ** was obtained from /dev/random. It is used only as a sanity check.
2657 **
2658 ** Since version 2.8.0, the journal format contains additional sanity
2659 ** checking information. If the power fails while the journal is being
2660 ** written, semi-random garbage data might appear in the journal
2661 ** file after power is restored. If an attempt is then made
2662 ** to roll the journal back, the database could be corrupted. The additional
2663 ** sanity checking data is an attempt to discover the garbage in the
2664 ** journal and ignore it.
2665 **
2666 ** The sanity checking information for the new journal format consists
2667 ** of a 32-bit checksum on each page of data. The checksum covers both
2668 ** the page number and the pPager->pageSize bytes of data for the page.
2669 ** This cksum is initialized to a 32-bit random value that appears in the
2670 ** journal file right after the header. The random initializer is important,
2671 ** because garbage data that appears at the end of a journal is likely
2672 ** data that was once in other files that have now been deleted. If the
2673 ** garbage data came from an obsolete journal file, the checksums might
2674 ** be correct. But by initializing the checksum to random value which
2675 ** is different for every journal, we minimize that risk.
2676 */
2677 static const unsigned char aJournalMagic[] = {
2678 0xd9, 0xd5, 0x05, 0xf9, 0x20, 0xa1, 0x63, 0xd7,
2679 };
2680
2681 /*
2682 ** The size of the of each page record in the journal is given by
2683 ** the following macro.
2684 */
2685 #define JOURNAL_PG_SZ(pPager) ((pPager->pageSize) + 8)
2686
2687 /*
2688 ** The journal header size for this pager. This is usually the same
2689 ** size as a single disk sector. See also setSectorSize().
2690 */
2691 #define JOURNAL_HDR_SZ(pPager) (pPager->sectorSize)
2692
2693 /*
2694 ** The macro MEMDB is true if we are dealing with an in-memory database.
2695 ** We do this as a macro so that if the SQLITE_OMIT_MEMORYDB macro is set,
2696 ** the value of MEMDB will be a constant and the compiler will optimize
2697 ** out code that would never execute.
2698 */
2699 #ifdef SQLITE_OMIT_MEMORYDB
2700 # define MEMDB 0
2701 #else
2702 # define MEMDB pPager->memDb
2703 #endif
2704
2705 /*
2706 ** The macro USEFETCH is true if we are allowed to use the xFetch and xUnfetch
2707 ** interfaces to access the database using memory-mapped I/O.
2708 */
2709 #if SQLITE_MAX_MMAP_SIZE>0
2710 # define USEFETCH(x) ((x)->bUseFetch)
2711 #else
2712 # define USEFETCH(x) 0
2713 #endif
2714
2715 /*
2716 ** The maximum legal page number is (2^31 - 1).
2717 */
2718 #define PAGER_MAX_PGNO 2147483647
2719
2720 /*
2721 ** The argument to this macro is a file descriptor (type sqlite3_file*).
2722 ** Return 0 if it is not open, or non-zero (but not 1) if it is.
2723 **
2724 ** This is so that expressions can be written as:
2725 **
2726 ** if( isOpen(pPager->jfd) ){ ...
2727 **
2728 ** instead of
2729 **
2730 ** if( pPager->jfd->pMethods ){ ...
2731 */
2732 #define isOpen(pFd) ((pFd)->pMethods!=0)
2733
2734 /*
2735 ** Return true if this pager uses a write-ahead log to read page pgno.
2736 ** Return false if the pager reads pgno directly from the database.
2737 */
2738 #if !defined(SQLITE_OMIT_WAL) && defined(SQLITE_DIRECT_OVERFLOW_READ)
2739 SQLITE_PRIVATE int sqlite3PagerUseWal(Pager *pPager, Pgno pgno){
2740 u32 iRead = 0;
2741 int rc;
2742 if( pPager->pWal==0 ) return 0;
2743 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iRead);
2744 return rc || iRead;
2745 }
2746 #endif
2747 #ifndef SQLITE_OMIT_WAL
2748 # define pagerUseWal(x) ((x)->pWal!=0)
2749 #else
2750 # define pagerUseWal(x) 0
2751 # define pagerRollbackWal(x) 0
2752 # define pagerWalFrames(v,w,x,y) 0
2753 # define pagerOpenWalIfPresent(z) SQLITE_OK
2754 # define pagerBeginReadTransaction(z) SQLITE_OK
2755 #endif
2756
2757 #ifndef NDEBUG
2758 /*
2759 ** Usage:
2760 **
2761 ** assert( assert_pager_state(pPager) );
2762 **
2763 ** This function runs many asserts to try to find inconsistencies in
2764 ** the internal state of the Pager object.
2765 */
2766 static int assert_pager_state(Pager *p){
2767 Pager *pPager = p;
2768
2769 /* State must be valid. */
2770 assert( p->eState==PAGER_OPEN
2771 || p->eState==PAGER_READER
2772 || p->eState==PAGER_WRITER_LOCKED
2773 || p->eState==PAGER_WRITER_CACHEMOD
2774 || p->eState==PAGER_WRITER_DBMOD
2775 || p->eState==PAGER_WRITER_FINISHED
2776 || p->eState==PAGER_ERROR
2777 );
2778
2779 /* Regardless of the current state, a temp-file connection always behaves
2780 ** as if it has an exclusive lock on the database file. It never updates
2781 ** the change-counter field, so the changeCountDone flag is always set.
2782 */
2783 assert( p->tempFile==0 || p->eLock==EXCLUSIVE_LOCK );
2784 assert( p->tempFile==0 || pPager->changeCountDone );
2785
2786 /* If the useJournal flag is clear, the journal-mode must be "OFF".
2787 ** And if the journal-mode is "OFF", the journal file must not be open.
2788 */
2789 assert( p->journalMode==PAGER_JOURNALMODE_OFF || p->useJournal );
2790 assert( p->journalMode!=PAGER_JOURNALMODE_OFF || !isOpen(p->jfd) );
2791
2792 /* Check that MEMDB implies noSync. And an in-memory journal. Since
2793 ** this means an in-memory pager performs no IO at all, it cannot encounter
2794 ** either SQLITE_IOERR or SQLITE_FULL during rollback or while finalizing
2795 ** a journal file. (although the in-memory journal implementation may
2796 ** return SQLITE_IOERR_NOMEM while the journal file is being written). It
2797 ** is therefore not possible for an in-memory pager to enter the ERROR
2798 ** state.
2799 */
2800 if( MEMDB ){
2801 assert( !isOpen(p->fd) );
2802 assert( p->noSync );
2803 assert( p->journalMode==PAGER_JOURNALMODE_OFF
2804 || p->journalMode==PAGER_JOURNALMODE_MEMORY
2805 );
2806 assert( p->eState!=PAGER_ERROR && p->eState!=PAGER_OPEN );
2807 assert( pagerUseWal(p)==0 );
2808 }
2809
2810 /* If changeCountDone is set, a RESERVED lock or greater must be held
2811 ** on the file.
2812 */
2813 assert( pPager->changeCountDone==0 || pPager->eLock>=RESERVED_LOCK );
2814 assert( p->eLock!=PENDING_LOCK );
2815
2816 switch( p->eState ){
2817 case PAGER_OPEN:
2818 assert( !MEMDB );
2819 assert( pPager->errCode==SQLITE_OK );
2820 assert( sqlite3PcacheRefCount(pPager->pPCache)==0 || pPager->tempFile );
2821 break;
2822
2823 case PAGER_READER:
2824 assert( pPager->errCode==SQLITE_OK );
2825 assert( p->eLock!=UNKNOWN_LOCK );
2826 assert( p->eLock>=SHARED_LOCK );
2827 break;
2828
2829 case PAGER_WRITER_LOCKED:
2830 assert( p->eLock!=UNKNOWN_LOCK );
2831 assert( pPager->errCode==SQLITE_OK );
2832 if( !pagerUseWal(pPager) ){
2833 assert( p->eLock>=RESERVED_LOCK );
2834 }
2835 assert( pPager->dbSize==pPager->dbOrigSize );
2836 assert( pPager->dbOrigSize==pPager->dbFileSize );
2837 assert( pPager->dbOrigSize==pPager->dbHintSize );
2838 assert( pPager->setMaster==0 );
2839 break;
2840
2841 case PAGER_WRITER_CACHEMOD:
2842 assert( p->eLock!=UNKNOWN_LOCK );
2843 assert( pPager->errCode==SQLITE_OK );
2844 if( !pagerUseWal(pPager) ){
2845 /* It is possible that if journal_mode=wal here that neither the
2846 ** journal file nor the WAL file are open. This happens during
2847 ** a rollback transaction that switches from journal_mode=off
2848 ** to journal_mode=wal.
2849 */
2850 assert( p->eLock>=RESERVED_LOCK );
2851 assert( isOpen(p->jfd)
2852 || p->journalMode==PAGER_JOURNALMODE_OFF
2853 || p->journalMode==PAGER_JOURNALMODE_WAL
2854 );
2855 }
2856 assert( pPager->dbOrigSize==pPager->dbFileSize );
2857 assert( pPager->dbOrigSize==pPager->dbHintSize );
2858 break;
2859
2860 case PAGER_WRITER_DBMOD:
2861 assert( p->eLock==EXCLUSIVE_LOCK );
2862 assert( pPager->errCode==SQLITE_OK );
2863 assert( !pagerUseWal(pPager) );
2864 assert( p->eLock>=EXCLUSIVE_LOCK );
2865 assert( isOpen(p->jfd)
2866 || p->journalMode==PAGER_JOURNALMODE_OFF
2867 || p->journalMode==PAGER_JOURNALMODE_WAL
2868 );
2869 assert( pPager->dbOrigSize<=pPager->dbHintSize );
2870 break;
2871
2872 case PAGER_WRITER_FINISHED:
2873 assert( p->eLock==EXCLUSIVE_LOCK );
2874 assert( pPager->errCode==SQLITE_OK );
2875 assert( !pagerUseWal(pPager) );
2876 assert( isOpen(p->jfd)
2877 || p->journalMode==PAGER_JOURNALMODE_OFF
2878 || p->journalMode==PAGER_JOURNALMODE_WAL
2879 );
2880 break;
2881
2882 case PAGER_ERROR:
2883 /* There must be at least one outstanding reference to the pager if
2884 ** in ERROR state. Otherwise the pager should have already dropped
2885 ** back to OPEN state.
2886 */
2887 assert( pPager->errCode!=SQLITE_OK );
2888 assert( sqlite3PcacheRefCount(pPager->pPCache)>0 || pPager->tempFile );
2889 break;
2890 }
2891
2892 return 1;
2893 }
2894 #endif /* ifndef NDEBUG */
2895
2896 #ifdef SQLITE_DEBUG
2897 /*
2898 ** Return a pointer to a human readable string in a static buffer
2899 ** containing the state of the Pager object passed as an argument. This
2900 ** is intended to be used within debuggers. For example, as an alternative
2901 ** to "print *pPager" in gdb:
2902 **
2903 ** (gdb) printf "%s", print_pager_state(pPager)
2904 */
2905 static char *print_pager_state(Pager *p){
2906 static char zRet[1024];
2907
2908 sqlite3_snprintf(1024, zRet,
2909 "Filename: %s\n"
2910 "State: %s errCode=%d\n"
2911 "Lock: %s\n"
2912 "Locking mode: locking_mode=%s\n"
2913 "Journal mode: journal_mode=%s\n"
2914 "Backing store: tempFile=%d memDb=%d useJournal=%d\n"
2915 "Journal: journalOff=%lld journalHdr=%lld\n"
2916 "Size: dbsize=%d dbOrigSize=%d dbFileSize=%d\n"
2917 , p->zFilename
2918 , p->eState==PAGER_OPEN ? "OPEN" :
2919 p->eState==PAGER_READER ? "READER" :
2920 p->eState==PAGER_WRITER_LOCKED ? "WRITER_LOCKED" :
2921 p->eState==PAGER_WRITER_CACHEMOD ? "WRITER_CACHEMOD" :
2922 p->eState==PAGER_WRITER_DBMOD ? "WRITER_DBMOD" :
2923 p->eState==PAGER_WRITER_FINISHED ? "WRITER_FINISHED" :
2924 p->eState==PAGER_ERROR ? "ERROR" : "?error?"
2925 , (int)p->errCode
2926 , p->eLock==NO_LOCK ? "NO_LOCK" :
2927 p->eLock==RESERVED_LOCK ? "RESERVED" :
2928 p->eLock==EXCLUSIVE_LOCK ? "EXCLUSIVE" :
2929 p->eLock==SHARED_LOCK ? "SHARED" :
2930 p->eLock==UNKNOWN_LOCK ? "UNKNOWN" : "?error?"
2931 , p->exclusiveMode ? "exclusive" : "normal"
2932 , p->journalMode==PAGER_JOURNALMODE_MEMORY ? "memory" :
2933 p->journalMode==PAGER_JOURNALMODE_OFF ? "off" :
2934 p->journalMode==PAGER_JOURNALMODE_DELETE ? "delete" :
2935 p->journalMode==PAGER_JOURNALMODE_PERSIST ? "persist" :
2936 p->journalMode==PAGER_JOURNALMODE_TRUNCATE ? "truncate" :
2937 p->journalMode==PAGER_JOURNALMODE_WAL ? "wal" : "?error?"
2938 , (int)p->tempFile, (int)p->memDb, (int)p->useJournal
2939 , p->journalOff, p->journalHdr
2940 , (int)p->dbSize, (int)p->dbOrigSize, (int)p->dbFileSize
2941 );
2942
2943 return zRet;
2944 }
2945 #endif
2946
2947 /* Forward references to the various page getters */
2948 static int getPageNormal(Pager*,Pgno,DbPage**,int);
2949 static int getPageError(Pager*,Pgno,DbPage**,int);
2950 #if SQLITE_MAX_MMAP_SIZE>0
2951 static int getPageMMap(Pager*,Pgno,DbPage**,int);
2952 #endif
2953
2954 /*
2955 ** Set the Pager.xGet method for the appropriate routine used to fetch
2956 ** content from the pager.
2957 */
2958 static void setGetterMethod(Pager *pPager){
2959 if( pPager->errCode ){
2960 pPager->xGet = getPageError;
2961 #if SQLITE_MAX_MMAP_SIZE>0
2962 }else if( USEFETCH(pPager)
2963 #ifdef SQLITE_HAS_CODEC
2964 && pPager->xCodec==0
2965 #endif
2966 ){
2967 pPager->xGet = getPageMMap;
2968 #endif /* SQLITE_MAX_MMAP_SIZE>0 */
2969 }else{
2970 pPager->xGet = getPageNormal;
2971 }
2972 }
2973
2974 /*
2975 ** Return true if it is necessary to write page *pPg into the sub-journal.
2976 ** A page needs to be written into the sub-journal if there exists one
2977 ** or more open savepoints for which:
2978 **
2979 ** * The page-number is less than or equal to PagerSavepoint.nOrig, and
2980 ** * The bit corresponding to the page-number is not set in
2981 ** PagerSavepoint.pInSavepoint.
2982 */
2983 static int subjRequiresPage(PgHdr *pPg){
2984 Pager *pPager = pPg->pPager;
2985 PagerSavepoint *p;
2986 Pgno pgno = pPg->pgno;
2987 int i;
2988 for(i=0; i<pPager->nSavepoint; i++){
2989 p = &pPager->aSavepoint[i];
2990 if( p->nOrig>=pgno && 0==sqlite3BitvecTestNotNull(p->pInSavepoint, pgno) ){
2991 return 1;
2992 }
2993 }
2994 return 0;
2995 }
2996
2997 #ifdef SQLITE_DEBUG
2998 /*
2999 ** Return true if the page is already in the journal file.
3000 */
3001 static int pageInJournal(Pager *pPager, PgHdr *pPg){
3002 return sqlite3BitvecTest(pPager->pInJournal, pPg->pgno);
3003 }
3004 #endif
3005
3006 /*
3007 ** Read a 32-bit integer from the given file descriptor. Store the integer
3008 ** that is read in *pRes. Return SQLITE_OK if everything worked, or an
3009 ** error code is something goes wrong.
3010 **
3011 ** All values are stored on disk as big-endian.
3012 */
3013 static int read32bits(sqlite3_file *fd, i64 offset, u32 *pRes){
3014 unsigned char ac[4];
3015 int rc = sqlite3OsRead(fd, ac, sizeof(ac), offset);
3016 if( rc==SQLITE_OK ){
3017 *pRes = sqlite3Get4byte(ac);
3018 }
3019 return rc;
3020 }
3021
3022 /*
3023 ** Write a 32-bit integer into a string buffer in big-endian byte order.
3024 */
3025 #define put32bits(A,B) sqlite3Put4byte((u8*)A,B)
3026
3027
3028 /*
3029 ** Write a 32-bit integer into the given file descriptor. Return SQLITE_OK
3030 ** on success or an error code is something goes wrong.
3031 */
3032 static int write32bits(sqlite3_file *fd, i64 offset, u32 val){
3033 char ac[4];
3034 put32bits(ac, val);
3035 return sqlite3OsWrite(fd, ac, 4, offset);
3036 }
3037
3038 /*
3039 ** Unlock the database file to level eLock, which must be either NO_LOCK
3040 ** or SHARED_LOCK. Regardless of whether or not the call to xUnlock()
3041 ** succeeds, set the Pager.eLock variable to match the (attempted) new lock.
3042 **
3043 ** Except, if Pager.eLock is set to UNKNOWN_LOCK when this function is
3044 ** called, do not modify it. See the comment above the #define of
3045 ** UNKNOWN_LOCK for an explanation of this.
3046 */
3047 static int pagerUnlockDb(Pager *pPager, int eLock){
3048 int rc = SQLITE_OK;
3049
3050 assert( !pPager->exclusiveMode || pPager->eLock==eLock );
3051 assert( eLock==NO_LOCK || eLock==SHARED_LOCK );
3052 assert( eLock!=NO_LOCK || pagerUseWal(pPager)==0 );
3053 if( isOpen(pPager->fd) ){
3054 assert( pPager->eLock>=eLock );
3055 rc = pPager->noLock ? SQLITE_OK : sqlite3OsUnlock(pPager->fd, eLock);
3056 if( pPager->eLock!=UNKNOWN_LOCK ){
3057 pPager->eLock = (u8)eLock;
3058 }
3059 IOTRACE(("UNLOCK %p %d\n", pPager, eLock))
3060 }
3061 return rc;
3062 }
3063
3064 /*
3065 ** Lock the database file to level eLock, which must be either SHARED_LOCK,
3066 ** RESERVED_LOCK or EXCLUSIVE_LOCK. If the caller is successful, set the
3067 ** Pager.eLock variable to the new locking state.
3068 **
3069 ** Except, if Pager.eLock is set to UNKNOWN_LOCK when this function is
3070 ** called, do not modify it unless the new locking state is EXCLUSIVE_LOCK.
3071 ** See the comment above the #define of UNKNOWN_LOCK for an explanation
3072 ** of this.
3073 */
3074 static int pagerLockDb(Pager *pPager, int eLock){
3075 int rc = SQLITE_OK;
3076
3077 assert( eLock==SHARED_LOCK || eLock==RESERVED_LOCK || eLock==EXCLUSIVE_LOCK );
3078 if( pPager->eLock<eLock || pPager->eLock==UNKNOWN_LOCK ){
3079 rc = pPager->noLock ? SQLITE_OK : sqlite3OsLock(pPager->fd, eLock);
3080 if( rc==SQLITE_OK && (pPager->eLock!=UNKNOWN_LOCK||eLock==EXCLUSIVE_LOCK) ){
3081 pPager->eLock = (u8)eLock;
3082 IOTRACE(("LOCK %p %d\n", pPager, eLock))
3083 }
3084 }
3085 return rc;
3086 }
3087
3088 /*
3089 ** This function determines whether or not the atomic-write optimization
3090 ** can be used with this pager. The optimization can be used if:
3091 **
3092 ** (a) the value returned by OsDeviceCharacteristics() indicates that
3093 ** a database page may be written atomically, and
3094 ** (b) the value returned by OsSectorSize() is less than or equal
3095 ** to the page size.
3096 **
3097 ** The optimization is also always enabled for temporary files. It is
3098 ** an error to call this function if pPager is opened on an in-memory
3099 ** database.
3100 **
3101 ** If the optimization cannot be used, 0 is returned. If it can be used,
3102 ** then the value returned is the size of the journal file when it
3103 ** contains rollback data for exactly one page.
3104 */
3105 #ifdef SQLITE_ENABLE_ATOMIC_WRITE
3106 static int jrnlBufferSize(Pager *pPager){
3107 assert( !MEMDB );
3108 if( !pPager->tempFile ){
3109 int dc; /* Device characteristics */
3110 int nSector; /* Sector size */
3111 int szPage; /* Page size */
3112
3113 assert( isOpen(pPager->fd) );
3114 dc = sqlite3OsDeviceCharacteristics(pPager->fd);
3115 nSector = pPager->sectorSize;
3116 szPage = pPager->pageSize;
3117
3118 assert(SQLITE_IOCAP_ATOMIC512==(512>>8));
3119 assert(SQLITE_IOCAP_ATOMIC64K==(65536>>8));
3120 if( 0==(dc&(SQLITE_IOCAP_ATOMIC|(szPage>>8)) || nSector>szPage) ){
3121 return 0;
3122 }
3123 }
3124
3125 return JOURNAL_HDR_SZ(pPager) + JOURNAL_PG_SZ(pPager);
3126 }
3127 #else
3128 # define jrnlBufferSize(x) 0
3129 #endif
3130
3131 /*
3132 ** If SQLITE_CHECK_PAGES is defined then we do some sanity checking
3133 ** on the cache using a hash function. This is used for testing
3134 ** and debugging only.
3135 */
3136 #ifdef SQLITE_CHECK_PAGES
3137 /*
3138 ** Return a 32-bit hash of the page data for pPage.
3139 */
3140 static u32 pager_datahash(int nByte, unsigned char *pData){
3141 u32 hash = 0;
3142 int i;
3143 for(i=0; i<nByte; i++){
3144 hash = (hash*1039) + pData[i];
3145 }
3146 return hash;
3147 }
3148 static u32 pager_pagehash(PgHdr *pPage){
3149 return pager_datahash(pPage->pPager->pageSize, (unsigned char *)pPage->pData);
3150 }
3151 static void pager_set_pagehash(PgHdr *pPage){
3152 pPage->pageHash = pager_pagehash(pPage);
3153 }
3154
3155 /*
3156 ** The CHECK_PAGE macro takes a PgHdr* as an argument. If SQLITE_CHECK_PAGES
3157 ** is defined, and NDEBUG is not defined, an assert() statement checks
3158 ** that the page is either dirty or still matches the calculated page-hash.
3159 */
3160 #define CHECK_PAGE(x) checkPage(x)
3161 static void checkPage(PgHdr *pPg){
3162 Pager *pPager = pPg->pPager;
3163 assert( pPager->eState!=PAGER_ERROR );
3164 assert( (pPg->flags&PGHDR_DIRTY) || pPg->pageHash==pager_pagehash(pPg) );
3165 }
3166
3167 #else
3168 #define pager_datahash(X,Y) 0
3169 #define pager_pagehash(X) 0
3170 #define pager_set_pagehash(X)
3171 #define CHECK_PAGE(x)
3172 #endif /* SQLITE_CHECK_PAGES */
3173
3174 /*
3175 ** When this is called the journal file for pager pPager must be open.
3176 ** This function attempts to read a master journal file name from the
3177 ** end of the file and, if successful, copies it into memory supplied
3178 ** by the caller. See comments above writeMasterJournal() for the format
3179 ** used to store a master journal file name at the end of a journal file.
3180 **
3181 ** zMaster must point to a buffer of at least nMaster bytes allocated by
3182 ** the caller. This should be sqlite3_vfs.mxPathname+1 (to ensure there is
3183 ** enough space to write the master journal name). If the master journal
3184 ** name in the journal is longer than nMaster bytes (including a
3185 ** nul-terminator), then this is handled as if no master journal name
3186 ** were present in the journal.
3187 **
3188 ** If a master journal file name is present at the end of the journal
3189 ** file, then it is copied into the buffer pointed to by zMaster. A
3190 ** nul-terminator byte is appended to the buffer following the master
3191 ** journal file name.
3192 **
3193 ** If it is determined that no master journal file name is present
3194 ** zMaster[0] is set to 0 and SQLITE_OK returned.
3195 **
3196 ** If an error occurs while reading from the journal file, an SQLite
3197 ** error code is returned.
3198 */
3199 static int readMasterJournal(sqlite3_file *pJrnl, char *zMaster, u32 nMaster){
3200 int rc; /* Return code */
3201 u32 len; /* Length in bytes of master journal name */
3202 i64 szJ; /* Total size in bytes of journal file pJrnl */
3203 u32 cksum; /* MJ checksum value read from journal */
3204 u32 u; /* Unsigned loop counter */
3205 unsigned char aMagic[8]; /* A buffer to hold the magic header */
3206 zMaster[0] = '\0';
3207
3208 if( SQLITE_OK!=(rc = sqlite3OsFileSize(pJrnl, &szJ))
3209 || szJ<16
3210 || SQLITE_OK!=(rc = read32bits(pJrnl, szJ-16, &len))
3211 || len>=nMaster
3212 || len==0
3213 || SQLITE_OK!=(rc = read32bits(pJrnl, szJ-12, &cksum))
3214 || SQLITE_OK!=(rc = sqlite3OsRead(pJrnl, aMagic, 8, szJ-8))
3215 || memcmp(aMagic, aJournalMagic, 8)
3216 || SQLITE_OK!=(rc = sqlite3OsRead(pJrnl, zMaster, len, szJ-16-len))
3217 ){
3218 return rc;
3219 }
3220
3221 /* See if the checksum matches the master journal name */
3222 for(u=0; u<len; u++){
3223 cksum -= zMaster[u];
3224 }
3225 if( cksum ){
3226 /* If the checksum doesn't add up, then one or more of the disk sectors
3227 ** containing the master journal filename is corrupted. This means
3228 ** definitely roll back, so just return SQLITE_OK and report a (nul)
3229 ** master-journal filename.
3230 */
3231 len = 0;
3232 }
3233 zMaster[len] = '\0';
3234
3235 return SQLITE_OK;
3236 }
3237
3238 /*
3239 ** Return the offset of the sector boundary at or immediately
3240 ** following the value in pPager->journalOff, assuming a sector
3241 ** size of pPager->sectorSize bytes.
3242 **
3243 ** i.e for a sector size of 512:
3244 **
3245 ** Pager.journalOff Return value
3246 ** ---------------------------------------
3247 ** 0 0
3248 ** 512 512
3249 ** 100 512
3250 ** 2000 2048
3251 **
3252 */
3253 static i64 journalHdrOffset(Pager *pPager){
3254 i64 offset = 0;
3255 i64 c = pPager->journalOff;
3256 if( c ){
3257 offset = ((c-1)/JOURNAL_HDR_SZ(pPager) + 1) * JOURNAL_HDR_SZ(pPager);
3258 }
3259 assert( offset%JOURNAL_HDR_SZ(pPager)==0 );
3260 assert( offset>=c );
3261 assert( (offset-c)<JOURNAL_HDR_SZ(pPager) );
3262 return offset;
3263 }
3264
3265 /*
3266 ** The journal file must be open when this function is called.
3267 **
3268 ** This function is a no-op if the journal file has not been written to
3269 ** within the current transaction (i.e. if Pager.journalOff==0).
3270 **
3271 ** If doTruncate is non-zero or the Pager.journalSizeLimit variable is
3272 ** set to 0, then truncate the journal file to zero bytes in size. Otherwise,
3273 ** zero the 28-byte header at the start of the journal file. In either case,
3274 ** if the pager is not in no-sync mode, sync the journal file immediately
3275 ** after writing or truncating it.
3276 **
3277 ** If Pager.journalSizeLimit is set to a positive, non-zero value, and
3278 ** following the truncation or zeroing described above the size of the
3279 ** journal file in bytes is larger than this value, then truncate the
3280 ** journal file to Pager.journalSizeLimit bytes. The journal file does
3281 ** not need to be synced following this operation.
3282 **
3283 ** If an IO error occurs, abandon processing and return the IO error code.
3284 ** Otherwise, return SQLITE_OK.
3285 */
3286 static int zeroJournalHdr(Pager *pPager, int doTruncate){
3287 int rc = SQLITE_OK; /* Return code */
3288 assert( isOpen(pPager->jfd) );
3289 assert( !sqlite3JournalIsInMemory(pPager->jfd) );
3290 if( pPager->journalOff ){
3291 const i64 iLimit = pPager->journalSizeLimit; /* Local cache of jsl */
3292
3293 IOTRACE(("JZEROHDR %p\n", pPager))
3294 if( doTruncate || iLimit==0 ){
3295 rc = sqlite3OsTruncate(pPager->jfd, 0);
3296 }else{
3297 static const char zeroHdr[28] = {0};
3298 rc = sqlite3OsWrite(pPager->jfd, zeroHdr, sizeof(zeroHdr), 0);
3299 }
3300 if( rc==SQLITE_OK && !pPager->noSync ){
3301 rc = sqlite3OsSync(pPager->jfd, SQLITE_SYNC_DATAONLY|pPager->syncFlags);
3302 }
3303
3304 /* At this point the transaction is committed but the write lock
3305 ** is still held on the file. If there is a size limit configured for
3306 ** the persistent journal and the journal file currently consumes more
3307 ** space than that limit allows for, truncate it now. There is no need
3308 ** to sync the file following this operation.
3309 */
3310 if( rc==SQLITE_OK && iLimit>0 ){
3311 i64 sz;
3312 rc = sqlite3OsFileSize(pPager->jfd, &sz);
3313 if( rc==SQLITE_OK && sz>iLimit ){
3314 rc = sqlite3OsTruncate(pPager->jfd, iLimit);
3315 }
3316 }
3317 }
3318 return rc;
3319 }
3320
3321 /*
3322 ** The journal file must be open when this routine is called. A journal
3323 ** header (JOURNAL_HDR_SZ bytes) is written into the journal file at the
3324 ** current location.
3325 **
3326 ** The format for the journal header is as follows:
3327 ** - 8 bytes: Magic identifying journal format.
3328 ** - 4 bytes: Number of records in journal, or -1 no-sync mode is on.
3329 ** - 4 bytes: Random number used for page hash.
3330 ** - 4 bytes: Initial database page count.
3331 ** - 4 bytes: Sector size used by the process that wrote this journal.
3332 ** - 4 bytes: Database page size.
3333 **
3334 ** Followed by (JOURNAL_HDR_SZ - 28) bytes of unused space.
3335 */
3336 static int writeJournalHdr(Pager *pPager){
3337 int rc = SQLITE_OK; /* Return code */
3338 char *zHeader = pPager->pTmpSpace; /* Temporary space used to build header */
3339 u32 nHeader = (u32)pPager->pageSize;/* Size of buffer pointed to by zHeader */
3340 u32 nWrite; /* Bytes of header sector written */
3341 int ii; /* Loop counter */
3342
3343 assert( isOpen(pPager->jfd) ); /* Journal file must be open. */
3344
3345 if( nHeader>JOURNAL_HDR_SZ(pPager) ){
3346 nHeader = JOURNAL_HDR_SZ(pPager);
3347 }
3348
3349 /* If there are active savepoints and any of them were created
3350 ** since the most recent journal header was written, update the
3351 ** PagerSavepoint.iHdrOffset fields now.
3352 */
3353 for(ii=0; ii<pPager->nSavepoint; ii++){
3354 if( pPager->aSavepoint[ii].iHdrOffset==0 ){
3355 pPager->aSavepoint[ii].iHdrOffset = pPager->journalOff;
3356 }
3357 }
3358
3359 pPager->journalHdr = pPager->journalOff = journalHdrOffset(pPager);
3360
3361 /*
3362 ** Write the nRec Field - the number of page records that follow this
3363 ** journal header. Normally, zero is written to this value at this time.
3364 ** After the records are added to the journal (and the journal synced,
3365 ** if in full-sync mode), the zero is overwritten with the true number
3366 ** of records (see syncJournal()).
3367 **
3368 ** A faster alternative is to write 0xFFFFFFFF to the nRec field. When
3369 ** reading the journal this value tells SQLite to assume that the
3370 ** rest of the journal file contains valid page records. This assumption
3371 ** is dangerous, as if a failure occurred whilst writing to the journal
3372 ** file it may contain some garbage data. There are two scenarios
3373 ** where this risk can be ignored:
3374 **
3375 ** * When the pager is in no-sync mode. Corruption can follow a
3376 ** power failure in this case anyway.
3377 **
3378 ** * When the SQLITE_IOCAP_SAFE_APPEND flag is set. This guarantees
3379 ** that garbage data is never appended to the journal file.
3380 */
3381 assert( isOpen(pPager->fd) || pPager->noSync );
3382 if( pPager->noSync || (pPager->journalMode==PAGER_JOURNALMODE_MEMORY)
3383 || (sqlite3OsDeviceCharacteristics(pPager->fd)&SQLITE_IOCAP_SAFE_APPEND)
3384 ){
3385 memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic));
3386 put32bits(&zHeader[sizeof(aJournalMagic)], 0xffffffff);
3387 }else{
3388 memset(zHeader, 0, sizeof(aJournalMagic)+4);
3389 }
3390
3391 /* The random check-hash initializer */
3392 sqlite3_randomness(sizeof(pPager->cksumInit), &pPager->cksumInit);
3393 put32bits(&zHeader[sizeof(aJournalMagic)+4], pPager->cksumInit);
3394 /* The initial database size */
3395 put32bits(&zHeader[sizeof(aJournalMagic)+8], pPager->dbOrigSize);
3396 /* The assumed sector size for this process */
3397 put32bits(&zHeader[sizeof(aJournalMagic)+12], pPager->sectorSize);
3398
3399 /* The page size */
3400 put32bits(&zHeader[sizeof(aJournalMagic)+16], pPager->pageSize);
3401
3402 /* Initializing the tail of the buffer is not necessary. Everything
3403 ** works find if the following memset() is omitted. But initializing
3404 ** the memory prevents valgrind from complaining, so we are willing to
3405 ** take the performance hit.
3406 */
3407 memset(&zHeader[sizeof(aJournalMagic)+20], 0,
3408 nHeader-(sizeof(aJournalMagic)+20));
3409
3410 /* In theory, it is only necessary to write the 28 bytes that the
3411 ** journal header consumes to the journal file here. Then increment the
3412 ** Pager.journalOff variable by JOURNAL_HDR_SZ so that the next
3413 ** record is written to the following sector (leaving a gap in the file
3414 ** that will be implicitly filled in by the OS).
3415 **
3416 ** However it has been discovered that on some systems this pattern can
3417 ** be significantly slower than contiguously writing data to the file,
3418 ** even if that means explicitly writing data to the block of
3419 ** (JOURNAL_HDR_SZ - 28) bytes that will not be used. So that is what
3420 ** is done.
3421 **
3422 ** The loop is required here in case the sector-size is larger than the
3423 ** database page size. Since the zHeader buffer is only Pager.pageSize
3424 ** bytes in size, more than one call to sqlite3OsWrite() may be required
3425 ** to populate the entire journal header sector.
3426 */
3427 for(nWrite=0; rc==SQLITE_OK&&nWrite<JOURNAL_HDR_SZ(pPager); nWrite+=nHeader){
3428 IOTRACE(("JHDR %p %lld %d\n", pPager, pPager->journalHdr, nHeader))
3429 rc = sqlite3OsWrite(pPager->jfd, zHeader, nHeader, pPager->journalOff);
3430 assert( pPager->journalHdr <= pPager->journalOff );
3431 pPager->journalOff += nHeader;
3432 }
3433
3434 return rc;
3435 }
3436
3437 /*
3438 ** The journal file must be open when this is called. A journal header file
3439 ** (JOURNAL_HDR_SZ bytes) is read from the current location in the journal
3440 ** file. The current location in the journal file is given by
3441 ** pPager->journalOff. See comments above function writeJournalHdr() for
3442 ** a description of the journal header format.
3443 **
3444 ** If the header is read successfully, *pNRec is set to the number of
3445 ** page records following this header and *pDbSize is set to the size of the
3446 ** database before the transaction began, in pages. Also, pPager->cksumInit
3447 ** is set to the value read from the journal header. SQLITE_OK is returned
3448 ** in this case.
3449 **
3450 ** If the journal header file appears to be corrupted, SQLITE_DONE is
3451 ** returned and *pNRec and *PDbSize are undefined. If JOURNAL_HDR_SZ bytes
3452 ** cannot be read from the journal file an error code is returned.
3453 */
3454 static int readJournalHdr(
3455 Pager *pPager, /* Pager object */
3456 int isHot,
3457 i64 journalSize, /* Size of the open journal file in bytes */
3458 u32 *pNRec, /* OUT: Value read from the nRec field */
3459 u32 *pDbSize /* OUT: Value of original database size field */
3460 ){
3461 int rc; /* Return code */
3462 unsigned char aMagic[8]; /* A buffer to hold the magic header */
3463 i64 iHdrOff; /* Offset of journal header being read */
3464
3465 assert( isOpen(pPager->jfd) ); /* Journal file must be open. */
3466
3467 /* Advance Pager.journalOff to the start of the next sector. If the
3468 ** journal file is too small for there to be a header stored at this
3469 ** point, return SQLITE_DONE.
3470 */
3471 pPager->journalOff = journalHdrOffset(pPager);
3472 if( pPager->journalOff+JOURNAL_HDR_SZ(pPager) > journalSize ){
3473 return SQLITE_DONE;
3474 }
3475 iHdrOff = pPager->journalOff;
3476
3477 /* Read in the first 8 bytes of the journal header. If they do not match
3478 ** the magic string found at the start of each journal header, return
3479 ** SQLITE_DONE. If an IO error occurs, return an error code. Otherwise,
3480 ** proceed.
3481 */
3482 if( isHot || iHdrOff!=pPager->journalHdr ){
3483 rc = sqlite3OsRead(pPager->jfd, aMagic, sizeof(aMagic), iHdrOff);
3484 if( rc ){
3485 return rc;
3486 }
3487 if( memcmp(aMagic, aJournalMagic, sizeof(aMagic))!=0 ){
3488 return SQLITE_DONE;
3489 }
3490 }
3491
3492 /* Read the first three 32-bit fields of the journal header: The nRec
3493 ** field, the checksum-initializer and the database size at the start
3494 ** of the transaction. Return an error code if anything goes wrong.
3495 */
3496 if( SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+8, pNRec))
3497 || SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+12, &pPager->cksumInit))
3498 || SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+16, pDbSize))
3499 ){
3500 return rc;
3501 }
3502
3503 if( pPager->journalOff==0 ){
3504 u32 iPageSize; /* Page-size field of journal header */
3505 u32 iSectorSize; /* Sector-size field of journal header */
3506
3507 /* Read the page-size and sector-size journal header fields. */
3508 if( SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+20, &iSectorSize))
3509 || SQLITE_OK!=(rc = read32bits(pPager->jfd, iHdrOff+24, &iPageSize))
3510 ){
3511 return rc;
3512 }
3513
3514 /* Versions of SQLite prior to 3.5.8 set the page-size field of the
3515 ** journal header to zero. In this case, assume that the Pager.pageSize
3516 ** variable is already set to the correct page size.
3517 */
3518 if( iPageSize==0 ){
3519 iPageSize = pPager->pageSize;
3520 }
3521
3522 /* Check that the values read from the page-size and sector-size fields
3523 ** are within range. To be 'in range', both values need to be a power
3524 ** of two greater than or equal to 512 or 32, and not greater than their
3525 ** respective compile time maximum limits.
3526 */
3527 if( iPageSize<512 || iSectorSize<32
3528 || iPageSize>SQLITE_MAX_PAGE_SIZE || iSectorSize>MAX_SECTOR_SIZE
3529 || ((iPageSize-1)&iPageSize)!=0 || ((iSectorSize-1)&iSectorSize)!=0
3530 ){
3531 /* If the either the page-size or sector-size in the journal-header is
3532 ** invalid, then the process that wrote the journal-header must have
3533 ** crashed before the header was synced. In this case stop reading
3534 ** the journal file here.
3535 */
3536 return SQLITE_DONE;
3537 }
3538
3539 /* Update the page-size to match the value read from the journal.
3540 ** Use a testcase() macro to make sure that malloc failure within
3541 ** PagerSetPagesize() is tested.
3542 */
3543 rc = sqlite3PagerSetPagesize(pPager, &iPageSize, -1);
3544 testcase( rc!=SQLITE_OK );
3545
3546 /* Update the assumed sector-size to match the value used by
3547 ** the process that created this journal. If this journal was
3548 ** created by a process other than this one, then this routine
3549 ** is being called from within pager_playback(). The local value
3550 ** of Pager.sectorSize is restored at the end of that routine.
3551 */
3552 pPager->sectorSize = iSectorSize;
3553 }
3554
3555 pPager->journalOff += JOURNAL_HDR_SZ(pPager);
3556 return rc;
3557 }
3558
3559
3560 /*
3561 ** Write the supplied master journal name into the journal file for pager
3562 ** pPager at the current location. The master journal name must be the last
3563 ** thing written to a journal file. If the pager is in full-sync mode, the
3564 ** journal file descriptor is advanced to the next sector boundary before
3565 ** anything is written. The format is:
3566 **
3567 ** + 4 bytes: PAGER_MJ_PGNO.
3568 ** + N bytes: Master journal filename in utf-8.
3569 ** + 4 bytes: N (length of master journal name in bytes, no nul-terminator).
3570 ** + 4 bytes: Master journal name checksum.
3571 ** + 8 bytes: aJournalMagic[].
3572 **
3573 ** The master journal page checksum is the sum of the bytes in the master
3574 ** journal name, where each byte is interpreted as a signed 8-bit integer.
3575 **
3576 ** If zMaster is a NULL pointer (occurs for a single database transaction),
3577 ** this call is a no-op.
3578 */
3579 static int writeMasterJournal(Pager *pPager, const char *zMaster){
3580 int rc; /* Return code */
3581 int nMaster; /* Length of string zMaster */
3582 i64 iHdrOff; /* Offset of header in journal file */
3583 i64 jrnlSize; /* Size of journal file on disk */
3584 u32 cksum = 0; /* Checksum of string zMaster */
3585
3586 assert( pPager->setMaster==0 );
3587 assert( !pagerUseWal(pPager) );
3588
3589 if( !zMaster
3590 || pPager->journalMode==PAGER_JOURNALMODE_MEMORY
3591 || !isOpen(pPager->jfd)
3592 ){
3593 return SQLITE_OK;
3594 }
3595 pPager->setMaster = 1;
3596 assert( pPager->journalHdr <= pPager->journalOff );
3597
3598 /* Calculate the length in bytes and the checksum of zMaster */
3599 for(nMaster=0; zMaster[nMaster]; nMaster++){
3600 cksum += zMaster[nMaster];
3601 }
3602
3603 /* If in full-sync mode, advance to the next disk sector before writing
3604 ** the master journal name. This is in case the previous page written to
3605 ** the journal has already been synced.
3606 */
3607 if( pPager->fullSync ){
3608 pPager->journalOff = journalHdrOffset(pPager);
3609 }
3610 iHdrOff = pPager->journalOff;
3611
3612 /* Write the master journal data to the end of the journal file. If
3613 ** an error occurs, return the error code to the caller.
3614 */
3615 if( (0 != (rc = write32bits(pPager->jfd, iHdrOff, PAGER_MJ_PGNO(pPager))))
3616 || (0 != (rc = sqlite3OsWrite(pPager->jfd, zMaster, nMaster, iHdrOff+4)))
3617 || (0 != (rc = write32bits(pPager->jfd, iHdrOff+4+nMaster, nMaster)))
3618 || (0 != (rc = write32bits(pPager->jfd, iHdrOff+4+nMaster+4, cksum)))
3619 || (0 != (rc = sqlite3OsWrite(pPager->jfd, aJournalMagic, 8,
3620 iHdrOff+4+nMaster+8)))
3621 ){
3622 return rc;
3623 }
3624 pPager->journalOff += (nMaster+20);
3625
3626 /* If the pager is in peristent-journal mode, then the physical
3627 ** journal-file may extend past the end of the master-journal name
3628 ** and 8 bytes of magic data just written to the file. This is
3629 ** dangerous because the code to rollback a hot-journal file
3630 ** will not be able to find the master-journal name to determine
3631 ** whether or not the journal is hot.
3632 **
3633 ** Easiest thing to do in this scenario is to truncate the journal
3634 ** file to the required size.
3635 */
3636 if( SQLITE_OK==(rc = sqlite3OsFileSize(pPager->jfd, &jrnlSize))
3637 && jrnlSize>pPager->journalOff
3638 ){
3639 rc = sqlite3OsTruncate(pPager->jfd, pPager->journalOff);
3640 }
3641 return rc;
3642 }
3643
3644 /*
3645 ** Discard the entire contents of the in-memory page-cache.
3646 */
3647 static void pager_reset(Pager *pPager){
3648 pPager->iDataVersion++;
3649 sqlite3BackupRestart(pPager->pBackup);
3650 sqlite3PcacheClear(pPager->pPCache);
3651 }
3652
3653 /*
3654 ** Return the pPager->iDataVersion value
3655 */
3656 SQLITE_PRIVATE u32 sqlite3PagerDataVersion(Pager *pPager){
3657 assert( pPager->eState>PAGER_OPEN );
3658 return pPager->iDataVersion;
3659 }
3660
3661 /*
3662 ** Free all structures in the Pager.aSavepoint[] array and set both
3663 ** Pager.aSavepoint and Pager.nSavepoint to zero. Close the sub-journal
3664 ** if it is open and the pager is not in exclusive mode.
3665 */
3666 static void releaseAllSavepoints(Pager *pPager){
3667 int ii; /* Iterator for looping through Pager.aSavepoint */
3668 for(ii=0; ii<pPager->nSavepoint; ii++){
3669 sqlite3BitvecDestroy(pPager->aSavepoint[ii].pInSavepoint);
3670 }
3671 if( !pPager->exclusiveMode || sqlite3JournalIsInMemory(pPager->sjfd) ){
3672 sqlite3OsClose(pPager->sjfd);
3673 }
3674 sqlite3_free(pPager->aSavepoint);
3675 pPager->aSavepoint = 0;
3676 pPager->nSavepoint = 0;
3677 pPager->nSubRec = 0;
3678 }
3679
3680 /*
3681 ** Set the bit number pgno in the PagerSavepoint.pInSavepoint
3682 ** bitvecs of all open savepoints. Return SQLITE_OK if successful
3683 ** or SQLITE_NOMEM if a malloc failure occurs.
3684 */
3685 static int addToSavepointBitvecs(Pager *pPager, Pgno pgno){
3686 int ii; /* Loop counter */
3687 int rc = SQLITE_OK; /* Result code */
3688
3689 for(ii=0; ii<pPager->nSavepoint; ii++){
3690 PagerSavepoint *p = &pPager->aSavepoint[ii];
3691 if( pgno<=p->nOrig ){
3692 rc |= sqlite3BitvecSet(p->pInSavepoint, pgno);
3693 testcase( rc==SQLITE_NOMEM );
3694 assert( rc==SQLITE_OK || rc==SQLITE_NOMEM );
3695 }
3696 }
3697 return rc;
3698 }
3699
3700 /*
3701 ** This function is a no-op if the pager is in exclusive mode and not
3702 ** in the ERROR state. Otherwise, it switches the pager to PAGER_OPEN
3703 ** state.
3704 **
3705 ** If the pager is not in exclusive-access mode, the database file is
3706 ** completely unlocked. If the file is unlocked and the file-system does
3707 ** not exhibit the UNDELETABLE_WHEN_OPEN property, the journal file is
3708 ** closed (if it is open).
3709 **
3710 ** If the pager is in ERROR state when this function is called, the
3711 ** contents of the pager cache are discarded before switching back to
3712 ** the OPEN state. Regardless of whether the pager is in exclusive-mode
3713 ** or not, any journal file left in the file-system will be treated
3714 ** as a hot-journal and rolled back the next time a read-transaction
3715 ** is opened (by this or by any other connection).
3716 */
3717 static void pager_unlock(Pager *pPager){
3718
3719 assert( pPager->eState==PAGER_READER
3720 || pPager->eState==PAGER_OPEN
3721 || pPager->eState==PAGER_ERROR
3722 );
3723
3724 sqlite3BitvecDestroy(pPager->pInJournal);
3725 pPager->pInJournal = 0;
3726 releaseAllSavepoints(pPager);
3727
3728 if( pagerUseWal(pPager) ){
3729 assert( !isOpen(pPager->jfd) );
3730 sqlite3WalEndReadTransaction(pPager->pWal);
3731 pPager->eState = PAGER_OPEN;
3732 }else if( !pPager->exclusiveMode ){
3733 int rc; /* Error code returned by pagerUnlockDb() */
3734 int iDc = isOpen(pPager->fd)?sqlite3OsDeviceCharacteristics(pPager->fd):0;
3735
3736 /* If the operating system support deletion of open files, then
3737 ** close the journal file when dropping the database lock. Otherwise
3738 ** another connection with journal_mode=delete might delete the file
3739 ** out from under us.
3740 */
3741 assert( (PAGER_JOURNALMODE_MEMORY & 5)!=1 );
3742 assert( (PAGER_JOURNALMODE_OFF & 5)!=1 );
3743 assert( (PAGER_JOURNALMODE_WAL & 5)!=1 );
3744 assert( (PAGER_JOURNALMODE_DELETE & 5)!=1 );
3745 assert( (PAGER_JOURNALMODE_TRUNCATE & 5)==1 );
3746 assert( (PAGER_JOURNALMODE_PERSIST & 5)==1 );
3747 if( 0==(iDc & SQLITE_IOCAP_UNDELETABLE_WHEN_OPEN)
3748 || 1!=(pPager->journalMode & 5)
3749 ){
3750 sqlite3OsClose(pPager->jfd);
3751 }
3752
3753 /* If the pager is in the ERROR state and the call to unlock the database
3754 ** file fails, set the current lock to UNKNOWN_LOCK. See the comment
3755 ** above the #define for UNKNOWN_LOCK for an explanation of why this
3756 ** is necessary.
3757 */
3758 rc = pagerUnlockDb(pPager, NO_LOCK);
3759 if( rc!=SQLITE_OK && pPager->eState==PAGER_ERROR ){
3760 pPager->eLock = UNKNOWN_LOCK;
3761 }
3762
3763 /* The pager state may be changed from PAGER_ERROR to PAGER_OPEN here
3764 ** without clearing the error code. This is intentional - the error
3765 ** code is cleared and the cache reset in the block below.
3766 */
3767 assert( pPager->errCode || pPager->eState!=PAGER_ERROR );
3768 pPager->changeCountDone = 0;
3769 pPager->eState = PAGER_OPEN;
3770 }
3771
3772 /* If Pager.errCode is set, the contents of the pager cache cannot be
3773 ** trusted. Now that there are no outstanding references to the pager,
3774 ** it can safely move back to PAGER_OPEN state. This happens in both
3775 ** normal and exclusive-locking mode.
3776 */
3777 assert( pPager->errCode==SQLITE_OK || !MEMDB );
3778 if( pPager->errCode ){
3779 if( pPager->tempFile==0 ){
3780 pager_reset(pPager);
3781 pPager->changeCountDone = 0;
3782 pPager->eState = PAGER_OPEN;
3783 }else{
3784 pPager->eState = (isOpen(pPager->jfd) ? PAGER_OPEN : PAGER_READER);
3785 }
3786 if( USEFETCH(pPager) ) sqlite3OsUnfetch(pPager->fd, 0, 0);
3787 pPager->errCode = SQLITE_OK;
3788 setGetterMethod(pPager);
3789 }
3790
3791 pPager->journalOff = 0;
3792 pPager->journalHdr = 0;
3793 pPager->setMaster = 0;
3794 }
3795
3796 /*
3797 ** This function is called whenever an IOERR or FULL error that requires
3798 ** the pager to transition into the ERROR state may ahve occurred.
3799 ** The first argument is a pointer to the pager structure, the second
3800 ** the error-code about to be returned by a pager API function. The
3801 ** value returned is a copy of the second argument to this function.
3802 **
3803 ** If the second argument is SQLITE_FULL, SQLITE_IOERR or one of the
3804 ** IOERR sub-codes, the pager enters the ERROR state and the error code
3805 ** is stored in Pager.errCode. While the pager remains in the ERROR state,
3806 ** all major API calls on the Pager will immediately return Pager.errCode.
3807 **
3808 ** The ERROR state indicates that the contents of the pager-cache
3809 ** cannot be trusted. This state can be cleared by completely discarding
3810 ** the contents of the pager-cache. If a transaction was active when
3811 ** the persistent error occurred, then the rollback journal may need
3812 ** to be replayed to restore the contents of the database file (as if
3813 ** it were a hot-journal).
3814 */
3815 static int pager_error(Pager *pPager, int rc){
3816 int rc2 = rc & 0xff;
3817 assert( rc==SQLITE_OK || !MEMDB );
3818 assert(
3819 pPager->errCode==SQLITE_FULL ||
3820 pPager->errCode==SQLITE_OK ||
3821 (pPager->errCode & 0xff)==SQLITE_IOERR
3822 );
3823 if( rc2==SQLITE_FULL || rc2==SQLITE_IOERR ){
3824 pPager->errCode = rc;
3825 pPager->eState = PAGER_ERROR;
3826 setGetterMethod(pPager);
3827 }
3828 return rc;
3829 }
3830
3831 static int pager_truncate(Pager *pPager, Pgno nPage);
3832
3833 /*
3834 ** The write transaction open on pPager is being committed (bCommit==1)
3835 ** or rolled back (bCommit==0).
3836 **
3837 ** Return TRUE if and only if all dirty pages should be flushed to disk.
3838 **
3839 ** Rules:
3840 **
3841 ** * For non-TEMP databases, always sync to disk. This is necessary
3842 ** for transactions to be durable.
3843 **
3844 ** * Sync TEMP database only on a COMMIT (not a ROLLBACK) when the backing
3845 ** file has been created already (via a spill on pagerStress()) and
3846 ** when the number of dirty pages in memory exceeds 25% of the total
3847 ** cache size.
3848 */
3849 static int pagerFlushOnCommit(Pager *pPager, int bCommit){
3850 if( pPager->tempFile==0 ) return 1;
3851 if( !bCommit ) return 0;
3852 if( !isOpen(pPager->fd) ) return 0;
3853 return (sqlite3PCachePercentDirty(pPager->pPCache)>=25);
3854 }
3855
3856 /*
3857 ** This routine ends a transaction. A transaction is usually ended by
3858 ** either a COMMIT or a ROLLBACK operation. This routine may be called
3859 ** after rollback of a hot-journal, or if an error occurs while opening
3860 ** the journal file or writing the very first journal-header of a
3861 ** database transaction.
3862 **
3863 ** This routine is never called in PAGER_ERROR state. If it is called
3864 ** in PAGER_NONE or PAGER_SHARED state and the lock held is less
3865 ** exclusive than a RESERVED lock, it is a no-op.
3866 **
3867 ** Otherwise, any active savepoints are released.
3868 **
3869 ** If the journal file is open, then it is "finalized". Once a journal
3870 ** file has been finalized it is not possible to use it to roll back a
3871 ** transaction. Nor will it be considered to be a hot-journal by this
3872 ** or any other database connection. Exactly how a journal is finalized
3873 ** depends on whether or not the pager is running in exclusive mode and
3874 ** the current journal-mode (Pager.journalMode value), as follows:
3875 **
3876 ** journalMode==MEMORY
3877 ** Journal file descriptor is simply closed. This destroys an
3878 ** in-memory journal.
3879 **
3880 ** journalMode==TRUNCATE
3881 ** Journal file is truncated to zero bytes in size.
3882 **
3883 ** journalMode==PERSIST
3884 ** The first 28 bytes of the journal file are zeroed. This invalidates
3885 ** the first journal header in the file, and hence the entire journal
3886 ** file. An invalid journal file cannot be rolled back.
3887 **
3888 ** journalMode==DELETE
3889 ** The journal file is closed and deleted using sqlite3OsDelete().
3890 **
3891 ** If the pager is running in exclusive mode, this method of finalizing
3892 ** the journal file is never used. Instead, if the journalMode is
3893 ** DELETE and the pager is in exclusive mode, the method described under
3894 ** journalMode==PERSIST is used instead.
3895 **
3896 ** After the journal is finalized, the pager moves to PAGER_READER state.
3897 ** If running in non-exclusive rollback mode, the lock on the file is
3898 ** downgraded to a SHARED_LOCK.
3899 **
3900 ** SQLITE_OK is returned if no error occurs. If an error occurs during
3901 ** any of the IO operations to finalize the journal file or unlock the
3902 ** database then the IO error code is returned to the user. If the
3903 ** operation to finalize the journal file fails, then the code still
3904 ** tries to unlock the database file if not in exclusive mode. If the
3905 ** unlock operation fails as well, then the first error code related
3906 ** to the first error encountered (the journal finalization one) is
3907 ** returned.
3908 */
3909 static int pager_end_transaction(Pager *pPager, int hasMaster, int bCommit){
3910 int rc = SQLITE_OK; /* Error code from journal finalization operation */
3911 int rc2 = SQLITE_OK; /* Error code from db file unlock operation */
3912
3913 /* Do nothing if the pager does not have an open write transaction
3914 ** or at least a RESERVED lock. This function may be called when there
3915 ** is no write-transaction active but a RESERVED or greater lock is
3916 ** held under two circumstances:
3917 **
3918 ** 1. After a successful hot-journal rollback, it is called with
3919 ** eState==PAGER_NONE and eLock==EXCLUSIVE_LOCK.
3920 **
3921 ** 2. If a connection with locking_mode=exclusive holding an EXCLUSIVE
3922 ** lock switches back to locking_mode=normal and then executes a
3923 ** read-transaction, this function is called with eState==PAGER_READER
3924 ** and eLock==EXCLUSIVE_LOCK when the read-transaction is closed.
3925 */
3926 assert( assert_pager_state(pPager) );
3927 assert( pPager->eState!=PAGER_ERROR );
3928 if( pPager->eState<PAGER_WRITER_LOCKED && pPager->eLock<RESERVED_LOCK ){
3929 return SQLITE_OK;
3930 }
3931
3932 releaseAllSavepoints(pPager);
3933 assert( isOpen(pPager->jfd) || pPager->pInJournal==0 );
3934 if( isOpen(pPager->jfd) ){
3935 assert( !pagerUseWal(pPager) );
3936
3937 /* Finalize the journal file. */
3938 if( sqlite3JournalIsInMemory(pPager->jfd) ){
3939 /* assert( pPager->journalMode==PAGER_JOURNALMODE_MEMORY ); */
3940 sqlite3OsClose(pPager->jfd);
3941 }else if( pPager->journalMode==PAGER_JOURNALMODE_TRUNCATE ){
3942 if( pPager->journalOff==0 ){
3943 rc = SQLITE_OK;
3944 }else{
3945 rc = sqlite3OsTruncate(pPager->jfd, 0);
3946 if( rc==SQLITE_OK && pPager->fullSync ){
3947 /* Make sure the new file size is written into the inode right away.
3948 ** Otherwise the journal might resurrect following a power loss and
3949 ** cause the last transaction to roll back. See
3950 ** https://bugzilla.mozilla.org/show_bug.cgi?id=1072773
3951 */
3952 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags);
3953 }
3954 }
3955 pPager->journalOff = 0;
3956 }else if( pPager->journalMode==PAGER_JOURNALMODE_PERSIST
3957 || (pPager->exclusiveMode && pPager->journalMode!=PAGER_JOURNALMODE_WAL)
3958 ){
3959 rc = zeroJournalHdr(pPager, hasMaster||pPager->tempFile);
3960 pPager->journalOff = 0;
3961 }else{
3962 /* This branch may be executed with Pager.journalMode==MEMORY if
3963 ** a hot-journal was just rolled back. In this case the journal
3964 ** file should be closed and deleted. If this connection writes to
3965 ** the database file, it will do so using an in-memory journal.
3966 */
3967 int bDelete = !pPager->tempFile;
3968 assert( sqlite3JournalIsInMemory(pPager->jfd)==0 );
3969 assert( pPager->journalMode==PAGER_JOURNALMODE_DELETE
3970 || pPager->journalMode==PAGER_JOURNALMODE_MEMORY
3971 || pPager->journalMode==PAGER_JOURNALMODE_WAL
3972 );
3973 sqlite3OsClose(pPager->jfd);
3974 if( bDelete ){
3975 rc = sqlite3OsDelete(pPager->pVfs, pPager->zJournal, pPager->extraSync);
3976 }
3977 }
3978 }
3979
3980 #ifdef SQLITE_CHECK_PAGES
3981 sqlite3PcacheIterateDirty(pPager->pPCache, pager_set_pagehash);
3982 if( pPager->dbSize==0 && sqlite3PcacheRefCount(pPager->pPCache)>0 ){
3983 PgHdr *p = sqlite3PagerLookup(pPager, 1);
3984 if( p ){
3985 p->pageHash = 0;
3986 sqlite3PagerUnrefNotNull(p);
3987 }
3988 }
3989 #endif
3990
3991 sqlite3BitvecDestroy(pPager->pInJournal);
3992 pPager->pInJournal = 0;
3993 pPager->nRec = 0;
3994 if( rc==SQLITE_OK ){
3995 if( MEMDB || pagerFlushOnCommit(pPager, bCommit) ){
3996 sqlite3PcacheCleanAll(pPager->pPCache);
3997 }else{
3998 sqlite3PcacheClearWritable(pPager->pPCache);
3999 }
4000 sqlite3PcacheTruncate(pPager->pPCache, pPager->dbSize);
4001 }
4002
4003 if( pagerUseWal(pPager) ){
4004 /* Drop the WAL write-lock, if any. Also, if the connection was in
4005 ** locking_mode=exclusive mode but is no longer, drop the EXCLUSIVE
4006 ** lock held on the database file.
4007 */
4008 rc2 = sqlite3WalEndWriteTransaction(pPager->pWal);
4009 assert( rc2==SQLITE_OK );
4010 }else if( rc==SQLITE_OK && bCommit && pPager->dbFileSize>pPager->dbSize ){
4011 /* This branch is taken when committing a transaction in rollback-journal
4012 ** mode if the database file on disk is larger than the database image.
4013 ** At this point the journal has been finalized and the transaction
4014 ** successfully committed, but the EXCLUSIVE lock is still held on the
4015 ** file. So it is safe to truncate the database file to its minimum
4016 ** required size. */
4017 assert( pPager->eLock==EXCLUSIVE_LOCK );
4018 rc = pager_truncate(pPager, pPager->dbSize);
4019 }
4020
4021 if( rc==SQLITE_OK && bCommit && isOpen(pPager->fd) ){
4022 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_COMMIT_PHASETWO, 0);
4023 if( rc==SQLITE_NOTFOUND ) rc = SQLITE_OK;
4024 }
4025
4026 if( !pPager->exclusiveMode
4027 && (!pagerUseWal(pPager) || sqlite3WalExclusiveMode(pPager->pWal, 0))
4028 ){
4029 rc2 = pagerUnlockDb(pPager, SHARED_LOCK);
4030 pPager->changeCountDone = 0;
4031 }
4032 pPager->eState = PAGER_READER;
4033 pPager->setMaster = 0;
4034
4035 return (rc==SQLITE_OK?rc2:rc);
4036 }
4037
4038 /*
4039 ** Execute a rollback if a transaction is active and unlock the
4040 ** database file.
4041 **
4042 ** If the pager has already entered the ERROR state, do not attempt
4043 ** the rollback at this time. Instead, pager_unlock() is called. The
4044 ** call to pager_unlock() will discard all in-memory pages, unlock
4045 ** the database file and move the pager back to OPEN state. If this
4046 ** means that there is a hot-journal left in the file-system, the next
4047 ** connection to obtain a shared lock on the pager (which may be this one)
4048 ** will roll it back.
4049 **
4050 ** If the pager has not already entered the ERROR state, but an IO or
4051 ** malloc error occurs during a rollback, then this will itself cause
4052 ** the pager to enter the ERROR state. Which will be cleared by the
4053 ** call to pager_unlock(), as described above.
4054 */
4055 static void pagerUnlockAndRollback(Pager *pPager){
4056 if( pPager->eState!=PAGER_ERROR && pPager->eState!=PAGER_OPEN ){
4057 assert( assert_pager_state(pPager) );
4058 if( pPager->eState>=PAGER_WRITER_LOCKED ){
4059 sqlite3BeginBenignMalloc();
4060 sqlite3PagerRollback(pPager);
4061 sqlite3EndBenignMalloc();
4062 }else if( !pPager->exclusiveMode ){
4063 assert( pPager->eState==PAGER_READER );
4064 pager_end_transaction(pPager, 0, 0);
4065 }
4066 }
4067 pager_unlock(pPager);
4068 }
4069
4070 /*
4071 ** Parameter aData must point to a buffer of pPager->pageSize bytes
4072 ** of data. Compute and return a checksum based ont the contents of the
4073 ** page of data and the current value of pPager->cksumInit.
4074 **
4075 ** This is not a real checksum. It is really just the sum of the
4076 ** random initial value (pPager->cksumInit) and every 200th byte
4077 ** of the page data, starting with byte offset (pPager->pageSize%200).
4078 ** Each byte is interpreted as an 8-bit unsigned integer.
4079 **
4080 ** Changing the formula used to compute this checksum results in an
4081 ** incompatible journal file format.
4082 **
4083 ** If journal corruption occurs due to a power failure, the most likely
4084 ** scenario is that one end or the other of the record will be changed.
4085 ** It is much less likely that the two ends of the journal record will be
4086 ** correct and the middle be corrupt. Thus, this "checksum" scheme,
4087 ** though fast and simple, catches the mostly likely kind of corruption.
4088 */
4089 static u32 pager_cksum(Pager *pPager, const u8 *aData){
4090 u32 cksum = pPager->cksumInit; /* Checksum value to return */
4091 int i = pPager->pageSize-200; /* Loop counter */
4092 while( i>0 ){
4093 cksum += aData[i];
4094 i -= 200;
4095 }
4096 return cksum;
4097 }
4098
4099 /*
4100 ** Report the current page size and number of reserved bytes back
4101 ** to the codec.
4102 */
4103 #ifdef SQLITE_HAS_CODEC
4104 static void pagerReportSize(Pager *pPager){
4105 if( pPager->xCodecSizeChng ){
4106 pPager->xCodecSizeChng(pPager->pCodec, pPager->pageSize,
4107 (int)pPager->nReserve);
4108 }
4109 }
4110 #else
4111 # define pagerReportSize(X) /* No-op if we do not support a codec */
4112 #endif
4113
4114 #ifdef SQLITE_HAS_CODEC
4115 /*
4116 ** Make sure the number of reserved bits is the same in the destination
4117 ** pager as it is in the source. This comes up when a VACUUM changes the
4118 ** number of reserved bits to the "optimal" amount.
4119 */
4120 SQLITE_PRIVATE void sqlite3PagerAlignReserve(Pager *pDest, Pager *pSrc){
4121 if( pDest->nReserve!=pSrc->nReserve ){
4122 pDest->nReserve = pSrc->nReserve;
4123 pagerReportSize(pDest);
4124 }
4125 }
4126 #endif
4127
4128 /*
4129 ** Read a single page from either the journal file (if isMainJrnl==1) or
4130 ** from the sub-journal (if isMainJrnl==0) and playback that page.
4131 ** The page begins at offset *pOffset into the file. The *pOffset
4132 ** value is increased to the start of the next page in the journal.
4133 **
4134 ** The main rollback journal uses checksums - the statement journal does
4135 ** not.
4136 **
4137 ** If the page number of the page record read from the (sub-)journal file
4138 ** is greater than the current value of Pager.dbSize, then playback is
4139 ** skipped and SQLITE_OK is returned.
4140 **
4141 ** If pDone is not NULL, then it is a record of pages that have already
4142 ** been played back. If the page at *pOffset has already been played back
4143 ** (if the corresponding pDone bit is set) then skip the playback.
4144 ** Make sure the pDone bit corresponding to the *pOffset page is set
4145 ** prior to returning.
4146 **
4147 ** If the page record is successfully read from the (sub-)journal file
4148 ** and played back, then SQLITE_OK is returned. If an IO error occurs
4149 ** while reading the record from the (sub-)journal file or while writing
4150 ** to the database file, then the IO error code is returned. If data
4151 ** is successfully read from the (sub-)journal file but appears to be
4152 ** corrupted, SQLITE_DONE is returned. Data is considered corrupted in
4153 ** two circumstances:
4154 **
4155 ** * If the record page-number is illegal (0 or PAGER_MJ_PGNO), or
4156 ** * If the record is being rolled back from the main journal file
4157 ** and the checksum field does not match the record content.
4158 **
4159 ** Neither of these two scenarios are possible during a savepoint rollback.
4160 **
4161 ** If this is a savepoint rollback, then memory may have to be dynamically
4162 ** allocated by this function. If this is the case and an allocation fails,
4163 ** SQLITE_NOMEM is returned.
4164 */
4165 static int pager_playback_one_page(
4166 Pager *pPager, /* The pager being played back */
4167 i64 *pOffset, /* Offset of record to playback */
4168 Bitvec *pDone, /* Bitvec of pages already played back */
4169 int isMainJrnl, /* 1 -> main journal. 0 -> sub-journal. */
4170 int isSavepnt /* True for a savepoint rollback */
4171 ){
4172 int rc;
4173 PgHdr *pPg; /* An existing page in the cache */
4174 Pgno pgno; /* The page number of a page in journal */
4175 u32 cksum; /* Checksum used for sanity checking */
4176 char *aData; /* Temporary storage for the page */
4177 sqlite3_file *jfd; /* The file descriptor for the journal file */
4178 int isSynced; /* True if journal page is synced */
4179
4180 assert( (isMainJrnl&~1)==0 ); /* isMainJrnl is 0 or 1 */
4181 assert( (isSavepnt&~1)==0 ); /* isSavepnt is 0 or 1 */
4182 assert( isMainJrnl || pDone ); /* pDone always used on sub-journals */
4183 assert( isSavepnt || pDone==0 ); /* pDone never used on non-savepoint */
4184
4185 aData = pPager->pTmpSpace;
4186 assert( aData ); /* Temp storage must have already been allocated */
4187 assert( pagerUseWal(pPager)==0 || (!isMainJrnl && isSavepnt) );
4188
4189 /* Either the state is greater than PAGER_WRITER_CACHEMOD (a transaction
4190 ** or savepoint rollback done at the request of the caller) or this is
4191 ** a hot-journal rollback. If it is a hot-journal rollback, the pager
4192 ** is in state OPEN and holds an EXCLUSIVE lock. Hot-journal rollback
4193 ** only reads from the main journal, not the sub-journal.
4194 */
4195 assert( pPager->eState>=PAGER_WRITER_CACHEMOD
4196 || (pPager->eState==PAGER_OPEN && pPager->eLock==EXCLUSIVE_LOCK)
4197 );
4198 assert( pPager->eState>=PAGER_WRITER_CACHEMOD || isMainJrnl );
4199
4200 /* Read the page number and page data from the journal or sub-journal
4201 ** file. Return an error code to the caller if an IO error occurs.
4202 */
4203 jfd = isMainJrnl ? pPager->jfd : pPager->sjfd;
4204 rc = read32bits(jfd, *pOffset, &pgno);
4205 if( rc!=SQLITE_OK ) return rc;
4206 rc = sqlite3OsRead(jfd, (u8*)aData, pPager->pageSize, (*pOffset)+4);
4207 if( rc!=SQLITE_OK ) return rc;
4208 *pOffset += pPager->pageSize + 4 + isMainJrnl*4;
4209
4210 /* Sanity checking on the page. This is more important that I originally
4211 ** thought. If a power failure occurs while the journal is being written,
4212 ** it could cause invalid data to be written into the journal. We need to
4213 ** detect this invalid data (with high probability) and ignore it.
4214 */
4215 if( pgno==0 || pgno==PAGER_MJ_PGNO(pPager) ){
4216 assert( !isSavepnt );
4217 return SQLITE_DONE;
4218 }
4219 if( pgno>(Pgno)pPager->dbSize || sqlite3BitvecTest(pDone, pgno) ){
4220 return SQLITE_OK;
4221 }
4222 if( isMainJrnl ){
4223 rc = read32bits(jfd, (*pOffset)-4, &cksum);
4224 if( rc ) return rc;
4225 if( !isSavepnt && pager_cksum(pPager, (u8*)aData)!=cksum ){
4226 return SQLITE_DONE;
4227 }
4228 }
4229
4230 /* If this page has already been played back before during the current
4231 ** rollback, then don't bother to play it back again.
4232 */
4233 if( pDone && (rc = sqlite3BitvecSet(pDone, pgno))!=SQLITE_OK ){
4234 return rc;
4235 }
4236
4237 /* When playing back page 1, restore the nReserve setting
4238 */
4239 if( pgno==1 && pPager->nReserve!=((u8*)aData)[20] ){
4240 pPager->nReserve = ((u8*)aData)[20];
4241 pagerReportSize(pPager);
4242 }
4243
4244 /* If the pager is in CACHEMOD state, then there must be a copy of this
4245 ** page in the pager cache. In this case just update the pager cache,
4246 ** not the database file. The page is left marked dirty in this case.
4247 **
4248 ** An exception to the above rule: If the database is in no-sync mode
4249 ** and a page is moved during an incremental vacuum then the page may
4250 ** not be in the pager cache. Later: if a malloc() or IO error occurs
4251 ** during a Movepage() call, then the page may not be in the cache
4252 ** either. So the condition described in the above paragraph is not
4253 ** assert()able.
4254 **
4255 ** If in WRITER_DBMOD, WRITER_FINISHED or OPEN state, then we update the
4256 ** pager cache if it exists and the main file. The page is then marked
4257 ** not dirty. Since this code is only executed in PAGER_OPEN state for
4258 ** a hot-journal rollback, it is guaranteed that the page-cache is empty
4259 ** if the pager is in OPEN state.
4260 **
4261 ** Ticket #1171: The statement journal might contain page content that is
4262 ** different from the page content at the start of the transaction.
4263 ** This occurs when a page is changed prior to the start of a statement
4264 ** then changed again within the statement. When rolling back such a
4265 ** statement we must not write to the original database unless we know
4266 ** for certain that original page contents are synced into the main rollback
4267 ** journal. Otherwise, a power loss might leave modified data in the
4268 ** database file without an entry in the rollback journal that can
4269 ** restore the database to its original form. Two conditions must be
4270 ** met before writing to the database files. (1) the database must be
4271 ** locked. (2) we know that the original page content is fully synced
4272 ** in the main journal either because the page is not in cache or else
4273 ** the page is marked as needSync==0.
4274 **
4275 ** 2008-04-14: When attempting to vacuum a corrupt database file, it
4276 ** is possible to fail a statement on a database that does not yet exist.
4277 ** Do not attempt to write if database file has never been opened.
4278 */
4279 if( pagerUseWal(pPager) ){
4280 pPg = 0;
4281 }else{
4282 pPg = sqlite3PagerLookup(pPager, pgno);
4283 }
4284 assert( pPg || !MEMDB );
4285 assert( pPager->eState!=PAGER_OPEN || pPg==0 || pPager->tempFile );
4286 PAGERTRACE(("PLAYBACK %d page %d hash(%08x) %s\n",
4287 PAGERID(pPager), pgno, pager_datahash(pPager->pageSize, (u8*)aData),
4288 (isMainJrnl?"main-journal":"sub-journal")
4289 ));
4290 if( isMainJrnl ){
4291 isSynced = pPager->noSync || (*pOffset <= pPager->journalHdr);
4292 }else{
4293 isSynced = (pPg==0 || 0==(pPg->flags & PGHDR_NEED_SYNC));
4294 }
4295 if( isOpen(pPager->fd)
4296 && (pPager->eState>=PAGER_WRITER_DBMOD || pPager->eState==PAGER_OPEN)
4297 && isSynced
4298 ){
4299 i64 ofst = (pgno-1)*(i64)pPager->pageSize;
4300 testcase( !isSavepnt && pPg!=0 && (pPg->flags&PGHDR_NEED_SYNC)!=0 );
4301 assert( !pagerUseWal(pPager) );
4302 rc = sqlite3OsWrite(pPager->fd, (u8 *)aData, pPager->pageSize, ofst);
4303 if( pgno>pPager->dbFileSize ){
4304 pPager->dbFileSize = pgno;
4305 }
4306 if( pPager->pBackup ){
4307 CODEC1(pPager, aData, pgno, 3, rc=SQLITE_NOMEM_BKPT);
4308 sqlite3BackupUpdate(pPager->pBackup, pgno, (u8*)aData);
4309 CODEC2(pPager, aData, pgno, 7, rc=SQLITE_NOMEM_BKPT, aData);
4310 }
4311 }else if( !isMainJrnl && pPg==0 ){
4312 /* If this is a rollback of a savepoint and data was not written to
4313 ** the database and the page is not in-memory, there is a potential
4314 ** problem. When the page is next fetched by the b-tree layer, it
4315 ** will be read from the database file, which may or may not be
4316 ** current.
4317 **
4318 ** There are a couple of different ways this can happen. All are quite
4319 ** obscure. When running in synchronous mode, this can only happen
4320 ** if the page is on the free-list at the start of the transaction, then
4321 ** populated, then moved using sqlite3PagerMovepage().
4322 **
4323 ** The solution is to add an in-memory page to the cache containing
4324 ** the data just read from the sub-journal. Mark the page as dirty
4325 ** and if the pager requires a journal-sync, then mark the page as
4326 ** requiring a journal-sync before it is written.
4327 */
4328 assert( isSavepnt );
4329 assert( (pPager->doNotSpill & SPILLFLAG_ROLLBACK)==0 );
4330 pPager->doNotSpill |= SPILLFLAG_ROLLBACK;
4331 rc = sqlite3PagerGet(pPager, pgno, &pPg, 1);
4332 assert( (pPager->doNotSpill & SPILLFLAG_ROLLBACK)!=0 );
4333 pPager->doNotSpill &= ~SPILLFLAG_ROLLBACK;
4334 if( rc!=SQLITE_OK ) return rc;
4335 sqlite3PcacheMakeDirty(pPg);
4336 }
4337 if( pPg ){
4338 /* No page should ever be explicitly rolled back that is in use, except
4339 ** for page 1 which is held in use in order to keep the lock on the
4340 ** database active. However such a page may be rolled back as a result
4341 ** of an internal error resulting in an automatic call to
4342 ** sqlite3PagerRollback().
4343 */
4344 void *pData;
4345 pData = pPg->pData;
4346 memcpy(pData, (u8*)aData, pPager->pageSize);
4347 pPager->xReiniter(pPg);
4348 /* It used to be that sqlite3PcacheMakeClean(pPg) was called here. But
4349 ** that call was dangerous and had no detectable benefit since the cache
4350 ** is normally cleaned by sqlite3PcacheCleanAll() after rollback and so
4351 ** has been removed. */
4352 pager_set_pagehash(pPg);
4353
4354 /* If this was page 1, then restore the value of Pager.dbFileVers.
4355 ** Do this before any decoding. */
4356 if( pgno==1 ){
4357 memcpy(&pPager->dbFileVers, &((u8*)pData)[24],sizeof(pPager->dbFileVers));
4358 }
4359
4360 /* Decode the page just read from disk */
4361 CODEC1(pPager, pData, pPg->pgno, 3, rc=SQLITE_NOMEM_BKPT);
4362 sqlite3PcacheRelease(pPg);
4363 }
4364 return rc;
4365 }
4366
4367 /*
4368 ** Parameter zMaster is the name of a master journal file. A single journal
4369 ** file that referred to the master journal file has just been rolled back.
4370 ** This routine checks if it is possible to delete the master journal file,
4371 ** and does so if it is.
4372 **
4373 ** Argument zMaster may point to Pager.pTmpSpace. So that buffer is not
4374 ** available for use within this function.
4375 **
4376 ** When a master journal file is created, it is populated with the names
4377 ** of all of its child journals, one after another, formatted as utf-8
4378 ** encoded text. The end of each child journal file is marked with a
4379 ** nul-terminator byte (0x00). i.e. the entire contents of a master journal
4380 ** file for a transaction involving two databases might be:
4381 **
4382 ** "/home/bill/a.db-journal\x00/home/bill/b.db-journal\x00"
4383 **
4384 ** A master journal file may only be deleted once all of its child
4385 ** journals have been rolled back.
4386 **
4387 ** This function reads the contents of the master-journal file into
4388 ** memory and loops through each of the child journal names. For
4389 ** each child journal, it checks if:
4390 **
4391 ** * if the child journal exists, and if so
4392 ** * if the child journal contains a reference to master journal
4393 ** file zMaster
4394 **
4395 ** If a child journal can be found that matches both of the criteria
4396 ** above, this function returns without doing anything. Otherwise, if
4397 ** no such child journal can be found, file zMaster is deleted from
4398 ** the file-system using sqlite3OsDelete().
4399 **
4400 ** If an IO error within this function, an error code is returned. This
4401 ** function allocates memory by calling sqlite3Malloc(). If an allocation
4402 ** fails, SQLITE_NOMEM is returned. Otherwise, if no IO or malloc errors
4403 ** occur, SQLITE_OK is returned.
4404 **
4405 ** TODO: This function allocates a single block of memory to load
4406 ** the entire contents of the master journal file. This could be
4407 ** a couple of kilobytes or so - potentially larger than the page
4408 ** size.
4409 */
4410 static int pager_delmaster(Pager *pPager, const char *zMaster){
4411 sqlite3_vfs *pVfs = pPager->pVfs;
4412 int rc; /* Return code */
4413 sqlite3_file *pMaster; /* Malloc'd master-journal file descriptor */
4414 sqlite3_file *pJournal; /* Malloc'd child-journal file descriptor */
4415 char *zMasterJournal = 0; /* Contents of master journal file */
4416 i64 nMasterJournal; /* Size of master journal file */
4417 char *zJournal; /* Pointer to one journal within MJ file */
4418 char *zMasterPtr; /* Space to hold MJ filename from a journal file */
4419 int nMasterPtr; /* Amount of space allocated to zMasterPtr[] */
4420
4421 /* Allocate space for both the pJournal and pMaster file descriptors.
4422 ** If successful, open the master journal file for reading.
4423 */
4424 pMaster = (sqlite3_file *)sqlite3MallocZero(pVfs->szOsFile * 2);
4425 pJournal = (sqlite3_file *)(((u8 *)pMaster) + pVfs->szOsFile);
4426 if( !pMaster ){
4427 rc = SQLITE_NOMEM_BKPT;
4428 }else{
4429 const int flags = (SQLITE_OPEN_READONLY|SQLITE_OPEN_MASTER_JOURNAL);
4430 rc = sqlite3OsOpen(pVfs, zMaster, pMaster, flags, 0);
4431 }
4432 if( rc!=SQLITE_OK ) goto delmaster_out;
4433
4434 /* Load the entire master journal file into space obtained from
4435 ** sqlite3_malloc() and pointed to by zMasterJournal. Also obtain
4436 ** sufficient space (in zMasterPtr) to hold the names of master
4437 ** journal files extracted from regular rollback-journals.
4438 */
4439 rc = sqlite3OsFileSize(pMaster, &nMasterJournal);
4440 if( rc!=SQLITE_OK ) goto delmaster_out;
4441 nMasterPtr = pVfs->mxPathname+1;
4442 zMasterJournal = sqlite3Malloc(nMasterJournal + nMasterPtr + 1);
4443 if( !zMasterJournal ){
4444 rc = SQLITE_NOMEM_BKPT;
4445 goto delmaster_out;
4446 }
4447 zMasterPtr = &zMasterJournal[nMasterJournal+1];
4448 rc = sqlite3OsRead(pMaster, zMasterJournal, (int)nMasterJournal, 0);
4449 if( rc!=SQLITE_OK ) goto delmaster_out;
4450 zMasterJournal[nMasterJournal] = 0;
4451
4452 zJournal = zMasterJournal;
4453 while( (zJournal-zMasterJournal)<nMasterJournal ){
4454 int exists;
4455 rc = sqlite3OsAccess(pVfs, zJournal, SQLITE_ACCESS_EXISTS, &exists);
4456 if( rc!=SQLITE_OK ){
4457 goto delmaster_out;
4458 }
4459 if( exists ){
4460 /* One of the journals pointed to by the master journal exists.
4461 ** Open it and check if it points at the master journal. If
4462 ** so, return without deleting the master journal file.
4463 */
4464 int c;
4465 int flags = (SQLITE_OPEN_READONLY|SQLITE_OPEN_MAIN_JOURNAL);
4466 rc = sqlite3OsOpen(pVfs, zJournal, pJournal, flags, 0);
4467 if( rc!=SQLITE_OK ){
4468 goto delmaster_out;
4469 }
4470
4471 rc = readMasterJournal(pJournal, zMasterPtr, nMasterPtr);
4472 sqlite3OsClose(pJournal);
4473 if( rc!=SQLITE_OK ){
4474 goto delmaster_out;
4475 }
4476
4477 c = zMasterPtr[0]!=0 && strcmp(zMasterPtr, zMaster)==0;
4478 if( c ){
4479 /* We have a match. Do not delete the master journal file. */
4480 goto delmaster_out;
4481 }
4482 }
4483 zJournal += (sqlite3Strlen30(zJournal)+1);
4484 }
4485
4486 sqlite3OsClose(pMaster);
4487 rc = sqlite3OsDelete(pVfs, zMaster, 0);
4488
4489 delmaster_out:
4490 sqlite3_free(zMasterJournal);
4491 if( pMaster ){
4492 sqlite3OsClose(pMaster);
4493 assert( !isOpen(pJournal) );
4494 sqlite3_free(pMaster);
4495 }
4496 return rc;
4497 }
4498
4499
4500 /*
4501 ** This function is used to change the actual size of the database
4502 ** file in the file-system. This only happens when committing a transaction,
4503 ** or rolling back a transaction (including rolling back a hot-journal).
4504 **
4505 ** If the main database file is not open, or the pager is not in either
4506 ** DBMOD or OPEN state, this function is a no-op. Otherwise, the size
4507 ** of the file is changed to nPage pages (nPage*pPager->pageSize bytes).
4508 ** If the file on disk is currently larger than nPage pages, then use the VFS
4509 ** xTruncate() method to truncate it.
4510 **
4511 ** Or, it might be the case that the file on disk is smaller than
4512 ** nPage pages. Some operating system implementations can get confused if
4513 ** you try to truncate a file to some size that is larger than it
4514 ** currently is, so detect this case and write a single zero byte to
4515 ** the end of the new file instead.
4516 **
4517 ** If successful, return SQLITE_OK. If an IO error occurs while modifying
4518 ** the database file, return the error code to the caller.
4519 */
4520 static int pager_truncate(Pager *pPager, Pgno nPage){
4521 int rc = SQLITE_OK;
4522 assert( pPager->eState!=PAGER_ERROR );
4523 assert( pPager->eState!=PAGER_READER );
4524
4525 if( isOpen(pPager->fd)
4526 && (pPager->eState>=PAGER_WRITER_DBMOD || pPager->eState==PAGER_OPEN)
4527 ){
4528 i64 currentSize, newSize;
4529 int szPage = pPager->pageSize;
4530 assert( pPager->eLock==EXCLUSIVE_LOCK );
4531 /* TODO: Is it safe to use Pager.dbFileSize here? */
4532 rc = sqlite3OsFileSize(pPager->fd, &currentSize);
4533 newSize = szPage*(i64)nPage;
4534 if( rc==SQLITE_OK && currentSize!=newSize ){
4535 if( currentSize>newSize ){
4536 rc = sqlite3OsTruncate(pPager->fd, newSize);
4537 }else if( (currentSize+szPage)<=newSize ){
4538 char *pTmp = pPager->pTmpSpace;
4539 memset(pTmp, 0, szPage);
4540 testcase( (newSize-szPage) == currentSize );
4541 testcase( (newSize-szPage) > currentSize );
4542 rc = sqlite3OsWrite(pPager->fd, pTmp, szPage, newSize-szPage);
4543 }
4544 if( rc==SQLITE_OK ){
4545 pPager->dbFileSize = nPage;
4546 }
4547 }
4548 }
4549 return rc;
4550 }
4551
4552 /*
4553 ** Return a sanitized version of the sector-size of OS file pFile. The
4554 ** return value is guaranteed to lie between 32 and MAX_SECTOR_SIZE.
4555 */
4556 SQLITE_PRIVATE int sqlite3SectorSize(sqlite3_file *pFile){
4557 int iRet = sqlite3OsSectorSize(pFile);
4558 if( iRet<32 ){
4559 iRet = 512;
4560 }else if( iRet>MAX_SECTOR_SIZE ){
4561 assert( MAX_SECTOR_SIZE>=512 );
4562 iRet = MAX_SECTOR_SIZE;
4563 }
4564 return iRet;
4565 }
4566
4567 /*
4568 ** Set the value of the Pager.sectorSize variable for the given
4569 ** pager based on the value returned by the xSectorSize method
4570 ** of the open database file. The sector size will be used
4571 ** to determine the size and alignment of journal header and
4572 ** master journal pointers within created journal files.
4573 **
4574 ** For temporary files the effective sector size is always 512 bytes.
4575 **
4576 ** Otherwise, for non-temporary files, the effective sector size is
4577 ** the value returned by the xSectorSize() method rounded up to 32 if
4578 ** it is less than 32, or rounded down to MAX_SECTOR_SIZE if it
4579 ** is greater than MAX_SECTOR_SIZE.
4580 **
4581 ** If the file has the SQLITE_IOCAP_POWERSAFE_OVERWRITE property, then set
4582 ** the effective sector size to its minimum value (512). The purpose of
4583 ** pPager->sectorSize is to define the "blast radius" of bytes that
4584 ** might change if a crash occurs while writing to a single byte in
4585 ** that range. But with POWERSAFE_OVERWRITE, the blast radius is zero
4586 ** (that is what POWERSAFE_OVERWRITE means), so we minimize the sector
4587 ** size. For backwards compatibility of the rollback journal file format,
4588 ** we cannot reduce the effective sector size below 512.
4589 */
4590 static void setSectorSize(Pager *pPager){
4591 assert( isOpen(pPager->fd) || pPager->tempFile );
4592
4593 if( pPager->tempFile
4594 || (sqlite3OsDeviceCharacteristics(pPager->fd) &
4595 SQLITE_IOCAP_POWERSAFE_OVERWRITE)!=0
4596 ){
4597 /* Sector size doesn't matter for temporary files. Also, the file
4598 ** may not have been opened yet, in which case the OsSectorSize()
4599 ** call will segfault. */
4600 pPager->sectorSize = 512;
4601 }else{
4602 pPager->sectorSize = sqlite3SectorSize(pPager->fd);
4603 }
4604 }
4605
4606 /*
4607 ** Playback the journal and thus restore the database file to
4608 ** the state it was in before we started making changes.
4609 **
4610 ** The journal file format is as follows:
4611 **
4612 ** (1) 8 byte prefix. A copy of aJournalMagic[].
4613 ** (2) 4 byte big-endian integer which is the number of valid page records
4614 ** in the journal. If this value is 0xffffffff, then compute the
4615 ** number of page records from the journal size.
4616 ** (3) 4 byte big-endian integer which is the initial value for the
4617 ** sanity checksum.
4618 ** (4) 4 byte integer which is the number of pages to truncate the
4619 ** database to during a rollback.
4620 ** (5) 4 byte big-endian integer which is the sector size. The header
4621 ** is this many bytes in size.
4622 ** (6) 4 byte big-endian integer which is the page size.
4623 ** (7) zero padding out to the next sector size.
4624 ** (8) Zero or more pages instances, each as follows:
4625 ** + 4 byte page number.
4626 ** + pPager->pageSize bytes of data.
4627 ** + 4 byte checksum
4628 **
4629 ** When we speak of the journal header, we mean the first 7 items above.
4630 ** Each entry in the journal is an instance of the 8th item.
4631 **
4632 ** Call the value from the second bullet "nRec". nRec is the number of
4633 ** valid page entries in the journal. In most cases, you can compute the
4634 ** value of nRec from the size of the journal file. But if a power
4635 ** failure occurred while the journal was being written, it could be the
4636 ** case that the size of the journal file had already been increased but
4637 ** the extra entries had not yet made it safely to disk. In such a case,
4638 ** the value of nRec computed from the file size would be too large. For
4639 ** that reason, we always use the nRec value in the header.
4640 **
4641 ** If the nRec value is 0xffffffff it means that nRec should be computed
4642 ** from the file size. This value is used when the user selects the
4643 ** no-sync option for the journal. A power failure could lead to corruption
4644 ** in this case. But for things like temporary table (which will be
4645 ** deleted when the power is restored) we don't care.
4646 **
4647 ** If the file opened as the journal file is not a well-formed
4648 ** journal file then all pages up to the first corrupted page are rolled
4649 ** back (or no pages if the journal header is corrupted). The journal file
4650 ** is then deleted and SQLITE_OK returned, just as if no corruption had
4651 ** been encountered.
4652 **
4653 ** If an I/O or malloc() error occurs, the journal-file is not deleted
4654 ** and an error code is returned.
4655 **
4656 ** The isHot parameter indicates that we are trying to rollback a journal
4657 ** that might be a hot journal. Or, it could be that the journal is
4658 ** preserved because of JOURNALMODE_PERSIST or JOURNALMODE_TRUNCATE.
4659 ** If the journal really is hot, reset the pager cache prior rolling
4660 ** back any content. If the journal is merely persistent, no reset is
4661 ** needed.
4662 */
4663 static int pager_playback(Pager *pPager, int isHot){
4664 sqlite3_vfs *pVfs = pPager->pVfs;
4665 i64 szJ; /* Size of the journal file in bytes */
4666 u32 nRec; /* Number of Records in the journal */
4667 u32 u; /* Unsigned loop counter */
4668 Pgno mxPg = 0; /* Size of the original file in pages */
4669 int rc; /* Result code of a subroutine */
4670 int res = 1; /* Value returned by sqlite3OsAccess() */
4671 char *zMaster = 0; /* Name of master journal file if any */
4672 int needPagerReset; /* True to reset page prior to first page rollback */
4673 int nPlayback = 0; /* Total number of pages restored from journal */
4674
4675 /* Figure out how many records are in the journal. Abort early if
4676 ** the journal is empty.
4677 */
4678 assert( isOpen(pPager->jfd) );
4679 rc = sqlite3OsFileSize(pPager->jfd, &szJ);
4680 if( rc!=SQLITE_OK ){
4681 goto end_playback;
4682 }
4683
4684 /* Read the master journal name from the journal, if it is present.
4685 ** If a master journal file name is specified, but the file is not
4686 ** present on disk, then the journal is not hot and does not need to be
4687 ** played back.
4688 **
4689 ** TODO: Technically the following is an error because it assumes that
4690 ** buffer Pager.pTmpSpace is (mxPathname+1) bytes or larger. i.e. that
4691 ** (pPager->pageSize >= pPager->pVfs->mxPathname+1). Using os_unix.c,
4692 ** mxPathname is 512, which is the same as the minimum allowable value
4693 ** for pageSize.
4694 */
4695 zMaster = pPager->pTmpSpace;
4696 rc = readMasterJournal(pPager->jfd, zMaster, pPager->pVfs->mxPathname+1);
4697 if( rc==SQLITE_OK && zMaster[0] ){
4698 rc = sqlite3OsAccess(pVfs, zMaster, SQLITE_ACCESS_EXISTS, &res);
4699 }
4700 zMaster = 0;
4701 if( rc!=SQLITE_OK || !res ){
4702 goto end_playback;
4703 }
4704 pPager->journalOff = 0;
4705 needPagerReset = isHot;
4706
4707 /* This loop terminates either when a readJournalHdr() or
4708 ** pager_playback_one_page() call returns SQLITE_DONE or an IO error
4709 ** occurs.
4710 */
4711 while( 1 ){
4712 /* Read the next journal header from the journal file. If there are
4713 ** not enough bytes left in the journal file for a complete header, or
4714 ** it is corrupted, then a process must have failed while writing it.
4715 ** This indicates nothing more needs to be rolled back.
4716 */
4717 rc = readJournalHdr(pPager, isHot, szJ, &nRec, &mxPg);
4718 if( rc!=SQLITE_OK ){
4719 if( rc==SQLITE_DONE ){
4720 rc = SQLITE_OK;
4721 }
4722 goto end_playback;
4723 }
4724
4725 /* If nRec is 0xffffffff, then this journal was created by a process
4726 ** working in no-sync mode. This means that the rest of the journal
4727 ** file consists of pages, there are no more journal headers. Compute
4728 ** the value of nRec based on this assumption.
4729 */
4730 if( nRec==0xffffffff ){
4731 assert( pPager->journalOff==JOURNAL_HDR_SZ(pPager) );
4732 nRec = (int)((szJ - JOURNAL_HDR_SZ(pPager))/JOURNAL_PG_SZ(pPager));
4733 }
4734
4735 /* If nRec is 0 and this rollback is of a transaction created by this
4736 ** process and if this is the final header in the journal, then it means
4737 ** that this part of the journal was being filled but has not yet been
4738 ** synced to disk. Compute the number of pages based on the remaining
4739 ** size of the file.
4740 **
4741 ** The third term of the test was added to fix ticket #2565.
4742 ** When rolling back a hot journal, nRec==0 always means that the next
4743 ** chunk of the journal contains zero pages to be rolled back. But
4744 ** when doing a ROLLBACK and the nRec==0 chunk is the last chunk in
4745 ** the journal, it means that the journal might contain additional
4746 ** pages that need to be rolled back and that the number of pages
4747 ** should be computed based on the journal file size.
4748 */
4749 if( nRec==0 && !isHot &&
4750 pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff ){
4751 nRec = (int)((szJ - pPager->journalOff) / JOURNAL_PG_SZ(pPager));
4752 }
4753
4754 /* If this is the first header read from the journal, truncate the
4755 ** database file back to its original size.
4756 */
4757 if( pPager->journalOff==JOURNAL_HDR_SZ(pPager) ){
4758 rc = pager_truncate(pPager, mxPg);
4759 if( rc!=SQLITE_OK ){
4760 goto end_playback;
4761 }
4762 pPager->dbSize = mxPg;
4763 }
4764
4765 /* Copy original pages out of the journal and back into the
4766 ** database file and/or page cache.
4767 */
4768 for(u=0; u<nRec; u++){
4769 if( needPagerReset ){
4770 pager_reset(pPager);
4771 needPagerReset = 0;
4772 }
4773 rc = pager_playback_one_page(pPager,&pPager->journalOff,0,1,0);
4774 if( rc==SQLITE_OK ){
4775 nPlayback++;
4776 }else{
4777 if( rc==SQLITE_DONE ){
4778 pPager->journalOff = szJ;
4779 break;
4780 }else if( rc==SQLITE_IOERR_SHORT_READ ){
4781 /* If the journal has been truncated, simply stop reading and
4782 ** processing the journal. This might happen if the journal was
4783 ** not completely written and synced prior to a crash. In that
4784 ** case, the database should have never been written in the
4785 ** first place so it is OK to simply abandon the rollback. */
4786 rc = SQLITE_OK;
4787 goto end_playback;
4788 }else{
4789 /* If we are unable to rollback, quit and return the error
4790 ** code. This will cause the pager to enter the error state
4791 ** so that no further harm will be done. Perhaps the next
4792 ** process to come along will be able to rollback the database.
4793 */
4794 goto end_playback;
4795 }
4796 }
4797 }
4798 }
4799 /*NOTREACHED*/
4800 assert( 0 );
4801
4802 end_playback:
4803 /* Following a rollback, the database file should be back in its original
4804 ** state prior to the start of the transaction, so invoke the
4805 ** SQLITE_FCNTL_DB_UNCHANGED file-control method to disable the
4806 ** assertion that the transaction counter was modified.
4807 */
4808 #ifdef SQLITE_DEBUG
4809 if( pPager->fd->pMethods ){
4810 sqlite3OsFileControlHint(pPager->fd,SQLITE_FCNTL_DB_UNCHANGED,0);
4811 }
4812 #endif
4813
4814 /* If this playback is happening automatically as a result of an IO or
4815 ** malloc error that occurred after the change-counter was updated but
4816 ** before the transaction was committed, then the change-counter
4817 ** modification may just have been reverted. If this happens in exclusive
4818 ** mode, then subsequent transactions performed by the connection will not
4819 ** update the change-counter at all. This may lead to cache inconsistency
4820 ** problems for other processes at some point in the future. So, just
4821 ** in case this has happened, clear the changeCountDone flag now.
4822 */
4823 pPager->changeCountDone = pPager->tempFile;
4824
4825 if( rc==SQLITE_OK ){
4826 zMaster = pPager->pTmpSpace;
4827 rc = readMasterJournal(pPager->jfd, zMaster, pPager->pVfs->mxPathname+1);
4828 testcase( rc!=SQLITE_OK );
4829 }
4830 if( rc==SQLITE_OK
4831 && (pPager->eState>=PAGER_WRITER_DBMOD || pPager->eState==PAGER_OPEN)
4832 ){
4833 rc = sqlite3PagerSync(pPager, 0);
4834 }
4835 if( rc==SQLITE_OK ){
4836 rc = pager_end_transaction(pPager, zMaster[0]!='\0', 0);
4837 testcase( rc!=SQLITE_OK );
4838 }
4839 if( rc==SQLITE_OK && zMaster[0] && res ){
4840 /* If there was a master journal and this routine will return success,
4841 ** see if it is possible to delete the master journal.
4842 */
4843 rc = pager_delmaster(pPager, zMaster);
4844 testcase( rc!=SQLITE_OK );
4845 }
4846 if( isHot && nPlayback ){
4847 sqlite3_log(SQLITE_NOTICE_RECOVER_ROLLBACK, "recovered %d pages from %s",
4848 nPlayback, pPager->zJournal);
4849 }
4850
4851 /* The Pager.sectorSize variable may have been updated while rolling
4852 ** back a journal created by a process with a different sector size
4853 ** value. Reset it to the correct value for this process.
4854 */
4855 setSectorSize(pPager);
4856 return rc;
4857 }
4858
4859
4860 /*
4861 ** Read the content for page pPg out of the database file and into
4862 ** pPg->pData. A shared lock or greater must be held on the database
4863 ** file before this function is called.
4864 **
4865 ** If page 1 is read, then the value of Pager.dbFileVers[] is set to
4866 ** the value read from the database file.
4867 **
4868 ** If an IO error occurs, then the IO error is returned to the caller.
4869 ** Otherwise, SQLITE_OK is returned.
4870 */
4871 static int readDbPage(PgHdr *pPg, u32 iFrame){
4872 Pager *pPager = pPg->pPager; /* Pager object associated with page pPg */
4873 Pgno pgno = pPg->pgno; /* Page number to read */
4874 int rc = SQLITE_OK; /* Return code */
4875 int pgsz = pPager->pageSize; /* Number of bytes to read */
4876
4877 assert( pPager->eState>=PAGER_READER && !MEMDB );
4878 assert( isOpen(pPager->fd) );
4879
4880 #ifndef SQLITE_OMIT_WAL
4881 if( iFrame ){
4882 /* Try to pull the page from the write-ahead log. */
4883 rc = sqlite3WalReadFrame(pPager->pWal, iFrame, pgsz, pPg->pData);
4884 }else
4885 #endif
4886 {
4887 i64 iOffset = (pgno-1)*(i64)pPager->pageSize;
4888 rc = sqlite3OsRead(pPager->fd, pPg->pData, pgsz, iOffset);
4889 if( rc==SQLITE_IOERR_SHORT_READ ){
4890 rc = SQLITE_OK;
4891 }
4892 }
4893
4894 if( pgno==1 ){
4895 if( rc ){
4896 /* If the read is unsuccessful, set the dbFileVers[] to something
4897 ** that will never be a valid file version. dbFileVers[] is a copy
4898 ** of bytes 24..39 of the database. Bytes 28..31 should always be
4899 ** zero or the size of the database in page. Bytes 32..35 and 35..39
4900 ** should be page numbers which are never 0xffffffff. So filling
4901 ** pPager->dbFileVers[] with all 0xff bytes should suffice.
4902 **
4903 ** For an encrypted database, the situation is more complex: bytes
4904 ** 24..39 of the database are white noise. But the probability of
4905 ** white noise equaling 16 bytes of 0xff is vanishingly small so
4906 ** we should still be ok.
4907 */
4908 memset(pPager->dbFileVers, 0xff, sizeof(pPager->dbFileVers));
4909 }else{
4910 u8 *dbFileVers = &((u8*)pPg->pData)[24];
4911 memcpy(&pPager->dbFileVers, dbFileVers, sizeof(pPager->dbFileVers));
4912 }
4913 }
4914 CODEC1(pPager, pPg->pData, pgno, 3, rc = SQLITE_NOMEM_BKPT);
4915
4916 PAGER_INCR(sqlite3_pager_readdb_count);
4917 PAGER_INCR(pPager->nRead);
4918 IOTRACE(("PGIN %p %d\n", pPager, pgno));
4919 PAGERTRACE(("FETCH %d page %d hash(%08x)\n",
4920 PAGERID(pPager), pgno, pager_pagehash(pPg)));
4921
4922 return rc;
4923 }
4924
4925 /*
4926 ** Update the value of the change-counter at offsets 24 and 92 in
4927 ** the header and the sqlite version number at offset 96.
4928 **
4929 ** This is an unconditional update. See also the pager_incr_changecounter()
4930 ** routine which only updates the change-counter if the update is actually
4931 ** needed, as determined by the pPager->changeCountDone state variable.
4932 */
4933 static void pager_write_changecounter(PgHdr *pPg){
4934 u32 change_counter;
4935
4936 /* Increment the value just read and write it back to byte 24. */
4937 change_counter = sqlite3Get4byte((u8*)pPg->pPager->dbFileVers)+1;
4938 put32bits(((char*)pPg->pData)+24, change_counter);
4939
4940 /* Also store the SQLite version number in bytes 96..99 and in
4941 ** bytes 92..95 store the change counter for which the version number
4942 ** is valid. */
4943 put32bits(((char*)pPg->pData)+92, change_counter);
4944 put32bits(((char*)pPg->pData)+96, SQLITE_VERSION_NUMBER);
4945 }
4946
4947 #ifndef SQLITE_OMIT_WAL
4948 /*
4949 ** This function is invoked once for each page that has already been
4950 ** written into the log file when a WAL transaction is rolled back.
4951 ** Parameter iPg is the page number of said page. The pCtx argument
4952 ** is actually a pointer to the Pager structure.
4953 **
4954 ** If page iPg is present in the cache, and has no outstanding references,
4955 ** it is discarded. Otherwise, if there are one or more outstanding
4956 ** references, the page content is reloaded from the database. If the
4957 ** attempt to reload content from the database is required and fails,
4958 ** return an SQLite error code. Otherwise, SQLITE_OK.
4959 */
4960 static int pagerUndoCallback(void *pCtx, Pgno iPg){
4961 int rc = SQLITE_OK;
4962 Pager *pPager = (Pager *)pCtx;
4963 PgHdr *pPg;
4964
4965 assert( pagerUseWal(pPager) );
4966 pPg = sqlite3PagerLookup(pPager, iPg);
4967 if( pPg ){
4968 if( sqlite3PcachePageRefcount(pPg)==1 ){
4969 sqlite3PcacheDrop(pPg);
4970 }else{
4971 u32 iFrame = 0;
4972 rc = sqlite3WalFindFrame(pPager->pWal, pPg->pgno, &iFrame);
4973 if( rc==SQLITE_OK ){
4974 rc = readDbPage(pPg, iFrame);
4975 }
4976 if( rc==SQLITE_OK ){
4977 pPager->xReiniter(pPg);
4978 }
4979 sqlite3PagerUnrefNotNull(pPg);
4980 }
4981 }
4982
4983 /* Normally, if a transaction is rolled back, any backup processes are
4984 ** updated as data is copied out of the rollback journal and into the
4985 ** database. This is not generally possible with a WAL database, as
4986 ** rollback involves simply truncating the log file. Therefore, if one
4987 ** or more frames have already been written to the log (and therefore
4988 ** also copied into the backup databases) as part of this transaction,
4989 ** the backups must be restarted.
4990 */
4991 sqlite3BackupRestart(pPager->pBackup);
4992
4993 return rc;
4994 }
4995
4996 /*
4997 ** This function is called to rollback a transaction on a WAL database.
4998 */
4999 static int pagerRollbackWal(Pager *pPager){
5000 int rc; /* Return Code */
5001 PgHdr *pList; /* List of dirty pages to revert */
5002
5003 /* For all pages in the cache that are currently dirty or have already
5004 ** been written (but not committed) to the log file, do one of the
5005 ** following:
5006 **
5007 ** + Discard the cached page (if refcount==0), or
5008 ** + Reload page content from the database (if refcount>0).
5009 */
5010 pPager->dbSize = pPager->dbOrigSize;
5011 rc = sqlite3WalUndo(pPager->pWal, pagerUndoCallback, (void *)pPager);
5012 pList = sqlite3PcacheDirtyList(pPager->pPCache);
5013 while( pList && rc==SQLITE_OK ){
5014 PgHdr *pNext = pList->pDirty;
5015 rc = pagerUndoCallback((void *)pPager, pList->pgno);
5016 pList = pNext;
5017 }
5018
5019 return rc;
5020 }
5021
5022 /*
5023 ** This function is a wrapper around sqlite3WalFrames(). As well as logging
5024 ** the contents of the list of pages headed by pList (connected by pDirty),
5025 ** this function notifies any active backup processes that the pages have
5026 ** changed.
5027 **
5028 ** The list of pages passed into this routine is always sorted by page number.
5029 ** Hence, if page 1 appears anywhere on the list, it will be the first page.
5030 */
5031 static int pagerWalFrames(
5032 Pager *pPager, /* Pager object */
5033 PgHdr *pList, /* List of frames to log */
5034 Pgno nTruncate, /* Database size after this commit */
5035 int isCommit /* True if this is a commit */
5036 ){
5037 int rc; /* Return code */
5038 int nList; /* Number of pages in pList */
5039 PgHdr *p; /* For looping over pages */
5040
5041 assert( pPager->pWal );
5042 assert( pList );
5043 #ifdef SQLITE_DEBUG
5044 /* Verify that the page list is in accending order */
5045 for(p=pList; p && p->pDirty; p=p->pDirty){
5046 assert( p->pgno < p->pDirty->pgno );
5047 }
5048 #endif
5049
5050 assert( pList->pDirty==0 || isCommit );
5051 if( isCommit ){
5052 /* If a WAL transaction is being committed, there is no point in writing
5053 ** any pages with page numbers greater than nTruncate into the WAL file.
5054 ** They will never be read by any client. So remove them from the pDirty
5055 ** list here. */
5056 PgHdr **ppNext = &pList;
5057 nList = 0;
5058 for(p=pList; (*ppNext = p)!=0; p=p->pDirty){
5059 if( p->pgno<=nTruncate ){
5060 ppNext = &p->pDirty;
5061 nList++;
5062 }
5063 }
5064 assert( pList );
5065 }else{
5066 nList = 1;
5067 }
5068 pPager->aStat[PAGER_STAT_WRITE] += nList;
5069
5070 if( pList->pgno==1 ) pager_write_changecounter(pList);
5071 rc = sqlite3WalFrames(pPager->pWal,
5072 pPager->pageSize, pList, nTruncate, isCommit, pPager->walSyncFlags
5073 );
5074 if( rc==SQLITE_OK && pPager->pBackup ){
5075 for(p=pList; p; p=p->pDirty){
5076 sqlite3BackupUpdate(pPager->pBackup, p->pgno, (u8 *)p->pData);
5077 }
5078 }
5079
5080 #ifdef SQLITE_CHECK_PAGES
5081 pList = sqlite3PcacheDirtyList(pPager->pPCache);
5082 for(p=pList; p; p=p->pDirty){
5083 pager_set_pagehash(p);
5084 }
5085 #endif
5086
5087 return rc;
5088 }
5089
5090 /*
5091 ** Begin a read transaction on the WAL.
5092 **
5093 ** This routine used to be called "pagerOpenSnapshot()" because it essentially
5094 ** makes a snapshot of the database at the current point in time and preserves
5095 ** that snapshot for use by the reader in spite of concurrently changes by
5096 ** other writers or checkpointers.
5097 */
5098 static int pagerBeginReadTransaction(Pager *pPager){
5099 int rc; /* Return code */
5100 int changed = 0; /* True if cache must be reset */
5101
5102 assert( pagerUseWal(pPager) );
5103 assert( pPager->eState==PAGER_OPEN || pPager->eState==PAGER_READER );
5104
5105 /* sqlite3WalEndReadTransaction() was not called for the previous
5106 ** transaction in locking_mode=EXCLUSIVE. So call it now. If we
5107 ** are in locking_mode=NORMAL and EndRead() was previously called,
5108 ** the duplicate call is harmless.
5109 */
5110 sqlite3WalEndReadTransaction(pPager->pWal);
5111
5112 rc = sqlite3WalBeginReadTransaction(pPager->pWal, &changed);
5113 if( rc!=SQLITE_OK || changed ){
5114 pager_reset(pPager);
5115 if( USEFETCH(pPager) ) sqlite3OsUnfetch(pPager->fd, 0, 0);
5116 }
5117
5118 return rc;
5119 }
5120 #endif
5121
5122 /*
5123 ** This function is called as part of the transition from PAGER_OPEN
5124 ** to PAGER_READER state to determine the size of the database file
5125 ** in pages (assuming the page size currently stored in Pager.pageSize).
5126 **
5127 ** If no error occurs, SQLITE_OK is returned and the size of the database
5128 ** in pages is stored in *pnPage. Otherwise, an error code (perhaps
5129 ** SQLITE_IOERR_FSTAT) is returned and *pnPage is left unmodified.
5130 */
5131 static int pagerPagecount(Pager *pPager, Pgno *pnPage){
5132 Pgno nPage; /* Value to return via *pnPage */
5133
5134 /* Query the WAL sub-system for the database size. The WalDbsize()
5135 ** function returns zero if the WAL is not open (i.e. Pager.pWal==0), or
5136 ** if the database size is not available. The database size is not
5137 ** available from the WAL sub-system if the log file is empty or
5138 ** contains no valid committed transactions.
5139 */
5140 assert( pPager->eState==PAGER_OPEN );
5141 assert( pPager->eLock>=SHARED_LOCK );
5142 assert( isOpen(pPager->fd) );
5143 assert( pPager->tempFile==0 );
5144 nPage = sqlite3WalDbsize(pPager->pWal);
5145
5146 /* If the number of pages in the database is not available from the
5147 ** WAL sub-system, determine the page counte based on the size of
5148 ** the database file. If the size of the database file is not an
5149 ** integer multiple of the page-size, round up the result.
5150 */
5151 if( nPage==0 && ALWAYS(isOpen(pPager->fd)) ){
5152 i64 n = 0; /* Size of db file in bytes */
5153 int rc = sqlite3OsFileSize(pPager->fd, &n);
5154 if( rc!=SQLITE_OK ){
5155 return rc;
5156 }
5157 nPage = (Pgno)((n+pPager->pageSize-1) / pPager->pageSize);
5158 }
5159
5160 /* If the current number of pages in the file is greater than the
5161 ** configured maximum pager number, increase the allowed limit so
5162 ** that the file can be read.
5163 */
5164 if( nPage>pPager->mxPgno ){
5165 pPager->mxPgno = (Pgno)nPage;
5166 }
5167
5168 *pnPage = nPage;
5169 return SQLITE_OK;
5170 }
5171
5172 #ifndef SQLITE_OMIT_WAL
5173 /*
5174 ** Check if the *-wal file that corresponds to the database opened by pPager
5175 ** exists if the database is not empy, or verify that the *-wal file does
5176 ** not exist (by deleting it) if the database file is empty.
5177 **
5178 ** If the database is not empty and the *-wal file exists, open the pager
5179 ** in WAL mode. If the database is empty or if no *-wal file exists and
5180 ** if no error occurs, make sure Pager.journalMode is not set to
5181 ** PAGER_JOURNALMODE_WAL.
5182 **
5183 ** Return SQLITE_OK or an error code.
5184 **
5185 ** The caller must hold a SHARED lock on the database file to call this
5186 ** function. Because an EXCLUSIVE lock on the db file is required to delete
5187 ** a WAL on a none-empty database, this ensures there is no race condition
5188 ** between the xAccess() below and an xDelete() being executed by some
5189 ** other connection.
5190 */
5191 static int pagerOpenWalIfPresent(Pager *pPager){
5192 int rc = SQLITE_OK;
5193 assert( pPager->eState==PAGER_OPEN );
5194 assert( pPager->eLock>=SHARED_LOCK );
5195
5196 if( !pPager->tempFile ){
5197 int isWal; /* True if WAL file exists */
5198 Pgno nPage; /* Size of the database file */
5199
5200 rc = pagerPagecount(pPager, &nPage);
5201 if( rc ) return rc;
5202 if( nPage==0 ){
5203 rc = sqlite3OsDelete(pPager->pVfs, pPager->zWal, 0);
5204 if( rc==SQLITE_IOERR_DELETE_NOENT ) rc = SQLITE_OK;
5205 isWal = 0;
5206 }else{
5207 rc = sqlite3OsAccess(
5208 pPager->pVfs, pPager->zWal, SQLITE_ACCESS_EXISTS, &isWal
5209 );
5210 }
5211 if( rc==SQLITE_OK ){
5212 if( isWal ){
5213 testcase( sqlite3PcachePagecount(pPager->pPCache)==0 );
5214 rc = sqlite3PagerOpenWal(pPager, 0);
5215 }else if( pPager->journalMode==PAGER_JOURNALMODE_WAL ){
5216 pPager->journalMode = PAGER_JOURNALMODE_DELETE;
5217 }
5218 }
5219 }
5220 return rc;
5221 }
5222 #endif
5223
5224 /*
5225 ** Playback savepoint pSavepoint. Or, if pSavepoint==NULL, then playback
5226 ** the entire master journal file. The case pSavepoint==NULL occurs when
5227 ** a ROLLBACK TO command is invoked on a SAVEPOINT that is a transaction
5228 ** savepoint.
5229 **
5230 ** When pSavepoint is not NULL (meaning a non-transaction savepoint is
5231 ** being rolled back), then the rollback consists of up to three stages,
5232 ** performed in the order specified:
5233 **
5234 ** * Pages are played back from the main journal starting at byte
5235 ** offset PagerSavepoint.iOffset and continuing to
5236 ** PagerSavepoint.iHdrOffset, or to the end of the main journal
5237 ** file if PagerSavepoint.iHdrOffset is zero.
5238 **
5239 ** * If PagerSavepoint.iHdrOffset is not zero, then pages are played
5240 ** back starting from the journal header immediately following
5241 ** PagerSavepoint.iHdrOffset to the end of the main journal file.
5242 **
5243 ** * Pages are then played back from the sub-journal file, starting
5244 ** with the PagerSavepoint.iSubRec and continuing to the end of
5245 ** the journal file.
5246 **
5247 ** Throughout the rollback process, each time a page is rolled back, the
5248 ** corresponding bit is set in a bitvec structure (variable pDone in the
5249 ** implementation below). This is used to ensure that a page is only
5250 ** rolled back the first time it is encountered in either journal.
5251 **
5252 ** If pSavepoint is NULL, then pages are only played back from the main
5253 ** journal file. There is no need for a bitvec in this case.
5254 **
5255 ** In either case, before playback commences the Pager.dbSize variable
5256 ** is reset to the value that it held at the start of the savepoint
5257 ** (or transaction). No page with a page-number greater than this value
5258 ** is played back. If one is encountered it is simply skipped.
5259 */
5260 static int pagerPlaybackSavepoint(Pager *pPager, PagerSavepoint *pSavepoint){
5261 i64 szJ; /* Effective size of the main journal */
5262 i64 iHdrOff; /* End of first segment of main-journal records */
5263 int rc = SQLITE_OK; /* Return code */
5264 Bitvec *pDone = 0; /* Bitvec to ensure pages played back only once */
5265
5266 assert( pPager->eState!=PAGER_ERROR );
5267 assert( pPager->eState>=PAGER_WRITER_LOCKED );
5268
5269 /* Allocate a bitvec to use to store the set of pages rolled back */
5270 if( pSavepoint ){
5271 pDone = sqlite3BitvecCreate(pSavepoint->nOrig);
5272 if( !pDone ){
5273 return SQLITE_NOMEM_BKPT;
5274 }
5275 }
5276
5277 /* Set the database size back to the value it was before the savepoint
5278 ** being reverted was opened.
5279 */
5280 pPager->dbSize = pSavepoint ? pSavepoint->nOrig : pPager->dbOrigSize;
5281 pPager->changeCountDone = pPager->tempFile;
5282
5283 if( !pSavepoint && pagerUseWal(pPager) ){
5284 return pagerRollbackWal(pPager);
5285 }
5286
5287 /* Use pPager->journalOff as the effective size of the main rollback
5288 ** journal. The actual file might be larger than this in
5289 ** PAGER_JOURNALMODE_TRUNCATE or PAGER_JOURNALMODE_PERSIST. But anything
5290 ** past pPager->journalOff is off-limits to us.
5291 */
5292 szJ = pPager->journalOff;
5293 assert( pagerUseWal(pPager)==0 || szJ==0 );
5294
5295 /* Begin by rolling back records from the main journal starting at
5296 ** PagerSavepoint.iOffset and continuing to the next journal header.
5297 ** There might be records in the main journal that have a page number
5298 ** greater than the current database size (pPager->dbSize) but those
5299 ** will be skipped automatically. Pages are added to pDone as they
5300 ** are played back.
5301 */
5302 if( pSavepoint && !pagerUseWal(pPager) ){
5303 iHdrOff = pSavepoint->iHdrOffset ? pSavepoint->iHdrOffset : szJ;
5304 pPager->journalOff = pSavepoint->iOffset;
5305 while( rc==SQLITE_OK && pPager->journalOff<iHdrOff ){
5306 rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);
5307 }
5308 assert( rc!=SQLITE_DONE );
5309 }else{
5310 pPager->journalOff = 0;
5311 }
5312
5313 /* Continue rolling back records out of the main journal starting at
5314 ** the first journal header seen and continuing until the effective end
5315 ** of the main journal file. Continue to skip out-of-range pages and
5316 ** continue adding pages rolled back to pDone.
5317 */
5318 while( rc==SQLITE_OK && pPager->journalOff<szJ ){
5319 u32 ii; /* Loop counter */
5320 u32 nJRec = 0; /* Number of Journal Records */
5321 u32 dummy;
5322 rc = readJournalHdr(pPager, 0, szJ, &nJRec, &dummy);
5323 assert( rc!=SQLITE_DONE );
5324
5325 /*
5326 ** The "pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff"
5327 ** test is related to ticket #2565. See the discussion in the
5328 ** pager_playback() function for additional information.
5329 */
5330 if( nJRec==0
5331 && pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff
5332 ){
5333 nJRec = (u32)((szJ - pPager->journalOff)/JOURNAL_PG_SZ(pPager));
5334 }
5335 for(ii=0; rc==SQLITE_OK && ii<nJRec && pPager->journalOff<szJ; ii++){
5336 rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);
5337 }
5338 assert( rc!=SQLITE_DONE );
5339 }
5340 assert( rc!=SQLITE_OK || pPager->journalOff>=szJ );
5341
5342 /* Finally, rollback pages from the sub-journal. Page that were
5343 ** previously rolled back out of the main journal (and are hence in pDone)
5344 ** will be skipped. Out-of-range pages are also skipped.
5345 */
5346 if( pSavepoint ){
5347 u32 ii; /* Loop counter */
5348 i64 offset = (i64)pSavepoint->iSubRec*(4+pPager->pageSize);
5349
5350 if( pagerUseWal(pPager) ){
5351 rc = sqlite3WalSavepointUndo(pPager->pWal, pSavepoint->aWalData);
5352 }
5353 for(ii=pSavepoint->iSubRec; rc==SQLITE_OK && ii<pPager->nSubRec; ii++){
5354 assert( offset==(i64)ii*(4+pPager->pageSize) );
5355 rc = pager_playback_one_page(pPager, &offset, pDone, 0, 1);
5356 }
5357 assert( rc!=SQLITE_DONE );
5358 }
5359
5360 sqlite3BitvecDestroy(pDone);
5361 if( rc==SQLITE_OK ){
5362 pPager->journalOff = szJ;
5363 }
5364
5365 return rc;
5366 }
5367
5368 /*
5369 ** Change the maximum number of in-memory pages that are allowed
5370 ** before attempting to recycle clean and unused pages.
5371 */
5372 SQLITE_PRIVATE void sqlite3PagerSetCachesize(Pager *pPager, int mxPage){
5373 sqlite3PcacheSetCachesize(pPager->pPCache, mxPage);
5374 }
5375
5376 /*
5377 ** Change the maximum number of in-memory pages that are allowed
5378 ** before attempting to spill pages to journal.
5379 */
5380 SQLITE_PRIVATE int sqlite3PagerSetSpillsize(Pager *pPager, int mxPage){
5381 return sqlite3PcacheSetSpillsize(pPager->pPCache, mxPage);
5382 }
5383
5384 /*
5385 ** Invoke SQLITE_FCNTL_MMAP_SIZE based on the current value of szMmap.
5386 */
5387 static void pagerFixMaplimit(Pager *pPager){
5388 #if SQLITE_MAX_MMAP_SIZE>0
5389 sqlite3_file *fd = pPager->fd;
5390 if( isOpen(fd) && fd->pMethods->iVersion>=3 ){
5391 sqlite3_int64 sz;
5392 sz = pPager->szMmap;
5393 pPager->bUseFetch = (sz>0);
5394 setGetterMethod(pPager);
5395 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_MMAP_SIZE, &sz);
5396 }
5397 #endif
5398 }
5399
5400 /*
5401 ** Change the maximum size of any memory mapping made of the database file.
5402 */
5403 SQLITE_PRIVATE void sqlite3PagerSetMmapLimit(Pager *pPager, sqlite3_int64 szMmap ){
5404 pPager->szMmap = szMmap;
5405 pagerFixMaplimit(pPager);
5406 }
5407
5408 /*
5409 ** Free as much memory as possible from the pager.
5410 */
5411 SQLITE_PRIVATE void sqlite3PagerShrink(Pager *pPager){
5412 sqlite3PcacheShrink(pPager->pPCache);
5413 }
5414
5415 /*
5416 ** Adjust settings of the pager to those specified in the pgFlags parameter.
5417 **
5418 ** The "level" in pgFlags & PAGER_SYNCHRONOUS_MASK sets the robustness
5419 ** of the database to damage due to OS crashes or power failures by
5420 ** changing the number of syncs()s when writing the journals.
5421 ** There are four levels:
5422 **
5423 ** OFF sqlite3OsSync() is never called. This is the default
5424 ** for temporary and transient files.
5425 **
5426 ** NORMAL The journal is synced once before writes begin on the
5427 ** database. This is normally adequate protection, but
5428 ** it is theoretically possible, though very unlikely,
5429 ** that an inopertune power failure could leave the journal
5430 ** in a state which would cause damage to the database
5431 ** when it is rolled back.
5432 **
5433 ** FULL The journal is synced twice before writes begin on the
5434 ** database (with some additional information - the nRec field
5435 ** of the journal header - being written in between the two
5436 ** syncs). If we assume that writing a
5437 ** single disk sector is atomic, then this mode provides
5438 ** assurance that the journal will not be corrupted to the
5439 ** point of causing damage to the database during rollback.
5440 **
5441 ** EXTRA This is like FULL except that is also syncs the directory
5442 ** that contains the rollback journal after the rollback
5443 ** journal is unlinked.
5444 **
5445 ** The above is for a rollback-journal mode. For WAL mode, OFF continues
5446 ** to mean that no syncs ever occur. NORMAL means that the WAL is synced
5447 ** prior to the start of checkpoint and that the database file is synced
5448 ** at the conclusion of the checkpoint if the entire content of the WAL
5449 ** was written back into the database. But no sync operations occur for
5450 ** an ordinary commit in NORMAL mode with WAL. FULL means that the WAL
5451 ** file is synced following each commit operation, in addition to the
5452 ** syncs associated with NORMAL. There is no difference between FULL
5453 ** and EXTRA for WAL mode.
5454 **
5455 ** Do not confuse synchronous=FULL with SQLITE_SYNC_FULL. The
5456 ** SQLITE_SYNC_FULL macro means to use the MacOSX-style full-fsync
5457 ** using fcntl(F_FULLFSYNC). SQLITE_SYNC_NORMAL means to do an
5458 ** ordinary fsync() call. There is no difference between SQLITE_SYNC_FULL
5459 ** and SQLITE_SYNC_NORMAL on platforms other than MacOSX. But the
5460 ** synchronous=FULL versus synchronous=NORMAL setting determines when
5461 ** the xSync primitive is called and is relevant to all platforms.
5462 **
5463 ** Numeric values associated with these states are OFF==1, NORMAL=2,
5464 ** and FULL=3.
5465 */
5466 #ifndef SQLITE_OMIT_PAGER_PRAGMAS
5467 SQLITE_PRIVATE void sqlite3PagerSetFlags(
5468 Pager *pPager, /* The pager to set safety level for */
5469 unsigned pgFlags /* Various flags */
5470 ){
5471 unsigned level = pgFlags & PAGER_SYNCHRONOUS_MASK;
5472 if( pPager->tempFile ){
5473 pPager->noSync = 1;
5474 pPager->fullSync = 0;
5475 pPager->extraSync = 0;
5476 }else{
5477 pPager->noSync = level==PAGER_SYNCHRONOUS_OFF ?1:0;
5478 pPager->fullSync = level>=PAGER_SYNCHRONOUS_FULL ?1:0;
5479 pPager->extraSync = level==PAGER_SYNCHRONOUS_EXTRA ?1:0;
5480 }
5481 if( pPager->noSync ){
5482 pPager->syncFlags = 0;
5483 pPager->ckptSyncFlags = 0;
5484 }else if( pgFlags & PAGER_FULLFSYNC ){
5485 pPager->syncFlags = SQLITE_SYNC_FULL;
5486 pPager->ckptSyncFlags = SQLITE_SYNC_FULL;
5487 }else if( pgFlags & PAGER_CKPT_FULLFSYNC ){
5488 pPager->syncFlags = SQLITE_SYNC_NORMAL;
5489 pPager->ckptSyncFlags = SQLITE_SYNC_FULL;
5490 }else{
5491 pPager->syncFlags = SQLITE_SYNC_NORMAL;
5492 pPager->ckptSyncFlags = SQLITE_SYNC_NORMAL;
5493 }
5494 pPager->walSyncFlags = pPager->syncFlags;
5495 if( pPager->fullSync ){
5496 pPager->walSyncFlags |= WAL_SYNC_TRANSACTIONS;
5497 }
5498 if( pgFlags & PAGER_CACHESPILL ){
5499 pPager->doNotSpill &= ~SPILLFLAG_OFF;
5500 }else{
5501 pPager->doNotSpill |= SPILLFLAG_OFF;
5502 }
5503 }
5504 #endif
5505
5506 /*
5507 ** The following global variable is incremented whenever the library
5508 ** attempts to open a temporary file. This information is used for
5509 ** testing and analysis only.
5510 */
5511 #ifdef SQLITE_TEST
5512 SQLITE_API int sqlite3_opentemp_count = 0;
5513 #endif
5514
5515 /*
5516 ** Open a temporary file.
5517 **
5518 ** Write the file descriptor into *pFile. Return SQLITE_OK on success
5519 ** or some other error code if we fail. The OS will automatically
5520 ** delete the temporary file when it is closed.
5521 **
5522 ** The flags passed to the VFS layer xOpen() call are those specified
5523 ** by parameter vfsFlags ORed with the following:
5524 **
5525 ** SQLITE_OPEN_READWRITE
5526 ** SQLITE_OPEN_CREATE
5527 ** SQLITE_OPEN_EXCLUSIVE
5528 ** SQLITE_OPEN_DELETEONCLOSE
5529 */
5530 static int pagerOpentemp(
5531 Pager *pPager, /* The pager object */
5532 sqlite3_file *pFile, /* Write the file descriptor here */
5533 int vfsFlags /* Flags passed through to the VFS */
5534 ){
5535 int rc; /* Return code */
5536
5537 #ifdef SQLITE_TEST
5538 sqlite3_opentemp_count++; /* Used for testing and analysis only */
5539 #endif
5540
5541 vfsFlags |= SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE |
5542 SQLITE_OPEN_EXCLUSIVE | SQLITE_OPEN_DELETEONCLOSE;
5543 rc = sqlite3OsOpen(pPager->pVfs, 0, pFile, vfsFlags, 0);
5544 assert( rc!=SQLITE_OK || isOpen(pFile) );
5545 return rc;
5546 }
5547
5548 /*
5549 ** Set the busy handler function.
5550 **
5551 ** The pager invokes the busy-handler if sqlite3OsLock() returns
5552 ** SQLITE_BUSY when trying to upgrade from no-lock to a SHARED lock,
5553 ** or when trying to upgrade from a RESERVED lock to an EXCLUSIVE
5554 ** lock. It does *not* invoke the busy handler when upgrading from
5555 ** SHARED to RESERVED, or when upgrading from SHARED to EXCLUSIVE
5556 ** (which occurs during hot-journal rollback). Summary:
5557 **
5558 ** Transition | Invokes xBusyHandler
5559 ** --------------------------------------------------------
5560 ** NO_LOCK -> SHARED_LOCK | Yes
5561 ** SHARED_LOCK -> RESERVED_LOCK | No
5562 ** SHARED_LOCK -> EXCLUSIVE_LOCK | No
5563 ** RESERVED_LOCK -> EXCLUSIVE_LOCK | Yes
5564 **
5565 ** If the busy-handler callback returns non-zero, the lock is
5566 ** retried. If it returns zero, then the SQLITE_BUSY error is
5567 ** returned to the caller of the pager API function.
5568 */
5569 SQLITE_PRIVATE void sqlite3PagerSetBusyhandler(
5570 Pager *pPager, /* Pager object */
5571 int (*xBusyHandler)(void *), /* Pointer to busy-handler function */
5572 void *pBusyHandlerArg /* Argument to pass to xBusyHandler */
5573 ){
5574 pPager->xBusyHandler = xBusyHandler;
5575 pPager->pBusyHandlerArg = pBusyHandlerArg;
5576
5577 if( isOpen(pPager->fd) ){
5578 void **ap = (void **)&pPager->xBusyHandler;
5579 assert( ((int(*)(void *))(ap[0]))==xBusyHandler );
5580 assert( ap[1]==pBusyHandlerArg );
5581 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_BUSYHANDLER, (void *)ap);
5582 }
5583 }
5584
5585 /*
5586 ** Change the page size used by the Pager object. The new page size
5587 ** is passed in *pPageSize.
5588 **
5589 ** If the pager is in the error state when this function is called, it
5590 ** is a no-op. The value returned is the error state error code (i.e.
5591 ** one of SQLITE_IOERR, an SQLITE_IOERR_xxx sub-code or SQLITE_FULL).
5592 **
5593 ** Otherwise, if all of the following are true:
5594 **
5595 ** * the new page size (value of *pPageSize) is valid (a power
5596 ** of two between 512 and SQLITE_MAX_PAGE_SIZE, inclusive), and
5597 **
5598 ** * there are no outstanding page references, and
5599 **
5600 ** * the database is either not an in-memory database or it is
5601 ** an in-memory database that currently consists of zero pages.
5602 **
5603 ** then the pager object page size is set to *pPageSize.
5604 **
5605 ** If the page size is changed, then this function uses sqlite3PagerMalloc()
5606 ** to obtain a new Pager.pTmpSpace buffer. If this allocation attempt
5607 ** fails, SQLITE_NOMEM is returned and the page size remains unchanged.
5608 ** In all other cases, SQLITE_OK is returned.
5609 **
5610 ** If the page size is not changed, either because one of the enumerated
5611 ** conditions above is not true, the pager was in error state when this
5612 ** function was called, or because the memory allocation attempt failed,
5613 ** then *pPageSize is set to the old, retained page size before returning.
5614 */
5615 SQLITE_PRIVATE int sqlite3PagerSetPagesize(Pager *pPager, u32 *pPageSize, int nR eserve){
5616 int rc = SQLITE_OK;
5617
5618 /* It is not possible to do a full assert_pager_state() here, as this
5619 ** function may be called from within PagerOpen(), before the state
5620 ** of the Pager object is internally consistent.
5621 **
5622 ** At one point this function returned an error if the pager was in
5623 ** PAGER_ERROR state. But since PAGER_ERROR state guarantees that
5624 ** there is at least one outstanding page reference, this function
5625 ** is a no-op for that case anyhow.
5626 */
5627
5628 u32 pageSize = *pPageSize;
5629 assert( pageSize==0 || (pageSize>=512 && pageSize<=SQLITE_MAX_PAGE_SIZE) );
5630 if( (pPager->memDb==0 || pPager->dbSize==0)
5631 && sqlite3PcacheRefCount(pPager->pPCache)==0
5632 && pageSize && pageSize!=(u32)pPager->pageSize
5633 ){
5634 char *pNew = NULL; /* New temp space */
5635 i64 nByte = 0;
5636
5637 if( pPager->eState>PAGER_OPEN && isOpen(pPager->fd) ){
5638 rc = sqlite3OsFileSize(pPager->fd, &nByte);
5639 }
5640 if( rc==SQLITE_OK ){
5641 pNew = (char *)sqlite3PageMalloc(pageSize);
5642 if( !pNew ) rc = SQLITE_NOMEM_BKPT;
5643 }
5644
5645 if( rc==SQLITE_OK ){
5646 pager_reset(pPager);
5647 rc = sqlite3PcacheSetPageSize(pPager->pPCache, pageSize);
5648 }
5649 if( rc==SQLITE_OK ){
5650 sqlite3PageFree(pPager->pTmpSpace);
5651 pPager->pTmpSpace = pNew;
5652 pPager->dbSize = (Pgno)((nByte+pageSize-1)/pageSize);
5653 pPager->pageSize = pageSize;
5654 }else{
5655 sqlite3PageFree(pNew);
5656 }
5657 }
5658
5659 *pPageSize = pPager->pageSize;
5660 if( rc==SQLITE_OK ){
5661 if( nReserve<0 ) nReserve = pPager->nReserve;
5662 assert( nReserve>=0 && nReserve<1000 );
5663 pPager->nReserve = (i16)nReserve;
5664 pagerReportSize(pPager);
5665 pagerFixMaplimit(pPager);
5666 }
5667 return rc;
5668 }
5669
5670 /*
5671 ** Return a pointer to the "temporary page" buffer held internally
5672 ** by the pager. This is a buffer that is big enough to hold the
5673 ** entire content of a database page. This buffer is used internally
5674 ** during rollback and will be overwritten whenever a rollback
5675 ** occurs. But other modules are free to use it too, as long as
5676 ** no rollbacks are happening.
5677 */
5678 SQLITE_PRIVATE void *sqlite3PagerTempSpace(Pager *pPager){
5679 return pPager->pTmpSpace;
5680 }
5681
5682 /*
5683 ** Attempt to set the maximum database page count if mxPage is positive.
5684 ** Make no changes if mxPage is zero or negative. And never reduce the
5685 ** maximum page count below the current size of the database.
5686 **
5687 ** Regardless of mxPage, return the current maximum page count.
5688 */
5689 SQLITE_PRIVATE int sqlite3PagerMaxPageCount(Pager *pPager, int mxPage){
5690 if( mxPage>0 ){
5691 pPager->mxPgno = mxPage;
5692 }
5693 assert( pPager->eState!=PAGER_OPEN ); /* Called only by OP_MaxPgcnt */
5694 assert( pPager->mxPgno>=pPager->dbSize ); /* OP_MaxPgcnt enforces this */
5695 return pPager->mxPgno;
5696 }
5697
5698 /*
5699 ** The following set of routines are used to disable the simulated
5700 ** I/O error mechanism. These routines are used to avoid simulated
5701 ** errors in places where we do not care about errors.
5702 **
5703 ** Unless -DSQLITE_TEST=1 is used, these routines are all no-ops
5704 ** and generate no code.
5705 */
5706 #ifdef SQLITE_TEST
5707 SQLITE_API extern int sqlite3_io_error_pending;
5708 SQLITE_API extern int sqlite3_io_error_hit;
5709 static int saved_cnt;
5710 void disable_simulated_io_errors(void){
5711 saved_cnt = sqlite3_io_error_pending;
5712 sqlite3_io_error_pending = -1;
5713 }
5714 void enable_simulated_io_errors(void){
5715 sqlite3_io_error_pending = saved_cnt;
5716 }
5717 #else
5718 # define disable_simulated_io_errors()
5719 # define enable_simulated_io_errors()
5720 #endif
5721
5722 /*
5723 ** Read the first N bytes from the beginning of the file into memory
5724 ** that pDest points to.
5725 **
5726 ** If the pager was opened on a transient file (zFilename==""), or
5727 ** opened on a file less than N bytes in size, the output buffer is
5728 ** zeroed and SQLITE_OK returned. The rationale for this is that this
5729 ** function is used to read database headers, and a new transient or
5730 ** zero sized database has a header than consists entirely of zeroes.
5731 **
5732 ** If any IO error apart from SQLITE_IOERR_SHORT_READ is encountered,
5733 ** the error code is returned to the caller and the contents of the
5734 ** output buffer undefined.
5735 */
5736 SQLITE_PRIVATE int sqlite3PagerReadFileheader(Pager *pPager, int N, unsigned cha r *pDest){
5737 int rc = SQLITE_OK;
5738 memset(pDest, 0, N);
5739 assert( isOpen(pPager->fd) || pPager->tempFile );
5740
5741 /* This routine is only called by btree immediately after creating
5742 ** the Pager object. There has not been an opportunity to transition
5743 ** to WAL mode yet.
5744 */
5745 assert( !pagerUseWal(pPager) );
5746
5747 if( isOpen(pPager->fd) ){
5748 IOTRACE(("DBHDR %p 0 %d\n", pPager, N))
5749 rc = sqlite3OsRead(pPager->fd, pDest, N, 0);
5750 if( rc==SQLITE_IOERR_SHORT_READ ){
5751 rc = SQLITE_OK;
5752 }
5753 }
5754 return rc;
5755 }
5756
5757 /*
5758 ** This function may only be called when a read-transaction is open on
5759 ** the pager. It returns the total number of pages in the database.
5760 **
5761 ** However, if the file is between 1 and <page-size> bytes in size, then
5762 ** this is considered a 1 page file.
5763 */
5764 SQLITE_PRIVATE void sqlite3PagerPagecount(Pager *pPager, int *pnPage){
5765 assert( pPager->eState>=PAGER_READER );
5766 assert( pPager->eState!=PAGER_WRITER_FINISHED );
5767 *pnPage = (int)pPager->dbSize;
5768 }
5769
5770
5771 /*
5772 ** Try to obtain a lock of type locktype on the database file. If
5773 ** a similar or greater lock is already held, this function is a no-op
5774 ** (returning SQLITE_OK immediately).
5775 **
5776 ** Otherwise, attempt to obtain the lock using sqlite3OsLock(). Invoke
5777 ** the busy callback if the lock is currently not available. Repeat
5778 ** until the busy callback returns false or until the attempt to
5779 ** obtain the lock succeeds.
5780 **
5781 ** Return SQLITE_OK on success and an error code if we cannot obtain
5782 ** the lock. If the lock is obtained successfully, set the Pager.state
5783 ** variable to locktype before returning.
5784 */
5785 static int pager_wait_on_lock(Pager *pPager, int locktype){
5786 int rc; /* Return code */
5787
5788 /* Check that this is either a no-op (because the requested lock is
5789 ** already held), or one of the transitions that the busy-handler
5790 ** may be invoked during, according to the comment above
5791 ** sqlite3PagerSetBusyhandler().
5792 */
5793 assert( (pPager->eLock>=locktype)
5794 || (pPager->eLock==NO_LOCK && locktype==SHARED_LOCK)
5795 || (pPager->eLock==RESERVED_LOCK && locktype==EXCLUSIVE_LOCK)
5796 );
5797
5798 do {
5799 rc = pagerLockDb(pPager, locktype);
5800 }while( rc==SQLITE_BUSY && pPager->xBusyHandler(pPager->pBusyHandlerArg) );
5801 return rc;
5802 }
5803
5804 /*
5805 ** Function assertTruncateConstraint(pPager) checks that one of the
5806 ** following is true for all dirty pages currently in the page-cache:
5807 **
5808 ** a) The page number is less than or equal to the size of the
5809 ** current database image, in pages, OR
5810 **
5811 ** b) if the page content were written at this time, it would not
5812 ** be necessary to write the current content out to the sub-journal
5813 ** (as determined by function subjRequiresPage()).
5814 **
5815 ** If the condition asserted by this function were not true, and the
5816 ** dirty page were to be discarded from the cache via the pagerStress()
5817 ** routine, pagerStress() would not write the current page content to
5818 ** the database file. If a savepoint transaction were rolled back after
5819 ** this happened, the correct behavior would be to restore the current
5820 ** content of the page. However, since this content is not present in either
5821 ** the database file or the portion of the rollback journal and
5822 ** sub-journal rolled back the content could not be restored and the
5823 ** database image would become corrupt. It is therefore fortunate that
5824 ** this circumstance cannot arise.
5825 */
5826 #if defined(SQLITE_DEBUG)
5827 static void assertTruncateConstraintCb(PgHdr *pPg){
5828 assert( pPg->flags&PGHDR_DIRTY );
5829 assert( !subjRequiresPage(pPg) || pPg->pgno<=pPg->pPager->dbSize );
5830 }
5831 static void assertTruncateConstraint(Pager *pPager){
5832 sqlite3PcacheIterateDirty(pPager->pPCache, assertTruncateConstraintCb);
5833 }
5834 #else
5835 # define assertTruncateConstraint(pPager)
5836 #endif
5837
5838 /*
5839 ** Truncate the in-memory database file image to nPage pages. This
5840 ** function does not actually modify the database file on disk. It
5841 ** just sets the internal state of the pager object so that the
5842 ** truncation will be done when the current transaction is committed.
5843 **
5844 ** This function is only called right before committing a transaction.
5845 ** Once this function has been called, the transaction must either be
5846 ** rolled back or committed. It is not safe to call this function and
5847 ** then continue writing to the database.
5848 */
5849 SQLITE_PRIVATE void sqlite3PagerTruncateImage(Pager *pPager, Pgno nPage){
5850 assert( pPager->dbSize>=nPage );
5851 assert( pPager->eState>=PAGER_WRITER_CACHEMOD );
5852 pPager->dbSize = nPage;
5853
5854 /* At one point the code here called assertTruncateConstraint() to
5855 ** ensure that all pages being truncated away by this operation are,
5856 ** if one or more savepoints are open, present in the savepoint
5857 ** journal so that they can be restored if the savepoint is rolled
5858 ** back. This is no longer necessary as this function is now only
5859 ** called right before committing a transaction. So although the
5860 ** Pager object may still have open savepoints (Pager.nSavepoint!=0),
5861 ** they cannot be rolled back. So the assertTruncateConstraint() call
5862 ** is no longer correct. */
5863 }
5864
5865
5866 /*
5867 ** This function is called before attempting a hot-journal rollback. It
5868 ** syncs the journal file to disk, then sets pPager->journalHdr to the
5869 ** size of the journal file so that the pager_playback() routine knows
5870 ** that the entire journal file has been synced.
5871 **
5872 ** Syncing a hot-journal to disk before attempting to roll it back ensures
5873 ** that if a power-failure occurs during the rollback, the process that
5874 ** attempts rollback following system recovery sees the same journal
5875 ** content as this process.
5876 **
5877 ** If everything goes as planned, SQLITE_OK is returned. Otherwise,
5878 ** an SQLite error code.
5879 */
5880 static int pagerSyncHotJournal(Pager *pPager){
5881 int rc = SQLITE_OK;
5882 if( !pPager->noSync ){
5883 rc = sqlite3OsSync(pPager->jfd, SQLITE_SYNC_NORMAL);
5884 }
5885 if( rc==SQLITE_OK ){
5886 rc = sqlite3OsFileSize(pPager->jfd, &pPager->journalHdr);
5887 }
5888 return rc;
5889 }
5890
5891 #if SQLITE_MAX_MMAP_SIZE>0
5892 /*
5893 ** Obtain a reference to a memory mapped page object for page number pgno.
5894 ** The new object will use the pointer pData, obtained from xFetch().
5895 ** If successful, set *ppPage to point to the new page reference
5896 ** and return SQLITE_OK. Otherwise, return an SQLite error code and set
5897 ** *ppPage to zero.
5898 **
5899 ** Page references obtained by calling this function should be released
5900 ** by calling pagerReleaseMapPage().
5901 */
5902 static int pagerAcquireMapPage(
5903 Pager *pPager, /* Pager object */
5904 Pgno pgno, /* Page number */
5905 void *pData, /* xFetch()'d data for this page */
5906 PgHdr **ppPage /* OUT: Acquired page object */
5907 ){
5908 PgHdr *p; /* Memory mapped page to return */
5909
5910 if( pPager->pMmapFreelist ){
5911 *ppPage = p = pPager->pMmapFreelist;
5912 pPager->pMmapFreelist = p->pDirty;
5913 p->pDirty = 0;
5914 assert( pPager->nExtra>=8 );
5915 memset(p->pExtra, 0, 8);
5916 }else{
5917 *ppPage = p = (PgHdr *)sqlite3MallocZero(sizeof(PgHdr) + pPager->nExtra);
5918 if( p==0 ){
5919 sqlite3OsUnfetch(pPager->fd, (i64)(pgno-1) * pPager->pageSize, pData);
5920 return SQLITE_NOMEM_BKPT;
5921 }
5922 p->pExtra = (void *)&p[1];
5923 p->flags = PGHDR_MMAP;
5924 p->nRef = 1;
5925 p->pPager = pPager;
5926 }
5927
5928 assert( p->pExtra==(void *)&p[1] );
5929 assert( p->pPage==0 );
5930 assert( p->flags==PGHDR_MMAP );
5931 assert( p->pPager==pPager );
5932 assert( p->nRef==1 );
5933
5934 p->pgno = pgno;
5935 p->pData = pData;
5936 pPager->nMmapOut++;
5937
5938 return SQLITE_OK;
5939 }
5940 #endif
5941
5942 /*
5943 ** Release a reference to page pPg. pPg must have been returned by an
5944 ** earlier call to pagerAcquireMapPage().
5945 */
5946 static void pagerReleaseMapPage(PgHdr *pPg){
5947 Pager *pPager = pPg->pPager;
5948 pPager->nMmapOut--;
5949 pPg->pDirty = pPager->pMmapFreelist;
5950 pPager->pMmapFreelist = pPg;
5951
5952 assert( pPager->fd->pMethods->iVersion>=3 );
5953 sqlite3OsUnfetch(pPager->fd, (i64)(pPg->pgno-1)*pPager->pageSize, pPg->pData);
5954 }
5955
5956 /*
5957 ** Free all PgHdr objects stored in the Pager.pMmapFreelist list.
5958 */
5959 static void pagerFreeMapHdrs(Pager *pPager){
5960 PgHdr *p;
5961 PgHdr *pNext;
5962 for(p=pPager->pMmapFreelist; p; p=pNext){
5963 pNext = p->pDirty;
5964 sqlite3_free(p);
5965 }
5966 }
5967
5968
5969 /*
5970 ** Shutdown the page cache. Free all memory and close all files.
5971 **
5972 ** If a transaction was in progress when this routine is called, that
5973 ** transaction is rolled back. All outstanding pages are invalidated
5974 ** and their memory is freed. Any attempt to use a page associated
5975 ** with this page cache after this function returns will likely
5976 ** result in a coredump.
5977 **
5978 ** This function always succeeds. If a transaction is active an attempt
5979 ** is made to roll it back. If an error occurs during the rollback
5980 ** a hot journal may be left in the filesystem but no error is returned
5981 ** to the caller.
5982 */
5983 SQLITE_PRIVATE int sqlite3PagerClose(Pager *pPager, sqlite3 *db){
5984 u8 *pTmp = (u8 *)pPager->pTmpSpace;
5985
5986 assert( db || pagerUseWal(pPager)==0 );
5987 assert( assert_pager_state(pPager) );
5988 disable_simulated_io_errors();
5989 sqlite3BeginBenignMalloc();
5990 pagerFreeMapHdrs(pPager);
5991 /* pPager->errCode = 0; */
5992 pPager->exclusiveMode = 0;
5993 #ifndef SQLITE_OMIT_WAL
5994 assert( db || pPager->pWal==0 );
5995 sqlite3WalClose(pPager->pWal, db, pPager->ckptSyncFlags, pPager->pageSize,
5996 (db && (db->flags & SQLITE_NoCkptOnClose) ? 0 : pTmp)
5997 );
5998 pPager->pWal = 0;
5999 #endif
6000 pager_reset(pPager);
6001 if( MEMDB ){
6002 pager_unlock(pPager);
6003 }else{
6004 /* If it is open, sync the journal file before calling UnlockAndRollback.
6005 ** If this is not done, then an unsynced portion of the open journal
6006 ** file may be played back into the database. If a power failure occurs
6007 ** while this is happening, the database could become corrupt.
6008 **
6009 ** If an error occurs while trying to sync the journal, shift the pager
6010 ** into the ERROR state. This causes UnlockAndRollback to unlock the
6011 ** database and close the journal file without attempting to roll it
6012 ** back or finalize it. The next database user will have to do hot-journal
6013 ** rollback before accessing the database file.
6014 */
6015 if( isOpen(pPager->jfd) ){
6016 pager_error(pPager, pagerSyncHotJournal(pPager));
6017 }
6018 pagerUnlockAndRollback(pPager);
6019 }
6020 sqlite3EndBenignMalloc();
6021 enable_simulated_io_errors();
6022 PAGERTRACE(("CLOSE %d\n", PAGERID(pPager)));
6023 IOTRACE(("CLOSE %p\n", pPager))
6024 sqlite3OsClose(pPager->jfd);
6025 sqlite3OsClose(pPager->fd);
6026 sqlite3PageFree(pTmp);
6027 sqlite3PcacheClose(pPager->pPCache);
6028
6029 #ifdef SQLITE_HAS_CODEC
6030 if( pPager->xCodecFree ) pPager->xCodecFree(pPager->pCodec);
6031 #endif
6032
6033 assert( !pPager->aSavepoint && !pPager->pInJournal );
6034 assert( !isOpen(pPager->jfd) && !isOpen(pPager->sjfd) );
6035
6036 sqlite3_free(pPager);
6037 return SQLITE_OK;
6038 }
6039
6040 #if !defined(NDEBUG) || defined(SQLITE_TEST)
6041 /*
6042 ** Return the page number for page pPg.
6043 */
6044 SQLITE_PRIVATE Pgno sqlite3PagerPagenumber(DbPage *pPg){
6045 return pPg->pgno;
6046 }
6047 #endif
6048
6049 /*
6050 ** Increment the reference count for page pPg.
6051 */
6052 SQLITE_PRIVATE void sqlite3PagerRef(DbPage *pPg){
6053 sqlite3PcacheRef(pPg);
6054 }
6055
6056 /*
6057 ** Sync the journal. In other words, make sure all the pages that have
6058 ** been written to the journal have actually reached the surface of the
6059 ** disk and can be restored in the event of a hot-journal rollback.
6060 **
6061 ** If the Pager.noSync flag is set, then this function is a no-op.
6062 ** Otherwise, the actions required depend on the journal-mode and the
6063 ** device characteristics of the file-system, as follows:
6064 **
6065 ** * If the journal file is an in-memory journal file, no action need
6066 ** be taken.
6067 **
6068 ** * Otherwise, if the device does not support the SAFE_APPEND property,
6069 ** then the nRec field of the most recently written journal header
6070 ** is updated to contain the number of journal records that have
6071 ** been written following it. If the pager is operating in full-sync
6072 ** mode, then the journal file is synced before this field is updated.
6073 **
6074 ** * If the device does not support the SEQUENTIAL property, then
6075 ** journal file is synced.
6076 **
6077 ** Or, in pseudo-code:
6078 **
6079 ** if( NOT <in-memory journal> ){
6080 ** if( NOT SAFE_APPEND ){
6081 ** if( <full-sync mode> ) xSync(<journal file>);
6082 ** <update nRec field>
6083 ** }
6084 ** if( NOT SEQUENTIAL ) xSync(<journal file>);
6085 ** }
6086 **
6087 ** If successful, this routine clears the PGHDR_NEED_SYNC flag of every
6088 ** page currently held in memory before returning SQLITE_OK. If an IO
6089 ** error is encountered, then the IO error code is returned to the caller.
6090 */
6091 static int syncJournal(Pager *pPager, int newHdr){
6092 int rc; /* Return code */
6093
6094 assert( pPager->eState==PAGER_WRITER_CACHEMOD
6095 || pPager->eState==PAGER_WRITER_DBMOD
6096 );
6097 assert( assert_pager_state(pPager) );
6098 assert( !pagerUseWal(pPager) );
6099
6100 rc = sqlite3PagerExclusiveLock(pPager);
6101 if( rc!=SQLITE_OK ) return rc;
6102
6103 if( !pPager->noSync ){
6104 assert( !pPager->tempFile );
6105 if( isOpen(pPager->jfd) && pPager->journalMode!=PAGER_JOURNALMODE_MEMORY ){
6106 const int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);
6107 assert( isOpen(pPager->jfd) );
6108
6109 if( 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){
6110 /* This block deals with an obscure problem. If the last connection
6111 ** that wrote to this database was operating in persistent-journal
6112 ** mode, then the journal file may at this point actually be larger
6113 ** than Pager.journalOff bytes. If the next thing in the journal
6114 ** file happens to be a journal-header (written as part of the
6115 ** previous connection's transaction), and a crash or power-failure
6116 ** occurs after nRec is updated but before this connection writes
6117 ** anything else to the journal file (or commits/rolls back its
6118 ** transaction), then SQLite may become confused when doing the
6119 ** hot-journal rollback following recovery. It may roll back all
6120 ** of this connections data, then proceed to rolling back the old,
6121 ** out-of-date data that follows it. Database corruption.
6122 **
6123 ** To work around this, if the journal file does appear to contain
6124 ** a valid header following Pager.journalOff, then write a 0x00
6125 ** byte to the start of it to prevent it from being recognized.
6126 **
6127 ** Variable iNextHdrOffset is set to the offset at which this
6128 ** problematic header will occur, if it exists. aMagic is used
6129 ** as a temporary buffer to inspect the first couple of bytes of
6130 ** the potential journal header.
6131 */
6132 i64 iNextHdrOffset;
6133 u8 aMagic[8];
6134 u8 zHeader[sizeof(aJournalMagic)+4];
6135
6136 memcpy(zHeader, aJournalMagic, sizeof(aJournalMagic));
6137 put32bits(&zHeader[sizeof(aJournalMagic)], pPager->nRec);
6138
6139 iNextHdrOffset = journalHdrOffset(pPager);
6140 rc = sqlite3OsRead(pPager->jfd, aMagic, 8, iNextHdrOffset);
6141 if( rc==SQLITE_OK && 0==memcmp(aMagic, aJournalMagic, 8) ){
6142 static const u8 zerobyte = 0;
6143 rc = sqlite3OsWrite(pPager->jfd, &zerobyte, 1, iNextHdrOffset);
6144 }
6145 if( rc!=SQLITE_OK && rc!=SQLITE_IOERR_SHORT_READ ){
6146 return rc;
6147 }
6148
6149 /* Write the nRec value into the journal file header. If in
6150 ** full-synchronous mode, sync the journal first. This ensures that
6151 ** all data has really hit the disk before nRec is updated to mark
6152 ** it as a candidate for rollback.
6153 **
6154 ** This is not required if the persistent media supports the
6155 ** SAFE_APPEND property. Because in this case it is not possible
6156 ** for garbage data to be appended to the file, the nRec field
6157 ** is populated with 0xFFFFFFFF when the journal header is written
6158 ** and never needs to be updated.
6159 */
6160 if( pPager->fullSync && 0==(iDc&SQLITE_IOCAP_SEQUENTIAL) ){
6161 PAGERTRACE(("SYNC journal of %d\n", PAGERID(pPager)));
6162 IOTRACE(("JSYNC %p\n", pPager))
6163 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags);
6164 if( rc!=SQLITE_OK ) return rc;
6165 }
6166 IOTRACE(("JHDR %p %lld\n", pPager, pPager->journalHdr));
6167 rc = sqlite3OsWrite(
6168 pPager->jfd, zHeader, sizeof(zHeader), pPager->journalHdr
6169 );
6170 if( rc!=SQLITE_OK ) return rc;
6171 }
6172 if( 0==(iDc&SQLITE_IOCAP_SEQUENTIAL) ){
6173 PAGERTRACE(("SYNC journal of %d\n", PAGERID(pPager)));
6174 IOTRACE(("JSYNC %p\n", pPager))
6175 rc = sqlite3OsSync(pPager->jfd, pPager->syncFlags|
6176 (pPager->syncFlags==SQLITE_SYNC_FULL?SQLITE_SYNC_DATAONLY:0)
6177 );
6178 if( rc!=SQLITE_OK ) return rc;
6179 }
6180
6181 pPager->journalHdr = pPager->journalOff;
6182 if( newHdr && 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){
6183 pPager->nRec = 0;
6184 rc = writeJournalHdr(pPager);
6185 if( rc!=SQLITE_OK ) return rc;
6186 }
6187 }else{
6188 pPager->journalHdr = pPager->journalOff;
6189 }
6190 }
6191
6192 /* Unless the pager is in noSync mode, the journal file was just
6193 ** successfully synced. Either way, clear the PGHDR_NEED_SYNC flag on
6194 ** all pages.
6195 */
6196 sqlite3PcacheClearSyncFlags(pPager->pPCache);
6197 pPager->eState = PAGER_WRITER_DBMOD;
6198 assert( assert_pager_state(pPager) );
6199 return SQLITE_OK;
6200 }
6201
6202 /*
6203 ** The argument is the first in a linked list of dirty pages connected
6204 ** by the PgHdr.pDirty pointer. This function writes each one of the
6205 ** in-memory pages in the list to the database file. The argument may
6206 ** be NULL, representing an empty list. In this case this function is
6207 ** a no-op.
6208 **
6209 ** The pager must hold at least a RESERVED lock when this function
6210 ** is called. Before writing anything to the database file, this lock
6211 ** is upgraded to an EXCLUSIVE lock. If the lock cannot be obtained,
6212 ** SQLITE_BUSY is returned and no data is written to the database file.
6213 **
6214 ** If the pager is a temp-file pager and the actual file-system file
6215 ** is not yet open, it is created and opened before any data is
6216 ** written out.
6217 **
6218 ** Once the lock has been upgraded and, if necessary, the file opened,
6219 ** the pages are written out to the database file in list order. Writing
6220 ** a page is skipped if it meets either of the following criteria:
6221 **
6222 ** * The page number is greater than Pager.dbSize, or
6223 ** * The PGHDR_DONT_WRITE flag is set on the page.
6224 **
6225 ** If writing out a page causes the database file to grow, Pager.dbFileSize
6226 ** is updated accordingly. If page 1 is written out, then the value cached
6227 ** in Pager.dbFileVers[] is updated to match the new value stored in
6228 ** the database file.
6229 **
6230 ** If everything is successful, SQLITE_OK is returned. If an IO error
6231 ** occurs, an IO error code is returned. Or, if the EXCLUSIVE lock cannot
6232 ** be obtained, SQLITE_BUSY is returned.
6233 */
6234 static int pager_write_pagelist(Pager *pPager, PgHdr *pList){
6235 int rc = SQLITE_OK; /* Return code */
6236
6237 /* This function is only called for rollback pagers in WRITER_DBMOD state. */
6238 assert( !pagerUseWal(pPager) );
6239 assert( pPager->tempFile || pPager->eState==PAGER_WRITER_DBMOD );
6240 assert( pPager->eLock==EXCLUSIVE_LOCK );
6241 assert( isOpen(pPager->fd) || pList->pDirty==0 );
6242
6243 /* If the file is a temp-file has not yet been opened, open it now. It
6244 ** is not possible for rc to be other than SQLITE_OK if this branch
6245 ** is taken, as pager_wait_on_lock() is a no-op for temp-files.
6246 */
6247 if( !isOpen(pPager->fd) ){
6248 assert( pPager->tempFile && rc==SQLITE_OK );
6249 rc = pagerOpentemp(pPager, pPager->fd, pPager->vfsFlags);
6250 }
6251
6252 /* Before the first write, give the VFS a hint of what the final
6253 ** file size will be.
6254 */
6255 assert( rc!=SQLITE_OK || isOpen(pPager->fd) );
6256 if( rc==SQLITE_OK
6257 && pPager->dbHintSize<pPager->dbSize
6258 && (pList->pDirty || pList->pgno>pPager->dbHintSize)
6259 ){
6260 sqlite3_int64 szFile = pPager->pageSize * (sqlite3_int64)pPager->dbSize;
6261 sqlite3OsFileControlHint(pPager->fd, SQLITE_FCNTL_SIZE_HINT, &szFile);
6262 pPager->dbHintSize = pPager->dbSize;
6263 }
6264
6265 while( rc==SQLITE_OK && pList ){
6266 Pgno pgno = pList->pgno;
6267
6268 /* If there are dirty pages in the page cache with page numbers greater
6269 ** than Pager.dbSize, this means sqlite3PagerTruncateImage() was called to
6270 ** make the file smaller (presumably by auto-vacuum code). Do not write
6271 ** any such pages to the file.
6272 **
6273 ** Also, do not write out any page that has the PGHDR_DONT_WRITE flag
6274 ** set (set by sqlite3PagerDontWrite()).
6275 */
6276 if( pgno<=pPager->dbSize && 0==(pList->flags&PGHDR_DONT_WRITE) ){
6277 i64 offset = (pgno-1)*(i64)pPager->pageSize; /* Offset to write */
6278 char *pData; /* Data to write */
6279
6280 assert( (pList->flags&PGHDR_NEED_SYNC)==0 );
6281 if( pList->pgno==1 ) pager_write_changecounter(pList);
6282
6283 /* Encode the database */
6284 CODEC2(pPager, pList->pData, pgno, 6, return SQLITE_NOMEM_BKPT, pData);
6285
6286 /* Write out the page data. */
6287 rc = sqlite3OsWrite(pPager->fd, pData, pPager->pageSize, offset);
6288
6289 /* If page 1 was just written, update Pager.dbFileVers to match
6290 ** the value now stored in the database file. If writing this
6291 ** page caused the database file to grow, update dbFileSize.
6292 */
6293 if( pgno==1 ){
6294 memcpy(&pPager->dbFileVers, &pData[24], sizeof(pPager->dbFileVers));
6295 }
6296 if( pgno>pPager->dbFileSize ){
6297 pPager->dbFileSize = pgno;
6298 }
6299 pPager->aStat[PAGER_STAT_WRITE]++;
6300
6301 /* Update any backup objects copying the contents of this pager. */
6302 sqlite3BackupUpdate(pPager->pBackup, pgno, (u8*)pList->pData);
6303
6304 PAGERTRACE(("STORE %d page %d hash(%08x)\n",
6305 PAGERID(pPager), pgno, pager_pagehash(pList)));
6306 IOTRACE(("PGOUT %p %d\n", pPager, pgno));
6307 PAGER_INCR(sqlite3_pager_writedb_count);
6308 }else{
6309 PAGERTRACE(("NOSTORE %d page %d\n", PAGERID(pPager), pgno));
6310 }
6311 pager_set_pagehash(pList);
6312 pList = pList->pDirty;
6313 }
6314
6315 return rc;
6316 }
6317
6318 /*
6319 ** Ensure that the sub-journal file is open. If it is already open, this
6320 ** function is a no-op.
6321 **
6322 ** SQLITE_OK is returned if everything goes according to plan. An
6323 ** SQLITE_IOERR_XXX error code is returned if a call to sqlite3OsOpen()
6324 ** fails.
6325 */
6326 static int openSubJournal(Pager *pPager){
6327 int rc = SQLITE_OK;
6328 if( !isOpen(pPager->sjfd) ){
6329 const int flags = SQLITE_OPEN_SUBJOURNAL | SQLITE_OPEN_READWRITE
6330 | SQLITE_OPEN_CREATE | SQLITE_OPEN_EXCLUSIVE
6331 | SQLITE_OPEN_DELETEONCLOSE;
6332 int nStmtSpill = sqlite3Config.nStmtSpill;
6333 if( pPager->journalMode==PAGER_JOURNALMODE_MEMORY || pPager->subjInMemory ){
6334 nStmtSpill = -1;
6335 }
6336 rc = sqlite3JournalOpen(pPager->pVfs, 0, pPager->sjfd, flags, nStmtSpill);
6337 }
6338 return rc;
6339 }
6340
6341 /*
6342 ** Append a record of the current state of page pPg to the sub-journal.
6343 **
6344 ** If successful, set the bit corresponding to pPg->pgno in the bitvecs
6345 ** for all open savepoints before returning.
6346 **
6347 ** This function returns SQLITE_OK if everything is successful, an IO
6348 ** error code if the attempt to write to the sub-journal fails, or
6349 ** SQLITE_NOMEM if a malloc fails while setting a bit in a savepoint
6350 ** bitvec.
6351 */
6352 static int subjournalPage(PgHdr *pPg){
6353 int rc = SQLITE_OK;
6354 Pager *pPager = pPg->pPager;
6355 if( pPager->journalMode!=PAGER_JOURNALMODE_OFF ){
6356
6357 /* Open the sub-journal, if it has not already been opened */
6358 assert( pPager->useJournal );
6359 assert( isOpen(pPager->jfd) || pagerUseWal(pPager) );
6360 assert( isOpen(pPager->sjfd) || pPager->nSubRec==0 );
6361 assert( pagerUseWal(pPager)
6362 || pageInJournal(pPager, pPg)
6363 || pPg->pgno>pPager->dbOrigSize
6364 );
6365 rc = openSubJournal(pPager);
6366
6367 /* If the sub-journal was opened successfully (or was already open),
6368 ** write the journal record into the file. */
6369 if( rc==SQLITE_OK ){
6370 void *pData = pPg->pData;
6371 i64 offset = (i64)pPager->nSubRec*(4+pPager->pageSize);
6372 char *pData2;
6373
6374 CODEC2(pPager, pData, pPg->pgno, 7, return SQLITE_NOMEM_BKPT, pData2);
6375 PAGERTRACE(("STMT-JOURNAL %d page %d\n", PAGERID(pPager), pPg->pgno));
6376 rc = write32bits(pPager->sjfd, offset, pPg->pgno);
6377 if( rc==SQLITE_OK ){
6378 rc = sqlite3OsWrite(pPager->sjfd, pData2, pPager->pageSize, offset+4);
6379 }
6380 }
6381 }
6382 if( rc==SQLITE_OK ){
6383 pPager->nSubRec++;
6384 assert( pPager->nSavepoint>0 );
6385 rc = addToSavepointBitvecs(pPager, pPg->pgno);
6386 }
6387 return rc;
6388 }
6389 static int subjournalPageIfRequired(PgHdr *pPg){
6390 if( subjRequiresPage(pPg) ){
6391 return subjournalPage(pPg);
6392 }else{
6393 return SQLITE_OK;
6394 }
6395 }
6396
6397 /*
6398 ** This function is called by the pcache layer when it has reached some
6399 ** soft memory limit. The first argument is a pointer to a Pager object
6400 ** (cast as a void*). The pager is always 'purgeable' (not an in-memory
6401 ** database). The second argument is a reference to a page that is
6402 ** currently dirty but has no outstanding references. The page
6403 ** is always associated with the Pager object passed as the first
6404 ** argument.
6405 **
6406 ** The job of this function is to make pPg clean by writing its contents
6407 ** out to the database file, if possible. This may involve syncing the
6408 ** journal file.
6409 **
6410 ** If successful, sqlite3PcacheMakeClean() is called on the page and
6411 ** SQLITE_OK returned. If an IO error occurs while trying to make the
6412 ** page clean, the IO error code is returned. If the page cannot be
6413 ** made clean for some other reason, but no error occurs, then SQLITE_OK
6414 ** is returned by sqlite3PcacheMakeClean() is not called.
6415 */
6416 static int pagerStress(void *p, PgHdr *pPg){
6417 Pager *pPager = (Pager *)p;
6418 int rc = SQLITE_OK;
6419
6420 assert( pPg->pPager==pPager );
6421 assert( pPg->flags&PGHDR_DIRTY );
6422
6423 /* The doNotSpill NOSYNC bit is set during times when doing a sync of
6424 ** journal (and adding a new header) is not allowed. This occurs
6425 ** during calls to sqlite3PagerWrite() while trying to journal multiple
6426 ** pages belonging to the same sector.
6427 **
6428 ** The doNotSpill ROLLBACK and OFF bits inhibits all cache spilling
6429 ** regardless of whether or not a sync is required. This is set during
6430 ** a rollback or by user request, respectively.
6431 **
6432 ** Spilling is also prohibited when in an error state since that could
6433 ** lead to database corruption. In the current implementation it
6434 ** is impossible for sqlite3PcacheFetch() to be called with createFlag==3
6435 ** while in the error state, hence it is impossible for this routine to
6436 ** be called in the error state. Nevertheless, we include a NEVER()
6437 ** test for the error state as a safeguard against future changes.
6438 */
6439 if( NEVER(pPager->errCode) ) return SQLITE_OK;
6440 testcase( pPager->doNotSpill & SPILLFLAG_ROLLBACK );
6441 testcase( pPager->doNotSpill & SPILLFLAG_OFF );
6442 testcase( pPager->doNotSpill & SPILLFLAG_NOSYNC );
6443 if( pPager->doNotSpill
6444 && ((pPager->doNotSpill & (SPILLFLAG_ROLLBACK|SPILLFLAG_OFF))!=0
6445 || (pPg->flags & PGHDR_NEED_SYNC)!=0)
6446 ){
6447 return SQLITE_OK;
6448 }
6449
6450 pPg->pDirty = 0;
6451 if( pagerUseWal(pPager) ){
6452 /* Write a single frame for this page to the log. */
6453 rc = subjournalPageIfRequired(pPg);
6454 if( rc==SQLITE_OK ){
6455 rc = pagerWalFrames(pPager, pPg, 0, 0);
6456 }
6457 }else{
6458
6459 /* Sync the journal file if required. */
6460 if( pPg->flags&PGHDR_NEED_SYNC
6461 || pPager->eState==PAGER_WRITER_CACHEMOD
6462 ){
6463 rc = syncJournal(pPager, 1);
6464 }
6465
6466 /* Write the contents of the page out to the database file. */
6467 if( rc==SQLITE_OK ){
6468 assert( (pPg->flags&PGHDR_NEED_SYNC)==0 );
6469 rc = pager_write_pagelist(pPager, pPg);
6470 }
6471 }
6472
6473 /* Mark the page as clean. */
6474 if( rc==SQLITE_OK ){
6475 PAGERTRACE(("STRESS %d page %d\n", PAGERID(pPager), pPg->pgno));
6476 sqlite3PcacheMakeClean(pPg);
6477 }
6478
6479 return pager_error(pPager, rc);
6480 }
6481
6482 /*
6483 ** Flush all unreferenced dirty pages to disk.
6484 */
6485 SQLITE_PRIVATE int sqlite3PagerFlush(Pager *pPager){
6486 int rc = pPager->errCode;
6487 if( !MEMDB ){
6488 PgHdr *pList = sqlite3PcacheDirtyList(pPager->pPCache);
6489 assert( assert_pager_state(pPager) );
6490 while( rc==SQLITE_OK && pList ){
6491 PgHdr *pNext = pList->pDirty;
6492 if( pList->nRef==0 ){
6493 rc = pagerStress((void*)pPager, pList);
6494 }
6495 pList = pNext;
6496 }
6497 }
6498
6499 return rc;
6500 }
6501
6502 /*
6503 ** Allocate and initialize a new Pager object and put a pointer to it
6504 ** in *ppPager. The pager should eventually be freed by passing it
6505 ** to sqlite3PagerClose().
6506 **
6507 ** The zFilename argument is the path to the database file to open.
6508 ** If zFilename is NULL then a randomly-named temporary file is created
6509 ** and used as the file to be cached. Temporary files are be deleted
6510 ** automatically when they are closed. If zFilename is ":memory:" then
6511 ** all information is held in cache. It is never written to disk.
6512 ** This can be used to implement an in-memory database.
6513 **
6514 ** The nExtra parameter specifies the number of bytes of space allocated
6515 ** along with each page reference. This space is available to the user
6516 ** via the sqlite3PagerGetExtra() API. When a new page is allocated, the
6517 ** first 8 bytes of this space are zeroed but the remainder is uninitialized.
6518 ** (The extra space is used by btree as the MemPage object.)
6519 **
6520 ** The flags argument is used to specify properties that affect the
6521 ** operation of the pager. It should be passed some bitwise combination
6522 ** of the PAGER_* flags.
6523 **
6524 ** The vfsFlags parameter is a bitmask to pass to the flags parameter
6525 ** of the xOpen() method of the supplied VFS when opening files.
6526 **
6527 ** If the pager object is allocated and the specified file opened
6528 ** successfully, SQLITE_OK is returned and *ppPager set to point to
6529 ** the new pager object. If an error occurs, *ppPager is set to NULL
6530 ** and error code returned. This function may return SQLITE_NOMEM
6531 ** (sqlite3Malloc() is used to allocate memory), SQLITE_CANTOPEN or
6532 ** various SQLITE_IO_XXX errors.
6533 */
6534 SQLITE_PRIVATE int sqlite3PagerOpen(
6535 sqlite3_vfs *pVfs, /* The virtual file system to use */
6536 Pager **ppPager, /* OUT: Return the Pager structure here */
6537 const char *zFilename, /* Name of the database file to open */
6538 int nExtra, /* Extra bytes append to each in-memory page */
6539 int flags, /* flags controlling this file */
6540 int vfsFlags, /* flags passed through to sqlite3_vfs.xOpen() */
6541 void (*xReinit)(DbPage*) /* Function to reinitialize pages */
6542 ){
6543 u8 *pPtr;
6544 Pager *pPager = 0; /* Pager object to allocate and return */
6545 int rc = SQLITE_OK; /* Return code */
6546 int tempFile = 0; /* True for temp files (incl. in-memory files) */
6547 int memDb = 0; /* True if this is an in-memory file */
6548 int readOnly = 0; /* True if this is a read-only file */
6549 int journalFileSize; /* Bytes to allocate for each journal fd */
6550 char *zPathname = 0; /* Full path to database file */
6551 int nPathname = 0; /* Number of bytes in zPathname */
6552 int useJournal = (flags & PAGER_OMIT_JOURNAL)==0; /* False to omit journal */
6553 int pcacheSize = sqlite3PcacheSize(); /* Bytes to allocate for PCache */
6554 u32 szPageDflt = SQLITE_DEFAULT_PAGE_SIZE; /* Default page size */
6555 const char *zUri = 0; /* URI args to copy */
6556 int nUri = 0; /* Number of bytes of URI args at *zUri */
6557
6558 /* Figure out how much space is required for each journal file-handle
6559 ** (there are two of them, the main journal and the sub-journal). */
6560 journalFileSize = ROUND8(sqlite3JournalSize(pVfs));
6561
6562 /* Set the output variable to NULL in case an error occurs. */
6563 *ppPager = 0;
6564
6565 #ifndef SQLITE_OMIT_MEMORYDB
6566 if( flags & PAGER_MEMORY ){
6567 memDb = 1;
6568 if( zFilename && zFilename[0] ){
6569 zPathname = sqlite3DbStrDup(0, zFilename);
6570 if( zPathname==0 ) return SQLITE_NOMEM_BKPT;
6571 nPathname = sqlite3Strlen30(zPathname);
6572 zFilename = 0;
6573 }
6574 }
6575 #endif
6576
6577 /* Compute and store the full pathname in an allocated buffer pointed
6578 ** to by zPathname, length nPathname. Or, if this is a temporary file,
6579 ** leave both nPathname and zPathname set to 0.
6580 */
6581 if( zFilename && zFilename[0] ){
6582 const char *z;
6583 nPathname = pVfs->mxPathname+1;
6584 zPathname = sqlite3DbMallocRaw(0, nPathname*2);
6585 if( zPathname==0 ){
6586 return SQLITE_NOMEM_BKPT;
6587 }
6588 zPathname[0] = 0; /* Make sure initialized even if FullPathname() fails */
6589 rc = sqlite3OsFullPathname(pVfs, zFilename, nPathname, zPathname);
6590 nPathname = sqlite3Strlen30(zPathname);
6591 z = zUri = &zFilename[sqlite3Strlen30(zFilename)+1];
6592 while( *z ){
6593 z += sqlite3Strlen30(z)+1;
6594 z += sqlite3Strlen30(z)+1;
6595 }
6596 nUri = (int)(&z[1] - zUri);
6597 assert( nUri>=0 );
6598 if( rc==SQLITE_OK && nPathname+8>pVfs->mxPathname ){
6599 /* This branch is taken when the journal path required by
6600 ** the database being opened will be more than pVfs->mxPathname
6601 ** bytes in length. This means the database cannot be opened,
6602 ** as it will not be possible to open the journal file or even
6603 ** check for a hot-journal before reading.
6604 */
6605 rc = SQLITE_CANTOPEN_BKPT;
6606 }
6607 if( rc!=SQLITE_OK ){
6608 sqlite3DbFree(0, zPathname);
6609 return rc;
6610 }
6611 }
6612
6613 /* Allocate memory for the Pager structure, PCache object, the
6614 ** three file descriptors, the database file name and the journal
6615 ** file name. The layout in memory is as follows:
6616 **
6617 ** Pager object (sizeof(Pager) bytes)
6618 ** PCache object (sqlite3PcacheSize() bytes)
6619 ** Database file handle (pVfs->szOsFile bytes)
6620 ** Sub-journal file handle (journalFileSize bytes)
6621 ** Main journal file handle (journalFileSize bytes)
6622 ** Database file name (nPathname+1 bytes)
6623 ** Journal file name (nPathname+8+1 bytes)
6624 */
6625 pPtr = (u8 *)sqlite3MallocZero(
6626 ROUND8(sizeof(*pPager)) + /* Pager structure */
6627 ROUND8(pcacheSize) + /* PCache object */
6628 ROUND8(pVfs->szOsFile) + /* The main db file */
6629 journalFileSize * 2 + /* The two journal files */
6630 nPathname + 1 + nUri + /* zFilename */
6631 nPathname + 8 + 2 /* zJournal */
6632 #ifndef SQLITE_OMIT_WAL
6633 + nPathname + 4 + 2 /* zWal */
6634 #endif
6635 );
6636 assert( EIGHT_BYTE_ALIGNMENT(SQLITE_INT_TO_PTR(journalFileSize)) );
6637 if( !pPtr ){
6638 sqlite3DbFree(0, zPathname);
6639 return SQLITE_NOMEM_BKPT;
6640 }
6641 pPager = (Pager*)(pPtr);
6642 pPager->pPCache = (PCache*)(pPtr += ROUND8(sizeof(*pPager)));
6643 pPager->fd = (sqlite3_file*)(pPtr += ROUND8(pcacheSize));
6644 pPager->sjfd = (sqlite3_file*)(pPtr += ROUND8(pVfs->szOsFile));
6645 pPager->jfd = (sqlite3_file*)(pPtr += journalFileSize);
6646 pPager->zFilename = (char*)(pPtr += journalFileSize);
6647 assert( EIGHT_BYTE_ALIGNMENT(pPager->jfd) );
6648
6649 /* Fill in the Pager.zFilename and Pager.zJournal buffers, if required. */
6650 if( zPathname ){
6651 assert( nPathname>0 );
6652 pPager->zJournal = (char*)(pPtr += nPathname + 1 + nUri);
6653 memcpy(pPager->zFilename, zPathname, nPathname);
6654 if( nUri ) memcpy(&pPager->zFilename[nPathname+1], zUri, nUri);
6655 memcpy(pPager->zJournal, zPathname, nPathname);
6656 memcpy(&pPager->zJournal[nPathname], "-journal\000", 8+2);
6657 sqlite3FileSuffix3(pPager->zFilename, pPager->zJournal);
6658 #ifndef SQLITE_OMIT_WAL
6659 pPager->zWal = &pPager->zJournal[nPathname+8+1];
6660 memcpy(pPager->zWal, zPathname, nPathname);
6661 memcpy(&pPager->zWal[nPathname], "-wal\000", 4+1);
6662 sqlite3FileSuffix3(pPager->zFilename, pPager->zWal);
6663 #endif
6664 sqlite3DbFree(0, zPathname);
6665 }
6666 pPager->pVfs = pVfs;
6667 pPager->vfsFlags = vfsFlags;
6668
6669 /* Open the pager file.
6670 */
6671 if( zFilename && zFilename[0] ){
6672 int fout = 0; /* VFS flags returned by xOpen() */
6673 rc = sqlite3OsOpen(pVfs, pPager->zFilename, pPager->fd, vfsFlags, &fout);
6674 assert( !memDb );
6675 readOnly = (fout&SQLITE_OPEN_READONLY);
6676
6677 /* If the file was successfully opened for read/write access,
6678 ** choose a default page size in case we have to create the
6679 ** database file. The default page size is the maximum of:
6680 **
6681 ** + SQLITE_DEFAULT_PAGE_SIZE,
6682 ** + The value returned by sqlite3OsSectorSize()
6683 ** + The largest page size that can be written atomically.
6684 */
6685 if( rc==SQLITE_OK ){
6686 int iDc = sqlite3OsDeviceCharacteristics(pPager->fd);
6687 if( !readOnly ){
6688 setSectorSize(pPager);
6689 assert(SQLITE_DEFAULT_PAGE_SIZE<=SQLITE_MAX_DEFAULT_PAGE_SIZE);
6690 if( szPageDflt<pPager->sectorSize ){
6691 if( pPager->sectorSize>SQLITE_MAX_DEFAULT_PAGE_SIZE ){
6692 szPageDflt = SQLITE_MAX_DEFAULT_PAGE_SIZE;
6693 }else{
6694 szPageDflt = (u32)pPager->sectorSize;
6695 }
6696 }
6697 #ifdef SQLITE_ENABLE_ATOMIC_WRITE
6698 {
6699 int ii;
6700 assert(SQLITE_IOCAP_ATOMIC512==(512>>8));
6701 assert(SQLITE_IOCAP_ATOMIC64K==(65536>>8));
6702 assert(SQLITE_MAX_DEFAULT_PAGE_SIZE<=65536);
6703 for(ii=szPageDflt; ii<=SQLITE_MAX_DEFAULT_PAGE_SIZE; ii=ii*2){
6704 if( iDc&(SQLITE_IOCAP_ATOMIC|(ii>>8)) ){
6705 szPageDflt = ii;
6706 }
6707 }
6708 }
6709 #endif
6710 }
6711 pPager->noLock = sqlite3_uri_boolean(zFilename, "nolock", 0);
6712 if( (iDc & SQLITE_IOCAP_IMMUTABLE)!=0
6713 || sqlite3_uri_boolean(zFilename, "immutable", 0) ){
6714 vfsFlags |= SQLITE_OPEN_READONLY;
6715 goto act_like_temp_file;
6716 }
6717 }
6718 }else{
6719 /* If a temporary file is requested, it is not opened immediately.
6720 ** In this case we accept the default page size and delay actually
6721 ** opening the file until the first call to OsWrite().
6722 **
6723 ** This branch is also run for an in-memory database. An in-memory
6724 ** database is the same as a temp-file that is never written out to
6725 ** disk and uses an in-memory rollback journal.
6726 **
6727 ** This branch also runs for files marked as immutable.
6728 */
6729 act_like_temp_file:
6730 tempFile = 1;
6731 pPager->eState = PAGER_READER; /* Pretend we already have a lock */
6732 pPager->eLock = EXCLUSIVE_LOCK; /* Pretend we are in EXCLUSIVE mode */
6733 pPager->noLock = 1; /* Do no locking */
6734 readOnly = (vfsFlags&SQLITE_OPEN_READONLY);
6735 }
6736
6737 /* The following call to PagerSetPagesize() serves to set the value of
6738 ** Pager.pageSize and to allocate the Pager.pTmpSpace buffer.
6739 */
6740 if( rc==SQLITE_OK ){
6741 assert( pPager->memDb==0 );
6742 rc = sqlite3PagerSetPagesize(pPager, &szPageDflt, -1);
6743 testcase( rc!=SQLITE_OK );
6744 }
6745
6746 /* Initialize the PCache object. */
6747 if( rc==SQLITE_OK ){
6748 nExtra = ROUND8(nExtra);
6749 assert( nExtra>=8 && nExtra<1000 );
6750 rc = sqlite3PcacheOpen(szPageDflt, nExtra, !memDb,
6751 !memDb?pagerStress:0, (void *)pPager, pPager->pPCache);
6752 }
6753
6754 /* If an error occurred above, free the Pager structure and close the file.
6755 */
6756 if( rc!=SQLITE_OK ){
6757 sqlite3OsClose(pPager->fd);
6758 sqlite3PageFree(pPager->pTmpSpace);
6759 sqlite3_free(pPager);
6760 return rc;
6761 }
6762
6763 PAGERTRACE(("OPEN %d %s\n", FILEHANDLEID(pPager->fd), pPager->zFilename));
6764 IOTRACE(("OPEN %p %s\n", pPager, pPager->zFilename))
6765
6766 pPager->useJournal = (u8)useJournal;
6767 /* pPager->stmtOpen = 0; */
6768 /* pPager->stmtInUse = 0; */
6769 /* pPager->nRef = 0; */
6770 /* pPager->stmtSize = 0; */
6771 /* pPager->stmtJSize = 0; */
6772 /* pPager->nPage = 0; */
6773 pPager->mxPgno = SQLITE_MAX_PAGE_COUNT;
6774 /* pPager->state = PAGER_UNLOCK; */
6775 /* pPager->errMask = 0; */
6776 pPager->tempFile = (u8)tempFile;
6777 assert( tempFile==PAGER_LOCKINGMODE_NORMAL
6778 || tempFile==PAGER_LOCKINGMODE_EXCLUSIVE );
6779 assert( PAGER_LOCKINGMODE_EXCLUSIVE==1 );
6780 pPager->exclusiveMode = (u8)tempFile;
6781 pPager->changeCountDone = pPager->tempFile;
6782 pPager->memDb = (u8)memDb;
6783 pPager->readOnly = (u8)readOnly;
6784 assert( useJournal || pPager->tempFile );
6785 pPager->noSync = pPager->tempFile;
6786 if( pPager->noSync ){
6787 assert( pPager->fullSync==0 );
6788 assert( pPager->extraSync==0 );
6789 assert( pPager->syncFlags==0 );
6790 assert( pPager->walSyncFlags==0 );
6791 assert( pPager->ckptSyncFlags==0 );
6792 }else{
6793 pPager->fullSync = 1;
6794 pPager->extraSync = 0;
6795 pPager->syncFlags = SQLITE_SYNC_NORMAL;
6796 pPager->walSyncFlags = SQLITE_SYNC_NORMAL | WAL_SYNC_TRANSACTIONS;
6797 pPager->ckptSyncFlags = SQLITE_SYNC_NORMAL;
6798 }
6799 /* pPager->pFirst = 0; */
6800 /* pPager->pFirstSynced = 0; */
6801 /* pPager->pLast = 0; */
6802 pPager->nExtra = (u16)nExtra;
6803 pPager->journalSizeLimit = SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT;
6804 assert( isOpen(pPager->fd) || tempFile );
6805 setSectorSize(pPager);
6806 if( !useJournal ){
6807 pPager->journalMode = PAGER_JOURNALMODE_OFF;
6808 }else if( memDb ){
6809 pPager->journalMode = PAGER_JOURNALMODE_MEMORY;
6810 }
6811 /* pPager->xBusyHandler = 0; */
6812 /* pPager->pBusyHandlerArg = 0; */
6813 pPager->xReiniter = xReinit;
6814 setGetterMethod(pPager);
6815 /* memset(pPager->aHash, 0, sizeof(pPager->aHash)); */
6816 /* pPager->szMmap = SQLITE_DEFAULT_MMAP_SIZE // will be set by btree.c */
6817
6818 *ppPager = pPager;
6819 return SQLITE_OK;
6820 }
6821
6822
6823 /* Verify that the database file has not be deleted or renamed out from
6824 ** under the pager. Return SQLITE_OK if the database is still were it ought
6825 ** to be on disk. Return non-zero (SQLITE_READONLY_DBMOVED or some other error
6826 ** code from sqlite3OsAccess()) if the database has gone missing.
6827 */
6828 static int databaseIsUnmoved(Pager *pPager){
6829 int bHasMoved = 0;
6830 int rc;
6831
6832 if( pPager->tempFile ) return SQLITE_OK;
6833 if( pPager->dbSize==0 ) return SQLITE_OK;
6834 assert( pPager->zFilename && pPager->zFilename[0] );
6835 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_HAS_MOVED, &bHasMoved);
6836 if( rc==SQLITE_NOTFOUND ){
6837 /* If the HAS_MOVED file-control is unimplemented, assume that the file
6838 ** has not been moved. That is the historical behavior of SQLite: prior to
6839 ** version 3.8.3, it never checked */
6840 rc = SQLITE_OK;
6841 }else if( rc==SQLITE_OK && bHasMoved ){
6842 rc = SQLITE_READONLY_DBMOVED;
6843 }
6844 return rc;
6845 }
6846
6847
6848 /*
6849 ** This function is called after transitioning from PAGER_UNLOCK to
6850 ** PAGER_SHARED state. It tests if there is a hot journal present in
6851 ** the file-system for the given pager. A hot journal is one that
6852 ** needs to be played back. According to this function, a hot-journal
6853 ** file exists if the following criteria are met:
6854 **
6855 ** * The journal file exists in the file system, and
6856 ** * No process holds a RESERVED or greater lock on the database file, and
6857 ** * The database file itself is greater than 0 bytes in size, and
6858 ** * The first byte of the journal file exists and is not 0x00.
6859 **
6860 ** If the current size of the database file is 0 but a journal file
6861 ** exists, that is probably an old journal left over from a prior
6862 ** database with the same name. In this case the journal file is
6863 ** just deleted using OsDelete, *pExists is set to 0 and SQLITE_OK
6864 ** is returned.
6865 **
6866 ** This routine does not check if there is a master journal filename
6867 ** at the end of the file. If there is, and that master journal file
6868 ** does not exist, then the journal file is not really hot. In this
6869 ** case this routine will return a false-positive. The pager_playback()
6870 ** routine will discover that the journal file is not really hot and
6871 ** will not roll it back.
6872 **
6873 ** If a hot-journal file is found to exist, *pExists is set to 1 and
6874 ** SQLITE_OK returned. If no hot-journal file is present, *pExists is
6875 ** set to 0 and SQLITE_OK returned. If an IO error occurs while trying
6876 ** to determine whether or not a hot-journal file exists, the IO error
6877 ** code is returned and the value of *pExists is undefined.
6878 */
6879 static int hasHotJournal(Pager *pPager, int *pExists){
6880 sqlite3_vfs * const pVfs = pPager->pVfs;
6881 int rc = SQLITE_OK; /* Return code */
6882 int exists = 1; /* True if a journal file is present */
6883 int jrnlOpen = !!isOpen(pPager->jfd);
6884
6885 assert( pPager->useJournal );
6886 assert( isOpen(pPager->fd) );
6887 assert( pPager->eState==PAGER_OPEN );
6888
6889 assert( jrnlOpen==0 || ( sqlite3OsDeviceCharacteristics(pPager->jfd) &
6890 SQLITE_IOCAP_UNDELETABLE_WHEN_OPEN
6891 ));
6892
6893 *pExists = 0;
6894 if( !jrnlOpen ){
6895 rc = sqlite3OsAccess(pVfs, pPager->zJournal, SQLITE_ACCESS_EXISTS, &exists);
6896 }
6897 if( rc==SQLITE_OK && exists ){
6898 int locked = 0; /* True if some process holds a RESERVED lock */
6899
6900 /* Race condition here: Another process might have been holding the
6901 ** the RESERVED lock and have a journal open at the sqlite3OsAccess()
6902 ** call above, but then delete the journal and drop the lock before
6903 ** we get to the following sqlite3OsCheckReservedLock() call. If that
6904 ** is the case, this routine might think there is a hot journal when
6905 ** in fact there is none. This results in a false-positive which will
6906 ** be dealt with by the playback routine. Ticket #3883.
6907 */
6908 rc = sqlite3OsCheckReservedLock(pPager->fd, &locked);
6909 if( rc==SQLITE_OK && !locked ){
6910 Pgno nPage; /* Number of pages in database file */
6911
6912 assert( pPager->tempFile==0 );
6913 rc = pagerPagecount(pPager, &nPage);
6914 if( rc==SQLITE_OK ){
6915 /* If the database is zero pages in size, that means that either (1) the
6916 ** journal is a remnant from a prior database with the same name where
6917 ** the database file but not the journal was deleted, or (2) the initial
6918 ** transaction that populates a new database is being rolled back.
6919 ** In either case, the journal file can be deleted. However, take care
6920 ** not to delete the journal file if it is already open due to
6921 ** journal_mode=PERSIST.
6922 */
6923 if( nPage==0 && !jrnlOpen ){
6924 sqlite3BeginBenignMalloc();
6925 if( pagerLockDb(pPager, RESERVED_LOCK)==SQLITE_OK ){
6926 sqlite3OsDelete(pVfs, pPager->zJournal, 0);
6927 if( !pPager->exclusiveMode ) pagerUnlockDb(pPager, SHARED_LOCK);
6928 }
6929 sqlite3EndBenignMalloc();
6930 }else{
6931 /* The journal file exists and no other connection has a reserved
6932 ** or greater lock on the database file. Now check that there is
6933 ** at least one non-zero bytes at the start of the journal file.
6934 ** If there is, then we consider this journal to be hot. If not,
6935 ** it can be ignored.
6936 */
6937 if( !jrnlOpen ){
6938 int f = SQLITE_OPEN_READONLY|SQLITE_OPEN_MAIN_JOURNAL;
6939 rc = sqlite3OsOpen(pVfs, pPager->zJournal, pPager->jfd, f, &f);
6940 }
6941 if( rc==SQLITE_OK ){
6942 u8 first = 0;
6943 rc = sqlite3OsRead(pPager->jfd, (void *)&first, 1, 0);
6944 if( rc==SQLITE_IOERR_SHORT_READ ){
6945 rc = SQLITE_OK;
6946 }
6947 if( !jrnlOpen ){
6948 sqlite3OsClose(pPager->jfd);
6949 }
6950 *pExists = (first!=0);
6951 }else if( rc==SQLITE_CANTOPEN ){
6952 /* If we cannot open the rollback journal file in order to see if
6953 ** it has a zero header, that might be due to an I/O error, or
6954 ** it might be due to the race condition described above and in
6955 ** ticket #3883. Either way, assume that the journal is hot.
6956 ** This might be a false positive. But if it is, then the
6957 ** automatic journal playback and recovery mechanism will deal
6958 ** with it under an EXCLUSIVE lock where we do not need to
6959 ** worry so much with race conditions.
6960 */
6961 *pExists = 1;
6962 rc = SQLITE_OK;
6963 }
6964 }
6965 }
6966 }
6967 }
6968
6969 return rc;
6970 }
6971
6972 /*
6973 ** This function is called to obtain a shared lock on the database file.
6974 ** It is illegal to call sqlite3PagerGet() until after this function
6975 ** has been successfully called. If a shared-lock is already held when
6976 ** this function is called, it is a no-op.
6977 **
6978 ** The following operations are also performed by this function.
6979 **
6980 ** 1) If the pager is currently in PAGER_OPEN state (no lock held
6981 ** on the database file), then an attempt is made to obtain a
6982 ** SHARED lock on the database file. Immediately after obtaining
6983 ** the SHARED lock, the file-system is checked for a hot-journal,
6984 ** which is played back if present. Following any hot-journal
6985 ** rollback, the contents of the cache are validated by checking
6986 ** the 'change-counter' field of the database file header and
6987 ** discarded if they are found to be invalid.
6988 **
6989 ** 2) If the pager is running in exclusive-mode, and there are currently
6990 ** no outstanding references to any pages, and is in the error state,
6991 ** then an attempt is made to clear the error state by discarding
6992 ** the contents of the page cache and rolling back any open journal
6993 ** file.
6994 **
6995 ** If everything is successful, SQLITE_OK is returned. If an IO error
6996 ** occurs while locking the database, checking for a hot-journal file or
6997 ** rolling back a journal file, the IO error code is returned.
6998 */
6999 SQLITE_PRIVATE int sqlite3PagerSharedLock(Pager *pPager){
7000 int rc = SQLITE_OK; /* Return code */
7001
7002 /* This routine is only called from b-tree and only when there are no
7003 ** outstanding pages. This implies that the pager state should either
7004 ** be OPEN or READER. READER is only possible if the pager is or was in
7005 ** exclusive access mode. */
7006 assert( sqlite3PcacheRefCount(pPager->pPCache)==0 );
7007 assert( assert_pager_state(pPager) );
7008 assert( pPager->eState==PAGER_OPEN || pPager->eState==PAGER_READER );
7009 assert( pPager->errCode==SQLITE_OK );
7010
7011 if( !pagerUseWal(pPager) && pPager->eState==PAGER_OPEN ){
7012 int bHotJournal = 1; /* True if there exists a hot journal-file */
7013
7014 assert( !MEMDB );
7015 assert( pPager->tempFile==0 || pPager->eLock==EXCLUSIVE_LOCK );
7016
7017 rc = pager_wait_on_lock(pPager, SHARED_LOCK);
7018 if( rc!=SQLITE_OK ){
7019 assert( pPager->eLock==NO_LOCK || pPager->eLock==UNKNOWN_LOCK );
7020 goto failed;
7021 }
7022
7023 /* If a journal file exists, and there is no RESERVED lock on the
7024 ** database file, then it either needs to be played back or deleted.
7025 */
7026 if( pPager->eLock<=SHARED_LOCK ){
7027 rc = hasHotJournal(pPager, &bHotJournal);
7028 }
7029 if( rc!=SQLITE_OK ){
7030 goto failed;
7031 }
7032 if( bHotJournal ){
7033 if( pPager->readOnly ){
7034 rc = SQLITE_READONLY_ROLLBACK;
7035 goto failed;
7036 }
7037
7038 /* Get an EXCLUSIVE lock on the database file. At this point it is
7039 ** important that a RESERVED lock is not obtained on the way to the
7040 ** EXCLUSIVE lock. If it were, another process might open the
7041 ** database file, detect the RESERVED lock, and conclude that the
7042 ** database is safe to read while this process is still rolling the
7043 ** hot-journal back.
7044 **
7045 ** Because the intermediate RESERVED lock is not requested, any
7046 ** other process attempting to access the database file will get to
7047 ** this point in the code and fail to obtain its own EXCLUSIVE lock
7048 ** on the database file.
7049 **
7050 ** Unless the pager is in locking_mode=exclusive mode, the lock is
7051 ** downgraded to SHARED_LOCK before this function returns.
7052 */
7053 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);
7054 if( rc!=SQLITE_OK ){
7055 goto failed;
7056 }
7057
7058 /* If it is not already open and the file exists on disk, open the
7059 ** journal for read/write access. Write access is required because
7060 ** in exclusive-access mode the file descriptor will be kept open
7061 ** and possibly used for a transaction later on. Also, write-access
7062 ** is usually required to finalize the journal in journal_mode=persist
7063 ** mode (and also for journal_mode=truncate on some systems).
7064 **
7065 ** If the journal does not exist, it usually means that some
7066 ** other connection managed to get in and roll it back before
7067 ** this connection obtained the exclusive lock above. Or, it
7068 ** may mean that the pager was in the error-state when this
7069 ** function was called and the journal file does not exist.
7070 */
7071 if( !isOpen(pPager->jfd) ){
7072 sqlite3_vfs * const pVfs = pPager->pVfs;
7073 int bExists; /* True if journal file exists */
7074 rc = sqlite3OsAccess(
7075 pVfs, pPager->zJournal, SQLITE_ACCESS_EXISTS, &bExists);
7076 if( rc==SQLITE_OK && bExists ){
7077 int fout = 0;
7078 int f = SQLITE_OPEN_READWRITE|SQLITE_OPEN_MAIN_JOURNAL;
7079 assert( !pPager->tempFile );
7080 rc = sqlite3OsOpen(pVfs, pPager->zJournal, pPager->jfd, f, &fout);
7081 assert( rc!=SQLITE_OK || isOpen(pPager->jfd) );
7082 if( rc==SQLITE_OK && fout&SQLITE_OPEN_READONLY ){
7083 rc = SQLITE_CANTOPEN_BKPT;
7084 sqlite3OsClose(pPager->jfd);
7085 }
7086 }
7087 }
7088
7089 /* Playback and delete the journal. Drop the database write
7090 ** lock and reacquire the read lock. Purge the cache before
7091 ** playing back the hot-journal so that we don't end up with
7092 ** an inconsistent cache. Sync the hot journal before playing
7093 ** it back since the process that crashed and left the hot journal
7094 ** probably did not sync it and we are required to always sync
7095 ** the journal before playing it back.
7096 */
7097 if( isOpen(pPager->jfd) ){
7098 assert( rc==SQLITE_OK );
7099 rc = pagerSyncHotJournal(pPager);
7100 if( rc==SQLITE_OK ){
7101 rc = pager_playback(pPager, !pPager->tempFile);
7102 pPager->eState = PAGER_OPEN;
7103 }
7104 }else if( !pPager->exclusiveMode ){
7105 pagerUnlockDb(pPager, SHARED_LOCK);
7106 }
7107
7108 if( rc!=SQLITE_OK ){
7109 /* This branch is taken if an error occurs while trying to open
7110 ** or roll back a hot-journal while holding an EXCLUSIVE lock. The
7111 ** pager_unlock() routine will be called before returning to unlock
7112 ** the file. If the unlock attempt fails, then Pager.eLock must be
7113 ** set to UNKNOWN_LOCK (see the comment above the #define for
7114 ** UNKNOWN_LOCK above for an explanation).
7115 **
7116 ** In order to get pager_unlock() to do this, set Pager.eState to
7117 ** PAGER_ERROR now. This is not actually counted as a transition
7118 ** to ERROR state in the state diagram at the top of this file,
7119 ** since we know that the same call to pager_unlock() will very
7120 ** shortly transition the pager object to the OPEN state. Calling
7121 ** assert_pager_state() would fail now, as it should not be possible
7122 ** to be in ERROR state when there are zero outstanding page
7123 ** references.
7124 */
7125 pager_error(pPager, rc);
7126 goto failed;
7127 }
7128
7129 assert( pPager->eState==PAGER_OPEN );
7130 assert( (pPager->eLock==SHARED_LOCK)
7131 || (pPager->exclusiveMode && pPager->eLock>SHARED_LOCK)
7132 );
7133 }
7134
7135 if( !pPager->tempFile && pPager->hasHeldSharedLock ){
7136 /* The shared-lock has just been acquired then check to
7137 ** see if the database has been modified. If the database has changed,
7138 ** flush the cache. The hasHeldSharedLock flag prevents this from
7139 ** occurring on the very first access to a file, in order to save a
7140 ** single unnecessary sqlite3OsRead() call at the start-up.
7141 **
7142 ** Database changes are detected by looking at 15 bytes beginning
7143 ** at offset 24 into the file. The first 4 of these 16 bytes are
7144 ** a 32-bit counter that is incremented with each change. The
7145 ** other bytes change randomly with each file change when
7146 ** a codec is in use.
7147 **
7148 ** There is a vanishingly small chance that a change will not be
7149 ** detected. The chance of an undetected change is so small that
7150 ** it can be neglected.
7151 */
7152 Pgno nPage = 0;
7153 char dbFileVers[sizeof(pPager->dbFileVers)];
7154
7155 rc = pagerPagecount(pPager, &nPage);
7156 if( rc ) goto failed;
7157
7158 if( nPage>0 ){
7159 IOTRACE(("CKVERS %p %d\n", pPager, sizeof(dbFileVers)));
7160 rc = sqlite3OsRead(pPager->fd, &dbFileVers, sizeof(dbFileVers), 24);
7161 if( rc!=SQLITE_OK && rc!=SQLITE_IOERR_SHORT_READ ){
7162 goto failed;
7163 }
7164 }else{
7165 memset(dbFileVers, 0, sizeof(dbFileVers));
7166 }
7167
7168 if( memcmp(pPager->dbFileVers, dbFileVers, sizeof(dbFileVers))!=0 ){
7169 pager_reset(pPager);
7170
7171 /* Unmap the database file. It is possible that external processes
7172 ** may have truncated the database file and then extended it back
7173 ** to its original size while this process was not holding a lock.
7174 ** In this case there may exist a Pager.pMap mapping that appears
7175 ** to be the right size but is not actually valid. Avoid this
7176 ** possibility by unmapping the db here. */
7177 if( USEFETCH(pPager) ){
7178 sqlite3OsUnfetch(pPager->fd, 0, 0);
7179 }
7180 }
7181 }
7182
7183 /* If there is a WAL file in the file-system, open this database in WAL
7184 ** mode. Otherwise, the following function call is a no-op.
7185 */
7186 rc = pagerOpenWalIfPresent(pPager);
7187 #ifndef SQLITE_OMIT_WAL
7188 assert( pPager->pWal==0 || rc==SQLITE_OK );
7189 #endif
7190 }
7191
7192 if( pagerUseWal(pPager) ){
7193 assert( rc==SQLITE_OK );
7194 rc = pagerBeginReadTransaction(pPager);
7195 }
7196
7197 if( pPager->tempFile==0 && pPager->eState==PAGER_OPEN && rc==SQLITE_OK ){
7198 rc = pagerPagecount(pPager, &pPager->dbSize);
7199 }
7200
7201 failed:
7202 if( rc!=SQLITE_OK ){
7203 assert( !MEMDB );
7204 pager_unlock(pPager);
7205 assert( pPager->eState==PAGER_OPEN );
7206 }else{
7207 pPager->eState = PAGER_READER;
7208 pPager->hasHeldSharedLock = 1;
7209 }
7210 return rc;
7211 }
7212
7213 /*
7214 ** If the reference count has reached zero, rollback any active
7215 ** transaction and unlock the pager.
7216 **
7217 ** Except, in locking_mode=EXCLUSIVE when there is nothing to in
7218 ** the rollback journal, the unlock is not performed and there is
7219 ** nothing to rollback, so this routine is a no-op.
7220 */
7221 static void pagerUnlockIfUnused(Pager *pPager){
7222 if( pPager->nMmapOut==0 && (sqlite3PcacheRefCount(pPager->pPCache)==0) ){
7223 pagerUnlockAndRollback(pPager);
7224 }
7225 }
7226
7227 /*
7228 ** The page getter methods each try to acquire a reference to a
7229 ** page with page number pgno. If the requested reference is
7230 ** successfully obtained, it is copied to *ppPage and SQLITE_OK returned.
7231 **
7232 ** There are different implementations of the getter method depending
7233 ** on the current state of the pager.
7234 **
7235 ** getPageNormal() -- The normal getter
7236 ** getPageError() -- Used if the pager is in an error state
7237 ** getPageMmap() -- Used if memory-mapped I/O is enabled
7238 **
7239 ** If the requested page is already in the cache, it is returned.
7240 ** Otherwise, a new page object is allocated and populated with data
7241 ** read from the database file. In some cases, the pcache module may
7242 ** choose not to allocate a new page object and may reuse an existing
7243 ** object with no outstanding references.
7244 **
7245 ** The extra data appended to a page is always initialized to zeros the
7246 ** first time a page is loaded into memory. If the page requested is
7247 ** already in the cache when this function is called, then the extra
7248 ** data is left as it was when the page object was last used.
7249 **
7250 ** If the database image is smaller than the requested page or if
7251 ** the flags parameter contains the PAGER_GET_NOCONTENT bit and the
7252 ** requested page is not already stored in the cache, then no
7253 ** actual disk read occurs. In this case the memory image of the
7254 ** page is initialized to all zeros.
7255 **
7256 ** If PAGER_GET_NOCONTENT is true, it means that we do not care about
7257 ** the contents of the page. This occurs in two scenarios:
7258 **
7259 ** a) When reading a free-list leaf page from the database, and
7260 **
7261 ** b) When a savepoint is being rolled back and we need to load
7262 ** a new page into the cache to be filled with the data read
7263 ** from the savepoint journal.
7264 **
7265 ** If PAGER_GET_NOCONTENT is true, then the data returned is zeroed instead
7266 ** of being read from the database. Additionally, the bits corresponding
7267 ** to pgno in Pager.pInJournal (bitvec of pages already written to the
7268 ** journal file) and the PagerSavepoint.pInSavepoint bitvecs of any open
7269 ** savepoints are set. This means if the page is made writable at any
7270 ** point in the future, using a call to sqlite3PagerWrite(), its contents
7271 ** will not be journaled. This saves IO.
7272 **
7273 ** The acquisition might fail for several reasons. In all cases,
7274 ** an appropriate error code is returned and *ppPage is set to NULL.
7275 **
7276 ** See also sqlite3PagerLookup(). Both this routine and Lookup() attempt
7277 ** to find a page in the in-memory cache first. If the page is not already
7278 ** in memory, this routine goes to disk to read it in whereas Lookup()
7279 ** just returns 0. This routine acquires a read-lock the first time it
7280 ** has to go to disk, and could also playback an old journal if necessary.
7281 ** Since Lookup() never goes to disk, it never has to deal with locks
7282 ** or journal files.
7283 */
7284 static int getPageNormal(
7285 Pager *pPager, /* The pager open on the database file */
7286 Pgno pgno, /* Page number to fetch */
7287 DbPage **ppPage, /* Write a pointer to the page here */
7288 int flags /* PAGER_GET_XXX flags */
7289 ){
7290 int rc = SQLITE_OK;
7291 PgHdr *pPg;
7292 u8 noContent; /* True if PAGER_GET_NOCONTENT is set */
7293 sqlite3_pcache_page *pBase;
7294
7295 assert( pPager->errCode==SQLITE_OK );
7296 assert( pPager->eState>=PAGER_READER );
7297 assert( assert_pager_state(pPager) );
7298 assert( pPager->hasHeldSharedLock==1 );
7299
7300 if( pgno==0 ) return SQLITE_CORRUPT_BKPT;
7301 pBase = sqlite3PcacheFetch(pPager->pPCache, pgno, 3);
7302 if( pBase==0 ){
7303 pPg = 0;
7304 rc = sqlite3PcacheFetchStress(pPager->pPCache, pgno, &pBase);
7305 if( rc!=SQLITE_OK ) goto pager_acquire_err;
7306 if( pBase==0 ){
7307 rc = SQLITE_NOMEM_BKPT;
7308 goto pager_acquire_err;
7309 }
7310 }
7311 pPg = *ppPage = sqlite3PcacheFetchFinish(pPager->pPCache, pgno, pBase);
7312 assert( pPg==(*ppPage) );
7313 assert( pPg->pgno==pgno );
7314 assert( pPg->pPager==pPager || pPg->pPager==0 );
7315
7316 noContent = (flags & PAGER_GET_NOCONTENT)!=0;
7317 if( pPg->pPager && !noContent ){
7318 /* In this case the pcache already contains an initialized copy of
7319 ** the page. Return without further ado. */
7320 assert( pgno<=PAGER_MAX_PGNO && pgno!=PAGER_MJ_PGNO(pPager) );
7321 pPager->aStat[PAGER_STAT_HIT]++;
7322 return SQLITE_OK;
7323
7324 }else{
7325 /* The pager cache has created a new page. Its content needs to
7326 ** be initialized. But first some error checks:
7327 **
7328 ** (1) The maximum page number is 2^31
7329 ** (2) Never try to fetch the locking page
7330 */
7331 if( pgno>PAGER_MAX_PGNO || pgno==PAGER_MJ_PGNO(pPager) ){
7332 rc = SQLITE_CORRUPT_BKPT;
7333 goto pager_acquire_err;
7334 }
7335
7336 pPg->pPager = pPager;
7337
7338 assert( !isOpen(pPager->fd) || !MEMDB );
7339 if( !isOpen(pPager->fd) || pPager->dbSize<pgno || noContent ){
7340 if( pgno>pPager->mxPgno ){
7341 rc = SQLITE_FULL;
7342 goto pager_acquire_err;
7343 }
7344 if( noContent ){
7345 /* Failure to set the bits in the InJournal bit-vectors is benign.
7346 ** It merely means that we might do some extra work to journal a
7347 ** page that does not need to be journaled. Nevertheless, be sure
7348 ** to test the case where a malloc error occurs while trying to set
7349 ** a bit in a bit vector.
7350 */
7351 sqlite3BeginBenignMalloc();
7352 if( pgno<=pPager->dbOrigSize ){
7353 TESTONLY( rc = ) sqlite3BitvecSet(pPager->pInJournal, pgno);
7354 testcase( rc==SQLITE_NOMEM );
7355 }
7356 TESTONLY( rc = ) addToSavepointBitvecs(pPager, pgno);
7357 testcase( rc==SQLITE_NOMEM );
7358 sqlite3EndBenignMalloc();
7359 }
7360 memset(pPg->pData, 0, pPager->pageSize);
7361 IOTRACE(("ZERO %p %d\n", pPager, pgno));
7362 }else{
7363 u32 iFrame = 0; /* Frame to read from WAL file */
7364 if( pagerUseWal(pPager) ){
7365 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iFrame);
7366 if( rc!=SQLITE_OK ) goto pager_acquire_err;
7367 }
7368 assert( pPg->pPager==pPager );
7369 pPager->aStat[PAGER_STAT_MISS]++;
7370 rc = readDbPage(pPg, iFrame);
7371 if( rc!=SQLITE_OK ){
7372 goto pager_acquire_err;
7373 }
7374 }
7375 pager_set_pagehash(pPg);
7376 }
7377 return SQLITE_OK;
7378
7379 pager_acquire_err:
7380 assert( rc!=SQLITE_OK );
7381 if( pPg ){
7382 sqlite3PcacheDrop(pPg);
7383 }
7384 pagerUnlockIfUnused(pPager);
7385 *ppPage = 0;
7386 return rc;
7387 }
7388
7389 #if SQLITE_MAX_MMAP_SIZE>0
7390 /* The page getter for when memory-mapped I/O is enabled */
7391 static int getPageMMap(
7392 Pager *pPager, /* The pager open on the database file */
7393 Pgno pgno, /* Page number to fetch */
7394 DbPage **ppPage, /* Write a pointer to the page here */
7395 int flags /* PAGER_GET_XXX flags */
7396 ){
7397 int rc = SQLITE_OK;
7398 PgHdr *pPg = 0;
7399 u32 iFrame = 0; /* Frame to read from WAL file */
7400
7401 /* It is acceptable to use a read-only (mmap) page for any page except
7402 ** page 1 if there is no write-transaction open or the ACQUIRE_READONLY
7403 ** flag was specified by the caller. And so long as the db is not a
7404 ** temporary or in-memory database. */
7405 const int bMmapOk = (pgno>1
7406 && (pPager->eState==PAGER_READER || (flags & PAGER_GET_READONLY))
7407 );
7408
7409 assert( USEFETCH(pPager) );
7410 #ifdef SQLITE_HAS_CODEC
7411 assert( pPager->xCodec==0 );
7412 #endif
7413
7414 /* Optimization note: Adding the "pgno<=1" term before "pgno==0" here
7415 ** allows the compiler optimizer to reuse the results of the "pgno>1"
7416 ** test in the previous statement, and avoid testing pgno==0 in the
7417 ** common case where pgno is large. */
7418 if( pgno<=1 && pgno==0 ){
7419 return SQLITE_CORRUPT_BKPT;
7420 }
7421 assert( pPager->eState>=PAGER_READER );
7422 assert( assert_pager_state(pPager) );
7423 assert( pPager->hasHeldSharedLock==1 );
7424 assert( pPager->errCode==SQLITE_OK );
7425
7426 if( bMmapOk && pagerUseWal(pPager) ){
7427 rc = sqlite3WalFindFrame(pPager->pWal, pgno, &iFrame);
7428 if( rc!=SQLITE_OK ){
7429 *ppPage = 0;
7430 return rc;
7431 }
7432 }
7433 if( bMmapOk && iFrame==0 ){
7434 void *pData = 0;
7435 rc = sqlite3OsFetch(pPager->fd,
7436 (i64)(pgno-1) * pPager->pageSize, pPager->pageSize, &pData
7437 );
7438 if( rc==SQLITE_OK && pData ){
7439 if( pPager->eState>PAGER_READER || pPager->tempFile ){
7440 pPg = sqlite3PagerLookup(pPager, pgno);
7441 }
7442 if( pPg==0 ){
7443 rc = pagerAcquireMapPage(pPager, pgno, pData, &pPg);
7444 }else{
7445 sqlite3OsUnfetch(pPager->fd, (i64)(pgno-1)*pPager->pageSize, pData);
7446 }
7447 if( pPg ){
7448 assert( rc==SQLITE_OK );
7449 *ppPage = pPg;
7450 return SQLITE_OK;
7451 }
7452 }
7453 if( rc!=SQLITE_OK ){
7454 *ppPage = 0;
7455 return rc;
7456 }
7457 }
7458 return getPageNormal(pPager, pgno, ppPage, flags);
7459 }
7460 #endif /* SQLITE_MAX_MMAP_SIZE>0 */
7461
7462 /* The page getter method for when the pager is an error state */
7463 static int getPageError(
7464 Pager *pPager, /* The pager open on the database file */
7465 Pgno pgno, /* Page number to fetch */
7466 DbPage **ppPage, /* Write a pointer to the page here */
7467 int flags /* PAGER_GET_XXX flags */
7468 ){
7469 UNUSED_PARAMETER(pgno);
7470 UNUSED_PARAMETER(flags);
7471 assert( pPager->errCode!=SQLITE_OK );
7472 *ppPage = 0;
7473 return pPager->errCode;
7474 }
7475
7476
7477 /* Dispatch all page fetch requests to the appropriate getter method.
7478 */
7479 SQLITE_PRIVATE int sqlite3PagerGet(
7480 Pager *pPager, /* The pager open on the database file */
7481 Pgno pgno, /* Page number to fetch */
7482 DbPage **ppPage, /* Write a pointer to the page here */
7483 int flags /* PAGER_GET_XXX flags */
7484 ){
7485 return pPager->xGet(pPager, pgno, ppPage, flags);
7486 }
7487
7488 /*
7489 ** Acquire a page if it is already in the in-memory cache. Do
7490 ** not read the page from disk. Return a pointer to the page,
7491 ** or 0 if the page is not in cache.
7492 **
7493 ** See also sqlite3PagerGet(). The difference between this routine
7494 ** and sqlite3PagerGet() is that _get() will go to the disk and read
7495 ** in the page if the page is not already in cache. This routine
7496 ** returns NULL if the page is not in cache or if a disk I/O error
7497 ** has ever happened.
7498 */
7499 SQLITE_PRIVATE DbPage *sqlite3PagerLookup(Pager *pPager, Pgno pgno){
7500 sqlite3_pcache_page *pPage;
7501 assert( pPager!=0 );
7502 assert( pgno!=0 );
7503 assert( pPager->pPCache!=0 );
7504 pPage = sqlite3PcacheFetch(pPager->pPCache, pgno, 0);
7505 assert( pPage==0 || pPager->hasHeldSharedLock );
7506 if( pPage==0 ) return 0;
7507 return sqlite3PcacheFetchFinish(pPager->pPCache, pgno, pPage);
7508 }
7509
7510 /*
7511 ** Release a page reference.
7512 **
7513 ** If the number of references to the page drop to zero, then the
7514 ** page is added to the LRU list. When all references to all pages
7515 ** are released, a rollback occurs and the lock on the database is
7516 ** removed.
7517 */
7518 SQLITE_PRIVATE void sqlite3PagerUnrefNotNull(DbPage *pPg){
7519 Pager *pPager;
7520 assert( pPg!=0 );
7521 pPager = pPg->pPager;
7522 if( pPg->flags & PGHDR_MMAP ){
7523 pagerReleaseMapPage(pPg);
7524 }else{
7525 sqlite3PcacheRelease(pPg);
7526 }
7527 pagerUnlockIfUnused(pPager);
7528 }
7529 SQLITE_PRIVATE void sqlite3PagerUnref(DbPage *pPg){
7530 if( pPg ) sqlite3PagerUnrefNotNull(pPg);
7531 }
7532
7533 /*
7534 ** This function is called at the start of every write transaction.
7535 ** There must already be a RESERVED or EXCLUSIVE lock on the database
7536 ** file when this routine is called.
7537 **
7538 ** Open the journal file for pager pPager and write a journal header
7539 ** to the start of it. If there are active savepoints, open the sub-journal
7540 ** as well. This function is only used when the journal file is being
7541 ** opened to write a rollback log for a transaction. It is not used
7542 ** when opening a hot journal file to roll it back.
7543 **
7544 ** If the journal file is already open (as it may be in exclusive mode),
7545 ** then this function just writes a journal header to the start of the
7546 ** already open file.
7547 **
7548 ** Whether or not the journal file is opened by this function, the
7549 ** Pager.pInJournal bitvec structure is allocated.
7550 **
7551 ** Return SQLITE_OK if everything is successful. Otherwise, return
7552 ** SQLITE_NOMEM if the attempt to allocate Pager.pInJournal fails, or
7553 ** an IO error code if opening or writing the journal file fails.
7554 */
7555 static int pager_open_journal(Pager *pPager){
7556 int rc = SQLITE_OK; /* Return code */
7557 sqlite3_vfs * const pVfs = pPager->pVfs; /* Local cache of vfs pointer */
7558
7559 assert( pPager->eState==PAGER_WRITER_LOCKED );
7560 assert( assert_pager_state(pPager) );
7561 assert( pPager->pInJournal==0 );
7562
7563 /* If already in the error state, this function is a no-op. But on
7564 ** the other hand, this routine is never called if we are already in
7565 ** an error state. */
7566 if( NEVER(pPager->errCode) ) return pPager->errCode;
7567
7568 if( !pagerUseWal(pPager) && pPager->journalMode!=PAGER_JOURNALMODE_OFF ){
7569 pPager->pInJournal = sqlite3BitvecCreate(pPager->dbSize);
7570 if( pPager->pInJournal==0 ){
7571 return SQLITE_NOMEM_BKPT;
7572 }
7573
7574 /* Open the journal file if it is not already open. */
7575 if( !isOpen(pPager->jfd) ){
7576 if( pPager->journalMode==PAGER_JOURNALMODE_MEMORY ){
7577 sqlite3MemJournalOpen(pPager->jfd);
7578 }else{
7579 int flags = SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE;
7580 int nSpill;
7581
7582 if( pPager->tempFile ){
7583 flags |= (SQLITE_OPEN_DELETEONCLOSE|SQLITE_OPEN_TEMP_JOURNAL);
7584 nSpill = sqlite3Config.nStmtSpill;
7585 }else{
7586 flags |= SQLITE_OPEN_MAIN_JOURNAL;
7587 nSpill = jrnlBufferSize(pPager);
7588 }
7589
7590 /* Verify that the database still has the same name as it did when
7591 ** it was originally opened. */
7592 rc = databaseIsUnmoved(pPager);
7593 if( rc==SQLITE_OK ){
7594 rc = sqlite3JournalOpen (
7595 pVfs, pPager->zJournal, pPager->jfd, flags, nSpill
7596 );
7597 }
7598 }
7599 assert( rc!=SQLITE_OK || isOpen(pPager->jfd) );
7600 }
7601
7602
7603 /* Write the first journal header to the journal file and open
7604 ** the sub-journal if necessary.
7605 */
7606 if( rc==SQLITE_OK ){
7607 /* TODO: Check if all of these are really required. */
7608 pPager->nRec = 0;
7609 pPager->journalOff = 0;
7610 pPager->setMaster = 0;
7611 pPager->journalHdr = 0;
7612 rc = writeJournalHdr(pPager);
7613 }
7614 }
7615
7616 if( rc!=SQLITE_OK ){
7617 sqlite3BitvecDestroy(pPager->pInJournal);
7618 pPager->pInJournal = 0;
7619 }else{
7620 assert( pPager->eState==PAGER_WRITER_LOCKED );
7621 pPager->eState = PAGER_WRITER_CACHEMOD;
7622 }
7623
7624 return rc;
7625 }
7626
7627 /*
7628 ** Begin a write-transaction on the specified pager object. If a
7629 ** write-transaction has already been opened, this function is a no-op.
7630 **
7631 ** If the exFlag argument is false, then acquire at least a RESERVED
7632 ** lock on the database file. If exFlag is true, then acquire at least
7633 ** an EXCLUSIVE lock. If such a lock is already held, no locking
7634 ** functions need be called.
7635 **
7636 ** If the subjInMemory argument is non-zero, then any sub-journal opened
7637 ** within this transaction will be opened as an in-memory file. This
7638 ** has no effect if the sub-journal is already opened (as it may be when
7639 ** running in exclusive mode) or if the transaction does not require a
7640 ** sub-journal. If the subjInMemory argument is zero, then any required
7641 ** sub-journal is implemented in-memory if pPager is an in-memory database,
7642 ** or using a temporary file otherwise.
7643 */
7644 SQLITE_PRIVATE int sqlite3PagerBegin(Pager *pPager, int exFlag, int subjInMemory ){
7645 int rc = SQLITE_OK;
7646
7647 if( pPager->errCode ) return pPager->errCode;
7648 assert( pPager->eState>=PAGER_READER && pPager->eState<PAGER_ERROR );
7649 pPager->subjInMemory = (u8)subjInMemory;
7650
7651 if( ALWAYS(pPager->eState==PAGER_READER) ){
7652 assert( pPager->pInJournal==0 );
7653
7654 if( pagerUseWal(pPager) ){
7655 /* If the pager is configured to use locking_mode=exclusive, and an
7656 ** exclusive lock on the database is not already held, obtain it now.
7657 */
7658 if( pPager->exclusiveMode && sqlite3WalExclusiveMode(pPager->pWal, -1) ){
7659 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);
7660 if( rc!=SQLITE_OK ){
7661 return rc;
7662 }
7663 (void)sqlite3WalExclusiveMode(pPager->pWal, 1);
7664 }
7665
7666 /* Grab the write lock on the log file. If successful, upgrade to
7667 ** PAGER_RESERVED state. Otherwise, return an error code to the caller.
7668 ** The busy-handler is not invoked if another connection already
7669 ** holds the write-lock. If possible, the upper layer will call it.
7670 */
7671 rc = sqlite3WalBeginWriteTransaction(pPager->pWal);
7672 }else{
7673 /* Obtain a RESERVED lock on the database file. If the exFlag parameter
7674 ** is true, then immediately upgrade this to an EXCLUSIVE lock. The
7675 ** busy-handler callback can be used when upgrading to the EXCLUSIVE
7676 ** lock, but not when obtaining the RESERVED lock.
7677 */
7678 rc = pagerLockDb(pPager, RESERVED_LOCK);
7679 if( rc==SQLITE_OK && exFlag ){
7680 rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
7681 }
7682 }
7683
7684 if( rc==SQLITE_OK ){
7685 /* Change to WRITER_LOCKED state.
7686 **
7687 ** WAL mode sets Pager.eState to PAGER_WRITER_LOCKED or CACHEMOD
7688 ** when it has an open transaction, but never to DBMOD or FINISHED.
7689 ** This is because in those states the code to roll back savepoint
7690 ** transactions may copy data from the sub-journal into the database
7691 ** file as well as into the page cache. Which would be incorrect in
7692 ** WAL mode.
7693 */
7694 pPager->eState = PAGER_WRITER_LOCKED;
7695 pPager->dbHintSize = pPager->dbSize;
7696 pPager->dbFileSize = pPager->dbSize;
7697 pPager->dbOrigSize = pPager->dbSize;
7698 pPager->journalOff = 0;
7699 }
7700
7701 assert( rc==SQLITE_OK || pPager->eState==PAGER_READER );
7702 assert( rc!=SQLITE_OK || pPager->eState==PAGER_WRITER_LOCKED );
7703 assert( assert_pager_state(pPager) );
7704 }
7705
7706 PAGERTRACE(("TRANSACTION %d\n", PAGERID(pPager)));
7707 return rc;
7708 }
7709
7710 /*
7711 ** Write page pPg onto the end of the rollback journal.
7712 */
7713 static SQLITE_NOINLINE int pagerAddPageToRollbackJournal(PgHdr *pPg){
7714 Pager *pPager = pPg->pPager;
7715 int rc;
7716 u32 cksum;
7717 char *pData2;
7718 i64 iOff = pPager->journalOff;
7719
7720 /* We should never write to the journal file the page that
7721 ** contains the database locks. The following assert verifies
7722 ** that we do not. */
7723 assert( pPg->pgno!=PAGER_MJ_PGNO(pPager) );
7724
7725 assert( pPager->journalHdr<=pPager->journalOff );
7726 CODEC2(pPager, pPg->pData, pPg->pgno, 7, return SQLITE_NOMEM_BKPT, pData2);
7727 cksum = pager_cksum(pPager, (u8*)pData2);
7728
7729 /* Even if an IO or diskfull error occurs while journalling the
7730 ** page in the block above, set the need-sync flag for the page.
7731 ** Otherwise, when the transaction is rolled back, the logic in
7732 ** playback_one_page() will think that the page needs to be restored
7733 ** in the database file. And if an IO error occurs while doing so,
7734 ** then corruption may follow.
7735 */
7736 pPg->flags |= PGHDR_NEED_SYNC;
7737
7738 rc = write32bits(pPager->jfd, iOff, pPg->pgno);
7739 if( rc!=SQLITE_OK ) return rc;
7740 rc = sqlite3OsWrite(pPager->jfd, pData2, pPager->pageSize, iOff+4);
7741 if( rc!=SQLITE_OK ) return rc;
7742 rc = write32bits(pPager->jfd, iOff+pPager->pageSize+4, cksum);
7743 if( rc!=SQLITE_OK ) return rc;
7744
7745 IOTRACE(("JOUT %p %d %lld %d\n", pPager, pPg->pgno,
7746 pPager->journalOff, pPager->pageSize));
7747 PAGER_INCR(sqlite3_pager_writej_count);
7748 PAGERTRACE(("JOURNAL %d page %d needSync=%d hash(%08x)\n",
7749 PAGERID(pPager), pPg->pgno,
7750 ((pPg->flags&PGHDR_NEED_SYNC)?1:0), pager_pagehash(pPg)));
7751
7752 pPager->journalOff += 8 + pPager->pageSize;
7753 pPager->nRec++;
7754 assert( pPager->pInJournal!=0 );
7755 rc = sqlite3BitvecSet(pPager->pInJournal, pPg->pgno);
7756 testcase( rc==SQLITE_NOMEM );
7757 assert( rc==SQLITE_OK || rc==SQLITE_NOMEM );
7758 rc |= addToSavepointBitvecs(pPager, pPg->pgno);
7759 assert( rc==SQLITE_OK || rc==SQLITE_NOMEM );
7760 return rc;
7761 }
7762
7763 /*
7764 ** Mark a single data page as writeable. The page is written into the
7765 ** main journal or sub-journal as required. If the page is written into
7766 ** one of the journals, the corresponding bit is set in the
7767 ** Pager.pInJournal bitvec and the PagerSavepoint.pInSavepoint bitvecs
7768 ** of any open savepoints as appropriate.
7769 */
7770 static int pager_write(PgHdr *pPg){
7771 Pager *pPager = pPg->pPager;
7772 int rc = SQLITE_OK;
7773
7774 /* This routine is not called unless a write-transaction has already
7775 ** been started. The journal file may or may not be open at this point.
7776 ** It is never called in the ERROR state.
7777 */
7778 assert( pPager->eState==PAGER_WRITER_LOCKED
7779 || pPager->eState==PAGER_WRITER_CACHEMOD
7780 || pPager->eState==PAGER_WRITER_DBMOD
7781 );
7782 assert( assert_pager_state(pPager) );
7783 assert( pPager->errCode==0 );
7784 assert( pPager->readOnly==0 );
7785 CHECK_PAGE(pPg);
7786
7787 /* The journal file needs to be opened. Higher level routines have already
7788 ** obtained the necessary locks to begin the write-transaction, but the
7789 ** rollback journal might not yet be open. Open it now if this is the case.
7790 **
7791 ** This is done before calling sqlite3PcacheMakeDirty() on the page.
7792 ** Otherwise, if it were done after calling sqlite3PcacheMakeDirty(), then
7793 ** an error might occur and the pager would end up in WRITER_LOCKED state
7794 ** with pages marked as dirty in the cache.
7795 */
7796 if( pPager->eState==PAGER_WRITER_LOCKED ){
7797 rc = pager_open_journal(pPager);
7798 if( rc!=SQLITE_OK ) return rc;
7799 }
7800 assert( pPager->eState>=PAGER_WRITER_CACHEMOD );
7801 assert( assert_pager_state(pPager) );
7802
7803 /* Mark the page that is about to be modified as dirty. */
7804 sqlite3PcacheMakeDirty(pPg);
7805
7806 /* If a rollback journal is in use, them make sure the page that is about
7807 ** to change is in the rollback journal, or if the page is a new page off
7808 ** then end of the file, make sure it is marked as PGHDR_NEED_SYNC.
7809 */
7810 assert( (pPager->pInJournal!=0) == isOpen(pPager->jfd) );
7811 if( pPager->pInJournal!=0
7812 && sqlite3BitvecTestNotNull(pPager->pInJournal, pPg->pgno)==0
7813 ){
7814 assert( pagerUseWal(pPager)==0 );
7815 if( pPg->pgno<=pPager->dbOrigSize ){
7816 rc = pagerAddPageToRollbackJournal(pPg);
7817 if( rc!=SQLITE_OK ){
7818 return rc;
7819 }
7820 }else{
7821 if( pPager->eState!=PAGER_WRITER_DBMOD ){
7822 pPg->flags |= PGHDR_NEED_SYNC;
7823 }
7824 PAGERTRACE(("APPEND %d page %d needSync=%d\n",
7825 PAGERID(pPager), pPg->pgno,
7826 ((pPg->flags&PGHDR_NEED_SYNC)?1:0)));
7827 }
7828 }
7829
7830 /* The PGHDR_DIRTY bit is set above when the page was added to the dirty-list
7831 ** and before writing the page into the rollback journal. Wait until now,
7832 ** after the page has been successfully journalled, before setting the
7833 ** PGHDR_WRITEABLE bit that indicates that the page can be safely modified.
7834 */
7835 pPg->flags |= PGHDR_WRITEABLE;
7836
7837 /* If the statement journal is open and the page is not in it,
7838 ** then write the page into the statement journal.
7839 */
7840 if( pPager->nSavepoint>0 ){
7841 rc = subjournalPageIfRequired(pPg);
7842 }
7843
7844 /* Update the database size and return. */
7845 if( pPager->dbSize<pPg->pgno ){
7846 pPager->dbSize = pPg->pgno;
7847 }
7848 return rc;
7849 }
7850
7851 /*
7852 ** This is a variant of sqlite3PagerWrite() that runs when the sector size
7853 ** is larger than the page size. SQLite makes the (reasonable) assumption that
7854 ** all bytes of a sector are written together by hardware. Hence, all bytes of
7855 ** a sector need to be journalled in case of a power loss in the middle of
7856 ** a write.
7857 **
7858 ** Usually, the sector size is less than or equal to the page size, in which
7859 ** case pages can be individually written. This routine only runs in the
7860 ** exceptional case where the page size is smaller than the sector size.
7861 */
7862 static SQLITE_NOINLINE int pagerWriteLargeSector(PgHdr *pPg){
7863 int rc = SQLITE_OK; /* Return code */
7864 Pgno nPageCount; /* Total number of pages in database file */
7865 Pgno pg1; /* First page of the sector pPg is located on. */
7866 int nPage = 0; /* Number of pages starting at pg1 to journal */
7867 int ii; /* Loop counter */
7868 int needSync = 0; /* True if any page has PGHDR_NEED_SYNC */
7869 Pager *pPager = pPg->pPager; /* The pager that owns pPg */
7870 Pgno nPagePerSector = (pPager->sectorSize/pPager->pageSize);
7871
7872 /* Set the doNotSpill NOSYNC bit to 1. This is because we cannot allow
7873 ** a journal header to be written between the pages journaled by
7874 ** this function.
7875 */
7876 assert( !MEMDB );
7877 assert( (pPager->doNotSpill & SPILLFLAG_NOSYNC)==0 );
7878 pPager->doNotSpill |= SPILLFLAG_NOSYNC;
7879
7880 /* This trick assumes that both the page-size and sector-size are
7881 ** an integer power of 2. It sets variable pg1 to the identifier
7882 ** of the first page of the sector pPg is located on.
7883 */
7884 pg1 = ((pPg->pgno-1) & ~(nPagePerSector-1)) + 1;
7885
7886 nPageCount = pPager->dbSize;
7887 if( pPg->pgno>nPageCount ){
7888 nPage = (pPg->pgno - pg1)+1;
7889 }else if( (pg1+nPagePerSector-1)>nPageCount ){
7890 nPage = nPageCount+1-pg1;
7891 }else{
7892 nPage = nPagePerSector;
7893 }
7894 assert(nPage>0);
7895 assert(pg1<=pPg->pgno);
7896 assert((pg1+nPage)>pPg->pgno);
7897
7898 for(ii=0; ii<nPage && rc==SQLITE_OK; ii++){
7899 Pgno pg = pg1+ii;
7900 PgHdr *pPage;
7901 if( pg==pPg->pgno || !sqlite3BitvecTest(pPager->pInJournal, pg) ){
7902 if( pg!=PAGER_MJ_PGNO(pPager) ){
7903 rc = sqlite3PagerGet(pPager, pg, &pPage, 0);
7904 if( rc==SQLITE_OK ){
7905 rc = pager_write(pPage);
7906 if( pPage->flags&PGHDR_NEED_SYNC ){
7907 needSync = 1;
7908 }
7909 sqlite3PagerUnrefNotNull(pPage);
7910 }
7911 }
7912 }else if( (pPage = sqlite3PagerLookup(pPager, pg))!=0 ){
7913 if( pPage->flags&PGHDR_NEED_SYNC ){
7914 needSync = 1;
7915 }
7916 sqlite3PagerUnrefNotNull(pPage);
7917 }
7918 }
7919
7920 /* If the PGHDR_NEED_SYNC flag is set for any of the nPage pages
7921 ** starting at pg1, then it needs to be set for all of them. Because
7922 ** writing to any of these nPage pages may damage the others, the
7923 ** journal file must contain sync()ed copies of all of them
7924 ** before any of them can be written out to the database file.
7925 */
7926 if( rc==SQLITE_OK && needSync ){
7927 assert( !MEMDB );
7928 for(ii=0; ii<nPage; ii++){
7929 PgHdr *pPage = sqlite3PagerLookup(pPager, pg1+ii);
7930 if( pPage ){
7931 pPage->flags |= PGHDR_NEED_SYNC;
7932 sqlite3PagerUnrefNotNull(pPage);
7933 }
7934 }
7935 }
7936
7937 assert( (pPager->doNotSpill & SPILLFLAG_NOSYNC)!=0 );
7938 pPager->doNotSpill &= ~SPILLFLAG_NOSYNC;
7939 return rc;
7940 }
7941
7942 /*
7943 ** Mark a data page as writeable. This routine must be called before
7944 ** making changes to a page. The caller must check the return value
7945 ** of this function and be careful not to change any page data unless
7946 ** this routine returns SQLITE_OK.
7947 **
7948 ** The difference between this function and pager_write() is that this
7949 ** function also deals with the special case where 2 or more pages
7950 ** fit on a single disk sector. In this case all co-resident pages
7951 ** must have been written to the journal file before returning.
7952 **
7953 ** If an error occurs, SQLITE_NOMEM or an IO error code is returned
7954 ** as appropriate. Otherwise, SQLITE_OK.
7955 */
7956 SQLITE_PRIVATE int sqlite3PagerWrite(PgHdr *pPg){
7957 Pager *pPager = pPg->pPager;
7958 assert( (pPg->flags & PGHDR_MMAP)==0 );
7959 assert( pPager->eState>=PAGER_WRITER_LOCKED );
7960 assert( assert_pager_state(pPager) );
7961 if( (pPg->flags & PGHDR_WRITEABLE)!=0 && pPager->dbSize>=pPg->pgno ){
7962 if( pPager->nSavepoint ) return subjournalPageIfRequired(pPg);
7963 return SQLITE_OK;
7964 }else if( pPager->errCode ){
7965 return pPager->errCode;
7966 }else if( pPager->sectorSize > (u32)pPager->pageSize ){
7967 assert( pPager->tempFile==0 );
7968 return pagerWriteLargeSector(pPg);
7969 }else{
7970 return pager_write(pPg);
7971 }
7972 }
7973
7974 /*
7975 ** Return TRUE if the page given in the argument was previously passed
7976 ** to sqlite3PagerWrite(). In other words, return TRUE if it is ok
7977 ** to change the content of the page.
7978 */
7979 #ifndef NDEBUG
7980 SQLITE_PRIVATE int sqlite3PagerIswriteable(DbPage *pPg){
7981 return pPg->flags & PGHDR_WRITEABLE;
7982 }
7983 #endif
7984
7985 /*
7986 ** A call to this routine tells the pager that it is not necessary to
7987 ** write the information on page pPg back to the disk, even though
7988 ** that page might be marked as dirty. This happens, for example, when
7989 ** the page has been added as a leaf of the freelist and so its
7990 ** content no longer matters.
7991 **
7992 ** The overlying software layer calls this routine when all of the data
7993 ** on the given page is unused. The pager marks the page as clean so
7994 ** that it does not get written to disk.
7995 **
7996 ** Tests show that this optimization can quadruple the speed of large
7997 ** DELETE operations.
7998 **
7999 ** This optimization cannot be used with a temp-file, as the page may
8000 ** have been dirty at the start of the transaction. In that case, if
8001 ** memory pressure forces page pPg out of the cache, the data does need
8002 ** to be written out to disk so that it may be read back in if the
8003 ** current transaction is rolled back.
8004 */
8005 SQLITE_PRIVATE void sqlite3PagerDontWrite(PgHdr *pPg){
8006 Pager *pPager = pPg->pPager;
8007 if( !pPager->tempFile && (pPg->flags&PGHDR_DIRTY) && pPager->nSavepoint==0 ){
8008 PAGERTRACE(("DONT_WRITE page %d of %d\n", pPg->pgno, PAGERID(pPager)));
8009 IOTRACE(("CLEAN %p %d\n", pPager, pPg->pgno))
8010 pPg->flags |= PGHDR_DONT_WRITE;
8011 pPg->flags &= ~PGHDR_WRITEABLE;
8012 testcase( pPg->flags & PGHDR_NEED_SYNC );
8013 pager_set_pagehash(pPg);
8014 }
8015 }
8016
8017 /*
8018 ** This routine is called to increment the value of the database file
8019 ** change-counter, stored as a 4-byte big-endian integer starting at
8020 ** byte offset 24 of the pager file. The secondary change counter at
8021 ** 92 is also updated, as is the SQLite version number at offset 96.
8022 **
8023 ** But this only happens if the pPager->changeCountDone flag is false.
8024 ** To avoid excess churning of page 1, the update only happens once.
8025 ** See also the pager_write_changecounter() routine that does an
8026 ** unconditional update of the change counters.
8027 **
8028 ** If the isDirectMode flag is zero, then this is done by calling
8029 ** sqlite3PagerWrite() on page 1, then modifying the contents of the
8030 ** page data. In this case the file will be updated when the current
8031 ** transaction is committed.
8032 **
8033 ** The isDirectMode flag may only be non-zero if the library was compiled
8034 ** with the SQLITE_ENABLE_ATOMIC_WRITE macro defined. In this case,
8035 ** if isDirect is non-zero, then the database file is updated directly
8036 ** by writing an updated version of page 1 using a call to the
8037 ** sqlite3OsWrite() function.
8038 */
8039 static int pager_incr_changecounter(Pager *pPager, int isDirectMode){
8040 int rc = SQLITE_OK;
8041
8042 assert( pPager->eState==PAGER_WRITER_CACHEMOD
8043 || pPager->eState==PAGER_WRITER_DBMOD
8044 );
8045 assert( assert_pager_state(pPager) );
8046
8047 /* Declare and initialize constant integer 'isDirect'. If the
8048 ** atomic-write optimization is enabled in this build, then isDirect
8049 ** is initialized to the value passed as the isDirectMode parameter
8050 ** to this function. Otherwise, it is always set to zero.
8051 **
8052 ** The idea is that if the atomic-write optimization is not
8053 ** enabled at compile time, the compiler can omit the tests of
8054 ** 'isDirect' below, as well as the block enclosed in the
8055 ** "if( isDirect )" condition.
8056 */
8057 #ifndef SQLITE_ENABLE_ATOMIC_WRITE
8058 # define DIRECT_MODE 0
8059 assert( isDirectMode==0 );
8060 UNUSED_PARAMETER(isDirectMode);
8061 #else
8062 # define DIRECT_MODE isDirectMode
8063 #endif
8064
8065 if( !pPager->changeCountDone && ALWAYS(pPager->dbSize>0) ){
8066 PgHdr *pPgHdr; /* Reference to page 1 */
8067
8068 assert( !pPager->tempFile && isOpen(pPager->fd) );
8069
8070 /* Open page 1 of the file for writing. */
8071 rc = sqlite3PagerGet(pPager, 1, &pPgHdr, 0);
8072 assert( pPgHdr==0 || rc==SQLITE_OK );
8073
8074 /* If page one was fetched successfully, and this function is not
8075 ** operating in direct-mode, make page 1 writable. When not in
8076 ** direct mode, page 1 is always held in cache and hence the PagerGet()
8077 ** above is always successful - hence the ALWAYS on rc==SQLITE_OK.
8078 */
8079 if( !DIRECT_MODE && ALWAYS(rc==SQLITE_OK) ){
8080 rc = sqlite3PagerWrite(pPgHdr);
8081 }
8082
8083 if( rc==SQLITE_OK ){
8084 /* Actually do the update of the change counter */
8085 pager_write_changecounter(pPgHdr);
8086
8087 /* If running in direct mode, write the contents of page 1 to the file. */
8088 if( DIRECT_MODE ){
8089 const void *zBuf;
8090 assert( pPager->dbFileSize>0 );
8091 CODEC2(pPager, pPgHdr->pData, 1, 6, rc=SQLITE_NOMEM_BKPT, zBuf);
8092 if( rc==SQLITE_OK ){
8093 rc = sqlite3OsWrite(pPager->fd, zBuf, pPager->pageSize, 0);
8094 pPager->aStat[PAGER_STAT_WRITE]++;
8095 }
8096 if( rc==SQLITE_OK ){
8097 /* Update the pager's copy of the change-counter. Otherwise, the
8098 ** next time a read transaction is opened the cache will be
8099 ** flushed (as the change-counter values will not match). */
8100 const void *pCopy = (const void *)&((const char *)zBuf)[24];
8101 memcpy(&pPager->dbFileVers, pCopy, sizeof(pPager->dbFileVers));
8102 pPager->changeCountDone = 1;
8103 }
8104 }else{
8105 pPager->changeCountDone = 1;
8106 }
8107 }
8108
8109 /* Release the page reference. */
8110 sqlite3PagerUnref(pPgHdr);
8111 }
8112 return rc;
8113 }
8114
8115 /*
8116 ** Sync the database file to disk. This is a no-op for in-memory databases
8117 ** or pages with the Pager.noSync flag set.
8118 **
8119 ** If successful, or if called on a pager for which it is a no-op, this
8120 ** function returns SQLITE_OK. Otherwise, an IO error code is returned.
8121 */
8122 SQLITE_PRIVATE int sqlite3PagerSync(Pager *pPager, const char *zMaster){
8123 int rc = SQLITE_OK;
8124
8125 if( isOpen(pPager->fd) ){
8126 void *pArg = (void*)zMaster;
8127 rc = sqlite3OsFileControl(pPager->fd, SQLITE_FCNTL_SYNC, pArg);
8128 if( rc==SQLITE_NOTFOUND ) rc = SQLITE_OK;
8129 }
8130 if( rc==SQLITE_OK && !pPager->noSync ){
8131 assert( !MEMDB );
8132 rc = sqlite3OsSync(pPager->fd, pPager->syncFlags);
8133 }
8134 return rc;
8135 }
8136
8137 /*
8138 ** This function may only be called while a write-transaction is active in
8139 ** rollback. If the connection is in WAL mode, this call is a no-op.
8140 ** Otherwise, if the connection does not already have an EXCLUSIVE lock on
8141 ** the database file, an attempt is made to obtain one.
8142 **
8143 ** If the EXCLUSIVE lock is already held or the attempt to obtain it is
8144 ** successful, or the connection is in WAL mode, SQLITE_OK is returned.
8145 ** Otherwise, either SQLITE_BUSY or an SQLITE_IOERR_XXX error code is
8146 ** returned.
8147 */
8148 SQLITE_PRIVATE int sqlite3PagerExclusiveLock(Pager *pPager){
8149 int rc = pPager->errCode;
8150 assert( assert_pager_state(pPager) );
8151 if( rc==SQLITE_OK ){
8152 assert( pPager->eState==PAGER_WRITER_CACHEMOD
8153 || pPager->eState==PAGER_WRITER_DBMOD
8154 || pPager->eState==PAGER_WRITER_LOCKED
8155 );
8156 assert( assert_pager_state(pPager) );
8157 if( 0==pagerUseWal(pPager) ){
8158 rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
8159 }
8160 }
8161 return rc;
8162 }
8163
8164 /*
8165 ** Sync the database file for the pager pPager. zMaster points to the name
8166 ** of a master journal file that should be written into the individual
8167 ** journal file. zMaster may be NULL, which is interpreted as no master
8168 ** journal (a single database transaction).
8169 **
8170 ** This routine ensures that:
8171 **
8172 ** * The database file change-counter is updated,
8173 ** * the journal is synced (unless the atomic-write optimization is used),
8174 ** * all dirty pages are written to the database file,
8175 ** * the database file is truncated (if required), and
8176 ** * the database file synced.
8177 **
8178 ** The only thing that remains to commit the transaction is to finalize
8179 ** (delete, truncate or zero the first part of) the journal file (or
8180 ** delete the master journal file if specified).
8181 **
8182 ** Note that if zMaster==NULL, this does not overwrite a previous value
8183 ** passed to an sqlite3PagerCommitPhaseOne() call.
8184 **
8185 ** If the final parameter - noSync - is true, then the database file itself
8186 ** is not synced. The caller must call sqlite3PagerSync() directly to
8187 ** sync the database file before calling CommitPhaseTwo() to delete the
8188 ** journal file in this case.
8189 */
8190 SQLITE_PRIVATE int sqlite3PagerCommitPhaseOne(
8191 Pager *pPager, /* Pager object */
8192 const char *zMaster, /* If not NULL, the master journal name */
8193 int noSync /* True to omit the xSync on the db file */
8194 ){
8195 int rc = SQLITE_OK; /* Return code */
8196
8197 assert( pPager->eState==PAGER_WRITER_LOCKED
8198 || pPager->eState==PAGER_WRITER_CACHEMOD
8199 || pPager->eState==PAGER_WRITER_DBMOD
8200 || pPager->eState==PAGER_ERROR
8201 );
8202 assert( assert_pager_state(pPager) );
8203
8204 /* If a prior error occurred, report that error again. */
8205 if( NEVER(pPager->errCode) ) return pPager->errCode;
8206
8207 /* Provide the ability to easily simulate an I/O error during testing */
8208 if( sqlite3FaultSim(400) ) return SQLITE_IOERR;
8209
8210 PAGERTRACE(("DATABASE SYNC: File=%s zMaster=%s nSize=%d\n",
8211 pPager->zFilename, zMaster, pPager->dbSize));
8212
8213 /* If no database changes have been made, return early. */
8214 if( pPager->eState<PAGER_WRITER_CACHEMOD ) return SQLITE_OK;
8215
8216 assert( MEMDB==0 || pPager->tempFile );
8217 assert( isOpen(pPager->fd) || pPager->tempFile );
8218 if( 0==pagerFlushOnCommit(pPager, 1) ){
8219 /* If this is an in-memory db, or no pages have been written to, or this
8220 ** function has already been called, it is mostly a no-op. However, any
8221 ** backup in progress needs to be restarted. */
8222 sqlite3BackupRestart(pPager->pBackup);
8223 }else{
8224 if( pagerUseWal(pPager) ){
8225 PgHdr *pList = sqlite3PcacheDirtyList(pPager->pPCache);
8226 PgHdr *pPageOne = 0;
8227 if( pList==0 ){
8228 /* Must have at least one page for the WAL commit flag.
8229 ** Ticket [2d1a5c67dfc2363e44f29d9bbd57f] 2011-05-18 */
8230 rc = sqlite3PagerGet(pPager, 1, &pPageOne, 0);
8231 pList = pPageOne;
8232 pList->pDirty = 0;
8233 }
8234 assert( rc==SQLITE_OK );
8235 if( ALWAYS(pList) ){
8236 rc = pagerWalFrames(pPager, pList, pPager->dbSize, 1);
8237 }
8238 sqlite3PagerUnref(pPageOne);
8239 if( rc==SQLITE_OK ){
8240 sqlite3PcacheCleanAll(pPager->pPCache);
8241 }
8242 }else{
8243 /* The following block updates the change-counter. Exactly how it
8244 ** does this depends on whether or not the atomic-update optimization
8245 ** was enabled at compile time, and if this transaction meets the
8246 ** runtime criteria to use the operation:
8247 **
8248 ** * The file-system supports the atomic-write property for
8249 ** blocks of size page-size, and
8250 ** * This commit is not part of a multi-file transaction, and
8251 ** * Exactly one page has been modified and store in the journal file.
8252 **
8253 ** If the optimization was not enabled at compile time, then the
8254 ** pager_incr_changecounter() function is called to update the change
8255 ** counter in 'indirect-mode'. If the optimization is compiled in but
8256 ** is not applicable to this transaction, call sqlite3JournalCreate()
8257 ** to make sure the journal file has actually been created, then call
8258 ** pager_incr_changecounter() to update the change-counter in indirect
8259 ** mode.
8260 **
8261 ** Otherwise, if the optimization is both enabled and applicable,
8262 ** then call pager_incr_changecounter() to update the change-counter
8263 ** in 'direct' mode. In this case the journal file will never be
8264 ** created for this transaction.
8265 */
8266 #ifdef SQLITE_ENABLE_ATOMIC_WRITE
8267 PgHdr *pPg;
8268 assert( isOpen(pPager->jfd)
8269 || pPager->journalMode==PAGER_JOURNALMODE_OFF
8270 || pPager->journalMode==PAGER_JOURNALMODE_WAL
8271 );
8272 if( !zMaster && isOpen(pPager->jfd)
8273 && pPager->journalOff==jrnlBufferSize(pPager)
8274 && pPager->dbSize>=pPager->dbOrigSize
8275 && (0==(pPg = sqlite3PcacheDirtyList(pPager->pPCache)) || 0==pPg->pDirty)
8276 ){
8277 /* Update the db file change counter via the direct-write method. The
8278 ** following call will modify the in-memory representation of page 1
8279 ** to include the updated change counter and then write page 1
8280 ** directly to the database file. Because of the atomic-write
8281 ** property of the host file-system, this is safe.
8282 */
8283 rc = pager_incr_changecounter(pPager, 1);
8284 }else{
8285 rc = sqlite3JournalCreate(pPager->jfd);
8286 if( rc==SQLITE_OK ){
8287 rc = pager_incr_changecounter(pPager, 0);
8288 }
8289 }
8290 #else
8291 rc = pager_incr_changecounter(pPager, 0);
8292 #endif
8293 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;
8294
8295 /* Write the master journal name into the journal file. If a master
8296 ** journal file name has already been written to the journal file,
8297 ** or if zMaster is NULL (no master journal), then this call is a no-op.
8298 */
8299 rc = writeMasterJournal(pPager, zMaster);
8300 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;
8301
8302 /* Sync the journal file and write all dirty pages to the database.
8303 ** If the atomic-update optimization is being used, this sync will not
8304 ** create the journal file or perform any real IO.
8305 **
8306 ** Because the change-counter page was just modified, unless the
8307 ** atomic-update optimization is used it is almost certain that the
8308 ** journal requires a sync here. However, in locking_mode=exclusive
8309 ** on a system under memory pressure it is just possible that this is
8310 ** not the case. In this case it is likely enough that the redundant
8311 ** xSync() call will be changed to a no-op by the OS anyhow.
8312 */
8313 rc = syncJournal(pPager, 0);
8314 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;
8315
8316 rc = pager_write_pagelist(pPager,sqlite3PcacheDirtyList(pPager->pPCache));
8317 if( rc!=SQLITE_OK ){
8318 assert( rc!=SQLITE_IOERR_BLOCKED );
8319 goto commit_phase_one_exit;
8320 }
8321 sqlite3PcacheCleanAll(pPager->pPCache);
8322
8323 /* If the file on disk is smaller than the database image, use
8324 ** pager_truncate to grow the file here. This can happen if the database
8325 ** image was extended as part of the current transaction and then the
8326 ** last page in the db image moved to the free-list. In this case the
8327 ** last page is never written out to disk, leaving the database file
8328 ** undersized. Fix this now if it is the case. */
8329 if( pPager->dbSize>pPager->dbFileSize ){
8330 Pgno nNew = pPager->dbSize - (pPager->dbSize==PAGER_MJ_PGNO(pPager));
8331 assert( pPager->eState==PAGER_WRITER_DBMOD );
8332 rc = pager_truncate(pPager, nNew);
8333 if( rc!=SQLITE_OK ) goto commit_phase_one_exit;
8334 }
8335
8336 /* Finally, sync the database file. */
8337 if( !noSync ){
8338 rc = sqlite3PagerSync(pPager, zMaster);
8339 }
8340 IOTRACE(("DBSYNC %p\n", pPager))
8341 }
8342 }
8343
8344 commit_phase_one_exit:
8345 if( rc==SQLITE_OK && !pagerUseWal(pPager) ){
8346 pPager->eState = PAGER_WRITER_FINISHED;
8347 }
8348 return rc;
8349 }
8350
8351
8352 /*
8353 ** When this function is called, the database file has been completely
8354 ** updated to reflect the changes made by the current transaction and
8355 ** synced to disk. The journal file still exists in the file-system
8356 ** though, and if a failure occurs at this point it will eventually
8357 ** be used as a hot-journal and the current transaction rolled back.
8358 **
8359 ** This function finalizes the journal file, either by deleting,
8360 ** truncating or partially zeroing it, so that it cannot be used
8361 ** for hot-journal rollback. Once this is done the transaction is
8362 ** irrevocably committed.
8363 **
8364 ** If an error occurs, an IO error code is returned and the pager
8365 ** moves into the error state. Otherwise, SQLITE_OK is returned.
8366 */
8367 SQLITE_PRIVATE int sqlite3PagerCommitPhaseTwo(Pager *pPager){
8368 int rc = SQLITE_OK; /* Return code */
8369
8370 /* This routine should not be called if a prior error has occurred.
8371 ** But if (due to a coding error elsewhere in the system) it does get
8372 ** called, just return the same error code without doing anything. */
8373 if( NEVER(pPager->errCode) ) return pPager->errCode;
8374
8375 assert( pPager->eState==PAGER_WRITER_LOCKED
8376 || pPager->eState==PAGER_WRITER_FINISHED
8377 || (pagerUseWal(pPager) && pPager->eState==PAGER_WRITER_CACHEMOD)
8378 );
8379 assert( assert_pager_state(pPager) );
8380
8381 /* An optimization. If the database was not actually modified during
8382 ** this transaction, the pager is running in exclusive-mode and is
8383 ** using persistent journals, then this function is a no-op.
8384 **
8385 ** The start of the journal file currently contains a single journal
8386 ** header with the nRec field set to 0. If such a journal is used as
8387 ** a hot-journal during hot-journal rollback, 0 changes will be made
8388 ** to the database file. So there is no need to zero the journal
8389 ** header. Since the pager is in exclusive mode, there is no need
8390 ** to drop any locks either.
8391 */
8392 if( pPager->eState==PAGER_WRITER_LOCKED
8393 && pPager->exclusiveMode
8394 && pPager->journalMode==PAGER_JOURNALMODE_PERSIST
8395 ){
8396 assert( pPager->journalOff==JOURNAL_HDR_SZ(pPager) || !pPager->journalOff );
8397 pPager->eState = PAGER_READER;
8398 return SQLITE_OK;
8399 }
8400
8401 PAGERTRACE(("COMMIT %d\n", PAGERID(pPager)));
8402 pPager->iDataVersion++;
8403 rc = pager_end_transaction(pPager, pPager->setMaster, 1);
8404 return pager_error(pPager, rc);
8405 }
8406
8407 /*
8408 ** If a write transaction is open, then all changes made within the
8409 ** transaction are reverted and the current write-transaction is closed.
8410 ** The pager falls back to PAGER_READER state if successful, or PAGER_ERROR
8411 ** state if an error occurs.
8412 **
8413 ** If the pager is already in PAGER_ERROR state when this function is called,
8414 ** it returns Pager.errCode immediately. No work is performed in this case.
8415 **
8416 ** Otherwise, in rollback mode, this function performs two functions:
8417 **
8418 ** 1) It rolls back the journal file, restoring all database file and
8419 ** in-memory cache pages to the state they were in when the transaction
8420 ** was opened, and
8421 **
8422 ** 2) It finalizes the journal file, so that it is not used for hot
8423 ** rollback at any point in the future.
8424 **
8425 ** Finalization of the journal file (task 2) is only performed if the
8426 ** rollback is successful.
8427 **
8428 ** In WAL mode, all cache-entries containing data modified within the
8429 ** current transaction are either expelled from the cache or reverted to
8430 ** their pre-transaction state by re-reading data from the database or
8431 ** WAL files. The WAL transaction is then closed.
8432 */
8433 SQLITE_PRIVATE int sqlite3PagerRollback(Pager *pPager){
8434 int rc = SQLITE_OK; /* Return code */
8435 PAGERTRACE(("ROLLBACK %d\n", PAGERID(pPager)));
8436
8437 /* PagerRollback() is a no-op if called in READER or OPEN state. If
8438 ** the pager is already in the ERROR state, the rollback is not
8439 ** attempted here. Instead, the error code is returned to the caller.
8440 */
8441 assert( assert_pager_state(pPager) );
8442 if( pPager->eState==PAGER_ERROR ) return pPager->errCode;
8443 if( pPager->eState<=PAGER_READER ) return SQLITE_OK;
8444
8445 if( pagerUseWal(pPager) ){
8446 int rc2;
8447 rc = sqlite3PagerSavepoint(pPager, SAVEPOINT_ROLLBACK, -1);
8448 rc2 = pager_end_transaction(pPager, pPager->setMaster, 0);
8449 if( rc==SQLITE_OK ) rc = rc2;
8450 }else if( !isOpen(pPager->jfd) || pPager->eState==PAGER_WRITER_LOCKED ){
8451 int eState = pPager->eState;
8452 rc = pager_end_transaction(pPager, 0, 0);
8453 if( !MEMDB && eState>PAGER_WRITER_LOCKED ){
8454 /* This can happen using journal_mode=off. Move the pager to the error
8455 ** state to indicate that the contents of the cache may not be trusted.
8456 ** Any active readers will get SQLITE_ABORT.
8457 */
8458 pPager->errCode = SQLITE_ABORT;
8459 pPager->eState = PAGER_ERROR;
8460 setGetterMethod(pPager);
8461 return rc;
8462 }
8463 }else{
8464 rc = pager_playback(pPager, 0);
8465 }
8466
8467 assert( pPager->eState==PAGER_READER || rc!=SQLITE_OK );
8468 assert( rc==SQLITE_OK || rc==SQLITE_FULL || rc==SQLITE_CORRUPT
8469 || rc==SQLITE_NOMEM || (rc&0xFF)==SQLITE_IOERR
8470 || rc==SQLITE_CANTOPEN
8471 );
8472
8473 /* If an error occurs during a ROLLBACK, we can no longer trust the pager
8474 ** cache. So call pager_error() on the way out to make any error persistent.
8475 */
8476 return pager_error(pPager, rc);
8477 }
8478
8479 /*
8480 ** Return TRUE if the database file is opened read-only. Return FALSE
8481 ** if the database is (in theory) writable.
8482 */
8483 SQLITE_PRIVATE u8 sqlite3PagerIsreadonly(Pager *pPager){
8484 return pPager->readOnly;
8485 }
8486
8487 #ifdef SQLITE_DEBUG
8488 /*
8489 ** Return the sum of the reference counts for all pages held by pPager.
8490 */
8491 SQLITE_PRIVATE int sqlite3PagerRefcount(Pager *pPager){
8492 return sqlite3PcacheRefCount(pPager->pPCache);
8493 }
8494 #endif
8495
8496 /*
8497 ** Return the approximate number of bytes of memory currently
8498 ** used by the pager and its associated cache.
8499 */
8500 SQLITE_PRIVATE int sqlite3PagerMemUsed(Pager *pPager){
8501 int perPageSize = pPager->pageSize + pPager->nExtra + sizeof(PgHdr)
8502 + 5*sizeof(void*);
8503 return perPageSize*sqlite3PcachePagecount(pPager->pPCache)
8504 + sqlite3MallocSize(pPager)
8505 + pPager->pageSize;
8506 }
8507
8508 /*
8509 ** Return the number of references to the specified page.
8510 */
8511 SQLITE_PRIVATE int sqlite3PagerPageRefcount(DbPage *pPage){
8512 return sqlite3PcachePageRefcount(pPage);
8513 }
8514
8515 #ifdef SQLITE_TEST
8516 /*
8517 ** This routine is used for testing and analysis only.
8518 */
8519 SQLITE_PRIVATE int *sqlite3PagerStats(Pager *pPager){
8520 static int a[11];
8521 a[0] = sqlite3PcacheRefCount(pPager->pPCache);
8522 a[1] = sqlite3PcachePagecount(pPager->pPCache);
8523 a[2] = sqlite3PcacheGetCachesize(pPager->pPCache);
8524 a[3] = pPager->eState==PAGER_OPEN ? -1 : (int) pPager->dbSize;
8525 a[4] = pPager->eState;
8526 a[5] = pPager->errCode;
8527 a[6] = pPager->aStat[PAGER_STAT_HIT];
8528 a[7] = pPager->aStat[PAGER_STAT_MISS];
8529 a[8] = 0; /* Used to be pPager->nOvfl */
8530 a[9] = pPager->nRead;
8531 a[10] = pPager->aStat[PAGER_STAT_WRITE];
8532 return a;
8533 }
8534 #endif
8535
8536 /*
8537 ** Parameter eStat must be either SQLITE_DBSTATUS_CACHE_HIT or
8538 ** SQLITE_DBSTATUS_CACHE_MISS. Before returning, *pnVal is incremented by the
8539 ** current cache hit or miss count, according to the value of eStat. If the
8540 ** reset parameter is non-zero, the cache hit or miss count is zeroed before
8541 ** returning.
8542 */
8543 SQLITE_PRIVATE void sqlite3PagerCacheStat(Pager *pPager, int eStat, int reset, i nt *pnVal){
8544
8545 assert( eStat==SQLITE_DBSTATUS_CACHE_HIT
8546 || eStat==SQLITE_DBSTATUS_CACHE_MISS
8547 || eStat==SQLITE_DBSTATUS_CACHE_WRITE
8548 );
8549
8550 assert( SQLITE_DBSTATUS_CACHE_HIT+1==SQLITE_DBSTATUS_CACHE_MISS );
8551 assert( SQLITE_DBSTATUS_CACHE_HIT+2==SQLITE_DBSTATUS_CACHE_WRITE );
8552 assert( PAGER_STAT_HIT==0 && PAGER_STAT_MISS==1 && PAGER_STAT_WRITE==2 );
8553
8554 *pnVal += pPager->aStat[eStat - SQLITE_DBSTATUS_CACHE_HIT];
8555 if( reset ){
8556 pPager->aStat[eStat - SQLITE_DBSTATUS_CACHE_HIT] = 0;
8557 }
8558 }
8559
8560 /*
8561 ** Return true if this is an in-memory or temp-file backed pager.
8562 */
8563 SQLITE_PRIVATE int sqlite3PagerIsMemdb(Pager *pPager){
8564 return pPager->tempFile;
8565 }
8566
8567 /*
8568 ** Check that there are at least nSavepoint savepoints open. If there are
8569 ** currently less than nSavepoints open, then open one or more savepoints
8570 ** to make up the difference. If the number of savepoints is already
8571 ** equal to nSavepoint, then this function is a no-op.
8572 **
8573 ** If a memory allocation fails, SQLITE_NOMEM is returned. If an error
8574 ** occurs while opening the sub-journal file, then an IO error code is
8575 ** returned. Otherwise, SQLITE_OK.
8576 */
8577 static SQLITE_NOINLINE int pagerOpenSavepoint(Pager *pPager, int nSavepoint){
8578 int rc = SQLITE_OK; /* Return code */
8579 int nCurrent = pPager->nSavepoint; /* Current number of savepoints */
8580 int ii; /* Iterator variable */
8581 PagerSavepoint *aNew; /* New Pager.aSavepoint array */
8582
8583 assert( pPager->eState>=PAGER_WRITER_LOCKED );
8584 assert( assert_pager_state(pPager) );
8585 assert( nSavepoint>nCurrent && pPager->useJournal );
8586
8587 /* Grow the Pager.aSavepoint array using realloc(). Return SQLITE_NOMEM
8588 ** if the allocation fails. Otherwise, zero the new portion in case a
8589 ** malloc failure occurs while populating it in the for(...) loop below.
8590 */
8591 aNew = (PagerSavepoint *)sqlite3Realloc(
8592 pPager->aSavepoint, sizeof(PagerSavepoint)*nSavepoint
8593 );
8594 if( !aNew ){
8595 return SQLITE_NOMEM_BKPT;
8596 }
8597 memset(&aNew[nCurrent], 0, (nSavepoint-nCurrent) * sizeof(PagerSavepoint));
8598 pPager->aSavepoint = aNew;
8599
8600 /* Populate the PagerSavepoint structures just allocated. */
8601 for(ii=nCurrent; ii<nSavepoint; ii++){
8602 aNew[ii].nOrig = pPager->dbSize;
8603 if( isOpen(pPager->jfd) && pPager->journalOff>0 ){
8604 aNew[ii].iOffset = pPager->journalOff;
8605 }else{
8606 aNew[ii].iOffset = JOURNAL_HDR_SZ(pPager);
8607 }
8608 aNew[ii].iSubRec = pPager->nSubRec;
8609 aNew[ii].pInSavepoint = sqlite3BitvecCreate(pPager->dbSize);
8610 if( !aNew[ii].pInSavepoint ){
8611 return SQLITE_NOMEM_BKPT;
8612 }
8613 if( pagerUseWal(pPager) ){
8614 sqlite3WalSavepoint(pPager->pWal, aNew[ii].aWalData);
8615 }
8616 pPager->nSavepoint = ii+1;
8617 }
8618 assert( pPager->nSavepoint==nSavepoint );
8619 assertTruncateConstraint(pPager);
8620 return rc;
8621 }
8622 SQLITE_PRIVATE int sqlite3PagerOpenSavepoint(Pager *pPager, int nSavepoint){
8623 assert( pPager->eState>=PAGER_WRITER_LOCKED );
8624 assert( assert_pager_state(pPager) );
8625
8626 if( nSavepoint>pPager->nSavepoint && pPager->useJournal ){
8627 return pagerOpenSavepoint(pPager, nSavepoint);
8628 }else{
8629 return SQLITE_OK;
8630 }
8631 }
8632
8633
8634 /*
8635 ** This function is called to rollback or release (commit) a savepoint.
8636 ** The savepoint to release or rollback need not be the most recently
8637 ** created savepoint.
8638 **
8639 ** Parameter op is always either SAVEPOINT_ROLLBACK or SAVEPOINT_RELEASE.
8640 ** If it is SAVEPOINT_RELEASE, then release and destroy the savepoint with
8641 ** index iSavepoint. If it is SAVEPOINT_ROLLBACK, then rollback all changes
8642 ** that have occurred since the specified savepoint was created.
8643 **
8644 ** The savepoint to rollback or release is identified by parameter
8645 ** iSavepoint. A value of 0 means to operate on the outermost savepoint
8646 ** (the first created). A value of (Pager.nSavepoint-1) means operate
8647 ** on the most recently created savepoint. If iSavepoint is greater than
8648 ** (Pager.nSavepoint-1), then this function is a no-op.
8649 **
8650 ** If a negative value is passed to this function, then the current
8651 ** transaction is rolled back. This is different to calling
8652 ** sqlite3PagerRollback() because this function does not terminate
8653 ** the transaction or unlock the database, it just restores the
8654 ** contents of the database to its original state.
8655 **
8656 ** In any case, all savepoints with an index greater than iSavepoint
8657 ** are destroyed. If this is a release operation (op==SAVEPOINT_RELEASE),
8658 ** then savepoint iSavepoint is also destroyed.
8659 **
8660 ** This function may return SQLITE_NOMEM if a memory allocation fails,
8661 ** or an IO error code if an IO error occurs while rolling back a
8662 ** savepoint. If no errors occur, SQLITE_OK is returned.
8663 */
8664 SQLITE_PRIVATE int sqlite3PagerSavepoint(Pager *pPager, int op, int iSavepoint){
8665 int rc = pPager->errCode;
8666
8667 #ifdef SQLITE_ENABLE_ZIPVFS
8668 if( op==SAVEPOINT_RELEASE ) rc = SQLITE_OK;
8669 #endif
8670
8671 assert( op==SAVEPOINT_RELEASE || op==SAVEPOINT_ROLLBACK );
8672 assert( iSavepoint>=0 || op==SAVEPOINT_ROLLBACK );
8673
8674 if( rc==SQLITE_OK && iSavepoint<pPager->nSavepoint ){
8675 int ii; /* Iterator variable */
8676 int nNew; /* Number of remaining savepoints after this op. */
8677
8678 /* Figure out how many savepoints will still be active after this
8679 ** operation. Store this value in nNew. Then free resources associated
8680 ** with any savepoints that are destroyed by this operation.
8681 */
8682 nNew = iSavepoint + (( op==SAVEPOINT_RELEASE ) ? 0 : 1);
8683 for(ii=nNew; ii<pPager->nSavepoint; ii++){
8684 sqlite3BitvecDestroy(pPager->aSavepoint[ii].pInSavepoint);
8685 }
8686 pPager->nSavepoint = nNew;
8687
8688 /* If this is a release of the outermost savepoint, truncate
8689 ** the sub-journal to zero bytes in size. */
8690 if( op==SAVEPOINT_RELEASE ){
8691 if( nNew==0 && isOpen(pPager->sjfd) ){
8692 /* Only truncate if it is an in-memory sub-journal. */
8693 if( sqlite3JournalIsInMemory(pPager->sjfd) ){
8694 rc = sqlite3OsTruncate(pPager->sjfd, 0);
8695 assert( rc==SQLITE_OK );
8696 }
8697 pPager->nSubRec = 0;
8698 }
8699 }
8700 /* Else this is a rollback operation, playback the specified savepoint.
8701 ** If this is a temp-file, it is possible that the journal file has
8702 ** not yet been opened. In this case there have been no changes to
8703 ** the database file, so the playback operation can be skipped.
8704 */
8705 else if( pagerUseWal(pPager) || isOpen(pPager->jfd) ){
8706 PagerSavepoint *pSavepoint = (nNew==0)?0:&pPager->aSavepoint[nNew-1];
8707 rc = pagerPlaybackSavepoint(pPager, pSavepoint);
8708 assert(rc!=SQLITE_DONE);
8709 }
8710
8711 #ifdef SQLITE_ENABLE_ZIPVFS
8712 /* If the cache has been modified but the savepoint cannot be rolled
8713 ** back journal_mode=off, put the pager in the error state. This way,
8714 ** if the VFS used by this pager includes ZipVFS, the entire transaction
8715 ** can be rolled back at the ZipVFS level. */
8716 else if(
8717 pPager->journalMode==PAGER_JOURNALMODE_OFF
8718 && pPager->eState>=PAGER_WRITER_CACHEMOD
8719 ){
8720 pPager->errCode = SQLITE_ABORT;
8721 pPager->eState = PAGER_ERROR;
8722 setGetterMethod(pPager);
8723 }
8724 #endif
8725 }
8726
8727 return rc;
8728 }
8729
8730 /*
8731 ** Return the full pathname of the database file.
8732 **
8733 ** Except, if the pager is in-memory only, then return an empty string if
8734 ** nullIfMemDb is true. This routine is called with nullIfMemDb==1 when
8735 ** used to report the filename to the user, for compatibility with legacy
8736 ** behavior. But when the Btree needs to know the filename for matching to
8737 ** shared cache, it uses nullIfMemDb==0 so that in-memory databases can
8738 ** participate in shared-cache.
8739 */
8740 SQLITE_PRIVATE const char *sqlite3PagerFilename(Pager *pPager, int nullIfMemDb){
8741 return (nullIfMemDb && pPager->memDb) ? "" : pPager->zFilename;
8742 }
8743
8744 /*
8745 ** Return the VFS structure for the pager.
8746 */
8747 SQLITE_PRIVATE sqlite3_vfs *sqlite3PagerVfs(Pager *pPager){
8748 return pPager->pVfs;
8749 }
8750
8751 /*
8752 ** Return the file handle for the database file associated
8753 ** with the pager. This might return NULL if the file has
8754 ** not yet been opened.
8755 */
8756 SQLITE_PRIVATE sqlite3_file *sqlite3PagerFile(Pager *pPager){
8757 return pPager->fd;
8758 }
8759
8760 /*
8761 ** Return the file handle for the journal file (if it exists).
8762 ** This will be either the rollback journal or the WAL file.
8763 */
8764 SQLITE_PRIVATE sqlite3_file *sqlite3PagerJrnlFile(Pager *pPager){
8765 #if SQLITE_OMIT_WAL
8766 return pPager->jfd;
8767 #else
8768 return pPager->pWal ? sqlite3WalFile(pPager->pWal) : pPager->jfd;
8769 #endif
8770 }
8771
8772 /*
8773 ** Return the full pathname of the journal file.
8774 */
8775 SQLITE_PRIVATE const char *sqlite3PagerJournalname(Pager *pPager){
8776 return pPager->zJournal;
8777 }
8778
8779 #ifdef SQLITE_HAS_CODEC
8780 /*
8781 ** Set or retrieve the codec for this pager
8782 */
8783 SQLITE_PRIVATE void sqlite3PagerSetCodec(
8784 Pager *pPager,
8785 void *(*xCodec)(void*,void*,Pgno,int),
8786 void (*xCodecSizeChng)(void*,int,int),
8787 void (*xCodecFree)(void*),
8788 void *pCodec
8789 ){
8790 if( pPager->xCodecFree ) pPager->xCodecFree(pPager->pCodec);
8791 pPager->xCodec = pPager->memDb ? 0 : xCodec;
8792 pPager->xCodecSizeChng = xCodecSizeChng;
8793 pPager->xCodecFree = xCodecFree;
8794 pPager->pCodec = pCodec;
8795 setGetterMethod(pPager);
8796 pagerReportSize(pPager);
8797 }
8798 SQLITE_PRIVATE void *sqlite3PagerGetCodec(Pager *pPager){
8799 return pPager->pCodec;
8800 }
8801
8802 /*
8803 ** This function is called by the wal module when writing page content
8804 ** into the log file.
8805 **
8806 ** This function returns a pointer to a buffer containing the encrypted
8807 ** page content. If a malloc fails, this function may return NULL.
8808 */
8809 SQLITE_PRIVATE void *sqlite3PagerCodec(PgHdr *pPg){
8810 void *aData = 0;
8811 CODEC2(pPg->pPager, pPg->pData, pPg->pgno, 6, return 0, aData);
8812 return aData;
8813 }
8814
8815 /*
8816 ** Return the current pager state
8817 */
8818 SQLITE_PRIVATE int sqlite3PagerState(Pager *pPager){
8819 return pPager->eState;
8820 }
8821 #endif /* SQLITE_HAS_CODEC */
8822
8823 #ifndef SQLITE_OMIT_AUTOVACUUM
8824 /*
8825 ** Move the page pPg to location pgno in the file.
8826 **
8827 ** There must be no references to the page previously located at
8828 ** pgno (which we call pPgOld) though that page is allowed to be
8829 ** in cache. If the page previously located at pgno is not already
8830 ** in the rollback journal, it is not put there by by this routine.
8831 **
8832 ** References to the page pPg remain valid. Updating any
8833 ** meta-data associated with pPg (i.e. data stored in the nExtra bytes
8834 ** allocated along with the page) is the responsibility of the caller.
8835 **
8836 ** A transaction must be active when this routine is called. It used to be
8837 ** required that a statement transaction was not active, but this restriction
8838 ** has been removed (CREATE INDEX needs to move a page when a statement
8839 ** transaction is active).
8840 **
8841 ** If the fourth argument, isCommit, is non-zero, then this page is being
8842 ** moved as part of a database reorganization just before the transaction
8843 ** is being committed. In this case, it is guaranteed that the database page
8844 ** pPg refers to will not be written to again within this transaction.
8845 **
8846 ** This function may return SQLITE_NOMEM or an IO error code if an error
8847 ** occurs. Otherwise, it returns SQLITE_OK.
8848 */
8849 SQLITE_PRIVATE int sqlite3PagerMovepage(Pager *pPager, DbPage *pPg, Pgno pgno, i nt isCommit){
8850 PgHdr *pPgOld; /* The page being overwritten. */
8851 Pgno needSyncPgno = 0; /* Old value of pPg->pgno, if sync is required */
8852 int rc; /* Return code */
8853 Pgno origPgno; /* The original page number */
8854
8855 assert( pPg->nRef>0 );
8856 assert( pPager->eState==PAGER_WRITER_CACHEMOD
8857 || pPager->eState==PAGER_WRITER_DBMOD
8858 );
8859 assert( assert_pager_state(pPager) );
8860
8861 /* In order to be able to rollback, an in-memory database must journal
8862 ** the page we are moving from.
8863 */
8864 assert( pPager->tempFile || !MEMDB );
8865 if( pPager->tempFile ){
8866 rc = sqlite3PagerWrite(pPg);
8867 if( rc ) return rc;
8868 }
8869
8870 /* If the page being moved is dirty and has not been saved by the latest
8871 ** savepoint, then save the current contents of the page into the
8872 ** sub-journal now. This is required to handle the following scenario:
8873 **
8874 ** BEGIN;
8875 ** <journal page X, then modify it in memory>
8876 ** SAVEPOINT one;
8877 ** <Move page X to location Y>
8878 ** ROLLBACK TO one;
8879 **
8880 ** If page X were not written to the sub-journal here, it would not
8881 ** be possible to restore its contents when the "ROLLBACK TO one"
8882 ** statement were is processed.
8883 **
8884 ** subjournalPage() may need to allocate space to store pPg->pgno into
8885 ** one or more savepoint bitvecs. This is the reason this function
8886 ** may return SQLITE_NOMEM.
8887 */
8888 if( (pPg->flags & PGHDR_DIRTY)!=0
8889 && SQLITE_OK!=(rc = subjournalPageIfRequired(pPg))
8890 ){
8891 return rc;
8892 }
8893
8894 PAGERTRACE(("MOVE %d page %d (needSync=%d) moves to %d\n",
8895 PAGERID(pPager), pPg->pgno, (pPg->flags&PGHDR_NEED_SYNC)?1:0, pgno));
8896 IOTRACE(("MOVE %p %d %d\n", pPager, pPg->pgno, pgno))
8897
8898 /* If the journal needs to be sync()ed before page pPg->pgno can
8899 ** be written to, store pPg->pgno in local variable needSyncPgno.
8900 **
8901 ** If the isCommit flag is set, there is no need to remember that
8902 ** the journal needs to be sync()ed before database page pPg->pgno
8903 ** can be written to. The caller has already promised not to write to it.
8904 */
8905 if( (pPg->flags&PGHDR_NEED_SYNC) && !isCommit ){
8906 needSyncPgno = pPg->pgno;
8907 assert( pPager->journalMode==PAGER_JOURNALMODE_OFF ||
8908 pageInJournal(pPager, pPg) || pPg->pgno>pPager->dbOrigSize );
8909 assert( pPg->flags&PGHDR_DIRTY );
8910 }
8911
8912 /* If the cache contains a page with page-number pgno, remove it
8913 ** from its hash chain. Also, if the PGHDR_NEED_SYNC flag was set for
8914 ** page pgno before the 'move' operation, it needs to be retained
8915 ** for the page moved there.
8916 */
8917 pPg->flags &= ~PGHDR_NEED_SYNC;
8918 pPgOld = sqlite3PagerLookup(pPager, pgno);
8919 assert( !pPgOld || pPgOld->nRef==1 );
8920 if( pPgOld ){
8921 pPg->flags |= (pPgOld->flags&PGHDR_NEED_SYNC);
8922 if( pPager->tempFile ){
8923 /* Do not discard pages from an in-memory database since we might
8924 ** need to rollback later. Just move the page out of the way. */
8925 sqlite3PcacheMove(pPgOld, pPager->dbSize+1);
8926 }else{
8927 sqlite3PcacheDrop(pPgOld);
8928 }
8929 }
8930
8931 origPgno = pPg->pgno;
8932 sqlite3PcacheMove(pPg, pgno);
8933 sqlite3PcacheMakeDirty(pPg);
8934
8935 /* For an in-memory database, make sure the original page continues
8936 ** to exist, in case the transaction needs to roll back. Use pPgOld
8937 ** as the original page since it has already been allocated.
8938 */
8939 if( pPager->tempFile && pPgOld ){
8940 sqlite3PcacheMove(pPgOld, origPgno);
8941 sqlite3PagerUnrefNotNull(pPgOld);
8942 }
8943
8944 if( needSyncPgno ){
8945 /* If needSyncPgno is non-zero, then the journal file needs to be
8946 ** sync()ed before any data is written to database file page needSyncPgno.
8947 ** Currently, no such page exists in the page-cache and the
8948 ** "is journaled" bitvec flag has been set. This needs to be remedied by
8949 ** loading the page into the pager-cache and setting the PGHDR_NEED_SYNC
8950 ** flag.
8951 **
8952 ** If the attempt to load the page into the page-cache fails, (due
8953 ** to a malloc() or IO failure), clear the bit in the pInJournal[]
8954 ** array. Otherwise, if the page is loaded and written again in
8955 ** this transaction, it may be written to the database file before
8956 ** it is synced into the journal file. This way, it may end up in
8957 ** the journal file twice, but that is not a problem.
8958 */
8959 PgHdr *pPgHdr;
8960 rc = sqlite3PagerGet(pPager, needSyncPgno, &pPgHdr, 0);
8961 if( rc!=SQLITE_OK ){
8962 if( needSyncPgno<=pPager->dbOrigSize ){
8963 assert( pPager->pTmpSpace!=0 );
8964 sqlite3BitvecClear(pPager->pInJournal, needSyncPgno, pPager->pTmpSpace);
8965 }
8966 return rc;
8967 }
8968 pPgHdr->flags |= PGHDR_NEED_SYNC;
8969 sqlite3PcacheMakeDirty(pPgHdr);
8970 sqlite3PagerUnrefNotNull(pPgHdr);
8971 }
8972
8973 return SQLITE_OK;
8974 }
8975 #endif
8976
8977 /*
8978 ** The page handle passed as the first argument refers to a dirty page
8979 ** with a page number other than iNew. This function changes the page's
8980 ** page number to iNew and sets the value of the PgHdr.flags field to
8981 ** the value passed as the third parameter.
8982 */
8983 SQLITE_PRIVATE void sqlite3PagerRekey(DbPage *pPg, Pgno iNew, u16 flags){
8984 assert( pPg->pgno!=iNew );
8985 pPg->flags = flags;
8986 sqlite3PcacheMove(pPg, iNew);
8987 }
8988
8989 /*
8990 ** Return a pointer to the data for the specified page.
8991 */
8992 SQLITE_PRIVATE void *sqlite3PagerGetData(DbPage *pPg){
8993 assert( pPg->nRef>0 || pPg->pPager->memDb );
8994 return pPg->pData;
8995 }
8996
8997 /*
8998 ** Return a pointer to the Pager.nExtra bytes of "extra" space
8999 ** allocated along with the specified page.
9000 */
9001 SQLITE_PRIVATE void *sqlite3PagerGetExtra(DbPage *pPg){
9002 return pPg->pExtra;
9003 }
9004
9005 /*
9006 ** Get/set the locking-mode for this pager. Parameter eMode must be one
9007 ** of PAGER_LOCKINGMODE_QUERY, PAGER_LOCKINGMODE_NORMAL or
9008 ** PAGER_LOCKINGMODE_EXCLUSIVE. If the parameter is not _QUERY, then
9009 ** the locking-mode is set to the value specified.
9010 **
9011 ** The returned value is either PAGER_LOCKINGMODE_NORMAL or
9012 ** PAGER_LOCKINGMODE_EXCLUSIVE, indicating the current (possibly updated)
9013 ** locking-mode.
9014 */
9015 SQLITE_PRIVATE int sqlite3PagerLockingMode(Pager *pPager, int eMode){
9016 assert( eMode==PAGER_LOCKINGMODE_QUERY
9017 || eMode==PAGER_LOCKINGMODE_NORMAL
9018 || eMode==PAGER_LOCKINGMODE_EXCLUSIVE );
9019 assert( PAGER_LOCKINGMODE_QUERY<0 );
9020 assert( PAGER_LOCKINGMODE_NORMAL>=0 && PAGER_LOCKINGMODE_EXCLUSIVE>=0 );
9021 assert( pPager->exclusiveMode || 0==sqlite3WalHeapMemory(pPager->pWal) );
9022 if( eMode>=0 && !pPager->tempFile && !sqlite3WalHeapMemory(pPager->pWal) ){
9023 pPager->exclusiveMode = (u8)eMode;
9024 }
9025 return (int)pPager->exclusiveMode;
9026 }
9027
9028 /*
9029 ** Set the journal-mode for this pager. Parameter eMode must be one of:
9030 **
9031 ** PAGER_JOURNALMODE_DELETE
9032 ** PAGER_JOURNALMODE_TRUNCATE
9033 ** PAGER_JOURNALMODE_PERSIST
9034 ** PAGER_JOURNALMODE_OFF
9035 ** PAGER_JOURNALMODE_MEMORY
9036 ** PAGER_JOURNALMODE_WAL
9037 **
9038 ** The journalmode is set to the value specified if the change is allowed.
9039 ** The change may be disallowed for the following reasons:
9040 **
9041 ** * An in-memory database can only have its journal_mode set to _OFF
9042 ** or _MEMORY.
9043 **
9044 ** * Temporary databases cannot have _WAL journalmode.
9045 **
9046 ** The returned indicate the current (possibly updated) journal-mode.
9047 */
9048 SQLITE_PRIVATE int sqlite3PagerSetJournalMode(Pager *pPager, int eMode){
9049 u8 eOld = pPager->journalMode; /* Prior journalmode */
9050
9051 #ifdef SQLITE_DEBUG
9052 /* The print_pager_state() routine is intended to be used by the debugger
9053 ** only. We invoke it once here to suppress a compiler warning. */
9054 print_pager_state(pPager);
9055 #endif
9056
9057
9058 /* The eMode parameter is always valid */
9059 assert( eMode==PAGER_JOURNALMODE_DELETE
9060 || eMode==PAGER_JOURNALMODE_TRUNCATE
9061 || eMode==PAGER_JOURNALMODE_PERSIST
9062 || eMode==PAGER_JOURNALMODE_OFF
9063 || eMode==PAGER_JOURNALMODE_WAL
9064 || eMode==PAGER_JOURNALMODE_MEMORY );
9065
9066 /* This routine is only called from the OP_JournalMode opcode, and
9067 ** the logic there will never allow a temporary file to be changed
9068 ** to WAL mode.
9069 */
9070 assert( pPager->tempFile==0 || eMode!=PAGER_JOURNALMODE_WAL );
9071
9072 /* Do allow the journalmode of an in-memory database to be set to
9073 ** anything other than MEMORY or OFF
9074 */
9075 if( MEMDB ){
9076 assert( eOld==PAGER_JOURNALMODE_MEMORY || eOld==PAGER_JOURNALMODE_OFF );
9077 if( eMode!=PAGER_JOURNALMODE_MEMORY && eMode!=PAGER_JOURNALMODE_OFF ){
9078 eMode = eOld;
9079 }
9080 }
9081
9082 if( eMode!=eOld ){
9083
9084 /* Change the journal mode. */
9085 assert( pPager->eState!=PAGER_ERROR );
9086 pPager->journalMode = (u8)eMode;
9087
9088 /* When transistioning from TRUNCATE or PERSIST to any other journal
9089 ** mode except WAL, unless the pager is in locking_mode=exclusive mode,
9090 ** delete the journal file.
9091 */
9092 assert( (PAGER_JOURNALMODE_TRUNCATE & 5)==1 );
9093 assert( (PAGER_JOURNALMODE_PERSIST & 5)==1 );
9094 assert( (PAGER_JOURNALMODE_DELETE & 5)==0 );
9095 assert( (PAGER_JOURNALMODE_MEMORY & 5)==4 );
9096 assert( (PAGER_JOURNALMODE_OFF & 5)==0 );
9097 assert( (PAGER_JOURNALMODE_WAL & 5)==5 );
9098
9099 assert( isOpen(pPager->fd) || pPager->exclusiveMode );
9100 if( !pPager->exclusiveMode && (eOld & 5)==1 && (eMode & 1)==0 ){
9101
9102 /* In this case we would like to delete the journal file. If it is
9103 ** not possible, then that is not a problem. Deleting the journal file
9104 ** here is an optimization only.
9105 **
9106 ** Before deleting the journal file, obtain a RESERVED lock on the
9107 ** database file. This ensures that the journal file is not deleted
9108 ** while it is in use by some other client.
9109 */
9110 sqlite3OsClose(pPager->jfd);
9111 if( pPager->eLock>=RESERVED_LOCK ){
9112 sqlite3OsDelete(pPager->pVfs, pPager->zJournal, 0);
9113 }else{
9114 int rc = SQLITE_OK;
9115 int state = pPager->eState;
9116 assert( state==PAGER_OPEN || state==PAGER_READER );
9117 if( state==PAGER_OPEN ){
9118 rc = sqlite3PagerSharedLock(pPager);
9119 }
9120 if( pPager->eState==PAGER_READER ){
9121 assert( rc==SQLITE_OK );
9122 rc = pagerLockDb(pPager, RESERVED_LOCK);
9123 }
9124 if( rc==SQLITE_OK ){
9125 sqlite3OsDelete(pPager->pVfs, pPager->zJournal, 0);
9126 }
9127 if( rc==SQLITE_OK && state==PAGER_READER ){
9128 pagerUnlockDb(pPager, SHARED_LOCK);
9129 }else if( state==PAGER_OPEN ){
9130 pager_unlock(pPager);
9131 }
9132 assert( state==pPager->eState );
9133 }
9134 }else if( eMode==PAGER_JOURNALMODE_OFF ){
9135 sqlite3OsClose(pPager->jfd);
9136 }
9137 }
9138
9139 /* Return the new journal mode */
9140 return (int)pPager->journalMode;
9141 }
9142
9143 /*
9144 ** Return the current journal mode.
9145 */
9146 SQLITE_PRIVATE int sqlite3PagerGetJournalMode(Pager *pPager){
9147 return (int)pPager->journalMode;
9148 }
9149
9150 /*
9151 ** Return TRUE if the pager is in a state where it is OK to change the
9152 ** journalmode. Journalmode changes can only happen when the database
9153 ** is unmodified.
9154 */
9155 SQLITE_PRIVATE int sqlite3PagerOkToChangeJournalMode(Pager *pPager){
9156 assert( assert_pager_state(pPager) );
9157 if( pPager->eState>=PAGER_WRITER_CACHEMOD ) return 0;
9158 if( NEVER(isOpen(pPager->jfd) && pPager->journalOff>0) ) return 0;
9159 return 1;
9160 }
9161
9162 /*
9163 ** Get/set the size-limit used for persistent journal files.
9164 **
9165 ** Setting the size limit to -1 means no limit is enforced.
9166 ** An attempt to set a limit smaller than -1 is a no-op.
9167 */
9168 SQLITE_PRIVATE i64 sqlite3PagerJournalSizeLimit(Pager *pPager, i64 iLimit){
9169 if( iLimit>=-1 ){
9170 pPager->journalSizeLimit = iLimit;
9171 sqlite3WalLimit(pPager->pWal, iLimit);
9172 }
9173 return pPager->journalSizeLimit;
9174 }
9175
9176 /*
9177 ** Return a pointer to the pPager->pBackup variable. The backup module
9178 ** in backup.c maintains the content of this variable. This module
9179 ** uses it opaquely as an argument to sqlite3BackupRestart() and
9180 ** sqlite3BackupUpdate() only.
9181 */
9182 SQLITE_PRIVATE sqlite3_backup **sqlite3PagerBackupPtr(Pager *pPager){
9183 return &pPager->pBackup;
9184 }
9185
9186 #ifndef SQLITE_OMIT_VACUUM
9187 /*
9188 ** Unless this is an in-memory or temporary database, clear the pager cache.
9189 */
9190 SQLITE_PRIVATE void sqlite3PagerClearCache(Pager *pPager){
9191 assert( MEMDB==0 || pPager->tempFile );
9192 if( pPager->tempFile==0 ) pager_reset(pPager);
9193 }
9194 #endif
9195
9196
9197 #ifndef SQLITE_OMIT_WAL
9198 /*
9199 ** This function is called when the user invokes "PRAGMA wal_checkpoint",
9200 ** "PRAGMA wal_blocking_checkpoint" or calls the sqlite3_wal_checkpoint()
9201 ** or wal_blocking_checkpoint() API functions.
9202 **
9203 ** Parameter eMode is one of SQLITE_CHECKPOINT_PASSIVE, FULL or RESTART.
9204 */
9205 SQLITE_PRIVATE int sqlite3PagerCheckpoint(
9206 Pager *pPager, /* Checkpoint on this pager */
9207 sqlite3 *db, /* Db handle used to check for interrupts */
9208 int eMode, /* Type of checkpoint */
9209 int *pnLog, /* OUT: Final number of frames in log */
9210 int *pnCkpt /* OUT: Final number of checkpointed frames */
9211 ){
9212 int rc = SQLITE_OK;
9213 if( pPager->pWal ){
9214 rc = sqlite3WalCheckpoint(pPager->pWal, db, eMode,
9215 (eMode==SQLITE_CHECKPOINT_PASSIVE ? 0 : pPager->xBusyHandler),
9216 pPager->pBusyHandlerArg,
9217 pPager->ckptSyncFlags, pPager->pageSize, (u8 *)pPager->pTmpSpace,
9218 pnLog, pnCkpt
9219 );
9220 }
9221 return rc;
9222 }
9223
9224 SQLITE_PRIVATE int sqlite3PagerWalCallback(Pager *pPager){
9225 return sqlite3WalCallback(pPager->pWal);
9226 }
9227
9228 /*
9229 ** Return true if the underlying VFS for the given pager supports the
9230 ** primitives necessary for write-ahead logging.
9231 */
9232 SQLITE_PRIVATE int sqlite3PagerWalSupported(Pager *pPager){
9233 const sqlite3_io_methods *pMethods = pPager->fd->pMethods;
9234 if( pPager->noLock ) return 0;
9235 return pPager->exclusiveMode || (pMethods->iVersion>=2 && pMethods->xShmMap);
9236 }
9237
9238 /*
9239 ** Attempt to take an exclusive lock on the database file. If a PENDING lock
9240 ** is obtained instead, immediately release it.
9241 */
9242 static int pagerExclusiveLock(Pager *pPager){
9243 int rc; /* Return code */
9244
9245 assert( pPager->eLock==SHARED_LOCK || pPager->eLock==EXCLUSIVE_LOCK );
9246 rc = pagerLockDb(pPager, EXCLUSIVE_LOCK);
9247 if( rc!=SQLITE_OK ){
9248 /* If the attempt to grab the exclusive lock failed, release the
9249 ** pending lock that may have been obtained instead. */
9250 pagerUnlockDb(pPager, SHARED_LOCK);
9251 }
9252
9253 return rc;
9254 }
9255
9256 /*
9257 ** Call sqlite3WalOpen() to open the WAL handle. If the pager is in
9258 ** exclusive-locking mode when this function is called, take an EXCLUSIVE
9259 ** lock on the database file and use heap-memory to store the wal-index
9260 ** in. Otherwise, use the normal shared-memory.
9261 */
9262 static int pagerOpenWal(Pager *pPager){
9263 int rc = SQLITE_OK;
9264
9265 assert( pPager->pWal==0 && pPager->tempFile==0 );
9266 assert( pPager->eLock==SHARED_LOCK || pPager->eLock==EXCLUSIVE_LOCK );
9267
9268 /* If the pager is already in exclusive-mode, the WAL module will use
9269 ** heap-memory for the wal-index instead of the VFS shared-memory
9270 ** implementation. Take the exclusive lock now, before opening the WAL
9271 ** file, to make sure this is safe.
9272 */
9273 if( pPager->exclusiveMode ){
9274 rc = pagerExclusiveLock(pPager);
9275 }
9276
9277 /* Open the connection to the log file. If this operation fails,
9278 ** (e.g. due to malloc() failure), return an error code.
9279 */
9280 if( rc==SQLITE_OK ){
9281 rc = sqlite3WalOpen(pPager->pVfs,
9282 pPager->fd, pPager->zWal, pPager->exclusiveMode,
9283 pPager->journalSizeLimit, &pPager->pWal
9284 );
9285 }
9286 pagerFixMaplimit(pPager);
9287
9288 return rc;
9289 }
9290
9291
9292 /*
9293 ** The caller must be holding a SHARED lock on the database file to call
9294 ** this function.
9295 **
9296 ** If the pager passed as the first argument is open on a real database
9297 ** file (not a temp file or an in-memory database), and the WAL file
9298 ** is not already open, make an attempt to open it now. If successful,
9299 ** return SQLITE_OK. If an error occurs or the VFS used by the pager does
9300 ** not support the xShmXXX() methods, return an error code. *pbOpen is
9301 ** not modified in either case.
9302 **
9303 ** If the pager is open on a temp-file (or in-memory database), or if
9304 ** the WAL file is already open, set *pbOpen to 1 and return SQLITE_OK
9305 ** without doing anything.
9306 */
9307 SQLITE_PRIVATE int sqlite3PagerOpenWal(
9308 Pager *pPager, /* Pager object */
9309 int *pbOpen /* OUT: Set to true if call is a no-op */
9310 ){
9311 int rc = SQLITE_OK; /* Return code */
9312
9313 assert( assert_pager_state(pPager) );
9314 assert( pPager->eState==PAGER_OPEN || pbOpen );
9315 assert( pPager->eState==PAGER_READER || !pbOpen );
9316 assert( pbOpen==0 || *pbOpen==0 );
9317 assert( pbOpen!=0 || (!pPager->tempFile && !pPager->pWal) );
9318
9319 if( !pPager->tempFile && !pPager->pWal ){
9320 if( !sqlite3PagerWalSupported(pPager) ) return SQLITE_CANTOPEN;
9321
9322 /* Close any rollback journal previously open */
9323 sqlite3OsClose(pPager->jfd);
9324
9325 rc = pagerOpenWal(pPager);
9326 if( rc==SQLITE_OK ){
9327 pPager->journalMode = PAGER_JOURNALMODE_WAL;
9328 pPager->eState = PAGER_OPEN;
9329 }
9330 }else{
9331 *pbOpen = 1;
9332 }
9333
9334 return rc;
9335 }
9336
9337 /*
9338 ** This function is called to close the connection to the log file prior
9339 ** to switching from WAL to rollback mode.
9340 **
9341 ** Before closing the log file, this function attempts to take an
9342 ** EXCLUSIVE lock on the database file. If this cannot be obtained, an
9343 ** error (SQLITE_BUSY) is returned and the log connection is not closed.
9344 ** If successful, the EXCLUSIVE lock is not released before returning.
9345 */
9346 SQLITE_PRIVATE int sqlite3PagerCloseWal(Pager *pPager, sqlite3 *db){
9347 int rc = SQLITE_OK;
9348
9349 assert( pPager->journalMode==PAGER_JOURNALMODE_WAL );
9350
9351 /* If the log file is not already open, but does exist in the file-system,
9352 ** it may need to be checkpointed before the connection can switch to
9353 ** rollback mode. Open it now so this can happen.
9354 */
9355 if( !pPager->pWal ){
9356 int logexists = 0;
9357 rc = pagerLockDb(pPager, SHARED_LOCK);
9358 if( rc==SQLITE_OK ){
9359 rc = sqlite3OsAccess(
9360 pPager->pVfs, pPager->zWal, SQLITE_ACCESS_EXISTS, &logexists
9361 );
9362 }
9363 if( rc==SQLITE_OK && logexists ){
9364 rc = pagerOpenWal(pPager);
9365 }
9366 }
9367
9368 /* Checkpoint and close the log. Because an EXCLUSIVE lock is held on
9369 ** the database file, the log and log-summary files will be deleted.
9370 */
9371 if( rc==SQLITE_OK && pPager->pWal ){
9372 rc = pagerExclusiveLock(pPager);
9373 if( rc==SQLITE_OK ){
9374 rc = sqlite3WalClose(pPager->pWal, db, pPager->ckptSyncFlags,
9375 pPager->pageSize, (u8*)pPager->pTmpSpace);
9376 pPager->pWal = 0;
9377 pagerFixMaplimit(pPager);
9378 if( rc && !pPager->exclusiveMode ) pagerUnlockDb(pPager, SHARED_LOCK);
9379 }
9380 }
9381 return rc;
9382 }
9383
9384 #ifdef SQLITE_ENABLE_SNAPSHOT
9385 /*
9386 ** If this is a WAL database, obtain a snapshot handle for the snapshot
9387 ** currently open. Otherwise, return an error.
9388 */
9389 SQLITE_PRIVATE int sqlite3PagerSnapshotGet(Pager *pPager, sqlite3_snapshot **ppS napshot){
9390 int rc = SQLITE_ERROR;
9391 if( pPager->pWal ){
9392 rc = sqlite3WalSnapshotGet(pPager->pWal, ppSnapshot);
9393 }
9394 return rc;
9395 }
9396
9397 /*
9398 ** If this is a WAL database, store a pointer to pSnapshot. Next time a
9399 ** read transaction is opened, attempt to read from the snapshot it
9400 ** identifies. If this is not a WAL database, return an error.
9401 */
9402 SQLITE_PRIVATE int sqlite3PagerSnapshotOpen(Pager *pPager, sqlite3_snapshot *pSn apshot){
9403 int rc = SQLITE_OK;
9404 if( pPager->pWal ){
9405 sqlite3WalSnapshotOpen(pPager->pWal, pSnapshot);
9406 }else{
9407 rc = SQLITE_ERROR;
9408 }
9409 return rc;
9410 }
9411
9412 /*
9413 ** If this is a WAL database, call sqlite3WalSnapshotRecover(). If this
9414 ** is not a WAL database, return an error.
9415 */
9416 SQLITE_PRIVATE int sqlite3PagerSnapshotRecover(Pager *pPager){
9417 int rc;
9418 if( pPager->pWal ){
9419 rc = sqlite3WalSnapshotRecover(pPager->pWal);
9420 }else{
9421 rc = SQLITE_ERROR;
9422 }
9423 return rc;
9424 }
9425 #endif /* SQLITE_ENABLE_SNAPSHOT */
9426 #endif /* !SQLITE_OMIT_WAL */
9427
9428 #ifdef SQLITE_ENABLE_ZIPVFS
9429 /*
9430 ** A read-lock must be held on the pager when this function is called. If
9431 ** the pager is in WAL mode and the WAL file currently contains one or more
9432 ** frames, return the size in bytes of the page images stored within the
9433 ** WAL frames. Otherwise, if this is not a WAL database or the WAL file
9434 ** is empty, return 0.
9435 */
9436 SQLITE_PRIVATE int sqlite3PagerWalFramesize(Pager *pPager){
9437 assert( pPager->eState>=PAGER_READER );
9438 return sqlite3WalFramesize(pPager->pWal);
9439 }
9440 #endif
9441
9442 #endif /* SQLITE_OMIT_DISKIO */
9443
9444 /************** End of pager.c ***********************************************/
9445 /************** Begin file wal.c *********************************************/
9446 /*
9447 ** 2010 February 1
9448 **
9449 ** The author disclaims copyright to this source code. In place of
9450 ** a legal notice, here is a blessing:
9451 **
9452 ** May you do good and not evil.
9453 ** May you find forgiveness for yourself and forgive others.
9454 ** May you share freely, never taking more than you give.
9455 **
9456 *************************************************************************
9457 **
9458 ** This file contains the implementation of a write-ahead log (WAL) used in
9459 ** "journal_mode=WAL" mode.
9460 **
9461 ** WRITE-AHEAD LOG (WAL) FILE FORMAT
9462 **
9463 ** A WAL file consists of a header followed by zero or more "frames".
9464 ** Each frame records the revised content of a single page from the
9465 ** database file. All changes to the database are recorded by writing
9466 ** frames into the WAL. Transactions commit when a frame is written that
9467 ** contains a commit marker. A single WAL can and usually does record
9468 ** multiple transactions. Periodically, the content of the WAL is
9469 ** transferred back into the database file in an operation called a
9470 ** "checkpoint".
9471 **
9472 ** A single WAL file can be used multiple times. In other words, the
9473 ** WAL can fill up with frames and then be checkpointed and then new
9474 ** frames can overwrite the old ones. A WAL always grows from beginning
9475 ** toward the end. Checksums and counters attached to each frame are
9476 ** used to determine which frames within the WAL are valid and which
9477 ** are leftovers from prior checkpoints.
9478 **
9479 ** The WAL header is 32 bytes in size and consists of the following eight
9480 ** big-endian 32-bit unsigned integer values:
9481 **
9482 ** 0: Magic number. 0x377f0682 or 0x377f0683
9483 ** 4: File format version. Currently 3007000
9484 ** 8: Database page size. Example: 1024
9485 ** 12: Checkpoint sequence number
9486 ** 16: Salt-1, random integer incremented with each checkpoint
9487 ** 20: Salt-2, a different random integer changing with each ckpt
9488 ** 24: Checksum-1 (first part of checksum for first 24 bytes of header).
9489 ** 28: Checksum-2 (second part of checksum for first 24 bytes of header).
9490 **
9491 ** Immediately following the wal-header are zero or more frames. Each
9492 ** frame consists of a 24-byte frame-header followed by a <page-size> bytes
9493 ** of page data. The frame-header is six big-endian 32-bit unsigned
9494 ** integer values, as follows:
9495 **
9496 ** 0: Page number.
9497 ** 4: For commit records, the size of the database image in pages
9498 ** after the commit. For all other records, zero.
9499 ** 8: Salt-1 (copied from the header)
9500 ** 12: Salt-2 (copied from the header)
9501 ** 16: Checksum-1.
9502 ** 20: Checksum-2.
9503 **
9504 ** A frame is considered valid if and only if the following conditions are
9505 ** true:
9506 **
9507 ** (1) The salt-1 and salt-2 values in the frame-header match
9508 ** salt values in the wal-header
9509 **
9510 ** (2) The checksum values in the final 8 bytes of the frame-header
9511 ** exactly match the checksum computed consecutively on the
9512 ** WAL header and the first 8 bytes and the content of all frames
9513 ** up to and including the current frame.
9514 **
9515 ** The checksum is computed using 32-bit big-endian integers if the
9516 ** magic number in the first 4 bytes of the WAL is 0x377f0683 and it
9517 ** is computed using little-endian if the magic number is 0x377f0682.
9518 ** The checksum values are always stored in the frame header in a
9519 ** big-endian format regardless of which byte order is used to compute
9520 ** the checksum. The checksum is computed by interpreting the input as
9521 ** an even number of unsigned 32-bit integers: x[0] through x[N]. The
9522 ** algorithm used for the checksum is as follows:
9523 **
9524 ** for i from 0 to n-1 step 2:
9525 ** s0 += x[i] + s1;
9526 ** s1 += x[i+1] + s0;
9527 ** endfor
9528 **
9529 ** Note that s0 and s1 are both weighted checksums using fibonacci weights
9530 ** in reverse order (the largest fibonacci weight occurs on the first element
9531 ** of the sequence being summed.) The s1 value spans all 32-bit
9532 ** terms of the sequence whereas s0 omits the final term.
9533 **
9534 ** On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the
9535 ** WAL is transferred into the database, then the database is VFS.xSync-ed.
9536 ** The VFS.xSync operations serve as write barriers - all writes launched
9537 ** before the xSync must complete before any write that launches after the
9538 ** xSync begins.
9539 **
9540 ** After each checkpoint, the salt-1 value is incremented and the salt-2
9541 ** value is randomized. This prevents old and new frames in the WAL from
9542 ** being considered valid at the same time and being checkpointing together
9543 ** following a crash.
9544 **
9545 ** READER ALGORITHM
9546 **
9547 ** To read a page from the database (call it page number P), a reader
9548 ** first checks the WAL to see if it contains page P. If so, then the
9549 ** last valid instance of page P that is a followed by a commit frame
9550 ** or is a commit frame itself becomes the value read. If the WAL
9551 ** contains no copies of page P that are valid and which are a commit
9552 ** frame or are followed by a commit frame, then page P is read from
9553 ** the database file.
9554 **
9555 ** To start a read transaction, the reader records the index of the last
9556 ** valid frame in the WAL. The reader uses this recorded "mxFrame" value
9557 ** for all subsequent read operations. New transactions can be appended
9558 ** to the WAL, but as long as the reader uses its original mxFrame value
9559 ** and ignores the newly appended content, it will see a consistent snapshot
9560 ** of the database from a single point in time. This technique allows
9561 ** multiple concurrent readers to view different versions of the database
9562 ** content simultaneously.
9563 **
9564 ** The reader algorithm in the previous paragraphs works correctly, but
9565 ** because frames for page P can appear anywhere within the WAL, the
9566 ** reader has to scan the entire WAL looking for page P frames. If the
9567 ** WAL is large (multiple megabytes is typical) that scan can be slow,
9568 ** and read performance suffers. To overcome this problem, a separate
9569 ** data structure called the wal-index is maintained to expedite the
9570 ** search for frames of a particular page.
9571 **
9572 ** WAL-INDEX FORMAT
9573 **
9574 ** Conceptually, the wal-index is shared memory, though VFS implementations
9575 ** might choose to implement the wal-index using a mmapped file. Because
9576 ** the wal-index is shared memory, SQLite does not support journal_mode=WAL
9577 ** on a network filesystem. All users of the database must be able to
9578 ** share memory.
9579 **
9580 ** The wal-index is transient. After a crash, the wal-index can (and should
9581 ** be) reconstructed from the original WAL file. In fact, the VFS is required
9582 ** to either truncate or zero the header of the wal-index when the last
9583 ** connection to it closes. Because the wal-index is transient, it can
9584 ** use an architecture-specific format; it does not have to be cross-platform.
9585 ** Hence, unlike the database and WAL file formats which store all values
9586 ** as big endian, the wal-index can store multi-byte values in the native
9587 ** byte order of the host computer.
9588 **
9589 ** The purpose of the wal-index is to answer this question quickly: Given
9590 ** a page number P and a maximum frame index M, return the index of the
9591 ** last frame in the wal before frame M for page P in the WAL, or return
9592 ** NULL if there are no frames for page P in the WAL prior to M.
9593 **
9594 ** The wal-index consists of a header region, followed by an one or
9595 ** more index blocks.
9596 **
9597 ** The wal-index header contains the total number of frames within the WAL
9598 ** in the mxFrame field.
9599 **
9600 ** Each index block except for the first contains information on
9601 ** HASHTABLE_NPAGE frames. The first index block contains information on
9602 ** HASHTABLE_NPAGE_ONE frames. The values of HASHTABLE_NPAGE_ONE and
9603 ** HASHTABLE_NPAGE are selected so that together the wal-index header and
9604 ** first index block are the same size as all other index blocks in the
9605 ** wal-index.
9606 **
9607 ** Each index block contains two sections, a page-mapping that contains the
9608 ** database page number associated with each wal frame, and a hash-table
9609 ** that allows readers to query an index block for a specific page number.
9610 ** The page-mapping is an array of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE
9611 ** for the first index block) 32-bit page numbers. The first entry in the
9612 ** first index-block contains the database page number corresponding to the
9613 ** first frame in the WAL file. The first entry in the second index block
9614 ** in the WAL file corresponds to the (HASHTABLE_NPAGE_ONE+1)th frame in
9615 ** the log, and so on.
9616 **
9617 ** The last index block in a wal-index usually contains less than the full
9618 ** complement of HASHTABLE_NPAGE (or HASHTABLE_NPAGE_ONE) page-numbers,
9619 ** depending on the contents of the WAL file. This does not change the
9620 ** allocated size of the page-mapping array - the page-mapping array merely
9621 ** contains unused entries.
9622 **
9623 ** Even without using the hash table, the last frame for page P
9624 ** can be found by scanning the page-mapping sections of each index block
9625 ** starting with the last index block and moving toward the first, and
9626 ** within each index block, starting at the end and moving toward the
9627 ** beginning. The first entry that equals P corresponds to the frame
9628 ** holding the content for that page.
9629 **
9630 ** The hash table consists of HASHTABLE_NSLOT 16-bit unsigned integers.
9631 ** HASHTABLE_NSLOT = 2*HASHTABLE_NPAGE, and there is one entry in the
9632 ** hash table for each page number in the mapping section, so the hash
9633 ** table is never more than half full. The expected number of collisions
9634 ** prior to finding a match is 1. Each entry of the hash table is an
9635 ** 1-based index of an entry in the mapping section of the same
9636 ** index block. Let K be the 1-based index of the largest entry in
9637 ** the mapping section. (For index blocks other than the last, K will
9638 ** always be exactly HASHTABLE_NPAGE (4096) and for the last index block
9639 ** K will be (mxFrame%HASHTABLE_NPAGE).) Unused slots of the hash table
9640 ** contain a value of 0.
9641 **
9642 ** To look for page P in the hash table, first compute a hash iKey on
9643 ** P as follows:
9644 **
9645 ** iKey = (P * 383) % HASHTABLE_NSLOT
9646 **
9647 ** Then start scanning entries of the hash table, starting with iKey
9648 ** (wrapping around to the beginning when the end of the hash table is
9649 ** reached) until an unused hash slot is found. Let the first unused slot
9650 ** be at index iUnused. (iUnused might be less than iKey if there was
9651 ** wrap-around.) Because the hash table is never more than half full,
9652 ** the search is guaranteed to eventually hit an unused entry. Let
9653 ** iMax be the value between iKey and iUnused, closest to iUnused,
9654 ** where aHash[iMax]==P. If there is no iMax entry (if there exists
9655 ** no hash slot such that aHash[i]==p) then page P is not in the
9656 ** current index block. Otherwise the iMax-th mapping entry of the
9657 ** current index block corresponds to the last entry that references
9658 ** page P.
9659 **
9660 ** A hash search begins with the last index block and moves toward the
9661 ** first index block, looking for entries corresponding to page P. On
9662 ** average, only two or three slots in each index block need to be
9663 ** examined in order to either find the last entry for page P, or to
9664 ** establish that no such entry exists in the block. Each index block
9665 ** holds over 4000 entries. So two or three index blocks are sufficient
9666 ** to cover a typical 10 megabyte WAL file, assuming 1K pages. 8 or 10
9667 ** comparisons (on average) suffice to either locate a frame in the
9668 ** WAL or to establish that the frame does not exist in the WAL. This
9669 ** is much faster than scanning the entire 10MB WAL.
9670 **
9671 ** Note that entries are added in order of increasing K. Hence, one
9672 ** reader might be using some value K0 and a second reader that started
9673 ** at a later time (after additional transactions were added to the WAL
9674 ** and to the wal-index) might be using a different value K1, where K1>K0.
9675 ** Both readers can use the same hash table and mapping section to get
9676 ** the correct result. There may be entries in the hash table with
9677 ** K>K0 but to the first reader, those entries will appear to be unused
9678 ** slots in the hash table and so the first reader will get an answer as
9679 ** if no values greater than K0 had ever been inserted into the hash table
9680 ** in the first place - which is what reader one wants. Meanwhile, the
9681 ** second reader using K1 will see additional values that were inserted
9682 ** later, which is exactly what reader two wants.
9683 **
9684 ** When a rollback occurs, the value of K is decreased. Hash table entries
9685 ** that correspond to frames greater than the new K value are removed
9686 ** from the hash table at this point.
9687 */
9688 #ifndef SQLITE_OMIT_WAL
9689
9690 /* #include "wal.h" */
9691
9692 /*
9693 ** Trace output macros
9694 */
9695 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
9696 SQLITE_PRIVATE int sqlite3WalTrace = 0;
9697 # define WALTRACE(X) if(sqlite3WalTrace) sqlite3DebugPrintf X
9698 #else
9699 # define WALTRACE(X)
9700 #endif
9701
9702 /*
9703 ** The maximum (and only) versions of the wal and wal-index formats
9704 ** that may be interpreted by this version of SQLite.
9705 **
9706 ** If a client begins recovering a WAL file and finds that (a) the checksum
9707 ** values in the wal-header are correct and (b) the version field is not
9708 ** WAL_MAX_VERSION, recovery fails and SQLite returns SQLITE_CANTOPEN.
9709 **
9710 ** Similarly, if a client successfully reads a wal-index header (i.e. the
9711 ** checksum test is successful) and finds that the version field is not
9712 ** WALINDEX_MAX_VERSION, then no read-transaction is opened and SQLite
9713 ** returns SQLITE_CANTOPEN.
9714 */
9715 #define WAL_MAX_VERSION 3007000
9716 #define WALINDEX_MAX_VERSION 3007000
9717
9718 /*
9719 ** Indices of various locking bytes. WAL_NREADER is the number
9720 ** of available reader locks and should be at least 3. The default
9721 ** is SQLITE_SHM_NLOCK==8 and WAL_NREADER==5.
9722 */
9723 #define WAL_WRITE_LOCK 0
9724 #define WAL_ALL_BUT_WRITE 1
9725 #define WAL_CKPT_LOCK 1
9726 #define WAL_RECOVER_LOCK 2
9727 #define WAL_READ_LOCK(I) (3+(I))
9728 #define WAL_NREADER (SQLITE_SHM_NLOCK-3)
9729
9730
9731 /* Object declarations */
9732 typedef struct WalIndexHdr WalIndexHdr;
9733 typedef struct WalIterator WalIterator;
9734 typedef struct WalCkptInfo WalCkptInfo;
9735
9736
9737 /*
9738 ** The following object holds a copy of the wal-index header content.
9739 **
9740 ** The actual header in the wal-index consists of two copies of this
9741 ** object followed by one instance of the WalCkptInfo object.
9742 ** For all versions of SQLite through 3.10.0 and probably beyond,
9743 ** the locking bytes (WalCkptInfo.aLock) start at offset 120 and
9744 ** the total header size is 136 bytes.
9745 **
9746 ** The szPage value can be any power of 2 between 512 and 32768, inclusive.
9747 ** Or it can be 1 to represent a 65536-byte page. The latter case was
9748 ** added in 3.7.1 when support for 64K pages was added.
9749 */
9750 struct WalIndexHdr {
9751 u32 iVersion; /* Wal-index version */
9752 u32 unused; /* Unused (padding) field */
9753 u32 iChange; /* Counter incremented each transaction */
9754 u8 isInit; /* 1 when initialized */
9755 u8 bigEndCksum; /* True if checksums in WAL are big-endian */
9756 u16 szPage; /* Database page size in bytes. 1==64K */
9757 u32 mxFrame; /* Index of last valid frame in the WAL */
9758 u32 nPage; /* Size of database in pages */
9759 u32 aFrameCksum[2]; /* Checksum of last frame in log */
9760 u32 aSalt[2]; /* Two salt values copied from WAL header */
9761 u32 aCksum[2]; /* Checksum over all prior fields */
9762 };
9763
9764 /*
9765 ** A copy of the following object occurs in the wal-index immediately
9766 ** following the second copy of the WalIndexHdr. This object stores
9767 ** information used by checkpoint.
9768 **
9769 ** nBackfill is the number of frames in the WAL that have been written
9770 ** back into the database. (We call the act of moving content from WAL to
9771 ** database "backfilling".) The nBackfill number is never greater than
9772 ** WalIndexHdr.mxFrame. nBackfill can only be increased by threads
9773 ** holding the WAL_CKPT_LOCK lock (which includes a recovery thread).
9774 ** However, a WAL_WRITE_LOCK thread can move the value of nBackfill from
9775 ** mxFrame back to zero when the WAL is reset.
9776 **
9777 ** nBackfillAttempted is the largest value of nBackfill that a checkpoint
9778 ** has attempted to achieve. Normally nBackfill==nBackfillAtempted, however
9779 ** the nBackfillAttempted is set before any backfilling is done and the
9780 ** nBackfill is only set after all backfilling completes. So if a checkpoint
9781 ** crashes, nBackfillAttempted might be larger than nBackfill. The
9782 ** WalIndexHdr.mxFrame must never be less than nBackfillAttempted.
9783 **
9784 ** The aLock[] field is a set of bytes used for locking. These bytes should
9785 ** never be read or written.
9786 **
9787 ** There is one entry in aReadMark[] for each reader lock. If a reader
9788 ** holds read-lock K, then the value in aReadMark[K] is no greater than
9789 ** the mxFrame for that reader. The value READMARK_NOT_USED (0xffffffff)
9790 ** for any aReadMark[] means that entry is unused. aReadMark[0] is
9791 ** a special case; its value is never used and it exists as a place-holder
9792 ** to avoid having to offset aReadMark[] indexs by one. Readers holding
9793 ** WAL_READ_LOCK(0) always ignore the entire WAL and read all content
9794 ** directly from the database.
9795 **
9796 ** The value of aReadMark[K] may only be changed by a thread that
9797 ** is holding an exclusive lock on WAL_READ_LOCK(K). Thus, the value of
9798 ** aReadMark[K] cannot changed while there is a reader is using that mark
9799 ** since the reader will be holding a shared lock on WAL_READ_LOCK(K).
9800 **
9801 ** The checkpointer may only transfer frames from WAL to database where
9802 ** the frame numbers are less than or equal to every aReadMark[] that is
9803 ** in use (that is, every aReadMark[j] for which there is a corresponding
9804 ** WAL_READ_LOCK(j)). New readers (usually) pick the aReadMark[] with the
9805 ** largest value and will increase an unused aReadMark[] to mxFrame if there
9806 ** is not already an aReadMark[] equal to mxFrame. The exception to the
9807 ** previous sentence is when nBackfill equals mxFrame (meaning that everything
9808 ** in the WAL has been backfilled into the database) then new readers
9809 ** will choose aReadMark[0] which has value 0 and hence such reader will
9810 ** get all their all content directly from the database file and ignore
9811 ** the WAL.
9812 **
9813 ** Writers normally append new frames to the end of the WAL. However,
9814 ** if nBackfill equals mxFrame (meaning that all WAL content has been
9815 ** written back into the database) and if no readers are using the WAL
9816 ** (in other words, if there are no WAL_READ_LOCK(i) where i>0) then
9817 ** the writer will first "reset" the WAL back to the beginning and start
9818 ** writing new content beginning at frame 1.
9819 **
9820 ** We assume that 32-bit loads are atomic and so no locks are needed in
9821 ** order to read from any aReadMark[] entries.
9822 */
9823 struct WalCkptInfo {
9824 u32 nBackfill; /* Number of WAL frames backfilled into DB */
9825 u32 aReadMark[WAL_NREADER]; /* Reader marks */
9826 u8 aLock[SQLITE_SHM_NLOCK]; /* Reserved space for locks */
9827 u32 nBackfillAttempted; /* WAL frames perhaps written, or maybe not */
9828 u32 notUsed0; /* Available for future enhancements */
9829 };
9830 #define READMARK_NOT_USED 0xffffffff
9831
9832
9833 /* A block of WALINDEX_LOCK_RESERVED bytes beginning at
9834 ** WALINDEX_LOCK_OFFSET is reserved for locks. Since some systems
9835 ** only support mandatory file-locks, we do not read or write data
9836 ** from the region of the file on which locks are applied.
9837 */
9838 #define WALINDEX_LOCK_OFFSET (sizeof(WalIndexHdr)*2+offsetof(WalCkptInfo,aLock))
9839 #define WALINDEX_HDR_SIZE (sizeof(WalIndexHdr)*2+sizeof(WalCkptInfo))
9840
9841 /* Size of header before each frame in wal */
9842 #define WAL_FRAME_HDRSIZE 24
9843
9844 /* Size of write ahead log header, including checksum. */
9845 /* #define WAL_HDRSIZE 24 */
9846 #define WAL_HDRSIZE 32
9847
9848 /* WAL magic value. Either this value, or the same value with the least
9849 ** significant bit also set (WAL_MAGIC | 0x00000001) is stored in 32-bit
9850 ** big-endian format in the first 4 bytes of a WAL file.
9851 **
9852 ** If the LSB is set, then the checksums for each frame within the WAL
9853 ** file are calculated by treating all data as an array of 32-bit
9854 ** big-endian words. Otherwise, they are calculated by interpreting
9855 ** all data as 32-bit little-endian words.
9856 */
9857 #define WAL_MAGIC 0x377f0682
9858
9859 /*
9860 ** Return the offset of frame iFrame in the write-ahead log file,
9861 ** assuming a database page size of szPage bytes. The offset returned
9862 ** is to the start of the write-ahead log frame-header.
9863 */
9864 #define walFrameOffset(iFrame, szPage) ( \
9865 WAL_HDRSIZE + ((iFrame)-1)*(i64)((szPage)+WAL_FRAME_HDRSIZE) \
9866 )
9867
9868 /*
9869 ** An open write-ahead log file is represented by an instance of the
9870 ** following object.
9871 */
9872 struct Wal {
9873 sqlite3_vfs *pVfs; /* The VFS used to create pDbFd */
9874 sqlite3_file *pDbFd; /* File handle for the database file */
9875 sqlite3_file *pWalFd; /* File handle for WAL file */
9876 u32 iCallback; /* Value to pass to log callback (or 0) */
9877 i64 mxWalSize; /* Truncate WAL to this size upon reset */
9878 int nWiData; /* Size of array apWiData */
9879 int szFirstBlock; /* Size of first block written to WAL file */
9880 volatile u32 **apWiData; /* Pointer to wal-index content in memory */
9881 u32 szPage; /* Database page size */
9882 i16 readLock; /* Which read lock is being held. -1 for none */
9883 u8 syncFlags; /* Flags to use to sync header writes */
9884 u8 exclusiveMode; /* Non-zero if connection is in exclusive mode */
9885 u8 writeLock; /* True if in a write transaction */
9886 u8 ckptLock; /* True if holding a checkpoint lock */
9887 u8 readOnly; /* WAL_RDWR, WAL_RDONLY, or WAL_SHM_RDONLY */
9888 u8 truncateOnCommit; /* True to truncate WAL file on commit */
9889 u8 syncHeader; /* Fsync the WAL header if true */
9890 u8 padToSectorBoundary; /* Pad transactions out to the next sector */
9891 WalIndexHdr hdr; /* Wal-index header for current transaction */
9892 u32 minFrame; /* Ignore wal frames before this one */
9893 u32 iReCksum; /* On commit, recalculate checksums from here */
9894 const char *zWalName; /* Name of WAL file */
9895 u32 nCkpt; /* Checkpoint sequence counter in the wal-header */
9896 #ifdef SQLITE_DEBUG
9897 u8 lockError; /* True if a locking error has occurred */
9898 #endif
9899 #ifdef SQLITE_ENABLE_SNAPSHOT
9900 WalIndexHdr *pSnapshot; /* Start transaction here if not NULL */
9901 #endif
9902 };
9903
9904 /*
9905 ** Candidate values for Wal.exclusiveMode.
9906 */
9907 #define WAL_NORMAL_MODE 0
9908 #define WAL_EXCLUSIVE_MODE 1
9909 #define WAL_HEAPMEMORY_MODE 2
9910
9911 /*
9912 ** Possible values for WAL.readOnly
9913 */
9914 #define WAL_RDWR 0 /* Normal read/write connection */
9915 #define WAL_RDONLY 1 /* The WAL file is readonly */
9916 #define WAL_SHM_RDONLY 2 /* The SHM file is readonly */
9917
9918 /*
9919 ** Each page of the wal-index mapping contains a hash-table made up of
9920 ** an array of HASHTABLE_NSLOT elements of the following type.
9921 */
9922 typedef u16 ht_slot;
9923
9924 /*
9925 ** This structure is used to implement an iterator that loops through
9926 ** all frames in the WAL in database page order. Where two or more frames
9927 ** correspond to the same database page, the iterator visits only the
9928 ** frame most recently written to the WAL (in other words, the frame with
9929 ** the largest index).
9930 **
9931 ** The internals of this structure are only accessed by:
9932 **
9933 ** walIteratorInit() - Create a new iterator,
9934 ** walIteratorNext() - Step an iterator,
9935 ** walIteratorFree() - Free an iterator.
9936 **
9937 ** This functionality is used by the checkpoint code (see walCheckpoint()).
9938 */
9939 struct WalIterator {
9940 int iPrior; /* Last result returned from the iterator */
9941 int nSegment; /* Number of entries in aSegment[] */
9942 struct WalSegment {
9943 int iNext; /* Next slot in aIndex[] not yet returned */
9944 ht_slot *aIndex; /* i0, i1, i2... such that aPgno[iN] ascend */
9945 u32 *aPgno; /* Array of page numbers. */
9946 int nEntry; /* Nr. of entries in aPgno[] and aIndex[] */
9947 int iZero; /* Frame number associated with aPgno[0] */
9948 } aSegment[1]; /* One for every 32KB page in the wal-index */
9949 };
9950
9951 /*
9952 ** Define the parameters of the hash tables in the wal-index file. There
9953 ** is a hash-table following every HASHTABLE_NPAGE page numbers in the
9954 ** wal-index.
9955 **
9956 ** Changing any of these constants will alter the wal-index format and
9957 ** create incompatibilities.
9958 */
9959 #define HASHTABLE_NPAGE 4096 /* Must be power of 2 */
9960 #define HASHTABLE_HASH_1 383 /* Should be prime */
9961 #define HASHTABLE_NSLOT (HASHTABLE_NPAGE*2) /* Must be a power of 2 */
9962
9963 /*
9964 ** The block of page numbers associated with the first hash-table in a
9965 ** wal-index is smaller than usual. This is so that there is a complete
9966 ** hash-table on each aligned 32KB page of the wal-index.
9967 */
9968 #define HASHTABLE_NPAGE_ONE (HASHTABLE_NPAGE - (WALINDEX_HDR_SIZE/sizeof(u32)))
9969
9970 /* The wal-index is divided into pages of WALINDEX_PGSZ bytes each. */
9971 #define WALINDEX_PGSZ ( \
9972 sizeof(ht_slot)*HASHTABLE_NSLOT + HASHTABLE_NPAGE*sizeof(u32) \
9973 )
9974
9975 /*
9976 ** Obtain a pointer to the iPage'th page of the wal-index. The wal-index
9977 ** is broken into pages of WALINDEX_PGSZ bytes. Wal-index pages are
9978 ** numbered from zero.
9979 **
9980 ** If this call is successful, *ppPage is set to point to the wal-index
9981 ** page and SQLITE_OK is returned. If an error (an OOM or VFS error) occurs,
9982 ** then an SQLite error code is returned and *ppPage is set to 0.
9983 */
9984 static int walIndexPage(Wal *pWal, int iPage, volatile u32 **ppPage){
9985 int rc = SQLITE_OK;
9986
9987 /* Enlarge the pWal->apWiData[] array if required */
9988 if( pWal->nWiData<=iPage ){
9989 int nByte = sizeof(u32*)*(iPage+1);
9990 volatile u32 **apNew;
9991 apNew = (volatile u32 **)sqlite3_realloc64((void *)pWal->apWiData, nByte);
9992 if( !apNew ){
9993 *ppPage = 0;
9994 return SQLITE_NOMEM_BKPT;
9995 }
9996 memset((void*)&apNew[pWal->nWiData], 0,
9997 sizeof(u32*)*(iPage+1-pWal->nWiData));
9998 pWal->apWiData = apNew;
9999 pWal->nWiData = iPage+1;
10000 }
10001
10002 /* Request a pointer to the required page from the VFS */
10003 if( pWal->apWiData[iPage]==0 ){
10004 if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ){
10005 pWal->apWiData[iPage] = (u32 volatile *)sqlite3MallocZero(WALINDEX_PGSZ);
10006 if( !pWal->apWiData[iPage] ) rc = SQLITE_NOMEM_BKPT;
10007 }else{
10008 rc = sqlite3OsShmMap(pWal->pDbFd, iPage, WALINDEX_PGSZ,
10009 pWal->writeLock, (void volatile **)&pWal->apWiData[iPage]
10010 );
10011 if( rc==SQLITE_READONLY ){
10012 pWal->readOnly |= WAL_SHM_RDONLY;
10013 rc = SQLITE_OK;
10014 }
10015 }
10016 }
10017
10018 *ppPage = pWal->apWiData[iPage];
10019 assert( iPage==0 || *ppPage || rc!=SQLITE_OK );
10020 return rc;
10021 }
10022
10023 /*
10024 ** Return a pointer to the WalCkptInfo structure in the wal-index.
10025 */
10026 static volatile WalCkptInfo *walCkptInfo(Wal *pWal){
10027 assert( pWal->nWiData>0 && pWal->apWiData[0] );
10028 return (volatile WalCkptInfo*)&(pWal->apWiData[0][sizeof(WalIndexHdr)/2]);
10029 }
10030
10031 /*
10032 ** Return a pointer to the WalIndexHdr structure in the wal-index.
10033 */
10034 static volatile WalIndexHdr *walIndexHdr(Wal *pWal){
10035 assert( pWal->nWiData>0 && pWal->apWiData[0] );
10036 return (volatile WalIndexHdr*)pWal->apWiData[0];
10037 }
10038
10039 /*
10040 ** The argument to this macro must be of type u32. On a little-endian
10041 ** architecture, it returns the u32 value that results from interpreting
10042 ** the 4 bytes as a big-endian value. On a big-endian architecture, it
10043 ** returns the value that would be produced by interpreting the 4 bytes
10044 ** of the input value as a little-endian integer.
10045 */
10046 #define BYTESWAP32(x) ( \
10047 (((x)&0x000000FF)<<24) + (((x)&0x0000FF00)<<8) \
10048 + (((x)&0x00FF0000)>>8) + (((x)&0xFF000000)>>24) \
10049 )
10050
10051 /*
10052 ** Generate or extend an 8 byte checksum based on the data in
10053 ** array aByte[] and the initial values of aIn[0] and aIn[1] (or
10054 ** initial values of 0 and 0 if aIn==NULL).
10055 **
10056 ** The checksum is written back into aOut[] before returning.
10057 **
10058 ** nByte must be a positive multiple of 8.
10059 */
10060 static void walChecksumBytes(
10061 int nativeCksum, /* True for native byte-order, false for non-native */
10062 u8 *a, /* Content to be checksummed */
10063 int nByte, /* Bytes of content in a[]. Must be a multiple of 8. */
10064 const u32 *aIn, /* Initial checksum value input */
10065 u32 *aOut /* OUT: Final checksum value output */
10066 ){
10067 u32 s1, s2;
10068 u32 *aData = (u32 *)a;
10069 u32 *aEnd = (u32 *)&a[nByte];
10070
10071 if( aIn ){
10072 s1 = aIn[0];
10073 s2 = aIn[1];
10074 }else{
10075 s1 = s2 = 0;
10076 }
10077
10078 assert( nByte>=8 );
10079 assert( (nByte&0x00000007)==0 );
10080
10081 if( nativeCksum ){
10082 do {
10083 s1 += *aData++ + s2;
10084 s2 += *aData++ + s1;
10085 }while( aData<aEnd );
10086 }else{
10087 do {
10088 s1 += BYTESWAP32(aData[0]) + s2;
10089 s2 += BYTESWAP32(aData[1]) + s1;
10090 aData += 2;
10091 }while( aData<aEnd );
10092 }
10093
10094 aOut[0] = s1;
10095 aOut[1] = s2;
10096 }
10097
10098 static void walShmBarrier(Wal *pWal){
10099 if( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE ){
10100 sqlite3OsShmBarrier(pWal->pDbFd);
10101 }
10102 }
10103
10104 /*
10105 ** Write the header information in pWal->hdr into the wal-index.
10106 **
10107 ** The checksum on pWal->hdr is updated before it is written.
10108 */
10109 static void walIndexWriteHdr(Wal *pWal){
10110 volatile WalIndexHdr *aHdr = walIndexHdr(pWal);
10111 const int nCksum = offsetof(WalIndexHdr, aCksum);
10112
10113 assert( pWal->writeLock );
10114 pWal->hdr.isInit = 1;
10115 pWal->hdr.iVersion = WALINDEX_MAX_VERSION;
10116 walChecksumBytes(1, (u8*)&pWal->hdr, nCksum, 0, pWal->hdr.aCksum);
10117 memcpy((void*)&aHdr[1], (const void*)&pWal->hdr, sizeof(WalIndexHdr));
10118 walShmBarrier(pWal);
10119 memcpy((void*)&aHdr[0], (const void*)&pWal->hdr, sizeof(WalIndexHdr));
10120 }
10121
10122 /*
10123 ** This function encodes a single frame header and writes it to a buffer
10124 ** supplied by the caller. A frame-header is made up of a series of
10125 ** 4-byte big-endian integers, as follows:
10126 **
10127 ** 0: Page number.
10128 ** 4: For commit records, the size of the database image in pages
10129 ** after the commit. For all other records, zero.
10130 ** 8: Salt-1 (copied from the wal-header)
10131 ** 12: Salt-2 (copied from the wal-header)
10132 ** 16: Checksum-1.
10133 ** 20: Checksum-2.
10134 */
10135 static void walEncodeFrame(
10136 Wal *pWal, /* The write-ahead log */
10137 u32 iPage, /* Database page number for frame */
10138 u32 nTruncate, /* New db size (or 0 for non-commit frames) */
10139 u8 *aData, /* Pointer to page data */
10140 u8 *aFrame /* OUT: Write encoded frame here */
10141 ){
10142 int nativeCksum; /* True for native byte-order checksums */
10143 u32 *aCksum = pWal->hdr.aFrameCksum;
10144 assert( WAL_FRAME_HDRSIZE==24 );
10145 sqlite3Put4byte(&aFrame[0], iPage);
10146 sqlite3Put4byte(&aFrame[4], nTruncate);
10147 if( pWal->iReCksum==0 ){
10148 memcpy(&aFrame[8], pWal->hdr.aSalt, 8);
10149
10150 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
10151 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
10152 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
10153
10154 sqlite3Put4byte(&aFrame[16], aCksum[0]);
10155 sqlite3Put4byte(&aFrame[20], aCksum[1]);
10156 }else{
10157 memset(&aFrame[8], 0, 16);
10158 }
10159 }
10160
10161 /*
10162 ** Check to see if the frame with header in aFrame[] and content
10163 ** in aData[] is valid. If it is a valid frame, fill *piPage and
10164 ** *pnTruncate and return true. Return if the frame is not valid.
10165 */
10166 static int walDecodeFrame(
10167 Wal *pWal, /* The write-ahead log */
10168 u32 *piPage, /* OUT: Database page number for frame */
10169 u32 *pnTruncate, /* OUT: New db size (or 0 if not commit) */
10170 u8 *aData, /* Pointer to page data (for checksum) */
10171 u8 *aFrame /* Frame data */
10172 ){
10173 int nativeCksum; /* True for native byte-order checksums */
10174 u32 *aCksum = pWal->hdr.aFrameCksum;
10175 u32 pgno; /* Page number of the frame */
10176 assert( WAL_FRAME_HDRSIZE==24 );
10177
10178 /* A frame is only valid if the salt values in the frame-header
10179 ** match the salt values in the wal-header.
10180 */
10181 if( memcmp(&pWal->hdr.aSalt, &aFrame[8], 8)!=0 ){
10182 return 0;
10183 }
10184
10185 /* A frame is only valid if the page number is creater than zero.
10186 */
10187 pgno = sqlite3Get4byte(&aFrame[0]);
10188 if( pgno==0 ){
10189 return 0;
10190 }
10191
10192 /* A frame is only valid if a checksum of the WAL header,
10193 ** all prior frams, the first 16 bytes of this frame-header,
10194 ** and the frame-data matches the checksum in the last 8
10195 ** bytes of this frame-header.
10196 */
10197 nativeCksum = (pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN);
10198 walChecksumBytes(nativeCksum, aFrame, 8, aCksum, aCksum);
10199 walChecksumBytes(nativeCksum, aData, pWal->szPage, aCksum, aCksum);
10200 if( aCksum[0]!=sqlite3Get4byte(&aFrame[16])
10201 || aCksum[1]!=sqlite3Get4byte(&aFrame[20])
10202 ){
10203 /* Checksum failed. */
10204 return 0;
10205 }
10206
10207 /* If we reach this point, the frame is valid. Return the page number
10208 ** and the new database size.
10209 */
10210 *piPage = pgno;
10211 *pnTruncate = sqlite3Get4byte(&aFrame[4]);
10212 return 1;
10213 }
10214
10215
10216 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
10217 /*
10218 ** Names of locks. This routine is used to provide debugging output and is not
10219 ** a part of an ordinary build.
10220 */
10221 static const char *walLockName(int lockIdx){
10222 if( lockIdx==WAL_WRITE_LOCK ){
10223 return "WRITE-LOCK";
10224 }else if( lockIdx==WAL_CKPT_LOCK ){
10225 return "CKPT-LOCK";
10226 }else if( lockIdx==WAL_RECOVER_LOCK ){
10227 return "RECOVER-LOCK";
10228 }else{
10229 static char zName[15];
10230 sqlite3_snprintf(sizeof(zName), zName, "READ-LOCK[%d]",
10231 lockIdx-WAL_READ_LOCK(0));
10232 return zName;
10233 }
10234 }
10235 #endif /*defined(SQLITE_TEST) || defined(SQLITE_DEBUG) */
10236
10237
10238 /*
10239 ** Set or release locks on the WAL. Locks are either shared or exclusive.
10240 ** A lock cannot be moved directly between shared and exclusive - it must go
10241 ** through the unlocked state first.
10242 **
10243 ** In locking_mode=EXCLUSIVE, all of these routines become no-ops.
10244 */
10245 static int walLockShared(Wal *pWal, int lockIdx){
10246 int rc;
10247 if( pWal->exclusiveMode ) return SQLITE_OK;
10248 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
10249 SQLITE_SHM_LOCK | SQLITE_SHM_SHARED);
10250 WALTRACE(("WAL%p: acquire SHARED-%s %s\n", pWal,
10251 walLockName(lockIdx), rc ? "failed" : "ok"));
10252 VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
10253 return rc;
10254 }
10255 static void walUnlockShared(Wal *pWal, int lockIdx){
10256 if( pWal->exclusiveMode ) return;
10257 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, 1,
10258 SQLITE_SHM_UNLOCK | SQLITE_SHM_SHARED);
10259 WALTRACE(("WAL%p: release SHARED-%s\n", pWal, walLockName(lockIdx)));
10260 }
10261 static int walLockExclusive(Wal *pWal, int lockIdx, int n){
10262 int rc;
10263 if( pWal->exclusiveMode ) return SQLITE_OK;
10264 rc = sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
10265 SQLITE_SHM_LOCK | SQLITE_SHM_EXCLUSIVE);
10266 WALTRACE(("WAL%p: acquire EXCLUSIVE-%s cnt=%d %s\n", pWal,
10267 walLockName(lockIdx), n, rc ? "failed" : "ok"));
10268 VVA_ONLY( pWal->lockError = (u8)(rc!=SQLITE_OK && rc!=SQLITE_BUSY); )
10269 return rc;
10270 }
10271 static void walUnlockExclusive(Wal *pWal, int lockIdx, int n){
10272 if( pWal->exclusiveMode ) return;
10273 (void)sqlite3OsShmLock(pWal->pDbFd, lockIdx, n,
10274 SQLITE_SHM_UNLOCK | SQLITE_SHM_EXCLUSIVE);
10275 WALTRACE(("WAL%p: release EXCLUSIVE-%s cnt=%d\n", pWal,
10276 walLockName(lockIdx), n));
10277 }
10278
10279 /*
10280 ** Compute a hash on a page number. The resulting hash value must land
10281 ** between 0 and (HASHTABLE_NSLOT-1). The walHashNext() function advances
10282 ** the hash to the next value in the event of a collision.
10283 */
10284 static int walHash(u32 iPage){
10285 assert( iPage>0 );
10286 assert( (HASHTABLE_NSLOT & (HASHTABLE_NSLOT-1))==0 );
10287 return (iPage*HASHTABLE_HASH_1) & (HASHTABLE_NSLOT-1);
10288 }
10289 static int walNextHash(int iPriorHash){
10290 return (iPriorHash+1)&(HASHTABLE_NSLOT-1);
10291 }
10292
10293 /*
10294 ** Return pointers to the hash table and page number array stored on
10295 ** page iHash of the wal-index. The wal-index is broken into 32KB pages
10296 ** numbered starting from 0.
10297 **
10298 ** Set output variable *paHash to point to the start of the hash table
10299 ** in the wal-index file. Set *piZero to one less than the frame
10300 ** number of the first frame indexed by this hash table. If a
10301 ** slot in the hash table is set to N, it refers to frame number
10302 ** (*piZero+N) in the log.
10303 **
10304 ** Finally, set *paPgno so that *paPgno[1] is the page number of the
10305 ** first frame indexed by the hash table, frame (*piZero+1).
10306 */
10307 static int walHashGet(
10308 Wal *pWal, /* WAL handle */
10309 int iHash, /* Find the iHash'th table */
10310 volatile ht_slot **paHash, /* OUT: Pointer to hash index */
10311 volatile u32 **paPgno, /* OUT: Pointer to page number array */
10312 u32 *piZero /* OUT: Frame associated with *paPgno[0] */
10313 ){
10314 int rc; /* Return code */
10315 volatile u32 *aPgno;
10316
10317 rc = walIndexPage(pWal, iHash, &aPgno);
10318 assert( rc==SQLITE_OK || iHash>0 );
10319
10320 if( rc==SQLITE_OK ){
10321 u32 iZero;
10322 volatile ht_slot *aHash;
10323
10324 aHash = (volatile ht_slot *)&aPgno[HASHTABLE_NPAGE];
10325 if( iHash==0 ){
10326 aPgno = &aPgno[WALINDEX_HDR_SIZE/sizeof(u32)];
10327 iZero = 0;
10328 }else{
10329 iZero = HASHTABLE_NPAGE_ONE + (iHash-1)*HASHTABLE_NPAGE;
10330 }
10331
10332 *paPgno = &aPgno[-1];
10333 *paHash = aHash;
10334 *piZero = iZero;
10335 }
10336 return rc;
10337 }
10338
10339 /*
10340 ** Return the number of the wal-index page that contains the hash-table
10341 ** and page-number array that contain entries corresponding to WAL frame
10342 ** iFrame. The wal-index is broken up into 32KB pages. Wal-index pages
10343 ** are numbered starting from 0.
10344 */
10345 static int walFramePage(u32 iFrame){
10346 int iHash = (iFrame+HASHTABLE_NPAGE-HASHTABLE_NPAGE_ONE-1) / HASHTABLE_NPAGE;
10347 assert( (iHash==0 || iFrame>HASHTABLE_NPAGE_ONE)
10348 && (iHash>=1 || iFrame<=HASHTABLE_NPAGE_ONE)
10349 && (iHash<=1 || iFrame>(HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE))
10350 && (iHash>=2 || iFrame<=HASHTABLE_NPAGE_ONE+HASHTABLE_NPAGE)
10351 && (iHash<=2 || iFrame>(HASHTABLE_NPAGE_ONE+2*HASHTABLE_NPAGE))
10352 );
10353 return iHash;
10354 }
10355
10356 /*
10357 ** Return the page number associated with frame iFrame in this WAL.
10358 */
10359 static u32 walFramePgno(Wal *pWal, u32 iFrame){
10360 int iHash = walFramePage(iFrame);
10361 if( iHash==0 ){
10362 return pWal->apWiData[0][WALINDEX_HDR_SIZE/sizeof(u32) + iFrame - 1];
10363 }
10364 return pWal->apWiData[iHash][(iFrame-1-HASHTABLE_NPAGE_ONE)%HASHTABLE_NPAGE];
10365 }
10366
10367 /*
10368 ** Remove entries from the hash table that point to WAL slots greater
10369 ** than pWal->hdr.mxFrame.
10370 **
10371 ** This function is called whenever pWal->hdr.mxFrame is decreased due
10372 ** to a rollback or savepoint.
10373 **
10374 ** At most only the hash table containing pWal->hdr.mxFrame needs to be
10375 ** updated. Any later hash tables will be automatically cleared when
10376 ** pWal->hdr.mxFrame advances to the point where those hash tables are
10377 ** actually needed.
10378 */
10379 static void walCleanupHash(Wal *pWal){
10380 volatile ht_slot *aHash = 0; /* Pointer to hash table to clear */
10381 volatile u32 *aPgno = 0; /* Page number array for hash table */
10382 u32 iZero = 0; /* frame == (aHash[x]+iZero) */
10383 int iLimit = 0; /* Zero values greater than this */
10384 int nByte; /* Number of bytes to zero in aPgno[] */
10385 int i; /* Used to iterate through aHash[] */
10386
10387 assert( pWal->writeLock );
10388 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE-1 );
10389 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE );
10390 testcase( pWal->hdr.mxFrame==HASHTABLE_NPAGE_ONE+1 );
10391
10392 if( pWal->hdr.mxFrame==0 ) return;
10393
10394 /* Obtain pointers to the hash-table and page-number array containing
10395 ** the entry that corresponds to frame pWal->hdr.mxFrame. It is guaranteed
10396 ** that the page said hash-table and array reside on is already mapped.
10397 */
10398 assert( pWal->nWiData>walFramePage(pWal->hdr.mxFrame) );
10399 assert( pWal->apWiData[walFramePage(pWal->hdr.mxFrame)] );
10400 walHashGet(pWal, walFramePage(pWal->hdr.mxFrame), &aHash, &aPgno, &iZero);
10401
10402 /* Zero all hash-table entries that correspond to frame numbers greater
10403 ** than pWal->hdr.mxFrame.
10404 */
10405 iLimit = pWal->hdr.mxFrame - iZero;
10406 assert( iLimit>0 );
10407 for(i=0; i<HASHTABLE_NSLOT; i++){
10408 if( aHash[i]>iLimit ){
10409 aHash[i] = 0;
10410 }
10411 }
10412
10413 /* Zero the entries in the aPgno array that correspond to frames with
10414 ** frame numbers greater than pWal->hdr.mxFrame.
10415 */
10416 nByte = (int)((char *)aHash - (char *)&aPgno[iLimit+1]);
10417 memset((void *)&aPgno[iLimit+1], 0, nByte);
10418
10419 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
10420 /* Verify that the every entry in the mapping region is still reachable
10421 ** via the hash table even after the cleanup.
10422 */
10423 if( iLimit ){
10424 int j; /* Loop counter */
10425 int iKey; /* Hash key */
10426 for(j=1; j<=iLimit; j++){
10427 for(iKey=walHash(aPgno[j]); aHash[iKey]; iKey=walNextHash(iKey)){
10428 if( aHash[iKey]==j ) break;
10429 }
10430 assert( aHash[iKey]==j );
10431 }
10432 }
10433 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
10434 }
10435
10436
10437 /*
10438 ** Set an entry in the wal-index that will map database page number
10439 ** pPage into WAL frame iFrame.
10440 */
10441 static int walIndexAppend(Wal *pWal, u32 iFrame, u32 iPage){
10442 int rc; /* Return code */
10443 u32 iZero = 0; /* One less than frame number of aPgno[1] */
10444 volatile u32 *aPgno = 0; /* Page number array */
10445 volatile ht_slot *aHash = 0; /* Hash table */
10446
10447 rc = walHashGet(pWal, walFramePage(iFrame), &aHash, &aPgno, &iZero);
10448
10449 /* Assuming the wal-index file was successfully mapped, populate the
10450 ** page number array and hash table entry.
10451 */
10452 if( rc==SQLITE_OK ){
10453 int iKey; /* Hash table key */
10454 int idx; /* Value to write to hash-table slot */
10455 int nCollide; /* Number of hash collisions */
10456
10457 idx = iFrame - iZero;
10458 assert( idx <= HASHTABLE_NSLOT/2 + 1 );
10459
10460 /* If this is the first entry to be added to this hash-table, zero the
10461 ** entire hash table and aPgno[] array before proceeding.
10462 */
10463 if( idx==1 ){
10464 int nByte = (int)((u8 *)&aHash[HASHTABLE_NSLOT] - (u8 *)&aPgno[1]);
10465 memset((void*)&aPgno[1], 0, nByte);
10466 }
10467
10468 /* If the entry in aPgno[] is already set, then the previous writer
10469 ** must have exited unexpectedly in the middle of a transaction (after
10470 ** writing one or more dirty pages to the WAL to free up memory).
10471 ** Remove the remnants of that writers uncommitted transaction from
10472 ** the hash-table before writing any new entries.
10473 */
10474 if( aPgno[idx] ){
10475 walCleanupHash(pWal);
10476 assert( !aPgno[idx] );
10477 }
10478
10479 /* Write the aPgno[] array entry and the hash-table slot. */
10480 nCollide = idx;
10481 for(iKey=walHash(iPage); aHash[iKey]; iKey=walNextHash(iKey)){
10482 if( (nCollide--)==0 ) return SQLITE_CORRUPT_BKPT;
10483 }
10484 aPgno[idx] = iPage;
10485 aHash[iKey] = (ht_slot)idx;
10486
10487 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
10488 /* Verify that the number of entries in the hash table exactly equals
10489 ** the number of entries in the mapping region.
10490 */
10491 {
10492 int i; /* Loop counter */
10493 int nEntry = 0; /* Number of entries in the hash table */
10494 for(i=0; i<HASHTABLE_NSLOT; i++){ if( aHash[i] ) nEntry++; }
10495 assert( nEntry==idx );
10496 }
10497
10498 /* Verify that the every entry in the mapping region is reachable
10499 ** via the hash table. This turns out to be a really, really expensive
10500 ** thing to check, so only do this occasionally - not on every
10501 ** iteration.
10502 */
10503 if( (idx&0x3ff)==0 ){
10504 int i; /* Loop counter */
10505 for(i=1; i<=idx; i++){
10506 for(iKey=walHash(aPgno[i]); aHash[iKey]; iKey=walNextHash(iKey)){
10507 if( aHash[iKey]==i ) break;
10508 }
10509 assert( aHash[iKey]==i );
10510 }
10511 }
10512 #endif /* SQLITE_ENABLE_EXPENSIVE_ASSERT */
10513 }
10514
10515
10516 return rc;
10517 }
10518
10519
10520 /*
10521 ** Recover the wal-index by reading the write-ahead log file.
10522 **
10523 ** This routine first tries to establish an exclusive lock on the
10524 ** wal-index to prevent other threads/processes from doing anything
10525 ** with the WAL or wal-index while recovery is running. The
10526 ** WAL_RECOVER_LOCK is also held so that other threads will know
10527 ** that this thread is running recovery. If unable to establish
10528 ** the necessary locks, this routine returns SQLITE_BUSY.
10529 */
10530 static int walIndexRecover(Wal *pWal){
10531 int rc; /* Return Code */
10532 i64 nSize; /* Size of log file */
10533 u32 aFrameCksum[2] = {0, 0};
10534 int iLock; /* Lock offset to lock for checkpoint */
10535 int nLock; /* Number of locks to hold */
10536
10537 /* Obtain an exclusive lock on all byte in the locking range not already
10538 ** locked by the caller. The caller is guaranteed to have locked the
10539 ** WAL_WRITE_LOCK byte, and may have also locked the WAL_CKPT_LOCK byte.
10540 ** If successful, the same bytes that are locked here are unlocked before
10541 ** this function returns.
10542 */
10543 assert( pWal->ckptLock==1 || pWal->ckptLock==0 );
10544 assert( WAL_ALL_BUT_WRITE==WAL_WRITE_LOCK+1 );
10545 assert( WAL_CKPT_LOCK==WAL_ALL_BUT_WRITE );
10546 assert( pWal->writeLock );
10547 iLock = WAL_ALL_BUT_WRITE + pWal->ckptLock;
10548 nLock = SQLITE_SHM_NLOCK - iLock;
10549 rc = walLockExclusive(pWal, iLock, nLock);
10550 if( rc ){
10551 return rc;
10552 }
10553 WALTRACE(("WAL%p: recovery begin...\n", pWal));
10554
10555 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
10556
10557 rc = sqlite3OsFileSize(pWal->pWalFd, &nSize);
10558 if( rc!=SQLITE_OK ){
10559 goto recovery_error;
10560 }
10561
10562 if( nSize>WAL_HDRSIZE ){
10563 u8 aBuf[WAL_HDRSIZE]; /* Buffer to load WAL header into */
10564 u8 *aFrame = 0; /* Malloc'd buffer to load entire frame */
10565 int szFrame; /* Number of bytes in buffer aFrame[] */
10566 u8 *aData; /* Pointer to data part of aFrame buffer */
10567 int iFrame; /* Index of last frame read */
10568 i64 iOffset; /* Next offset to read from log file */
10569 int szPage; /* Page size according to the log */
10570 u32 magic; /* Magic value read from WAL header */
10571 u32 version; /* Magic value read from WAL header */
10572 int isValid; /* True if this frame is valid */
10573
10574 /* Read in the WAL header. */
10575 rc = sqlite3OsRead(pWal->pWalFd, aBuf, WAL_HDRSIZE, 0);
10576 if( rc!=SQLITE_OK ){
10577 goto recovery_error;
10578 }
10579
10580 /* If the database page size is not a power of two, or is greater than
10581 ** SQLITE_MAX_PAGE_SIZE, conclude that the WAL file contains no valid
10582 ** data. Similarly, if the 'magic' value is invalid, ignore the whole
10583 ** WAL file.
10584 */
10585 magic = sqlite3Get4byte(&aBuf[0]);
10586 szPage = sqlite3Get4byte(&aBuf[8]);
10587 if( (magic&0xFFFFFFFE)!=WAL_MAGIC
10588 || szPage&(szPage-1)
10589 || szPage>SQLITE_MAX_PAGE_SIZE
10590 || szPage<512
10591 ){
10592 goto finished;
10593 }
10594 pWal->hdr.bigEndCksum = (u8)(magic&0x00000001);
10595 pWal->szPage = szPage;
10596 pWal->nCkpt = sqlite3Get4byte(&aBuf[12]);
10597 memcpy(&pWal->hdr.aSalt, &aBuf[16], 8);
10598
10599 /* Verify that the WAL header checksum is correct */
10600 walChecksumBytes(pWal->hdr.bigEndCksum==SQLITE_BIGENDIAN,
10601 aBuf, WAL_HDRSIZE-2*4, 0, pWal->hdr.aFrameCksum
10602 );
10603 if( pWal->hdr.aFrameCksum[0]!=sqlite3Get4byte(&aBuf[24])
10604 || pWal->hdr.aFrameCksum[1]!=sqlite3Get4byte(&aBuf[28])
10605 ){
10606 goto finished;
10607 }
10608
10609 /* Verify that the version number on the WAL format is one that
10610 ** are able to understand */
10611 version = sqlite3Get4byte(&aBuf[4]);
10612 if( version!=WAL_MAX_VERSION ){
10613 rc = SQLITE_CANTOPEN_BKPT;
10614 goto finished;
10615 }
10616
10617 /* Malloc a buffer to read frames into. */
10618 szFrame = szPage + WAL_FRAME_HDRSIZE;
10619 aFrame = (u8 *)sqlite3_malloc64(szFrame);
10620 if( !aFrame ){
10621 rc = SQLITE_NOMEM_BKPT;
10622 goto recovery_error;
10623 }
10624 aData = &aFrame[WAL_FRAME_HDRSIZE];
10625
10626 /* Read all frames from the log file. */
10627 iFrame = 0;
10628 for(iOffset=WAL_HDRSIZE; (iOffset+szFrame)<=nSize; iOffset+=szFrame){
10629 u32 pgno; /* Database page number for frame */
10630 u32 nTruncate; /* dbsize field from frame header */
10631
10632 /* Read and decode the next log frame. */
10633 iFrame++;
10634 rc = sqlite3OsRead(pWal->pWalFd, aFrame, szFrame, iOffset);
10635 if( rc!=SQLITE_OK ) break;
10636 isValid = walDecodeFrame(pWal, &pgno, &nTruncate, aData, aFrame);
10637 if( !isValid ) break;
10638 rc = walIndexAppend(pWal, iFrame, pgno);
10639 if( rc!=SQLITE_OK ) break;
10640
10641 /* If nTruncate is non-zero, this is a commit record. */
10642 if( nTruncate ){
10643 pWal->hdr.mxFrame = iFrame;
10644 pWal->hdr.nPage = nTruncate;
10645 pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16));
10646 testcase( szPage<=32768 );
10647 testcase( szPage>=65536 );
10648 aFrameCksum[0] = pWal->hdr.aFrameCksum[0];
10649 aFrameCksum[1] = pWal->hdr.aFrameCksum[1];
10650 }
10651 }
10652
10653 sqlite3_free(aFrame);
10654 }
10655
10656 finished:
10657 if( rc==SQLITE_OK ){
10658 volatile WalCkptInfo *pInfo;
10659 int i;
10660 pWal->hdr.aFrameCksum[0] = aFrameCksum[0];
10661 pWal->hdr.aFrameCksum[1] = aFrameCksum[1];
10662 walIndexWriteHdr(pWal);
10663
10664 /* Reset the checkpoint-header. This is safe because this thread is
10665 ** currently holding locks that exclude all other readers, writers and
10666 ** checkpointers.
10667 */
10668 pInfo = walCkptInfo(pWal);
10669 pInfo->nBackfill = 0;
10670 pInfo->nBackfillAttempted = pWal->hdr.mxFrame;
10671 pInfo->aReadMark[0] = 0;
10672 for(i=1; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
10673 if( pWal->hdr.mxFrame ) pInfo->aReadMark[1] = pWal->hdr.mxFrame;
10674
10675 /* If more than one frame was recovered from the log file, report an
10676 ** event via sqlite3_log(). This is to help with identifying performance
10677 ** problems caused by applications routinely shutting down without
10678 ** checkpointing the log file.
10679 */
10680 if( pWal->hdr.nPage ){
10681 sqlite3_log(SQLITE_NOTICE_RECOVER_WAL,
10682 "recovered %d frames from WAL file %s",
10683 pWal->hdr.mxFrame, pWal->zWalName
10684 );
10685 }
10686 }
10687
10688 recovery_error:
10689 WALTRACE(("WAL%p: recovery %s\n", pWal, rc ? "failed" : "ok"));
10690 walUnlockExclusive(pWal, iLock, nLock);
10691 return rc;
10692 }
10693
10694 /*
10695 ** Close an open wal-index.
10696 */
10697 static void walIndexClose(Wal *pWal, int isDelete){
10698 if( pWal->exclusiveMode==WAL_HEAPMEMORY_MODE ){
10699 int i;
10700 for(i=0; i<pWal->nWiData; i++){
10701 sqlite3_free((void *)pWal->apWiData[i]);
10702 pWal->apWiData[i] = 0;
10703 }
10704 }else{
10705 sqlite3OsShmUnmap(pWal->pDbFd, isDelete);
10706 }
10707 }
10708
10709 /*
10710 ** Open a connection to the WAL file zWalName. The database file must
10711 ** already be opened on connection pDbFd. The buffer that zWalName points
10712 ** to must remain valid for the lifetime of the returned Wal* handle.
10713 **
10714 ** A SHARED lock should be held on the database file when this function
10715 ** is called. The purpose of this SHARED lock is to prevent any other
10716 ** client from unlinking the WAL or wal-index file. If another process
10717 ** were to do this just after this client opened one of these files, the
10718 ** system would be badly broken.
10719 **
10720 ** If the log file is successfully opened, SQLITE_OK is returned and
10721 ** *ppWal is set to point to a new WAL handle. If an error occurs,
10722 ** an SQLite error code is returned and *ppWal is left unmodified.
10723 */
10724 SQLITE_PRIVATE int sqlite3WalOpen(
10725 sqlite3_vfs *pVfs, /* vfs module to open wal and wal-index */
10726 sqlite3_file *pDbFd, /* The open database file */
10727 const char *zWalName, /* Name of the WAL file */
10728 int bNoShm, /* True to run in heap-memory mode */
10729 i64 mxWalSize, /* Truncate WAL to this size on reset */
10730 Wal **ppWal /* OUT: Allocated Wal handle */
10731 ){
10732 int rc; /* Return Code */
10733 Wal *pRet; /* Object to allocate and return */
10734 int flags; /* Flags passed to OsOpen() */
10735
10736 assert( zWalName && zWalName[0] );
10737 assert( pDbFd );
10738
10739 /* In the amalgamation, the os_unix.c and os_win.c source files come before
10740 ** this source file. Verify that the #defines of the locking byte offsets
10741 ** in os_unix.c and os_win.c agree with the WALINDEX_LOCK_OFFSET value.
10742 ** For that matter, if the lock offset ever changes from its initial design
10743 ** value of 120, we need to know that so there is an assert() to check it.
10744 */
10745 assert( 120==WALINDEX_LOCK_OFFSET );
10746 assert( 136==WALINDEX_HDR_SIZE );
10747 #ifdef WIN_SHM_BASE
10748 assert( WIN_SHM_BASE==WALINDEX_LOCK_OFFSET );
10749 #endif
10750 #ifdef UNIX_SHM_BASE
10751 assert( UNIX_SHM_BASE==WALINDEX_LOCK_OFFSET );
10752 #endif
10753
10754
10755 /* Allocate an instance of struct Wal to return. */
10756 *ppWal = 0;
10757 pRet = (Wal*)sqlite3MallocZero(sizeof(Wal) + pVfs->szOsFile);
10758 if( !pRet ){
10759 return SQLITE_NOMEM_BKPT;
10760 }
10761
10762 pRet->pVfs = pVfs;
10763 pRet->pWalFd = (sqlite3_file *)&pRet[1];
10764 pRet->pDbFd = pDbFd;
10765 pRet->readLock = -1;
10766 pRet->mxWalSize = mxWalSize;
10767 pRet->zWalName = zWalName;
10768 pRet->syncHeader = 1;
10769 pRet->padToSectorBoundary = 1;
10770 pRet->exclusiveMode = (bNoShm ? WAL_HEAPMEMORY_MODE: WAL_NORMAL_MODE);
10771
10772 /* Open file handle on the write-ahead log file. */
10773 flags = (SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|SQLITE_OPEN_WAL);
10774 rc = sqlite3OsOpen(pVfs, zWalName, pRet->pWalFd, flags, &flags);
10775 if( rc==SQLITE_OK && flags&SQLITE_OPEN_READONLY ){
10776 pRet->readOnly = WAL_RDONLY;
10777 }
10778
10779 if( rc!=SQLITE_OK ){
10780 walIndexClose(pRet, 0);
10781 sqlite3OsClose(pRet->pWalFd);
10782 sqlite3_free(pRet);
10783 }else{
10784 int iDC = sqlite3OsDeviceCharacteristics(pDbFd);
10785 if( iDC & SQLITE_IOCAP_SEQUENTIAL ){ pRet->syncHeader = 0; }
10786 if( iDC & SQLITE_IOCAP_POWERSAFE_OVERWRITE ){
10787 pRet->padToSectorBoundary = 0;
10788 }
10789 *ppWal = pRet;
10790 WALTRACE(("WAL%d: opened\n", pRet));
10791 }
10792 return rc;
10793 }
10794
10795 /*
10796 ** Change the size to which the WAL file is trucated on each reset.
10797 */
10798 SQLITE_PRIVATE void sqlite3WalLimit(Wal *pWal, i64 iLimit){
10799 if( pWal ) pWal->mxWalSize = iLimit;
10800 }
10801
10802 /*
10803 ** Find the smallest page number out of all pages held in the WAL that
10804 ** has not been returned by any prior invocation of this method on the
10805 ** same WalIterator object. Write into *piFrame the frame index where
10806 ** that page was last written into the WAL. Write into *piPage the page
10807 ** number.
10808 **
10809 ** Return 0 on success. If there are no pages in the WAL with a page
10810 ** number larger than *piPage, then return 1.
10811 */
10812 static int walIteratorNext(
10813 WalIterator *p, /* Iterator */
10814 u32 *piPage, /* OUT: The page number of the next page */
10815 u32 *piFrame /* OUT: Wal frame index of next page */
10816 ){
10817 u32 iMin; /* Result pgno must be greater than iMin */
10818 u32 iRet = 0xFFFFFFFF; /* 0xffffffff is never a valid page number */
10819 int i; /* For looping through segments */
10820
10821 iMin = p->iPrior;
10822 assert( iMin<0xffffffff );
10823 for(i=p->nSegment-1; i>=0; i--){
10824 struct WalSegment *pSegment = &p->aSegment[i];
10825 while( pSegment->iNext<pSegment->nEntry ){
10826 u32 iPg = pSegment->aPgno[pSegment->aIndex[pSegment->iNext]];
10827 if( iPg>iMin ){
10828 if( iPg<iRet ){
10829 iRet = iPg;
10830 *piFrame = pSegment->iZero + pSegment->aIndex[pSegment->iNext];
10831 }
10832 break;
10833 }
10834 pSegment->iNext++;
10835 }
10836 }
10837
10838 *piPage = p->iPrior = iRet;
10839 return (iRet==0xFFFFFFFF);
10840 }
10841
10842 /*
10843 ** This function merges two sorted lists into a single sorted list.
10844 **
10845 ** aLeft[] and aRight[] are arrays of indices. The sort key is
10846 ** aContent[aLeft[]] and aContent[aRight[]]. Upon entry, the following
10847 ** is guaranteed for all J<K:
10848 **
10849 ** aContent[aLeft[J]] < aContent[aLeft[K]]
10850 ** aContent[aRight[J]] < aContent[aRight[K]]
10851 **
10852 ** This routine overwrites aRight[] with a new (probably longer) sequence
10853 ** of indices such that the aRight[] contains every index that appears in
10854 ** either aLeft[] or the old aRight[] and such that the second condition
10855 ** above is still met.
10856 **
10857 ** The aContent[aLeft[X]] values will be unique for all X. And the
10858 ** aContent[aRight[X]] values will be unique too. But there might be
10859 ** one or more combinations of X and Y such that
10860 **
10861 ** aLeft[X]!=aRight[Y] && aContent[aLeft[X]] == aContent[aRight[Y]]
10862 **
10863 ** When that happens, omit the aLeft[X] and use the aRight[Y] index.
10864 */
10865 static void walMerge(
10866 const u32 *aContent, /* Pages in wal - keys for the sort */
10867 ht_slot *aLeft, /* IN: Left hand input list */
10868 int nLeft, /* IN: Elements in array *paLeft */
10869 ht_slot **paRight, /* IN/OUT: Right hand input list */
10870 int *pnRight, /* IN/OUT: Elements in *paRight */
10871 ht_slot *aTmp /* Temporary buffer */
10872 ){
10873 int iLeft = 0; /* Current index in aLeft */
10874 int iRight = 0; /* Current index in aRight */
10875 int iOut = 0; /* Current index in output buffer */
10876 int nRight = *pnRight;
10877 ht_slot *aRight = *paRight;
10878
10879 assert( nLeft>0 && nRight>0 );
10880 while( iRight<nRight || iLeft<nLeft ){
10881 ht_slot logpage;
10882 Pgno dbpage;
10883
10884 if( (iLeft<nLeft)
10885 && (iRight>=nRight || aContent[aLeft[iLeft]]<aContent[aRight[iRight]])
10886 ){
10887 logpage = aLeft[iLeft++];
10888 }else{
10889 logpage = aRight[iRight++];
10890 }
10891 dbpage = aContent[logpage];
10892
10893 aTmp[iOut++] = logpage;
10894 if( iLeft<nLeft && aContent[aLeft[iLeft]]==dbpage ) iLeft++;
10895
10896 assert( iLeft>=nLeft || aContent[aLeft[iLeft]]>dbpage );
10897 assert( iRight>=nRight || aContent[aRight[iRight]]>dbpage );
10898 }
10899
10900 *paRight = aLeft;
10901 *pnRight = iOut;
10902 memcpy(aLeft, aTmp, sizeof(aTmp[0])*iOut);
10903 }
10904
10905 /*
10906 ** Sort the elements in list aList using aContent[] as the sort key.
10907 ** Remove elements with duplicate keys, preferring to keep the
10908 ** larger aList[] values.
10909 **
10910 ** The aList[] entries are indices into aContent[]. The values in
10911 ** aList[] are to be sorted so that for all J<K:
10912 **
10913 ** aContent[aList[J]] < aContent[aList[K]]
10914 **
10915 ** For any X and Y such that
10916 **
10917 ** aContent[aList[X]] == aContent[aList[Y]]
10918 **
10919 ** Keep the larger of the two values aList[X] and aList[Y] and discard
10920 ** the smaller.
10921 */
10922 static void walMergesort(
10923 const u32 *aContent, /* Pages in wal */
10924 ht_slot *aBuffer, /* Buffer of at least *pnList items to use */
10925 ht_slot *aList, /* IN/OUT: List to sort */
10926 int *pnList /* IN/OUT: Number of elements in aList[] */
10927 ){
10928 struct Sublist {
10929 int nList; /* Number of elements in aList */
10930 ht_slot *aList; /* Pointer to sub-list content */
10931 };
10932
10933 const int nList = *pnList; /* Size of input list */
10934 int nMerge = 0; /* Number of elements in list aMerge */
10935 ht_slot *aMerge = 0; /* List to be merged */
10936 int iList; /* Index into input list */
10937 u32 iSub = 0; /* Index into aSub array */
10938 struct Sublist aSub[13]; /* Array of sub-lists */
10939
10940 memset(aSub, 0, sizeof(aSub));
10941 assert( nList<=HASHTABLE_NPAGE && nList>0 );
10942 assert( HASHTABLE_NPAGE==(1<<(ArraySize(aSub)-1)) );
10943
10944 for(iList=0; iList<nList; iList++){
10945 nMerge = 1;
10946 aMerge = &aList[iList];
10947 for(iSub=0; iList & (1<<iSub); iSub++){
10948 struct Sublist *p;
10949 assert( iSub<ArraySize(aSub) );
10950 p = &aSub[iSub];
10951 assert( p->aList && p->nList<=(1<<iSub) );
10952 assert( p->aList==&aList[iList&~((2<<iSub)-1)] );
10953 walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);
10954 }
10955 aSub[iSub].aList = aMerge;
10956 aSub[iSub].nList = nMerge;
10957 }
10958
10959 for(iSub++; iSub<ArraySize(aSub); iSub++){
10960 if( nList & (1<<iSub) ){
10961 struct Sublist *p;
10962 assert( iSub<ArraySize(aSub) );
10963 p = &aSub[iSub];
10964 assert( p->nList<=(1<<iSub) );
10965 assert( p->aList==&aList[nList&~((2<<iSub)-1)] );
10966 walMerge(aContent, p->aList, p->nList, &aMerge, &nMerge, aBuffer);
10967 }
10968 }
10969 assert( aMerge==aList );
10970 *pnList = nMerge;
10971
10972 #ifdef SQLITE_DEBUG
10973 {
10974 int i;
10975 for(i=1; i<*pnList; i++){
10976 assert( aContent[aList[i]] > aContent[aList[i-1]] );
10977 }
10978 }
10979 #endif
10980 }
10981
10982 /*
10983 ** Free an iterator allocated by walIteratorInit().
10984 */
10985 static void walIteratorFree(WalIterator *p){
10986 sqlite3_free(p);
10987 }
10988
10989 /*
10990 ** Construct a WalInterator object that can be used to loop over all
10991 ** pages in the WAL in ascending order. The caller must hold the checkpoint
10992 ** lock.
10993 **
10994 ** On success, make *pp point to the newly allocated WalInterator object
10995 ** return SQLITE_OK. Otherwise, return an error code. If this routine
10996 ** returns an error, the value of *pp is undefined.
10997 **
10998 ** The calling routine should invoke walIteratorFree() to destroy the
10999 ** WalIterator object when it has finished with it.
11000 */
11001 static int walIteratorInit(Wal *pWal, WalIterator **pp){
11002 WalIterator *p; /* Return value */
11003 int nSegment; /* Number of segments to merge */
11004 u32 iLast; /* Last frame in log */
11005 int nByte; /* Number of bytes to allocate */
11006 int i; /* Iterator variable */
11007 ht_slot *aTmp; /* Temp space used by merge-sort */
11008 int rc = SQLITE_OK; /* Return Code */
11009
11010 /* This routine only runs while holding the checkpoint lock. And
11011 ** it only runs if there is actually content in the log (mxFrame>0).
11012 */
11013 assert( pWal->ckptLock && pWal->hdr.mxFrame>0 );
11014 iLast = pWal->hdr.mxFrame;
11015
11016 /* Allocate space for the WalIterator object. */
11017 nSegment = walFramePage(iLast) + 1;
11018 nByte = sizeof(WalIterator)
11019 + (nSegment-1)*sizeof(struct WalSegment)
11020 + iLast*sizeof(ht_slot);
11021 p = (WalIterator *)sqlite3_malloc64(nByte);
11022 if( !p ){
11023 return SQLITE_NOMEM_BKPT;
11024 }
11025 memset(p, 0, nByte);
11026 p->nSegment = nSegment;
11027
11028 /* Allocate temporary space used by the merge-sort routine. This block
11029 ** of memory will be freed before this function returns.
11030 */
11031 aTmp = (ht_slot *)sqlite3_malloc64(
11032 sizeof(ht_slot) * (iLast>HASHTABLE_NPAGE?HASHTABLE_NPAGE:iLast)
11033 );
11034 if( !aTmp ){
11035 rc = SQLITE_NOMEM_BKPT;
11036 }
11037
11038 for(i=0; rc==SQLITE_OK && i<nSegment; i++){
11039 volatile ht_slot *aHash;
11040 u32 iZero;
11041 volatile u32 *aPgno;
11042
11043 rc = walHashGet(pWal, i, &aHash, &aPgno, &iZero);
11044 if( rc==SQLITE_OK ){
11045 int j; /* Counter variable */
11046 int nEntry; /* Number of entries in this segment */
11047 ht_slot *aIndex; /* Sorted index for this segment */
11048
11049 aPgno++;
11050 if( (i+1)==nSegment ){
11051 nEntry = (int)(iLast - iZero);
11052 }else{
11053 nEntry = (int)((u32*)aHash - (u32*)aPgno);
11054 }
11055 aIndex = &((ht_slot *)&p->aSegment[p->nSegment])[iZero];
11056 iZero++;
11057
11058 for(j=0; j<nEntry; j++){
11059 aIndex[j] = (ht_slot)j;
11060 }
11061 walMergesort((u32 *)aPgno, aTmp, aIndex, &nEntry);
11062 p->aSegment[i].iZero = iZero;
11063 p->aSegment[i].nEntry = nEntry;
11064 p->aSegment[i].aIndex = aIndex;
11065 p->aSegment[i].aPgno = (u32 *)aPgno;
11066 }
11067 }
11068 sqlite3_free(aTmp);
11069
11070 if( rc!=SQLITE_OK ){
11071 walIteratorFree(p);
11072 }
11073 *pp = p;
11074 return rc;
11075 }
11076
11077 /*
11078 ** Attempt to obtain the exclusive WAL lock defined by parameters lockIdx and
11079 ** n. If the attempt fails and parameter xBusy is not NULL, then it is a
11080 ** busy-handler function. Invoke it and retry the lock until either the
11081 ** lock is successfully obtained or the busy-handler returns 0.
11082 */
11083 static int walBusyLock(
11084 Wal *pWal, /* WAL connection */
11085 int (*xBusy)(void*), /* Function to call when busy */
11086 void *pBusyArg, /* Context argument for xBusyHandler */
11087 int lockIdx, /* Offset of first byte to lock */
11088 int n /* Number of bytes to lock */
11089 ){
11090 int rc;
11091 do {
11092 rc = walLockExclusive(pWal, lockIdx, n);
11093 }while( xBusy && rc==SQLITE_BUSY && xBusy(pBusyArg) );
11094 return rc;
11095 }
11096
11097 /*
11098 ** The cache of the wal-index header must be valid to call this function.
11099 ** Return the page-size in bytes used by the database.
11100 */
11101 static int walPagesize(Wal *pWal){
11102 return (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);
11103 }
11104
11105 /*
11106 ** The following is guaranteed when this function is called:
11107 **
11108 ** a) the WRITER lock is held,
11109 ** b) the entire log file has been checkpointed, and
11110 ** c) any existing readers are reading exclusively from the database
11111 ** file - there are no readers that may attempt to read a frame from
11112 ** the log file.
11113 **
11114 ** This function updates the shared-memory structures so that the next
11115 ** client to write to the database (which may be this one) does so by
11116 ** writing frames into the start of the log file.
11117 **
11118 ** The value of parameter salt1 is used as the aSalt[1] value in the
11119 ** new wal-index header. It should be passed a pseudo-random value (i.e.
11120 ** one obtained from sqlite3_randomness()).
11121 */
11122 static void walRestartHdr(Wal *pWal, u32 salt1){
11123 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
11124 int i; /* Loop counter */
11125 u32 *aSalt = pWal->hdr.aSalt; /* Big-endian salt values */
11126 pWal->nCkpt++;
11127 pWal->hdr.mxFrame = 0;
11128 sqlite3Put4byte((u8*)&aSalt[0], 1 + sqlite3Get4byte((u8*)&aSalt[0]));
11129 memcpy(&pWal->hdr.aSalt[1], &salt1, 4);
11130 walIndexWriteHdr(pWal);
11131 pInfo->nBackfill = 0;
11132 pInfo->nBackfillAttempted = 0;
11133 pInfo->aReadMark[1] = 0;
11134 for(i=2; i<WAL_NREADER; i++) pInfo->aReadMark[i] = READMARK_NOT_USED;
11135 assert( pInfo->aReadMark[0]==0 );
11136 }
11137
11138 /*
11139 ** Copy as much content as we can from the WAL back into the database file
11140 ** in response to an sqlite3_wal_checkpoint() request or the equivalent.
11141 **
11142 ** The amount of information copies from WAL to database might be limited
11143 ** by active readers. This routine will never overwrite a database page
11144 ** that a concurrent reader might be using.
11145 **
11146 ** All I/O barrier operations (a.k.a fsyncs) occur in this routine when
11147 ** SQLite is in WAL-mode in synchronous=NORMAL. That means that if
11148 ** checkpoints are always run by a background thread or background
11149 ** process, foreground threads will never block on a lengthy fsync call.
11150 **
11151 ** Fsync is called on the WAL before writing content out of the WAL and
11152 ** into the database. This ensures that if the new content is persistent
11153 ** in the WAL and can be recovered following a power-loss or hard reset.
11154 **
11155 ** Fsync is also called on the database file if (and only if) the entire
11156 ** WAL content is copied into the database file. This second fsync makes
11157 ** it safe to delete the WAL since the new content will persist in the
11158 ** database file.
11159 **
11160 ** This routine uses and updates the nBackfill field of the wal-index header.
11161 ** This is the only routine that will increase the value of nBackfill.
11162 ** (A WAL reset or recovery will revert nBackfill to zero, but not increase
11163 ** its value.)
11164 **
11165 ** The caller must be holding sufficient locks to ensure that no other
11166 ** checkpoint is running (in any other thread or process) at the same
11167 ** time.
11168 */
11169 static int walCheckpoint(
11170 Wal *pWal, /* Wal connection */
11171 sqlite3 *db, /* Check for interrupts on this handle */
11172 int eMode, /* One of PASSIVE, FULL or RESTART */
11173 int (*xBusy)(void*), /* Function to call when busy */
11174 void *pBusyArg, /* Context argument for xBusyHandler */
11175 int sync_flags, /* Flags for OsSync() (or 0) */
11176 u8 *zBuf /* Temporary buffer to use */
11177 ){
11178 int rc = SQLITE_OK; /* Return code */
11179 int szPage; /* Database page-size */
11180 WalIterator *pIter = 0; /* Wal iterator context */
11181 u32 iDbpage = 0; /* Next database page to write */
11182 u32 iFrame = 0; /* Wal frame containing data for iDbpage */
11183 u32 mxSafeFrame; /* Max frame that can be backfilled */
11184 u32 mxPage; /* Max database page to write */
11185 int i; /* Loop counter */
11186 volatile WalCkptInfo *pInfo; /* The checkpoint status information */
11187
11188 szPage = walPagesize(pWal);
11189 testcase( szPage<=32768 );
11190 testcase( szPage>=65536 );
11191 pInfo = walCkptInfo(pWal);
11192 if( pInfo->nBackfill<pWal->hdr.mxFrame ){
11193
11194 /* Allocate the iterator */
11195 rc = walIteratorInit(pWal, &pIter);
11196 if( rc!=SQLITE_OK ){
11197 return rc;
11198 }
11199 assert( pIter );
11200
11201 /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked
11202 ** in the SQLITE_CHECKPOINT_PASSIVE mode. */
11203 assert( eMode!=SQLITE_CHECKPOINT_PASSIVE || xBusy==0 );
11204
11205 /* Compute in mxSafeFrame the index of the last frame of the WAL that is
11206 ** safe to write into the database. Frames beyond mxSafeFrame might
11207 ** overwrite database pages that are in use by active readers and thus
11208 ** cannot be backfilled from the WAL.
11209 */
11210 mxSafeFrame = pWal->hdr.mxFrame;
11211 mxPage = pWal->hdr.nPage;
11212 for(i=1; i<WAL_NREADER; i++){
11213 /* Thread-sanitizer reports that the following is an unsafe read,
11214 ** as some other thread may be in the process of updating the value
11215 ** of the aReadMark[] slot. The assumption here is that if that is
11216 ** happening, the other client may only be increasing the value,
11217 ** not decreasing it. So assuming either that either the "old" or
11218 ** "new" version of the value is read, and not some arbitrary value
11219 ** that would never be written by a real client, things are still
11220 ** safe. */
11221 u32 y = pInfo->aReadMark[i];
11222 if( mxSafeFrame>y ){
11223 assert( y<=pWal->hdr.mxFrame );
11224 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(i), 1);
11225 if( rc==SQLITE_OK ){
11226 pInfo->aReadMark[i] = (i==1 ? mxSafeFrame : READMARK_NOT_USED);
11227 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
11228 }else if( rc==SQLITE_BUSY ){
11229 mxSafeFrame = y;
11230 xBusy = 0;
11231 }else{
11232 goto walcheckpoint_out;
11233 }
11234 }
11235 }
11236
11237 if( pInfo->nBackfill<mxSafeFrame
11238 && (rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(0),1))==SQLITE_OK
11239 ){
11240 i64 nSize; /* Current size of database file */
11241 u32 nBackfill = pInfo->nBackfill;
11242
11243 pInfo->nBackfillAttempted = mxSafeFrame;
11244
11245 /* Sync the WAL to disk */
11246 if( sync_flags ){
11247 rc = sqlite3OsSync(pWal->pWalFd, sync_flags);
11248 }
11249
11250 /* If the database may grow as a result of this checkpoint, hint
11251 ** about the eventual size of the db file to the VFS layer.
11252 */
11253 if( rc==SQLITE_OK ){
11254 i64 nReq = ((i64)mxPage * szPage);
11255 rc = sqlite3OsFileSize(pWal->pDbFd, &nSize);
11256 if( rc==SQLITE_OK && nSize<nReq ){
11257 sqlite3OsFileControlHint(pWal->pDbFd, SQLITE_FCNTL_SIZE_HINT, &nReq);
11258 }
11259 }
11260
11261
11262 /* Iterate through the contents of the WAL, copying data to the db file */
11263 while( rc==SQLITE_OK && 0==walIteratorNext(pIter, &iDbpage, &iFrame) ){
11264 i64 iOffset;
11265 assert( walFramePgno(pWal, iFrame)==iDbpage );
11266 if( db->u1.isInterrupted ){
11267 rc = db->mallocFailed ? SQLITE_NOMEM_BKPT : SQLITE_INTERRUPT;
11268 break;
11269 }
11270 if( iFrame<=nBackfill || iFrame>mxSafeFrame || iDbpage>mxPage ){
11271 continue;
11272 }
11273 iOffset = walFrameOffset(iFrame, szPage) + WAL_FRAME_HDRSIZE;
11274 /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL file */
11275 rc = sqlite3OsRead(pWal->pWalFd, zBuf, szPage, iOffset);
11276 if( rc!=SQLITE_OK ) break;
11277 iOffset = (iDbpage-1)*(i64)szPage;
11278 testcase( IS_BIG_INT(iOffset) );
11279 rc = sqlite3OsWrite(pWal->pDbFd, zBuf, szPage, iOffset);
11280 if( rc!=SQLITE_OK ) break;
11281 }
11282
11283 /* If work was actually accomplished... */
11284 if( rc==SQLITE_OK ){
11285 if( mxSafeFrame==walIndexHdr(pWal)->mxFrame ){
11286 i64 szDb = pWal->hdr.nPage*(i64)szPage;
11287 testcase( IS_BIG_INT(szDb) );
11288 rc = sqlite3OsTruncate(pWal->pDbFd, szDb);
11289 if( rc==SQLITE_OK && sync_flags ){
11290 rc = sqlite3OsSync(pWal->pDbFd, sync_flags);
11291 }
11292 }
11293 if( rc==SQLITE_OK ){
11294 pInfo->nBackfill = mxSafeFrame;
11295 }
11296 }
11297
11298 /* Release the reader lock held while backfilling */
11299 walUnlockExclusive(pWal, WAL_READ_LOCK(0), 1);
11300 }
11301
11302 if( rc==SQLITE_BUSY ){
11303 /* Reset the return code so as not to report a checkpoint failure
11304 ** just because there are active readers. */
11305 rc = SQLITE_OK;
11306 }
11307 }
11308
11309 /* If this is an SQLITE_CHECKPOINT_RESTART or TRUNCATE operation, and the
11310 ** entire wal file has been copied into the database file, then block
11311 ** until all readers have finished using the wal file. This ensures that
11312 ** the next process to write to the database restarts the wal file.
11313 */
11314 if( rc==SQLITE_OK && eMode!=SQLITE_CHECKPOINT_PASSIVE ){
11315 assert( pWal->writeLock );
11316 if( pInfo->nBackfill<pWal->hdr.mxFrame ){
11317 rc = SQLITE_BUSY;
11318 }else if( eMode>=SQLITE_CHECKPOINT_RESTART ){
11319 u32 salt1;
11320 sqlite3_randomness(4, &salt1);
11321 assert( pInfo->nBackfill==pWal->hdr.mxFrame );
11322 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_READ_LOCK(1), WAL_NREADER-1);
11323 if( rc==SQLITE_OK ){
11324 if( eMode==SQLITE_CHECKPOINT_TRUNCATE ){
11325 /* IMPLEMENTATION-OF: R-44699-57140 This mode works the same way as
11326 ** SQLITE_CHECKPOINT_RESTART with the addition that it also
11327 ** truncates the log file to zero bytes just prior to a
11328 ** successful return.
11329 **
11330 ** In theory, it might be safe to do this without updating the
11331 ** wal-index header in shared memory, as all subsequent reader or
11332 ** writer clients should see that the entire log file has been
11333 ** checkpointed and behave accordingly. This seems unsafe though,
11334 ** as it would leave the system in a state where the contents of
11335 ** the wal-index header do not match the contents of the
11336 ** file-system. To avoid this, update the wal-index header to
11337 ** indicate that the log file contains zero valid frames. */
11338 walRestartHdr(pWal, salt1);
11339 rc = sqlite3OsTruncate(pWal->pWalFd, 0);
11340 }
11341 walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
11342 }
11343 }
11344 }
11345
11346 walcheckpoint_out:
11347 walIteratorFree(pIter);
11348 return rc;
11349 }
11350
11351 /*
11352 ** If the WAL file is currently larger than nMax bytes in size, truncate
11353 ** it to exactly nMax bytes. If an error occurs while doing so, ignore it.
11354 */
11355 static void walLimitSize(Wal *pWal, i64 nMax){
11356 i64 sz;
11357 int rx;
11358 sqlite3BeginBenignMalloc();
11359 rx = sqlite3OsFileSize(pWal->pWalFd, &sz);
11360 if( rx==SQLITE_OK && (sz > nMax ) ){
11361 rx = sqlite3OsTruncate(pWal->pWalFd, nMax);
11362 }
11363 sqlite3EndBenignMalloc();
11364 if( rx ){
11365 sqlite3_log(rx, "cannot limit WAL size: %s", pWal->zWalName);
11366 }
11367 }
11368
11369 /*
11370 ** Close a connection to a log file.
11371 */
11372 SQLITE_PRIVATE int sqlite3WalClose(
11373 Wal *pWal, /* Wal to close */
11374 sqlite3 *db, /* For interrupt flag */
11375 int sync_flags, /* Flags to pass to OsSync() (or 0) */
11376 int nBuf,
11377 u8 *zBuf /* Buffer of at least nBuf bytes */
11378 ){
11379 int rc = SQLITE_OK;
11380 if( pWal ){
11381 int isDelete = 0; /* True to unlink wal and wal-index files */
11382
11383 /* If an EXCLUSIVE lock can be obtained on the database file (using the
11384 ** ordinary, rollback-mode locking methods, this guarantees that the
11385 ** connection associated with this log file is the only connection to
11386 ** the database. In this case checkpoint the database and unlink both
11387 ** the wal and wal-index files.
11388 **
11389 ** The EXCLUSIVE lock is not released before returning.
11390 */
11391 if( zBuf!=0
11392 && SQLITE_OK==(rc = sqlite3OsLock(pWal->pDbFd, SQLITE_LOCK_EXCLUSIVE))
11393 ){
11394 if( pWal->exclusiveMode==WAL_NORMAL_MODE ){
11395 pWal->exclusiveMode = WAL_EXCLUSIVE_MODE;
11396 }
11397 rc = sqlite3WalCheckpoint(pWal, db,
11398 SQLITE_CHECKPOINT_PASSIVE, 0, 0, sync_flags, nBuf, zBuf, 0, 0
11399 );
11400 if( rc==SQLITE_OK ){
11401 int bPersist = -1;
11402 sqlite3OsFileControlHint(
11403 pWal->pDbFd, SQLITE_FCNTL_PERSIST_WAL, &bPersist
11404 );
11405 if( bPersist!=1 ){
11406 /* Try to delete the WAL file if the checkpoint completed and
11407 ** fsyned (rc==SQLITE_OK) and if we are not in persistent-wal
11408 ** mode (!bPersist) */
11409 isDelete = 1;
11410 }else if( pWal->mxWalSize>=0 ){
11411 /* Try to truncate the WAL file to zero bytes if the checkpoint
11412 ** completed and fsynced (rc==SQLITE_OK) and we are in persistent
11413 ** WAL mode (bPersist) and if the PRAGMA journal_size_limit is a
11414 ** non-negative value (pWal->mxWalSize>=0). Note that we truncate
11415 ** to zero bytes as truncating to the journal_size_limit might
11416 ** leave a corrupt WAL file on disk. */
11417 walLimitSize(pWal, 0);
11418 }
11419 }
11420 }
11421
11422 walIndexClose(pWal, isDelete);
11423 sqlite3OsClose(pWal->pWalFd);
11424 if( isDelete ){
11425 sqlite3BeginBenignMalloc();
11426 sqlite3OsDelete(pWal->pVfs, pWal->zWalName, 0);
11427 sqlite3EndBenignMalloc();
11428 }
11429 WALTRACE(("WAL%p: closed\n", pWal));
11430 sqlite3_free((void *)pWal->apWiData);
11431 sqlite3_free(pWal);
11432 }
11433 return rc;
11434 }
11435
11436 /*
11437 ** Try to read the wal-index header. Return 0 on success and 1 if
11438 ** there is a problem.
11439 **
11440 ** The wal-index is in shared memory. Another thread or process might
11441 ** be writing the header at the same time this procedure is trying to
11442 ** read it, which might result in inconsistency. A dirty read is detected
11443 ** by verifying that both copies of the header are the same and also by
11444 ** a checksum on the header.
11445 **
11446 ** If and only if the read is consistent and the header is different from
11447 ** pWal->hdr, then pWal->hdr is updated to the content of the new header
11448 ** and *pChanged is set to 1.
11449 **
11450 ** If the checksum cannot be verified return non-zero. If the header
11451 ** is read successfully and the checksum verified, return zero.
11452 */
11453 static int walIndexTryHdr(Wal *pWal, int *pChanged){
11454 u32 aCksum[2]; /* Checksum on the header content */
11455 WalIndexHdr h1, h2; /* Two copies of the header content */
11456 WalIndexHdr volatile *aHdr; /* Header in shared memory */
11457
11458 /* The first page of the wal-index must be mapped at this point. */
11459 assert( pWal->nWiData>0 && pWal->apWiData[0] );
11460
11461 /* Read the header. This might happen concurrently with a write to the
11462 ** same area of shared memory on a different CPU in a SMP,
11463 ** meaning it is possible that an inconsistent snapshot is read
11464 ** from the file. If this happens, return non-zero.
11465 **
11466 ** There are two copies of the header at the beginning of the wal-index.
11467 ** When reading, read [0] first then [1]. Writes are in the reverse order.
11468 ** Memory barriers are used to prevent the compiler or the hardware from
11469 ** reordering the reads and writes.
11470 */
11471 aHdr = walIndexHdr(pWal);
11472 memcpy(&h1, (void *)&aHdr[0], sizeof(h1));
11473 walShmBarrier(pWal);
11474 memcpy(&h2, (void *)&aHdr[1], sizeof(h2));
11475
11476 if( memcmp(&h1, &h2, sizeof(h1))!=0 ){
11477 return 1; /* Dirty read */
11478 }
11479 if( h1.isInit==0 ){
11480 return 1; /* Malformed header - probably all zeros */
11481 }
11482 walChecksumBytes(1, (u8*)&h1, sizeof(h1)-sizeof(h1.aCksum), 0, aCksum);
11483 if( aCksum[0]!=h1.aCksum[0] || aCksum[1]!=h1.aCksum[1] ){
11484 return 1; /* Checksum does not match */
11485 }
11486
11487 if( memcmp(&pWal->hdr, &h1, sizeof(WalIndexHdr)) ){
11488 *pChanged = 1;
11489 memcpy(&pWal->hdr, &h1, sizeof(WalIndexHdr));
11490 pWal->szPage = (pWal->hdr.szPage&0xfe00) + ((pWal->hdr.szPage&0x0001)<<16);
11491 testcase( pWal->szPage<=32768 );
11492 testcase( pWal->szPage>=65536 );
11493 }
11494
11495 /* The header was successfully read. Return zero. */
11496 return 0;
11497 }
11498
11499 /*
11500 ** Read the wal-index header from the wal-index and into pWal->hdr.
11501 ** If the wal-header appears to be corrupt, try to reconstruct the
11502 ** wal-index from the WAL before returning.
11503 **
11504 ** Set *pChanged to 1 if the wal-index header value in pWal->hdr is
11505 ** changed by this operation. If pWal->hdr is unchanged, set *pChanged
11506 ** to 0.
11507 **
11508 ** If the wal-index header is successfully read, return SQLITE_OK.
11509 ** Otherwise an SQLite error code.
11510 */
11511 static int walIndexReadHdr(Wal *pWal, int *pChanged){
11512 int rc; /* Return code */
11513 int badHdr; /* True if a header read failed */
11514 volatile u32 *page0; /* Chunk of wal-index containing header */
11515
11516 /* Ensure that page 0 of the wal-index (the page that contains the
11517 ** wal-index header) is mapped. Return early if an error occurs here.
11518 */
11519 assert( pChanged );
11520 rc = walIndexPage(pWal, 0, &page0);
11521 if( rc!=SQLITE_OK ){
11522 return rc;
11523 };
11524 assert( page0 || pWal->writeLock==0 );
11525
11526 /* If the first page of the wal-index has been mapped, try to read the
11527 ** wal-index header immediately, without holding any lock. This usually
11528 ** works, but may fail if the wal-index header is corrupt or currently
11529 ** being modified by another thread or process.
11530 */
11531 badHdr = (page0 ? walIndexTryHdr(pWal, pChanged) : 1);
11532
11533 /* If the first attempt failed, it might have been due to a race
11534 ** with a writer. So get a WRITE lock and try again.
11535 */
11536 assert( badHdr==0 || pWal->writeLock==0 );
11537 if( badHdr ){
11538 if( pWal->readOnly & WAL_SHM_RDONLY ){
11539 if( SQLITE_OK==(rc = walLockShared(pWal, WAL_WRITE_LOCK)) ){
11540 walUnlockShared(pWal, WAL_WRITE_LOCK);
11541 rc = SQLITE_READONLY_RECOVERY;
11542 }
11543 }else if( SQLITE_OK==(rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1)) ){
11544 pWal->writeLock = 1;
11545 if( SQLITE_OK==(rc = walIndexPage(pWal, 0, &page0)) ){
11546 badHdr = walIndexTryHdr(pWal, pChanged);
11547 if( badHdr ){
11548 /* If the wal-index header is still malformed even while holding
11549 ** a WRITE lock, it can only mean that the header is corrupted and
11550 ** needs to be reconstructed. So run recovery to do exactly that.
11551 */
11552 rc = walIndexRecover(pWal);
11553 *pChanged = 1;
11554 }
11555 }
11556 pWal->writeLock = 0;
11557 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
11558 }
11559 }
11560
11561 /* If the header is read successfully, check the version number to make
11562 ** sure the wal-index was not constructed with some future format that
11563 ** this version of SQLite cannot understand.
11564 */
11565 if( badHdr==0 && pWal->hdr.iVersion!=WALINDEX_MAX_VERSION ){
11566 rc = SQLITE_CANTOPEN_BKPT;
11567 }
11568
11569 return rc;
11570 }
11571
11572 /*
11573 ** This is the value that walTryBeginRead returns when it needs to
11574 ** be retried.
11575 */
11576 #define WAL_RETRY (-1)
11577
11578 /*
11579 ** Attempt to start a read transaction. This might fail due to a race or
11580 ** other transient condition. When that happens, it returns WAL_RETRY to
11581 ** indicate to the caller that it is safe to retry immediately.
11582 **
11583 ** On success return SQLITE_OK. On a permanent failure (such an
11584 ** I/O error or an SQLITE_BUSY because another process is running
11585 ** recovery) return a positive error code.
11586 **
11587 ** The useWal parameter is true to force the use of the WAL and disable
11588 ** the case where the WAL is bypassed because it has been completely
11589 ** checkpointed. If useWal==0 then this routine calls walIndexReadHdr()
11590 ** to make a copy of the wal-index header into pWal->hdr. If the
11591 ** wal-index header has changed, *pChanged is set to 1 (as an indication
11592 ** to the caller that the local paget cache is obsolete and needs to be
11593 ** flushed.) When useWal==1, the wal-index header is assumed to already
11594 ** be loaded and the pChanged parameter is unused.
11595 **
11596 ** The caller must set the cnt parameter to the number of prior calls to
11597 ** this routine during the current read attempt that returned WAL_RETRY.
11598 ** This routine will start taking more aggressive measures to clear the
11599 ** race conditions after multiple WAL_RETRY returns, and after an excessive
11600 ** number of errors will ultimately return SQLITE_PROTOCOL. The
11601 ** SQLITE_PROTOCOL return indicates that some other process has gone rogue
11602 ** and is not honoring the locking protocol. There is a vanishingly small
11603 ** chance that SQLITE_PROTOCOL could be returned because of a run of really
11604 ** bad luck when there is lots of contention for the wal-index, but that
11605 ** possibility is so small that it can be safely neglected, we believe.
11606 **
11607 ** On success, this routine obtains a read lock on
11608 ** WAL_READ_LOCK(pWal->readLock). The pWal->readLock integer is
11609 ** in the range 0 <= pWal->readLock < WAL_NREADER. If pWal->readLock==(-1)
11610 ** that means the Wal does not hold any read lock. The reader must not
11611 ** access any database page that is modified by a WAL frame up to and
11612 ** including frame number aReadMark[pWal->readLock]. The reader will
11613 ** use WAL frames up to and including pWal->hdr.mxFrame if pWal->readLock>0
11614 ** Or if pWal->readLock==0, then the reader will ignore the WAL
11615 ** completely and get all content directly from the database file.
11616 ** If the useWal parameter is 1 then the WAL will never be ignored and
11617 ** this routine will always set pWal->readLock>0 on success.
11618 ** When the read transaction is completed, the caller must release the
11619 ** lock on WAL_READ_LOCK(pWal->readLock) and set pWal->readLock to -1.
11620 **
11621 ** This routine uses the nBackfill and aReadMark[] fields of the header
11622 ** to select a particular WAL_READ_LOCK() that strives to let the
11623 ** checkpoint process do as much work as possible. This routine might
11624 ** update values of the aReadMark[] array in the header, but if it does
11625 ** so it takes care to hold an exclusive lock on the corresponding
11626 ** WAL_READ_LOCK() while changing values.
11627 */
11628 static int walTryBeginRead(Wal *pWal, int *pChanged, int useWal, int cnt){
11629 volatile WalCkptInfo *pInfo; /* Checkpoint information in wal-index */
11630 u32 mxReadMark; /* Largest aReadMark[] value */
11631 int mxI; /* Index of largest aReadMark[] value */
11632 int i; /* Loop counter */
11633 int rc = SQLITE_OK; /* Return code */
11634 u32 mxFrame; /* Wal frame to lock to */
11635
11636 assert( pWal->readLock<0 ); /* Not currently locked */
11637
11638 /* Take steps to avoid spinning forever if there is a protocol error.
11639 **
11640 ** Circumstances that cause a RETRY should only last for the briefest
11641 ** instances of time. No I/O or other system calls are done while the
11642 ** locks are held, so the locks should not be held for very long. But
11643 ** if we are unlucky, another process that is holding a lock might get
11644 ** paged out or take a page-fault that is time-consuming to resolve,
11645 ** during the few nanoseconds that it is holding the lock. In that case,
11646 ** it might take longer than normal for the lock to free.
11647 **
11648 ** After 5 RETRYs, we begin calling sqlite3OsSleep(). The first few
11649 ** calls to sqlite3OsSleep() have a delay of 1 microsecond. Really this
11650 ** is more of a scheduler yield than an actual delay. But on the 10th
11651 ** an subsequent retries, the delays start becoming longer and longer,
11652 ** so that on the 100th (and last) RETRY we delay for 323 milliseconds.
11653 ** The total delay time before giving up is less than 10 seconds.
11654 */
11655 if( cnt>5 ){
11656 int nDelay = 1; /* Pause time in microseconds */
11657 if( cnt>100 ){
11658 VVA_ONLY( pWal->lockError = 1; )
11659 return SQLITE_PROTOCOL;
11660 }
11661 if( cnt>=10 ) nDelay = (cnt-9)*(cnt-9)*39;
11662 sqlite3OsSleep(pWal->pVfs, nDelay);
11663 }
11664
11665 if( !useWal ){
11666 rc = walIndexReadHdr(pWal, pChanged);
11667 if( rc==SQLITE_BUSY ){
11668 /* If there is not a recovery running in another thread or process
11669 ** then convert BUSY errors to WAL_RETRY. If recovery is known to
11670 ** be running, convert BUSY to BUSY_RECOVERY. There is a race here
11671 ** which might cause WAL_RETRY to be returned even if BUSY_RECOVERY
11672 ** would be technically correct. But the race is benign since with
11673 ** WAL_RETRY this routine will be called again and will probably be
11674 ** right on the second iteration.
11675 */
11676 if( pWal->apWiData[0]==0 ){
11677 /* This branch is taken when the xShmMap() method returns SQLITE_BUSY.
11678 ** We assume this is a transient condition, so return WAL_RETRY. The
11679 ** xShmMap() implementation used by the default unix and win32 VFS
11680 ** modules may return SQLITE_BUSY due to a race condition in the
11681 ** code that determines whether or not the shared-memory region
11682 ** must be zeroed before the requested page is returned.
11683 */
11684 rc = WAL_RETRY;
11685 }else if( SQLITE_OK==(rc = walLockShared(pWal, WAL_RECOVER_LOCK)) ){
11686 walUnlockShared(pWal, WAL_RECOVER_LOCK);
11687 rc = WAL_RETRY;
11688 }else if( rc==SQLITE_BUSY ){
11689 rc = SQLITE_BUSY_RECOVERY;
11690 }
11691 }
11692 if( rc!=SQLITE_OK ){
11693 return rc;
11694 }
11695 }
11696
11697 pInfo = walCkptInfo(pWal);
11698 if( !useWal && pInfo->nBackfill==pWal->hdr.mxFrame
11699 #ifdef SQLITE_ENABLE_SNAPSHOT
11700 && (pWal->pSnapshot==0 || pWal->hdr.mxFrame==0
11701 || 0==memcmp(&pWal->hdr, pWal->pSnapshot, sizeof(WalIndexHdr)))
11702 #endif
11703 ){
11704 /* The WAL has been completely backfilled (or it is empty).
11705 ** and can be safely ignored.
11706 */
11707 rc = walLockShared(pWal, WAL_READ_LOCK(0));
11708 walShmBarrier(pWal);
11709 if( rc==SQLITE_OK ){
11710 if( memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr)) ){
11711 /* It is not safe to allow the reader to continue here if frames
11712 ** may have been appended to the log before READ_LOCK(0) was obtained.
11713 ** When holding READ_LOCK(0), the reader ignores the entire log file,
11714 ** which implies that the database file contains a trustworthy
11715 ** snapshot. Since holding READ_LOCK(0) prevents a checkpoint from
11716 ** happening, this is usually correct.
11717 **
11718 ** However, if frames have been appended to the log (or if the log
11719 ** is wrapped and written for that matter) before the READ_LOCK(0)
11720 ** is obtained, that is not necessarily true. A checkpointer may
11721 ** have started to backfill the appended frames but crashed before
11722 ** it finished. Leaving a corrupt image in the database file.
11723 */
11724 walUnlockShared(pWal, WAL_READ_LOCK(0));
11725 return WAL_RETRY;
11726 }
11727 pWal->readLock = 0;
11728 return SQLITE_OK;
11729 }else if( rc!=SQLITE_BUSY ){
11730 return rc;
11731 }
11732 }
11733
11734 /* If we get this far, it means that the reader will want to use
11735 ** the WAL to get at content from recent commits. The job now is
11736 ** to select one of the aReadMark[] entries that is closest to
11737 ** but not exceeding pWal->hdr.mxFrame and lock that entry.
11738 */
11739 mxReadMark = 0;
11740 mxI = 0;
11741 mxFrame = pWal->hdr.mxFrame;
11742 #ifdef SQLITE_ENABLE_SNAPSHOT
11743 if( pWal->pSnapshot && pWal->pSnapshot->mxFrame<mxFrame ){
11744 mxFrame = pWal->pSnapshot->mxFrame;
11745 }
11746 #endif
11747 for(i=1; i<WAL_NREADER; i++){
11748 u32 thisMark = pInfo->aReadMark[i];
11749 if( mxReadMark<=thisMark && thisMark<=mxFrame ){
11750 assert( thisMark!=READMARK_NOT_USED );
11751 mxReadMark = thisMark;
11752 mxI = i;
11753 }
11754 }
11755 if( (pWal->readOnly & WAL_SHM_RDONLY)==0
11756 && (mxReadMark<mxFrame || mxI==0)
11757 ){
11758 for(i=1; i<WAL_NREADER; i++){
11759 rc = walLockExclusive(pWal, WAL_READ_LOCK(i), 1);
11760 if( rc==SQLITE_OK ){
11761 mxReadMark = pInfo->aReadMark[i] = mxFrame;
11762 mxI = i;
11763 walUnlockExclusive(pWal, WAL_READ_LOCK(i), 1);
11764 break;
11765 }else if( rc!=SQLITE_BUSY ){
11766 return rc;
11767 }
11768 }
11769 }
11770 if( mxI==0 ){
11771 assert( rc==SQLITE_BUSY || (pWal->readOnly & WAL_SHM_RDONLY)!=0 );
11772 return rc==SQLITE_BUSY ? WAL_RETRY : SQLITE_READONLY_CANTLOCK;
11773 }
11774
11775 rc = walLockShared(pWal, WAL_READ_LOCK(mxI));
11776 if( rc ){
11777 return rc==SQLITE_BUSY ? WAL_RETRY : rc;
11778 }
11779 /* Now that the read-lock has been obtained, check that neither the
11780 ** value in the aReadMark[] array or the contents of the wal-index
11781 ** header have changed.
11782 **
11783 ** It is necessary to check that the wal-index header did not change
11784 ** between the time it was read and when the shared-lock was obtained
11785 ** on WAL_READ_LOCK(mxI) was obtained to account for the possibility
11786 ** that the log file may have been wrapped by a writer, or that frames
11787 ** that occur later in the log than pWal->hdr.mxFrame may have been
11788 ** copied into the database by a checkpointer. If either of these things
11789 ** happened, then reading the database with the current value of
11790 ** pWal->hdr.mxFrame risks reading a corrupted snapshot. So, retry
11791 ** instead.
11792 **
11793 ** Before checking that the live wal-index header has not changed
11794 ** since it was read, set Wal.minFrame to the first frame in the wal
11795 ** file that has not yet been checkpointed. This client will not need
11796 ** to read any frames earlier than minFrame from the wal file - they
11797 ** can be safely read directly from the database file.
11798 **
11799 ** Because a ShmBarrier() call is made between taking the copy of
11800 ** nBackfill and checking that the wal-header in shared-memory still
11801 ** matches the one cached in pWal->hdr, it is guaranteed that the
11802 ** checkpointer that set nBackfill was not working with a wal-index
11803 ** header newer than that cached in pWal->hdr. If it were, that could
11804 ** cause a problem. The checkpointer could omit to checkpoint
11805 ** a version of page X that lies before pWal->minFrame (call that version
11806 ** A) on the basis that there is a newer version (version B) of the same
11807 ** page later in the wal file. But if version B happens to like past
11808 ** frame pWal->hdr.mxFrame - then the client would incorrectly assume
11809 ** that it can read version A from the database file. However, since
11810 ** we can guarantee that the checkpointer that set nBackfill could not
11811 ** see any pages past pWal->hdr.mxFrame, this problem does not come up.
11812 */
11813 pWal->minFrame = pInfo->nBackfill+1;
11814 walShmBarrier(pWal);
11815 if( pInfo->aReadMark[mxI]!=mxReadMark
11816 || memcmp((void *)walIndexHdr(pWal), &pWal->hdr, sizeof(WalIndexHdr))
11817 ){
11818 walUnlockShared(pWal, WAL_READ_LOCK(mxI));
11819 return WAL_RETRY;
11820 }else{
11821 assert( mxReadMark<=pWal->hdr.mxFrame );
11822 pWal->readLock = (i16)mxI;
11823 }
11824 return rc;
11825 }
11826
11827 #ifdef SQLITE_ENABLE_SNAPSHOT
11828 /*
11829 ** Attempt to reduce the value of the WalCkptInfo.nBackfillAttempted
11830 ** variable so that older snapshots can be accessed. To do this, loop
11831 ** through all wal frames from nBackfillAttempted to (nBackfill+1),
11832 ** comparing their content to the corresponding page with the database
11833 ** file, if any. Set nBackfillAttempted to the frame number of the
11834 ** first frame for which the wal file content matches the db file.
11835 **
11836 ** This is only really safe if the file-system is such that any page
11837 ** writes made by earlier checkpointers were atomic operations, which
11838 ** is not always true. It is also possible that nBackfillAttempted
11839 ** may be left set to a value larger than expected, if a wal frame
11840 ** contains content that duplicate of an earlier version of the same
11841 ** page.
11842 **
11843 ** SQLITE_OK is returned if successful, or an SQLite error code if an
11844 ** error occurs. It is not an error if nBackfillAttempted cannot be
11845 ** decreased at all.
11846 */
11847 SQLITE_PRIVATE int sqlite3WalSnapshotRecover(Wal *pWal){
11848 int rc;
11849
11850 assert( pWal->readLock>=0 );
11851 rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);
11852 if( rc==SQLITE_OK ){
11853 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
11854 int szPage = (int)pWal->szPage;
11855 i64 szDb; /* Size of db file in bytes */
11856
11857 rc = sqlite3OsFileSize(pWal->pDbFd, &szDb);
11858 if( rc==SQLITE_OK ){
11859 void *pBuf1 = sqlite3_malloc(szPage);
11860 void *pBuf2 = sqlite3_malloc(szPage);
11861 if( pBuf1==0 || pBuf2==0 ){
11862 rc = SQLITE_NOMEM;
11863 }else{
11864 u32 i = pInfo->nBackfillAttempted;
11865 for(i=pInfo->nBackfillAttempted; i>pInfo->nBackfill; i--){
11866 volatile ht_slot *dummy;
11867 volatile u32 *aPgno; /* Array of page numbers */
11868 u32 iZero; /* Frame corresponding to aPgno[0] */
11869 u32 pgno; /* Page number in db file */
11870 i64 iDbOff; /* Offset of db file entry */
11871 i64 iWalOff; /* Offset of wal file entry */
11872
11873 rc = walHashGet(pWal, walFramePage(i), &dummy, &aPgno, &iZero);
11874 if( rc!=SQLITE_OK ) break;
11875 pgno = aPgno[i-iZero];
11876 iDbOff = (i64)(pgno-1) * szPage;
11877
11878 if( iDbOff+szPage<=szDb ){
11879 iWalOff = walFrameOffset(i, szPage) + WAL_FRAME_HDRSIZE;
11880 rc = sqlite3OsRead(pWal->pWalFd, pBuf1, szPage, iWalOff);
11881
11882 if( rc==SQLITE_OK ){
11883 rc = sqlite3OsRead(pWal->pDbFd, pBuf2, szPage, iDbOff);
11884 }
11885
11886 if( rc!=SQLITE_OK || 0==memcmp(pBuf1, pBuf2, szPage) ){
11887 break;
11888 }
11889 }
11890
11891 pInfo->nBackfillAttempted = i-1;
11892 }
11893 }
11894
11895 sqlite3_free(pBuf1);
11896 sqlite3_free(pBuf2);
11897 }
11898 walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);
11899 }
11900
11901 return rc;
11902 }
11903 #endif /* SQLITE_ENABLE_SNAPSHOT */
11904
11905 /*
11906 ** Begin a read transaction on the database.
11907 **
11908 ** This routine used to be called sqlite3OpenSnapshot() and with good reason:
11909 ** it takes a snapshot of the state of the WAL and wal-index for the current
11910 ** instant in time. The current thread will continue to use this snapshot.
11911 ** Other threads might append new content to the WAL and wal-index but
11912 ** that extra content is ignored by the current thread.
11913 **
11914 ** If the database contents have changes since the previous read
11915 ** transaction, then *pChanged is set to 1 before returning. The
11916 ** Pager layer will use this to know that is cache is stale and
11917 ** needs to be flushed.
11918 */
11919 SQLITE_PRIVATE int sqlite3WalBeginReadTransaction(Wal *pWal, int *pChanged){
11920 int rc; /* Return code */
11921 int cnt = 0; /* Number of TryBeginRead attempts */
11922
11923 #ifdef SQLITE_ENABLE_SNAPSHOT
11924 int bChanged = 0;
11925 WalIndexHdr *pSnapshot = pWal->pSnapshot;
11926 if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){
11927 bChanged = 1;
11928 }
11929 #endif
11930
11931 do{
11932 rc = walTryBeginRead(pWal, pChanged, 0, ++cnt);
11933 }while( rc==WAL_RETRY );
11934 testcase( (rc&0xff)==SQLITE_BUSY );
11935 testcase( (rc&0xff)==SQLITE_IOERR );
11936 testcase( rc==SQLITE_PROTOCOL );
11937 testcase( rc==SQLITE_OK );
11938
11939 #ifdef SQLITE_ENABLE_SNAPSHOT
11940 if( rc==SQLITE_OK ){
11941 if( pSnapshot && memcmp(pSnapshot, &pWal->hdr, sizeof(WalIndexHdr))!=0 ){
11942 /* At this point the client has a lock on an aReadMark[] slot holding
11943 ** a value equal to or smaller than pSnapshot->mxFrame, but pWal->hdr
11944 ** is populated with the wal-index header corresponding to the head
11945 ** of the wal file. Verify that pSnapshot is still valid before
11946 ** continuing. Reasons why pSnapshot might no longer be valid:
11947 **
11948 ** (1) The WAL file has been reset since the snapshot was taken.
11949 ** In this case, the salt will have changed.
11950 **
11951 ** (2) A checkpoint as been attempted that wrote frames past
11952 ** pSnapshot->mxFrame into the database file. Note that the
11953 ** checkpoint need not have completed for this to cause problems.
11954 */
11955 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
11956
11957 assert( pWal->readLock>0 || pWal->hdr.mxFrame==0 );
11958 assert( pInfo->aReadMark[pWal->readLock]<=pSnapshot->mxFrame );
11959
11960 /* It is possible that there is a checkpointer thread running
11961 ** concurrent with this code. If this is the case, it may be that the
11962 ** checkpointer has already determined that it will checkpoint
11963 ** snapshot X, where X is later in the wal file than pSnapshot, but
11964 ** has not yet set the pInfo->nBackfillAttempted variable to indicate
11965 ** its intent. To avoid the race condition this leads to, ensure that
11966 ** there is no checkpointer process by taking a shared CKPT lock
11967 ** before checking pInfo->nBackfillAttempted.
11968 **
11969 ** TODO: Does the aReadMark[] lock prevent a checkpointer from doing
11970 ** this already?
11971 */
11972 rc = walLockShared(pWal, WAL_CKPT_LOCK);
11973
11974 if( rc==SQLITE_OK ){
11975 /* Check that the wal file has not been wrapped. Assuming that it has
11976 ** not, also check that no checkpointer has attempted to checkpoint any
11977 ** frames beyond pSnapshot->mxFrame. If either of these conditions are
11978 ** true, return SQLITE_BUSY_SNAPSHOT. Otherwise, overwrite pWal->hdr
11979 ** with *pSnapshot and set *pChanged as appropriate for opening the
11980 ** snapshot. */
11981 if( !memcmp(pSnapshot->aSalt, pWal->hdr.aSalt, sizeof(pWal->hdr.aSalt))
11982 && pSnapshot->mxFrame>=pInfo->nBackfillAttempted
11983 ){
11984 assert( pWal->readLock>0 );
11985 memcpy(&pWal->hdr, pSnapshot, sizeof(WalIndexHdr));
11986 *pChanged = bChanged;
11987 }else{
11988 rc = SQLITE_BUSY_SNAPSHOT;
11989 }
11990
11991 /* Release the shared CKPT lock obtained above. */
11992 walUnlockShared(pWal, WAL_CKPT_LOCK);
11993 }
11994
11995
11996 if( rc!=SQLITE_OK ){
11997 sqlite3WalEndReadTransaction(pWal);
11998 }
11999 }
12000 }
12001 #endif
12002 return rc;
12003 }
12004
12005 /*
12006 ** Finish with a read transaction. All this does is release the
12007 ** read-lock.
12008 */
12009 SQLITE_PRIVATE void sqlite3WalEndReadTransaction(Wal *pWal){
12010 sqlite3WalEndWriteTransaction(pWal);
12011 if( pWal->readLock>=0 ){
12012 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
12013 pWal->readLock = -1;
12014 }
12015 }
12016
12017 /*
12018 ** Search the wal file for page pgno. If found, set *piRead to the frame that
12019 ** contains the page. Otherwise, if pgno is not in the wal file, set *piRead
12020 ** to zero.
12021 **
12022 ** Return SQLITE_OK if successful, or an error code if an error occurs. If an
12023 ** error does occur, the final value of *piRead is undefined.
12024 */
12025 SQLITE_PRIVATE int sqlite3WalFindFrame(
12026 Wal *pWal, /* WAL handle */
12027 Pgno pgno, /* Database page number to read data for */
12028 u32 *piRead /* OUT: Frame number (or zero) */
12029 ){
12030 u32 iRead = 0; /* If !=0, WAL frame to return data from */
12031 u32 iLast = pWal->hdr.mxFrame; /* Last page in WAL for this reader */
12032 int iHash; /* Used to loop through N hash tables */
12033 int iMinHash;
12034
12035 /* This routine is only be called from within a read transaction. */
12036 assert( pWal->readLock>=0 || pWal->lockError );
12037
12038 /* If the "last page" field of the wal-index header snapshot is 0, then
12039 ** no data will be read from the wal under any circumstances. Return early
12040 ** in this case as an optimization. Likewise, if pWal->readLock==0,
12041 ** then the WAL is ignored by the reader so return early, as if the
12042 ** WAL were empty.
12043 */
12044 if( iLast==0 || pWal->readLock==0 ){
12045 *piRead = 0;
12046 return SQLITE_OK;
12047 }
12048
12049 /* Search the hash table or tables for an entry matching page number
12050 ** pgno. Each iteration of the following for() loop searches one
12051 ** hash table (each hash table indexes up to HASHTABLE_NPAGE frames).
12052 **
12053 ** This code might run concurrently to the code in walIndexAppend()
12054 ** that adds entries to the wal-index (and possibly to this hash
12055 ** table). This means the value just read from the hash
12056 ** slot (aHash[iKey]) may have been added before or after the
12057 ** current read transaction was opened. Values added after the
12058 ** read transaction was opened may have been written incorrectly -
12059 ** i.e. these slots may contain garbage data. However, we assume
12060 ** that any slots written before the current read transaction was
12061 ** opened remain unmodified.
12062 **
12063 ** For the reasons above, the if(...) condition featured in the inner
12064 ** loop of the following block is more stringent that would be required
12065 ** if we had exclusive access to the hash-table:
12066 **
12067 ** (aPgno[iFrame]==pgno):
12068 ** This condition filters out normal hash-table collisions.
12069 **
12070 ** (iFrame<=iLast):
12071 ** This condition filters out entries that were added to the hash
12072 ** table after the current read-transaction had started.
12073 */
12074 iMinHash = walFramePage(pWal->minFrame);
12075 for(iHash=walFramePage(iLast); iHash>=iMinHash && iRead==0; iHash--){
12076 volatile ht_slot *aHash; /* Pointer to hash table */
12077 volatile u32 *aPgno; /* Pointer to array of page numbers */
12078 u32 iZero; /* Frame number corresponding to aPgno[0] */
12079 int iKey; /* Hash slot index */
12080 int nCollide; /* Number of hash collisions remaining */
12081 int rc; /* Error code */
12082
12083 rc = walHashGet(pWal, iHash, &aHash, &aPgno, &iZero);
12084 if( rc!=SQLITE_OK ){
12085 return rc;
12086 }
12087 nCollide = HASHTABLE_NSLOT;
12088 for(iKey=walHash(pgno); aHash[iKey]; iKey=walNextHash(iKey)){
12089 u32 iFrame = aHash[iKey] + iZero;
12090 if( iFrame<=iLast && iFrame>=pWal->minFrame && aPgno[aHash[iKey]]==pgno ){
12091 assert( iFrame>iRead || CORRUPT_DB );
12092 iRead = iFrame;
12093 }
12094 if( (nCollide--)==0 ){
12095 return SQLITE_CORRUPT_BKPT;
12096 }
12097 }
12098 }
12099
12100 #ifdef SQLITE_ENABLE_EXPENSIVE_ASSERT
12101 /* If expensive assert() statements are available, do a linear search
12102 ** of the wal-index file content. Make sure the results agree with the
12103 ** result obtained using the hash indexes above. */
12104 {
12105 u32 iRead2 = 0;
12106 u32 iTest;
12107 assert( pWal->minFrame>0 );
12108 for(iTest=iLast; iTest>=pWal->minFrame; iTest--){
12109 if( walFramePgno(pWal, iTest)==pgno ){
12110 iRead2 = iTest;
12111 break;
12112 }
12113 }
12114 assert( iRead==iRead2 );
12115 }
12116 #endif
12117
12118 *piRead = iRead;
12119 return SQLITE_OK;
12120 }
12121
12122 /*
12123 ** Read the contents of frame iRead from the wal file into buffer pOut
12124 ** (which is nOut bytes in size). Return SQLITE_OK if successful, or an
12125 ** error code otherwise.
12126 */
12127 SQLITE_PRIVATE int sqlite3WalReadFrame(
12128 Wal *pWal, /* WAL handle */
12129 u32 iRead, /* Frame to read */
12130 int nOut, /* Size of buffer pOut in bytes */
12131 u8 *pOut /* Buffer to write page data to */
12132 ){
12133 int sz;
12134 i64 iOffset;
12135 sz = pWal->hdr.szPage;
12136 sz = (sz&0xfe00) + ((sz&0x0001)<<16);
12137 testcase( sz<=32768 );
12138 testcase( sz>=65536 );
12139 iOffset = walFrameOffset(iRead, sz) + WAL_FRAME_HDRSIZE;
12140 /* testcase( IS_BIG_INT(iOffset) ); // requires a 4GiB WAL */
12141 return sqlite3OsRead(pWal->pWalFd, pOut, (nOut>sz ? sz : nOut), iOffset);
12142 }
12143
12144 /*
12145 ** Return the size of the database in pages (or zero, if unknown).
12146 */
12147 SQLITE_PRIVATE Pgno sqlite3WalDbsize(Wal *pWal){
12148 if( pWal && ALWAYS(pWal->readLock>=0) ){
12149 return pWal->hdr.nPage;
12150 }
12151 return 0;
12152 }
12153
12154
12155 /*
12156 ** This function starts a write transaction on the WAL.
12157 **
12158 ** A read transaction must have already been started by a prior call
12159 ** to sqlite3WalBeginReadTransaction().
12160 **
12161 ** If another thread or process has written into the database since
12162 ** the read transaction was started, then it is not possible for this
12163 ** thread to write as doing so would cause a fork. So this routine
12164 ** returns SQLITE_BUSY in that case and no write transaction is started.
12165 **
12166 ** There can only be a single writer active at a time.
12167 */
12168 SQLITE_PRIVATE int sqlite3WalBeginWriteTransaction(Wal *pWal){
12169 int rc;
12170
12171 /* Cannot start a write transaction without first holding a read
12172 ** transaction. */
12173 assert( pWal->readLock>=0 );
12174 assert( pWal->writeLock==0 && pWal->iReCksum==0 );
12175
12176 if( pWal->readOnly ){
12177 return SQLITE_READONLY;
12178 }
12179
12180 /* Only one writer allowed at a time. Get the write lock. Return
12181 ** SQLITE_BUSY if unable.
12182 */
12183 rc = walLockExclusive(pWal, WAL_WRITE_LOCK, 1);
12184 if( rc ){
12185 return rc;
12186 }
12187 pWal->writeLock = 1;
12188
12189 /* If another connection has written to the database file since the
12190 ** time the read transaction on this connection was started, then
12191 ** the write is disallowed.
12192 */
12193 if( memcmp(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr))!=0 ){
12194 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
12195 pWal->writeLock = 0;
12196 rc = SQLITE_BUSY_SNAPSHOT;
12197 }
12198
12199 return rc;
12200 }
12201
12202 /*
12203 ** End a write transaction. The commit has already been done. This
12204 ** routine merely releases the lock.
12205 */
12206 SQLITE_PRIVATE int sqlite3WalEndWriteTransaction(Wal *pWal){
12207 if( pWal->writeLock ){
12208 walUnlockExclusive(pWal, WAL_WRITE_LOCK, 1);
12209 pWal->writeLock = 0;
12210 pWal->iReCksum = 0;
12211 pWal->truncateOnCommit = 0;
12212 }
12213 return SQLITE_OK;
12214 }
12215
12216 /*
12217 ** If any data has been written (but not committed) to the log file, this
12218 ** function moves the write-pointer back to the start of the transaction.
12219 **
12220 ** Additionally, the callback function is invoked for each frame written
12221 ** to the WAL since the start of the transaction. If the callback returns
12222 ** other than SQLITE_OK, it is not invoked again and the error code is
12223 ** returned to the caller.
12224 **
12225 ** Otherwise, if the callback function does not return an error, this
12226 ** function returns SQLITE_OK.
12227 */
12228 SQLITE_PRIVATE int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *p UndoCtx){
12229 int rc = SQLITE_OK;
12230 if( ALWAYS(pWal->writeLock) ){
12231 Pgno iMax = pWal->hdr.mxFrame;
12232 Pgno iFrame;
12233
12234 /* Restore the clients cache of the wal-index header to the state it
12235 ** was in before the client began writing to the database.
12236 */
12237 memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));
12238
12239 for(iFrame=pWal->hdr.mxFrame+1;
12240 ALWAYS(rc==SQLITE_OK) && iFrame<=iMax;
12241 iFrame++
12242 ){
12243 /* This call cannot fail. Unless the page for which the page number
12244 ** is passed as the second argument is (a) in the cache and
12245 ** (b) has an outstanding reference, then xUndo is either a no-op
12246 ** (if (a) is false) or simply expels the page from the cache (if (b)
12247 ** is false).
12248 **
12249 ** If the upper layer is doing a rollback, it is guaranteed that there
12250 ** are no outstanding references to any page other than page 1. And
12251 ** page 1 is never written to the log until the transaction is
12252 ** committed. As a result, the call to xUndo may not fail.
12253 */
12254 assert( walFramePgno(pWal, iFrame)!=1 );
12255 rc = xUndo(pUndoCtx, walFramePgno(pWal, iFrame));
12256 }
12257 if( iMax!=pWal->hdr.mxFrame ) walCleanupHash(pWal);
12258 }
12259 return rc;
12260 }
12261
12262 /*
12263 ** Argument aWalData must point to an array of WAL_SAVEPOINT_NDATA u32
12264 ** values. This function populates the array with values required to
12265 ** "rollback" the write position of the WAL handle back to the current
12266 ** point in the event of a savepoint rollback (via WalSavepointUndo()).
12267 */
12268 SQLITE_PRIVATE void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData){
12269 assert( pWal->writeLock );
12270 aWalData[0] = pWal->hdr.mxFrame;
12271 aWalData[1] = pWal->hdr.aFrameCksum[0];
12272 aWalData[2] = pWal->hdr.aFrameCksum[1];
12273 aWalData[3] = pWal->nCkpt;
12274 }
12275
12276 /*
12277 ** Move the write position of the WAL back to the point identified by
12278 ** the values in the aWalData[] array. aWalData must point to an array
12279 ** of WAL_SAVEPOINT_NDATA u32 values that has been previously populated
12280 ** by a call to WalSavepoint().
12281 */
12282 SQLITE_PRIVATE int sqlite3WalSavepointUndo(Wal *pWal, u32 *aWalData){
12283 int rc = SQLITE_OK;
12284
12285 assert( pWal->writeLock );
12286 assert( aWalData[3]!=pWal->nCkpt || aWalData[0]<=pWal->hdr.mxFrame );
12287
12288 if( aWalData[3]!=pWal->nCkpt ){
12289 /* This savepoint was opened immediately after the write-transaction
12290 ** was started. Right after that, the writer decided to wrap around
12291 ** to the start of the log. Update the savepoint values to match.
12292 */
12293 aWalData[0] = 0;
12294 aWalData[3] = pWal->nCkpt;
12295 }
12296
12297 if( aWalData[0]<pWal->hdr.mxFrame ){
12298 pWal->hdr.mxFrame = aWalData[0];
12299 pWal->hdr.aFrameCksum[0] = aWalData[1];
12300 pWal->hdr.aFrameCksum[1] = aWalData[2];
12301 walCleanupHash(pWal);
12302 }
12303
12304 return rc;
12305 }
12306
12307 /*
12308 ** This function is called just before writing a set of frames to the log
12309 ** file (see sqlite3WalFrames()). It checks to see if, instead of appending
12310 ** to the current log file, it is possible to overwrite the start of the
12311 ** existing log file with the new frames (i.e. "reset" the log). If so,
12312 ** it sets pWal->hdr.mxFrame to 0. Otherwise, pWal->hdr.mxFrame is left
12313 ** unchanged.
12314 **
12315 ** SQLITE_OK is returned if no error is encountered (regardless of whether
12316 ** or not pWal->hdr.mxFrame is modified). An SQLite error code is returned
12317 ** if an error occurs.
12318 */
12319 static int walRestartLog(Wal *pWal){
12320 int rc = SQLITE_OK;
12321 int cnt;
12322
12323 if( pWal->readLock==0 ){
12324 volatile WalCkptInfo *pInfo = walCkptInfo(pWal);
12325 assert( pInfo->nBackfill==pWal->hdr.mxFrame );
12326 if( pInfo->nBackfill>0 ){
12327 u32 salt1;
12328 sqlite3_randomness(4, &salt1);
12329 rc = walLockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
12330 if( rc==SQLITE_OK ){
12331 /* If all readers are using WAL_READ_LOCK(0) (in other words if no
12332 ** readers are currently using the WAL), then the transactions
12333 ** frames will overwrite the start of the existing log. Update the
12334 ** wal-index header to reflect this.
12335 **
12336 ** In theory it would be Ok to update the cache of the header only
12337 ** at this point. But updating the actual wal-index header is also
12338 ** safe and means there is no special case for sqlite3WalUndo()
12339 ** to handle if this transaction is rolled back. */
12340 walRestartHdr(pWal, salt1);
12341 walUnlockExclusive(pWal, WAL_READ_LOCK(1), WAL_NREADER-1);
12342 }else if( rc!=SQLITE_BUSY ){
12343 return rc;
12344 }
12345 }
12346 walUnlockShared(pWal, WAL_READ_LOCK(0));
12347 pWal->readLock = -1;
12348 cnt = 0;
12349 do{
12350 int notUsed;
12351 rc = walTryBeginRead(pWal, &notUsed, 1, ++cnt);
12352 }while( rc==WAL_RETRY );
12353 assert( (rc&0xff)!=SQLITE_BUSY ); /* BUSY not possible when useWal==1 */
12354 testcase( (rc&0xff)==SQLITE_IOERR );
12355 testcase( rc==SQLITE_PROTOCOL );
12356 testcase( rc==SQLITE_OK );
12357 }
12358 return rc;
12359 }
12360
12361 /*
12362 ** Information about the current state of the WAL file and where
12363 ** the next fsync should occur - passed from sqlite3WalFrames() into
12364 ** walWriteToLog().
12365 */
12366 typedef struct WalWriter {
12367 Wal *pWal; /* The complete WAL information */
12368 sqlite3_file *pFd; /* The WAL file to which we write */
12369 sqlite3_int64 iSyncPoint; /* Fsync at this offset */
12370 int syncFlags; /* Flags for the fsync */
12371 int szPage; /* Size of one page */
12372 } WalWriter;
12373
12374 /*
12375 ** Write iAmt bytes of content into the WAL file beginning at iOffset.
12376 ** Do a sync when crossing the p->iSyncPoint boundary.
12377 **
12378 ** In other words, if iSyncPoint is in between iOffset and iOffset+iAmt,
12379 ** first write the part before iSyncPoint, then sync, then write the
12380 ** rest.
12381 */
12382 static int walWriteToLog(
12383 WalWriter *p, /* WAL to write to */
12384 void *pContent, /* Content to be written */
12385 int iAmt, /* Number of bytes to write */
12386 sqlite3_int64 iOffset /* Start writing at this offset */
12387 ){
12388 int rc;
12389 if( iOffset<p->iSyncPoint && iOffset+iAmt>=p->iSyncPoint ){
12390 int iFirstAmt = (int)(p->iSyncPoint - iOffset);
12391 rc = sqlite3OsWrite(p->pFd, pContent, iFirstAmt, iOffset);
12392 if( rc ) return rc;
12393 iOffset += iFirstAmt;
12394 iAmt -= iFirstAmt;
12395 pContent = (void*)(iFirstAmt + (char*)pContent);
12396 assert( p->syncFlags & (SQLITE_SYNC_NORMAL|SQLITE_SYNC_FULL) );
12397 rc = sqlite3OsSync(p->pFd, p->syncFlags & SQLITE_SYNC_MASK);
12398 if( iAmt==0 || rc ) return rc;
12399 }
12400 rc = sqlite3OsWrite(p->pFd, pContent, iAmt, iOffset);
12401 return rc;
12402 }
12403
12404 /*
12405 ** Write out a single frame of the WAL
12406 */
12407 static int walWriteOneFrame(
12408 WalWriter *p, /* Where to write the frame */
12409 PgHdr *pPage, /* The page of the frame to be written */
12410 int nTruncate, /* The commit flag. Usually 0. >0 for commit */
12411 sqlite3_int64 iOffset /* Byte offset at which to write */
12412 ){
12413 int rc; /* Result code from subfunctions */
12414 void *pData; /* Data actually written */
12415 u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-header in */
12416 #if defined(SQLITE_HAS_CODEC)
12417 if( (pData = sqlite3PagerCodec(pPage))==0 ) return SQLITE_NOMEM_BKPT;
12418 #else
12419 pData = pPage->pData;
12420 #endif
12421 walEncodeFrame(p->pWal, pPage->pgno, nTruncate, pData, aFrame);
12422 rc = walWriteToLog(p, aFrame, sizeof(aFrame), iOffset);
12423 if( rc ) return rc;
12424 /* Write the page data */
12425 rc = walWriteToLog(p, pData, p->szPage, iOffset+sizeof(aFrame));
12426 return rc;
12427 }
12428
12429 /*
12430 ** This function is called as part of committing a transaction within which
12431 ** one or more frames have been overwritten. It updates the checksums for
12432 ** all frames written to the wal file by the current transaction starting
12433 ** with the earliest to have been overwritten.
12434 **
12435 ** SQLITE_OK is returned if successful, or an SQLite error code otherwise.
12436 */
12437 static int walRewriteChecksums(Wal *pWal, u32 iLast){
12438 const int szPage = pWal->szPage;/* Database page size */
12439 int rc = SQLITE_OK; /* Return code */
12440 u8 *aBuf; /* Buffer to load data from wal file into */
12441 u8 aFrame[WAL_FRAME_HDRSIZE]; /* Buffer to assemble frame-headers in */
12442 u32 iRead; /* Next frame to read from wal file */
12443 i64 iCksumOff;
12444
12445 aBuf = sqlite3_malloc(szPage + WAL_FRAME_HDRSIZE);
12446 if( aBuf==0 ) return SQLITE_NOMEM_BKPT;
12447
12448 /* Find the checksum values to use as input for the recalculating the
12449 ** first checksum. If the first frame is frame 1 (implying that the current
12450 ** transaction restarted the wal file), these values must be read from the
12451 ** wal-file header. Otherwise, read them from the frame header of the
12452 ** previous frame. */
12453 assert( pWal->iReCksum>0 );
12454 if( pWal->iReCksum==1 ){
12455 iCksumOff = 24;
12456 }else{
12457 iCksumOff = walFrameOffset(pWal->iReCksum-1, szPage) + 16;
12458 }
12459 rc = sqlite3OsRead(pWal->pWalFd, aBuf, sizeof(u32)*2, iCksumOff);
12460 pWal->hdr.aFrameCksum[0] = sqlite3Get4byte(aBuf);
12461 pWal->hdr.aFrameCksum[1] = sqlite3Get4byte(&aBuf[sizeof(u32)]);
12462
12463 iRead = pWal->iReCksum;
12464 pWal->iReCksum = 0;
12465 for(; rc==SQLITE_OK && iRead<=iLast; iRead++){
12466 i64 iOff = walFrameOffset(iRead, szPage);
12467 rc = sqlite3OsRead(pWal->pWalFd, aBuf, szPage+WAL_FRAME_HDRSIZE, iOff);
12468 if( rc==SQLITE_OK ){
12469 u32 iPgno, nDbSize;
12470 iPgno = sqlite3Get4byte(aBuf);
12471 nDbSize = sqlite3Get4byte(&aBuf[4]);
12472
12473 walEncodeFrame(pWal, iPgno, nDbSize, &aBuf[WAL_FRAME_HDRSIZE], aFrame);
12474 rc = sqlite3OsWrite(pWal->pWalFd, aFrame, sizeof(aFrame), iOff);
12475 }
12476 }
12477
12478 sqlite3_free(aBuf);
12479 return rc;
12480 }
12481
12482 /*
12483 ** Write a set of frames to the log. The caller must hold the write-lock
12484 ** on the log file (obtained using sqlite3WalBeginWriteTransaction()).
12485 */
12486 SQLITE_PRIVATE int sqlite3WalFrames(
12487 Wal *pWal, /* Wal handle to write to */
12488 int szPage, /* Database page-size in bytes */
12489 PgHdr *pList, /* List of dirty pages to write */
12490 Pgno nTruncate, /* Database size after this commit */
12491 int isCommit, /* True if this is a commit */
12492 int sync_flags /* Flags to pass to OsSync() (or 0) */
12493 ){
12494 int rc; /* Used to catch return codes */
12495 u32 iFrame; /* Next frame address */
12496 PgHdr *p; /* Iterator to run through pList with. */
12497 PgHdr *pLast = 0; /* Last frame in list */
12498 int nExtra = 0; /* Number of extra copies of last page */
12499 int szFrame; /* The size of a single frame */
12500 i64 iOffset; /* Next byte to write in WAL file */
12501 WalWriter w; /* The writer */
12502 u32 iFirst = 0; /* First frame that may be overwritten */
12503 WalIndexHdr *pLive; /* Pointer to shared header */
12504
12505 assert( pList );
12506 assert( pWal->writeLock );
12507
12508 /* If this frame set completes a transaction, then nTruncate>0. If
12509 ** nTruncate==0 then this frame set does not complete the transaction. */
12510 assert( (isCommit!=0)==(nTruncate!=0) );
12511
12512 #if defined(SQLITE_TEST) && defined(SQLITE_DEBUG)
12513 { int cnt; for(cnt=0, p=pList; p; p=p->pDirty, cnt++){}
12514 WALTRACE(("WAL%p: frame write begin. %d frames. mxFrame=%d. %s\n",
12515 pWal, cnt, pWal->hdr.mxFrame, isCommit ? "Commit" : "Spill"));
12516 }
12517 #endif
12518
12519 pLive = (WalIndexHdr*)walIndexHdr(pWal);
12520 if( memcmp(&pWal->hdr, (void *)pLive, sizeof(WalIndexHdr))!=0 ){
12521 iFirst = pLive->mxFrame+1;
12522 }
12523
12524 /* See if it is possible to write these frames into the start of the
12525 ** log file, instead of appending to it at pWal->hdr.mxFrame.
12526 */
12527 if( SQLITE_OK!=(rc = walRestartLog(pWal)) ){
12528 return rc;
12529 }
12530
12531 /* If this is the first frame written into the log, write the WAL
12532 ** header to the start of the WAL file. See comments at the top of
12533 ** this source file for a description of the WAL header format.
12534 */
12535 iFrame = pWal->hdr.mxFrame;
12536 if( iFrame==0 ){
12537 u8 aWalHdr[WAL_HDRSIZE]; /* Buffer to assemble wal-header in */
12538 u32 aCksum[2]; /* Checksum for wal-header */
12539
12540 sqlite3Put4byte(&aWalHdr[0], (WAL_MAGIC | SQLITE_BIGENDIAN));
12541 sqlite3Put4byte(&aWalHdr[4], WAL_MAX_VERSION);
12542 sqlite3Put4byte(&aWalHdr[8], szPage);
12543 sqlite3Put4byte(&aWalHdr[12], pWal->nCkpt);
12544 if( pWal->nCkpt==0 ) sqlite3_randomness(8, pWal->hdr.aSalt);
12545 memcpy(&aWalHdr[16], pWal->hdr.aSalt, 8);
12546 walChecksumBytes(1, aWalHdr, WAL_HDRSIZE-2*4, 0, aCksum);
12547 sqlite3Put4byte(&aWalHdr[24], aCksum[0]);
12548 sqlite3Put4byte(&aWalHdr[28], aCksum[1]);
12549
12550 pWal->szPage = szPage;
12551 pWal->hdr.bigEndCksum = SQLITE_BIGENDIAN;
12552 pWal->hdr.aFrameCksum[0] = aCksum[0];
12553 pWal->hdr.aFrameCksum[1] = aCksum[1];
12554 pWal->truncateOnCommit = 1;
12555
12556 rc = sqlite3OsWrite(pWal->pWalFd, aWalHdr, sizeof(aWalHdr), 0);
12557 WALTRACE(("WAL%p: wal-header write %s\n", pWal, rc ? "failed" : "ok"));
12558 if( rc!=SQLITE_OK ){
12559 return rc;
12560 }
12561
12562 /* Sync the header (unless SQLITE_IOCAP_SEQUENTIAL is true or unless
12563 ** all syncing is turned off by PRAGMA synchronous=OFF). Otherwise
12564 ** an out-of-order write following a WAL restart could result in
12565 ** database corruption. See the ticket:
12566 **
12567 ** http://localhost:591/sqlite/info/ff5be73dee
12568 */
12569 if( pWal->syncHeader && sync_flags ){
12570 rc = sqlite3OsSync(pWal->pWalFd, sync_flags & SQLITE_SYNC_MASK);
12571 if( rc ) return rc;
12572 }
12573 }
12574 assert( (int)pWal->szPage==szPage );
12575
12576 /* Setup information needed to write frames into the WAL */
12577 w.pWal = pWal;
12578 w.pFd = pWal->pWalFd;
12579 w.iSyncPoint = 0;
12580 w.syncFlags = sync_flags;
12581 w.szPage = szPage;
12582 iOffset = walFrameOffset(iFrame+1, szPage);
12583 szFrame = szPage + WAL_FRAME_HDRSIZE;
12584
12585 /* Write all frames into the log file exactly once */
12586 for(p=pList; p; p=p->pDirty){
12587 int nDbSize; /* 0 normally. Positive == commit flag */
12588
12589 /* Check if this page has already been written into the wal file by
12590 ** the current transaction. If so, overwrite the existing frame and
12591 ** set Wal.writeLock to WAL_WRITELOCK_RECKSUM - indicating that
12592 ** checksums must be recomputed when the transaction is committed. */
12593 if( iFirst && (p->pDirty || isCommit==0) ){
12594 u32 iWrite = 0;
12595 VVA_ONLY(rc =) sqlite3WalFindFrame(pWal, p->pgno, &iWrite);
12596 assert( rc==SQLITE_OK || iWrite==0 );
12597 if( iWrite>=iFirst ){
12598 i64 iOff = walFrameOffset(iWrite, szPage) + WAL_FRAME_HDRSIZE;
12599 void *pData;
12600 if( pWal->iReCksum==0 || iWrite<pWal->iReCksum ){
12601 pWal->iReCksum = iWrite;
12602 }
12603 #if defined(SQLITE_HAS_CODEC)
12604 if( (pData = sqlite3PagerCodec(p))==0 ) return SQLITE_NOMEM;
12605 #else
12606 pData = p->pData;
12607 #endif
12608 rc = sqlite3OsWrite(pWal->pWalFd, pData, szPage, iOff);
12609 if( rc ) return rc;
12610 p->flags &= ~PGHDR_WAL_APPEND;
12611 continue;
12612 }
12613 }
12614
12615 iFrame++;
12616 assert( iOffset==walFrameOffset(iFrame, szPage) );
12617 nDbSize = (isCommit && p->pDirty==0) ? nTruncate : 0;
12618 rc = walWriteOneFrame(&w, p, nDbSize, iOffset);
12619 if( rc ) return rc;
12620 pLast = p;
12621 iOffset += szFrame;
12622 p->flags |= PGHDR_WAL_APPEND;
12623 }
12624
12625 /* Recalculate checksums within the wal file if required. */
12626 if( isCommit && pWal->iReCksum ){
12627 rc = walRewriteChecksums(pWal, iFrame);
12628 if( rc ) return rc;
12629 }
12630
12631 /* If this is the end of a transaction, then we might need to pad
12632 ** the transaction and/or sync the WAL file.
12633 **
12634 ** Padding and syncing only occur if this set of frames complete a
12635 ** transaction and if PRAGMA synchronous=FULL. If synchronous==NORMAL
12636 ** or synchronous==OFF, then no padding or syncing are needed.
12637 **
12638 ** If SQLITE_IOCAP_POWERSAFE_OVERWRITE is defined, then padding is not
12639 ** needed and only the sync is done. If padding is needed, then the
12640 ** final frame is repeated (with its commit mark) until the next sector
12641 ** boundary is crossed. Only the part of the WAL prior to the last
12642 ** sector boundary is synced; the part of the last frame that extends
12643 ** past the sector boundary is written after the sync.
12644 */
12645 if( isCommit && (sync_flags & WAL_SYNC_TRANSACTIONS)!=0 ){
12646 int bSync = 1;
12647 if( pWal->padToSectorBoundary ){
12648 int sectorSize = sqlite3SectorSize(pWal->pWalFd);
12649 w.iSyncPoint = ((iOffset+sectorSize-1)/sectorSize)*sectorSize;
12650 bSync = (w.iSyncPoint==iOffset);
12651 testcase( bSync );
12652 while( iOffset<w.iSyncPoint ){
12653 rc = walWriteOneFrame(&w, pLast, nTruncate, iOffset);
12654 if( rc ) return rc;
12655 iOffset += szFrame;
12656 nExtra++;
12657 }
12658 }
12659 if( bSync ){
12660 assert( rc==SQLITE_OK );
12661 rc = sqlite3OsSync(w.pFd, sync_flags & SQLITE_SYNC_MASK);
12662 }
12663 }
12664
12665 /* If this frame set completes the first transaction in the WAL and
12666 ** if PRAGMA journal_size_limit is set, then truncate the WAL to the
12667 ** journal size limit, if possible.
12668 */
12669 if( isCommit && pWal->truncateOnCommit && pWal->mxWalSize>=0 ){
12670 i64 sz = pWal->mxWalSize;
12671 if( walFrameOffset(iFrame+nExtra+1, szPage)>pWal->mxWalSize ){
12672 sz = walFrameOffset(iFrame+nExtra+1, szPage);
12673 }
12674 walLimitSize(pWal, sz);
12675 pWal->truncateOnCommit = 0;
12676 }
12677
12678 /* Append data to the wal-index. It is not necessary to lock the
12679 ** wal-index to do this as the SQLITE_SHM_WRITE lock held on the wal-index
12680 ** guarantees that there are no other writers, and no data that may
12681 ** be in use by existing readers is being overwritten.
12682 */
12683 iFrame = pWal->hdr.mxFrame;
12684 for(p=pList; p && rc==SQLITE_OK; p=p->pDirty){
12685 if( (p->flags & PGHDR_WAL_APPEND)==0 ) continue;
12686 iFrame++;
12687 rc = walIndexAppend(pWal, iFrame, p->pgno);
12688 }
12689 while( rc==SQLITE_OK && nExtra>0 ){
12690 iFrame++;
12691 nExtra--;
12692 rc = walIndexAppend(pWal, iFrame, pLast->pgno);
12693 }
12694
12695 if( rc==SQLITE_OK ){
12696 /* Update the private copy of the header. */
12697 pWal->hdr.szPage = (u16)((szPage&0xff00) | (szPage>>16));
12698 testcase( szPage<=32768 );
12699 testcase( szPage>=65536 );
12700 pWal->hdr.mxFrame = iFrame;
12701 if( isCommit ){
12702 pWal->hdr.iChange++;
12703 pWal->hdr.nPage = nTruncate;
12704 }
12705 /* If this is a commit, update the wal-index header too. */
12706 if( isCommit ){
12707 walIndexWriteHdr(pWal);
12708 pWal->iCallback = iFrame;
12709 }
12710 }
12711
12712 WALTRACE(("WAL%p: frame write %s\n", pWal, rc ? "failed" : "ok"));
12713 return rc;
12714 }
12715
12716 /*
12717 ** This routine is called to implement sqlite3_wal_checkpoint() and
12718 ** related interfaces.
12719 **
12720 ** Obtain a CHECKPOINT lock and then backfill as much information as
12721 ** we can from WAL into the database.
12722 **
12723 ** If parameter xBusy is not NULL, it is a pointer to a busy-handler
12724 ** callback. In this case this function runs a blocking checkpoint.
12725 */
12726 SQLITE_PRIVATE int sqlite3WalCheckpoint(
12727 Wal *pWal, /* Wal connection */
12728 sqlite3 *db, /* Check this handle's interrupt flag */
12729 int eMode, /* PASSIVE, FULL, RESTART, or TRUNCATE */
12730 int (*xBusy)(void*), /* Function to call when busy */
12731 void *pBusyArg, /* Context argument for xBusyHandler */
12732 int sync_flags, /* Flags to sync db file with (or 0) */
12733 int nBuf, /* Size of temporary buffer */
12734 u8 *zBuf, /* Temporary buffer to use */
12735 int *pnLog, /* OUT: Number of frames in WAL */
12736 int *pnCkpt /* OUT: Number of backfilled frames in WAL */
12737 ){
12738 int rc; /* Return code */
12739 int isChanged = 0; /* True if a new wal-index header is loaded */
12740 int eMode2 = eMode; /* Mode to pass to walCheckpoint() */
12741 int (*xBusy2)(void*) = xBusy; /* Busy handler for eMode2 */
12742
12743 assert( pWal->ckptLock==0 );
12744 assert( pWal->writeLock==0 );
12745
12746 /* EVIDENCE-OF: R-62920-47450 The busy-handler callback is never invoked
12747 ** in the SQLITE_CHECKPOINT_PASSIVE mode. */
12748 assert( eMode!=SQLITE_CHECKPOINT_PASSIVE || xBusy==0 );
12749
12750 if( pWal->readOnly ) return SQLITE_READONLY;
12751 WALTRACE(("WAL%p: checkpoint begins\n", pWal));
12752
12753 /* IMPLEMENTATION-OF: R-62028-47212 All calls obtain an exclusive
12754 ** "checkpoint" lock on the database file. */
12755 rc = walLockExclusive(pWal, WAL_CKPT_LOCK, 1);
12756 if( rc ){
12757 /* EVIDENCE-OF: R-10421-19736 If any other process is running a
12758 ** checkpoint operation at the same time, the lock cannot be obtained and
12759 ** SQLITE_BUSY is returned.
12760 ** EVIDENCE-OF: R-53820-33897 Even if there is a busy-handler configured,
12761 ** it will not be invoked in this case.
12762 */
12763 testcase( rc==SQLITE_BUSY );
12764 testcase( xBusy!=0 );
12765 return rc;
12766 }
12767 pWal->ckptLock = 1;
12768
12769 /* IMPLEMENTATION-OF: R-59782-36818 The SQLITE_CHECKPOINT_FULL, RESTART and
12770 ** TRUNCATE modes also obtain the exclusive "writer" lock on the database
12771 ** file.
12772 **
12773 ** EVIDENCE-OF: R-60642-04082 If the writer lock cannot be obtained
12774 ** immediately, and a busy-handler is configured, it is invoked and the
12775 ** writer lock retried until either the busy-handler returns 0 or the
12776 ** lock is successfully obtained.
12777 */
12778 if( eMode!=SQLITE_CHECKPOINT_PASSIVE ){
12779 rc = walBusyLock(pWal, xBusy, pBusyArg, WAL_WRITE_LOCK, 1);
12780 if( rc==SQLITE_OK ){
12781 pWal->writeLock = 1;
12782 }else if( rc==SQLITE_BUSY ){
12783 eMode2 = SQLITE_CHECKPOINT_PASSIVE;
12784 xBusy2 = 0;
12785 rc = SQLITE_OK;
12786 }
12787 }
12788
12789 /* Read the wal-index header. */
12790 if( rc==SQLITE_OK ){
12791 rc = walIndexReadHdr(pWal, &isChanged);
12792 if( isChanged && pWal->pDbFd->pMethods->iVersion>=3 ){
12793 sqlite3OsUnfetch(pWal->pDbFd, 0, 0);
12794 }
12795 }
12796
12797 /* Copy data from the log to the database file. */
12798 if( rc==SQLITE_OK ){
12799
12800 if( pWal->hdr.mxFrame && walPagesize(pWal)!=nBuf ){
12801 rc = SQLITE_CORRUPT_BKPT;
12802 }else{
12803 rc = walCheckpoint(pWal, db, eMode2, xBusy2, pBusyArg, sync_flags, zBuf);
12804 }
12805
12806 /* If no error occurred, set the output variables. */
12807 if( rc==SQLITE_OK || rc==SQLITE_BUSY ){
12808 if( pnLog ) *pnLog = (int)pWal->hdr.mxFrame;
12809 if( pnCkpt ) *pnCkpt = (int)(walCkptInfo(pWal)->nBackfill);
12810 }
12811 }
12812
12813 if( isChanged ){
12814 /* If a new wal-index header was loaded before the checkpoint was
12815 ** performed, then the pager-cache associated with pWal is now
12816 ** out of date. So zero the cached wal-index header to ensure that
12817 ** next time the pager opens a snapshot on this database it knows that
12818 ** the cache needs to be reset.
12819 */
12820 memset(&pWal->hdr, 0, sizeof(WalIndexHdr));
12821 }
12822
12823 /* Release the locks. */
12824 sqlite3WalEndWriteTransaction(pWal);
12825 walUnlockExclusive(pWal, WAL_CKPT_LOCK, 1);
12826 pWal->ckptLock = 0;
12827 WALTRACE(("WAL%p: checkpoint %s\n", pWal, rc ? "failed" : "ok"));
12828 return (rc==SQLITE_OK && eMode!=eMode2 ? SQLITE_BUSY : rc);
12829 }
12830
12831 /* Return the value to pass to a sqlite3_wal_hook callback, the
12832 ** number of frames in the WAL at the point of the last commit since
12833 ** sqlite3WalCallback() was called. If no commits have occurred since
12834 ** the last call, then return 0.
12835 */
12836 SQLITE_PRIVATE int sqlite3WalCallback(Wal *pWal){
12837 u32 ret = 0;
12838 if( pWal ){
12839 ret = pWal->iCallback;
12840 pWal->iCallback = 0;
12841 }
12842 return (int)ret;
12843 }
12844
12845 /*
12846 ** This function is called to change the WAL subsystem into or out
12847 ** of locking_mode=EXCLUSIVE.
12848 **
12849 ** If op is zero, then attempt to change from locking_mode=EXCLUSIVE
12850 ** into locking_mode=NORMAL. This means that we must acquire a lock
12851 ** on the pWal->readLock byte. If the WAL is already in locking_mode=NORMAL
12852 ** or if the acquisition of the lock fails, then return 0. If the
12853 ** transition out of exclusive-mode is successful, return 1. This
12854 ** operation must occur while the pager is still holding the exclusive
12855 ** lock on the main database file.
12856 **
12857 ** If op is one, then change from locking_mode=NORMAL into
12858 ** locking_mode=EXCLUSIVE. This means that the pWal->readLock must
12859 ** be released. Return 1 if the transition is made and 0 if the
12860 ** WAL is already in exclusive-locking mode - meaning that this
12861 ** routine is a no-op. The pager must already hold the exclusive lock
12862 ** on the main database file before invoking this operation.
12863 **
12864 ** If op is negative, then do a dry-run of the op==1 case but do
12865 ** not actually change anything. The pager uses this to see if it
12866 ** should acquire the database exclusive lock prior to invoking
12867 ** the op==1 case.
12868 */
12869 SQLITE_PRIVATE int sqlite3WalExclusiveMode(Wal *pWal, int op){
12870 int rc;
12871 assert( pWal->writeLock==0 );
12872 assert( pWal->exclusiveMode!=WAL_HEAPMEMORY_MODE || op==-1 );
12873
12874 /* pWal->readLock is usually set, but might be -1 if there was a
12875 ** prior error while attempting to acquire are read-lock. This cannot
12876 ** happen if the connection is actually in exclusive mode (as no xShmLock
12877 ** locks are taken in this case). Nor should the pager attempt to
12878 ** upgrade to exclusive-mode following such an error.
12879 */
12880 assert( pWal->readLock>=0 || pWal->lockError );
12881 assert( pWal->readLock>=0 || (op<=0 && pWal->exclusiveMode==0) );
12882
12883 if( op==0 ){
12884 if( pWal->exclusiveMode ){
12885 pWal->exclusiveMode = 0;
12886 if( walLockShared(pWal, WAL_READ_LOCK(pWal->readLock))!=SQLITE_OK ){
12887 pWal->exclusiveMode = 1;
12888 }
12889 rc = pWal->exclusiveMode==0;
12890 }else{
12891 /* Already in locking_mode=NORMAL */
12892 rc = 0;
12893 }
12894 }else if( op>0 ){
12895 assert( pWal->exclusiveMode==0 );
12896 assert( pWal->readLock>=0 );
12897 walUnlockShared(pWal, WAL_READ_LOCK(pWal->readLock));
12898 pWal->exclusiveMode = 1;
12899 rc = 1;
12900 }else{
12901 rc = pWal->exclusiveMode==0;
12902 }
12903 return rc;
12904 }
12905
12906 /*
12907 ** Return true if the argument is non-NULL and the WAL module is using
12908 ** heap-memory for the wal-index. Otherwise, if the argument is NULL or the
12909 ** WAL module is using shared-memory, return false.
12910 */
12911 SQLITE_PRIVATE int sqlite3WalHeapMemory(Wal *pWal){
12912 return (pWal && pWal->exclusiveMode==WAL_HEAPMEMORY_MODE );
12913 }
12914
12915 #ifdef SQLITE_ENABLE_SNAPSHOT
12916 /* Create a snapshot object. The content of a snapshot is opaque to
12917 ** every other subsystem, so the WAL module can put whatever it needs
12918 ** in the object.
12919 */
12920 SQLITE_PRIVATE int sqlite3WalSnapshotGet(Wal *pWal, sqlite3_snapshot **ppSnapsho t){
12921 int rc = SQLITE_OK;
12922 WalIndexHdr *pRet;
12923 static const u32 aZero[4] = { 0, 0, 0, 0 };
12924
12925 assert( pWal->readLock>=0 && pWal->writeLock==0 );
12926
12927 if( memcmp(&pWal->hdr.aFrameCksum[0],aZero,16)==0 ){
12928 *ppSnapshot = 0;
12929 return SQLITE_ERROR;
12930 }
12931 pRet = (WalIndexHdr*)sqlite3_malloc(sizeof(WalIndexHdr));
12932 if( pRet==0 ){
12933 rc = SQLITE_NOMEM_BKPT;
12934 }else{
12935 memcpy(pRet, &pWal->hdr, sizeof(WalIndexHdr));
12936 *ppSnapshot = (sqlite3_snapshot*)pRet;
12937 }
12938
12939 return rc;
12940 }
12941
12942 /* Try to open on pSnapshot when the next read-transaction starts
12943 */
12944 SQLITE_PRIVATE void sqlite3WalSnapshotOpen(Wal *pWal, sqlite3_snapshot *pSnapsho t){
12945 pWal->pSnapshot = (WalIndexHdr*)pSnapshot;
12946 }
12947
12948 /*
12949 ** Return a +ve value if snapshot p1 is newer than p2. A -ve value if
12950 ** p1 is older than p2 and zero if p1 and p2 are the same snapshot.
12951 */
12952 SQLITE_API int sqlite3_snapshot_cmp(sqlite3_snapshot *p1, sqlite3_snapshot *p2){
12953 WalIndexHdr *pHdr1 = (WalIndexHdr*)p1;
12954 WalIndexHdr *pHdr2 = (WalIndexHdr*)p2;
12955
12956 /* aSalt[0] is a copy of the value stored in the wal file header. It
12957 ** is incremented each time the wal file is restarted. */
12958 if( pHdr1->aSalt[0]<pHdr2->aSalt[0] ) return -1;
12959 if( pHdr1->aSalt[0]>pHdr2->aSalt[0] ) return +1;
12960 if( pHdr1->mxFrame<pHdr2->mxFrame ) return -1;
12961 if( pHdr1->mxFrame>pHdr2->mxFrame ) return +1;
12962 return 0;
12963 }
12964 #endif /* SQLITE_ENABLE_SNAPSHOT */
12965
12966 #ifdef SQLITE_ENABLE_ZIPVFS
12967 /*
12968 ** If the argument is not NULL, it points to a Wal object that holds a
12969 ** read-lock. This function returns the database page-size if it is known,
12970 ** or zero if it is not (or if pWal is NULL).
12971 */
12972 SQLITE_PRIVATE int sqlite3WalFramesize(Wal *pWal){
12973 assert( pWal==0 || pWal->readLock>=0 );
12974 return (pWal ? pWal->szPage : 0);
12975 }
12976 #endif
12977
12978 /* Return the sqlite3_file object for the WAL file
12979 */
12980 SQLITE_PRIVATE sqlite3_file *sqlite3WalFile(Wal *pWal){
12981 return pWal->pWalFd;
12982 }
12983
12984 #endif /* #ifndef SQLITE_OMIT_WAL */
12985
12986 /************** End of wal.c *************************************************/
12987 /************** Begin file btmutex.c *****************************************/
12988 /*
12989 ** 2007 August 27
12990 **
12991 ** The author disclaims copyright to this source code. In place of
12992 ** a legal notice, here is a blessing:
12993 **
12994 ** May you do good and not evil.
12995 ** May you find forgiveness for yourself and forgive others.
12996 ** May you share freely, never taking more than you give.
12997 **
12998 *************************************************************************
12999 **
13000 ** This file contains code used to implement mutexes on Btree objects.
13001 ** This code really belongs in btree.c. But btree.c is getting too
13002 ** big and we want to break it down some. This packaged seemed like
13003 ** a good breakout.
13004 */
13005 /************** Include btreeInt.h in the middle of btmutex.c ****************/
13006 /************** Begin file btreeInt.h ****************************************/
13007 /*
13008 ** 2004 April 6
13009 **
13010 ** The author disclaims copyright to this source code. In place of
13011 ** a legal notice, here is a blessing:
13012 **
13013 ** May you do good and not evil.
13014 ** May you find forgiveness for yourself and forgive others.
13015 ** May you share freely, never taking more than you give.
13016 **
13017 *************************************************************************
13018 ** This file implements an external (disk-based) database using BTrees.
13019 ** For a detailed discussion of BTrees, refer to
13020 **
13021 ** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3:
13022 ** "Sorting And Searching", pages 473-480. Addison-Wesley
13023 ** Publishing Company, Reading, Massachusetts.
13024 **
13025 ** The basic idea is that each page of the file contains N database
13026 ** entries and N+1 pointers to subpages.
13027 **
13028 ** ----------------------------------------------------------------
13029 ** | Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N-1) | Ptr(N) |
13030 ** ----------------------------------------------------------------
13031 **
13032 ** All of the keys on the page that Ptr(0) points to have values less
13033 ** than Key(0). All of the keys on page Ptr(1) and its subpages have
13034 ** values greater than Key(0) and less than Key(1). All of the keys
13035 ** on Ptr(N) and its subpages have values greater than Key(N-1). And
13036 ** so forth.
13037 **
13038 ** Finding a particular key requires reading O(log(M)) pages from the
13039 ** disk where M is the number of entries in the tree.
13040 **
13041 ** In this implementation, a single file can hold one or more separate
13042 ** BTrees. Each BTree is identified by the index of its root page. The
13043 ** key and data for any entry are combined to form the "payload". A
13044 ** fixed amount of payload can be carried directly on the database
13045 ** page. If the payload is larger than the preset amount then surplus
13046 ** bytes are stored on overflow pages. The payload for an entry
13047 ** and the preceding pointer are combined to form a "Cell". Each
13048 ** page has a small header which contains the Ptr(N) pointer and other
13049 ** information such as the size of key and data.
13050 **
13051 ** FORMAT DETAILS
13052 **
13053 ** The file is divided into pages. The first page is called page 1,
13054 ** the second is page 2, and so forth. A page number of zero indicates
13055 ** "no such page". The page size can be any power of 2 between 512 and 65536.
13056 ** Each page can be either a btree page, a freelist page, an overflow
13057 ** page, or a pointer-map page.
13058 **
13059 ** The first page is always a btree page. The first 100 bytes of the first
13060 ** page contain a special header (the "file header") that describes the file.
13061 ** The format of the file header is as follows:
13062 **
13063 ** OFFSET SIZE DESCRIPTION
13064 ** 0 16 Header string: "SQLite format 3\000"
13065 ** 16 2 Page size in bytes. (1 means 65536)
13066 ** 18 1 File format write version
13067 ** 19 1 File format read version
13068 ** 20 1 Bytes of unused space at the end of each page
13069 ** 21 1 Max embedded payload fraction (must be 64)
13070 ** 22 1 Min embedded payload fraction (must be 32)
13071 ** 23 1 Min leaf payload fraction (must be 32)
13072 ** 24 4 File change counter
13073 ** 28 4 Reserved for future use
13074 ** 32 4 First freelist page
13075 ** 36 4 Number of freelist pages in the file
13076 ** 40 60 15 4-byte meta values passed to higher layers
13077 **
13078 ** 40 4 Schema cookie
13079 ** 44 4 File format of schema layer
13080 ** 48 4 Size of page cache
13081 ** 52 4 Largest root-page (auto/incr_vacuum)
13082 ** 56 4 1=UTF-8 2=UTF16le 3=UTF16be
13083 ** 60 4 User version
13084 ** 64 4 Incremental vacuum mode
13085 ** 68 4 Application-ID
13086 ** 72 20 unused
13087 ** 92 4 The version-valid-for number
13088 ** 96 4 SQLITE_VERSION_NUMBER
13089 **
13090 ** All of the integer values are big-endian (most significant byte first).
13091 **
13092 ** The file change counter is incremented when the database is changed
13093 ** This counter allows other processes to know when the file has changed
13094 ** and thus when they need to flush their cache.
13095 **
13096 ** The max embedded payload fraction is the amount of the total usable
13097 ** space in a page that can be consumed by a single cell for standard
13098 ** B-tree (non-LEAFDATA) tables. A value of 255 means 100%. The default
13099 ** is to limit the maximum cell size so that at least 4 cells will fit
13100 ** on one page. Thus the default max embedded payload fraction is 64.
13101 **
13102 ** If the payload for a cell is larger than the max payload, then extra
13103 ** payload is spilled to overflow pages. Once an overflow page is allocated,
13104 ** as many bytes as possible are moved into the overflow pages without letting
13105 ** the cell size drop below the min embedded payload fraction.
13106 **
13107 ** The min leaf payload fraction is like the min embedded payload fraction
13108 ** except that it applies to leaf nodes in a LEAFDATA tree. The maximum
13109 ** payload fraction for a LEAFDATA tree is always 100% (or 255) and it
13110 ** not specified in the header.
13111 **
13112 ** Each btree pages is divided into three sections: The header, the
13113 ** cell pointer array, and the cell content area. Page 1 also has a 100-byte
13114 ** file header that occurs before the page header.
13115 **
13116 ** |----------------|
13117 ** | file header | 100 bytes. Page 1 only.
13118 ** |----------------|
13119 ** | page header | 8 bytes for leaves. 12 bytes for interior nodes
13120 ** |----------------|
13121 ** | cell pointer | | 2 bytes per cell. Sorted order.
13122 ** | array | | Grows downward
13123 ** | | v
13124 ** |----------------|
13125 ** | unallocated |
13126 ** | space |
13127 ** |----------------| ^ Grows upwards
13128 ** | cell content | | Arbitrary order interspersed with freeblocks.
13129 ** | area | | and free space fragments.
13130 ** |----------------|
13131 **
13132 ** The page headers looks like this:
13133 **
13134 ** OFFSET SIZE DESCRIPTION
13135 ** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf
13136 ** 1 2 byte offset to the first freeblock
13137 ** 3 2 number of cells on this page
13138 ** 5 2 first byte of the cell content area
13139 ** 7 1 number of fragmented free bytes
13140 ** 8 4 Right child (the Ptr(N) value). Omitted on leaves.
13141 **
13142 ** The flags define the format of this btree page. The leaf flag means that
13143 ** this page has no children. The zerodata flag means that this page carries
13144 ** only keys and no data. The intkey flag means that the key is an integer
13145 ** which is stored in the key size entry of the cell header rather than in
13146 ** the payload area.
13147 **
13148 ** The cell pointer array begins on the first byte after the page header.
13149 ** The cell pointer array contains zero or more 2-byte numbers which are
13150 ** offsets from the beginning of the page to the cell content in the cell
13151 ** content area. The cell pointers occur in sorted order. The system strives
13152 ** to keep free space after the last cell pointer so that new cells can
13153 ** be easily added without having to defragment the page.
13154 **
13155 ** Cell content is stored at the very end of the page and grows toward the
13156 ** beginning of the page.
13157 **
13158 ** Unused space within the cell content area is collected into a linked list of
13159 ** freeblocks. Each freeblock is at least 4 bytes in size. The byte offset
13160 ** to the first freeblock is given in the header. Freeblocks occur in
13161 ** increasing order. Because a freeblock must be at least 4 bytes in size,
13162 ** any group of 3 or fewer unused bytes in the cell content area cannot
13163 ** exist on the freeblock chain. A group of 3 or fewer free bytes is called
13164 ** a fragment. The total number of bytes in all fragments is recorded.
13165 ** in the page header at offset 7.
13166 **
13167 ** SIZE DESCRIPTION
13168 ** 2 Byte offset of the next freeblock
13169 ** 2 Bytes in this freeblock
13170 **
13171 ** Cells are of variable length. Cells are stored in the cell content area at
13172 ** the end of the page. Pointers to the cells are in the cell pointer array
13173 ** that immediately follows the page header. Cells is not necessarily
13174 ** contiguous or in order, but cell pointers are contiguous and in order.
13175 **
13176 ** Cell content makes use of variable length integers. A variable
13177 ** length integer is 1 to 9 bytes where the lower 7 bits of each
13178 ** byte are used. The integer consists of all bytes that have bit 8 set and
13179 ** the first byte with bit 8 clear. The most significant byte of the integer
13180 ** appears first. A variable-length integer may not be more than 9 bytes long.
13181 ** As a special case, all 8 bytes of the 9th byte are used as data. This
13182 ** allows a 64-bit integer to be encoded in 9 bytes.
13183 **
13184 ** 0x00 becomes 0x00000000
13185 ** 0x7f becomes 0x0000007f
13186 ** 0x81 0x00 becomes 0x00000080
13187 ** 0x82 0x00 becomes 0x00000100
13188 ** 0x80 0x7f becomes 0x0000007f
13189 ** 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678
13190 ** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081
13191 **
13192 ** Variable length integers are used for rowids and to hold the number of
13193 ** bytes of key and data in a btree cell.
13194 **
13195 ** The content of a cell looks like this:
13196 **
13197 ** SIZE DESCRIPTION
13198 ** 4 Page number of the left child. Omitted if leaf flag is set.
13199 ** var Number of bytes of data. Omitted if the zerodata flag is set.
13200 ** var Number of bytes of key. Or the key itself if intkey flag is set.
13201 ** * Payload
13202 ** 4 First page of the overflow chain. Omitted if no overflow
13203 **
13204 ** Overflow pages form a linked list. Each page except the last is completely
13205 ** filled with data (pagesize - 4 bytes). The last page can have as little
13206 ** as 1 byte of data.
13207 **
13208 ** SIZE DESCRIPTION
13209 ** 4 Page number of next overflow page
13210 ** * Data
13211 **
13212 ** Freelist pages come in two subtypes: trunk pages and leaf pages. The
13213 ** file header points to the first in a linked list of trunk page. Each trunk
13214 ** page points to multiple leaf pages. The content of a leaf page is
13215 ** unspecified. A trunk page looks like this:
13216 **
13217 ** SIZE DESCRIPTION
13218 ** 4 Page number of next trunk page
13219 ** 4 Number of leaf pointers on this page
13220 ** * zero or more pages numbers of leaves
13221 */
13222 /* #include "sqliteInt.h" */
13223
13224
13225 /* The following value is the maximum cell size assuming a maximum page
13226 ** size give above.
13227 */
13228 #define MX_CELL_SIZE(pBt) ((int)(pBt->pageSize-8))
13229
13230 /* The maximum number of cells on a single page of the database. This
13231 ** assumes a minimum cell size of 6 bytes (4 bytes for the cell itself
13232 ** plus 2 bytes for the index to the cell in the page header). Such
13233 ** small cells will be rare, but they are possible.
13234 */
13235 #define MX_CELL(pBt) ((pBt->pageSize-8)/6)
13236
13237 /* Forward declarations */
13238 typedef struct MemPage MemPage;
13239 typedef struct BtLock BtLock;
13240 typedef struct CellInfo CellInfo;
13241
13242 /*
13243 ** This is a magic string that appears at the beginning of every
13244 ** SQLite database in order to identify the file as a real database.
13245 **
13246 ** You can change this value at compile-time by specifying a
13247 ** -DSQLITE_FILE_HEADER="..." on the compiler command-line. The
13248 ** header must be exactly 16 bytes including the zero-terminator so
13249 ** the string itself should be 15 characters long. If you change
13250 ** the header, then your custom library will not be able to read
13251 ** databases generated by the standard tools and the standard tools
13252 ** will not be able to read databases created by your custom library.
13253 */
13254 #ifndef SQLITE_FILE_HEADER /* 123456789 123456 */
13255 # define SQLITE_FILE_HEADER "SQLite format 3"
13256 #endif
13257
13258 /*
13259 ** Page type flags. An ORed combination of these flags appear as the
13260 ** first byte of on-disk image of every BTree page.
13261 */
13262 #define PTF_INTKEY 0x01
13263 #define PTF_ZERODATA 0x02
13264 #define PTF_LEAFDATA 0x04
13265 #define PTF_LEAF 0x08
13266
13267 /*
13268 ** An instance of this object stores information about each a single database
13269 ** page that has been loaded into memory. The information in this object
13270 ** is derived from the raw on-disk page content.
13271 **
13272 ** As each database page is loaded into memory, the pager allocats an
13273 ** instance of this object and zeros the first 8 bytes. (This is the
13274 ** "extra" information associated with each page of the pager.)
13275 **
13276 ** Access to all fields of this structure is controlled by the mutex
13277 ** stored in MemPage.pBt->mutex.
13278 */
13279 struct MemPage {
13280 u8 isInit; /* True if previously initialized. MUST BE FIRST! */
13281 u8 bBusy; /* Prevent endless loops on corrupt database files */
13282 u8 intKey; /* True if table b-trees. False for index b-trees */
13283 u8 intKeyLeaf; /* True if the leaf of an intKey table */
13284 Pgno pgno; /* Page number for this page */
13285 /* Only the first 8 bytes (above) are zeroed by pager.c when a new page
13286 ** is allocated. All fields that follow must be initialized before use */
13287 u8 leaf; /* True if a leaf page */
13288 u8 hdrOffset; /* 100 for page 1. 0 otherwise */
13289 u8 childPtrSize; /* 0 if leaf==1. 4 if leaf==0 */
13290 u8 max1bytePayload; /* min(maxLocal,127) */
13291 u8 nOverflow; /* Number of overflow cell bodies in aCell[] */
13292 u16 maxLocal; /* Copy of BtShared.maxLocal or BtShared.maxLeaf */
13293 u16 minLocal; /* Copy of BtShared.minLocal or BtShared.minLeaf */
13294 u16 cellOffset; /* Index in aData of first cell pointer */
13295 u16 nFree; /* Number of free bytes on the page */
13296 u16 nCell; /* Number of cells on this page, local and ovfl */
13297 u16 maskPage; /* Mask for page offset */
13298 u16 aiOvfl[4]; /* Insert the i-th overflow cell before the aiOvfl-th
13299 ** non-overflow cell */
13300 u8 *apOvfl[4]; /* Pointers to the body of overflow cells */
13301 BtShared *pBt; /* Pointer to BtShared that this page is part of */
13302 u8 *aData; /* Pointer to disk image of the page data */
13303 u8 *aDataEnd; /* One byte past the end of usable data */
13304 u8 *aCellIdx; /* The cell index area */
13305 u8 *aDataOfst; /* Same as aData for leaves. aData+4 for interior */
13306 DbPage *pDbPage; /* Pager page handle */
13307 u16 (*xCellSize)(MemPage*,u8*); /* cellSizePtr method */
13308 void (*xParseCell)(MemPage*,u8*,CellInfo*); /* btreeParseCell method */
13309 };
13310
13311 /*
13312 ** A linked list of the following structures is stored at BtShared.pLock.
13313 ** Locks are added (or upgraded from READ_LOCK to WRITE_LOCK) when a cursor
13314 ** is opened on the table with root page BtShared.iTable. Locks are removed
13315 ** from this list when a transaction is committed or rolled back, or when
13316 ** a btree handle is closed.
13317 */
13318 struct BtLock {
13319 Btree *pBtree; /* Btree handle holding this lock */
13320 Pgno iTable; /* Root page of table */
13321 u8 eLock; /* READ_LOCK or WRITE_LOCK */
13322 BtLock *pNext; /* Next in BtShared.pLock list */
13323 };
13324
13325 /* Candidate values for BtLock.eLock */
13326 #define READ_LOCK 1
13327 #define WRITE_LOCK 2
13328
13329 /* A Btree handle
13330 **
13331 ** A database connection contains a pointer to an instance of
13332 ** this object for every database file that it has open. This structure
13333 ** is opaque to the database connection. The database connection cannot
13334 ** see the internals of this structure and only deals with pointers to
13335 ** this structure.
13336 **
13337 ** For some database files, the same underlying database cache might be
13338 ** shared between multiple connections. In that case, each connection
13339 ** has it own instance of this object. But each instance of this object
13340 ** points to the same BtShared object. The database cache and the
13341 ** schema associated with the database file are all contained within
13342 ** the BtShared object.
13343 **
13344 ** All fields in this structure are accessed under sqlite3.mutex.
13345 ** The pBt pointer itself may not be changed while there exists cursors
13346 ** in the referenced BtShared that point back to this Btree since those
13347 ** cursors have to go through this Btree to find their BtShared and
13348 ** they often do so without holding sqlite3.mutex.
13349 */
13350 struct Btree {
13351 sqlite3 *db; /* The database connection holding this btree */
13352 BtShared *pBt; /* Sharable content of this btree */
13353 u8 inTrans; /* TRANS_NONE, TRANS_READ or TRANS_WRITE */
13354 u8 sharable; /* True if we can share pBt with another db */
13355 u8 locked; /* True if db currently has pBt locked */
13356 u8 hasIncrblobCur; /* True if there are one or more Incrblob cursors */
13357 int wantToLock; /* Number of nested calls to sqlite3BtreeEnter() */
13358 int nBackup; /* Number of backup operations reading this btree */
13359 u32 iDataVersion; /* Combines with pBt->pPager->iDataVersion */
13360 Btree *pNext; /* List of other sharable Btrees from the same db */
13361 Btree *pPrev; /* Back pointer of the same list */
13362 #ifndef SQLITE_OMIT_SHARED_CACHE
13363 BtLock lock; /* Object used to lock page 1 */
13364 #endif
13365 };
13366
13367 /*
13368 ** Btree.inTrans may take one of the following values.
13369 **
13370 ** If the shared-data extension is enabled, there may be multiple users
13371 ** of the Btree structure. At most one of these may open a write transaction,
13372 ** but any number may have active read transactions.
13373 */
13374 #define TRANS_NONE 0
13375 #define TRANS_READ 1
13376 #define TRANS_WRITE 2
13377
13378 /*
13379 ** An instance of this object represents a single database file.
13380 **
13381 ** A single database file can be in use at the same time by two
13382 ** or more database connections. When two or more connections are
13383 ** sharing the same database file, each connection has it own
13384 ** private Btree object for the file and each of those Btrees points
13385 ** to this one BtShared object. BtShared.nRef is the number of
13386 ** connections currently sharing this database file.
13387 **
13388 ** Fields in this structure are accessed under the BtShared.mutex
13389 ** mutex, except for nRef and pNext which are accessed under the
13390 ** global SQLITE_MUTEX_STATIC_MASTER mutex. The pPager field
13391 ** may not be modified once it is initially set as long as nRef>0.
13392 ** The pSchema field may be set once under BtShared.mutex and
13393 ** thereafter is unchanged as long as nRef>0.
13394 **
13395 ** isPending:
13396 **
13397 ** If a BtShared client fails to obtain a write-lock on a database
13398 ** table (because there exists one or more read-locks on the table),
13399 ** the shared-cache enters 'pending-lock' state and isPending is
13400 ** set to true.
13401 **
13402 ** The shared-cache leaves the 'pending lock' state when either of
13403 ** the following occur:
13404 **
13405 ** 1) The current writer (BtShared.pWriter) concludes its transaction, OR
13406 ** 2) The number of locks held by other connections drops to zero.
13407 **
13408 ** while in the 'pending-lock' state, no connection may start a new
13409 ** transaction.
13410 **
13411 ** This feature is included to help prevent writer-starvation.
13412 */
13413 struct BtShared {
13414 Pager *pPager; /* The page cache */
13415 sqlite3 *db; /* Database connection currently using this Btree */
13416 BtCursor *pCursor; /* A list of all open cursors */
13417 MemPage *pPage1; /* First page of the database */
13418 u8 openFlags; /* Flags to sqlite3BtreeOpen() */
13419 #ifndef SQLITE_OMIT_AUTOVACUUM
13420 u8 autoVacuum; /* True if auto-vacuum is enabled */
13421 u8 incrVacuum; /* True if incr-vacuum is enabled */
13422 u8 bDoTruncate; /* True to truncate db on commit */
13423 #endif
13424 u8 inTransaction; /* Transaction state */
13425 u8 max1bytePayload; /* Maximum first byte of cell for a 1-byte payload */
13426 #ifdef SQLITE_HAS_CODEC
13427 u8 optimalReserve; /* Desired amount of reserved space per page */
13428 #endif
13429 u16 btsFlags; /* Boolean parameters. See BTS_* macros below */
13430 u16 maxLocal; /* Maximum local payload in non-LEAFDATA tables */
13431 u16 minLocal; /* Minimum local payload in non-LEAFDATA tables */
13432 u16 maxLeaf; /* Maximum local payload in a LEAFDATA table */
13433 u16 minLeaf; /* Minimum local payload in a LEAFDATA table */
13434 u32 pageSize; /* Total number of bytes on a page */
13435 u32 usableSize; /* Number of usable bytes on each page */
13436 int nTransaction; /* Number of open transactions (read + write) */
13437 u32 nPage; /* Number of pages in the database */
13438 void *pSchema; /* Pointer to space allocated by sqlite3BtreeSchema() */
13439 void (*xFreeSchema)(void*); /* Destructor for BtShared.pSchema */
13440 sqlite3_mutex *mutex; /* Non-recursive mutex required to access this object */
13441 Bitvec *pHasContent; /* Set of pages moved to free-list this transaction */
13442 #ifndef SQLITE_OMIT_SHARED_CACHE
13443 int nRef; /* Number of references to this structure */
13444 BtShared *pNext; /* Next on a list of sharable BtShared structs */
13445 BtLock *pLock; /* List of locks held on this shared-btree struct */
13446 Btree *pWriter; /* Btree with currently open write transaction */
13447 #endif
13448 u8 *pTmpSpace; /* Temp space sufficient to hold a single cell */
13449 };
13450
13451 /*
13452 ** Allowed values for BtShared.btsFlags
13453 */
13454 #define BTS_READ_ONLY 0x0001 /* Underlying file is readonly */
13455 #define BTS_PAGESIZE_FIXED 0x0002 /* Page size can no longer be changed */
13456 #define BTS_SECURE_DELETE 0x0004 /* PRAGMA secure_delete is enabled */
13457 #define BTS_INITIALLY_EMPTY 0x0008 /* Database was empty at trans start */
13458 #define BTS_NO_WAL 0x0010 /* Do not open write-ahead-log files */
13459 #define BTS_EXCLUSIVE 0x0020 /* pWriter has an exclusive lock */
13460 #define BTS_PENDING 0x0040 /* Waiting for read-locks to clear */
13461
13462 /*
13463 ** An instance of the following structure is used to hold information
13464 ** about a cell. The parseCellPtr() function fills in this structure
13465 ** based on information extract from the raw disk page.
13466 */
13467 struct CellInfo {
13468 i64 nKey; /* The key for INTKEY tables, or nPayload otherwise */
13469 u8 *pPayload; /* Pointer to the start of payload */
13470 u32 nPayload; /* Bytes of payload */
13471 u16 nLocal; /* Amount of payload held locally, not on overflow */
13472 u16 nSize; /* Size of the cell content on the main b-tree page */
13473 };
13474
13475 /*
13476 ** Maximum depth of an SQLite B-Tree structure. Any B-Tree deeper than
13477 ** this will be declared corrupt. This value is calculated based on a
13478 ** maximum database size of 2^31 pages a minimum fanout of 2 for a
13479 ** root-node and 3 for all other internal nodes.
13480 **
13481 ** If a tree that appears to be taller than this is encountered, it is
13482 ** assumed that the database is corrupt.
13483 */
13484 #define BTCURSOR_MAX_DEPTH 20
13485
13486 /*
13487 ** A cursor is a pointer to a particular entry within a particular
13488 ** b-tree within a database file.
13489 **
13490 ** The entry is identified by its MemPage and the index in
13491 ** MemPage.aCell[] of the entry.
13492 **
13493 ** A single database file can be shared by two more database connections,
13494 ** but cursors cannot be shared. Each cursor is associated with a
13495 ** particular database connection identified BtCursor.pBtree.db.
13496 **
13497 ** Fields in this structure are accessed under the BtShared.mutex
13498 ** found at self->pBt->mutex.
13499 **
13500 ** skipNext meaning:
13501 ** eState==SKIPNEXT && skipNext>0: Next sqlite3BtreeNext() is no-op.
13502 ** eState==SKIPNEXT && skipNext<0: Next sqlite3BtreePrevious() is no-op.
13503 ** eState==FAULT: Cursor fault with skipNext as error code.
13504 */
13505 struct BtCursor {
13506 Btree *pBtree; /* The Btree to which this cursor belongs */
13507 BtShared *pBt; /* The BtShared this cursor points to */
13508 BtCursor *pNext; /* Forms a linked list of all cursors */
13509 Pgno *aOverflow; /* Cache of overflow page locations */
13510 CellInfo info; /* A parse of the cell we are pointing at */
13511 i64 nKey; /* Size of pKey, or last integer key */
13512 void *pKey; /* Saved key that was cursor last known position */
13513 Pgno pgnoRoot; /* The root page of this tree */
13514 int nOvflAlloc; /* Allocated size of aOverflow[] array */
13515 int skipNext; /* Prev() is noop if negative. Next() is noop if positive.
13516 ** Error code if eState==CURSOR_FAULT */
13517 u8 curFlags; /* zero or more BTCF_* flags defined below */
13518 u8 curPagerFlags; /* Flags to send to sqlite3PagerGet() */
13519 u8 eState; /* One of the CURSOR_XXX constants (see below) */
13520 u8 hints; /* As configured by CursorSetHints() */
13521 /* All fields above are zeroed when the cursor is allocated. See
13522 ** sqlite3BtreeCursorZero(). Fields that follow must be manually
13523 ** initialized. */
13524 i8 iPage; /* Index of current page in apPage */
13525 u8 curIntKey; /* Value of apPage[0]->intKey */
13526 struct KeyInfo *pKeyInfo; /* Argument passed to comparison function */
13527 void *padding1; /* Make object size a multiple of 16 */
13528 u16 aiIdx[BTCURSOR_MAX_DEPTH]; /* Current index in apPage[i] */
13529 MemPage *apPage[BTCURSOR_MAX_DEPTH]; /* Pages from root to current page */
13530 };
13531
13532 /*
13533 ** Legal values for BtCursor.curFlags
13534 */
13535 #define BTCF_WriteFlag 0x01 /* True if a write cursor */
13536 #define BTCF_ValidNKey 0x02 /* True if info.nKey is valid */
13537 #define BTCF_ValidOvfl 0x04 /* True if aOverflow is valid */
13538 #define BTCF_AtLast 0x08 /* Cursor is pointing ot the last entry */
13539 #define BTCF_Incrblob 0x10 /* True if an incremental I/O handle */
13540 #define BTCF_Multiple 0x20 /* Maybe another cursor on the same btree */
13541
13542 /*
13543 ** Potential values for BtCursor.eState.
13544 **
13545 ** CURSOR_INVALID:
13546 ** Cursor does not point to a valid entry. This can happen (for example)
13547 ** because the table is empty or because BtreeCursorFirst() has not been
13548 ** called.
13549 **
13550 ** CURSOR_VALID:
13551 ** Cursor points to a valid entry. getPayload() etc. may be called.
13552 **
13553 ** CURSOR_SKIPNEXT:
13554 ** Cursor is valid except that the Cursor.skipNext field is non-zero
13555 ** indicating that the next sqlite3BtreeNext() or sqlite3BtreePrevious()
13556 ** operation should be a no-op.
13557 **
13558 ** CURSOR_REQUIRESEEK:
13559 ** The table that this cursor was opened on still exists, but has been
13560 ** modified since the cursor was last used. The cursor position is saved
13561 ** in variables BtCursor.pKey and BtCursor.nKey. When a cursor is in
13562 ** this state, restoreCursorPosition() can be called to attempt to
13563 ** seek the cursor to the saved position.
13564 **
13565 ** CURSOR_FAULT:
13566 ** An unrecoverable error (an I/O error or a malloc failure) has occurred
13567 ** on a different connection that shares the BtShared cache with this
13568 ** cursor. The error has left the cache in an inconsistent state.
13569 ** Do nothing else with this cursor. Any attempt to use the cursor
13570 ** should return the error code stored in BtCursor.skipNext
13571 */
13572 #define CURSOR_INVALID 0
13573 #define CURSOR_VALID 1
13574 #define CURSOR_SKIPNEXT 2
13575 #define CURSOR_REQUIRESEEK 3
13576 #define CURSOR_FAULT 4
13577
13578 /*
13579 ** The database page the PENDING_BYTE occupies. This page is never used.
13580 */
13581 # define PENDING_BYTE_PAGE(pBt) PAGER_MJ_PGNO(pBt)
13582
13583 /*
13584 ** These macros define the location of the pointer-map entry for a
13585 ** database page. The first argument to each is the number of usable
13586 ** bytes on each page of the database (often 1024). The second is the
13587 ** page number to look up in the pointer map.
13588 **
13589 ** PTRMAP_PAGENO returns the database page number of the pointer-map
13590 ** page that stores the required pointer. PTRMAP_PTROFFSET returns
13591 ** the offset of the requested map entry.
13592 **
13593 ** If the pgno argument passed to PTRMAP_PAGENO is a pointer-map page,
13594 ** then pgno is returned. So (pgno==PTRMAP_PAGENO(pgsz, pgno)) can be
13595 ** used to test if pgno is a pointer-map page. PTRMAP_ISPAGE implements
13596 ** this test.
13597 */
13598 #define PTRMAP_PAGENO(pBt, pgno) ptrmapPageno(pBt, pgno)
13599 #define PTRMAP_PTROFFSET(pgptrmap, pgno) (5*(pgno-pgptrmap-1))
13600 #define PTRMAP_ISPAGE(pBt, pgno) (PTRMAP_PAGENO((pBt),(pgno))==(pgno))
13601
13602 /*
13603 ** The pointer map is a lookup table that identifies the parent page for
13604 ** each child page in the database file. The parent page is the page that
13605 ** contains a pointer to the child. Every page in the database contains
13606 ** 0 or 1 parent pages. (In this context 'database page' refers
13607 ** to any page that is not part of the pointer map itself.) Each pointer map
13608 ** entry consists of a single byte 'type' and a 4 byte parent page number.
13609 ** The PTRMAP_XXX identifiers below are the valid types.
13610 **
13611 ** The purpose of the pointer map is to facility moving pages from one
13612 ** position in the file to another as part of autovacuum. When a page
13613 ** is moved, the pointer in its parent must be updated to point to the
13614 ** new location. The pointer map is used to locate the parent page quickly.
13615 **
13616 ** PTRMAP_ROOTPAGE: The database page is a root-page. The page-number is not
13617 ** used in this case.
13618 **
13619 ** PTRMAP_FREEPAGE: The database page is an unused (free) page. The page-number
13620 ** is not used in this case.
13621 **
13622 ** PTRMAP_OVERFLOW1: The database page is the first page in a list of
13623 ** overflow pages. The page number identifies the page that
13624 ** contains the cell with a pointer to this overflow page.
13625 **
13626 ** PTRMAP_OVERFLOW2: The database page is the second or later page in a list of
13627 ** overflow pages. The page-number identifies the previous
13628 ** page in the overflow page list.
13629 **
13630 ** PTRMAP_BTREE: The database page is a non-root btree page. The page number
13631 ** identifies the parent page in the btree.
13632 */
13633 #define PTRMAP_ROOTPAGE 1
13634 #define PTRMAP_FREEPAGE 2
13635 #define PTRMAP_OVERFLOW1 3
13636 #define PTRMAP_OVERFLOW2 4
13637 #define PTRMAP_BTREE 5
13638
13639 /* A bunch of assert() statements to check the transaction state variables
13640 ** of handle p (type Btree*) are internally consistent.
13641 */
13642 #define btreeIntegrity(p) \
13643 assert( p->pBt->inTransaction!=TRANS_NONE || p->pBt->nTransaction==0 ); \
13644 assert( p->pBt->inTransaction>=p->inTrans );
13645
13646
13647 /*
13648 ** The ISAUTOVACUUM macro is used within balance_nonroot() to determine
13649 ** if the database supports auto-vacuum or not. Because it is used
13650 ** within an expression that is an argument to another macro
13651 ** (sqliteMallocRaw), it is not possible to use conditional compilation.
13652 ** So, this macro is defined instead.
13653 */
13654 #ifndef SQLITE_OMIT_AUTOVACUUM
13655 #define ISAUTOVACUUM (pBt->autoVacuum)
13656 #else
13657 #define ISAUTOVACUUM 0
13658 #endif
13659
13660
13661 /*
13662 ** This structure is passed around through all the sanity checking routines
13663 ** in order to keep track of some global state information.
13664 **
13665 ** The aRef[] array is allocated so that there is 1 bit for each page in
13666 ** the database. As the integrity-check proceeds, for each page used in
13667 ** the database the corresponding bit is set. This allows integrity-check to
13668 ** detect pages that are used twice and orphaned pages (both of which
13669 ** indicate corruption).
13670 */
13671 typedef struct IntegrityCk IntegrityCk;
13672 struct IntegrityCk {
13673 BtShared *pBt; /* The tree being checked out */
13674 Pager *pPager; /* The associated pager. Also accessible by pBt->pPager */
13675 u8 *aPgRef; /* 1 bit per page in the db (see above) */
13676 Pgno nPage; /* Number of pages in the database */
13677 int mxErr; /* Stop accumulating errors when this reaches zero */
13678 int nErr; /* Number of messages written to zErrMsg so far */
13679 int mallocFailed; /* A memory allocation error has occurred */
13680 const char *zPfx; /* Error message prefix */
13681 int v1, v2; /* Values for up to two %d fields in zPfx */
13682 StrAccum errMsg; /* Accumulate the error message text here */
13683 u32 *heap; /* Min-heap used for analyzing cell coverage */
13684 };
13685
13686 /*
13687 ** Routines to read or write a two- and four-byte big-endian integer values.
13688 */
13689 #define get2byte(x) ((x)[0]<<8 | (x)[1])
13690 #define put2byte(p,v) ((p)[0] = (u8)((v)>>8), (p)[1] = (u8)(v))
13691 #define get4byte sqlite3Get4byte
13692 #define put4byte sqlite3Put4byte
13693
13694 /*
13695 ** get2byteAligned(), unlike get2byte(), requires that its argument point to a
13696 ** two-byte aligned address. get2bytea() is only used for accessing the
13697 ** cell addresses in a btree header.
13698 */
13699 #if SQLITE_BYTEORDER==4321
13700 # define get2byteAligned(x) (*(u16*)(x))
13701 #elif SQLITE_BYTEORDER==1234 && GCC_VERSION>=4008000
13702 # define get2byteAligned(x) __builtin_bswap16(*(u16*)(x))
13703 #elif SQLITE_BYTEORDER==1234 && MSVC_VERSION>=1300
13704 # define get2byteAligned(x) _byteswap_ushort(*(u16*)(x))
13705 #else
13706 # define get2byteAligned(x) ((x)[0]<<8 | (x)[1])
13707 #endif
13708
13709 /************** End of btreeInt.h ********************************************/
13710 /************** Continuing where we left off in btmutex.c ********************/
13711 #ifndef SQLITE_OMIT_SHARED_CACHE
13712 #if SQLITE_THREADSAFE
13713
13714 /*
13715 ** Obtain the BtShared mutex associated with B-Tree handle p. Also,
13716 ** set BtShared.db to the database handle associated with p and the
13717 ** p->locked boolean to true.
13718 */
13719 static void lockBtreeMutex(Btree *p){
13720 assert( p->locked==0 );
13721 assert( sqlite3_mutex_notheld(p->pBt->mutex) );
13722 assert( sqlite3_mutex_held(p->db->mutex) );
13723
13724 sqlite3_mutex_enter(p->pBt->mutex);
13725 p->pBt->db = p->db;
13726 p->locked = 1;
13727 }
13728
13729 /*
13730 ** Release the BtShared mutex associated with B-Tree handle p and
13731 ** clear the p->locked boolean.
13732 */
13733 static void SQLITE_NOINLINE unlockBtreeMutex(Btree *p){
13734 BtShared *pBt = p->pBt;
13735 assert( p->locked==1 );
13736 assert( sqlite3_mutex_held(pBt->mutex) );
13737 assert( sqlite3_mutex_held(p->db->mutex) );
13738 assert( p->db==pBt->db );
13739
13740 sqlite3_mutex_leave(pBt->mutex);
13741 p->locked = 0;
13742 }
13743
13744 /* Forward reference */
13745 static void SQLITE_NOINLINE btreeLockCarefully(Btree *p);
13746
13747 /*
13748 ** Enter a mutex on the given BTree object.
13749 **
13750 ** If the object is not sharable, then no mutex is ever required
13751 ** and this routine is a no-op. The underlying mutex is non-recursive.
13752 ** But we keep a reference count in Btree.wantToLock so the behavior
13753 ** of this interface is recursive.
13754 **
13755 ** To avoid deadlocks, multiple Btrees are locked in the same order
13756 ** by all database connections. The p->pNext is a list of other
13757 ** Btrees belonging to the same database connection as the p Btree
13758 ** which need to be locked after p. If we cannot get a lock on
13759 ** p, then first unlock all of the others on p->pNext, then wait
13760 ** for the lock to become available on p, then relock all of the
13761 ** subsequent Btrees that desire a lock.
13762 */
13763 SQLITE_PRIVATE void sqlite3BtreeEnter(Btree *p){
13764 /* Some basic sanity checking on the Btree. The list of Btrees
13765 ** connected by pNext and pPrev should be in sorted order by
13766 ** Btree.pBt value. All elements of the list should belong to
13767 ** the same connection. Only shared Btrees are on the list. */
13768 assert( p->pNext==0 || p->pNext->pBt>p->pBt );
13769 assert( p->pPrev==0 || p->pPrev->pBt<p->pBt );
13770 assert( p->pNext==0 || p->pNext->db==p->db );
13771 assert( p->pPrev==0 || p->pPrev->db==p->db );
13772 assert( p->sharable || (p->pNext==0 && p->pPrev==0) );
13773
13774 /* Check for locking consistency */
13775 assert( !p->locked || p->wantToLock>0 );
13776 assert( p->sharable || p->wantToLock==0 );
13777
13778 /* We should already hold a lock on the database connection */
13779 assert( sqlite3_mutex_held(p->db->mutex) );
13780
13781 /* Unless the database is sharable and unlocked, then BtShared.db
13782 ** should already be set correctly. */
13783 assert( (p->locked==0 && p->sharable) || p->pBt->db==p->db );
13784
13785 if( !p->sharable ) return;
13786 p->wantToLock++;
13787 if( p->locked ) return;
13788 btreeLockCarefully(p);
13789 }
13790
13791 /* This is a helper function for sqlite3BtreeLock(). By moving
13792 ** complex, but seldom used logic, out of sqlite3BtreeLock() and
13793 ** into this routine, we avoid unnecessary stack pointer changes
13794 ** and thus help the sqlite3BtreeLock() routine to run much faster
13795 ** in the common case.
13796 */
13797 static void SQLITE_NOINLINE btreeLockCarefully(Btree *p){
13798 Btree *pLater;
13799
13800 /* In most cases, we should be able to acquire the lock we
13801 ** want without having to go through the ascending lock
13802 ** procedure that follows. Just be sure not to block.
13803 */
13804 if( sqlite3_mutex_try(p->pBt->mutex)==SQLITE_OK ){
13805 p->pBt->db = p->db;
13806 p->locked = 1;
13807 return;
13808 }
13809
13810 /* To avoid deadlock, first release all locks with a larger
13811 ** BtShared address. Then acquire our lock. Then reacquire
13812 ** the other BtShared locks that we used to hold in ascending
13813 ** order.
13814 */
13815 for(pLater=p->pNext; pLater; pLater=pLater->pNext){
13816 assert( pLater->sharable );
13817 assert( pLater->pNext==0 || pLater->pNext->pBt>pLater->pBt );
13818 assert( !pLater->locked || pLater->wantToLock>0 );
13819 if( pLater->locked ){
13820 unlockBtreeMutex(pLater);
13821 }
13822 }
13823 lockBtreeMutex(p);
13824 for(pLater=p->pNext; pLater; pLater=pLater->pNext){
13825 if( pLater->wantToLock ){
13826 lockBtreeMutex(pLater);
13827 }
13828 }
13829 }
13830
13831
13832 /*
13833 ** Exit the recursive mutex on a Btree.
13834 */
13835 SQLITE_PRIVATE void sqlite3BtreeLeave(Btree *p){
13836 assert( sqlite3_mutex_held(p->db->mutex) );
13837 if( p->sharable ){
13838 assert( p->wantToLock>0 );
13839 p->wantToLock--;
13840 if( p->wantToLock==0 ){
13841 unlockBtreeMutex(p);
13842 }
13843 }
13844 }
13845
13846 #ifndef NDEBUG
13847 /*
13848 ** Return true if the BtShared mutex is held on the btree, or if the
13849 ** B-Tree is not marked as sharable.
13850 **
13851 ** This routine is used only from within assert() statements.
13852 */
13853 SQLITE_PRIVATE int sqlite3BtreeHoldsMutex(Btree *p){
13854 assert( p->sharable==0 || p->locked==0 || p->wantToLock>0 );
13855 assert( p->sharable==0 || p->locked==0 || p->db==p->pBt->db );
13856 assert( p->sharable==0 || p->locked==0 || sqlite3_mutex_held(p->pBt->mutex) );
13857 assert( p->sharable==0 || p->locked==0 || sqlite3_mutex_held(p->db->mutex) );
13858
13859 return (p->sharable==0 || p->locked);
13860 }
13861 #endif
13862
13863
13864 /*
13865 ** Enter the mutex on every Btree associated with a database
13866 ** connection. This is needed (for example) prior to parsing
13867 ** a statement since we will be comparing table and column names
13868 ** against all schemas and we do not want those schemas being
13869 ** reset out from under us.
13870 **
13871 ** There is a corresponding leave-all procedures.
13872 **
13873 ** Enter the mutexes in accending order by BtShared pointer address
13874 ** to avoid the possibility of deadlock when two threads with
13875 ** two or more btrees in common both try to lock all their btrees
13876 ** at the same instant.
13877 */
13878 static void SQLITE_NOINLINE btreeEnterAll(sqlite3 *db){
13879 int i;
13880 int skipOk = 1;
13881 Btree *p;
13882 assert( sqlite3_mutex_held(db->mutex) );
13883 for(i=0; i<db->nDb; i++){
13884 p = db->aDb[i].pBt;
13885 if( p && p->sharable ){
13886 sqlite3BtreeEnter(p);
13887 skipOk = 0;
13888 }
13889 }
13890 db->skipBtreeMutex = skipOk;
13891 }
13892 SQLITE_PRIVATE void sqlite3BtreeEnterAll(sqlite3 *db){
13893 if( db->skipBtreeMutex==0 ) btreeEnterAll(db);
13894 }
13895 static void SQLITE_NOINLINE btreeLeaveAll(sqlite3 *db){
13896 int i;
13897 Btree *p;
13898 assert( sqlite3_mutex_held(db->mutex) );
13899 for(i=0; i<db->nDb; i++){
13900 p = db->aDb[i].pBt;
13901 if( p ) sqlite3BtreeLeave(p);
13902 }
13903 }
13904 SQLITE_PRIVATE void sqlite3BtreeLeaveAll(sqlite3 *db){
13905 if( db->skipBtreeMutex==0 ) btreeLeaveAll(db);
13906 }
13907
13908 #ifndef NDEBUG
13909 /*
13910 ** Return true if the current thread holds the database connection
13911 ** mutex and all required BtShared mutexes.
13912 **
13913 ** This routine is used inside assert() statements only.
13914 */
13915 SQLITE_PRIVATE int sqlite3BtreeHoldsAllMutexes(sqlite3 *db){
13916 int i;
13917 if( !sqlite3_mutex_held(db->mutex) ){
13918 return 0;
13919 }
13920 for(i=0; i<db->nDb; i++){
13921 Btree *p;
13922 p = db->aDb[i].pBt;
13923 if( p && p->sharable &&
13924 (p->wantToLock==0 || !sqlite3_mutex_held(p->pBt->mutex)) ){
13925 return 0;
13926 }
13927 }
13928 return 1;
13929 }
13930 #endif /* NDEBUG */
13931
13932 #ifndef NDEBUG
13933 /*
13934 ** Return true if the correct mutexes are held for accessing the
13935 ** db->aDb[iDb].pSchema structure. The mutexes required for schema
13936 ** access are:
13937 **
13938 ** (1) The mutex on db
13939 ** (2) if iDb!=1, then the mutex on db->aDb[iDb].pBt.
13940 **
13941 ** If pSchema is not NULL, then iDb is computed from pSchema and
13942 ** db using sqlite3SchemaToIndex().
13943 */
13944 SQLITE_PRIVATE int sqlite3SchemaMutexHeld(sqlite3 *db, int iDb, Schema *pSchema) {
13945 Btree *p;
13946 assert( db!=0 );
13947 if( pSchema ) iDb = sqlite3SchemaToIndex(db, pSchema);
13948 assert( iDb>=0 && iDb<db->nDb );
13949 if( !sqlite3_mutex_held(db->mutex) ) return 0;
13950 if( iDb==1 ) return 1;
13951 p = db->aDb[iDb].pBt;
13952 assert( p!=0 );
13953 return p->sharable==0 || p->locked==1;
13954 }
13955 #endif /* NDEBUG */
13956
13957 #else /* SQLITE_THREADSAFE>0 above. SQLITE_THREADSAFE==0 below */
13958 /*
13959 ** The following are special cases for mutex enter routines for use
13960 ** in single threaded applications that use shared cache. Except for
13961 ** these two routines, all mutex operations are no-ops in that case and
13962 ** are null #defines in btree.h.
13963 **
13964 ** If shared cache is disabled, then all btree mutex routines, including
13965 ** the ones below, are no-ops and are null #defines in btree.h.
13966 */
13967
13968 SQLITE_PRIVATE void sqlite3BtreeEnter(Btree *p){
13969 p->pBt->db = p->db;
13970 }
13971 SQLITE_PRIVATE void sqlite3BtreeEnterAll(sqlite3 *db){
13972 int i;
13973 for(i=0; i<db->nDb; i++){
13974 Btree *p = db->aDb[i].pBt;
13975 if( p ){
13976 p->pBt->db = p->db;
13977 }
13978 }
13979 }
13980 #endif /* if SQLITE_THREADSAFE */
13981
13982 #ifndef SQLITE_OMIT_INCRBLOB
13983 /*
13984 ** Enter a mutex on a Btree given a cursor owned by that Btree.
13985 **
13986 ** These entry points are used by incremental I/O only. Enter() is required
13987 ** any time OMIT_SHARED_CACHE is not defined, regardless of whether or not
13988 ** the build is threadsafe. Leave() is only required by threadsafe builds.
13989 */
13990 SQLITE_PRIVATE void sqlite3BtreeEnterCursor(BtCursor *pCur){
13991 sqlite3BtreeEnter(pCur->pBtree);
13992 }
13993 # if SQLITE_THREADSAFE
13994 SQLITE_PRIVATE void sqlite3BtreeLeaveCursor(BtCursor *pCur){
13995 sqlite3BtreeLeave(pCur->pBtree);
13996 }
13997 # endif
13998 #endif /* ifndef SQLITE_OMIT_INCRBLOB */
13999
14000 #endif /* ifndef SQLITE_OMIT_SHARED_CACHE */
14001
14002 /************** End of btmutex.c *********************************************/
14003
14004 /* Chain include. */
14005 #include "sqlite3.03.c"
OLDNEW
« no previous file with comments | « third_party/sqlite/amalgamation/sqlite3.01.c ('k') | third_party/sqlite/amalgamation/sqlite3.03.c » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698