|
|
Created:
5 years, 1 month ago by cmumford Modified:
5 years, 1 month ago CC:
chromium-reviews, chromium-apps-reviews_chromium.org, asvitkine+watch_chromium.org, extensions-reviews_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionLeveldbValueStore: Deleting db when open fails due to corruption.
Corrupted databases would never be deleted. This change will delete
the leveldb database iff an open attempt fails as a result of
corruption.
This class is currently used by the Extensions component in Chrome.
BUG=549177
Committed: https://crrev.com/b3e2c306a1adcb117f4134281dfcf0cb6ac2467f
Cr-Commit-Position: refs/heads/master@{#358722}
Patch Set 1 #Patch Set 2 : base::File::DeleteFile -> base::DeleteFile #
Total comments: 3
Patch Set 3 : Handling database Get errors (when corrupted). #Patch Set 4 : EnsureDbIsOpen calling Restore when corrupted, also Recover -> Restore #
Total comments: 5
Messages
Total messages: 23 (2 generated)
cmumford@chromium.org changed reviewers: + rdevlin.cronin@chromium.org, rkaplow@chromium.org
rkaplow@chromium.org: Please review changes in histograms.xml rdevlin.cronin@chromium.org: Please review of all changes. Two notes: 1) I chose to use DeleteFile as most dev's prefer over leveldb::DestroyDB(). 2) I don't know if any of the Extensions db's would be candidates to use leveldb::RepairDB(). I'd love to use that if possible.
> 2) I don't know if any of the Extensions db's would be candidates to use > leveldb::RepairDB(). I'd love to use that if possible. I don't know about leveldb::RepairDB(), but it sounds awesome, and I don't see any reason extensions couldn't benefit. Of course, I don't know anything about it. ;) Are there some criteria (or catches) for being able to use it on a database? https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... extensions/browser/value_store/leveldb_value_store.cc:360: // We do not currently recover from corrupted databases, so completely We do try to recover. See ValueStore::Restore(). This sounds like it could blow away the whole database when maybe it could be saved?
Devlin: Just to give you a preview of what I think is missing (error wise). This latest patch handles the Get failures, in all four places, and calls Restore() to reset the entire database. This is probably not exactly what you want to do upon error, but is a step in the right direction. Let me know what you think. https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... extensions/browser/value_store/leveldb_value_store.cc:360: // We do not currently recover from corrupted databases, so completely On 2015/10/29 17:57:26, Devlin wrote: > We do try to recover. See ValueStore::Restore(). This sounds like it could > blow away the whole database when maybe it could be saved? Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the first caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does not pass that error back to the callback, so it get's dropped on the floor, and ValueStore::Restore() is never called. So my worry is that there's a lot of plumbing in between the actual error, and the function that handles it, with plenty of opportunity for a mistake. I think we should still keep the existing functions as-is, because corruption can happen post open, but whenever it does happen we have three options: 1) Blow it away (either delete or leveldb::DestroyDB). 2) leveldb::RepairDB (which could still fail). 3) Do nothing - but it *will* fail on next open attempt So my vote would be to: 1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the database upon open when corruption is detected. 2) Modify ValueStoreFrontend::Backend::Get to pass back the ValueStore::ReadResult to the callback (or have a second callback). Does this sound reasonable. I'm still new to this code so am likely overlooking something... P.S. I'll respond to the question of RepairDB() on the bug.
https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... extensions/browser/value_store/leveldb_value_store.cc:360: // We do not currently recover from corrupted databases, so completely On 2015/10/29 22:34:22, cmumford wrote: > On 2015/10/29 17:57:26, Devlin wrote: > > We do try to recover. See ValueStore::Restore(). This sounds like it could > > blow away the whole database when maybe it could be saved? > > Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the first > caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does not > pass that error back to the callback, so it get's dropped on the floor, and > ValueStore::Restore() is never called. > > So my worry is that there's a lot of plumbing in between the actual error, and > the function that handles it, with plenty of opportunity for a mistake. > > I think we should still keep the existing functions as-is, because corruption > can happen post open, but whenever it does happen we have three options: > > 1) Blow it away (either delete or leveldb::DestroyDB). > 2) leveldb::RepairDB (which could still fail). > 3) Do nothing - but it *will* fail on next open attempt > > So my vote would be to: > > 1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the > database upon open when corruption is detected. > 2) Modify ValueStoreFrontend::Backend::Get to pass back the > ValueStore::ReadResult to the callback (or have a second callback). > > Does this sound reasonable. I'm still new to this code so am likely overlooking > something... > > P.S. I'll respond to the question of RepairDB() on the bug. We try to restore when a call fails - this is in storage_api.cc. E.g., we'll try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and then, if that fails, we'll try to handle the error by restore a particular key or the entire database (ExtensionFunction::ResponseValue SettingsFunction::HandleError), and then re-run. Of course, "Restore" is really just "Delete carefully", starting with the single key and advancing to the whole DB.
On 2015/10/30 20:13:17, Devlin wrote: > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... > File extensions/browser/value_store/leveldb_value_store.cc (right): > > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... > extensions/browser/value_store/leveldb_value_store.cc:360: // We do not > currently recover from corrupted databases, so completely > On 2015/10/29 22:34:22, cmumford wrote: > > On 2015/10/29 17:57:26, Devlin wrote: > > > We do try to recover. See ValueStore::Restore(). This sounds like it could > > > blow away the whole database when maybe it could be saved? > > > > Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the first > > caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does > not > > pass that error back to the callback, so it get's dropped on the floor, and > > ValueStore::Restore() is never called. > > > > So my worry is that there's a lot of plumbing in between the actual error, and > > the function that handles it, with plenty of opportunity for a mistake. > > > > I think we should still keep the existing functions as-is, because corruption > > can happen post open, but whenever it does happen we have three options: > > > > 1) Blow it away (either delete or leveldb::DestroyDB). > > 2) leveldb::RepairDB (which could still fail). > > 3) Do nothing - but it *will* fail on next open attempt > > > > So my vote would be to: > > > > 1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the > > database upon open when corruption is detected. > > 2) Modify ValueStoreFrontend::Backend::Get to pass back the > > ValueStore::ReadResult to the callback (or have a second callback). > > > > Does this sound reasonable. I'm still new to this code so am likely > overlooking > > something... > > > > P.S. I'll respond to the question of RepairDB() on the bug. > > We try to restore when a call fails - this is in storage_api.cc. E.g., we'll > try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and > then, if that fails, we'll try to handle the error by restore a particular key > or the entire database (ExtensionFunction::ResponseValue > SettingsFunction::HandleError), and then re-run. > > Of course, "Restore" is really just "Delete carefully", starting with the single > key and advancing to the whole DB. Yes, that is one flow which appears to call Restore, but it's just one (of many) paths to EnsureDbIsOpen(). If, as an experiment, you modify EnsureDbIsOpen() to return a corrupted status error, then you can verify that, during starup, that Restore is never resulting in *many* calls to EnsureDbIsOpen() and many loggings of those failures.
On 2015/11/02 17:28:28, cmumford wrote: > On 2015/10/30 20:13:17, Devlin wrote: > > > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... > > File extensions/browser/value_store/leveldb_value_store.cc (right): > > > > > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu... > > extensions/browser/value_store/leveldb_value_store.cc:360: // We do not > > currently recover from corrupted databases, so completely > > On 2015/10/29 22:34:22, cmumford wrote: > > > On 2015/10/29 17:57:26, Devlin wrote: > > > > We do try to recover. See ValueStore::Restore(). This sounds like it > could > > > > blow away the whole database when maybe it could be saved? > > > > > > Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the > first > > > caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does > > not > > > pass that error back to the callback, so it get's dropped on the floor, and > > > ValueStore::Restore() is never called. > > > > > > So my worry is that there's a lot of plumbing in between the actual error, > and > > > the function that handles it, with plenty of opportunity for a mistake. > > > > > > I think we should still keep the existing functions as-is, because > corruption > > > can happen post open, but whenever it does happen we have three options: > > > > > > 1) Blow it away (either delete or leveldb::DestroyDB). > > > 2) leveldb::RepairDB (which could still fail). > > > 3) Do nothing - but it *will* fail on next open attempt > > > > > > So my vote would be to: > > > > > > 1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the > > > database upon open when corruption is detected. > > > 2) Modify ValueStoreFrontend::Backend::Get to pass back the > > > ValueStore::ReadResult to the callback (or have a second callback). > > > > > > Does this sound reasonable. I'm still new to this code so am likely > > overlooking > > > something... > > > > > > P.S. I'll respond to the question of RepairDB() on the bug. > > > > We try to restore when a call fails - this is in storage_api.cc. E.g., we'll > > try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and > > then, if that fails, we'll try to handle the error by restore a particular key > > or the entire database (ExtensionFunction::ResponseValue > > SettingsFunction::HandleError), and then re-run. > > > > Of course, "Restore" is really just "Delete carefully", starting with the > single > > key and advancing to the whole DB. > > Yes, that is one flow which appears to call Restore, but it's just one (of many) > paths to EnsureDbIsOpen(). If, as an experiment, you modify EnsureDbIsOpen() to > return a corrupted status error, then you can verify that, during starup, that > Restore is never resulting in *many* calls to EnsureDbIsOpen() and many loggings > of those failures. Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I think were originally only called from the API. The fact that there are more paths probably means that we should be trying to recover at those times, too (and possibly looking at why we're hitting the DB without extension action). In any case, the right path here should be to try and recover safely. We shouldn't be blowing away databases if there's a chance they could be saved. Can we try and ensure that each path to open the database tries to recover and, if it makes sense, just use the leveldb::RepairDb() functionality instead of Restore()?
> Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I think > were originally only called from the API. The fact that there are more paths > probably means that we should be trying to recover at those times, too (and > possibly looking at why we're hitting the DB without extension action). > > In any case, the right path here should be to try and recover safely. We > shouldn't be blowing away databases if there's a chance they could be saved. > Can we try and ensure that each path to open the database tries to recover and, > if it makes sense, just use the leveldb::RepairDb() functionality instead of > Restore()? I agree with you on modifying all paths - if for no other reason those functions need to know that their read/write failed, to take the appropriate action. By removing this proposed repair out of EnsureDbIsOpen I'm concerned with three things. 1. There is an awfully lot of moving parts that need to be connected correctly to ensure that Restore() is called correctly. It's a fragile approach. 2. I’m confident, but still not 100% sure I’ve found every way that Restore() might not be called. 3. There may be race conditions with multiple in-flight read/write's which could fail and result in multiple calls to Restore(). Don't worry, that's my last appeal for this approach. Let me know if you want to stick with the existing design, and I’ll strip the db recovery code from EnsureDbIsOpen. You are the owner! I have local changes to add a result callback to all methods in StateStore that read/write to the database (and indirectly call EnsureDbIsOpen()). These are GetExtensionValue() (already in this CL), SetExtensionValue(), RemoveExtensionValue(), and Restore(). It's a big set of changes, so best to land in separate CL's IMO, but if you prefer to have one big commit let me know.
On 2015/11/03 00:44:31, cmumford wrote: > > Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I think > > were originally only called from the API. The fact that there are more paths > > probably means that we should be trying to recover at those times, too (and > > possibly looking at why we're hitting the DB without extension action). > > > > In any case, the right path here should be to try and recover safely. We > > shouldn't be blowing away databases if there's a chance they could be saved. > > Can we try and ensure that each path to open the database tries to recover > and, > > if it makes sense, just use the leveldb::RepairDb() functionality instead of > > Restore()? > > I agree with you on modifying all paths - if for no other reason those > functions need to know that their read/write failed, to take the appropriate > action. > > By removing this proposed repair out of EnsureDbIsOpen I'm concerned with > three things. > > 1. There is an awfully lot of moving parts that need to be connected > correctly to ensure that Restore() is called correctly. It's a fragile > approach. > 2. I’m confident, but still not 100% sure I’ve found every way that > Restore() might not be called. > 3. There may be race conditions with multiple in-flight read/write's which > could fail and result in multiple calls to Restore(). > > Don't worry, that's my last appeal for this approach. Let me know if you > want to stick with the existing design, and I’ll strip the db recovery > code from EnsureDbIsOpen. You are the owner! > > I have local changes to add a result callback to all methods in StateStore > that read/write to the database (and indirectly call EnsureDbIsOpen()). > These are GetExtensionValue() (already in this CL), SetExtensionValue(), > RemoveExtensionValue(), and Restore(). > > It's a big set of changes, so best to land in separate CL's IMO, but if you > prefer to have one big commit let me know. Sorry, might have been unclear. What I'd like to see is a single place we try to recover/repair the database. Right now, that's in the API, but I'm fine with moving it to EnsureDbIsOpen (assuming this wouldn't be too much of a performance regression) - but then we should remove the Restore() calls, because they're superfluous (the DB will be repaired or not, and trying again won't do anything). Also, we *should* try to repair, before blowing the DB away completely, just as we do now (otherwise, this will be a regression, because we would previously try and restore it, whereas now we'd just kill it). Finally, I would assume that the internal RepairDb() function is probably smarter than Restore(), and should probably be used instead. So, to sum up, I'd basically expect this CL to end up as: - Remove the Restore() calls. - Put RepairDb() calls in EnsureDbIsOpen. I think that meets all our needs, right? We still try and repair, and we do so at first sign of corruption, so other callers are happy.
> > Sorry, might have been unclear. What I'd like to see is a single place we try > to recover/repair the database. Right now, that's in the API, but I'm fine with > moving it to EnsureDbIsOpen (assuming this wouldn't be too much of a performance > regression) - but then we should remove the Restore() calls, because they're > superfluous (the DB will be repaired or not, and trying again won't do > anything). Also, we *should* try to repair, before blowing the DB away > completely, just as we do now (otherwise, this will be a regression, because we > would previously try and restore it, whereas now we'd just kill it). Finally, I > would assume that the internal RepairDb() function is probably smarter than > Restore(), and should probably be used instead. > > So, to sum up, I'd basically expect this CL to end up as: > - Remove the Restore() calls. > - Put RepairDb() calls in EnsureDbIsOpen. > > I think that meets all our needs, right? We still try and repair, and we do so > at first sign of corruption, so other callers are happy. Yeah - I think we're basically in agreement. Unfortunately LeveldbValueStore is one of six classes derived from ValueStore where Restore() is defined. I don't think we can remove Restore from the other value stores where it probably makes sense. What do you think about having EnsureDbIsOpen call Restore directly, but still return the error which (after these changes) will still call Restore a second time. That will be inefficient, but only about 5 in a million db's get corrupted so I think that should be tolerable.
On 2015/11/05 21:37:38, cmumford wrote: > > > > Sorry, might have been unclear. What I'd like to see is a single place we try > > to recover/repair the database. Right now, that's in the API, but I'm fine > with > > moving it to EnsureDbIsOpen (assuming this wouldn't be too much of a > performance > > regression) - but then we should remove the Restore() calls, because they're > > superfluous (the DB will be repaired or not, and trying again won't do > > anything). Also, we *should* try to repair, before blowing the DB away > > completely, just as we do now (otherwise, this will be a regression, because > we > > would previously try and restore it, whereas now we'd just kill it). Finally, > I > > would assume that the internal RepairDb() function is probably smarter than > > Restore(), and should probably be used instead. > > > > So, to sum up, I'd basically expect this CL to end up as: > > - Remove the Restore() calls. > > - Put RepairDb() calls in EnsureDbIsOpen. > > > > I think that meets all our needs, right? We still try and repair, and we do > so > > at first sign of corruption, so other callers are happy. > > Yeah - I think we're basically in agreement. Unfortunately LeveldbValueStore is > one of six classes derived from ValueStore where Restore() is defined. I don't > think we can remove Restore from the other value stores where it probably makes > sense. What do you think about having EnsureDbIsOpen call Restore directly, but > still return the error which (after these changes) will still call Restore a > second time. That will be inefficient, but only about 5 in a million db's get > corrupted so I think that should be tolerable. All those classes actually eventually reach a LevelDBValueStore (except the testing ones, of course). We can leave in the Restore() method, but would it still make sense to replace the implementation with RepairDb? But otherwise, yeah, just putting a Restore() call in EnsureDbIsOpen() SGTM.
> All those classes actually eventually reach a LevelDBValueStore (except the > testing ones, of course). We can leave in the Restore() method, but would it > still make sense to replace the implementation with RepairDb? > > But otherwise, yeah, just putting a Restore() call in EnsureDbIsOpen() SGTM. Yes, I'll modify Restore to call leveldb::RepairDB(). I'll the CL modified.
Devlin: Here's the version with EnsureDbIsOpen calling Restore. I decided to move the ::Get() error checking into a separate CL to keep things simple.
Much cleaner! :) Restore() is really only called in two circumstances: 1. When something (a Get()/Set()) fails. 2. From within a Restore() implementation (i.e., delegating the call). Since Get()/Set() calls also call EnsureDbIsOpen(), it seems like 1 is now unnecessary, right? Can we remove it?
On 2015/11/06 18:55:34, Devlin wrote: > Much cleaner! :) > > Restore() is really only called in two circumstances: > 1. When something (a Get()/Set()) fails. > 2. From within a Restore() implementation (i.e., delegating the call). > > Since Get()/Set() calls also call EnsureDbIsOpen(), it seems like 1 is now > unnecessary, right? Can we remove it? I don't think so. Because corruption can either be undetected during open, or can happen to a currently open db we still need to detect and recover in that scenario.
On 2015/11/06 19:16:05, cmumford wrote: > On 2015/11/06 18:55:34, Devlin wrote: > > Much cleaner! :) > > > > Restore() is really only called in two circumstances: > > 1. When something (a Get()/Set()) fails. > > 2. From within a Restore() implementation (i.e., delegating the call). > > > > Since Get()/Set() calls also call EnsureDbIsOpen(), it seems like 1 is now > > unnecessary, right? Can we remove it? > > I don't think so. Because corruption can either be undetected during open, or > can happen to a currently open db we still need to detect and recover in that > scenario. Okay, that make sense. LGTM if the bots are happy. https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/valu... File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/valu... extensions/browser/value_store/leveldb_value_store.cc:76: restore_histogram_ = base::LinearHistogram::FactoryGet( This is histogram black magic to me, so I'll trust the histogram owner to verify it. :) https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:64970: + <int value="2" label="Repair Success"/> Missing one, right?
On 2015/11/06 19:16:05, cmumford wrote: > On 2015/11/06 18:55:34, Devlin wrote: > > Much cleaner! :) > > > > Restore() is really only called in two circumstances: > > 1. When something (a Get()/Set()) fails. > > 2. From within a Restore() implementation (i.e., delegating the call). > > > > Since Get()/Set() calls also call EnsureDbIsOpen(), it seems like 1 is now > > unnecessary, right? Can we remove it? > > I don't think so. Because corruption can either be undetected during open, or > can happen to a currently open db we still need to detect and recover in that > scenario. Okay, that make sense. LGTM if the bots are happy.
lgtm https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:64970: + <int value="2" label="Repair Success"/> On 2015/11/06 19:54:46, Devlin wrote: > Missing one, right? this is all that is tracked, actually.
Should be landing shortly. https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/valu... File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/valu... extensions/browser/value_store/leveldb_value_store.cc:76: restore_histogram_ = base::LinearHistogram::FactoryGet( On 2015/11/06 19:54:46, Devlin wrote: > This is histogram black magic to me, so I'll trust the histogram owner to verify > it. :) I manually hack-ed in a call to "restore_histogram_->Add(LEVELDB_RESTORE_REPAIR_SUCCESS);" and verified it was in about:histograms. https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:64970: + <int value="2" label="Repair Success"/> On 2015/11/06 19:54:46, Devlin wrote: > Missing one, right? These correlate to the vales in LevelDBCorruptionRecoveryValue (from leveldb_value_store.cc).
The CQ bit was checked by cmumford@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1428843002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1428843002/60001
Message was sent while issue was closed.
Committed patchset #4 (id:60001)
Message was sent while issue was closed.
Patchset 4 (id:??) landed as https://crrev.com/b3e2c306a1adcb117f4134281dfcf0cb6ac2467f Cr-Commit-Position: refs/heads/master@{#358722} |