Issue 1428843002: LeveldbValueStore: Deleting db when open fails due to corruption.

Issue 1428843002: LeveldbValueStore: Deleting db when open fails due to corruption. (Closed)

Created:
5 years, 1 month ago by cmumford

Modified:
5 years, 1 month ago

Reviewers:
Devlin, rkaplow

CC:
chromium-reviews, chromium-apps-reviews_chromium.org, asvitkine+watch_chromium.org, extensions-reviews_chromium.org

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Target Ref:
refs/pending/heads/master

Project:
chromium

Visibility:
Public.

More Reviews

Description

LeveldbValueStore: Deleting db when open fails due to corruption. Corrupted databases would never be deleted. This change will delete the leveldb database iff an open attempt fails as a result of corruption. This class is currently used by the Extensions component in Chrome. BUG=549177 Committed: https://crrev.com/b3e2c306a1adcb117f4134281dfcf0cb6ac2467f Cr-Commit-Position: refs/heads/master@{#358722}

Patch Set 1 #

Patch Set 2 : base::File::DeleteFile -> base::DeleteFile #

Total comments: 3

Patch Set 3 : Handling database Get errors (when corrupted). #

Patch Set 4 : EnsureDbIsOpen calling Restore when corrupted, also Recover -> Restore #

Total comments: 5

Created: 5 years, 1 month ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+64 lines, -22 lines)			Patch
M	extensions/browser/value_store/leveldb_value_store.h	View	1 2 3	2 chunks	+2 lines, -1 line	0 comments	Download
M	extensions/browser/value_store/leveldb_value_store.cc	View	1 2 3	6 chunks	+40 lines, -21 lines	2 comments	Download
M	tools/metrics/histograms/histograms.xml	View	1 2 3	3 chunks	+22 lines, -0 lines	3 comments	Download

Messages

Total messages: 23 (2 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

cmumford

rkaplow@chromium.org: Please review changes in histograms.xml rdevlin.cronin@chromium.org: Please review of all changes. Two notes: 1) ...

5 years, 1 month ago (2015-10-29 16:57:38 UTC) #2

Devlin

> 2) I don't know if any of the Extensions db's would be candidates to ...

5 years, 1 month ago (2015-10-29 17:57:26 UTC) #3

cmumford

Devlin: Just to give you a preview of what I think is missing (error wise). ...

5 years, 1 month ago (2015-10-29 22:34:22 UTC) #4

Devlin:

Just to give you a preview of what I think is missing (error wise). This latest
patch handles the Get failures, in all four places, and calls Restore() to reset
the entire database. This is probably not exactly what you want to do upon
error, but is a step in the right direction. Let me know what you think.

https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
File extensions/browser/value_store/leveldb_value_store.cc (right):

https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
extensions/browser/value_store/leveldb_value_store.cc:360: // We do not
currently recover from corrupted databases, so completely
On 2015/10/29 17:57:26, Devlin wrote:
> We do try to recover.  See ValueStore::Restore().  This sounds like it could
> blow away the whole database when maybe it could be saved?

Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the first
caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does not
pass that error back to the callback, so it get's dropped on the floor, and
ValueStore::Restore() is never called.

So my worry is that there's a lot of plumbing in between the actual error, and
the function that handles it, with plenty of opportunity for a mistake.

I think we should still keep the existing functions as-is, because corruption
can happen post open, but whenever it does happen we have three options:

  1) Blow it away (either delete or leveldb::DestroyDB).
  2) leveldb::RepairDB (which could still fail).
  3) Do nothing - but it *will* fail on next open attempt

So my vote would be to:

  1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the
database upon open when corruption is detected.
  2) Modify ValueStoreFrontend::Backend::Get to pass back the
ValueStore::ReadResult to the callback (or have a second callback).

Does this sound reasonable. I'm still new to this code so am likely overlooking
something...

P.S. I'll respond to the question of RepairDB() on the bug.

Devlin

https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/value_store/leveldb_value_store.cc File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/value_store/leveldb_value_store.cc#newcode360 extensions/browser/value_store/leveldb_value_store.cc:360: // We do not currently recover from corrupted databases, ...

5 years, 1 month ago (2015-10-30 20:13:17 UTC) #5

https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
File extensions/browser/value_store/leveldb_value_store.cc (right):

https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
extensions/browser/value_store/leveldb_value_store.cc:360: // We do not
currently recover from corrupted databases, so completely
On 2015/10/29 22:34:22, cmumford wrote:
> On 2015/10/29 17:57:26, Devlin wrote:
> > We do try to recover.  See ValueStore::Restore().  This sounds like it could
> > blow away the whole database when maybe it could be saved?
> 
> Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the first
> caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does
not
> pass that error back to the callback, so it get's dropped on the floor, and
> ValueStore::Restore() is never called.
> 
> So my worry is that there's a lot of plumbing in between the actual error, and
> the function that handles it, with plenty of opportunity for a mistake.
> 
> I think we should still keep the existing functions as-is, because corruption
> can happen post open, but whenever it does happen we have three options:
> 
>   1) Blow it away (either delete or leveldb::DestroyDB).
>   2) leveldb::RepairDB (which could still fail).
>   3) Do nothing - but it *will* fail on next open attempt
> 
> So my vote would be to:
> 
>   1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the
> database upon open when corruption is detected.
>   2) Modify ValueStoreFrontend::Backend::Get to pass back the
> ValueStore::ReadResult to the callback (or have a second callback).
> 
> Does this sound reasonable. I'm still new to this code so am likely
overlooking
> something...
> 
> P.S. I'll respond to the question of RepairDB() on the bug.

We try to restore when a call fails - this is in storage_api.cc.  E.g., we'll
try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and
then, if that fails, we'll try to handle the error by restore a particular key
or the entire database (ExtensionFunction::ResponseValue
SettingsFunction::HandleError), and then re-run.

Of course, "Restore" is really just "Delete carefully", starting with the single
key and advancing to the whole DB.

cmumford

On 2015/10/30 20:13:17, Devlin wrote: > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/value_store/leveldb_value_store.cc > File extensions/browser/value_store/leveldb_value_store.cc (right): > > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/value_store/leveldb_value_store.cc#newcode360 > ...

5 years, 1 month ago (2015-11-02 17:28:28 UTC) #6

On 2015/10/30 20:13:17, Devlin wrote:
>
https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
> File extensions/browser/value_store/leveldb_value_store.cc (right):
> 
>
https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
> extensions/browser/value_store/leveldb_value_store.cc:360: // We do not
> currently recover from corrupted databases, so completely
> On 2015/10/29 22:34:22, cmumford wrote:
> > On 2015/10/29 17:57:26, Devlin wrote:
> > > We do try to recover.  See ValueStore::Restore().  This sounds like it
could
> > > blow away the whole database when maybe it could be saved?
> > 
> > Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the
first
> > caller to ValueStore::Get() logs the failure, but (unless I'm mistaken) does
> not
> > pass that error back to the callback, so it get's dropped on the floor, and
> > ValueStore::Restore() is never called.
> > 
> > So my worry is that there's a lot of plumbing in between the actual error,
and
> > the function that handles it, with plenty of opportunity for a mistake.
> > 
> > I think we should still keep the existing functions as-is, because
corruption
> > can happen post open, but whenever it does happen we have three options:
> > 
> >   1) Blow it away (either delete or leveldb::DestroyDB).
> >   2) leveldb::RepairDB (which could still fail).
> >   3) Do nothing - but it *will* fail on next open attempt
> > 
> > So my vote would be to:
> > 
> >   1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the
> > database upon open when corruption is detected.
> >   2) Modify ValueStoreFrontend::Backend::Get to pass back the
> > ValueStore::ReadResult to the callback (or have a second callback).
> > 
> > Does this sound reasonable. I'm still new to this code so am likely
> overlooking
> > something...
> > 
> > P.S. I'll respond to the question of RepairDB() on the bug.
> 
> We try to restore when a call fails - this is in storage_api.cc.  E.g., we'll
> try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and
> then, if that fails, we'll try to handle the error by restore a particular key
> or the entire database (ExtensionFunction::ResponseValue
> SettingsFunction::HandleError), and then re-run.
> 
> Of course, "Restore" is really just "Delete carefully", starting with the
single
> key and advancing to the whole DB.

Yes, that is one flow which appears to call Restore, but it's just one (of many)
paths to EnsureDbIsOpen(). If, as an experiment, you modify EnsureDbIsOpen() to
return a corrupted status error, then you can verify that, during starup, that
Restore is never resulting in *many* calls to EnsureDbIsOpen() and many loggings
of those failures.

Devlin

On 2015/11/02 17:28:28, cmumford wrote: > On 2015/10/30 20:13:17, Devlin wrote: > > > https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/value_store/leveldb_value_store.cc ...

5 years, 1 month ago (2015-11-02 17:55:26 UTC) #7

On 2015/11/02 17:28:28, cmumford wrote:
> On 2015/10/30 20:13:17, Devlin wrote:
> >
>
https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
> > File extensions/browser/value_store/leveldb_value_store.cc (right):
> > 
> >
>
https://codereview.chromium.org/1428843002/diff/20001/extensions/browser/valu...
> > extensions/browser/value_store/leveldb_value_store.cc:360: // We do not
> > currently recover from corrupted databases, so completely
> > On 2015/10/29 22:34:22, cmumford wrote:
> > > On 2015/10/29 17:57:26, Devlin wrote:
> > > > We do try to recover.  See ValueStore::Restore().  This sounds like it
> could
> > > > blow away the whole database when maybe it could be saved?
> > > 
> > > Yes, it appears as though ValueStoreFrontend::Backend::Get (which is the
> first
> > > caller to ValueStore::Get() logs the failure, but (unless I'm mistaken)
does
> > not
> > > pass that error back to the callback, so it get's dropped on the floor,
and
> > > ValueStore::Restore() is never called.
> > > 
> > > So my worry is that there's a lot of plumbing in between the actual error,
> and
> > > the function that handles it, with plenty of opportunity for a mistake.
> > > 
> > > I think we should still keep the existing functions as-is, because
> corruption
> > > can happen post open, but whenever it does happen we have three options:
> > > 
> > >   1) Blow it away (either delete or leveldb::DestroyDB).
> > >   2) leveldb::RepairDB (which could still fail).
> > >   3) Do nothing - but it *will* fail on next open attempt
> > > 
> > > So my vote would be to:
> > > 
> > >   1) keep ValueStore::Restore() as-is, but to delete (possibly repair) the
> > > database upon open when corruption is detected.
> > >   2) Modify ValueStoreFrontend::Backend::Get to pass back the
> > > ValueStore::ReadResult to the callback (or have a second callback).
> > > 
> > > Does this sound reasonable. I'm still new to this code so am likely
> > overlooking
> > > something...
> > > 
> > > P.S. I'll respond to the question of RepairDB() on the bug.
> > 
> > We try to restore when a call fails - this is in storage_api.cc.  E.g.,
we'll
> > try to perform a Get() (StorageStorageAreaGetFunction::RunWithStorage), and
> > then, if that fails, we'll try to handle the error by restore a particular
key
> > or the entire database (ExtensionFunction::ResponseValue
> > SettingsFunction::HandleError), and then re-run.
> > 
> > Of course, "Restore" is really just "Delete carefully", starting with the
> single
> > key and advancing to the whole DB.
> 
> Yes, that is one flow which appears to call Restore, but it's just one (of
many)
> paths to EnsureDbIsOpen(). If, as an experiment, you modify EnsureDbIsOpen()
to
> return a corrupted status error, then you can verify that, during starup, that
> Restore is never resulting in *many* calls to EnsureDbIsOpen() and many
loggings
> of those failures.

Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I think
were originally only called from the API.  The fact that there are more paths
probably means that we should be trying to recover at those times, too (and
possibly looking at why we're hitting the DB without extension action).

In any case, the right path here should be to try and recover safely.  We
shouldn't be blowing away databases if there's a chance they could be saved. 
Can we try and ensure that each path to open the database tries to recover and,
if it makes sense, just use the leveldb::RepairDb() functionality instead of
Restore()?

cmumford

> Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I think > were ...

5 years, 1 month ago (2015-11-03 00:44:31 UTC) #8

Devlin

On 2015/11/03 00:44:31, cmumford wrote: > > Hmm, EnsureDbIsOpen seems to only be called from ...

5 years, 1 month ago (2015-11-03 17:00:06 UTC) #9

On 2015/11/03 00:44:31, cmumford wrote:
> > Hmm, EnsureDbIsOpen seems to only be called from Get/Set/Remove, which I
think
> > were originally only called from the API.  The fact that there are more
paths
> > probably means that we should be trying to recover at those times, too (and
> > possibly looking at why we're hitting the DB without extension action).
> > 
> > In any case, the right path here should be to try and recover safely.  We
> > shouldn't be blowing away databases if there's a chance they could be saved.

> > Can we try and ensure that each path to open the database tries to recover
> and,
> > if it makes sense, just use the leveldb::RepairDb() functionality instead of
> > Restore()?
> 
> I agree with you on modifying all paths - if for no other reason those
> functions need to know that their read/write failed, to take the appropriate
> action.
> 
> By removing this proposed repair out of EnsureDbIsOpen I'm concerned with
> three things.
> 
> 1. There is an awfully lot of moving parts that need to be connected
>    correctly to ensure that Restore() is called correctly. It's a fragile
>    approach.
> 2. I’m confident, but still not 100% sure I’ve found every way that
>    Restore() might not be called.
> 3. There may be race conditions with multiple in-flight read/write's which
>    could fail and result in multiple calls to Restore().
> 
> Don't worry, that's my last appeal for this approach. Let me know if you
> want to stick with the existing design, and I’ll strip the db recovery
> code from EnsureDbIsOpen. You are the owner!
> 
> I have local changes to add a result callback to all methods in StateStore
> that read/write to the database (and indirectly call EnsureDbIsOpen()).
> These are GetExtensionValue() (already in this CL), SetExtensionValue(),
> RemoveExtensionValue(), and Restore().
> 
> It's a big set of changes, so best to land in separate CL's IMO, but if you
> prefer to have one big commit let me know.

Sorry, might have been unclear.  What I'd like to see is a single place we try
to recover/repair the database.  Right now, that's in the API, but I'm fine with
moving it to EnsureDbIsOpen (assuming this wouldn't be too much of a performance
regression) - but then we should remove the Restore() calls, because they're
superfluous (the DB will be repaired or not, and trying again won't do
anything).  Also, we *should* try to repair, before blowing the DB away
completely, just as we do now (otherwise, this will be a regression, because we
would previously try and restore it, whereas now we'd just kill it).  Finally, I
would assume that the internal RepairDb() function is probably smarter than
Restore(), and should probably be used instead.

So, to sum up, I'd basically expect this CL to end up as:
- Remove the Restore() calls.
- Put RepairDb() calls in EnsureDbIsOpen.

I think that meets all our needs, right?  We still try and repair, and we do so
at first sign of corruption, so other callers are happy.

cmumford

> > Sorry, might have been unclear. What I'd like to see is a single ...

5 years, 1 month ago (2015-11-05 21:37:38 UTC) #10

Devlin

On 2015/11/05 21:37:38, cmumford wrote: > > > > Sorry, might have been unclear. What ...

5 years, 1 month ago (2015-11-05 22:04:02 UTC) #11

On 2015/11/05 21:37:38, cmumford wrote:
> > 
> > Sorry, might have been unclear.  What I'd like to see is a single place we
try
> > to recover/repair the database.  Right now, that's in the API, but I'm fine
> with
> > moving it to EnsureDbIsOpen (assuming this wouldn't be too much of a
> performance
> > regression) - but then we should remove the Restore() calls, because they're
> > superfluous (the DB will be repaired or not, and trying again won't do
> > anything).  Also, we *should* try to repair, before blowing the DB away
> > completely, just as we do now (otherwise, this will be a regression, because
> we
> > would previously try and restore it, whereas now we'd just kill it). 
Finally,
> I
> > would assume that the internal RepairDb() function is probably smarter than
> > Restore(), and should probably be used instead.
> > 
> > So, to sum up, I'd basically expect this CL to end up as:
> > - Remove the Restore() calls.
> > - Put RepairDb() calls in EnsureDbIsOpen.
> > 
> > I think that meets all our needs, right?  We still try and repair, and we do
> so
> > at first sign of corruption, so other callers are happy.
> 
> Yeah - I think we're basically in agreement. Unfortunately LeveldbValueStore
is
> one of six classes derived from ValueStore where Restore() is defined. I don't
> think we can remove Restore from the other value stores where it probably
makes
> sense. What do you think about having EnsureDbIsOpen call Restore directly,
but
> still return the error which (after these changes) will still call Restore a
> second time. That will be inefficient, but only about 5 in a million db's get
> corrupted so I think that should be tolerable.

All those classes actually eventually reach a LevelDBValueStore (except the
testing ones, of course).  We can leave in the Restore() method, but would it
still make sense to replace the implementation with RepairDb?

But otherwise, yeah, just putting a Restore() call in EnsureDbIsOpen() SGTM.

cmumford

> All those classes actually eventually reach a LevelDBValueStore (except the > testing ones, of ...

5 years, 1 month ago (2015-11-05 22:05:38 UTC) #12

cmumford

Devlin: Here's the version with EnsureDbIsOpen calling Restore. I decided to move the ::Get() error ...

5 years, 1 month ago (2015-11-06 18:37:50 UTC) #13

Devlin

Much cleaner! :) Restore() is really only called in two circumstances: 1. When something (a ...

5 years, 1 month ago (2015-11-06 18:55:34 UTC) #14

cmumford

On 2015/11/06 18:55:34, Devlin wrote: > Much cleaner! :) > > Restore() is really only ...

5 years, 1 month ago (2015-11-06 19:16:05 UTC) #15

Devlin

On 2015/11/06 19:16:05, cmumford wrote: > On 2015/11/06 18:55:34, Devlin wrote: > > Much cleaner! ...

5 years, 1 month ago (2015-11-06 19:54:46 UTC) #16

Devlin

On 2015/11/06 19:16:05, cmumford wrote: > On 2015/11/06 18:55:34, Devlin wrote: > > Much cleaner! ...

5 years, 1 month ago (2015-11-06 19:54:47 UTC) #17

rkaplow

lgtm https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histograms/histograms.xml File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1428843002/diff/60001/tools/metrics/histograms/histograms.xml#newcode64970 tools/metrics/histograms/histograms.xml:64970: + <int value="2" label="Repair Success"/> On 2015/11/06 19:54:46, ...

5 years, 1 month ago (2015-11-06 23:06:51 UTC) #18

cmumford

Should be landing shortly. https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/value_store/leveldb_value_store.cc File extensions/browser/value_store/leveldb_value_store.cc (right): https://codereview.chromium.org/1428843002/diff/60001/extensions/browser/value_store/leveldb_value_store.cc#newcode76 extensions/browser/value_store/leveldb_value_store.cc:76: restore_histogram_ = base::LinearHistogram::FactoryGet( On 2015/11/06 ...

5 years, 1 month ago (2015-11-09 23:04:22 UTC) #19

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1428843002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1428843002/60001

5 years, 1 month ago (2015-11-09 23:06:12 UTC) #21

commit-bot: I haz the power

5 years, 1 month ago (2015-11-10 00:55:30 UTC) #23

Message was sent while issue was closed.

Patchset 4 (id:??) landed as
https://crrev.com/b3e2c306a1adcb117f4134281dfcf0cb6ac2467f
Cr-Commit-Position: refs/heads/master@{#358722}

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages