Issue 1832173002: [sql] Database recovery system for Shortcuts.

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-25 21:20:29 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1832173002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1832173002/20001

4 years, 9 months ago (2016-03-25 21:20:51 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-25 22:35:49 UTC) #3

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 9 months ago (2016-03-25 22:35:51 UTC) #4

Scott Hess - ex-Googler

shess@chromium.org changed reviewers: + mpearson@chromium.org

4 years, 9 months ago (2016-03-25 23:24:32 UTC) #5

Scott Hess - ex-Googler

I realize this is an SQLite-heavy review, but, unfortunately, the collection of SQLite-centric people on ...

4 years, 9 months ago (2016-03-25 23:24:32 UTC) #6

Mark P

On 2016/03/25 23:24:32, Scott Hess wrote: > I realize this is an SQLite-heavy review, but, ...

4 years, 8 months ago (2016-03-28 19:27:59 UTC) #7

Scott Hess - ex-Googler

On 2016/03/28 19:27:59, Mark P wrote: > On 2016/03/25 23:24:32, Scott Hess wrote: > > ...

4 years, 8 months ago (2016-03-28 20:01:54 UTC) #8

On 2016/03/28 19:27:59, Mark P wrote:
> On 2016/03/25 23:24:32, Scott Hess wrote:
> > I realize this is an SQLite-heavy review, but, unfortunately, the collection
> of
> > SQLite-centric people on Chromium is pretty slim (mostly me, actually).  I
> don't
> > think there's anything in here that's super subtle.
> > 
> > components/history/core/browser/thumbnail_database.cc contains an existing
> > example of sql::Recovery being used.  vacuum.c references are to
> > third_party/sqlite/src/src/vacuum.c , which is pretty short.
> > 
> > Thanks.
> 
> I'm out sick and this looks complicated.  I'll aim to get to this near the end
> of the week; I hope that's okay.

No problem - it's not time-sensitive.  Though I think I'll be going OOT for a
week on Friday :-).  I can start working on queueing up whichever database I
want to hit next while waiting.

> I'm curious: do you have any idea of the effect this'll have?

For the databases in question, right now it's highly likely that those that are
corrupt are stuck in that state.  Corruption is weird, it's detected at the
point where SQLite notices something wrong, so it's possible that one query
works but a similar query fails based on what data is accessed.  It is
_possible_ to have a corruption which gradually is updated away, but IMHO pretty
unlikely, my experience with people suffering from corruption is that it just
persists until action is taken (for instance delete the profile).  The way most
Chromium code is written, you end up with a kind of half-working state which
gets further and further out of line over time (in this case old shortcuts might
continue to work, but they can't be updated or deleted and new ones can't be
added).  With this change, any data which was readable before _should_ end up in
the resulting database, the primary key should prevent any duplication, and then
going forward the database should work correctly.

In terms of efficacy, my experience is that corruptions are mostly uncorrelated,
they're from things like losing power suddenly or BSOD, so for most users fixing
things once is all they'll need.  Of course, that's for legitimate cases, if the
user has flaky hardware or malware messing things up, this isn't going to help
them in the end.

It is likely that there will still be databases which end up in an unusable
state.  I have some code to upload diagnostics via the crash server.  I might
circle back with that for this database, in the past I've found that error cases
often had a few actionable buckets (for instance an incorrectly-coded schema
upgrade) before getting to the irreducible long tail.  That will depend on how
many cases are left.

My long-term goal is to have things be more self-healing, because we can't
really rely on a DBA to manually fix things, and relying on users to notice and
delete their databases is only going to address a few percent at most.

Right now the two databases which already have similar code in place are
recovering ~125k databases per month.  It's been in place long enough that this
is the steady state.  Hard to estimate results for this database, I mainly
picked it based on Sqlite.Error.Shortcuts having enough hits that changes to
SQLITE_CORRUPT will be noticeable, even on dev channel.

Mark P

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/browser/shortcuts_database.cc File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/browser/shortcuts_database.cc#newcode68 components/omnibox/browser/shortcuts_database.cc:68: // contradicts itself, but SQLite is able to read ...

4 years, 8 months ago (2016-04-07 23:09:44 UTC) #9

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:68: // contradicts itself, but
SQLite is able to read it.
Do you want this explanation here rather than than by the ShouldRecoverOrRaze()
function?  Having it in two places seems redundant and likely to get out of
sync.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:71: // itself.
ditto about this TODO

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:82: // return errors until the
handle is re-opened.
This comment duplicates much of the comment by the function definition.  Is any
of it necessary?

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:87: if
(!sql::Connection::ShouldIgnoreSqliteError(extended_error))
Do you still want to run this code / possibly display the message if the above
code decides to recover the database?

If yes, consider whether it would be better for developers to display this
message first or the (possibly failed) recovery message first.  (I'd think the
original error first would be better.)

If no, then you can struct this function differently.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:175: // addressed.  Someone
needs to handle invalid schema.
This shouldn't be a comment.  File a bug against pkasting@ to either do
something with the return value (and comment in the .h file the meaning of the
return value) or change the return value to void.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87
sql/recovery.cc:87: // Automatically recovered entire database successfully.
This comment sounds like it could equally apply to the earlier enum
RECOVERY_SUCCESS_AUTORECOVER.  Please figure out a way to make their differences
clearer.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode540
sql/recovery.cc:540: // |prefix|, and apply it to [main].  Skip
'sqlite_sequence', which will be
This comment is too broad.  Below you explicitly assume the prefix is the right
amount of text so if you add a "main." right after it, it makes sense as a sql
statement.
SELECT foo from BAR;
->
SELmain.ECT foo from BAR;
Ugh, you don't mean this.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode541
sql/recovery.cc:541: // created on demand by SQLite if any tables use
AUTOINCREMENT.
This comment should mention the meaning of the return value.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode546
sql/recovery.cc:546: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'"));
What happens if prefix_length > length(sql)?  What does substr return in this
case?
The sqlite3 documentation I found doesn't define this.
https://www.sqlite.org/lang_corefunc.html
Perhaps you want to add another another condition ensuring that sql is long
enough?
(Though I can't imagine this causing any problems unless substr() actually fails
on this input, as the equality test still applies)

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode552
sql/recovery.cc:552: // to be worth the complexity, though.
Can you expand this slightly
to be worth complexity to skip statements with IF NOT EXISTS?
to be worth complexity to identify IF NOT EXISTS statement and then ...?

Also, if you don't plan to do it, please don't label it as a TODO. :-P

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode553
sql/recovery.cc:553: 
This blank line here feels unnecessary (but then I could be misunderstanding the
scope of the TODO comment above.)

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode554
sql/recovery.cc:554: while (s.Step()) {
Should all these executions be wrapped in a transaction?

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode684
sql/recovery.cc:684: switch (extended_error & 0xFF) {
You do the same & 0xFF in connection.cc as well.  Consider making this bitmask a
constant somewhere.  Or at least comment it better (in both places), maybe
something as simple as
// Mask down to the primary result code.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode19
sql/recovery.h:19: // Recovery module for sql/.  The basic idea is to create a
fresh
As a new function you introduced does almost exactly what the example below
does, perhaps you want to update this text and example if that's what you expect
the common usage to be?

Also, if that's the case, you should put the function declaration higher in this
class.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode159
sql/recovery.h:159: // database can be razed or deleted.  The goal is for this
function to make
Uh, did you mean "will be", not "can be"?  And will the result be razed (i.e.,
deleted) or an empty database?
If you didn't mean that, then how will the caller know whether the recovery
failed and the database is okay to be razed?  This function has no return value.
If you're intentionally vague because you plan to change this from
leave-corrupt-database-around to something else, just give it the right name for
now.  Later when you make the intended change, you can rename the function.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode166
sql/recovery.h:166: //
Please provide an transition sentence here.  It appears everything else in the
comment explains how this function works, i.e., which error code you consider
recoverable and why.  Another sentence would make this comment more readable.

Also, consider whether these comments better belong in the implementation, not
the .h.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode169
sql/recovery.h:169: // is probably to delete the file and start over.
probably?
When this is not the best fix?
(consider omitting the word probably)

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode175
sql/recovery.h:175: // fix is probably to delete the file and start over.
This last sentence is confusing.  Did you mean that there are other times this
error code can be returned and those are probably not recoverable?  If so, then
this comment fails to explain what this function will return for this error
code.
Perhaps you want to drop the last sentence if appropriate?

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode181
sql/recovery.h:181: // which is no longer in the table.
Again, this explanation doesn't state what this function will return for this
error code.  In fact, it implies that the function will say to raze the
database, when that's the opposite of what the function does.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode188
sql/recovery.h:188: static bool ShouldRecoverOrRaze(int extended_error);
nit: given your explanation of what this does, I think it should be named
something like IsRecoverable or IsLikelyRecoverable or IsRecoverableGivenError
or ShouldTryRecovering or something.

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:81886: +    In autorecover database was
too broken to begin.
optional nit (has better emphasis in my opinion):
Database was too broken to attempt recovery.

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:81901: +    Autorecover failed
recovering a table.
For something like this, which is probably a more common type of recovery
failure, do you want to record the failed type of error code in another
histogram?  (or extra values within this histogram I suppose)

Mark P

Phew. There are all your comments. It's unclear whether I found any bugs the process... ...

4 years, 8 months ago (2016-04-08 20:12:07 UTC) #10

Scott Hess - ex-Googler

Rebase, scoped_ptr to unique_ptr in new code

4 years, 8 months ago (2016-04-14 23:55:34 UTC) #11

Scott Hess - ex-Googler

Address review comments from #9 and #10.

4 years, 8 months ago (2016-04-15 00:35:40 UTC) #12

Scott Hess - ex-Googler

On 2016/04/08 20:12:07, Mark P wrote: > Phew. > > There are all your comments. ...

4 years, 8 months ago (2016-04-15 00:38:16 UTC) #13

On 2016/04/08 20:12:07, Mark P wrote:
> Phew.
> 
> There are all your comments.
> 
> It's unclear whether I found any bugs the process...  My questions point out
> possible issues I see from my unexperienced viewpoint--whether they're real
> issues you'll hopefully be able to figure out.

THANK YOU FOR THE THOROUGH REVIEW!  It may backfire on you, as in the past
getting anyone to review (ie, help improve) this stuff has been like pulling
teeth.  Hopefully my answers aren't completely facile and evasive. 
Unfortunately, to some extent they are, because there are bits of this that I'm
not completely certain of how things work in the fleet, and other bits where I'm
trying to evade shaving yaks elsewhere.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:68: // contradicts itself, but
SQLite is able to read it.
On 2016/04/07 23:09:43, Mark P wrote:
> Do you want this explanation here rather than than by the
ShouldRecoverOrRaze()
> function?  Having it in two places seems redundant and likely to get out of
> sync.

Acknowledged.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:71: // itself.
On 2016/04/07 23:09:43, Mark P wrote:
> ditto about this TODO

Leftover from when the code checked directly against SQLITE_CORRUPT and other
errors.  That required adding a dep on third_party/sqlite which seemed
inappropriate.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:82: // return errors until the
handle is re-opened.
On 2016/04/07 23:09:43, Mark P wrote:
> This comment duplicates much of the comment by the function definition.  Is
any
> of it necessary?

OK.  Leaving the last line, because IMHO it's pertinent enough to the local
stack and likely unexpected.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:87: if
(!sql::Connection::ShouldIgnoreSqliteError(extended_error))
On 2016/04/07 23:09:43, Mark P wrote:
> Do you still want to run this code / possibly display the message if the above
> code decides to recover the database?
> 
> If yes, consider whether it would be better for developers to display this
> message first or the (possibly failed) recovery message first.  (I'd think the
> original error first would be better.)
> 
> If no, then you can struct this function differently.

I'll have to think about this.  Corruption generally stems from things like
unexpected power failures, so a developer is unlikely to see that.  This DLOG is
expected to fire for cases like SQLITE_ERROR, which would indicate that a
statement had a syntax error, or an inappropriate schema change happened,
something like that.  So is there an interesting case at the intersection where
a developer caused a corruption?  Hard to say.

If a developer was unlucky enough to have a legitimate corruption, I would
expect this to FATAL once, then work going forward.  Not a great outcome, but
most developers would just ignore it or log a bug against me.  That's find, then
I can discuss it with them and determine if there's a real problem or a red
herring.  Usually it's been a red herring, but it has caught real bugs.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:175: // addressed.  Someone
needs to handle invalid schema.
On 2016/04/07 23:09:43, Mark P wrote:
> This shouldn't be a comment.  File a bug against pkasting@ to either do
> something with the return value (and comment in the .h file the meaning of the
> return value) or change the return value to void.

OK.  Adding http://crbug.com/603304 about this.  I'm fine with adding some
instrumentation and working to resolve this, as I'm also looking for ways to
generalize the diagnostics for other databases.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath
db_path(temp_dir.path().AppendASCII("TestShortcuts1.db"));
On 2016/04/08 20:12:06, Mark P wrote:
> This probably should be "3", not "1".  (Otherwise why would other tests use
> unique numbers?  There must be a reason.)

Copy/pasta.  Since it's in a unique temp dir, I'll just strip the 1.  To the
best of my knowledge none of these need to be unique names.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:318: // The database
is broken.
On 2016/04/08 20:12:06, Mark P wrote:
> How about putting a blank line after 315, then adding
> // Break the database.
> then corrupt the header,
> then a blank line, then a comment such as
> // Verify the database is broken.

Done.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:325:
ASSERT_FALSE(connection.Execute("SELECT count(*) FROM omni_box_shortcuts"));
On 2016/04/08 20:12:06, Mark P wrote:
> Why does this return failure if we're ignoring the error?  Comment.

Production code in debug mode fails with a FATAL, because there should not be
cases where that is acceptable.  Execute() is WARN_UNUSED_RESULT to force
developers to pay attention to the result, because in actual production there
will be failures that are not due to how the code is written.  The ignorer tells
sql::Connection that we're testing the code to handle an error case, so don't do
the FATAL thing.

[To be clear, a lot of the failure code paths for sql returns are entirely
untested in Chromium.  A lot of the code is written to assume nothing can fail,
which is true in development but false when deployed.  I've tried to force the
issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for
failure paths, in part because it's challenging to inject failures.  So I agree,
much of this is dubious.]

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:327:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/08 20:12:06, Mark P wrote:
> Side comment: the comment by the declaration of CheckIgnoredErrors() is awful.

> It doesn't say what the return type means.  Does it mean we encountered an
error
> of the type that we ignored?  That's my best guess.  But it could mean that we
> encountered an error that didn't ignore, i.e., check status, taking into
account
> ignored errors...

I don't really follow.  It says it's used to check if the errors were
encountered, "True" seems like the most likely response to that question.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:330: //
sql::Connection::Open() will hit the corruption, the error callback will
On 2016/04/08 20:12:06, Mark P wrote:
> sql::Connection::Open() will hit
> ->
> The sql::Connection::Open() call in ShortcutsDatabase::Init() will hit

Done.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:340:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/08 20:12:06, Mark P wrote:
> I think it's odd that we still see this error.  Shouldn't it be cleared after
> the first failure and then not have any errors for the second attempt at
> opening?

The error will be thrown in the first Open(), the ignorer allows it to fail
gracefully, then the retry will succeed.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:348: sql::Statement
statement(connection.GetUniqueStatement(
On 2016/04/08 20:12:06, Mark P wrote:
> Please use an identical call to the one you used to verify the database was
> broken, don't switch from execute to getuniquestatement.

Done.

https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:351: EXPECT_EQ(5,
statement.ColumnInt(0));
On 2016/04/08 20:12:06, Mark P wrote:
> Consider using something like
> EXPECT_EQ(arraysize(shortcut_test_db), CountRecords())
> a la the other tests in this file.

CountRecords() runs against the sql::Connection contained in the
ShortcutsDatabase created by SetUp().  That database cannot easily be used at
this point.  I could do this:
  db_ = new ShortcutsDatabase(db_path);
but note that this isn't necessarily describing the same database as SetUp().

I could shift the CountRecords() code out to a helper running against a
sql::Connection, then have CountRecords() call that.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87
sql/recovery.cc:87: // Automatically recovered entire database successfully.
On 2016/04/07 23:09:43, Mark P wrote:
> This comment sounds like it could equally apply to the earlier enum
> RECOVERY_SUCCESS_AUTORECOVER.  Please figure out a way to make their
differences
> clearer.

The earlier applies to a specific table, this applies to the entire database. 
So would you prefer to rename the earlier ones to reflect their specific scope?

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode96
sql/recovery.cc:96: // Distinguish failure in querying schema from failure in
creating schema.
On 2016/04/08 20:12:06, Mark P wrote:
> Please clearly state these are subcategories of
> RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode540
sql/recovery.cc:540: // |prefix|, and apply it to [main].  Skip
'sqlite_sequence', which will be
On 2016/04/07 23:09:43, Mark P wrote:
> This comment is too broad.  Below you explicitly assume the prefix is the
right
> amount of text so if you add a "main." right after it, it makes sense as a sql
> statement.
> SELECT foo from BAR;
> ->
> SELmain.ECT foo from BAR;
> Ugh, you don't mean this.

I agree that the example case isn't what I want, but I'm not sure what to do
about it.  The prefix has to be some sort of valid SQL prefix thingie, but I'm
not sure what to call it, and I'm not going to be writing a parser to validate
it.  I guess I can assert that it ends in a space, but even that doesn't mean
much, since I'm using string manipulation to dynamically construct statements.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode541
sql/recovery.cc:541: // created on demand by SQLite if any tables use
AUTOINCREMENT.
On 2016/04/07 23:09:43, Mark P wrote:
> This comment should mention the meaning of the return value.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode546
sql/recovery.cc:546: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'"));
On 2016/04/07 23:09:43, Mark P wrote:
> What happens if prefix_length > length(sql)?  What does substr return in this
> case?
> The sqlite3 documentation I found doesn't define this.
> https://www.sqlite.org/lang_corefunc.html
> Perhaps you want to add another another condition ensuring that sql is long
> enough?
> (Though I can't imagine this causing any problems unless substr() actually
fails
> on this input, as the equality test still applies)

I think I'll just rewrite this as a full table scan and do the filtering in C++.
 It won't change the efficiency at all, and I expect it will be approximately
the same weight in code.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode552
sql/recovery.cc:552: // to be worth the complexity, though.
On 2016/04/07 23:09:43, Mark P wrote:
> Can you expand this slightly
> to be worth complexity to skip statements with IF NOT EXISTS?
> to be worth complexity to identify IF NOT EXISTS statement and then ...?
> 
> Also, if you don't plan to do it, please don't label it as a TODO. :-P

I think I'll remove the comment entirely.  A common corruption is having a
transaction broken when a page is being rebalanced, which can lead to seeing the
same data in multiple places.  This could only happen to the sqlite_master btree
if a transaction was broken while making a schema change, which is already
unlikely, but it could only be successfully coded against if the result was a
valid schema, which is probably even more unlikely.

OK, and it just occurred to me that I could add a DISTINCT to the selection
statement and ignore the problem entirely.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode553
sql/recovery.cc:553: 
On 2016/04/07 23:09:43, Mark P wrote:
> This blank line here feels unnecessary (but then I could be misunderstanding
the
> scope of the TODO comment above.)

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode554
sql/recovery.cc:554: while (s.Step()) {
On 2016/04/07 23:09:43, Mark P wrote:
> Should all these executions be wrapped in a transaction?

Recovery happens to a temporary database, which is only committed to the main
database in Recovered().  So transactions would only be necessary insofar as the
(local) caller needs to do controlled rollback of one approach in order to try
another approach.  I don't think that can really come up, here, mostly because I
can't think of a second simple approach (either the schema will work for this
kind of recovery, or it won't, and the client shouldn't be calling this code if
the schema is too complex or subtle).

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode592
sql/recovery.cc:592: // in the channel (175k/month).  Lastly, there were 3
failures in a month
On 2016/04/08 20:12:06, Mark P wrote:
> Please don't put non-normalized numbers in comments that will become out of
date
> over time.
> e.g., say something like X failures per million database opens in M-49.
> 
> Also, if you're referencing an issues that fixed, please mention when, e.g.,
> fixed for M-50.

Maybe I'll keep the comment in my local client, since it's mostly just to inform
my thinking about whether this is safe to flip to Raze() or Delete() or
something.  The specific numbers aren't interesting, I was just thinking "BIG",
"Less big", and "trivial or maybe noise".

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode594
sql/recovery.cc:594: // practice 3 in 1M might be a reasonable false-positive
rate.
On 2016/04/08 20:12:06, Mark P wrote:
> I think this last sentence isn't worth mentioning anyway.  Things that occur
> three times in a month aren't worth thinking about, let alone documenting in a
> place that will force thinking about it to cross other people's mind.

Acknowledged.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode596
sql/recovery.cc:596: db->Poison();
On 2016/04/08 20:12:06, Mark P wrote:
> In the header comment, you write the function will poison the database so it
> "will return errors until the handle is re-opened."  This implies the database
> is closed.  Yet, you don't close it here.  The comment by the definition for
> Poison says
>   // This is intended as an alternative to Close() in error callbacks.
>   // Close() should still be called at some point.
> so shouldn't you close it...?

The originating caller's Open()/Close() calls should balance out.  This call can
happen deep in a stack of calls in a case handling an exceptional error, making
it impractical for the client to track the Close() status.  So Poison() poisons
the sql::Connection handle, but logically the handle is still open.

For instance, warnings are thrown if you try to run Execute() on a handle which
was never opened, or which has been closed.  But if the handle was poisoned,
then the Execute() will fail without warnings.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode600
sql/recovery.cc:600: #if DCHECK_IS_ON()
On 2016/04/08 20:12:06, Mark P wrote:
> I don't understand this block of code.  That said, I am willing to approve the
> changelist without any changes to this block; I understand that it's okay that
I
> don't have the background for it.

fts3 is a virtual table which we used to use in history, and remains compiled
into Chrome for WebSQL.  I don't think anyone in the browser process is using
it.  AutorecoverTable() fails on fts3 tables by successfully dropping all of the
data, so I want to prevent anyone from calling this code on an fts3 virtual
table.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode625
sql/recovery.cc:625: Recovery::Rollback(std::move(recovery));
On 2016/04/08 20:12:06, Mark P wrote:
> Header talks about poisoning; why are you using a rollback here?  Should you
> poison after this rollback?
> ditto multiple places below

See comment I made about the RecoverDatabaseOrRaze() header-file comment.  I
intend all of these to because Unrecoverable() (leading to clearing the database
data).

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode635
sql/recovery.cc:635: "AND name!='sqlite_sequence'"));
On 2016/04/08 20:12:06, Mark P wrote:
> AND != corrupt.sqlite_sequence ?

sqlite_master is implicitly main.sqlite_master .  SQLite doesn't
interpret/interpolate the columns, so selecting from corrupt.sqlite_master would
still have 'sqlite_sequence' as the name and the sql would not have db.table
naming.

[I think?  OMG, my precious brain!  Yes, AFAICT from testing it.]

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode666
sql/recovery.cc:666: //   sql LIKE "CREATE VIRTUAL TABLE %"
On 2016/04/08 20:12:06, Mark P wrote:
> Do you have a sense of whether this will be different in some cases?  Or are
you
> nervous, which is why you left it like vacuum.c?

I do not fully understand why vacuum.c uses the rootpage like this.  SQLite
creates tables with a root page even if empty, and I believe it could also
validly use a null root page.  Most likely the assumption is that you created a
table, so you'll populate it, though I've certainly seen that assumption broken
w/in Chromium.  Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it can't
accidentally match 'CREATE TABLE' or any other sort of faux table.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode684
sql/recovery.cc:684: switch (extended_error & 0xFF) {
On 2016/04/07 23:09:43, Mark P wrote:
> You do the same & 0xFF in connection.cc as well.  Consider making this bitmask
a
> constant somewhere.  Or at least comment it better (in both places), maybe
> something as simple as
> // Mask down to the primary result code.

Copied language from connection.cc.  Unfortunately SQLite doesn't provide a
really distinct name for the basic error codes, so I'm phrasing in terms of
removing extended error codes.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode19
sql/recovery.h:19: // Recovery module for sql/.  The basic idea is to create a
fresh
On 2016/04/07 23:09:44, Mark P wrote:
> As a new function you introduced does almost exactly what the example below
> does, perhaps you want to update this text and example if that's what you
expect
> the common usage to be?
> 
> Also, if that's the case, you should put the function declaration higher in
this
> class.

I rephrased it to put RecoverDatabaseOrRaze() first, then the skeleton code for
cases that doesn't cover.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode159
sql/recovery.h:159: // database can be razed or deleted.  The goal is for this
function to make
On 2016/04/07 23:09:44, Mark P wrote:
> Uh, did you mean "will be", not "can be"?  And will the result be razed (i.e.,
> deleted) or an empty database?
> If you didn't mean that, then how will the caller know whether the recovery
> failed and the database is okay to be razed?  This function has no return
value.
> If you're intentionally vague because you plan to change this from
> leave-corrupt-database-around to something else, just give it the right name
for
> now.  Later when you make the intended change, you can rename the function.

I'm not sure how to answer!

Right now, it only applies the safe cases.  It either thinks it recovered the
database, or it rolls things back.  But I've seen results from the fleet which
make me believe that running recovery to success didn't actually work (possibly
because of shenanigans in the system, like malware or profile's being
overwritten or something).  The intention is to monitor the histograms to see if
any of the failure outcomes is due to a broken assumption rather than broken
user data.  If the failures cases look weirdly high, I'll instrument to debug
the fleet, but if they look sensible, I'll switch them over from Rollback() to
Unrecoverable() (leading to Raze()).

sql::Connection::Raze() means that it overwrites the database with a single-page
empty database.  The delete language is because if the Raze() doesn't work, then
what?  Probably attempt to Delete().  I'm nervous about landing either of those
up front, though, on the off chance that I write a bug which works right in my
tests but in the field blows away a bunch of data.

Anyhow, my intention with this function is that it consumes the database handle
in all cases, while making a best effort to recover the data, and failing that a
best effort to clear the database of data, and failing that a best effort to
delete the database.  On next run the database should either be a valid
data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should be
a valid empty SQLite database, or it should be missing (SQLite would create the
database empty).

Raze() versus Delete() is because Raze() works using SQLite locking and I/O.  It
can clear the database in cases where Delete() won't work.  Likewise, Delete()
may work in cases where Raze() does not.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode166
sql/recovery.h:166: //
On 2016/04/07 23:09:44, Mark P wrote:
> Please provide an transition sentence here.  It appears everything else in the
> comment explains how this function works, i.e., which error code you consider
> recoverable and why.  Another sentence would make this comment more readable.
> 
> Also, consider whether these comments better belong in the implementation, not
> the .h.

Moved cases into the implementation file.

I'm not sure how to make this fully hermetic WRT SQLite.  Some SQLite errors
stem from obviously broken database files, some stem from obviously broken
client code, and some are ambiguous.  My goal is to get rid of the ambiguity
over time, but I'd rather approach it piecemeal, fixing the known-possible cases
first.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode169
sql/recovery.h:169: // is probably to delete the file and start over.
On 2016/04/07 23:09:44, Mark P wrote:
> probably?
> When this is not the best fix?
> (consider omitting the word probably)

I guess this is expressing my certainty level.  SQLite is very layered, so
sometimes the error you get is unexpected - when you get that error and decode
why, it makes sense how it happened, but reading the code it's challenging to
verify which specific errors are possible at which specific API calls. 
SQLITE_CANTOPEN is returned for the example cases given, but I don't know how
far the unknown unknowns run.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode175
sql/recovery.h:175: // fix is probably to delete the file and start over.
On 2016/04/07 23:09:44, Mark P wrote:
> This last sentence is confusing.  Did you mean that there are other times this
> error code can be returned and those are probably not recoverable?  If so,
then
> this comment fails to explain what this function will return for this error
> code.
> Perhaps you want to drop the last sentence if appropriate?

When the recovery code was first written, there was a case where SQLITE_NOTADB
was recoverable.  Now that case returns SQLITE_CORRUPT (which is a more sensible
return value).  So it made sense to run the recovery code, then delete it if
that failed.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode181
sql/recovery.h:181: // which is no longer in the table.
On 2016/04/07 23:09:44, Mark P wrote:
> Again, this explanation doesn't state what this function will return for this
> error code.  In fact, it implies that the function will say to raze the
> database, when that's the opposite of what the function does.

Revised in implementation.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode188
sql/recovery.h:188: static bool ShouldRecoverOrRaze(int extended_error);
On 2016/04/07 23:09:44, Mark P wrote:
> nit: given your explanation of what this does, I think it should be named
> something like IsRecoverable or IsLikelyRecoverable or IsRecoverableGivenError
> or ShouldTryRecovering or something.

Hmm.  None of the first three work with me - there's no way to guarantee that it
is recoverable short of trying, and there isn't really any way to productively
quantify the level of success recovery had.  ShouldTry makes it sound like you
should try recovery, and if that doesn't work maybe something else, but I'm not
sure there is anything else to try.

The reason for the separation is to let the calling code be structured like:

if (ShouldTry(error)) {
  // Cleanup local state before the destructive recover call.
  Recover(db, db_path);
}

I could instead pass the error into the operative function, like:

if (AutomatedErrorHandler(db, error, db_path)) {
  // Cleanup local state because the database is now poisoned.
}

I think phrasing that name would also be challenging, and I don't like having
the return value signal whether serious side effects have happened or not.

PROBABLY in all cases getting into the error handler should lead to a poisoned
handle.  In the best cases, it was a benign syntax error which inadvertently got
shipped to production and doesn't harm anyone, but in the worst cases you're
messing up your data and then messing up the mess.  So it should be like an OOM
CHECK(), where once you notice something awry, you minimize the damage. 
Unfortunately, there's code in Chromium which implicitly depends on working
degraded because things were too relaxed when that code landed (which is why I'm
even here, my long-term goal is to have no errors in the static sense, only in
the dynamic sense, and then have things to automatically handle the dynamic
errors).

["static sense" meaning incorrectly-coded clients, "dynamic sense" meaning
problems due to the operating environment.]

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.cc
File sql/recovery_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:751: // This table needs sqlite_sequence to work.
On 2016/04/08 20:12:06, Mark P wrote:
> ... because of the autoincrement.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:753: "CREATE TABLE a (id INTEGER PRIMARY KEY
AUTOINCREMENT, v TEXT)"));
On 2016/04/08 20:12:06, Mark P wrote:
> Everywhere else in this file uses tables x and y.  Any reason not to use those
> names?

Eh, it was just [a]utoincrement and [i]ndex and [v]alue.  Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:781: 
On 2016/04/08 20:12:07, Mark P wrote:
> I think this blank line is unnecessary.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:784: 
On 2016/04/08 20:12:06, Mark P wrote:
> ditto

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:786: EXPECT_TRUE(db().IsSQLValid(kTrivialSql));
On 2016/04/08 20:12:06, Mark P wrote:
> Comment such as
> // Verify the database handle is valid.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:786: EXPECT_TRUE(db().IsSQLValid(kTrivialSql));
On 2016/04/08 20:12:06, Mark P wrote:
> Comment such as
> // Verify the database handle is valid.

Done.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:795: ASSERT_TRUE(Reopen());
On 2016/04/08 20:12:07, Mark P wrote:
> This block of tests makes me nervous because you're comparing the new data
with
> the old data, yet you have no check that the old data is plausible.  Suppose
> there was an uncaught error earlier that prevented any data from being
inserted
> into the database.  This test would still pass.  I suggest either testing the
> results of the original data calls or removing that original data test and
> simply compare the new results with the actual hard-coded values you'd expect.

Done.

In testing the schema, it seemed like a fair bit of annoying text to format, and
a little brittle (the actual format of the schema seems like a SQLite internal
kind of thing, to me).  So I tested that the expected number of schema items was
present.  That's still a bit specific, since sqlite_sequence seems like an
internal thing clients have no reason to see, but explicitly omitting it is also
unnatural.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:801: // Test that the view works.
On 2016/04/08 20:12:07, Mark P wrote:
> I don't understand this block.  I don't see how this checks that the view
works.
>  There is no expect or assert statements in here, and ExecuteWithResults()
> doesn't have any expected or assert inside it that would trigger the view was
> broken.
> 
> If nothing else, this block is more correctly combined with the trigger block
> below, with no comment making an explicit claim about checking the view.

It's possible that the block is dumb.  It's gone now.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:812: ExecuteWithResults(&db(), kAWithoutSql, "|",
"\n"));
On 2016/04/08 20:12:07, Mark P wrote:
> Shouldn't this be kASql, i.e., you're checking that after the above deletion,
a
> _normal_ select statement will return the same result as a select statement
that
> excludes 'truck' explicitly?
> ditto with kISql below

Sorry, the above capture from the view _is_ the same result excluding 'truck'. 
The trigger has the side effect of changing the view's results.

I'll just go through and make the expectations static.

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:81886: +    In autorecover database was
too broken to begin.
On 2016/04/07 23:09:44, Mark P wrote:
> optional nit (has better emphasis in my opinion):
> Database was too broken to attempt recovery.

Done.

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:81892: +    Autorecover could not create
new db schema.
On 2016/04/08 20:12:07, Mark P wrote:
> Consider mentioning this is a subset of RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.
> ditto below.
> That said, I see you have weird subset relationships elsewhere in this enum. 
I
> wonder if mentioning this in some places but not others (unless you want to
make
> this list complete) somewhat confusing.
> Maybe you should simply add something to the histogram description.

It's intended to be more like code coverage, where I'm tracking which decision
points were taken.  Being able to add things up is only weakly useful, mostly
everything other than SUCCESS should normalize against SUCCESS.  The only reason
I'd care if one failure case was greater than another is in prioritizing which
to look at first.

https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:81901: +    Autorecover failed
recovering a table.
On 2016/04/07 23:09:44, Mark P wrote:
> For something like this, which is probably a more common type of recovery
> failure, do you want to record the failed type of error code in another
> histogram?  (or extra values within this histogram I suppose)

My approach was since I already have this histogram to track flow of control,
I'll hit all of those points since it's "free".  Then if any failure cases seem
bothersome, I'll circle back and add something to debug those.

Mostly I'm just being lazy, and somewhat trying to avoid analysis paralysis. 
Most of the failure points in AutorecoverTable() shouldn't be possible, because
it's operating against a temporary database, and, indeed, the histograms show
low occurance in there.  I guess SQLITE_IOERR or SQLITE_FULL could happen, and
in either case it's kind of a give-up situation.  RECOVERY_FAILED_ATTACH fails
enough that logging that error code may make sense before flipping that case to
raze.

Mark P

Here's another full pass through the code. Apologies that I didn't bother replying to your ...

4 years, 8 months ago (2016-04-19 23:34:29 UTC) #14

Here's another full pass through the code.

Apologies that I didn't bother replying to your comments that require no further
changes.  (Thank you for replying or explaining those things; I just don't want
to clutter up this review with too many unimportant comments.)

--mark

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87
sql/recovery.cc:87: // Automatically recovered entire database successfully.
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/07 23:09:43, Mark P wrote:
> > This comment sounds like it could equally apply to the earlier enum
> > RECOVERY_SUCCESS_AUTORECOVER.  Please figure out a way to make their
> differences
> > clearer.
> 
> The earlier applies to a specific table, this applies to the entire database. 
> So would you prefer to rename the earlier ones to reflect their specific
scope?

No, I thought you were being sloppy about database versus table here (as the
autorecovery code is currently only used for a single table).  I was wrong. 
It's fine as is.

https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode635
sql/recovery.cc:635: "AND name!='sqlite_sequence'"));
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > AND != corrupt.sqlite_sequence ?
> 
> sqlite_master is implicitly main.sqlite_master .  SQLite doesn't
> interpret/interpolate the columns, so selecting from corrupt.sqlite_master
would
> still have 'sqlite_sequence' as the name and the sql would not have db.table
> naming.
> 
> [I think?  OMG, my precious brain!  Yes, AFAICT from testing it.]

I'm sorry for making your brain explode. ;-)

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:74: if
(!sql::Connection::ShouldIgnoreSqliteError(extended_error))
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/07 23:09:43, Mark P wrote:
> > Do you still want to run this code / possibly display the message if the
above
> > code decides to recover the database?
> > 
> > If yes, consider whether it would be better for developers to display this
> > message first or the (possibly failed) recovery message first.  (I'd think
the
> > original error first would be better.)
> > 
> > If no, then you can struct this function differently.
> 
> I'll have to think about this.  Corruption generally stems from things like
> unexpected power failures, so a developer is unlikely to see that.  This DLOG
is
> expected to fire for cases like SQLITE_ERROR, which would indicate that a
> statement had a syntax error, or an inappropriate schema change happened,
> something like that.  So is there an interesting case at the intersection
where
> a developer caused a corruption?  Hard to say.
> 
> If a developer was unlucky enough to have a legitimate corruption, I would
> expect this to FATAL once, then work going forward.  Not a great outcome, but
> most developers would just ignore it or log a bug against me.  That's find,
then
> I can discuss it with them and determine if there's a real problem or a red
> herring.  Usually it's been a red herring, but it has caught real bugs.

You misunderstood my comment.

I meant in the current revision of the code, you call try to open the database. 
If it fails, you then call RecoverDatabase(), which may display an error.  Then
at the end, you display the original error with opening the database.  This last
error I'd think can be easily interpreted as the final error, i.e., the error
with recovering the database (if any); if recovery succeeded, display this error
here may make it harder to recognize that fact.

I'd think you should display the original error, then attempt recovery and
display the result of the recovery attempt next, i.e., display errors in the
order they are encountered.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath
db_path(temp_dir.path().AppendASCII("TestShortcuts.db"));
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > This probably should be "3", not "1".  (Otherwise why would other tests use
> > unique numbers?  There must be a reason.)
> 
> Copy/pasta.  Since it's in a unique temp dir, I'll just strip the 1.  To the
> best of my knowledge none of these need to be unique names.

If it doesn't matter that they're different, please revise the other tests in
this file to use the same name.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:327: sql::Statement
statement(connection.GetUniqueStatement(
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > Why does this return failure if we're ignoring the error?  Comment.
> 
> Production code in debug mode fails with a FATAL, because there should not be
> cases where that is acceptable.  Execute() is WARN_UNUSED_RESULT to force
> developers to pay attention to the result, because in actual production there
> will be failures that are not due to how the code is written.  The ignorer
tells
> sql::Connection that we're testing the code to handle an error case, so don't
do
> the FATAL thing.
> 
> [To be clear, a lot of the failure code paths for sql returns are entirely
> untested in Chromium.  A lot of the code is written to assume nothing can
fail,
> which is true in development but false when deployed.  I've tried to force the
> issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for
> failure paths, in part because it's challenging to inject failures.  So I
agree,
> much of this is dubious.]

Thank you for the explanation.  That said, my original request still applies. 
Please add a comment here, in particular about why the statement.is_valid() test
should be false.  To a naive eye, I'd think it should be true.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:331:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > Side comment: the comment by the declaration of CheckIgnoredErrors() is
awful.
> 
> > It doesn't say what the return type means.  Does it mean we encountered an
> error
> > of the type that we ignored?  That's my best guess.  But it could mean that
we
> > encountered an error that didn't ignore, i.e., check status, taking into
> account
> > ignored errors...
> 
> I don't really follow.  It says it's used to check if the errors were
> encountered, "True" seems like the most likely response to that question.

I don't agree.  I see at least two ways to understand CheckIgnoredErrors().  One
could simply be "return whether we have an error that was ignored".  Another
could be "return whether everything is okay, i.e., check that it's okay, taking
in account ignored errors".

Furthermore, given that this function declaration is in the context of a test
// Allow containing test to check if the errors were encountered.
// ...
it's very easy to read the "check" as one of the debug macros like ASSERT or
EXPECT, not simply find the value and return.

  // Allow containing test to check if the errors were encountered.
  // Failure to call results in ADD_FAILURE() in destructor.
  bool CheckIgnoredErrors();

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:334: // The
sql::Connection::Open() in ShortcutsDatabase::Init() will hit the
nit: in -> call in

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:344:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > I think it's odd that we still see this error.  Shouldn't it be cleared
after
> > the first failure and then not have any errors for the second attempt at
> > opening?
> 
> The error will be thrown in the first Open(), the ignorer allows it to fail
> gracefully, then the retry will succeed.

Given this, I don't like this test's structure.  I want a test that verifies the
database opens correctly when not in the context of an IgnoreError regime. 
Either fix this so it clears the error in the middle as I originally expected or
add another block below to verify that.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:354: "SELECT COUNT(*)
FROM omni_box_shortcuts"));
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > Please use an identical call to the one you used to verify the database was
> > broken, don't switch from execute to getuniquestatement.
> 
> Done.

Almost done.  I'd like exactly the same tests, i.e., also add a 
ASSERT_TRUE(statement.is_valid());

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:356: EXPECT_EQ(5,
statement.ColumnInt(0));
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > Consider using something like
> > EXPECT_EQ(arraysize(shortcut_test_db), CountRecords())
> > a la the other tests in this file.
> 
> CountRecords() runs against the sql::Connection contained in the
> ShortcutsDatabase created by SetUp().  That database cannot easily be used at
> this point.  I could do this:
>   db_ = new ShortcutsDatabase(db_path);
> but note that this isn't necessarily describing the same database as SetUp().
> 
> I could shift the CountRecords() code out to a helper running against a
> sql::Connection, then have CountRecords() call that.

I don't have a strong preference for how you fix it, but fundamentally I don't
want a magic number "5" here because it's not clear where it comes from and
which / whether other changes in this file should cause this number to change
and which shouldn't.  Please solve this code.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode541
sql/recovery.cc:541: // which is created on demand by SQLite if any tables use
AUTOINCREMENT.
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/07 23:09:43, Mark P wrote:
> > This comment is too broad.  Below you explicitly assume the prefix is the
> right
> > amount of text so if you add a "main." right after it, it makes sense as a
sql
> > statement.
> > SELECT foo from BAR;
> > ->
> > SELmain.ECT foo from BAR;
> > Ugh, you don't mean this.
> 
> I agree that the example case isn't what I want, but I'm not sure what to do
> about it.  The prefix has to be some sort of valid SQL prefix thingie, but I'm
> not sure what to call it, and I'm not going to be writing a parser to validate
> it.  I guess I can assert that it ends in a space, but even that doesn't mean
> much, since I'm using string manipulation to dynamically construct statements.

You still should allude to this limitation here, if only to say something like
|prefix| must be the beginning of valid SQL string so that we can insert "main."
after it in SQL commands that start with prefix and have the result be another
valid SQL command.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode551
sql/recovery.cc:551: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'"));
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/07 23:09:43, Mark P wrote:
> > What happens if prefix_length > length(sql)?  What does substr return in
this
> > case?
> > The sqlite3 documentation I found doesn't define this.
> > https://www.sqlite.org/lang_corefunc.html
> > Perhaps you want to add another another condition ensuring that sql is long
> > enough?
> > (Though I can't imagine this causing any problems unless substr() actually
> fails
> > on this input, as the equality test still applies)
> 
> I think I'll just rewrite this as a full table scan and do the filtering in
C++.
>  It won't change the efficiency at all, and I expect it will be approximately
> the same weight in code.

You didn't do this rewrite you promised.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode555
sql/recovery.cc:555: while (s.Step()) {
On 2016/04/15 00:38:14, Scott Hess wrote:
> On 2016/04/07 23:09:43, Mark P wrote:
> > Should all these executions be wrapped in a transaction?
> 
> Recovery happens to a temporary database, which is only committed to the main
> database in Recovered().  So transactions would only be necessary insofar as
the
> (local) caller needs to do controlled rollback of one approach in order to try
> another approach.  I don't think that can really come up, here, mostly because
I
> can't think of a second simple approach (either the schema will work for this
> kind of recovery, or it won't, and the client shouldn't be calling this code
if
> the schema is too complex or subtle).

You're not concerned with having a failed attempt leave additional commands in
corrupt.sqlite_master?  Or does that not happen?

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588
sql/recovery.cc:588: db->Poison();
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > In the header comment, you write the function will poison the database so it
> > "will return errors until the handle is re-opened."  This implies the
database
> > is closed.  Yet, you don't close it here.  The comment by the definition for
> > Poison says
> >   // This is intended as an alternative to Close() in error callbacks.
> >   // Close() should still be called at some point.
> > so shouldn't you close it...?
> 
> The originating caller's Open()/Close() calls should balance out.  This call
can
> happen deep in a stack of calls in a case handling an exceptional error,
making
> it impractical for the client to track the Close() status.  So Poison()
poisons
> the sql::Connection handle, but logically the handle is still open.
In this case, I think it'd be better in the header to say, after you say that it
poisons the database "(though the handle technically remains open)" before
saying "will return errors until the handle is re-opened."
> 
> For instance, warnings are thrown if you try to run Execute() on a handle
which
> was never opened, or which has been closed.  But if the handle was poisoned,
> then the Execute() will fail without warnings.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode661
sql/recovery.cc:661: //   sql LIKE "CREATE VIRTUAL TABLE %"
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/08 20:12:06, Mark P wrote:
> > Do you have a sense of whether this will be different in some cases?  Or are
> you
> > nervous, which is why you left it like vacuum.c?
> 
> I do not fully understand why vacuum.c uses the rootpage like this.  SQLite
> creates tables with a root page even if empty, and I believe it could also
> validly use a null root page.  Most likely the assumption is that you created
a
> table, so you'll populate it, though I've certainly seen that assumption
broken
> w/in Chromium.  Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it
can't
> accidentally match 'CREATE TABLE' or any other sort of faux table.

If it seems safer to you, why don't you do it (and comment on why you deviate
from vacuum.c in this case)?

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode684
sql/recovery.cc:684: // instance a symlink to a non-existent path, or the file
is a directory).
How is this recoverable?  This comment implies you should be returning false.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689
sql/recovery.cc:689: // SQLite return this where other versions return
SQLITE_CORRUPT.
How is this likely to be recoverable?

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode89
sql/recovery.h:89: static std::unique_ptr<Recovery> Begin(Connection*
connection,
I'm surprised that you're changing the type of this function and the ones below
it and don't need to add additional files to this changelist.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode172
sql/recovery.h:172: static void RecoverDatabaseOrRaze(Connection* db,
On 2016/04/15 00:38:15, Scott Hess wrote:
> On 2016/04/07 23:09:44, Mark P wrote:
> > Uh, did you mean "will be", not "can be"?  And will the result be razed
(i.e.,
> > deleted) or an empty database?
> > If you didn't mean that, then how will the caller know whether the recovery
> > failed and the database is okay to be razed?  This function has no return
> value.
> > If you're intentionally vague because you plan to change this from
> > leave-corrupt-database-around to something else, just give it the right name
> for
> > now.  Later when you make the intended change, you can rename the function.
> 
> I'm not sure how to answer!
> 
> Right now, it only applies the safe cases.  It either thinks it recovered the
> database, or it rolls things back.  But I've seen results from the fleet which
> make me believe that running recovery to success didn't actually work
(possibly
> because of shenanigans in the system, like malware or profile's being
> overwritten or something).  The intention is to monitor the histograms to see
if
> any of the failure outcomes is due to a broken assumption rather than broken
> user data.  If the failures cases look weirdly high, I'll instrument to debug
> the fleet, but if they look sensible, I'll switch them over from Rollback() to
> Unrecoverable() (leading to Raze()).
> 
> sql::Connection::Raze() means that it overwrites the database with a
single-page
> empty database.  The delete language is because if the Raze() doesn't work,
then
> what?  Probably attempt to Delete().  I'm nervous about landing either of
those
> up front, though, on the off chance that I write a bug which works right in my
> tests but in the field blows away a bunch of data.
> 
> Anyhow, my intention with this function is that it consumes the database
handle
> in all cases, while making a best effort to recover the data, and failing that
a
> best effort to clear the database of data, and failing that a best effort to
> delete the database.  On next run the database should either be a valid
> data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should
be
> a valid empty SQLite database, or it should be missing (SQLite would create
the
> database empty).
> 
> Raze() versus Delete() is because Raze() works using SQLite locking and I/O. 
It
> can clear the database in cases where Delete() won't work.  Likewise, Delete()
> may work in cases where Raze() does not.

Thanks for the long response.  Fundamentally, my complaint is that the comment
above and indeed the function name doesn't correspond with what the function
does.  There is no razing going on here.

Why not just call it RecoverDatabase()?  And remove comments that are entirely
wrong such as "If the database is entirely unreadable, the new database will be
empty.", and, for that matter, the whole second paragraph (each sentence in it
is currently wrong).

You can add your raze or delete language back into the function name and
comments once you actually switch the code from doing Rollback to Unrecoverable.
 And at that time add comments about the goal of the function (leaving a working
database, even if empty) at that time.  That's not the goal at this time.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode176
sql/recovery.h:176: // or which cannot plausibly be recovered (and should be
razed).  This does
nit: or which -> versus
You don't mean "or", at least in the boolean sense.
Consider making a similar change to the function name.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.cc
File sql/recovery_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:771: // As a side effect AUTOINCREMENT creates the
sqlite_sequence table for
optional nit: comma after effect

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (left):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83764: <int value="19"
label="RECOVERY_FAILED_META_NO_VERSION">
Please add a sentence to the end of the Sqlite.RecoveryEvents description,
something to effect that those of these enum entries are subsets / included in
the counts of other enum entries.

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
recover.
"to recover" or "to attempt recovery"?
It looks like this error does not include the other detailed failures about
problems that can happen along the recovery process

Scott Hess - ex-Googler

Sorry about the long delay - I had kind of a poor April experience. For ...

4 years, 7 months ago (2016-05-13 21:24:37 UTC) #15

Sorry about the long delay - I had kind of a poor April experience.

For the ScopedErrorIgnorer comments, I still don't really understand how a
simple function-name change helps anything.  I could see changing the entire
thing from "ScopedErrorIgnorer" to "ScopedErrorExpecter", used basically the
same:

  ScopedErrorExpecter expecter;
  expecter.ExpectError(SQLITE_CORRUPT);
  // Code under test.
  ASSERT_TRUE(expecter.SawExpectedErrors());

WDYT?  The "ignore" naming is centered around "How it changes operation of
sql::Connection", whereas "expect" naming would be centered around "How you use
it."  That implies a filename change, and various changes outside the focus of
this CL, so I'd rather move it to an on-topic CL this CL can depend on.

Thanks,
scott

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database.cc:74: if
(!sql::Connection::ShouldIgnoreSqliteError(extended_error))
On 2016/04/19 23:34:27, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/07 23:09:43, Mark P wrote:
> > > Do you still want to run this code / possibly display the message if the
> above
> > > code decides to recover the database?
> > > 
> > > If yes, consider whether it would be better for developers to display this
> > > message first or the (possibly failed) recovery message first.  (I'd think
> the
> > > original error first would be better.)
> > > 
> > > If no, then you can struct this function differently.
> > 
> > I'll have to think about this.  Corruption generally stems from things like
> > unexpected power failures, so a developer is unlikely to see that.  This
DLOG
> is
> > expected to fire for cases like SQLITE_ERROR, which would indicate that a
> > statement had a syntax error, or an inappropriate schema change happened,
> > something like that.  So is there an interesting case at the intersection
> where
> > a developer caused a corruption?  Hard to say.
> > 
> > If a developer was unlucky enough to have a legitimate corruption, I would
> > expect this to FATAL once, then work going forward.  Not a great outcome,
but
> > most developers would just ignore it or log a bug against me.  That's find,
> then
> > I can discuss it with them and determine if there's a real problem or a red
> > herring.  Usually it's been a red herring, but it has caught real bugs.
> 
> You misunderstood my comment.
> 
> I meant in the current revision of the code, you call try to open the
database. 
> If it fails, you then call RecoverDatabase(), which may display an error. 
Then
> at the end, you display the original error with opening the database.  This
last
> error I'd think can be easily interpreted as the final error, i.e., the error
> with recovering the database (if any); if recovery succeeded, display this
error
> here may make it harder to recognize that fact.
> 
> I'd think you should display the original error, then attempt recovery and
> display the result of the recovery attempt next, i.e., display errors in the
> order they are encountered.

I'll just have the recovery path return.  Connection::OnSqliteError() already
logs at ERROR level, if anyone is paying attention.  I think the errors which
lead to recovery attempts should be distinct from the kinds of development
errors the DLOG(FATAL) is intended to expose.  To the extent that someone
manages to write code that causes corruption, crashing here wouldn't be helpful
(as compared to a statement error, where the crash happens compiling the
statement).

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
File components/omnibox/browser/shortcuts_database_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath
db_path(temp_dir.path().AppendASCII("TestShortcuts.db"));
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > This probably should be "3", not "1".  (Otherwise why would other tests
use
> > > unique numbers?  There must be a reason.)
> > 
> > Copy/pasta.  Since it's in a unique temp dir, I'll just strip the 1.  To the
> > best of my knowledge none of these need to be unique names.
> 
> If it doesn't matter that they're different, please revise the other tests in
> this file to use the same name.

Done.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:327: sql::Statement
statement(connection.GetUniqueStatement(
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > Why does this return failure if we're ignoring the error?  Comment.
> > 
> > Production code in debug mode fails with a FATAL, because there should not
be
> > cases where that is acceptable.  Execute() is WARN_UNUSED_RESULT to force
> > developers to pay attention to the result, because in actual production
there
> > will be failures that are not due to how the code is written.  The ignorer
> tells
> > sql::Connection that we're testing the code to handle an error case, so
don't
> do
> > the FATAL thing.
> > 
> > [To be clear, a lot of the failure code paths for sql returns are entirely
> > untested in Chromium.  A lot of the code is written to assume nothing can
> fail,
> > which is true in development but false when deployed.  I've tried to force
the
> > issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for
> > failure paths, in part because it's challenging to inject failures.  So I
> agree,
> > much of this is dubious.]
> 
> Thank you for the explanation.  That said, my original request still applies. 
> Please add a comment here, in particular about why the statement.is_valid()
test
> should be false.  To a naive eye, I'd think it should be true.

Done.  Pulled up the statement sql, noted that the post-recovery version
succeeds.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:331:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > Side comment: the comment by the declaration of CheckIgnoredErrors() is
> awful.
> > 
> > > It doesn't say what the return type means.  Does it mean we encountered an
> > error
> > > of the type that we ignored?  That's my best guess.  But it could mean
that
> we
> > > encountered an error that didn't ignore, i.e., check status, taking into
> > account
> > > ignored errors...
> > 
> > I don't really follow.  It says it's used to check if the errors were
> > encountered, "True" seems like the most likely response to that question.
> 
> I don't agree.  I see at least two ways to understand CheckIgnoredErrors(). 
One
> could simply be "return whether we have an error that was ignored".  Another
> could be "return whether everything is okay, i.e., check that it's okay,
taking
> in account ignored errors".

Obviously there's a terminology problem, here, but I'm a little befuddled how to
resolve it.  ScopedErrorIgnorer exists entirely/strictly to track errors within
a scope.  I'm not sure what "everything" might refer to in that context, other
than the errors the ScopedErrorIgnorer has been enlisted to track.

I'd be happy to refactor that a little, I'll open up a discussion thread offline
and this CL can wait on that.  The original reviewer of that code maybe wasn't
enthusiastic enough about being roped into this.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:334: // The
sql::Connection::Open() in ShortcutsDatabase::Init() will hit the
On 2016/04/19 23:34:28, Mark P wrote:
> nit: in -> call in

Done.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:344:
ASSERT_TRUE(ignore_errors.CheckIgnoredErrors());
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > I think it's odd that we still see this error.  Shouldn't it be cleared
> after
> > > the first failure and then not have any errors for the second attempt at
> > > opening?
> > 
> > The error will be thrown in the first Open(), the ignorer allows it to fail
> > gracefully, then the retry will succeed.
> 
> Given this, I don't like this test's structure.  I want a test that verifies
the
> database opens correctly when not in the context of an IgnoreError regime. 
> Either fix this so it clears the error in the middle as I originally expected
or
> add another block below to verify that.

I'm not sure what you're referring to with "clears the error".  Nobody is
clearing errors.  The database can't open correctly when it is corrupt and not
in the context of an error ignorer, there will be a crash.  As far as whether
the Open() retry works when the error callback fixes things, that's already
being tested by SQLConnectionTest.RazeCallbackReopen.  This test is testing that
the Open() retry logic works when the shortcuts error handler is in the loop.

I don't think there is a point where ShortcutsDatabase can crack things open to
allow seeing the flow of control for failure retry within
sql::Connection::Open().  I could add a test that uses the shortcuts error
handler against a plain sql::Connection, but that would imply exporting the
error handler, which may-or-may-not be appropriate (right now it would work
because it's written in terms of only the sql::Connection, but that wouldn't
work if it were revised to reference ShortcutsDatabase).

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:354: "SELECT COUNT(*)
FROM omni_box_shortcuts"));
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > Please use an identical call to the one you used to verify the database
was
> > > broken, don't switch from execute to getuniquestatement.
> > 
> > Done.
> 
> Almost done.  I'd like exactly the same tests, i.e., also add a 
> ASSERT_TRUE(statement.is_valid());

Done.

https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow...
components/omnibox/browser/shortcuts_database_unittest.cc:356: EXPECT_EQ(5,
statement.ColumnInt(0));
On 2016/04/19 23:34:27, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > Consider using something like
> > > EXPECT_EQ(arraysize(shortcut_test_db), CountRecords())
> > > a la the other tests in this file.
> > 
> > CountRecords() runs against the sql::Connection contained in the
> > ShortcutsDatabase created by SetUp().  That database cannot easily be used
at
> > this point.  I could do this:
> >   db_ = new ShortcutsDatabase(db_path);
> > but note that this isn't necessarily describing the same database as
SetUp().
> > 
> > I could shift the CountRecords() code out to a helper running against a
> > sql::Connection, then have CountRecords() call that.
> 
> I don't have a strong preference for how you fix it, but fundamentally I don't
> want a magic number "5" here because it's not clear where it comes from and
> which / whether other changes in this file should cause this number to change
> and which shouldn't.  Please solve this code.

It's dependent on the contents of the golden file.  I'll generate the expected
value from the uncorrupted database.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode541
sql/recovery.cc:541: // which is created on demand by SQLite if any tables use
AUTOINCREMENT.
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:15, Scott Hess wrote:
> > On 2016/04/07 23:09:43, Mark P wrote:
> > > This comment is too broad.  Below you explicitly assume the prefix is the
> > right
> > > amount of text so if you add a "main." right after it, it makes sense as a
> sql
> > > statement.
> > > SELECT foo from BAR;
> > > ->
> > > SELmain.ECT foo from BAR;
> > > Ugh, you don't mean this.
> > 
> > I agree that the example case isn't what I want, but I'm not sure what to do
> > about it.  The prefix has to be some sort of valid SQL prefix thingie, but
I'm
> > not sure what to call it, and I'm not going to be writing a parser to
validate
> > it.  I guess I can assert that it ends in a space, but even that doesn't
mean
> > much, since I'm using string manipulation to dynamically construct
statements.
> 
> You still should allude to this limitation here, if only to say something like
> |prefix| must be the beginning of valid SQL string so that we can insert
"main."
> after it in SQL commands that start with prefix and have the result be another
> valid SQL command.

Done.

I'm reluctant to get too elaborate with the comment, because at some point it
will just be an English version of the precise series of steps the code
implements.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode551
sql/recovery.cc:551: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'"));
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/07 23:09:43, Mark P wrote:
> > > What happens if prefix_length > length(sql)?  What does substr return in
> this
> > > case?
> > > The sqlite3 documentation I found doesn't define this.
> > > https://www.sqlite.org/lang_corefunc.html
> > > Perhaps you want to add another another condition ensuring that sql is
long
> > > enough?
> > > (Though I can't imagine this causing any problems unless substr() actually
> > fails
> > > on this input, as the equality test still applies)
> > 
> > I think I'll just rewrite this as a full table scan and do the filtering in
> C++.
> >  It won't change the efficiency at all, and I expect it will be
approximately
> > the same weight in code.
> 
> You didn't do this rewrite you promised.

Was thinking out loud, earlier.  Done.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode555
sql/recovery.cc:555: while (s.Step()) {
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:14, Scott Hess wrote:
> > On 2016/04/07 23:09:43, Mark P wrote:
> > > Should all these executions be wrapped in a transaction?
> > 
> > Recovery happens to a temporary database, which is only committed to the
main
> > database in Recovered().  So transactions would only be necessary insofar as
> the
> > (local) caller needs to do controlled rollback of one approach in order to
try
> > another approach.  I don't think that can really come up, here, mostly
because
> I
> > can't think of a second simple approach (either the schema will work for
this
> > kind of recovery, or it won't, and the client shouldn't be calling this code
> if
> > the schema is too complex or subtle).
> 
> You're not concerned with having a failed attempt leave additional commands in
> corrupt.sqlite_master?  Or does that not happen?

In case of failure the recovery database [main] does not replace the original
database [corrupt].

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588
sql/recovery.cc:588: db->Poison();
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:15, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > In the header comment, you write the function will poison the database so
it
> > > "will return errors until the handle is re-opened."  This implies the
> database
> > > is closed.  Yet, you don't close it here.  The comment by the definition
for
> > > Poison says
> > >   // This is intended as an alternative to Close() in error callbacks.
> > >   // Close() should still be called at some point.
> > > so shouldn't you close it...?
> > 
> > The originating caller's Open()/Close() calls should balance out.  This call
> can
> > happen deep in a stack of calls in a case handling an exceptional error,
> making
> > it impractical for the client to track the Close() status.  So Poison()
> poisons
> > the sql::Connection handle, but logically the handle is still open.
> In this case, I think it'd be better in the header to say, after you say that
it
> poisons the database "(though the handle technically remains open)" before
> saying "will return errors until the handle is re-opened."
> > 
> > For instance, warnings are thrown if you try to run Execute() on a handle
> which
> > was never opened, or which has been closed.  But if the handle was poisoned,
> > then the Execute() will fail without warnings.

Done.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode661
sql/recovery.cc:661: //   sql LIKE "CREATE VIRTUAL TABLE %"
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:15, Scott Hess wrote:
> > On 2016/04/08 20:12:06, Mark P wrote:
> > > Do you have a sense of whether this will be different in some cases?  Or
are
> > you
> > > nervous, which is why you left it like vacuum.c?
> > 
> > I do not fully understand why vacuum.c uses the rootpage like this.  SQLite
> > creates tables with a root page even if empty, and I believe it could also
> > validly use a null root page.  Most likely the assumption is that you
created
> a
> > table, so you'll populate it, though I've certainly seen that assumption
> broken
> > w/in Chromium.  Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it
> can't
> > accidentally match 'CREATE TABLE' or any other sort of faux table.
> 
> If it seems safer to you, why don't you do it (and comment on why you deviate
> from vacuum.c in this case)?

Assuming my reasoning is correct, then the DCHECK_IS_ON() section above implies
that virtual tables shouldn't work at all, so I'll just remove the part I
believe to be associated with virtual tables.

I'm leaving the DCHECK_IS_ON() section in place, though.  My experience is that
error-handling code sometimes does not have good testing coverage, so I'd rather
it actively failed in development rather than falsely succeeding.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode684
sql/recovery.cc:684: // instance a symlink to a non-existent path, or the file
is a directory).
On 2016/04/19 23:34:28, Mark P wrote:
> How is this recoverable?  This comment implies you should be returning false.

I intended this for the delete option, which you asked me to remove, so I'll
remove it.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689
sql/recovery.cc:689: // SQLite return this where other versions return
SQLITE_CORRUPT.
On 2016/04/19 23:34:28, Mark P wrote:
> How is this likely to be recoverable?

In the case where it was incorrectly returning SQLITE_NOTADB instead of
SQLITE_CORRUPT, it would be recoverable.  For the version of SQLite currently
checked into Chromium's third_party repo, I think SQLITE_NOTADB is never
recoverable, but some iOS versions of SQLite are old enough to hit this.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode89
sql/recovery.h:89: static std::unique_ptr<Recovery> Begin(Connection*
connection,
On 2016/04/19 23:34:28, Mark P wrote:
> I'm surprised that you're changing the type of this function and the ones
below
> it and don't need to add additional files to this changelist.

The change from base::scoped_ptr<> to std::unique_ptr<> happened since the CL
was originally posted.  So any changes you see there are probably just the
rebase, and wouldn't show in a diff from trunk (except any new code following
the conversion, of course).

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode172
sql/recovery.h:172: static void RecoverDatabaseOrRaze(Connection* db,
On 2016/04/19 23:34:28, Mark P wrote:
> On 2016/04/15 00:38:15, Scott Hess wrote:
> > On 2016/04/07 23:09:44, Mark P wrote:
> > > Uh, did you mean "will be", not "can be"?  And will the result be razed
> (i.e.,
> > > deleted) or an empty database?
> > > If you didn't mean that, then how will the caller know whether the
recovery
> > > failed and the database is okay to be razed?  This function has no return
> > value.
> > > If you're intentionally vague because you plan to change this from
> > > leave-corrupt-database-around to something else, just give it the right
name
> > for
> > > now.  Later when you make the intended change, you can rename the
function.
> > 
> > I'm not sure how to answer!
> > 
> > Right now, it only applies the safe cases.  It either thinks it recovered
the
> > database, or it rolls things back.  But I've seen results from the fleet
which
> > make me believe that running recovery to success didn't actually work
> (possibly
> > because of shenanigans in the system, like malware or profile's being
> > overwritten or something).  The intention is to monitor the histograms to
see
> if
> > any of the failure outcomes is due to a broken assumption rather than broken
> > user data.  If the failures cases look weirdly high, I'll instrument to
debug
> > the fleet, but if they look sensible, I'll switch them over from Rollback()
to
> > Unrecoverable() (leading to Raze()).
> > 
> > sql::Connection::Raze() means that it overwrites the database with a
> single-page
> > empty database.  The delete language is because if the Raze() doesn't work,
> then
> > what?  Probably attempt to Delete().  I'm nervous about landing either of
> those
> > up front, though, on the off chance that I write a bug which works right in
my
> > tests but in the field blows away a bunch of data.
> > 
> > Anyhow, my intention with this function is that it consumes the database
> handle
> > in all cases, while making a best effort to recover the data, and failing
that
> a
> > best effort to clear the database of data, and failing that a best effort to
> > delete the database.  On next run the database should either be a valid
> > data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should
> be
> > a valid empty SQLite database, or it should be missing (SQLite would create
> the
> > database empty).
> > 
> > Raze() versus Delete() is because Raze() works using SQLite locking and I/O.

> It
> > can clear the database in cases where Delete() won't work.  Likewise,
Delete()
> > may work in cases where Raze() does not.
> 
> Thanks for the long response.  Fundamentally, my complaint is that the comment
> above and indeed the function name doesn't correspond with what the function
> does.  There is no razing going on here.
> 
> Why not just call it RecoverDatabase()?  And remove comments that are entirely
> wrong such as "If the database is entirely unreadable, the new database will
be
> empty.", and, for that matter, the whole second paragraph (each sentence in it
> is currently wrong).
> 
> You can add your raze or delete language back into the function name and
> comments once you actually switch the code from doing Rollback to
Unrecoverable.
>  And at that time add comments about the goal of the function (leaving a
working
> database, even if empty) at that time.  That's not the goal at this time.
> 
> 

Done.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode176
sql/recovery.h:176: // or which cannot plausibly be recovered (and should be
razed).  This does
On 2016/04/19 23:34:28, Mark P wrote:
> nit: or which -> versus
> You don't mean "or", at least in the boolean sense.
> Consider making a similar change to the function name.

Just removed the comparison clause entirely.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.cc
File sql/recovery_unittest.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.c...
sql/recovery_unittest.cc:771: // As a side effect AUTOINCREMENT creates the
sqlite_sequence table for
On 2016/04/19 23:34:28, Mark P wrote:
> optional nit: comma after effect

Done.

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (left):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83764: <int value="19"
label="RECOVERY_FAILED_META_NO_VERSION">
On 2016/04/19 23:34:28, Mark P wrote:
> Please add a sentence to the end of the Sqlite.RecoveryEvents description,
> something to effect that those of these enum entries are subsets / included in
> the counts of other enum entries.

Done.

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
recover.
On 2016/04/19 23:34:28, Mark P wrote:
> "to recover" or "to attempt recovery"?

Recovery was attempted, and it failed in setup.  I'll rephrase to be similar to
the other descriptions.

> It looks like this error does not include the other detailed failures about
> problems that can happen along the recovery process

I don't know what this means.  Are you asking for someone to enumerate all of
the cases which prevented Begin() from succeeding?  Most of them are already
logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
RECOVERY_FAILED_ATTACH.

Which obviously implies you may want me to annotate those various items to
describe which other items are supersets and subsets of each other, etc.  I'm
going to pre-emptively argue against this (and maybe that I should unwind the
annotations just added).  I'm basically recording these as code coverage, so
there's literally no utility to understanding them in the abstract without
reference to the code.  It is easy enough to enumerate the places where a
failure can be detected, but many of them are challenging to quantify in the
abstract, and I'd rather not spend a week understanding how a particular failure
can happen if it has only ever happened 4 times in the fleet.

Scott Hess - ex-Googler

Apologies - I see some unrelated rebase action in there. I'm going to back out ...

4 years, 7 months ago (2016-05-13 21:26:52 UTC) #16

Scott Hess - ex-Googler

Patchset #5 (id:100001) has been deleted

4 years, 7 months ago (2016-05-13 21:33:27 UTC) #19

Scott Hess - ex-Googler

Patchset #5 (id:120001) has been deleted

4 years, 7 months ago (2016-05-13 21:35:34 UTC) #21

Scott Hess - ex-Googler

Address review comments through #14. I think.

4 years, 7 months ago (2016-05-13 21:39:06 UTC) #23

Mark P

I assume it's okay to wait until both other changelists are submitted before coming back ...

4 years, 6 months ago (2016-06-08 22:15:38 UTC) #24

Scott Hess - ex-Googler

On 2016/06/08 22:15:38, Mark P wrote: > I assume it's okay to wait until both ...

4 years, 6 months ago (2016-06-08 23:34:34 UTC) #25

Scott Hess - ex-Googler

On 2016/06/25 00:35:32, Scott Hess wrote: > rebase If that emailed, note that I haven't ...

4 years, 6 months ago (2016-06-25 00:43:07 UTC) #27

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org to run a CQ dry run

4 years, 6 months ago (2016-06-25 00:43:17 UTC) #28

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/1832173002/180001

4 years, 6 months ago (2016-06-25 00:43:33 UTC) #29

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 6 months ago (2016-06-25 00:57:21 UTC) #30

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device-gn on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-gn/builds/26855)

4 years, 6 months ago (2016-06-25 00:57:22 UTC) #31

Scott Hess - ex-Googler

std::count returns ptrdiff_t, various comment changes.

4 years, 5 months ago (2016-06-27 21:33:26 UTC) #32

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-06-27 21:34:11 UTC) #33

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/1832173002/200001

4 years, 5 months ago (2016-06-27 21:35:26 UTC) #34

Scott Hess - ex-Googler

OK, I think the open comments have been addressed, except the histograms.xml one I've noted ...

4 years, 5 months ago (2016-06-27 21:35:38 UTC) #35

OK, I think the open comments have been addressed, except the histograms.xml one
I've noted as being in disagreement.

Sorry about the extended arc of this CL.  Hopefully now the skids are greased.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588
sql/recovery.cc:588: db->Poison();
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > On 2016/04/15 00:38:15, Scott Hess wrote:
> > > On 2016/04/08 20:12:06, Mark P wrote:
> > > > In the header comment, you write the function will poison the database
so
> it
> > > > "will return errors until the handle is re-opened."  This implies the
> > database
> > > > is closed.  Yet, you don't close it here.  The comment by the definition
> for
> > > > Poison says
> > > >   // This is intended as an alternative to Close() in error callbacks.
> > > >   // Close() should still be called at some point.
> > > > so shouldn't you close it...?
> > > 
> > > The originating caller's Open()/Close() calls should balance out.  This
call
> > can
> > > happen deep in a stack of calls in a case handling an exceptional error,
> > making
> > > it impractical for the client to track the Close() status.  So Poison()
> > poisons
> > > the sql::Connection handle, but logically the handle is still open.
> > In this case, I think it'd be better in the header to say, after you say
that
> it
> > poisons the database "(though the handle technically remains open)" before
> > saying "will return errors until the handle is re-opened."
> > > 
> > > For instance, warnings are thrown if you try to run Execute() on a handle
> > which
> > > was never opened, or which has been closed.  But if the handle was
poisoned,
> > > then the Execute() will fail without warnings.
> 
> Done.

Header-comment-change really done, this time.

https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689
sql/recovery.cc:689: // SQLite return this where other versions return
SQLITE_CORRUPT.
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > How is this likely to be recoverable?
> 
> In the case where it was incorrectly returning SQLITE_NOTADB instead of
> SQLITE_CORRUPT, it would be recoverable.  For the version of SQLite currently
> checked into Chromium's third_party repo, I think SQLITE_NOTADB is never
> recoverable, but some iOS versions of SQLite are old enough to hit this.

Adjusted the comment a bit more.  Long-term, this case will lead to recovery for
some versions of SQLite, and to deletion for others.  I don't think it would be
productive to isolate which versions of SQLite have the different return code,
because the only change to the worst-case outcome is that the handle is poisoned
(which is actually preferable, since SQLITE_NOTADB will never resolve itself).

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
recover.
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > "to recover" or "to attempt recovery"?
> 
> Recovery was attempted, and it failed in setup.  I'll rephrase to be similar
to
> the other descriptions.
> 
> > It looks like this error does not include the other detailed failures about
> > problems that can happen along the recovery process
> 
> I don't know what this means.  Are you asking for someone to enumerate all of
> the cases which prevented Begin() from succeeding?  Most of them are already
> logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
> RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
> RECOVERY_FAILED_ATTACH.
> 
> Which obviously implies you may want me to annotate those various items to
> describe which other items are supersets and subsets of each other, etc.  I'm
> going to pre-emptively argue against this (and maybe that I should unwind the
> annotations just added).  I'm basically recording these as code coverage, so
> there's literally no utility to understanding them in the abstract without
> reference to the code.  It is easy enough to enumerate the places where a
> failure can be detected, but many of them are challenging to quantify in the
> abstract, and I'd rather not spend a week understanding how a particular
failure
> can happen if it has only ever happened 4 times in the fleet.

Just FYI, I don't consider this comment resolved, there's still disagreement.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 5 months ago (2016-06-27 22:39:06 UTC) #36

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/192521)

4 years, 5 months ago (2016-06-27 22:39:07 UTC) #37

Mark P

comments are on both patchset #7 and #8. (Apologies, I was in the middle of ...

4 years, 5 months ago (2016-06-27 23:20:13 UTC) #38

comments are on both patchset #7 and #8.
(Apologies, I was in the middle of reviewing patchset 7 when you uploaded patch
8 with additional revisions.)

--mark

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
recover.
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > "to recover" or "to attempt recovery"?
> 
> Recovery was attempted, and it failed in setup.  I'll rephrase to be similar
to
> the other descriptions.
> 
> > It looks like this error does not include the other detailed failures about
> > problems that can happen along the recovery process
> 
> I don't know what this means.  Are you asking for someone to enumerate all of
> the cases which prevented Begin() from succeeding?  Most of them are already
> logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
> RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
> RECOVERY_FAILED_ATTACH.
> 
> Which obviously implies you may want me to annotate those various items to
> describe which other items are supersets and subsets of each other, etc.  I'm
> going to pre-emptively argue against this (and maybe that I should unwind the
> annotations just added).  I'm basically recording these as code coverage, so
> there's literally no utility to understanding them in the abstract without
> reference to the code.  It is easy enough to enumerate the places where a
> failure can be detected, but many of them are challenging to quantify in the
> abstract, and I'd rather not spend a week understanding how a particular
failure
> can happen if it has only ever happened 4 times in the fleet.

Acknowledged.

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
sql/recovery.cc:555: while (s.Step()) {
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > On 2016/04/15 00:38:14, Scott Hess wrote:
> > > On 2016/04/07 23:09:43, Mark P wrote:
> > > > Should all these executions be wrapped in a transaction?
> > > 
> > > Recovery happens to a temporary database, which is only committed to the
> main
> > > database in Recovered().  So transactions would only be necessary insofar
as
> > the
> > > (local) caller needs to do controlled rollback of one approach in order to
> try
> > > another approach.  I don't think that can really come up, here, mostly
> because
> > I
> > > can't think of a second simple approach (either the schema will work for
> this
> > > kind of recovery, or it won't, and the client shouldn't be calling this
code
> > if
> > > the schema is too complex or subtle).
> > 
> > You're not concerned with having a failed attempt leave additional commands
in
> > corrupt.sqlite_master?  Or does that not happen?
> 
> In case of failure the recovery database [main] does not replace the original
> database [corrupt].

I don't understand.  It seems in on each step you're running an executing a
command.  If one command fails but the other succeeds, won't you leave the the
"main." database in a weird state?  I don't see function wrapped in a
transaction.

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0)
nit: spaces around binary operators

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173
sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR
or SQLITE_FULL).
nit: recovered -> recoverable ?

https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
tools/metrics/histograms/histograms.xml:54088: +   
RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.
On 2016/05/13 21:24:36, Scott Hess wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > Please add a sentence to the end of the Sqlite.RecoveryEvents description,
> > something to effect that those of these enum entries are subsets / included
in
> > the counts of other enum entries.
> 
> Done.

Kind of done.  These are subsets, true.  But I _think_ there are other subsets
here too.
Please either
- be complete and list all subsets.
- be vague and simply state that there are certain types of errors that cause an
emission to multiple buckets.  (Read code for details.)
The current phrasing implies this list is complete (which I'm skeptical of given
your other comments).

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should
not be correlated with programmer error, so the
I have trouble understanding this sentence.  I don't see why it shouldn't be
"correlated" in some sense, nor is it obvious why the DCHECK wouldn't be
helpful.  Perhaps consider replacing the sentence with:
The DLOG(FATAL) below should not normally be relevant to developers. (Developers
do not often test against corrupt-yet-recoverable databases.)  Displaying the
error would probably lead to more confusion than help.

https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod...
sql/recovery.cc:689: // will fail with no changes to the database.
consider adding ... so there's no harm in returning true here.

Scott Hess - ex-Googler

My day is looking such that if I'm lucky, I can circle back this evening. ...

4 years, 5 months ago (2016-06-28 22:28:28 UTC) #39

My day is looking such that if I'm lucky, I can circle back this evening.  WRT
the histogram coverage, right now, I'm just loading up the same histogram like
histograms cost money.  Would it help to just break out a separate histogram for
the new DB-level API?  In that case, I can probably get that API to where each
exit point has a single enum, and the failures will overlap the table-level
failures in the other histogram, rather than overlapping within the same
histogram.


On 2016/06/27 23:20:13, Mark P wrote:
> comments are on both patchset #7 and #8.
> (Apologies, I was in the middle of reviewing patchset 7 when you uploaded
patch
> 8 with additional revisions.)
> 
> --mark
> 
>
https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
> File tools/metrics/histograms/histograms.xml (right):
> 
>
https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
> tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
> recover.
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > "to recover" or "to attempt recovery"?
> > 
> > Recovery was attempted, and it failed in setup.  I'll rephrase to be similar
> to
> > the other descriptions.
> > 
> > > It looks like this error does not include the other detailed failures
about
> > > problems that can happen along the recovery process
> > 
> > I don't know what this means.  Are you asking for someone to enumerate all
of
> > the cases which prevented Begin() from succeeding?  Most of them are already
> > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
> > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
> > RECOVERY_FAILED_ATTACH.
> > 
> > Which obviously implies you may want me to annotate those various items to
> > describe which other items are supersets and subsets of each other, etc. 
I'm
> > going to pre-emptively argue against this (and maybe that I should unwind
the
> > annotations just added).  I'm basically recording these as code coverage, so
> > there's literally no utility to understanding them in the abstract without
> > reference to the code.  It is easy enough to enumerate the places where a
> > failure can be detected, but many of them are challenging to quantify in the
> > abstract, and I'd rather not spend a week understanding how a particular
> failure
> > can happen if it has only ever happened 4 times in the fleet.
> 
> Acknowledged.
> 
> https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc
> File sql/recovery.cc (right):
> 
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
> sql/recovery.cc:555: while (s.Step()) {
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > On 2016/04/15 00:38:14, Scott Hess wrote:
> > > > On 2016/04/07 23:09:43, Mark P wrote:
> > > > > Should all these executions be wrapped in a transaction?
> > > > 
> > > > Recovery happens to a temporary database, which is only committed to the
> > main
> > > > database in Recovered().  So transactions would only be necessary
insofar
> as
> > > the
> > > > (local) caller needs to do controlled rollback of one approach in order
to
> > try
> > > > another approach.  I don't think that can really come up, here, mostly
> > because
> > > I
> > > > can't think of a second simple approach (either the schema will work for
> > this
> > > > kind of recovery, or it won't, and the client shouldn't be calling this
> code
> > > if
> > > > the schema is too complex or subtle).
> > > 
> > > You're not concerned with having a failed attempt leave additional
commands
> in
> > > corrupt.sqlite_master?  Or does that not happen?
> > 
> > In case of failure the recovery database [main] does not replace the
original
> > database [corrupt].
> 
> I don't understand.  It seems in on each step you're running an executing a
> command.  If one command fails but the other succeeds, won't you leave the the
> "main." database in a weird state?  I don't see function wrapped in a
> transaction.
> 
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
> sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0)
> nit: spaces around binary operators
> 
> https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h
> File sql/recovery.h (right):
> 
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173
> sql/recovery.h:173: // verified to be recovered correctly (such as
SQLITE_IOERR
> or SQLITE_FULL).
> nit: recovered -> recoverable ?
> 
>
https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
> File tools/metrics/histograms/histograms.xml (right):
> 
>
https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
> tools/metrics/histograms/histograms.xml:54088: +   
> RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > Please add a sentence to the end of the Sqlite.RecoveryEvents description,
> > > something to effect that those of these enum entries are subsets /
included
> in
> > > the counts of other enum entries.
> > 
> > Done.
> 
> Kind of done.  These are subsets, true.  But I _think_ there are other subsets
> here too.
> Please either
> - be complete and list all subsets.
> - be vague and simply state that there are certain types of errors that cause
an
> emission to multiple buckets.  (Read code for details.)
> The current phrasing implies this list is complete (which I'm skeptical of
given
> your other comments).
> 
>
https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
> File components/omnibox/browser/shortcuts_database.cc (right):
> 
>
https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
> components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases
should
> not be correlated with programmer error, so the
> I have trouble understanding this sentence.  I don't see why it shouldn't be
> "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be
> helpful.  Perhaps consider replacing the sentence with:
> The DLOG(FATAL) below should not normally be relevant to developers.
(Developers
> do not often test against corrupt-yet-recoverable databases.)  Displaying the
> error would probably lead to more confusion than help.
> 
> https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc
> File sql/recovery.cc (right):
> 
>
https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod...
> sql/recovery.cc:689: // will fail with no changes to the database.
> consider adding ... so there's no harm in returning true here.

Mark P

On Tue, Jun 28, 2016 at 3:28 PM, <shess@chromium.org> wrote: > My day is looking ...

4 years, 5 months ago (2016-06-28 22:38:29 UTC) #40

On Tue, Jun 28, 2016 at 3:28 PM, <shess@chromium.org> wrote:

> My day is looking such that if I'm lucky, I can circle back this evening.
> WRT
> the histogram coverage, right now, I'm just loading up the same histogram
> like
> histograms cost money. Would it help to just break out a separate
> histogram for
> the new DB-level API? In that case, I can probably get that API to where
> each
> exit point has a single enum, and the failures will overlap the table-level
> failures in the other histogram, rather than overlapping within the same
> histogram.
>

Frankly, I don't care that much.  I don't expect many people other than you
to look at this.  I just want whatever behavior there is to be documented
in case someone else looks then they'll either understand or at least know
they have to delve into code to look deeper.  Do whichever solution (one or
multiple histograms) that would be better for you to work with.

--mark


>
>
>
> On 2016/06/27 23:20:13, Mark P wrote:
> > comments are on both patchset #7 and #8.
> > (Apologies, I was in the middle of reviewing patchset 7 when you uploaded
> patch
> > 8 with additional revisions.)
> >
> > --mark
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
> > File tools/metrics/histograms/histograms.xml (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
> > tools/metrics/histograms/histograms.xml:83771: + Database was too broken
> to
> > recover.
> > On 2016/05/13 21:24:36, Scott Hess wrote:
> > > On 2016/04/19 23:34:28, Mark P wrote:
> > > > "to recover" or "to attempt recovery"?
> > >
> > > Recovery was attempted, and it failed in setup. I'll rephrase to be
> similar
> > to
> > > the other descriptions.
> > >
> > > > It looks like this error does not include the other detailed failures
> about
> > > > problems that can happen along the recovery process
> > >
> > > I don't know what this means. Are you asking for someone to enumerate
> all
> of
> > > the cases which prevented Begin() from succeeding? Most of them are
> already
> > > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
> > > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
> > > RECOVERY_FAILED_ATTACH.
> > >
> > > Which obviously implies you may want me to annotate those various
> items to
> > > describe which other items are supersets and subsets of each other,
> etc.
> I'm
> > > going to pre-emptively argue against this (and maybe that I should
> unwind
> the
> > > annotations just added). I'm basically recording these as code
> coverage, so
> > > there's literally no utility to understanding them in the abstract
> without
> > > reference to the code. It is easy enough to enumerate the places where
> a
> > > failure can be detected, but many of them are challenging to quantify
> in the
> > > abstract, and I'd rather not spend a week understanding how a
> particular
> > failure
> > > can happen if it has only ever happened 4 times in the fleet.
> >
> > Acknowledged.
> >
> > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc
> > File sql/recovery.cc (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
> > sql/recovery.cc:555: while (s.Step()) {
> > On 2016/05/13 21:24:36, Scott Hess wrote:
> > > On 2016/04/19 23:34:28, Mark P wrote:
> > > > On 2016/04/15 00:38:14, Scott Hess wrote:
> > > > > On 2016/04/07 23:09:43, Mark P wrote:
> > > > > > Should all these executions be wrapped in a transaction?
> > > > >
> > > > > Recovery happens to a temporary database, which is only committed
> to the
> > > main
> > > > > database in Recovered(). So transactions would only be necessary
> insofar
> > as
> > > > the
> > > > > (local) caller needs to do controlled rollback of one approach in
> order
> to
> > > try
> > > > > another approach. I don't think that can really come up, here,
> mostly
> > > because
> > > > I
> > > > > can't think of a second simple approach (either the schema will
> work for
> > > this
> > > > > kind of recovery, or it won't, and the client shouldn't be calling
> this
> > code
> > > > if
> > > > > the schema is too complex or subtle).
> > > >
> > > > You're not concerned with having a failed attempt leave additional
> commands
> > in
> > > > corrupt.sqlite_master? Or does that not happen?
> > >
> > > In case of failure the recovery database [main] does not replace the
> original
> > > database [corrupt].
> >
> > I don't understand. It seems in on each step you're running an executing
> a
> > command. If one command fails but the other succeeds, won't you leave
> the the
> > "main." database in a weird state? I don't see function wrapped in a
> > transaction.
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
> > sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0)
> > nit: spaces around binary operators
> >
> > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h
> > File sql/recovery.h (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173
> > sql/recovery.h:173: // verified to be recovered correctly (such as
> SQLITE_IOERR
> > or SQLITE_FULL).
> > nit: recovered -> recoverable ?
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
> > File tools/metrics/histograms/histograms.xml (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
> > tools/metrics/histograms/histograms.xml:54088: +
> > RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.
> > On 2016/05/13 21:24:36, Scott Hess wrote:
> > > On 2016/04/19 23:34:28, Mark P wrote:
> > > > Please add a sentence to the end of the Sqlite.RecoveryEvents
> description,
> > > > something to effect that those of these enum entries are subsets /
> included
> > in
> > > > the counts of other enum entries.
> > >
> > > Done.
> >
> > Kind of done. These are subsets, true. But I _think_ there are other
> subsets
> > here too.
> > Please either
> > - be complete and list all subsets.
> > - be vague and simply state that there are certain types of errors that
> cause
> an
> > emission to multiple buckets. (Read code for details.)
> > The current phrasing implies this list is complete (which I'm skeptical
> of
> given
> > your other comments).
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
> > File components/omnibox/browser/shortcuts_database.cc (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
> > components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases
> should
> > not be correlated with programmer error, so the
> > I have trouble understanding this sentence. I don't see why it shouldn't
> be
> > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be
> > helpful. Perhaps consider replacing the sentence with:
> > The DLOG(FATAL) below should not normally be relevant to developers.
> (Developers
> > do not often test against corrupt-yet-recoverable databases.) Displaying
> the
> > error would probably lead to more confusion than help.
> >
> > https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc
> > File sql/recovery.cc (right):
> >
> >
>
>
https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod...
> > sql/recovery.cc:689: // will fail with no changes to the database.
> > consider adding ... so there's no harm in returning true here.
>
>
>
> https://codereview.chromium.org/1832173002/
>

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Scott Hess - ex-Googler

Trim comments, histogram comments, drop enum from histogram.

4 years, 5 months ago (2016-06-29 21:37:38 UTC) #41

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-06-29 21:39:19 UTC) #42

Scott Hess - ex-Googler

ptal, thx. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histograms/histograms.xml File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histograms/histograms.xml#oldcode83764 tools/metrics/histograms/histograms.xml:83764: <int value="19" label="RECOVERY_FAILED_META_NO_VERSION"> On 2016/05/13 21:24:36, Scott ...

4 years, 5 months ago (2016-06-29 21:39:30 UTC) #43

ptal, thx.

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (left):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83764: <int value="19"
label="RECOVERY_FAILED_META_NO_VERSION">
On 2016/05/13 21:24:36, Scott Hess (OOO Jul 1-Aug 7) wrote:
> On 2016/04/19 23:34:28, Mark P wrote:
> > Please add a sentence to the end of the Sqlite.RecoveryEvents description,
> > something to effect that those of these enum entries are subsets / included
in
> > the counts of other enum entries.
> 
> Done.

Undone in favor of referring to the code.

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram...
tools/metrics/histograms/histograms.xml:83771: +    Database was too broken to
recover.
On 2016/06/27 23:20:13, Mark P wrote:
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > "to recover" or "to attempt recovery"?
> > 
> > Recovery was attempted, and it failed in setup.  I'll rephrase to be similar
> to
> > the other descriptions.
> > 
> > > It looks like this error does not include the other detailed failures
about
> > > problems that can happen along the recovery process
> > 
> > I don't know what this means.  Are you asking for someone to enumerate all
of
> > the cases which prevented Begin() from succeeding?  Most of them are already
> > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY,
> > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and
> > RECOVERY_FAILED_ATTACH.
> > 
> > Which obviously implies you may want me to annotate those various items to
> > describe which other items are supersets and subsets of each other, etc. 
I'm
> > going to pre-emptively argue against this (and maybe that I should unwind
the
> > annotations just added).  I'm basically recording these as code coverage, so
> > there's literally no utility to understanding them in the abstract without
> > reference to the code.  It is easy enough to enumerate the places where a
> > failure can be detected, but many of them are challenging to quantify in the
> > abstract, and I'd rather not spend a week understanding how a particular
> failure
> > can happen if it has only ever happened 4 times in the fleet.
> 
> Acknowledged.

OK, WRT re-writing the histogram comment to indicate "this is coverage, look at
the code", I also omitted RECOVERY_FAILED_AUTORECOVERDB_SCHEMA in favor of the
independent subsets.  So those enums are now independent, but the entire set of
enums are not (due to table and db errors overlapping).

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
sql/recovery.cc:555: while (s.Step()) {
On 2016/06/27 23:20:13, Mark P wrote:
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > On 2016/04/15 00:38:14, Scott Hess wrote:
> > > > On 2016/04/07 23:09:43, Mark P wrote:
> > > > > Should all these executions be wrapped in a transaction?
> > > > 
> > > > Recovery happens to a temporary database, which is only committed to the
> > main
> > > > database in Recovered().  So transactions would only be necessary
insofar
> as
> > > the
> > > > (local) caller needs to do controlled rollback of one approach in order
to
> > try
> > > > another approach.  I don't think that can really come up, here, mostly
> > because
> > > I
> > > > can't think of a second simple approach (either the schema will work for
> > this
> > > > kind of recovery, or it won't, and the client shouldn't be calling this
> code
> > > if
> > > > the schema is too complex or subtle).
> > > 
> > > You're not concerned with having a failed attempt leave additional
commands
> in
> > > corrupt.sqlite_master?  Or does that not happen?
> > 
> > In case of failure the recovery database [main] does not replace the
original
> > database [corrupt].
> 
> I don't understand.  It seems in on each step you're running an executing a
> command.  If one command fails but the other succeeds, won't you leave the the
> "main." database in a weird state?  I don't see function wrapped in a
> transaction.

At the outer level, the database [corrupt] is the real database we started with,
"Shortcuts" in this case.  This system accesses that database read-only.  The
database [main] is a temporary database (on POSIX I think it's an open fd to an
unlinked file).  If recovery runs through to completion, the final Recovered()
call will copy the blocks of the recovered database back the original database. 
In all other cases, it will be closed, and thus discarded.

Regular transactions aren't appropriate, since the corrupt database is presumed
untrustworthy.  Any regular write activity against tables or rows in that
database could propagate corruption (thus the read-only), and the recover
virtual table is used to scan things out in a safe fashion.  The final copy is
at the block level, which uses OS-level operations so corruption should no
longer matter.

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0)
On 2016/06/27 23:20:13, Mark P wrote:
> nit: spaces around binary operators

Done.

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173
sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR
or SQLITE_FULL).
On 2016/06/27 23:20:13, Mark P wrote:
> nit: recovered -> recoverable ?

I'm just going to remove the entire "why false" case, and just let the "why
true" case stand alone.

I intentionally avoided language like "Do not call RecoverDatabase() unless this
returns true", because it actually doesn't matter.  The code doesn't work better
or worse based on this error code, the caller just shouldn't dump all errors
into recovery because of the goal of strongly flagging newly-introduced errors
like SQLITE_ERROR or SQLITE_MISUSE.

https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
File tools/metrics/histograms/histograms.xml (right):

https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra...
tools/metrics/histograms/histograms.xml:54088: +   
RECOVERY_FAILED_AUTORECOVERDB_SCHEMA.
On 2016/06/27 23:20:13, Mark P wrote:
> On 2016/05/13 21:24:36, Scott Hess wrote:
> > On 2016/04/19 23:34:28, Mark P wrote:
> > > Please add a sentence to the end of the Sqlite.RecoveryEvents description,
> > > something to effect that those of these enum entries are subsets /
included
> in
> > > the counts of other enum entries.
> > 
> > Done.
> 
> Kind of done.  These are subsets, true.  But I _think_ there are other subsets
> here too.
> Please either
> - be complete and list all subsets.
> - be vague and simply state that there are certain types of errors that cause
an
> emission to multiple buckets.  (Read code for details.)
> The current phrasing implies this list is complete (which I'm skeptical of
given
> your other comments).

Reworded the description to emphasis that this is to inform the development
cycle, not indicate that anything actionable is going wrong.  And point at
recovery.cc.

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should
not be correlated with programmer error, so the
On 2016/06/27 23:20:13, Mark P wrote:
> I have trouble understanding this sentence.  I don't see why it shouldn't be
> "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be
> helpful.  Perhaps consider replacing the sentence with:
> The DLOG(FATAL) below should not normally be relevant to developers.
(Developers
> do not often test against corrupt-yet-recoverable databases.)  Displaying the
> error would probably lead to more confusion than help.

Probably nerdview.  Made an attempt at revision.

https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod...
sql/recovery.cc:689: // will fail with no changes to the database.
On 2016/06/27 23:20:13, Mark P wrote:
> consider adding ... so there's no harm in returning true here.

Done.

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/1832173002/220001

4 years, 5 months ago (2016-06-29 21:47:52 UTC) #44

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 5 months ago (2016-06-29 22:16:18 UTC) #45

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_android_rel_ng/builds/96216)

4 years, 5 months ago (2016-06-29 22:16:19 UTC) #46

Mark P

lgtm (with one trivial nit below) (at last!) Happy Friday! (and enjoy your time away) ...

4 years, 5 months ago (2016-07-01 22:40:55 UTC) #47

lgtm (with one trivial nit below)
(at last!)

Happy Friday! (and enjoy your time away)

--mark

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod...
sql/recovery.cc:555: while (s.Step()) {
On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote:
> On 2016/06/27 23:20:13, Mark P wrote:
> > On 2016/05/13 21:24:36, Scott Hess wrote:
> > > On 2016/04/19 23:34:28, Mark P wrote:
> > > > On 2016/04/15 00:38:14, Scott Hess wrote:
> > > > > On 2016/04/07 23:09:43, Mark P wrote:
> > > > > > Should all these executions be wrapped in a transaction?
> > > > > 
> > > > > Recovery happens to a temporary database, which is only committed to
the
> > > main
> > > > > database in Recovered().  So transactions would only be necessary
> insofar
> > as
> > > > the
> > > > > (local) caller needs to do controlled rollback of one approach in
order
> to
> > > try
> > > > > another approach.  I don't think that can really come up, here, mostly
> > > because
> > > > I
> > > > > can't think of a second simple approach (either the schema will work
for
> > > this
> > > > > kind of recovery, or it won't, and the client shouldn't be calling
this
> > code
> > > > if
> > > > > the schema is too complex or subtle).
> > > > 
> > > > You're not concerned with having a failed attempt leave additional
> commands
> > in
> > > > corrupt.sqlite_master?  Or does that not happen?
> > > 
> > > In case of failure the recovery database [main] does not replace the
> original
> > > database [corrupt].
> > 
> > I don't understand.  It seems in on each step you're running an executing a
> > command.  If one command fails but the other succeeds, won't you leave the
the
> > "main." database in a weird state?  I don't see function wrapped in a
> > transaction.
> 
> At the outer level, the database [corrupt] is the real database we started
with,
> "Shortcuts" in this case.  This system accesses that database read-only.  The
> database [main] is a temporary database (on POSIX I think it's an open fd to
an
> unlinked file).  If recovery runs through to completion, the final Recovered()
> call will copy the blocks of the recovered database back the original
database. 
> In all other cases, it will be closed, and thus discarded.
> 
> Regular transactions aren't appropriate, since the corrupt database is
presumed
> untrustworthy.  Any regular write activity against tables or rows in that
> database could propagate corruption (thus the read-only), and the recover
> virtual table is used to scan things out in a safe fashion.  The final copy is
> at the block level, which uses OS-level operations so corruption should no
> longer matter.

Thanks for the explanation.

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h
File sql/recovery.h (right):

https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173
sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR
or SQLITE_FULL).
On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote:
> On 2016/06/27 23:20:13, Mark P wrote:
> > nit: recovered -> recoverable ?
> 
> I'm just going to remove the entire "why false" case, and just let the "why
> true" case stand alone.

Good move.

> I intentionally avoided language like "Do not call RecoverDatabase() unless
this
> returns true", because it actually doesn't matter.  The code doesn't work
better
> or worse based on this error code, the caller just shouldn't dump all errors
> into recovery because of the goal of strongly flagging newly-introduced errors
> like SQLITE_ERROR or SQLITE_MISUSE.

Acknowledged.

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
File components/omnibox/browser/shortcuts_database.cc (right):

https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro...
components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should
not be correlated with programmer error, so the
On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote:
> On 2016/06/27 23:20:13, Mark P wrote:
> > I have trouble understanding this sentence.  I don't see why it shouldn't be
> > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be
> > helpful.  Perhaps consider replacing the sentence with:
> > The DLOG(FATAL) below should not normally be relevant to developers.
> (Developers
> > do not often test against corrupt-yet-recoverable databases.)  Displaying
the
> > error would probably lead to more confusion than help.
> 
> Probably nerdview.  Made an attempt at revision.

Your new comment is much better, thanks.

https://codereview.chromium.org/1832173002/diff/220001/sql/recovery.cc
File sql/recovery.cc (right):

https://codereview.chromium.org/1832173002/diff/220001/sql/recovery.cc#newcod...
sql/recovery.cc:687: // will fail with no changes to the database, so there's no
harm to
nit: "to attempt" or "in attempting", not "to attempting"

Scott Hess - ex-Googler

The CQ bit was unchecked by shess@chromium.org

4 years, 5 months ago (2016-07-01 23:22:36 UTC) #49

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org

4 years, 5 months ago (2016-07-01 23:22:36 UTC) #50

Scott Hess - ex-Googler

The CQ bit was checked by shess@chromium.org

4 years, 5 months ago (2016-07-01 23:22:37 UTC) #51

Scott Hess - ex-Googler

The patchset sent to the CQ was uploaded after l-g-t-m from mpearson@chromium.org Link to the ...

4 years, 5 months ago (2016-07-01 23:22:37 UTC) #52

Scott Hess - ex-Googler

The patchset sent to the CQ was uploaded after l-g-t-m from mpearson@chromium.org Link to the ...

4 years, 5 months ago (2016-07-01 23:22:37 UTC) #53

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/1832173002/240001

4 years, 5 months ago (2016-07-01 23:23:03 UTC) #54

commit-bot: I haz the power

Description was changed from ========== [sql] Database recovery system for Shortcuts. sql::Recovery provides a conservative ...

4 years, 5 months ago (2016-07-02 00:28:39 UTC) #57

commit-bot: I haz the power

Patchset 10 (id:??) landed as https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5 Cr-Commit-Position: refs/heads/master@{#403576}

4 years, 5 months ago (2016-07-02 00:28:40 UTC) #58

Scott Hess - ex-Googler

4 years, 5 months ago (2016-07-02 04:29:29 UTC) #59

Message was sent while issue was closed.

On 2016/07/02 00:28:40, commit-bot: I haz the power wrote:
> Patchset 10 (id:??) landed as
> https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5
> Cr-Commit-Position: refs/heads/master@{#403576}

To future build or stability sheriffs,

I'm going to be in-and-out of town for a bit, here.  I have a couple points
where I can make time to check-in on things, and I intend to mostly to check on
the progress of this CL.  If this is definitely breaking the thing, revert it! 
But if it just needs to be disabled, I put up a CL for that:
   https://codereview.chromium.org/2114823003/

Issue 1832173002: [sql] Database recovery system for Shortcuts. (Closed)

Description

Patch Set 1 #

Patch Set 2 : iOS SQLite doesn't support column names in view definition. #

Patch Set 3 : Rebase, scoped_ptr to unique_ptr in new code #

Patch Set 4 : Address review comments from #9 and #10. #

Patch Set 5 : Rebase #

Patch Set 6 : Address review comments through #14. I think. #

Patch Set 7 : rebase #

Patch Set 8 : std::count returns ptrdiff_t, various comment changes. #

Patch Set 9 : Trim comments, histogram comments, drop enum from histogram. #

Patch Set 10 : grammar #

Messages