|
|
Created:
4 years, 9 months ago by Scott Hess - ex-Googler Modified:
4 years, 5 months ago Reviewers:
Mark P CC:
chromium-reviews, asvitkine+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
Description[sql] Database recovery system for Shortcuts.
sql::Recovery provides a conservative recovery system which requires
some overhead in the client code. Such code exists for the "Top Sites"
and Favicons databases, and has shown that the easy-to-handle cases
dominate. Implement a helper function to recover an entire database,
and apply it to the Shortcuts database.
BUG=597785
Committed: https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5
Cr-Commit-Position: refs/heads/master@{#403576}
Patch Set 1 #Patch Set 2 : iOS SQLite doesn't support column names in view definition. #
Total comments: 97
Patch Set 3 : Rebase, scoped_ptr to unique_ptr in new code #Patch Set 4 : Address review comments from #9 and #10. #
Total comments: 48
Patch Set 5 : Rebase #Patch Set 6 : Address review comments through #14. I think. #Patch Set 7 : rebase #
Total comments: 10
Patch Set 8 : std::count returns ptrdiff_t, various comment changes. #
Total comments: 5
Patch Set 9 : Trim comments, histogram comments, drop enum from histogram. #
Total comments: 1
Patch Set 10 : grammar #
Dependent Patchsets: Messages
Total messages: 59 (18 generated)
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1832173002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1832173002/20001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
shess@chromium.org changed reviewers: + mpearson@chromium.org
I realize this is an SQLite-heavy review, but, unfortunately, the collection of SQLite-centric people on Chromium is pretty slim (mostly me, actually). I don't think there's anything in here that's super subtle. components/history/core/browser/thumbnail_database.cc contains an existing example of sql::Recovery being used. vacuum.c references are to third_party/sqlite/src/src/vacuum.c , which is pretty short. Thanks.
On 2016/03/25 23:24:32, Scott Hess wrote: > I realize this is an SQLite-heavy review, but, unfortunately, the collection of > SQLite-centric people on Chromium is pretty slim (mostly me, actually). I don't > think there's anything in here that's super subtle. > > components/history/core/browser/thumbnail_database.cc contains an existing > example of sql::Recovery being used. vacuum.c references are to > third_party/sqlite/src/src/vacuum.c , which is pretty short. > > Thanks. I'm out sick and this looks complicated. I'll aim to get to this near the end of the week; I hope that's okay. I'm curious: do you have any idea of the effect this'll have? --mark
On 2016/03/28 19:27:59, Mark P wrote: > On 2016/03/25 23:24:32, Scott Hess wrote: > > I realize this is an SQLite-heavy review, but, unfortunately, the collection > of > > SQLite-centric people on Chromium is pretty slim (mostly me, actually). I > don't > > think there's anything in here that's super subtle. > > > > components/history/core/browser/thumbnail_database.cc contains an existing > > example of sql::Recovery being used. vacuum.c references are to > > third_party/sqlite/src/src/vacuum.c , which is pretty short. > > > > Thanks. > > I'm out sick and this looks complicated. I'll aim to get to this near the end > of the week; I hope that's okay. No problem - it's not time-sensitive. Though I think I'll be going OOT for a week on Friday :-). I can start working on queueing up whichever database I want to hit next while waiting. > I'm curious: do you have any idea of the effect this'll have? For the databases in question, right now it's highly likely that those that are corrupt are stuck in that state. Corruption is weird, it's detected at the point where SQLite notices something wrong, so it's possible that one query works but a similar query fails based on what data is accessed. It is _possible_ to have a corruption which gradually is updated away, but IMHO pretty unlikely, my experience with people suffering from corruption is that it just persists until action is taken (for instance delete the profile). The way most Chromium code is written, you end up with a kind of half-working state which gets further and further out of line over time (in this case old shortcuts might continue to work, but they can't be updated or deleted and new ones can't be added). With this change, any data which was readable before _should_ end up in the resulting database, the primary key should prevent any duplication, and then going forward the database should work correctly. In terms of efficacy, my experience is that corruptions are mostly uncorrelated, they're from things like losing power suddenly or BSOD, so for most users fixing things once is all they'll need. Of course, that's for legitimate cases, if the user has flaky hardware or malware messing things up, this isn't going to help them in the end. It is likely that there will still be databases which end up in an unusable state. I have some code to upload diagnostics via the crash server. I might circle back with that for this database, in the past I've found that error cases often had a few actionable buckets (for instance an incorrectly-coded schema upgrade) before getting to the irreducible long tail. That will depend on how many cases are left. My long-term goal is to have things be more self-healing, because we can't really rely on a DBA to manually fix things, and relying on users to notice and delete their databases is only going to address a few percent at most. Right now the two databases which already have similar code in place are recovering ~125k databases per month. It's been in place long enough that this is the steady state. Hard to estimate results for this database, I mainly picked it based on Sqlite.Error.Shortcuts having enough hits that changes to SQLITE_CORRUPT will be noticeable, even on dev channel.
https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:68: // contradicts itself, but SQLite is able to read it. Do you want this explanation here rather than than by the ShouldRecoverOrRaze() function? Having it in two places seems redundant and likely to get out of sync. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:71: // itself. ditto about this TODO https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:82: // return errors until the handle is re-opened. This comment duplicates much of the comment by the function definition. Is any of it necessary? https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:87: if (!sql::Connection::ShouldIgnoreSqliteError(extended_error)) Do you still want to run this code / possibly display the message if the above code decides to recover the database? If yes, consider whether it would be better for developers to display this message first or the (possibly failed) recovery message first. (I'd think the original error first would be better.) If no, then you can struct this function differently. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:175: // addressed. Someone needs to handle invalid schema. This shouldn't be a comment. File a bug against pkasting@ to either do something with the return value (and comment in the .h file the meaning of the return value) or change the return value to void. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87 sql/recovery.cc:87: // Automatically recovered entire database successfully. This comment sounds like it could equally apply to the earlier enum RECOVERY_SUCCESS_AUTORECOVER. Please figure out a way to make their differences clearer. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode540 sql/recovery.cc:540: // |prefix|, and apply it to [main]. Skip 'sqlite_sequence', which will be This comment is too broad. Below you explicitly assume the prefix is the right amount of text so if you add a "main." right after it, it makes sense as a sql statement. SELECT foo from BAR; -> SELmain.ECT foo from BAR; Ugh, you don't mean this. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode541 sql/recovery.cc:541: // created on demand by SQLite if any tables use AUTOINCREMENT. This comment should mention the meaning of the return value. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode546 sql/recovery.cc:546: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'")); What happens if prefix_length > length(sql)? What does substr return in this case? The sqlite3 documentation I found doesn't define this. https://www.sqlite.org/lang_corefunc.html Perhaps you want to add another another condition ensuring that sql is long enough? (Though I can't imagine this causing any problems unless substr() actually fails on this input, as the equality test still applies) https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode552 sql/recovery.cc:552: // to be worth the complexity, though. Can you expand this slightly to be worth complexity to skip statements with IF NOT EXISTS? to be worth complexity to identify IF NOT EXISTS statement and then ...? Also, if you don't plan to do it, please don't label it as a TODO. :-P https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode553 sql/recovery.cc:553: This blank line here feels unnecessary (but then I could be misunderstanding the scope of the TODO comment above.) https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode554 sql/recovery.cc:554: while (s.Step()) { Should all these executions be wrapped in a transaction? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode684 sql/recovery.cc:684: switch (extended_error & 0xFF) { You do the same & 0xFF in connection.cc as well. Consider making this bitmask a constant somewhere. Or at least comment it better (in both places), maybe something as simple as // Mask down to the primary result code. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode19 sql/recovery.h:19: // Recovery module for sql/. The basic idea is to create a fresh As a new function you introduced does almost exactly what the example below does, perhaps you want to update this text and example if that's what you expect the common usage to be? Also, if that's the case, you should put the function declaration higher in this class. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode159 sql/recovery.h:159: // database can be razed or deleted. The goal is for this function to make Uh, did you mean "will be", not "can be"? And will the result be razed (i.e., deleted) or an empty database? If you didn't mean that, then how will the caller know whether the recovery failed and the database is okay to be razed? This function has no return value. If you're intentionally vague because you plan to change this from leave-corrupt-database-around to something else, just give it the right name for now. Later when you make the intended change, you can rename the function. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode166 sql/recovery.h:166: // Please provide an transition sentence here. It appears everything else in the comment explains how this function works, i.e., which error code you consider recoverable and why. Another sentence would make this comment more readable. Also, consider whether these comments better belong in the implementation, not the .h. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode169 sql/recovery.h:169: // is probably to delete the file and start over. probably? When this is not the best fix? (consider omitting the word probably) https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode175 sql/recovery.h:175: // fix is probably to delete the file and start over. This last sentence is confusing. Did you mean that there are other times this error code can be returned and those are probably not recoverable? If so, then this comment fails to explain what this function will return for this error code. Perhaps you want to drop the last sentence if appropriate? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode181 sql/recovery.h:181: // which is no longer in the table. Again, this explanation doesn't state what this function will return for this error code. In fact, it implies that the function will say to raze the database, when that's the opposite of what the function does. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode188 sql/recovery.h:188: static bool ShouldRecoverOrRaze(int extended_error); nit: given your explanation of what this does, I think it should be named something like IsRecoverable or IsLikelyRecoverable or IsRecoverableGivenError or ShouldTryRecovering or something. https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81886: + In autorecover database was too broken to begin. optional nit (has better emphasis in my opinion): Database was too broken to attempt recovery. https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81901: + Autorecover failed recovering a table. For something like this, which is probably a more common type of recovery failure, do you want to record the failed type of error code in another histogram? (or extra values within this histogram I suppose)
Phew. There are all your comments. It's unclear whether I found any bugs the process... My questions point out possible issues I see from my unexperienced viewpoint--whether they're real issues you'll hopefully be able to figure out. --mark https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath db_path(temp_dir.path().AppendASCII("TestShortcuts1.db")); This probably should be "3", not "1". (Otherwise why would other tests use unique numbers? There must be a reason.) https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:318: // The database is broken. How about putting a blank line after 315, then adding // Break the database. then corrupt the header, then a blank line, then a comment such as // Verify the database is broken. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:325: ASSERT_FALSE(connection.Execute("SELECT count(*) FROM omni_box_shortcuts")); Why does this return failure if we're ignoring the error? Comment. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:327: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); Side comment: the comment by the declaration of CheckIgnoredErrors() is awful. It doesn't say what the return type means. Does it mean we encountered an error of the type that we ignored? That's my best guess. But it could mean that we encountered an error that didn't ignore, i.e., check status, taking into account ignored errors... https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:330: // sql::Connection::Open() will hit the corruption, the error callback will sql::Connection::Open() will hit -> The sql::Connection::Open() call in ShortcutsDatabase::Init() will hit https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:340: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); I think it's odd that we still see this error. Shouldn't it be cleared after the first failure and then not have any errors for the second attempt at opening? https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:348: sql::Statement statement(connection.GetUniqueStatement( Please use an identical call to the one you used to verify the database was broken, don't switch from execute to getuniquestatement. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:351: EXPECT_EQ(5, statement.ColumnInt(0)); Consider using something like EXPECT_EQ(arraysize(shortcut_test_db), CountRecords()) a la the other tests in this file. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode96 sql/recovery.cc:96: // Distinguish failure in querying schema from failure in creating schema. Please clearly state these are subcategories of RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode592 sql/recovery.cc:592: // in the channel (175k/month). Lastly, there were 3 failures in a month Please don't put non-normalized numbers in comments that will become out of date over time. e.g., say something like X failures per million database opens in M-49. Also, if you're referencing an issues that fixed, please mention when, e.g., fixed for M-50. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode594 sql/recovery.cc:594: // practice 3 in 1M might be a reasonable false-positive rate. I think this last sentence isn't worth mentioning anyway. Things that occur three times in a month aren't worth thinking about, let alone documenting in a place that will force thinking about it to cross other people's mind. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode596 sql/recovery.cc:596: db->Poison(); In the header comment, you write the function will poison the database so it "will return errors until the handle is re-opened." This implies the database is closed. Yet, you don't close it here. The comment by the definition for Poison says // This is intended as an alternative to Close() in error callbacks. // Close() should still be called at some point. so shouldn't you close it...? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode600 sql/recovery.cc:600: #if DCHECK_IS_ON() I don't understand this block of code. That said, I am willing to approve the changelist without any changes to this block; I understand that it's okay that I don't have the background for it. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode625 sql/recovery.cc:625: Recovery::Rollback(std::move(recovery)); Header talks about poisoning; why are you using a rollback here? Should you poison after this rollback? ditto multiple places below https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode635 sql/recovery.cc:635: "AND name!='sqlite_sequence'")); AND != corrupt.sqlite_sequence ? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode666 sql/recovery.cc:666: // sql LIKE "CREATE VIRTUAL TABLE %" Do you have a sense of whether this will be different in some cases? Or are you nervous, which is why you left it like vacuum.c? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.cc File sql/recovery_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:751: // This table needs sqlite_sequence to work. ... because of the autoincrement. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:753: "CREATE TABLE a (id INTEGER PRIMARY KEY AUTOINCREMENT, v TEXT)")); Everywhere else in this file uses tables x and y. Any reason not to use those names? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:781: I think this blank line is unnecessary. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:784: ditto https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:786: EXPECT_TRUE(db().IsSQLValid(kTrivialSql)); Comment such as // Verify the database handle is valid. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:795: ASSERT_TRUE(Reopen()); This block of tests makes me nervous because you're comparing the new data with the old data, yet you have no check that the old data is plausible. Suppose there was an uncaught error earlier that prevented any data from being inserted into the database. This test would still pass. I suggest either testing the results of the original data calls or removing that original data test and simply compare the new results with the actual hard-coded values you'd expect. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:801: // Test that the view works. I don't understand this block. I don't see how this checks that the view works. There is no expect or assert statements in here, and ExecuteWithResults() doesn't have any expected or assert inside it that would trigger the view was broken. If nothing else, this block is more correctly combined with the trigger block below, with no comment making an explicit claim about checking the view. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:812: ExecuteWithResults(&db(), kAWithoutSql, "|", "\n")); Shouldn't this be kASql, i.e., you're checking that after the above deletion, a _normal_ select statement will return the same result as a select statement that excludes 'truck' explicitly? ditto with kISql below https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81892: + Autorecover could not create new db schema. Consider mentioning this is a subset of RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. ditto below. That said, I see you have weird subset relationships elsewhere in this enum. I wonder if mentioning this in some places but not others (unless you want to make this list complete) somewhat confusing. Maybe you should simply add something to the histogram description.
Rebase, scoped_ptr to unique_ptr in new code
Address review comments from #9 and #10.
On 2016/04/08 20:12:07, Mark P wrote: > Phew. > > There are all your comments. > > It's unclear whether I found any bugs the process... My questions point out > possible issues I see from my unexperienced viewpoint--whether they're real > issues you'll hopefully be able to figure out. THANK YOU FOR THE THOROUGH REVIEW! It may backfire on you, as in the past getting anyone to review (ie, help improve) this stuff has been like pulling teeth. Hopefully my answers aren't completely facile and evasive. Unfortunately, to some extent they are, because there are bits of this that I'm not completely certain of how things work in the fleet, and other bits where I'm trying to evade shaving yaks elsewhere. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:68: // contradicts itself, but SQLite is able to read it. On 2016/04/07 23:09:43, Mark P wrote: > Do you want this explanation here rather than than by the ShouldRecoverOrRaze() > function? Having it in two places seems redundant and likely to get out of > sync. Acknowledged. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:71: // itself. On 2016/04/07 23:09:43, Mark P wrote: > ditto about this TODO Leftover from when the code checked directly against SQLITE_CORRUPT and other errors. That required adding a dep on third_party/sqlite which seemed inappropriate. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:82: // return errors until the handle is re-opened. On 2016/04/07 23:09:43, Mark P wrote: > This comment duplicates much of the comment by the function definition. Is any > of it necessary? OK. Leaving the last line, because IMHO it's pertinent enough to the local stack and likely unexpected. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:87: if (!sql::Connection::ShouldIgnoreSqliteError(extended_error)) On 2016/04/07 23:09:43, Mark P wrote: > Do you still want to run this code / possibly display the message if the above > code decides to recover the database? > > If yes, consider whether it would be better for developers to display this > message first or the (possibly failed) recovery message first. (I'd think the > original error first would be better.) > > If no, then you can struct this function differently. I'll have to think about this. Corruption generally stems from things like unexpected power failures, so a developer is unlikely to see that. This DLOG is expected to fire for cases like SQLITE_ERROR, which would indicate that a statement had a syntax error, or an inappropriate schema change happened, something like that. So is there an interesting case at the intersection where a developer caused a corruption? Hard to say. If a developer was unlucky enough to have a legitimate corruption, I would expect this to FATAL once, then work going forward. Not a great outcome, but most developers would just ignore it or log a bug against me. That's find, then I can discuss it with them and determine if there's a real problem or a red herring. Usually it's been a red herring, but it has caught real bugs. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:175: // addressed. Someone needs to handle invalid schema. On 2016/04/07 23:09:43, Mark P wrote: > This shouldn't be a comment. File a bug against pkasting@ to either do > something with the return value (and comment in the .h file the meaning of the > return value) or change the return value to void. OK. Adding http://crbug.com/603304 about this. I'm fine with adding some instrumentation and working to resolve this, as I'm also looking for ways to generalize the diagnostics for other databases. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath db_path(temp_dir.path().AppendASCII("TestShortcuts1.db")); On 2016/04/08 20:12:06, Mark P wrote: > This probably should be "3", not "1". (Otherwise why would other tests use > unique numbers? There must be a reason.) Copy/pasta. Since it's in a unique temp dir, I'll just strip the 1. To the best of my knowledge none of these need to be unique names. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:318: // The database is broken. On 2016/04/08 20:12:06, Mark P wrote: > How about putting a blank line after 315, then adding > // Break the database. > then corrupt the header, > then a blank line, then a comment such as > // Verify the database is broken. Done. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:325: ASSERT_FALSE(connection.Execute("SELECT count(*) FROM omni_box_shortcuts")); On 2016/04/08 20:12:06, Mark P wrote: > Why does this return failure if we're ignoring the error? Comment. Production code in debug mode fails with a FATAL, because there should not be cases where that is acceptable. Execute() is WARN_UNUSED_RESULT to force developers to pay attention to the result, because in actual production there will be failures that are not due to how the code is written. The ignorer tells sql::Connection that we're testing the code to handle an error case, so don't do the FATAL thing. [To be clear, a lot of the failure code paths for sql returns are entirely untested in Chromium. A lot of the code is written to assume nothing can fail, which is true in development but false when deployed. I've tried to force the issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for failure paths, in part because it's challenging to inject failures. So I agree, much of this is dubious.] https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:327: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/08 20:12:06, Mark P wrote: > Side comment: the comment by the declaration of CheckIgnoredErrors() is awful. > It doesn't say what the return type means. Does it mean we encountered an error > of the type that we ignored? That's my best guess. But it could mean that we > encountered an error that didn't ignore, i.e., check status, taking into account > ignored errors... I don't really follow. It says it's used to check if the errors were encountered, "True" seems like the most likely response to that question. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:330: // sql::Connection::Open() will hit the corruption, the error callback will On 2016/04/08 20:12:06, Mark P wrote: > sql::Connection::Open() will hit > -> > The sql::Connection::Open() call in ShortcutsDatabase::Init() will hit Done. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:340: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/08 20:12:06, Mark P wrote: > I think it's odd that we still see this error. Shouldn't it be cleared after > the first failure and then not have any errors for the second attempt at > opening? The error will be thrown in the first Open(), the ignorer allows it to fail gracefully, then the retry will succeed. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:348: sql::Statement statement(connection.GetUniqueStatement( On 2016/04/08 20:12:06, Mark P wrote: > Please use an identical call to the one you used to verify the database was > broken, don't switch from execute to getuniquestatement. Done. https://codereview.chromium.org/1832173002/diff/20001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:351: EXPECT_EQ(5, statement.ColumnInt(0)); On 2016/04/08 20:12:06, Mark P wrote: > Consider using something like > EXPECT_EQ(arraysize(shortcut_test_db), CountRecords()) > a la the other tests in this file. CountRecords() runs against the sql::Connection contained in the ShortcutsDatabase created by SetUp(). That database cannot easily be used at this point. I could do this: db_ = new ShortcutsDatabase(db_path); but note that this isn't necessarily describing the same database as SetUp(). I could shift the CountRecords() code out to a helper running against a sql::Connection, then have CountRecords() call that. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87 sql/recovery.cc:87: // Automatically recovered entire database successfully. On 2016/04/07 23:09:43, Mark P wrote: > This comment sounds like it could equally apply to the earlier enum > RECOVERY_SUCCESS_AUTORECOVER. Please figure out a way to make their differences > clearer. The earlier applies to a specific table, this applies to the entire database. So would you prefer to rename the earlier ones to reflect their specific scope? https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode96 sql/recovery.cc:96: // Distinguish failure in querying schema from failure in creating schema. On 2016/04/08 20:12:06, Mark P wrote: > Please clearly state these are subcategories of > RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode540 sql/recovery.cc:540: // |prefix|, and apply it to [main]. Skip 'sqlite_sequence', which will be On 2016/04/07 23:09:43, Mark P wrote: > This comment is too broad. Below you explicitly assume the prefix is the right > amount of text so if you add a "main." right after it, it makes sense as a sql > statement. > SELECT foo from BAR; > -> > SELmain.ECT foo from BAR; > Ugh, you don't mean this. I agree that the example case isn't what I want, but I'm not sure what to do about it. The prefix has to be some sort of valid SQL prefix thingie, but I'm not sure what to call it, and I'm not going to be writing a parser to validate it. I guess I can assert that it ends in a space, but even that doesn't mean much, since I'm using string manipulation to dynamically construct statements. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode541 sql/recovery.cc:541: // created on demand by SQLite if any tables use AUTOINCREMENT. On 2016/04/07 23:09:43, Mark P wrote: > This comment should mention the meaning of the return value. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode546 sql/recovery.cc:546: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'")); On 2016/04/07 23:09:43, Mark P wrote: > What happens if prefix_length > length(sql)? What does substr return in this > case? > The sqlite3 documentation I found doesn't define this. > https://www.sqlite.org/lang_corefunc.html > Perhaps you want to add another another condition ensuring that sql is long > enough? > (Though I can't imagine this causing any problems unless substr() actually fails > on this input, as the equality test still applies) I think I'll just rewrite this as a full table scan and do the filtering in C++. It won't change the efficiency at all, and I expect it will be approximately the same weight in code. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode552 sql/recovery.cc:552: // to be worth the complexity, though. On 2016/04/07 23:09:43, Mark P wrote: > Can you expand this slightly > to be worth complexity to skip statements with IF NOT EXISTS? > to be worth complexity to identify IF NOT EXISTS statement and then ...? > > Also, if you don't plan to do it, please don't label it as a TODO. :-P I think I'll remove the comment entirely. A common corruption is having a transaction broken when a page is being rebalanced, which can lead to seeing the same data in multiple places. This could only happen to the sqlite_master btree if a transaction was broken while making a schema change, which is already unlikely, but it could only be successfully coded against if the result was a valid schema, which is probably even more unlikely. OK, and it just occurred to me that I could add a DISTINCT to the selection statement and ignore the problem entirely. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode553 sql/recovery.cc:553: On 2016/04/07 23:09:43, Mark P wrote: > This blank line here feels unnecessary (but then I could be misunderstanding the > scope of the TODO comment above.) Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode554 sql/recovery.cc:554: while (s.Step()) { On 2016/04/07 23:09:43, Mark P wrote: > Should all these executions be wrapped in a transaction? Recovery happens to a temporary database, which is only committed to the main database in Recovered(). So transactions would only be necessary insofar as the (local) caller needs to do controlled rollback of one approach in order to try another approach. I don't think that can really come up, here, mostly because I can't think of a second simple approach (either the schema will work for this kind of recovery, or it won't, and the client shouldn't be calling this code if the schema is too complex or subtle). https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode592 sql/recovery.cc:592: // in the channel (175k/month). Lastly, there were 3 failures in a month On 2016/04/08 20:12:06, Mark P wrote: > Please don't put non-normalized numbers in comments that will become out of date > over time. > e.g., say something like X failures per million database opens in M-49. > > Also, if you're referencing an issues that fixed, please mention when, e.g., > fixed for M-50. Maybe I'll keep the comment in my local client, since it's mostly just to inform my thinking about whether this is safe to flip to Raze() or Delete() or something. The specific numbers aren't interesting, I was just thinking "BIG", "Less big", and "trivial or maybe noise". https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode594 sql/recovery.cc:594: // practice 3 in 1M might be a reasonable false-positive rate. On 2016/04/08 20:12:06, Mark P wrote: > I think this last sentence isn't worth mentioning anyway. Things that occur > three times in a month aren't worth thinking about, let alone documenting in a > place that will force thinking about it to cross other people's mind. Acknowledged. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode596 sql/recovery.cc:596: db->Poison(); On 2016/04/08 20:12:06, Mark P wrote: > In the header comment, you write the function will poison the database so it > "will return errors until the handle is re-opened." This implies the database > is closed. Yet, you don't close it here. The comment by the definition for > Poison says > // This is intended as an alternative to Close() in error callbacks. > // Close() should still be called at some point. > so shouldn't you close it...? The originating caller's Open()/Close() calls should balance out. This call can happen deep in a stack of calls in a case handling an exceptional error, making it impractical for the client to track the Close() status. So Poison() poisons the sql::Connection handle, but logically the handle is still open. For instance, warnings are thrown if you try to run Execute() on a handle which was never opened, or which has been closed. But if the handle was poisoned, then the Execute() will fail without warnings. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode600 sql/recovery.cc:600: #if DCHECK_IS_ON() On 2016/04/08 20:12:06, Mark P wrote: > I don't understand this block of code. That said, I am willing to approve the > changelist without any changes to this block; I understand that it's okay that I > don't have the background for it. fts3 is a virtual table which we used to use in history, and remains compiled into Chrome for WebSQL. I don't think anyone in the browser process is using it. AutorecoverTable() fails on fts3 tables by successfully dropping all of the data, so I want to prevent anyone from calling this code on an fts3 virtual table. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode625 sql/recovery.cc:625: Recovery::Rollback(std::move(recovery)); On 2016/04/08 20:12:06, Mark P wrote: > Header talks about poisoning; why are you using a rollback here? Should you > poison after this rollback? > ditto multiple places below See comment I made about the RecoverDatabaseOrRaze() header-file comment. I intend all of these to because Unrecoverable() (leading to clearing the database data). https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode635 sql/recovery.cc:635: "AND name!='sqlite_sequence'")); On 2016/04/08 20:12:06, Mark P wrote: > AND != corrupt.sqlite_sequence ? sqlite_master is implicitly main.sqlite_master . SQLite doesn't interpret/interpolate the columns, so selecting from corrupt.sqlite_master would still have 'sqlite_sequence' as the name and the sql would not have db.table naming. [I think? OMG, my precious brain! Yes, AFAICT from testing it.] https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode666 sql/recovery.cc:666: // sql LIKE "CREATE VIRTUAL TABLE %" On 2016/04/08 20:12:06, Mark P wrote: > Do you have a sense of whether this will be different in some cases? Or are you > nervous, which is why you left it like vacuum.c? I do not fully understand why vacuum.c uses the rootpage like this. SQLite creates tables with a root page even if empty, and I believe it could also validly use a null root page. Most likely the assumption is that you created a table, so you'll populate it, though I've certainly seen that assumption broken w/in Chromium. Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it can't accidentally match 'CREATE TABLE' or any other sort of faux table. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode684 sql/recovery.cc:684: switch (extended_error & 0xFF) { On 2016/04/07 23:09:43, Mark P wrote: > You do the same & 0xFF in connection.cc as well. Consider making this bitmask a > constant somewhere. Or at least comment it better (in both places), maybe > something as simple as > // Mask down to the primary result code. Copied language from connection.cc. Unfortunately SQLite doesn't provide a really distinct name for the basic error codes, so I'm phrasing in terms of removing extended error codes. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode19 sql/recovery.h:19: // Recovery module for sql/. The basic idea is to create a fresh On 2016/04/07 23:09:44, Mark P wrote: > As a new function you introduced does almost exactly what the example below > does, perhaps you want to update this text and example if that's what you expect > the common usage to be? > > Also, if that's the case, you should put the function declaration higher in this > class. I rephrased it to put RecoverDatabaseOrRaze() first, then the skeleton code for cases that doesn't cover. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode159 sql/recovery.h:159: // database can be razed or deleted. The goal is for this function to make On 2016/04/07 23:09:44, Mark P wrote: > Uh, did you mean "will be", not "can be"? And will the result be razed (i.e., > deleted) or an empty database? > If you didn't mean that, then how will the caller know whether the recovery > failed and the database is okay to be razed? This function has no return value. > If you're intentionally vague because you plan to change this from > leave-corrupt-database-around to something else, just give it the right name for > now. Later when you make the intended change, you can rename the function. I'm not sure how to answer! Right now, it only applies the safe cases. It either thinks it recovered the database, or it rolls things back. But I've seen results from the fleet which make me believe that running recovery to success didn't actually work (possibly because of shenanigans in the system, like malware or profile's being overwritten or something). The intention is to monitor the histograms to see if any of the failure outcomes is due to a broken assumption rather than broken user data. If the failures cases look weirdly high, I'll instrument to debug the fleet, but if they look sensible, I'll switch them over from Rollback() to Unrecoverable() (leading to Raze()). sql::Connection::Raze() means that it overwrites the database with a single-page empty database. The delete language is because if the Raze() doesn't work, then what? Probably attempt to Delete(). I'm nervous about landing either of those up front, though, on the off chance that I write a bug which works right in my tests but in the field blows away a bunch of data. Anyhow, my intention with this function is that it consumes the database handle in all cases, while making a best effort to recover the data, and failing that a best effort to clear the database of data, and failing that a best effort to delete the database. On next run the database should either be a valid data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should be a valid empty SQLite database, or it should be missing (SQLite would create the database empty). Raze() versus Delete() is because Raze() works using SQLite locking and I/O. It can clear the database in cases where Delete() won't work. Likewise, Delete() may work in cases where Raze() does not. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode166 sql/recovery.h:166: // On 2016/04/07 23:09:44, Mark P wrote: > Please provide an transition sentence here. It appears everything else in the > comment explains how this function works, i.e., which error code you consider > recoverable and why. Another sentence would make this comment more readable. > > Also, consider whether these comments better belong in the implementation, not > the .h. Moved cases into the implementation file. I'm not sure how to make this fully hermetic WRT SQLite. Some SQLite errors stem from obviously broken database files, some stem from obviously broken client code, and some are ambiguous. My goal is to get rid of the ambiguity over time, but I'd rather approach it piecemeal, fixing the known-possible cases first. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode169 sql/recovery.h:169: // is probably to delete the file and start over. On 2016/04/07 23:09:44, Mark P wrote: > probably? > When this is not the best fix? > (consider omitting the word probably) I guess this is expressing my certainty level. SQLite is very layered, so sometimes the error you get is unexpected - when you get that error and decode why, it makes sense how it happened, but reading the code it's challenging to verify which specific errors are possible at which specific API calls. SQLITE_CANTOPEN is returned for the example cases given, but I don't know how far the unknown unknowns run. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode175 sql/recovery.h:175: // fix is probably to delete the file and start over. On 2016/04/07 23:09:44, Mark P wrote: > This last sentence is confusing. Did you mean that there are other times this > error code can be returned and those are probably not recoverable? If so, then > this comment fails to explain what this function will return for this error > code. > Perhaps you want to drop the last sentence if appropriate? When the recovery code was first written, there was a case where SQLITE_NOTADB was recoverable. Now that case returns SQLITE_CORRUPT (which is a more sensible return value). So it made sense to run the recovery code, then delete it if that failed. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode181 sql/recovery.h:181: // which is no longer in the table. On 2016/04/07 23:09:44, Mark P wrote: > Again, this explanation doesn't state what this function will return for this > error code. In fact, it implies that the function will say to raze the > database, when that's the opposite of what the function does. Revised in implementation. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.h#newcode188 sql/recovery.h:188: static bool ShouldRecoverOrRaze(int extended_error); On 2016/04/07 23:09:44, Mark P wrote: > nit: given your explanation of what this does, I think it should be named > something like IsRecoverable or IsLikelyRecoverable or IsRecoverableGivenError > or ShouldTryRecovering or something. Hmm. None of the first three work with me - there's no way to guarantee that it is recoverable short of trying, and there isn't really any way to productively quantify the level of success recovery had. ShouldTry makes it sound like you should try recovery, and if that doesn't work maybe something else, but I'm not sure there is anything else to try. The reason for the separation is to let the calling code be structured like: if (ShouldTry(error)) { // Cleanup local state before the destructive recover call. Recover(db, db_path); } I could instead pass the error into the operative function, like: if (AutomatedErrorHandler(db, error, db_path)) { // Cleanup local state because the database is now poisoned. } I think phrasing that name would also be challenging, and I don't like having the return value signal whether serious side effects have happened or not. PROBABLY in all cases getting into the error handler should lead to a poisoned handle. In the best cases, it was a benign syntax error which inadvertently got shipped to production and doesn't harm anyone, but in the worst cases you're messing up your data and then messing up the mess. So it should be like an OOM CHECK(), where once you notice something awry, you minimize the damage. Unfortunately, there's code in Chromium which implicitly depends on working degraded because things were too relaxed when that code landed (which is why I'm even here, my long-term goal is to have no errors in the static sense, only in the dynamic sense, and then have things to automatically handle the dynamic errors). ["static sense" meaning incorrectly-coded clients, "dynamic sense" meaning problems due to the operating environment.] https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.cc File sql/recovery_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:751: // This table needs sqlite_sequence to work. On 2016/04/08 20:12:06, Mark P wrote: > ... because of the autoincrement. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:753: "CREATE TABLE a (id INTEGER PRIMARY KEY AUTOINCREMENT, v TEXT)")); On 2016/04/08 20:12:06, Mark P wrote: > Everywhere else in this file uses tables x and y. Any reason not to use those > names? Eh, it was just [a]utoincrement and [i]ndex and [v]alue. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:781: On 2016/04/08 20:12:07, Mark P wrote: > I think this blank line is unnecessary. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:784: On 2016/04/08 20:12:06, Mark P wrote: > ditto Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:786: EXPECT_TRUE(db().IsSQLValid(kTrivialSql)); On 2016/04/08 20:12:06, Mark P wrote: > Comment such as > // Verify the database handle is valid. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:786: EXPECT_TRUE(db().IsSQLValid(kTrivialSql)); On 2016/04/08 20:12:06, Mark P wrote: > Comment such as > // Verify the database handle is valid. Done. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:795: ASSERT_TRUE(Reopen()); On 2016/04/08 20:12:07, Mark P wrote: > This block of tests makes me nervous because you're comparing the new data with > the old data, yet you have no check that the old data is plausible. Suppose > there was an uncaught error earlier that prevented any data from being inserted > into the database. This test would still pass. I suggest either testing the > results of the original data calls or removing that original data test and > simply compare the new results with the actual hard-coded values you'd expect. Done. In testing the schema, it seemed like a fair bit of annoying text to format, and a little brittle (the actual format of the schema seems like a SQLite internal kind of thing, to me). So I tested that the expected number of schema items was present. That's still a bit specific, since sqlite_sequence seems like an internal thing clients have no reason to see, but explicitly omitting it is also unnatural. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:801: // Test that the view works. On 2016/04/08 20:12:07, Mark P wrote: > I don't understand this block. I don't see how this checks that the view works. > There is no expect or assert statements in here, and ExecuteWithResults() > doesn't have any expected or assert inside it that would trigger the view was > broken. > > If nothing else, this block is more correctly combined with the trigger block > below, with no comment making an explicit claim about checking the view. It's possible that the block is dumb. It's gone now. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery_unittest.c... sql/recovery_unittest.cc:812: ExecuteWithResults(&db(), kAWithoutSql, "|", "\n")); On 2016/04/08 20:12:07, Mark P wrote: > Shouldn't this be kASql, i.e., you're checking that after the above deletion, a > _normal_ select statement will return the same result as a select statement that > excludes 'truck' explicitly? > ditto with kISql below Sorry, the above capture from the view _is_ the same result excluding 'truck'. The trigger has the side effect of changing the view's results. I'll just go through and make the expectations static. https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81886: + In autorecover database was too broken to begin. On 2016/04/07 23:09:44, Mark P wrote: > optional nit (has better emphasis in my opinion): > Database was too broken to attempt recovery. Done. https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81892: + Autorecover could not create new db schema. On 2016/04/08 20:12:07, Mark P wrote: > Consider mentioning this is a subset of RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. > ditto below. > That said, I see you have weird subset relationships elsewhere in this enum. I > wonder if mentioning this in some places but not others (unless you want to make > this list complete) somewhat confusing. > Maybe you should simply add something to the histogram description. It's intended to be more like code coverage, where I'm tracking which decision points were taken. Being able to add things up is only weakly useful, mostly everything other than SUCCESS should normalize against SUCCESS. The only reason I'd care if one failure case was greater than another is in prioritizing which to look at first. https://codereview.chromium.org/1832173002/diff/20001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:81901: + Autorecover failed recovering a table. On 2016/04/07 23:09:44, Mark P wrote: > For something like this, which is probably a more common type of recovery > failure, do you want to record the failed type of error code in another > histogram? (or extra values within this histogram I suppose) My approach was since I already have this histogram to track flow of control, I'll hit all of those points since it's "free". Then if any failure cases seem bothersome, I'll circle back and add something to debug those. Mostly I'm just being lazy, and somewhat trying to avoid analysis paralysis. Most of the failure points in AutorecoverTable() shouldn't be possible, because it's operating against a temporary database, and, indeed, the histograms show low occurance in there. I guess SQLITE_IOERR or SQLITE_FULL could happen, and in either case it's kind of a give-up situation. RECOVERY_FAILED_ATTACH fails enough that logging that error code may make sense before flipping that case to raze.
Here's another full pass through the code. Apologies that I didn't bother replying to your comments that require no further changes. (Thank you for replying or explaining those things; I just don't want to clutter up this review with too many unimportant comments.) --mark https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode87 sql/recovery.cc:87: // Automatically recovered entire database successfully. On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/07 23:09:43, Mark P wrote: > > This comment sounds like it could equally apply to the earlier enum > > RECOVERY_SUCCESS_AUTORECOVER. Please figure out a way to make their > differences > > clearer. > > The earlier applies to a specific table, this applies to the entire database. > So would you prefer to rename the earlier ones to reflect their specific scope? No, I thought you were being sloppy about database versus table here (as the autorecovery code is currently only used for a single table). I was wrong. It's fine as is. https://codereview.chromium.org/1832173002/diff/20001/sql/recovery.cc#newcode635 sql/recovery.cc:635: "AND name!='sqlite_sequence'")); On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > AND != corrupt.sqlite_sequence ? > > sqlite_master is implicitly main.sqlite_master . SQLite doesn't > interpret/interpolate the columns, so selecting from corrupt.sqlite_master would > still have 'sqlite_sequence' as the name and the sql would not have db.table > naming. > > [I think? OMG, my precious brain! Yes, AFAICT from testing it.] I'm sorry for making your brain explode. ;-) https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:74: if (!sql::Connection::ShouldIgnoreSqliteError(extended_error)) On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/07 23:09:43, Mark P wrote: > > Do you still want to run this code / possibly display the message if the above > > code decides to recover the database? > > > > If yes, consider whether it would be better for developers to display this > > message first or the (possibly failed) recovery message first. (I'd think the > > original error first would be better.) > > > > If no, then you can struct this function differently. > > I'll have to think about this. Corruption generally stems from things like > unexpected power failures, so a developer is unlikely to see that. This DLOG is > expected to fire for cases like SQLITE_ERROR, which would indicate that a > statement had a syntax error, or an inappropriate schema change happened, > something like that. So is there an interesting case at the intersection where > a developer caused a corruption? Hard to say. > > If a developer was unlucky enough to have a legitimate corruption, I would > expect this to FATAL once, then work going forward. Not a great outcome, but > most developers would just ignore it or log a bug against me. That's find, then > I can discuss it with them and determine if there's a real problem or a red > herring. Usually it's been a red herring, but it has caught real bugs. You misunderstood my comment. I meant in the current revision of the code, you call try to open the database. If it fails, you then call RecoverDatabase(), which may display an error. Then at the end, you display the original error with opening the database. This last error I'd think can be easily interpreted as the final error, i.e., the error with recovering the database (if any); if recovery succeeded, display this error here may make it harder to recognize that fact. I'd think you should display the original error, then attempt recovery and display the result of the recovery attempt next, i.e., display errors in the order they are encountered. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath db_path(temp_dir.path().AppendASCII("TestShortcuts.db")); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > This probably should be "3", not "1". (Otherwise why would other tests use > > unique numbers? There must be a reason.) > > Copy/pasta. Since it's in a unique temp dir, I'll just strip the 1. To the > best of my knowledge none of these need to be unique names. If it doesn't matter that they're different, please revise the other tests in this file to use the same name. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:327: sql::Statement statement(connection.GetUniqueStatement( On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > Why does this return failure if we're ignoring the error? Comment. > > Production code in debug mode fails with a FATAL, because there should not be > cases where that is acceptable. Execute() is WARN_UNUSED_RESULT to force > developers to pay attention to the result, because in actual production there > will be failures that are not due to how the code is written. The ignorer tells > sql::Connection that we're testing the code to handle an error case, so don't do > the FATAL thing. > > [To be clear, a lot of the failure code paths for sql returns are entirely > untested in Chromium. A lot of the code is written to assume nothing can fail, > which is true in development but false when deployed. I've tried to force the > issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for > failure paths, in part because it's challenging to inject failures. So I agree, > much of this is dubious.] Thank you for the explanation. That said, my original request still applies. Please add a comment here, in particular about why the statement.is_valid() test should be false. To a naive eye, I'd think it should be true. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:331: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > Side comment: the comment by the declaration of CheckIgnoredErrors() is awful. > > > It doesn't say what the return type means. Does it mean we encountered an > error > > of the type that we ignored? That's my best guess. But it could mean that we > > encountered an error that didn't ignore, i.e., check status, taking into > account > > ignored errors... > > I don't really follow. It says it's used to check if the errors were > encountered, "True" seems like the most likely response to that question. I don't agree. I see at least two ways to understand CheckIgnoredErrors(). One could simply be "return whether we have an error that was ignored". Another could be "return whether everything is okay, i.e., check that it's okay, taking in account ignored errors". Furthermore, given that this function declaration is in the context of a test // Allow containing test to check if the errors were encountered. // ... it's very easy to read the "check" as one of the debug macros like ASSERT or EXPECT, not simply find the value and return. // Allow containing test to check if the errors were encountered. // Failure to call results in ADD_FAILURE() in destructor. bool CheckIgnoredErrors(); https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:334: // The sql::Connection::Open() in ShortcutsDatabase::Init() will hit the nit: in -> call in https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:344: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > I think it's odd that we still see this error. Shouldn't it be cleared after > > the first failure and then not have any errors for the second attempt at > > opening? > > The error will be thrown in the first Open(), the ignorer allows it to fail > gracefully, then the retry will succeed. Given this, I don't like this test's structure. I want a test that verifies the database opens correctly when not in the context of an IgnoreError regime. Either fix this so it clears the error in the middle as I originally expected or add another block below to verify that. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:354: "SELECT COUNT(*) FROM omni_box_shortcuts")); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > Please use an identical call to the one you used to verify the database was > > broken, don't switch from execute to getuniquestatement. > > Done. Almost done. I'd like exactly the same tests, i.e., also add a ASSERT_TRUE(statement.is_valid()); https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:356: EXPECT_EQ(5, statement.ColumnInt(0)); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > Consider using something like > > EXPECT_EQ(arraysize(shortcut_test_db), CountRecords()) > > a la the other tests in this file. > > CountRecords() runs against the sql::Connection contained in the > ShortcutsDatabase created by SetUp(). That database cannot easily be used at > this point. I could do this: > db_ = new ShortcutsDatabase(db_path); > but note that this isn't necessarily describing the same database as SetUp(). > > I could shift the CountRecords() code out to a helper running against a > sql::Connection, then have CountRecords() call that. I don't have a strong preference for how you fix it, but fundamentally I don't want a magic number "5" here because it's not clear where it comes from and which / whether other changes in this file should cause this number to change and which shouldn't. Please solve this code. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode541 sql/recovery.cc:541: // which is created on demand by SQLite if any tables use AUTOINCREMENT. On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/07 23:09:43, Mark P wrote: > > This comment is too broad. Below you explicitly assume the prefix is the > right > > amount of text so if you add a "main." right after it, it makes sense as a sql > > statement. > > SELECT foo from BAR; > > -> > > SELmain.ECT foo from BAR; > > Ugh, you don't mean this. > > I agree that the example case isn't what I want, but I'm not sure what to do > about it. The prefix has to be some sort of valid SQL prefix thingie, but I'm > not sure what to call it, and I'm not going to be writing a parser to validate > it. I guess I can assert that it ends in a space, but even that doesn't mean > much, since I'm using string manipulation to dynamically construct statements. You still should allude to this limitation here, if only to say something like |prefix| must be the beginning of valid SQL string so that we can insert "main." after it in SQL commands that start with prefix and have the result be another valid SQL command. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode551 sql/recovery.cc:551: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'")); On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/07 23:09:43, Mark P wrote: > > What happens if prefix_length > length(sql)? What does substr return in this > > case? > > The sqlite3 documentation I found doesn't define this. > > https://www.sqlite.org/lang_corefunc.html > > Perhaps you want to add another another condition ensuring that sql is long > > enough? > > (Though I can't imagine this causing any problems unless substr() actually > fails > > on this input, as the equality test still applies) > > I think I'll just rewrite this as a full table scan and do the filtering in C++. > It won't change the efficiency at all, and I expect it will be approximately > the same weight in code. You didn't do this rewrite you promised. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode555 sql/recovery.cc:555: while (s.Step()) { On 2016/04/15 00:38:14, Scott Hess wrote: > On 2016/04/07 23:09:43, Mark P wrote: > > Should all these executions be wrapped in a transaction? > > Recovery happens to a temporary database, which is only committed to the main > database in Recovered(). So transactions would only be necessary insofar as the > (local) caller needs to do controlled rollback of one approach in order to try > another approach. I don't think that can really come up, here, mostly because I > can't think of a second simple approach (either the schema will work for this > kind of recovery, or it won't, and the client shouldn't be calling this code if > the schema is too complex or subtle). You're not concerned with having a failed attempt leave additional commands in corrupt.sqlite_master? Or does that not happen? https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588 sql/recovery.cc:588: db->Poison(); On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > In the header comment, you write the function will poison the database so it > > "will return errors until the handle is re-opened." This implies the database > > is closed. Yet, you don't close it here. The comment by the definition for > > Poison says > > // This is intended as an alternative to Close() in error callbacks. > > // Close() should still be called at some point. > > so shouldn't you close it...? > > The originating caller's Open()/Close() calls should balance out. This call can > happen deep in a stack of calls in a case handling an exceptional error, making > it impractical for the client to track the Close() status. So Poison() poisons > the sql::Connection handle, but logically the handle is still open. In this case, I think it'd be better in the header to say, after you say that it poisons the database "(though the handle technically remains open)" before saying "will return errors until the handle is re-opened." > > For instance, warnings are thrown if you try to run Execute() on a handle which > was never opened, or which has been closed. But if the handle was poisoned, > then the Execute() will fail without warnings. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode661 sql/recovery.cc:661: // sql LIKE "CREATE VIRTUAL TABLE %" On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/08 20:12:06, Mark P wrote: > > Do you have a sense of whether this will be different in some cases? Or are > you > > nervous, which is why you left it like vacuum.c? > > I do not fully understand why vacuum.c uses the rootpage like this. SQLite > creates tables with a root page even if empty, and I believe it could also > validly use a null root page. Most likely the assumption is that you created a > table, so you'll populate it, though I've certainly seen that assumption broken > w/in Chromium. Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it can't > accidentally match 'CREATE TABLE' or any other sort of faux table. If it seems safer to you, why don't you do it (and comment on why you deviate from vacuum.c in this case)? https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode684 sql/recovery.cc:684: // instance a symlink to a non-existent path, or the file is a directory). How is this recoverable? This comment implies you should be returning false. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689 sql/recovery.cc:689: // SQLite return this where other versions return SQLITE_CORRUPT. How is this likely to be recoverable? https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode89 sql/recovery.h:89: static std::unique_ptr<Recovery> Begin(Connection* connection, I'm surprised that you're changing the type of this function and the ones below it and don't need to add additional files to this changelist. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode172 sql/recovery.h:172: static void RecoverDatabaseOrRaze(Connection* db, On 2016/04/15 00:38:15, Scott Hess wrote: > On 2016/04/07 23:09:44, Mark P wrote: > > Uh, did you mean "will be", not "can be"? And will the result be razed (i.e., > > deleted) or an empty database? > > If you didn't mean that, then how will the caller know whether the recovery > > failed and the database is okay to be razed? This function has no return > value. > > If you're intentionally vague because you plan to change this from > > leave-corrupt-database-around to something else, just give it the right name > for > > now. Later when you make the intended change, you can rename the function. > > I'm not sure how to answer! > > Right now, it only applies the safe cases. It either thinks it recovered the > database, or it rolls things back. But I've seen results from the fleet which > make me believe that running recovery to success didn't actually work (possibly > because of shenanigans in the system, like malware or profile's being > overwritten or something). The intention is to monitor the histograms to see if > any of the failure outcomes is due to a broken assumption rather than broken > user data. If the failures cases look weirdly high, I'll instrument to debug > the fleet, but if they look sensible, I'll switch them over from Rollback() to > Unrecoverable() (leading to Raze()). > > sql::Connection::Raze() means that it overwrites the database with a single-page > empty database. The delete language is because if the Raze() doesn't work, then > what? Probably attempt to Delete(). I'm nervous about landing either of those > up front, though, on the off chance that I write a bug which works right in my > tests but in the field blows away a bunch of data. > > Anyhow, my intention with this function is that it consumes the database handle > in all cases, while making a best effort to recover the data, and failing that a > best effort to clear the database of data, and failing that a best effort to > delete the database. On next run the database should either be a valid > data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should be > a valid empty SQLite database, or it should be missing (SQLite would create the > database empty). > > Raze() versus Delete() is because Raze() works using SQLite locking and I/O. It > can clear the database in cases where Delete() won't work. Likewise, Delete() > may work in cases where Raze() does not. Thanks for the long response. Fundamentally, my complaint is that the comment above and indeed the function name doesn't correspond with what the function does. There is no razing going on here. Why not just call it RecoverDatabase()? And remove comments that are entirely wrong such as "If the database is entirely unreadable, the new database will be empty.", and, for that matter, the whole second paragraph (each sentence in it is currently wrong). You can add your raze or delete language back into the function name and comments once you actually switch the code from doing Rollback to Unrecoverable. And at that time add comments about the goal of the function (leaving a working database, even if empty) at that time. That's not the goal at this time. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode176 sql/recovery.h:176: // or which cannot plausibly be recovered (and should be razed). This does nit: or which -> versus You don't mean "or", at least in the boolean sense. Consider making a similar change to the function name. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.cc File sql/recovery_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.c... sql/recovery_unittest.cc:771: // As a side effect AUTOINCREMENT creates the sqlite_sequence table for optional nit: comma after effect https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83764: <int value="19" label="RECOVERY_FAILED_META_NO_VERSION"> Please add a sentence to the end of the Sqlite.RecoveryEvents description, something to effect that those of these enum entries are subsets / included in the counts of other enum entries. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83771: + Database was too broken to recover. "to recover" or "to attempt recovery"? It looks like this error does not include the other detailed failures about problems that can happen along the recovery process
Sorry about the long delay - I had kind of a poor April experience. For the ScopedErrorIgnorer comments, I still don't really understand how a simple function-name change helps anything. I could see changing the entire thing from "ScopedErrorIgnorer" to "ScopedErrorExpecter", used basically the same: ScopedErrorExpecter expecter; expecter.ExpectError(SQLITE_CORRUPT); // Code under test. ASSERT_TRUE(expecter.SawExpectedErrors()); WDYT? The "ignore" naming is centered around "How it changes operation of sql::Connection", whereas "expect" naming would be centered around "How you use it." That implies a filename change, and various changes outside the focus of this CL, so I'd rather move it to an on-topic CL this CL can depend on. Thanks, scott https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database.cc:74: if (!sql::Connection::ShouldIgnoreSqliteError(extended_error)) On 2016/04/19 23:34:27, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/07 23:09:43, Mark P wrote: > > > Do you still want to run this code / possibly display the message if the > above > > > code decides to recover the database? > > > > > > If yes, consider whether it would be better for developers to display this > > > message first or the (possibly failed) recovery message first. (I'd think > the > > > original error first would be better.) > > > > > > If no, then you can struct this function differently. > > > > I'll have to think about this. Corruption generally stems from things like > > unexpected power failures, so a developer is unlikely to see that. This DLOG > is > > expected to fire for cases like SQLITE_ERROR, which would indicate that a > > statement had a syntax error, or an inappropriate schema change happened, > > something like that. So is there an interesting case at the intersection > where > > a developer caused a corruption? Hard to say. > > > > If a developer was unlucky enough to have a legitimate corruption, I would > > expect this to FATAL once, then work going forward. Not a great outcome, but > > most developers would just ignore it or log a bug against me. That's find, > then > > I can discuss it with them and determine if there's a real problem or a red > > herring. Usually it's been a red herring, but it has caught real bugs. > > You misunderstood my comment. > > I meant in the current revision of the code, you call try to open the database. > If it fails, you then call RecoverDatabase(), which may display an error. Then > at the end, you display the original error with opening the database. This last > error I'd think can be easily interpreted as the final error, i.e., the error > with recovering the database (if any); if recovery succeeded, display this error > here may make it harder to recognize that fact. > > I'd think you should display the original error, then attempt recovery and > display the result of the recovery attempt next, i.e., display errors in the > order they are encountered. I'll just have the recovery path return. Connection::OnSqliteError() already logs at ERROR level, if anyone is paying attention. I think the errors which lead to recovery attempts should be distinct from the kinds of development errors the DLOG(FATAL) is intended to expose. To the extent that someone manages to write code that causes corruption, crashing here wouldn't be helpful (as compared to a statement error, where the crash happens compiling the statement). https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... File components/omnibox/browser/shortcuts_database_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:314: base::FilePath db_path(temp_dir.path().AppendASCII("TestShortcuts.db")); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > This probably should be "3", not "1". (Otherwise why would other tests use > > > unique numbers? There must be a reason.) > > > > Copy/pasta. Since it's in a unique temp dir, I'll just strip the 1. To the > > best of my knowledge none of these need to be unique names. > > If it doesn't matter that they're different, please revise the other tests in > this file to use the same name. Done. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:327: sql::Statement statement(connection.GetUniqueStatement( On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > Why does this return failure if we're ignoring the error? Comment. > > > > Production code in debug mode fails with a FATAL, because there should not be > > cases where that is acceptable. Execute() is WARN_UNUSED_RESULT to force > > developers to pay attention to the result, because in actual production there > > will be failures that are not due to how the code is written. The ignorer > tells > > sql::Connection that we're testing the code to handle an error case, so don't > do > > the FATAL thing. > > > > [To be clear, a lot of the failure code paths for sql returns are entirely > > untested in Chromium. A lot of the code is written to assume nothing can > fail, > > which is true in development but false when deployed. I've tried to force the > > issue a bit with WARN_UNUSED_RESULT, but generally no tests are written for > > failure paths, in part because it's challenging to inject failures. So I > agree, > > much of this is dubious.] > > Thank you for the explanation. That said, my original request still applies. > Please add a comment here, in particular about why the statement.is_valid() test > should be false. To a naive eye, I'd think it should be true. Done. Pulled up the statement sql, noted that the post-recovery version succeeds. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:331: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > Side comment: the comment by the declaration of CheckIgnoredErrors() is > awful. > > > > > It doesn't say what the return type means. Does it mean we encountered an > > error > > > of the type that we ignored? That's my best guess. But it could mean that > we > > > encountered an error that didn't ignore, i.e., check status, taking into > > account > > > ignored errors... > > > > I don't really follow. It says it's used to check if the errors were > > encountered, "True" seems like the most likely response to that question. > > I don't agree. I see at least two ways to understand CheckIgnoredErrors(). One > could simply be "return whether we have an error that was ignored". Another > could be "return whether everything is okay, i.e., check that it's okay, taking > in account ignored errors". Obviously there's a terminology problem, here, but I'm a little befuddled how to resolve it. ScopedErrorIgnorer exists entirely/strictly to track errors within a scope. I'm not sure what "everything" might refer to in that context, other than the errors the ScopedErrorIgnorer has been enlisted to track. I'd be happy to refactor that a little, I'll open up a discussion thread offline and this CL can wait on that. The original reviewer of that code maybe wasn't enthusiastic enough about being roped into this. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:334: // The sql::Connection::Open() in ShortcutsDatabase::Init() will hit the On 2016/04/19 23:34:28, Mark P wrote: > nit: in -> call in Done. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:344: ASSERT_TRUE(ignore_errors.CheckIgnoredErrors()); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > I think it's odd that we still see this error. Shouldn't it be cleared > after > > > the first failure and then not have any errors for the second attempt at > > > opening? > > > > The error will be thrown in the first Open(), the ignorer allows it to fail > > gracefully, then the retry will succeed. > > Given this, I don't like this test's structure. I want a test that verifies the > database opens correctly when not in the context of an IgnoreError regime. > Either fix this so it clears the error in the middle as I originally expected or > add another block below to verify that. I'm not sure what you're referring to with "clears the error". Nobody is clearing errors. The database can't open correctly when it is corrupt and not in the context of an error ignorer, there will be a crash. As far as whether the Open() retry works when the error callback fixes things, that's already being tested by SQLConnectionTest.RazeCallbackReopen. This test is testing that the Open() retry logic works when the shortcuts error handler is in the loop. I don't think there is a point where ShortcutsDatabase can crack things open to allow seeing the flow of control for failure retry within sql::Connection::Open(). I could add a test that uses the shortcuts error handler against a plain sql::Connection, but that would imply exporting the error handler, which may-or-may-not be appropriate (right now it would work because it's written in terms of only the sql::Connection, but that wouldn't work if it were revised to reference ShortcutsDatabase). https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:354: "SELECT COUNT(*) FROM omni_box_shortcuts")); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > Please use an identical call to the one you used to verify the database was > > > broken, don't switch from execute to getuniquestatement. > > > > Done. > > Almost done. I'd like exactly the same tests, i.e., also add a > ASSERT_TRUE(statement.is_valid()); Done. https://codereview.chromium.org/1832173002/diff/60001/components/omnibox/brow... components/omnibox/browser/shortcuts_database_unittest.cc:356: EXPECT_EQ(5, statement.ColumnInt(0)); On 2016/04/19 23:34:27, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > Consider using something like > > > EXPECT_EQ(arraysize(shortcut_test_db), CountRecords()) > > > a la the other tests in this file. > > > > CountRecords() runs against the sql::Connection contained in the > > ShortcutsDatabase created by SetUp(). That database cannot easily be used at > > this point. I could do this: > > db_ = new ShortcutsDatabase(db_path); > > but note that this isn't necessarily describing the same database as SetUp(). > > > > I could shift the CountRecords() code out to a helper running against a > > sql::Connection, then have CountRecords() call that. > > I don't have a strong preference for how you fix it, but fundamentally I don't > want a magic number "5" here because it's not clear where it comes from and > which / whether other changes in this file should cause this number to change > and which shouldn't. Please solve this code. It's dependent on the contents of the golden file. I'll generate the expected value from the uncorrupted database. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode541 sql/recovery.cc:541: // which is created on demand by SQLite if any tables use AUTOINCREMENT. On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:15, Scott Hess wrote: > > On 2016/04/07 23:09:43, Mark P wrote: > > > This comment is too broad. Below you explicitly assume the prefix is the > > right > > > amount of text so if you add a "main." right after it, it makes sense as a > sql > > > statement. > > > SELECT foo from BAR; > > > -> > > > SELmain.ECT foo from BAR; > > > Ugh, you don't mean this. > > > > I agree that the example case isn't what I want, but I'm not sure what to do > > about it. The prefix has to be some sort of valid SQL prefix thingie, but I'm > > not sure what to call it, and I'm not going to be writing a parser to validate > > it. I guess I can assert that it ends in a space, but even that doesn't mean > > much, since I'm using string manipulation to dynamically construct statements. > > You still should allude to this limitation here, if only to say something like > |prefix| must be the beginning of valid SQL string so that we can insert "main." > after it in SQL commands that start with prefix and have the result be another > valid SQL command. Done. I'm reluctant to get too elaborate with the comment, because at some point it will just be an English version of the precise series of steps the code implements. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode551 sql/recovery.cc:551: "WHERE substr(sql, 1, ?)=? AND name<>'sqlite_sequence'")); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/07 23:09:43, Mark P wrote: > > > What happens if prefix_length > length(sql)? What does substr return in > this > > > case? > > > The sqlite3 documentation I found doesn't define this. > > > https://www.sqlite.org/lang_corefunc.html > > > Perhaps you want to add another another condition ensuring that sql is long > > > enough? > > > (Though I can't imagine this causing any problems unless substr() actually > > fails > > > on this input, as the equality test still applies) > > > > I think I'll just rewrite this as a full table scan and do the filtering in > C++. > > It won't change the efficiency at all, and I expect it will be approximately > > the same weight in code. > > You didn't do this rewrite you promised. Was thinking out loud, earlier. Done. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode555 sql/recovery.cc:555: while (s.Step()) { On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:14, Scott Hess wrote: > > On 2016/04/07 23:09:43, Mark P wrote: > > > Should all these executions be wrapped in a transaction? > > > > Recovery happens to a temporary database, which is only committed to the main > > database in Recovered(). So transactions would only be necessary insofar as > the > > (local) caller needs to do controlled rollback of one approach in order to try > > another approach. I don't think that can really come up, here, mostly because > I > > can't think of a second simple approach (either the schema will work for this > > kind of recovery, or it won't, and the client shouldn't be calling this code > if > > the schema is too complex or subtle). > > You're not concerned with having a failed attempt leave additional commands in > corrupt.sqlite_master? Or does that not happen? In case of failure the recovery database [main] does not replace the original database [corrupt]. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588 sql/recovery.cc:588: db->Poison(); On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:15, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > In the header comment, you write the function will poison the database so it > > > "will return errors until the handle is re-opened." This implies the > database > > > is closed. Yet, you don't close it here. The comment by the definition for > > > Poison says > > > // This is intended as an alternative to Close() in error callbacks. > > > // Close() should still be called at some point. > > > so shouldn't you close it...? > > > > The originating caller's Open()/Close() calls should balance out. This call > can > > happen deep in a stack of calls in a case handling an exceptional error, > making > > it impractical for the client to track the Close() status. So Poison() > poisons > > the sql::Connection handle, but logically the handle is still open. > In this case, I think it'd be better in the header to say, after you say that it > poisons the database "(though the handle technically remains open)" before > saying "will return errors until the handle is re-opened." > > > > For instance, warnings are thrown if you try to run Execute() on a handle > which > > was never opened, or which has been closed. But if the handle was poisoned, > > then the Execute() will fail without warnings. Done. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode661 sql/recovery.cc:661: // sql LIKE "CREATE VIRTUAL TABLE %" On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:15, Scott Hess wrote: > > On 2016/04/08 20:12:06, Mark P wrote: > > > Do you have a sense of whether this will be different in some cases? Or are > > you > > > nervous, which is why you left it like vacuum.c? > > > > I do not fully understand why vacuum.c uses the rootpage like this. SQLite > > creates tables with a root page even if empty, and I believe it could also > > validly use a null root page. Most likely the assumption is that you created > a > > table, so you'll populate it, though I've certainly seen that assumption > broken > > w/in Chromium. Testing 'CREATE VIRTUAL TABLE' seems safer to me, since it > can't > > accidentally match 'CREATE TABLE' or any other sort of faux table. > > If it seems safer to you, why don't you do it (and comment on why you deviate > from vacuum.c in this case)? Assuming my reasoning is correct, then the DCHECK_IS_ON() section above implies that virtual tables shouldn't work at all, so I'll just remove the part I believe to be associated with virtual tables. I'm leaving the DCHECK_IS_ON() section in place, though. My experience is that error-handling code sometimes does not have good testing coverage, so I'd rather it actively failed in development rather than falsely succeeding. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode684 sql/recovery.cc:684: // instance a symlink to a non-existent path, or the file is a directory). On 2016/04/19 23:34:28, Mark P wrote: > How is this recoverable? This comment implies you should be returning false. I intended this for the delete option, which you asked me to remove, so I'll remove it. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689 sql/recovery.cc:689: // SQLite return this where other versions return SQLITE_CORRUPT. On 2016/04/19 23:34:28, Mark P wrote: > How is this likely to be recoverable? In the case where it was incorrectly returning SQLITE_NOTADB instead of SQLITE_CORRUPT, it would be recoverable. For the version of SQLite currently checked into Chromium's third_party repo, I think SQLITE_NOTADB is never recoverable, but some iOS versions of SQLite are old enough to hit this. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode89 sql/recovery.h:89: static std::unique_ptr<Recovery> Begin(Connection* connection, On 2016/04/19 23:34:28, Mark P wrote: > I'm surprised that you're changing the type of this function and the ones below > it and don't need to add additional files to this changelist. The change from base::scoped_ptr<> to std::unique_ptr<> happened since the CL was originally posted. So any changes you see there are probably just the rebase, and wouldn't show in a diff from trunk (except any new code following the conversion, of course). https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode172 sql/recovery.h:172: static void RecoverDatabaseOrRaze(Connection* db, On 2016/04/19 23:34:28, Mark P wrote: > On 2016/04/15 00:38:15, Scott Hess wrote: > > On 2016/04/07 23:09:44, Mark P wrote: > > > Uh, did you mean "will be", not "can be"? And will the result be razed > (i.e., > > > deleted) or an empty database? > > > If you didn't mean that, then how will the caller know whether the recovery > > > failed and the database is okay to be razed? This function has no return > > value. > > > If you're intentionally vague because you plan to change this from > > > leave-corrupt-database-around to something else, just give it the right name > > for > > > now. Later when you make the intended change, you can rename the function. > > > > I'm not sure how to answer! > > > > Right now, it only applies the safe cases. It either thinks it recovered the > > database, or it rolls things back. But I've seen results from the fleet which > > make me believe that running recovery to success didn't actually work > (possibly > > because of shenanigans in the system, like malware or profile's being > > overwritten or something). The intention is to monitor the histograms to see > if > > any of the failure outcomes is due to a broken assumption rather than broken > > user data. If the failures cases look weirdly high, I'll instrument to debug > > the fleet, but if they look sensible, I'll switch them over from Rollback() to > > Unrecoverable() (leading to Raze()). > > > > sql::Connection::Raze() means that it overwrites the database with a > single-page > > empty database. The delete language is because if the Raze() doesn't work, > then > > what? Probably attempt to Delete(). I'm nervous about landing either of > those > > up front, though, on the off chance that I write a bug which works right in my > > tests but in the field blows away a bunch of data. > > > > Anyhow, my intention with this function is that it consumes the database > handle > > in all cases, while making a best effort to recover the data, and failing that > a > > best effort to clear the database of data, and failing that a best effort to > > delete the database. On next run the database should either be a valid > > data-containing SQLite database (doesn't throw SQLITE_CORRUPT), or it should > be > > a valid empty SQLite database, or it should be missing (SQLite would create > the > > database empty). > > > > Raze() versus Delete() is because Raze() works using SQLite locking and I/O. > It > > can clear the database in cases where Delete() won't work. Likewise, Delete() > > may work in cases where Raze() does not. > > Thanks for the long response. Fundamentally, my complaint is that the comment > above and indeed the function name doesn't correspond with what the function > does. There is no razing going on here. > > Why not just call it RecoverDatabase()? And remove comments that are entirely > wrong such as "If the database is entirely unreadable, the new database will be > empty.", and, for that matter, the whole second paragraph (each sentence in it > is currently wrong). > > You can add your raze or delete language back into the function name and > comments once you actually switch the code from doing Rollback to Unrecoverable. > And at that time add comments about the goal of the function (leaving a working > database, even if empty) at that time. That's not the goal at this time. > > Done. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.h#newcode176 sql/recovery.h:176: // or which cannot plausibly be recovered (and should be razed). This does On 2016/04/19 23:34:28, Mark P wrote: > nit: or which -> versus > You don't mean "or", at least in the boolean sense. > Consider making a similar change to the function name. Just removed the comparison clause entirely. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.cc File sql/recovery_unittest.cc (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery_unittest.c... sql/recovery_unittest.cc:771: // As a side effect AUTOINCREMENT creates the sqlite_sequence table for On 2016/04/19 23:34:28, Mark P wrote: > optional nit: comma after effect Done. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83764: <int value="19" label="RECOVERY_FAILED_META_NO_VERSION"> On 2016/04/19 23:34:28, Mark P wrote: > Please add a sentence to the end of the Sqlite.RecoveryEvents description, > something to effect that those of these enum entries are subsets / included in > the counts of other enum entries. Done. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83771: + Database was too broken to recover. On 2016/04/19 23:34:28, Mark P wrote: > "to recover" or "to attempt recovery"? Recovery was attempted, and it failed in setup. I'll rephrase to be similar to the other descriptions. > It looks like this error does not include the other detailed failures about > problems that can happen along the recovery process I don't know what this means. Are you asking for someone to enumerate all of the cases which prevented Begin() from succeeding? Most of them are already logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and RECOVERY_FAILED_ATTACH. Which obviously implies you may want me to annotate those various items to describe which other items are supersets and subsets of each other, etc. I'm going to pre-emptively argue against this (and maybe that I should unwind the annotations just added). I'm basically recording these as code coverage, so there's literally no utility to understanding them in the abstract without reference to the code. It is easy enough to enumerate the places where a failure can be detected, but many of them are challenging to quantify in the abstract, and I'd rather not spend a week understanding how a particular failure can happen if it has only ever happened 4 times in the fleet.
Apologies - I see some unrelated rebase action in there. I'm going to back out the latest patch and split it into a rebase followed by my changes.
Patchset #5 (id:80001) has been deleted
Rebase
Patchset #5 (id:100001) has been deleted
Rebase
Patchset #5 (id:120001) has been deleted
Rebase
Address review comments through #14. I think.
I assume it's okay to wait until both other changelists are submitted before coming back to this one, right? --mark
On 2016/06/08 22:15:38, Mark P wrote: > I assume it's okay to wait until both other changelists are submitted before > coming back to this one, right? Yeah, let's wait until the offshoot threads are all ground away. I'll ping once I think things should be back in business.
rebase
On 2016/06/25 00:35:32, Scott Hess wrote: > rebase If that emailed, note that I haven't reviewed the past comments to see if everything is in order. It looks like in May I _thought_ things were in order, but ...
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device-gn on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-gn/bui...)
std::count returns ptrdiff_t, various comment changes.
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
OK, I think the open comments have been addressed, except the histograms.xml one I've noted as being in disagreement. Sorry about the extended arc of this CL. Hopefully now the skids are greased. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode588 sql/recovery.cc:588: db->Poison(); On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > On 2016/04/15 00:38:15, Scott Hess wrote: > > > On 2016/04/08 20:12:06, Mark P wrote: > > > > In the header comment, you write the function will poison the database so > it > > > > "will return errors until the handle is re-opened." This implies the > > database > > > > is closed. Yet, you don't close it here. The comment by the definition > for > > > > Poison says > > > > // This is intended as an alternative to Close() in error callbacks. > > > > // Close() should still be called at some point. > > > > so shouldn't you close it...? > > > > > > The originating caller's Open()/Close() calls should balance out. This call > > can > > > happen deep in a stack of calls in a case handling an exceptional error, > > making > > > it impractical for the client to track the Close() status. So Poison() > > poisons > > > the sql::Connection handle, but logically the handle is still open. > > In this case, I think it'd be better in the header to say, after you say that > it > > poisons the database "(though the handle technically remains open)" before > > saying "will return errors until the handle is re-opened." > > > > > > For instance, warnings are thrown if you try to run Execute() on a handle > > which > > > was never opened, or which has been closed. But if the handle was poisoned, > > > then the Execute() will fail without warnings. > > Done. Header-comment-change really done, this time. https://codereview.chromium.org/1832173002/diff/60001/sql/recovery.cc#newcode689 sql/recovery.cc:689: // SQLite return this where other versions return SQLITE_CORRUPT. On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > How is this likely to be recoverable? > > In the case where it was incorrectly returning SQLITE_NOTADB instead of > SQLITE_CORRUPT, it would be recoverable. For the version of SQLite currently > checked into Chromium's third_party repo, I think SQLITE_NOTADB is never > recoverable, but some iOS versions of SQLite are old enough to hit this. Adjusted the comment a bit more. Long-term, this case will lead to recovery for some versions of SQLite, and to deletion for others. I don't think it would be productive to isolate which versions of SQLite have the different return code, because the only change to the worst-case outcome is that the handle is poisoned (which is actually preferable, since SQLITE_NOTADB will never resolve itself). https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83771: + Database was too broken to recover. On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > "to recover" or "to attempt recovery"? > > Recovery was attempted, and it failed in setup. I'll rephrase to be similar to > the other descriptions. > > > It looks like this error does not include the other detailed failures about > > problems that can happen along the recovery process > > I don't know what this means. Are you asking for someone to enumerate all of > the cases which prevented Begin() from succeeding? Most of them are already > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and > RECOVERY_FAILED_ATTACH. > > Which obviously implies you may want me to annotate those various items to > describe which other items are supersets and subsets of each other, etc. I'm > going to pre-emptively argue against this (and maybe that I should unwind the > annotations just added). I'm basically recording these as code coverage, so > there's literally no utility to understanding them in the abstract without > reference to the code. It is easy enough to enumerate the places where a > failure can be detected, but many of them are challenging to quantify in the > abstract, and I'd rather not spend a week understanding how a particular failure > can happen if it has only ever happened 4 times in the fleet. Just FYI, I don't consider this comment resolved, there's still disagreement.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
comments are on both patchset #7 and #8. (Apologies, I was in the middle of reviewing patchset 7 when you uploaded patch 8 with additional revisions.) --mark https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83771: + Database was too broken to recover. On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > "to recover" or "to attempt recovery"? > > Recovery was attempted, and it failed in setup. I'll rephrase to be similar to > the other descriptions. > > > It looks like this error does not include the other detailed failures about > > problems that can happen along the recovery process > > I don't know what this means. Are you asking for someone to enumerate all of > the cases which prevented Begin() from succeeding? Most of them are already > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and > RECOVERY_FAILED_ATTACH. > > Which obviously implies you may want me to annotate those various items to > describe which other items are supersets and subsets of each other, etc. I'm > going to pre-emptively argue against this (and maybe that I should unwind the > annotations just added). I'm basically recording these as code coverage, so > there's literally no utility to understanding them in the abstract without > reference to the code. It is easy enough to enumerate the places where a > failure can be detected, but many of them are challenging to quantify in the > abstract, and I'd rather not spend a week understanding how a particular failure > can happen if it has only ever happened 4 times in the fleet. Acknowledged. https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... sql/recovery.cc:555: while (s.Step()) { On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > On 2016/04/15 00:38:14, Scott Hess wrote: > > > On 2016/04/07 23:09:43, Mark P wrote: > > > > Should all these executions be wrapped in a transaction? > > > > > > Recovery happens to a temporary database, which is only committed to the > main > > > database in Recovered(). So transactions would only be necessary insofar as > > the > > > (local) caller needs to do controlled rollback of one approach in order to > try > > > another approach. I don't think that can really come up, here, mostly > because > > I > > > can't think of a second simple approach (either the schema will work for > this > > > kind of recovery, or it won't, and the client shouldn't be calling this code > > if > > > the schema is too complex or subtle). > > > > You're not concerned with having a failed attempt leave additional commands in > > corrupt.sqlite_master? Or does that not happen? > > In case of failure the recovery database [main] does not replace the original > database [corrupt]. I don't understand. It seems in on each step you're running an executing a command. If one command fails but the other succeeds, won't you leave the the "main." database in a weird state? I don't see function wrapped in a transaction. https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0) nit: spaces around binary operators https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173 sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR or SQLITE_FULL). nit: recovered -> recoverable ? https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... tools/metrics/histograms/histograms.xml:54088: + RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. On 2016/05/13 21:24:36, Scott Hess wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > Please add a sentence to the end of the Sqlite.RecoveryEvents description, > > something to effect that those of these enum entries are subsets / included in > > the counts of other enum entries. > > Done. Kind of done. These are subsets, true. But I _think_ there are other subsets here too. Please either - be complete and list all subsets. - be vague and simply state that there are certain types of errors that cause an emission to multiple buckets. (Read code for details.) The current phrasing implies this list is complete (which I'm skeptical of given your other comments). https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should not be correlated with programmer error, so the I have trouble understanding this sentence. I don't see why it shouldn't be "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be helpful. Perhaps consider replacing the sentence with: The DLOG(FATAL) below should not normally be relevant to developers. (Developers do not often test against corrupt-yet-recoverable databases.) Displaying the error would probably lead to more confusion than help. https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod... sql/recovery.cc:689: // will fail with no changes to the database. consider adding ... so there's no harm in returning true here.
My day is looking such that if I'm lucky, I can circle back this evening. WRT the histogram coverage, right now, I'm just loading up the same histogram like histograms cost money. Would it help to just break out a separate histogram for the new DB-level API? In that case, I can probably get that API to where each exit point has a single enum, and the failures will overlap the table-level failures in the other histogram, rather than overlapping within the same histogram. On 2016/06/27 23:20:13, Mark P wrote: > comments are on both patchset #7 and #8. > (Apologies, I was in the middle of reviewing patchset 7 when you uploaded patch > 8 with additional revisions.) > > --mark > > https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... > File tools/metrics/histograms/histograms.xml (right): > > https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... > tools/metrics/histograms/histograms.xml:83771: + Database was too broken to > recover. > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > "to recover" or "to attempt recovery"? > > > > Recovery was attempted, and it failed in setup. I'll rephrase to be similar > to > > the other descriptions. > > > > > It looks like this error does not include the other detailed failures about > > > problems that can happen along the recovery process > > > > I don't know what this means. Are you asking for someone to enumerate all of > > the cases which prevented Begin() from succeeding? Most of them are already > > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, > > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and > > RECOVERY_FAILED_ATTACH. > > > > Which obviously implies you may want me to annotate those various items to > > describe which other items are supersets and subsets of each other, etc. I'm > > going to pre-emptively argue against this (and maybe that I should unwind the > > annotations just added). I'm basically recording these as code coverage, so > > there's literally no utility to understanding them in the abstract without > > reference to the code. It is easy enough to enumerate the places where a > > failure can be detected, but many of them are challenging to quantify in the > > abstract, and I'd rather not spend a week understanding how a particular > failure > > can happen if it has only ever happened 4 times in the fleet. > > Acknowledged. > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc > File sql/recovery.cc (right): > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... > sql/recovery.cc:555: while (s.Step()) { > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > On 2016/04/15 00:38:14, Scott Hess wrote: > > > > On 2016/04/07 23:09:43, Mark P wrote: > > > > > Should all these executions be wrapped in a transaction? > > > > > > > > Recovery happens to a temporary database, which is only committed to the > > main > > > > database in Recovered(). So transactions would only be necessary insofar > as > > > the > > > > (local) caller needs to do controlled rollback of one approach in order to > > try > > > > another approach. I don't think that can really come up, here, mostly > > because > > > I > > > > can't think of a second simple approach (either the schema will work for > > this > > > > kind of recovery, or it won't, and the client shouldn't be calling this > code > > > if > > > > the schema is too complex or subtle). > > > > > > You're not concerned with having a failed attempt leave additional commands > in > > > corrupt.sqlite_master? Or does that not happen? > > > > In case of failure the recovery database [main] does not replace the original > > database [corrupt]. > > I don't understand. It seems in on each step you're running an executing a > command. If one command fails but the other succeeds, won't you leave the the > "main." database in a weird state? I don't see function wrapped in a > transaction. > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... > sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0) > nit: spaces around binary operators > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h > File sql/recovery.h (right): > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173 > sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR > or SQLITE_FULL). > nit: recovered -> recoverable ? > > https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... > File tools/metrics/histograms/histograms.xml (right): > > https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... > tools/metrics/histograms/histograms.xml:54088: + > RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > Please add a sentence to the end of the Sqlite.RecoveryEvents description, > > > something to effect that those of these enum entries are subsets / included > in > > > the counts of other enum entries. > > > > Done. > > Kind of done. These are subsets, true. But I _think_ there are other subsets > here too. > Please either > - be complete and list all subsets. > - be vague and simply state that there are certain types of errors that cause an > emission to multiple buckets. (Read code for details.) > The current phrasing implies this list is complete (which I'm skeptical of given > your other comments). > > https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... > File components/omnibox/browser/shortcuts_database.cc (right): > > https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... > components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should > not be correlated with programmer error, so the > I have trouble understanding this sentence. I don't see why it shouldn't be > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be > helpful. Perhaps consider replacing the sentence with: > The DLOG(FATAL) below should not normally be relevant to developers. (Developers > do not often test against corrupt-yet-recoverable databases.) Displaying the > error would probably lead to more confusion than help. > > https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc > File sql/recovery.cc (right): > > https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod... > sql/recovery.cc:689: // will fail with no changes to the database. > consider adding ... so there's no harm in returning true here.
On Tue, Jun 28, 2016 at 3:28 PM, <shess@chromium.org> wrote: > My day is looking such that if I'm lucky, I can circle back this evening. > WRT > the histogram coverage, right now, I'm just loading up the same histogram > like > histograms cost money. Would it help to just break out a separate > histogram for > the new DB-level API? In that case, I can probably get that API to where > each > exit point has a single enum, and the failures will overlap the table-level > failures in the other histogram, rather than overlapping within the same > histogram. > Frankly, I don't care that much. I don't expect many people other than you to look at this. I just want whatever behavior there is to be documented in case someone else looks then they'll either understand or at least know they have to delve into code to look deeper. Do whichever solution (one or multiple histograms) that would be better for you to work with. --mark > > > > On 2016/06/27 23:20:13, Mark P wrote: > > comments are on both patchset #7 and #8. > > (Apologies, I was in the middle of reviewing patchset 7 when you uploaded > patch > > 8 with additional revisions.) > > > > --mark > > > > > > https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... > > File tools/metrics/histograms/histograms.xml (right): > > > > > > https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... > > tools/metrics/histograms/histograms.xml:83771: + Database was too broken > to > > recover. > > On 2016/05/13 21:24:36, Scott Hess wrote: > > > On 2016/04/19 23:34:28, Mark P wrote: > > > > "to recover" or "to attempt recovery"? > > > > > > Recovery was attempted, and it failed in setup. I'll rephrase to be > similar > > to > > > the other descriptions. > > > > > > > It looks like this error does not include the other detailed failures > about > > > > problems that can happen along the recovery process > > > > > > I don't know what this means. Are you asking for someone to enumerate > all > of > > > the cases which prevented Begin() from succeeding? Most of them are > already > > > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, > > > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and > > > RECOVERY_FAILED_ATTACH. > > > > > > Which obviously implies you may want me to annotate those various > items to > > > describe which other items are supersets and subsets of each other, > etc. > I'm > > > going to pre-emptively argue against this (and maybe that I should > unwind > the > > > annotations just added). I'm basically recording these as code > coverage, so > > > there's literally no utility to understanding them in the abstract > without > > > reference to the code. It is easy enough to enumerate the places where > a > > > failure can be detected, but many of them are challenging to quantify > in the > > > abstract, and I'd rather not spend a week understanding how a > particular > > failure > > > can happen if it has only ever happened 4 times in the fleet. > > > > Acknowledged. > > > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc > > File sql/recovery.cc (right): > > > > > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... > > sql/recovery.cc:555: while (s.Step()) { > > On 2016/05/13 21:24:36, Scott Hess wrote: > > > On 2016/04/19 23:34:28, Mark P wrote: > > > > On 2016/04/15 00:38:14, Scott Hess wrote: > > > > > On 2016/04/07 23:09:43, Mark P wrote: > > > > > > Should all these executions be wrapped in a transaction? > > > > > > > > > > Recovery happens to a temporary database, which is only committed > to the > > > main > > > > > database in Recovered(). So transactions would only be necessary > insofar > > as > > > > the > > > > > (local) caller needs to do controlled rollback of one approach in > order > to > > > try > > > > > another approach. I don't think that can really come up, here, > mostly > > > because > > > > I > > > > > can't think of a second simple approach (either the schema will > work for > > > this > > > > > kind of recovery, or it won't, and the client shouldn't be calling > this > > code > > > > if > > > > > the schema is too complex or subtle). > > > > > > > > You're not concerned with having a failed attempt leave additional > commands > > in > > > > corrupt.sqlite_master? Or does that not happen? > > > > > > In case of failure the recovery database [main] does not replace the > original > > > database [corrupt]. > > > > I don't understand. It seems in on each step you're running an executing > a > > command. If one command fails but the other succeeds, won't you leave > the the > > "main." database in a weird state? I don't see function wrapped in a > > transaction. > > > > > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... > > sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0) > > nit: spaces around binary operators > > > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h > > File sql/recovery.h (right): > > > > > > https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173 > > sql/recovery.h:173: // verified to be recovered correctly (such as > SQLITE_IOERR > > or SQLITE_FULL). > > nit: recovered -> recoverable ? > > > > > > https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... > > File tools/metrics/histograms/histograms.xml (right): > > > > > > https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... > > tools/metrics/histograms/histograms.xml:54088: + > > RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. > > On 2016/05/13 21:24:36, Scott Hess wrote: > > > On 2016/04/19 23:34:28, Mark P wrote: > > > > Please add a sentence to the end of the Sqlite.RecoveryEvents > description, > > > > something to effect that those of these enum entries are subsets / > included > > in > > > > the counts of other enum entries. > > > > > > Done. > > > > Kind of done. These are subsets, true. But I _think_ there are other > subsets > > here too. > > Please either > > - be complete and list all subsets. > > - be vague and simply state that there are certain types of errors that > cause > an > > emission to multiple buckets. (Read code for details.) > > The current phrasing implies this list is complete (which I'm skeptical > of > given > > your other comments). > > > > > > https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... > > File components/omnibox/browser/shortcuts_database.cc (right): > > > > > > https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... > > components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases > should > > not be correlated with programmer error, so the > > I have trouble understanding this sentence. I don't see why it shouldn't > be > > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be > > helpful. Perhaps consider replacing the sentence with: > > The DLOG(FATAL) below should not normally be relevant to developers. > (Developers > > do not often test against corrupt-yet-recoverable databases.) Displaying > the > > error would probably lead to more confusion than help. > > > > https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc > > File sql/recovery.cc (right): > > > > > > https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod... > > sql/recovery.cc:689: // will fail with no changes to the database. > > consider adding ... so there's no harm in returning true here. > > > > https://codereview.chromium.org/1832173002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Trim comments, histogram comments, drop enum from histogram.
The CQ bit was checked by shess@chromium.org to run a CQ dry run
ptal, thx. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83764: <int value="19" label="RECOVERY_FAILED_META_NO_VERSION"> On 2016/05/13 21:24:36, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/04/19 23:34:28, Mark P wrote: > > Please add a sentence to the end of the Sqlite.RecoveryEvents description, > > something to effect that those of these enum entries are subsets / included in > > the counts of other enum entries. > > Done. Undone in favor of referring to the code. https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/60001/tools/metrics/histogram... tools/metrics/histograms/histograms.xml:83771: + Database was too broken to recover. On 2016/06/27 23:20:13, Mark P wrote: > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > "to recover" or "to attempt recovery"? > > > > Recovery was attempted, and it failed in setup. I'll rephrase to be similar > to > > the other descriptions. > > > > > It looks like this error does not include the other detailed failures about > > > problems that can happen along the recovery process > > > > I don't know what this means. Are you asking for someone to enumerate all of > > the cases which prevented Begin() from succeeding? Most of them are already > > logged in the Init() section, namely RECOVERY_FAILED_OPEN_TEMPORARY, > > RECOVERY_FAILED_VIRTUAL_TABLE_INIT, RECOVERY_FAILED_WRITABLE_SCHEMA and > > RECOVERY_FAILED_ATTACH. > > > > Which obviously implies you may want me to annotate those various items to > > describe which other items are supersets and subsets of each other, etc. I'm > > going to pre-emptively argue against this (and maybe that I should unwind the > > annotations just added). I'm basically recording these as code coverage, so > > there's literally no utility to understanding them in the abstract without > > reference to the code. It is easy enough to enumerate the places where a > > failure can be detected, but many of them are challenging to quantify in the > > abstract, and I'd rather not spend a week understanding how a particular > failure > > can happen if it has only ever happened 4 times in the fleet. > > Acknowledged. OK, WRT re-writing the histogram comment to indicate "this is coverage, look at the code", I also omitted RECOVERY_FAILED_AUTORECOVERDB_SCHEMA in favor of the independent subsets. So those enums are now independent, but the entire set of enums are not (due to table and db errors overlapping). https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... sql/recovery.cc:555: while (s.Step()) { On 2016/06/27 23:20:13, Mark P wrote: > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > On 2016/04/15 00:38:14, Scott Hess wrote: > > > > On 2016/04/07 23:09:43, Mark P wrote: > > > > > Should all these executions be wrapped in a transaction? > > > > > > > > Recovery happens to a temporary database, which is only committed to the > > main > > > > database in Recovered(). So transactions would only be necessary insofar > as > > > the > > > > (local) caller needs to do controlled rollback of one approach in order to > > try > > > > another approach. I don't think that can really come up, here, mostly > > because > > > I > > > > can't think of a second simple approach (either the schema will work for > > this > > > > kind of recovery, or it won't, and the client shouldn't be calling this > code > > > if > > > > the schema is too complex or subtle). > > > > > > You're not concerned with having a failed attempt leave additional commands > in > > > corrupt.sqlite_master? Or does that not happen? > > > > In case of failure the recovery database [main] does not replace the original > > database [corrupt]. > > I don't understand. It seems in on each step you're running an executing a > command. If one command fails but the other succeeds, won't you leave the the > "main." database in a weird state? I don't see function wrapped in a > transaction. At the outer level, the database [corrupt] is the real database we started with, "Shortcuts" in this case. This system accesses that database read-only. The database [main] is a temporary database (on POSIX I think it's an open fd to an unlinked file). If recovery runs through to completion, the final Recovered() call will copy the blocks of the recovered database back the original database. In all other cases, it will be closed, and thus discarded. Regular transactions aren't appropriate, since the corrupt database is presumed untrustworthy. Any regular write activity against tables or rows in that database could propagate corruption (thus the read-only), and the recover virtual table is used to scan things out in a safe fashion. The final copy is at the block level, which uses OS-level operations so corruption should no longer matter. https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... sql/recovery.cc:559: if (sql.compare(0, prefix_len, prefix)!=0) On 2016/06/27 23:20:13, Mark P wrote: > nit: spaces around binary operators Done. https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173 sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR or SQLITE_FULL). On 2016/06/27 23:20:13, Mark P wrote: > nit: recovered -> recoverable ? I'm just going to remove the entire "why false" case, and just let the "why true" case stand alone. I intentionally avoided language like "Do not call RecoverDatabase() unless this returns true", because it actually doesn't matter. The code doesn't work better or worse based on this error code, the caller just shouldn't dump all errors into recovery because of the goal of strongly flagging newly-introduced errors like SQLITE_ERROR or SQLITE_MISUSE. https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... File tools/metrics/histograms/histograms.xml (right): https://codereview.chromium.org/1832173002/diff/180001/tools/metrics/histogra... tools/metrics/histograms/histograms.xml:54088: + RECOVERY_FAILED_AUTORECOVERDB_SCHEMA. On 2016/06/27 23:20:13, Mark P wrote: > On 2016/05/13 21:24:36, Scott Hess wrote: > > On 2016/04/19 23:34:28, Mark P wrote: > > > Please add a sentence to the end of the Sqlite.RecoveryEvents description, > > > something to effect that those of these enum entries are subsets / included > in > > > the counts of other enum entries. > > > > Done. > > Kind of done. These are subsets, true. But I _think_ there are other subsets > here too. > Please either > - be complete and list all subsets. > - be vague and simply state that there are certain types of errors that cause an > emission to multiple buckets. (Read code for details.) > The current phrasing implies this list is complete (which I'm skeptical of given > your other comments). Reworded the description to emphasis that this is to inform the development cycle, not indicate that anything actionable is going wrong. And point at recovery.cc. https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should not be correlated with programmer error, so the On 2016/06/27 23:20:13, Mark P wrote: > I have trouble understanding this sentence. I don't see why it shouldn't be > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be > helpful. Perhaps consider replacing the sentence with: > The DLOG(FATAL) below should not normally be relevant to developers. (Developers > do not often test against corrupt-yet-recoverable databases.) Displaying the > error would probably lead to more confusion than help. Probably nerdview. Made an attempt at revision. https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/200001/sql/recovery.cc#newcod... sql/recovery.cc:689: // will fail with no changes to the database. On 2016/06/27 23:20:13, Mark P wrote: > consider adding ... so there's no harm in returning true here. Done.
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_androi...)
lgtm (with one trivial nit below) (at last!) Happy Friday! (and enjoy your time away) --mark https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.cc#newcod... sql/recovery.cc:555: while (s.Step()) { On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/06/27 23:20:13, Mark P wrote: > > On 2016/05/13 21:24:36, Scott Hess wrote: > > > On 2016/04/19 23:34:28, Mark P wrote: > > > > On 2016/04/15 00:38:14, Scott Hess wrote: > > > > > On 2016/04/07 23:09:43, Mark P wrote: > > > > > > Should all these executions be wrapped in a transaction? > > > > > > > > > > Recovery happens to a temporary database, which is only committed to the > > > main > > > > > database in Recovered(). So transactions would only be necessary > insofar > > as > > > > the > > > > > (local) caller needs to do controlled rollback of one approach in order > to > > > try > > > > > another approach. I don't think that can really come up, here, mostly > > > because > > > > I > > > > > can't think of a second simple approach (either the schema will work for > > > this > > > > > kind of recovery, or it won't, and the client shouldn't be calling this > > code > > > > if > > > > > the schema is too complex or subtle). > > > > > > > > You're not concerned with having a failed attempt leave additional > commands > > in > > > > corrupt.sqlite_master? Or does that not happen? > > > > > > In case of failure the recovery database [main] does not replace the > original > > > database [corrupt]. > > > > I don't understand. It seems in on each step you're running an executing a > > command. If one command fails but the other succeeds, won't you leave the the > > "main." database in a weird state? I don't see function wrapped in a > > transaction. > > At the outer level, the database [corrupt] is the real database we started with, > "Shortcuts" in this case. This system accesses that database read-only. The > database [main] is a temporary database (on POSIX I think it's an open fd to an > unlinked file). If recovery runs through to completion, the final Recovered() > call will copy the blocks of the recovered database back the original database. > In all other cases, it will be closed, and thus discarded. > > Regular transactions aren't appropriate, since the corrupt database is presumed > untrustworthy. Any regular write activity against tables or rows in that > database could propagate corruption (thus the read-only), and the recover > virtual table is used to scan things out in a safe fashion. The final copy is > at the block level, which uses OS-level operations so corruption should no > longer matter. Thanks for the explanation. https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h File sql/recovery.h (right): https://codereview.chromium.org/1832173002/diff/180001/sql/recovery.h#newcode173 sql/recovery.h:173: // verified to be recovered correctly (such as SQLITE_IOERR or SQLITE_FULL). On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/06/27 23:20:13, Mark P wrote: > > nit: recovered -> recoverable ? > > I'm just going to remove the entire "why false" case, and just let the "why > true" case stand alone. Good move. > I intentionally avoided language like "Do not call RecoverDatabase() unless this > returns true", because it actually doesn't matter. The code doesn't work better > or worse based on this error code, the caller just shouldn't dump all errors > into recovery because of the goal of strongly flagging newly-introduced errors > like SQLITE_ERROR or SQLITE_MISUSE. Acknowledged. https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... File components/omnibox/browser/shortcuts_database.cc (right): https://codereview.chromium.org/1832173002/diff/200001/components/omnibox/bro... components/omnibox/browser/shortcuts_database.cc:72: // Recoverable cases should not be correlated with programmer error, so the On 2016/06/29 21:39:30, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/06/27 23:20:13, Mark P wrote: > > I have trouble understanding this sentence. I don't see why it shouldn't be > > "correlated" in some sense, nor is it obvious why the DCHECK wouldn't be > > helpful. Perhaps consider replacing the sentence with: > > The DLOG(FATAL) below should not normally be relevant to developers. > (Developers > > do not often test against corrupt-yet-recoverable databases.) Displaying the > > error would probably lead to more confusion than help. > > Probably nerdview. Made an attempt at revision. Your new comment is much better, thanks. https://codereview.chromium.org/1832173002/diff/220001/sql/recovery.cc File sql/recovery.cc (right): https://codereview.chromium.org/1832173002/diff/220001/sql/recovery.cc#newcod... sql/recovery.cc:687: // will fail with no changes to the database, so there's no harm to nit: "to attempt" or "in attempting", not "to attempting"
grammar
The CQ bit was unchecked by shess@chromium.org
The CQ bit was checked by shess@chromium.org
The CQ bit was checked by shess@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from mpearson@chromium.org Link to the patchset: https://codereview.chromium.org/1832173002/#ps240001 (title: "grammar")
The patchset sent to the CQ was uploaded after l-g-t-m from mpearson@chromium.org Link to the patchset: https://codereview.chromium.org/1832173002/#ps240001 (title: "grammar")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Committed patchset #10 (id:240001)
Message was sent while issue was closed.
CQ bit was unchecked.
Message was sent while issue was closed.
Description was changed from ========== [sql] Database recovery system for Shortcuts. sql::Recovery provides a conservative recovery system which requires some overhead in the client code. Such code exists for the "Top Sites" and Favicons databases, and has shown that the easy-to-handle cases dominate. Implement a helper function to recover an entire database, and apply it to the Shortcuts database. BUG=597785 ========== to ========== [sql] Database recovery system for Shortcuts. sql::Recovery provides a conservative recovery system which requires some overhead in the client code. Such code exists for the "Top Sites" and Favicons databases, and has shown that the easy-to-handle cases dominate. Implement a helper function to recover an entire database, and apply it to the Shortcuts database. BUG=597785 Committed: https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5 Cr-Commit-Position: refs/heads/master@{#403576} ==========
Message was sent while issue was closed.
Patchset 10 (id:??) landed as https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5 Cr-Commit-Position: refs/heads/master@{#403576}
Message was sent while issue was closed.
On 2016/07/02 00:28:40, commit-bot: I haz the power wrote: > Patchset 10 (id:??) landed as > https://crrev.com/a402e75a80c13fab013100447b31a00ff6b3a0f5 > Cr-Commit-Position: refs/heads/master@{#403576} To future build or stability sheriffs, I'm going to be in-and-out of town for a bit, here. I have a couple points where I can make time to check-in on things, and I intend to mostly to check on the progress of this CL. If this is definitely breaking the thing, revert it! But if it just needs to be disabled, I put up a CL for that: https://codereview.chromium.org/2114823003/ |