|
|
Created:
5 years, 2 months ago by Scott Hess - ex-Googler Modified:
5 years, 2 months ago CC:
chromium-reviews Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
Description[sql] Track uploads of diagnostic data to prevent duplication.
The ThumbnailDatabase error handler attempts to capture generic data
about the error, which is then uploaded for diagnostics with
DumpWithoutCrashing(). Database with errors which are not repaired
upload new dumps repeatedly. This change adds a tracking file alongside
the database file, which has the dual purpose of recording whether a
previous dump was attempted, and also verifying that files can be
created in the profile directory. If the profile directory is too
broken to create files, then diagnostic data is unlikely to inform a
fix.
BUG=543321, 362505, 526614, 240396
Committed: https://crrev.com/c8cd2a161947886dab169caa0aa656230f2928ec
Cr-Commit-Position: refs/heads/master@{#355632}
Patch Set 1 #Patch Set 2 : missing FILE_PATH_LITERAL #
Total comments: 16
Patch Set 3 : Move sql::Connection methods with no changes. #Patch Set 4 : Other changes from pkotwicz #
Total comments: 4
Patch Set 5 : Another comment tweak #
Total comments: 2
Patch Set 6 : Shift diagnostic generation and reporting to sql::Connection. #
Total comments: 13
Patch Set 7 : Collect{Error/Corruption}Info test #Patch Set 8 : Mojo can't get the size, sigh. #
Total comments: 6
Patch Set 9 : review nits. #Patch Set 10 : Bail out EVEN MORE. #Patch Set 11 : Oh, mojo #
Messages
Total messages: 43 (10 generated)
missing FILE_PATH_LITERAL
shess@chromium.org changed reviewers: + pkotwicz@chromium.org
I think I had you review the original code, so... I'm polishing this off both to address complaints about soiling the crash stream, and to start moving towards doing this kind of debugging on other databases. When written, this code depended on features in chrome/ which have now moved into base/, so it is reasonable to shift some of it into sql::Connection instead.
LGTM https://codereview.chromium.org/1393393007/diff/20001/components/history/core... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/20001/components/history/core... components/history/core/browser/thumbnail_database.cc:258: return; Nit - I think that this is slightly clearer: if (db->ShouldUploadDiagnosticDump()) { // Corrupt cases currently dominate, report them very infrequently. const uint64 kCorruptReportsPerMillion = 10000; if (rand < kCorruptReportsPerMillion) ReportCorrupt(db, startup_kb); } https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1560: bool Connection::ShouldUploadDiagnosticDump() const { Nit: Can you please move this function so that it better matches the order of the functions in the .h file? https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1608: scoped_ptr<base::Value> read_root(deserializer.Deserialize(NULL, NULL)); Nit: NULL -> nullptr https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1621: base::ListValue* dumps = NULL; Nit: NULL -> nullptr https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1640: breadcrumb_path.AddExtension(FILE_PATH_LITERAL("new")); Should we use base::CreateTemporaryFile() instead? https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h File sql/connection.h (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h#newcod... sql/connection.h:478: // each unique histogram tag. Nit: In the comment can you mention that calling ShouldUploadDiagnosticDump() sets state and that calling ShouldUploadDiagnosticDump() twice will return true at most once? I am not a big fan of the method name but I cannot think of a better method name. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h#newcod... sql/connection.h:665: // passed to Open(). Nit: Return -> Returns
Move sql::Connection methods with no changes.
Other changes from pkotwicz
For purposes of reviewing changes incrementally, moving the functions is in one patch, then editing things is in the other. Since the rename and commenting were kind of "Delete draft and rewrite", probably best to review those again. https://codereview.chromium.org/1393393007/diff/20001/components/history/core... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/20001/components/history/core... components/history/core/browser/thumbnail_database.cc:258: return; On 2015/10/15 18:02:29, pkotwicz wrote: > Nit - I think that this is slightly clearer: > > if (db->ShouldUploadDiagnosticDump()) { > // Corrupt cases currently dominate, report them very infrequently. > const uint64 kCorruptReportsPerMillion = 10000; > if (rand < kCorruptReportsPerMillion) > ReportCorrupt(db, startup_kb); > } OK, I pulled all of these to just before the Report line. I could push the check into the Report functions, but I'm thinking the entire ball of wax should be refactored. If this change successfully moderates the reporting rate, then it would make sense to remove much of the special-casing in this function. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1560: bool Connection::ShouldUploadDiagnosticDump() const { On 2015/10/15 18:02:29, pkotwicz wrote: > Nit: Can you please move this function so that it better matches the order of > the functions in the .h file? I can put it after OnMemoryDump(), but the functions in this file are not well-ordered in the first place... https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1608: scoped_ptr<base::Value> read_root(deserializer.Deserialize(NULL, NULL)); On 2015/10/15 18:02:30, pkotwicz wrote: > Nit: NULL -> nullptr Acknowledged. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1621: base::ListValue* dumps = NULL; On 2015/10/15 18:02:29, pkotwicz wrote: > Nit: NULL -> nullptr Acknowledged. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1640: breadcrumb_path.AddExtension(FILE_PATH_LITERAL("new")); On 2015/10/15 18:02:29, pkotwicz wrote: > Should we use base::CreateTemporaryFile() instead? I'll assume you intended CreateTemporaryFileInDir(), due to the rename. I used a reproducible path because I didn't want to worry about writing cleanup code for orphaned temporary files, and I could not find such cleanup code for CreateTemporaryFile() and kin. The following delete line is not strictly necessary, since the underlying WriteFile() should truncate if it already exists. Also, CreateTemporaryFileInDir() on POSIX calls mkstemp() and then closees the resulting file descriptor. In this case I definitely don't need a unique file, so that seems excessive (and a little weird). I don't even know what to make of the Windows implementation. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h File sql/connection.h (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h#newcod... sql/connection.h:478: // each unique histogram tag. On 2015/10/15 18:02:30, pkotwicz wrote: > Nit: In the comment can you mention that calling ShouldUploadDiagnosticDump() > sets state and that calling ShouldUploadDiagnosticDump() twice will return true > at most once? > > I am not a big fan of the method name but I cannot think of a better method > name. Changed to RegisterIntentToUpload(), with the return value interpreted as whether registration was successful. Overhauled the comments. https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h#newcod... sql/connection.h:665: // passed to Open(). On 2015/10/15 18:02:30, pkotwicz wrote: > Nit: Return -> Returns Acknowledged. Also moved impl to after ReleaseCacheMemoryIfNeeded().
https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc#newco... sql/connection.cc:290: db_path.DirName().Append(FILE_PATH_LITERAL("sqlite-diag"))); Hmm. I am considering using this file to also track certain out-of-band status about databases, the main short-term case being to track if a database shows errors which are known to cause problems with memory-mapped I/O. To the extent that this info is optional and reflects records about the operation of the database, I'm comfortable with "sqlite-diag", but I could see room for dispute. Another case which has occurred to me would be to run periodic integrity checks. I had considered that at various points in the past, but kept running against the question of how to prevent it from running continuously in the worst-case scenarios. If the technique used here works, then I think a very similar approach would work for rate-limiting integrity checks.
Still LGTM https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.cc#newco... sql/connection.cc:1640: breadcrumb_path.AddExtension(FILE_PATH_LITERAL("new")); Fair enough https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h File sql/connection.h (right): https://codereview.chromium.org/1393393007/diff/20001/sql/connection.h#newcod... sql/connection.h:478: // each unique histogram tag. I like this function name better. Thanks for changing https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc#newco... sql/connection.cc:262: // Any error reading or writing the file is considered failure. This seems to be covered by the comment in the .h file https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc#newco... sql/connection.cc:265: // tags for databases which have been dumped. It is unclear what "Top level" refers to. How about: "The sqlite-diag file contains a dictionary with the version number and an array of ..." https://codereview.chromium.org/1393393007/diff/60001/sql/connection.cc#newco... sql/connection.cc:290: db_path.DirName().Append(FILE_PATH_LITERAL("sqlite-diag"))); Fair enough
Another comment tweak
shess@chromium.org changed reviewers: + sky@chromium.org
Thanks for the review! Adding sky@ as owner on thumbnail_database.cc. This is my alternative to that other review which just disabled diagnostic dumps entirely.
https://codereview.chromium.org/1393393007/diff/80001/components/history/core... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/80001/components/history/core... components/history/core/browser/thumbnail_database.cc:259: if (rand < kCorruptReportsPerMillion && db->RegisterIntentToUpload()) Naming is hard, but looking at this code it isn't clear why something called 'register intent to upload' would dictate whether an error is reported. 'register intent to upload' reads as a verb, and nothing something that you need to check the return value from or dictates why it would effect reporting an error. OTOH if RegisterIntentToUpload was named ShouldReportError, then this code makes total sense. WDYT?
https://codereview.chromium.org/1393393007/diff/80001/components/history/core... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/80001/components/history/core... components/history/core/browser/thumbnail_database.cc:259: if (rand < kCorruptReportsPerMillion && db->RegisterIntentToUpload()) On 2015/10/16 15:07:24, sky wrote: > Naming is hard, but looking at this code it isn't clear why something called > 'register intent to upload' would dictate whether an error is reported. > 'register intent to upload' reads as a verb, and nothing something that you need > to check the return value from or dictates why it would effect reporting an > error. OTOH if RegisterIntentToUpload was named ShouldReportError, then this > code makes total sense. WDYT? Previously it was more like ShouldReportError(), but that seemed confusing (to both of us) that it had a side effect which changed the future return value. I changed to this under the theory that if it failed to register intent, then it shouldn't report. It is shaped a little like a test-and-set function, but that didn't really help with naming. I suppose it doesn't have to be that way, it could have separate query and set calls, maybe with the contents cached between them. But the set would have to happen before the report, otherwise failure to set wouldn't prevent uploads, so the code would look like: if (one_test && db->ShouldReportError()) { db->GoingToReportError(); ReportError(...); } Peter, WDYT? Back to ShouldReportError(), side effect and all? Long-term,
if (one_test && db->ShouldReportError()) { db->GoingToReportError(); ReportError(...); } sounds good to me. I think it is the clearest solution. It is annoying because ShouldReportError() and GoingToReportError() are going to do a lot of duplicate work. However, given that ReportError() will be called at most once for each histogram_tag_ I think that this is ok. I think that it is possible to have helper methods such that ShouldReportError() and GoingToReportError() do not duplicate too much code.
On 2015/10/16 17:35:07, pkotwicz wrote: > if (one_test && db->ShouldReportError()) { > db->GoingToReportError(); > ReportError(...); > } > > sounds good to me. I think it is the clearest solution. Bother, this was wrong, it needs to check the return value of the new method, because otherwise it is likely to end up back in the stream-of-errors case. So it would look like: if (one_test && db->ShouldReportError()) { if (db->GoingToReportError()) ReportError(...); } or something like that, which feels just as bad as the current code. As a short-term solution, what if I push the DumpWithoutCrashing() helper into sql::Connection, and have it handle this test internally? The ReportError() data collection isn't very expensive at all, so running that for no reason won't hurt. ReportCorrupt() is more expensive, but it's only a negative when there's corruption, the corruption isn't fixed by recovery, and the reporting code can't report it.
That sounds like a reasonable proposal. ReportError() and ReportCorrupt() should run at most once per "run of Chrome" because of the |reported| static variable.
Patchset #6 (id:100001) has been deleted
Shift diagnostic generation and reporting to sql::Connection.
On 2015/10/16 18:33:14, pkotwicz wrote: > That sounds like a reasonable proposal. ReportError() and ReportCorrupt() should > run at most once per "run of Chrome" because of the |reported| static variable. OK, PS6 is trying it out. Apologies if things look like they could be improved further - I found myself worried that I was letting one more refactor stand in the way of just getting things improved right now (WRT the crash server noise) without messing up too much. I was thinking "Maybe I should test the collection routines", but found myself unable to really quantify what I'd want to test. The only thing I'm positive about with those is that I'm going to change them. Hmm, I guess that I could write basic "function runs to completion without crashing" tests. Those will be obvious, I'll pop them up in a few minutes. https://codereview.chromium.org/1393393007/diff/120001/components/history/cor... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/120001/components/history/cor... components/history/core/browser/thumbnail_database.cc:141: db->ReportDiagnosticInfo(extended_error, stmt); I considered pushing this into sql::Connection, but the most obvious place was the intent-to-report function, which would then require adaptation for testing purposes. Since I'm not even certain this is the right solution, I decided to leave it near where it already was. Previously, the goal of this threshold was to keep the daily load approachable. Now it has changed to spread the incoming bump over time. I think it is probably not perfect for this, but it is probably Good Enough For Now. My major concern is actually that the report rate will be too low, but I still need to grind through my backlog of data from this, so there's time to tweak going forward. https://codereview.chromium.org/1393393007/diff/120001/components/history/cor... components/history/core/browser/thumbnail_database.cc:420: GenerateDiagnostics(db, extended_error, stmt); Removed |startup_kb| because DbPath() lets it just query the size as needed. Previously |stmt| wasn't logged on the theory that the backtrace would provide it - I've found that to be just annoying enough to decode sometimes that I think it was a false savings. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:275: char debug_buf[2000]; SQLITE_ERROR uploads for Favicons tend to be in the range of 1100 bytes, which includes the schema, so there's not much incentive to increase this at this time. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:743: base::StringAppendF(&debug_info, "reported error: %d\n", error); Previously just recorded GetErrorCode(). I couldn't convince myself that that was 100% safe. It should be, though, which is why I had it record only if different. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:746: // from posix. This is slightly different from before. Detecting how to interpret this number before was annoying, with this the source is easier to distinguish. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:760: } Previously the statement wasn't recorded, under the theory that the backtrace would provide it. It usually does, except when it's ambiguous and the line number doesn't match trunk any more. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:767: int rc = sqlite3_prepare_v2(db_, kVersionSql, -1, &s, nullptr); I changed this to be direct sqlite3 access, because most sql::Connection functions are undefined when errors are happening. Square that consideration because this is running in the context of an error callback. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:814: // prudent to rewrite in terms of SQLite API calls, and mark the function const. I didn't do this at this time because it seemed more intrusive than the CollectErrorInfo() case. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:843: "integrity_check %" PRId64 " ms, %" PRIuS " records:\n", The timed value was previous reported as PRIx64. <eyeroll/> https://codereview.chromium.org/1393393007/diff/120001/sql/connection.h File sql/connection.h (right): https://codereview.chromium.org/1393393007/diff/120001/sql/connection.h#newco... sql/connection.h:515: void AssertIOAllowed() const { So it can be used from const methods. https://codereview.chromium.org/1393393007/diff/120001/sql/connection_unittes... File sql/connection_unittest.cc (left): https://codereview.chromium.org/1393393007/diff/120001/sql/connection_unittes... sql/connection_unittest.cc:139: } // namespace sql Slight namespace rearrangement to allow test-friending for RegisterIntentToUpload. https://codereview.chromium.org/1393393007/diff/120001/sql/connection_unittes... File sql/connection_unittest.cc (right): https://codereview.chromium.org/1393393007/diff/120001/sql/connection_unittes... sql/connection_unittest.cc:212: // SQLite function to adjust mock time by |argv[0]| milliseconds. Moved here with no changes to keep it in an anonymous namespace.
Collect{Error/Corruption}Info test
On 2015/10/19 22:23:41, Scott Hess wrote: > I was thinking "Maybe I should test the collection routines", but found myself > unable to really quantify what I'd want to test. The only thing I'm positive > about with those is that I'm going to change them. Hmm, I guess that I could > write basic "function runs to completion without crashing" tests. Those will be > obvious, I'll pop them up in a few minutes. This is in PS#7. https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/120001/sql/connection.cc#newc... sql/connection.cc:799: error = false; Oops.
Mojo can't get the size, sigh.
LGTM https://codereview.chromium.org/1393393007/diff/160001/components/history/cor... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/160001/components/history/cor... components/history/core/browser/thumbnail_database.cc:139: uint64 rand = base::RandGenerator(1000000); Nit: Would it be clearer if the RandGenerator had a max value of 100? https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc#newc... sql/connection.cc:277: debug_buf[arraysize(debug_buf) - 1] = '\0'; This null assignment to the last character seems unnecessary based on the implementation of base::strlcpy() https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc#newc... sql/connection.cc:825: db_size = -1; Nit: Setting |db_size| to -1 seems redundant according to source of GetFileSize()
LGTM
review nits.
Thanks. https://codereview.chromium.org/1393393007/diff/160001/components/history/cor... File components/history/core/browser/thumbnail_database.cc (right): https://codereview.chromium.org/1393393007/diff/160001/components/history/cor... components/history/core/browser/thumbnail_database.cc:139: uint64 rand = base::RandGenerator(1000000); On 2015/10/21 15:23:33, pkotwicz wrote: > Nit: Would it be clearer if the RandGenerator had a max value of 100? Done. Originally I wasn't sure if it would want to go below 1% for certain types of errors, but that wasn't necessary. I don't think it is likely to become necessary, but it can always be changed again if so. https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc File sql/connection.cc (right): https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc#newc... sql/connection.cc:277: debug_buf[arraysize(debug_buf) - 1] = '\0'; On 2015/10/21 15:23:33, pkotwicz wrote: > This null assignment to the last character seems unnecessary based on the > implementation of base::strlcpy() Done. When reviewing the strlcat() section of the man page confused me on this issue. https://codereview.chromium.org/1393393007/diff/160001/sql/connection.cc#newc... sql/connection.cc:825: db_size = -1; On 2015/10/21 15:23:33, pkotwicz wrote: > Nit: Setting |db_size| to -1 seems redundant according to source of > GetFileSize() Header comment says "Returns file size. Returns true on success." Not excited to rely on undefined behavior. The previous code recorded the size at open, so this case wasn't possible. I'm inclined to think that if this case ever happens, the system is beyond help, so no useful information would be generated - if the file doesn't exist, the corrupt is solved, if the code can't access the file, how can it fix it? Modifying to just bail out in this case.
Bail out EVEN MORE.
The CQ bit was checked by shess@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkotwicz@chromium.org, sky@chromium.org Link to the patchset: https://codereview.chromium.org/1393393007/#ps200001 (title: "Bail out EVEN MORE.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1393393007/200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1393393007/200001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_chromium_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
Oh, mojo
The CQ bit was checked by shess@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkotwicz@chromium.org, sky@chromium.org Link to the patchset: https://codereview.chromium.org/1393393007/#ps220001 (title: "Oh, mojo")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1393393007/220001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1393393007/220001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: win_chromium_rel_ng on tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by shess@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1393393007/220001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1393393007/220001
Message was sent while issue was closed.
Committed patchset #11 (id:220001)
Message was sent while issue was closed.
Patchset 11 (id:??) landed as https://crrev.com/c8cd2a161947886dab169caa0aa656230f2928ec Cr-Commit-Position: refs/heads/master@{#355632} |