OLD | NEW |
(Empty) | |
| 1 <!-- |
| 2 Copyright 2017 The Crashpad Authors. All rights reserved. |
| 3 |
| 4 Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 you may not use this file except in compliance with the License. |
| 6 You may obtain a copy of the License at |
| 7 |
| 8 http://www.apache.org/licenses/LICENSE-2.0 |
| 9 |
| 10 Unless required by applicable law or agreed to in writing, software |
| 11 distributed under the License is distributed on an "AS IS" BASIS, |
| 12 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 See the License for the specific language governing permissions and |
| 14 limitations under the License. |
| 15 --> |
| 16 |
| 17 # Crashpad Overview Design |
| 18 |
| 19 [TOC] |
| 20 |
| 21 ## Objective |
| 22 |
| 23 Crashpad is a library for capturing, storing and transmitting postmortem crash |
| 24 reports from a client to an upstream collection server. Crashpad aims to make it |
| 25 possible for clients to capture process state at the time of crash with the best |
| 26 possible fidelity and coverage, with the minimum of fuss. |
| 27 |
| 28 Crashpad also provides a facility for clients to capture dumps of process state |
| 29 on-demand for diagnostic purposes. |
| 30 |
| 31 Crashpad additionally provides minimal facilities for clients to adorn their |
| 32 crashes with application-specific metadata in the form of per-process key/value |
| 33 pairs. More sophisticated clients are able to adorn crash reports further |
| 34 through extensibility points that allow the embedder to augment the crash report |
| 35 with application-specific metadata. |
| 36 |
| 37 ## Background |
| 38 |
| 39 It’s an unfortunate truth that any large piece of software will contain bugs |
| 40 that will cause it to occasionally crash. Even in the absence of bugs, software |
| 41 incompatibilities can cause program instability. |
| 42 |
| 43 Fixing bugs and incompatibilities in client software that ships to millions of |
| 44 users around the world is a daunting task. User reports and manual reproduction |
| 45 of crashes can work, but even given a user report, often times the problem is |
| 46 not readily reproducible. This is for various reasons, such as e.g. system |
| 47 version or third-party software incompatibility, or the problem can happen due |
| 48 to a race of some sort. Users are also unlikely to report problems they |
| 49 encounter, and user reports are often of poor quality, as unfortunately most |
| 50 users don’t have experience with making good bug reports. |
| 51 |
| 52 Automatic crash telemetry has been the best solution to the problem so far, as |
| 53 this relieves the burden of manual reporting from users, while capturing the |
| 54 hardware and software state at the time of crash. |
| 55 |
| 56 TODO(siggi): examples of this? |
| 57 |
| 58 Crash telemetry involves capturing postmortem crash dumps and transmitting them |
| 59 to a backend collection server. On the server they can be stackwalked and |
| 60 symbolized, and evaluated and aggregated in various ways. Stackwalking and |
| 61 symbolizing the reports on an upstream server has several benefits over |
| 62 performing these tasks on the client. High-fidelity stackwalking requires access |
| 63 to bulky unwind data, and it may be desirable to not ship this to end users out |
| 64 of concern for the application size. The process of symbolization requires |
| 65 access to debugging symbols, which can be quite large, and the symbolization |
| 66 process can consume considerable other resources. Transmitting un-stackwalked |
| 67 and un-symbolized postmortem dumps to the collection server also allows deep |
| 68 analysis of individual dumps, which is often necessary to resolve the bug |
| 69 causing the crash. |
| 70 |
| 71 Transmitting reports to the collection server allows aggregating crashes by |
| 72 cause, which in turn allows assessing the importance of different crashes in |
| 73 terms of the occurrence rate and e.g. the potential security impact. |
| 74 |
| 75 A postmortem crash dump must contain the program state at the time of crash |
| 76 with sufficient fidelity to allow diagnosing and fixing the problem. As the full |
| 77 program state is usually too large to transmit to an upstream server, the |
| 78 postmortem dump captures a heuristic subset of the full state. |
| 79 |
| 80 The crashed program is in an indeterminate state and, in fact, has often crashed |
| 81 because of corrupt global state - such as heap. It’s therefore important to |
| 82 generate crash reports with as little execution in the crashed process as |
| 83 possible. Different operating systems vary in the facilities they provide for |
| 84 this. |
| 85 |
| 86 ## Overview |
| 87 |
| 88 Crashpad is a client-side library that focuses on capturing machine and program |
| 89 state in a postmortem crash report, and transmitting this report to a backend |
| 90 server - a “collection server”. The Crashpad library is embedded by the client |
| 91 application. Conceptually, Crashpad breaks down into the handler and the client. |
| 92 The handler runs in a separate process from the client or clients. It is |
| 93 responsible for snapshotting the crashing client process’ state on a crash, |
| 94 saving it to a crash dump, and transmitting the crash dump to an upstream |
| 95 server. Clients register with the handler to allow it to capture and upload |
| 96 their crashes. |
| 97 |
| 98 ### The Crashpad handler |
| 99 |
| 100 The Crashpad handler is instantiated in a process supplied by the embedding |
| 101 application. It provides means for clients to register themselves by some means |
| 102 of IPC, or where operating system support is available, by taking advantage of |
| 103 such support to cause crash notifications to be delivered to the handler. On |
| 104 crash, the handler snapshots the crashed client process’ state, writes it to a |
| 105 postmortem dump in a database, and may also transmit the dump to an upstream |
| 106 server if so configured. |
| 107 |
| 108 The Crashpad handler is able to handle cross-bitted requests and generate crash |
| 109 dumps across bitness, where e.g. the handler is a 64-bit process while the |
| 110 client is a 32-bit process or vice versa. In the case of Windows, this is |
| 111 limited by the OS such that a 32-bit handler can only generate crash dumps for |
| 112 32-bit clients, but a 64-bit handler can acquire nearly all of the detail for a |
| 113 32-bit process. |
| 114 |
| 115 ### The Crashpad client |
| 116 |
| 117 The Crashpad client provides two main facilities. |
| 118 1. Registration with the Crashpad handler. |
| 119 2. Metadata communication to the Crashpad handler on crash. |
| 120 |
| 121 A Crashpad embedder links the Crashpad client library into one or more |
| 122 executables, whether a loadable library or a program file. The client process |
| 123 then registers with the Crashpad handler through some mode of IPC or other |
| 124 operating system-specific support. |
| 125 |
| 126 On crash, metadata is communicated to the Crashpad handler via the CrashpadInfo |
| 127 structure. Each client executable module linking the Crashpad client library |
| 128 embeds a CrashpadInfo structure, which can be updated by the client with |
| 129 whatever state the client wishes to record with a crash. |
| 130 |
| 131 ![Overview image](overview.png) |
| 132 |
| 133 Here is an overview picture of the conceptual relationships between embedder (in |
| 134 light blue), client modules (darker blue), and Crashpad (in green). Note that |
| 135 multiple client modules can contain a CrashpadInfo structure, but only one |
| 136 registration is necessary. |
| 137 |
| 138 ## Detailed Design |
| 139 |
| 140 ### Requirements |
| 141 |
| 142 The purpose of Crashpad is to capture machine, OS and application state in |
| 143 sufficient detail and fidelity to allow developers to diagnose and, where |
| 144 possible, fix the issue causing the crash. |
| 145 |
| 146 Each distinct crash report is assigned a globally unique ID, in order to allow |
| 147 users to associate them with a user report, report in bug reports and so on. |
| 148 |
| 149 It’s critical to safeguard the user’s privacy by ensuring that no crash report |
| 150 is ever uploaded without user consent. Likewise it’s important to ensure that |
| 151 Crashpad never captures or uploads reports from non-client processes. |
| 152 |
| 153 ### Concepts |
| 154 |
| 155 * **Client ID**. A UUID tied to a single instance of a Crashpad database. When |
| 156 creating a crash report, the Crashpad handler includes the client ID stored |
| 157 in the database. This provides a means to determine how many individual end |
| 158 users are affected by a specific crash signature. |
| 159 |
| 160 * **Crash ID**. A UUID representing a single crash report. Uploaded crash |
| 161 reports also receive a “server ID.” The Crashpad database indexes both the |
| 162 locally-generated and server-generated IDs. |
| 163 |
| 164 * **Collection Server**. See [crash server documentation.]( |
| 165 https://goto.google.com/crash-server-overview) |
| 166 |
| 167 * **Client Process**. Any process that has registered with a Crashpad handler. |
| 168 |
| 169 * **Handler process**. A process hosting the Crashpad handler library. This may |
| 170 be a dedicated executable, or it may be hosted within a client executable |
| 171 with control passed to it based on special signaling under the client’s |
| 172 control, such as a command-line parameter. |
| 173 |
| 174 * **CrashpadInfo**. A structure used by client modules to provide information to |
| 175 the handler. |
| 176 |
| 177 * **Annotations**. Each CrashpadInfo structure points to a dictionary of |
| 178 {string, string} annotations that the client can use to communicate |
| 179 application state in the case of crash. |
| 180 |
| 181 * **Database**. The Crashpad database contains persistent client settings as |
| 182 well as crash dumps pending upload. |
| 183 |
| 184 TODO(siggi): moar concepts? |
| 185 |
| 186 ### Overview Picture |
| 187 |
| 188 Here is a rough overview picture of the various Crashpad constructs, their |
| 189 layering and intended use by clients. |
| 190 |
| 191 ![Layering image](layering.png) |
| 192 |
| 193 Dark blue boxes are interfaces, light blue boxes are implementation. Gray is the |
| 194 embedding client application. Note that wherever possible, implementation that |
| 195 necessarily has to be OS-specific, exposes OS-agnostic interfaces to the rest of |
| 196 Crashpad and the client. |
| 197 |
| 198 ### Registration |
| 199 |
| 200 The particulars of how a client registers with the handler varies across |
| 201 operating systems. |
| 202 |
| 203 #### macOS |
| 204 |
| 205 At registration time, the client designates a Mach port monitored by the |
| 206 Crashpad handler as the EXC_CRASH exception port for the client. The port may be |
| 207 acquired by launching a new handler process or by retrieving service already |
| 208 registered with the system. The registration is maintained by the kernel and is |
| 209 inherited by subprocesses at creation time by default, so only the topmost |
| 210 process of a process tree need register. |
| 211 |
| 212 Crashpad provides a facility for a process to disassociate (unregister) with an |
| 213 existing crash handler, which can be necessary when an older client spawns an |
| 214 updated version. |
| 215 |
| 216 #### Windows |
| 217 |
| 218 There are two modes of registration on Windows. In both cases the handler is |
| 219 advised of the address of a set of structures in the client process’ address |
| 220 space. These structures include a pair of ExceptionInformation structs, one for |
| 221 generating a postmortem dump for a crashing process, and another one for |
| 222 generating a dump for a non- crashing process. |
| 223 |
| 224 ##### Normal registration |
| 225 |
| 226 In the normal registration mode, the client connects to a named pipe by a |
| 227 pre-arranged name. A registration request is written to the pipe. During |
| 228 registration, the handler creates a set of events, duplicates them to the |
| 229 registering client, then returns the handle values in the registration response. |
| 230 This is a blocking process. |
| 231 |
| 232 ##### Initial Handler Creation |
| 233 |
| 234 In order to avoid blocking client startup for the creation and initialization of |
| 235 the handler, a different mode of registration can be used for the handler |
| 236 creation. In this mode, the client creates a set of event handles and inherits |
| 237 them into the newly created handler process. The handler process is advised of |
| 238 the handle values and the location of the ExceptionInformation structures by way |
| 239 of command line arguments in this mode. |
| 240 |
| 241 #### Linux/Android |
| 242 |
| 243 TODO(mmentovai): describe this. See this preliminary doc. |
| 244 |
| 245 ### Capturing Exceptions |
| 246 |
| 247 The details of how Crashpad captures the exceptions leading to crashes varies |
| 248 between operating systems. |
| 249 |
| 250 #### macOS |
| 251 |
| 252 On macOS, the operating system will notify the handler of client crashes via the |
| 253 Mach port set as the client process’ exception port. As exceptions are |
| 254 dispatched to the Mach port by the kernel, on macOS, exceptions can be handled |
| 255 entirely from the Crashpad handler without the need to run any code in the crash |
| 256 process at the time of the exception. |
| 257 |
| 258 #### Windows |
| 259 |
| 260 On Windows, the OS dispatches exceptions in the context of the crashing thread. |
| 261 To notify the handler of exceptions, the Crashpad client registers an |
| 262 UnhandledExceptionFilter (UEF) in the client process. When an exception trickles |
| 263 up to the UEF, it stores the exception information and the crashing thread’s ID |
| 264 in the ExceptionInformation structure registered with the handler. It then sets |
| 265 an event handle to signal the handler to go ahead and process the exception. |
| 266 |
| 267 ##### Caveats |
| 268 |
| 269 * If the crashing thread’s stack is smashed when an exception occurs, the |
| 270 exception cannot be dispatched. In this case the OS will summarily terminate |
| 271 the process, without the handler having an opportunity to generate a crash |
| 272 report. |
| 273 * If an exception is handled in the crashing thread, it will never propagate |
| 274 to the UEF, and thus a crash report won’t be generated. This happens a fair |
| 275 bit in Windows as system libraries will often dispatch callbacks under a |
| 276 structured exception handler. This occurs during Window message dispatching |
| 277 on some system configurations, as well as during e.g. DLL entry point |
| 278 notifications. |
| 279 * A growing number of conditions in the system and runtime exist where |
| 280 detected corruption or illegal calls result in summary termination of the |
| 281 process, in which case no crash report will be generated. |
| 282 |
| 283 ###### Out-Of-Process Exception Handling |
| 284 |
| 285 There exists a mechanism in Windows Error Reporting (WER) that allows a client |
| 286 process to register for handling client exceptions out of the crashing process. |
| 287 Unfortunately this mechanism is difficult to use, and doesn’t provide coverage |
| 288 for many of the caveats above. [Details |
| 289 here.](https://crashpad.chromium.org/bug/133) |
| 290 |
| 291 #### Linux/Android |
| 292 |
| 293 TODO(mmentovai): describe this. See [this preliminary |
| 294 doc.](https://goto.google.com/crashpad-android-dd) |
| 295 |
| 296 ### The CrashpadInfo structure |
| 297 |
| 298 The CrashpadInfo structure is used to communicate information from the client to |
| 299 the handler. Each executable module in a client process can contain a |
| 300 CrashpadInfo structure. On a crash, the handler crawls all modules in the |
| 301 crashing process to locate all CrashpadInfo structures present. The CrashpadInfo |
| 302 structures are linked into a special, named section of the executable, where the |
| 303 handler can readily find them. |
| 304 |
| 305 The CrashpadInfo structure has a magic signature, and contains a size and a |
| 306 version field. The intent is to allow backwards compatibility from older client |
| 307 modules to newer handler. It may also be necessary to provide forwards |
| 308 compatibility from newer clients to older handler, though this hasn’t occurred |
| 309 yet. |
| 310 |
| 311 The CrashpadInfo structure contains such properties as the cap for how much |
| 312 memory to include in the crash dump, some tristate flags for controlling the |
| 313 handler’s behavior, a pointer to an annotation dictionary and so on. |
| 314 |
| 315 ### Snapshot |
| 316 |
| 317 Snapshot is a layer of interfaces that represent the machine and OS entities |
| 318 that Crashpad cares about. Different concrete implementations of snapshot can |
| 319 then be backed different ways, such as e.g. from the in-memory representation of |
| 320 a crashed process, or e.g. from the contents of a minidump. |
| 321 |
| 322 ### Crash Dump Creation |
| 323 |
| 324 To create a crash dump, a subset of the machine, OS and application state is |
| 325 grabbed from the crashed process into an in-memory snapshot structure in the |
| 326 handler process. Since the full application state is typically too large for |
| 327 capturing to disk and transmitting to an upstream server, the snapshot contains |
| 328 a heuristically selected subset of the full state. |
| 329 |
| 330 The precise details of what’s captured varies between operating systems, but |
| 331 generally includes the following |
| 332 * The set of modules (executable, shared libraries) that are loaded into the |
| 333 crashing process. |
| 334 * An enumeration of the threads running in the crashing process, including the |
| 335 register contents and the contents of stack memory of each thread. |
| 336 * A selection of the OS-related state of the process, such as e.g. the command |
| 337 line, environment and so on. |
| 338 * A selection of memory potentially referenced from registers and from stack. |
| 339 |
| 340 To capture a crash dump, the crashing process is first suspended, then a |
| 341 snapshot is created in the handler process. The snapshot includes the |
| 342 CrashpadInfo structures of the modules loaded into the process, and the contents |
| 343 of those is used to control the level of detail captured for the crash dump. |
| 344 |
| 345 Once the snapshot has been constructed, it is then written to a minidump file, |
| 346 which is added to the database. The process is un-suspended after the minidump |
| 347 file has been written. In the case of a crash (as opposed to a client request to |
| 348 produce a dump without crashing), it is then either killed by the operating |
| 349 system or the Crashpad handler. |
| 350 |
| 351 In general the snapshotting process has to be very intimate with the operating |
| 352 system it’s working with, so there will be a set of concrete implementation |
| 353 classes, many deriving from the snapshot interfaces, doing this for each |
| 354 operating system. |
| 355 |
| 356 ### Minidump |
| 357 |
| 358 The minidump implementation is responsible for writing a snapshot to a |
| 359 serialized on-disk file in the minidump format. The minidump implementation is |
| 360 OS-agnostic, as it works on an OS-agnostic Snapshot interface. |
| 361 |
| 362 TODO(siggi): Talk about two-phase writes and contents ordering here. |
| 363 |
| 364 ### Database |
| 365 |
| 366 The Crashpad database contains persistent client settings, including a unique |
| 367 crash client identifier and the upload-enabled bit. Note that the crash client |
| 368 identifier is assigned by Crashpad, and is distinct from any identifiers the |
| 369 client application uses to identify users, installs, machines or such - if any. |
| 370 The expectation is that the client application will manage the user’s upload |
| 371 consent, and inform Crashpad of changes in consent. |
| 372 |
| 373 The unique client identifier is set at the time of database creation. It is then |
| 374 recorded into every crash report collected by the handler and communicated to |
| 375 the upstream server. |
| 376 |
| 377 The database stores a configurable number of recorded crash dumps to a |
| 378 configurable maximum aggregate size. For each crash dump it stores annotations |
| 379 relating to whether the crash dumps have been uploaded. For successfully |
| 380 uploaded crash dumps it also stores their server-assigned ID. |
| 381 |
| 382 The database consists of a settings file, named "settings.dat" with binary |
| 383 contents (see crashpad::Settings::Data for the file format), as well as |
| 384 directory containing the crash dumps. Additionally each crash dump is adorned |
| 385 with properties relating to the state of the dump for upload and such. The |
| 386 details of how these properties are stored vary between platforms. |
| 387 |
| 388 #### macOS |
| 389 |
| 390 The macOS implementation simply stores database properties on the minidump files |
| 391 in filesystem extended attributes. |
| 392 |
| 393 #### Windows |
| 394 |
| 395 The Windows implementation stores database properties in a binary file named |
| 396 “metadata” at the top level of the database directory. |
| 397 |
| 398 ### Report Format |
| 399 |
| 400 Crash reports are recorded in the Windows minidump format with |
| 401 extensions to support Crashpad additions, such as e.g. Annotations. |
| 402 |
| 403 ### Upload to collection server |
| 404 |
| 405 #### Wire Format |
| 406 |
| 407 For the time being, Crashpad uses the Breakpad wire protocol, which is |
| 408 essentially a MIME multipart message communicated over HTTP(S). To support this, |
| 409 the annotations from all the CrashpadInfo structures found in the crashing |
| 410 process are merged to create the Breakpad “crash keys” as form data. The |
| 411 postmortem minidump is then attached as an “application/octet- stream” |
| 412 attachment with the name “upload_file_minidump”. The entirety of the request |
| 413 body, including the minidump, can be gzip-compressed to reduce transmission time |
| 414 and increase transmission reliability. Note that by convention there is a set of |
| 415 “crash keys” that are used to communicate the product, version, client ID and |
| 416 other relevant data about the client, to the server. Crashpad normally stores |
| 417 these values in the minidump file itself, but retrieves them from the minidump |
| 418 and supplies them as form data for compatibility with the Breakpad-style server. |
| 419 |
| 420 This is a temporary compatibility measure to allow the current Breakpad-based |
| 421 upstream server to handle Crashpad reports. In the fullness of time, the wire |
| 422 protocol is expected to change to remove this redundant transmission and |
| 423 processing of the Annotations. |
| 424 |
| 425 #### Transport |
| 426 |
| 427 The embedding client controls the URL of the collection server by the command |
| 428 line passed to the handler. The handler can upload crashes with HTTP or HTTPS, |
| 429 depending on client’s preference. It’s strongly suggested use HTTPS transport |
| 430 for crash uploads to protect the user’s privacy against man-in-the-middle |
| 431 snoopers. |
| 432 |
| 433 TODO(mmentovai): Certificate pinning. |
| 434 |
| 435 #### Throttling & Retry Strategy |
| 436 |
| 437 To protect both the collection server from DDoS as well as to protect the |
| 438 clients from unreasonable data transfer demands, the handler implements a |
| 439 client-side throttling strategy. At the moment, the strategy is very simplistic, |
| 440 it simply limits uploads to one upload per hour, and failed uploads are aborted. |
| 441 |
| 442 An experiment has been conducted to lift all throttling. Analysis on the |
| 443 aggregate data this produced shows that multiple crashes within a short timespan |
| 444 on the same client are nearly always due to the same cause. Therefore there is |
| 445 very little loss of signal due to the throttling, though the ability to |
| 446 reconstruct at least the full crash count is highly desirable. |
| 447 |
| 448 The lack of retry is expected to [change |
| 449 soon](https://crashpad.chromium.org/bug/23), as this creates blind spots for |
| 450 client crashes that exclusively occur on e.g. network down events, during |
| 451 suspend and resume and such. |
| 452 |
| 453 ### Extensibility |
| 454 |
| 455 Clients are able to extend the generated crash reports in two ways, by |
| 456 manipulating their CrashpadInfo structure. |
| 457 The two extensibility points are: |
| 458 1. Nominating a set of address ranges for inclusion in the crash report. |
| 459 2. Adding user-defined minidump streams for inclusion in the crash report. |
| 460 |
| 461 In both cases the CrashpadInfo structure has to be updated before a crash |
| 462 occurs. |
| 463 |
| 464 ### Dependencies |
| 465 |
| 466 Aside from system headers and APIs, when used outside of Chromium, Crashpad has |
| 467 a dependency on “mini_chromium”, which is a subset of the Chromium base library. |
| 468 This is to allow non-Chromium clients to use Crashpad, without taking a direct |
| 469 dependency on the Chromium base, while allowing Chromium projects to use |
| 470 Crashpad with minimum code duplication or hassle. When using Crashpad as part of |
| 471 Chromium, Chromium’s own copy of the base library is used instead of |
| 472 mini_chromium. |
| 473 |
| 474 The downside to this is that mini_chromium must be kept up to date with |
| 475 interface and implementation changes in Chromium base, for the subset of |
| 476 functionality used by Crashpad. |
| 477 |
| 478 ## Caveats |
| 479 |
| 480 TODO(anyone): You may need to describe what you did not do or why simpler |
| 481 approaches don't work. Mention other things to watch out for (if any). |
| 482 |
| 483 ## Security Considerations |
| 484 |
| 485 Crashpad may be used to capture the state of sandboxed processes and it writes |
| 486 minidumps to disk. It may therefore straddle security boundaries, so it’s |
| 487 important that Crashpad handle all data it reads out of the crashed process with |
| 488 extreme care. The Crashpad handler takes care to access client address spaces |
| 489 through specially-designed accessors that check pointer validity and enforce |
| 490 accesses within prescribed bounds. The flow of information into the Crashpad |
| 491 handler is exclusively one-way: Crashpad never communicates anything back to |
| 492 its clients, aside from providing single-bit indications of completion. |
| 493 |
| 494 ## Privacy Considerations |
| 495 |
| 496 Crashpad may capture arbitrary contents from crashed process’ memory, including |
| 497 user IDs and passwords, credit card information, URLs and whatever other content |
| 498 users have trusted the crashing program with. The client program must acquire |
| 499 and honor the user’s consent to upload crash reports, and appropriately manage |
| 500 the upload state in Crashpad’s database. |
| 501 |
| 502 Crashpad must also be careful not to upload crashes for arbitrary processes on |
| 503 the user’s system. To this end, Crashpad will never upload a process that hasn’t |
| 504 registered with the handler, but note that registrations are inherited by child |
| 505 processes on some operating systems. |
OLD | NEW |