Index: third_party/afl/src/docs/technical_details.txt |
diff --git a/third_party/afl/src/docs/technical_details.txt b/third_party/afl/src/docs/technical_details.txt |
index ec789a31999cb807a1063abf0dbca59540bedc43..3ec487414e7d87698cb1933d6b9023186fc58256 100644 |
--- a/third_party/afl/src/docs/technical_details.txt |
+++ b/third_party/afl/src/docs/technical_details.txt |
@@ -85,8 +85,8 @@ of dword- or qword-wide instructions and a simple loop. |
When a mutated input produces an execution trace containing new tuples, the |
corresponding input file is preserved and routed for additional processing |
later on (see section #3). Inputs that do not trigger new local-scale state |
-transitions in the execution trace are discarded, even if their overall |
-instrumentation output pattern is unique. |
+transitions in the execution trace (i.e., produce no new tuples) are discarded, |
+even if their overall control flow sequence is unique. |
This approach allows for a very fine-grained and long-term exploration of |
program state while not having to perform any computationally intensive and |
@@ -101,7 +101,7 @@ new tuples (CA, AE): |
#2: A -> B -> C -> A -> E |
At the same time, with #2 processed, the following pattern will not be seen |
-as unique, despite having a markedly different execution path: |
+as unique, despite having a markedly different overall execution path: |
#3: A -> B -> C -> A -> B -> C -> A -> B -> C -> D -> E |
@@ -142,9 +142,9 @@ Mutated test cases that produced new state transitions within the program are |
added to the input queue and used as a starting point for future rounds of |
fuzzing. They supplement, but do not automatically replace, existing finds. |
-This approach allows the tool to progressively explore various disjoint and |
-possibly mutually incompatible features of the underlying data format, as |
-shown in this image: |
+In contrast to more greedy genetic algorithms, this approach allows the tool |
+to progressively explore various disjoint and possibly mutually incompatible |
+features of the underlying data format, as shown in this image: |
http://lcamtuf.coredump.cx/afl/afl_gzip.png |
@@ -201,10 +201,10 @@ the sessions were seeded with a valid unified diff: |
Edge coverage | 1,259 | 1,734 | 1.72 | 0 |
AFL model | 1,452 | 2,040 | 3.16 | 1 |
-Some of the earlier work on evolutionary fuzzing suggested maintaining just a |
-single test case and selecting for mutations that improve coverage. At least |
-in the tests described above, this "greedy" method appeared to offer no |
-substantial benefits over blind fuzzing. |
+At noted earlier on, some of the prior work on genetic fuzzing relied on |
+maintaining a single test case and evolving it to maximize coverage. At least |
+in the tests described above, this "greedy" approach appears to confer no |
+substantial benefits over blind fuzzing strategies. |
4) Culling the corpus |
--------------------- |
@@ -263,9 +263,9 @@ files make the target binary slower, and because they reduce the likelihood |
that a mutation would touch important format control structures, rather than |
redundant data blocks. This is discussed in more detail in perf_tips.txt. |
-The possibility of a bad starting corpus provided by the user aside, some |
-types of mutations can have the effect of iteratively increasing the size of |
-the generated files, so it is important to counter this trend. |
+The possibility that the user will provide a low-quality starting corpus aside, |
+some types of mutations can have the effect of iteratively increasing the size |
+of the generated files, so it is important to counter this trend. |
Luckily, the instrumentation feedback provides a simple way to automatically |
trim down input files while ensuring that the changes made to the files have no |
@@ -275,11 +275,11 @@ The built-in trimmer in afl-fuzz attempts to sequentially remove blocks of data |
with variable length and stepover; any deletion that doesn't affect the checksum |
of the trace map is committed to disk. The trimmer is not designed to be |
particularly thorough; instead, it tries to strike a balance between precision |
-and the number of execve() calls spent on the process. The average per-file |
-gains are around 5-20%. |
+and the number of execve() calls spent on the process, selecting the block size |
+and stepover to match. The average per-file gains are around 5-20%. |
The standalone afl-tmin tool uses a more exhaustive, iterative algorithm, and |
-also attempts to perform alphabet normalization on the trimmed files. |
+also attempts to perform alphabet normalization on the trimmed files. |
6) Fuzzing strategies |
--------------------- |
@@ -302,11 +302,16 @@ strategies include: |
- Sequential insertion of known interesting integers (0, 1, INT_MAX, etc), |
-The non-deterministic steps include stacked bit flips, insertions, deletions, |
-arithmetics, and splicing of different test cases. |
+The purpose of opening with deterministic steps is related to their tendency to |
+produce compact test cases and small diffs between the non-crashing and crashing |
+inputs. |
-Their relative yields and execve() costs have been investigated and are |
-discussed in the aforementioned blog post. |
+With deterministic fuzzing out of the way, the non-deterministic steps include |
+stacked bit flips, insertions, deletions, arithmetics, and splicing of different |
+test cases. |
+ |
+The relative yields and execve() costs of all these strategies have been |
+investigated and are discussed in the aforementioned blog post. |
For the reasons discussed in historical_notes.txt (chiefly, performance, |
simplicity, and reliability), AFL generally does not try to reason about the |
@@ -315,13 +320,14 @@ are nominally blind, and are guided only by the evolutionary design of the |
input queue. |
That said, there is one (trivial) exception to this rule: when a new queue |
-entry goes through the initial set of deterministic fuzzing steps, and some |
-regions in the file are observed to have no effect on the checksum of the |
+entry goes through the initial set of deterministic fuzzing steps, and tweaks to |
+some regions in the file are observed to have no effect on the checksum of the |
execution path, they may be excluded from the remaining phases of |
-deterministic fuzzing - and proceed straight to random tweaks. Especially for |
-verbose, human-readable data formats, this can reduce the number of execs by |
-10-40% or so without an appreciable drop in coverage. In extreme cases, such |
-as normally block-aligned tar archives, the gains can be as high as 90%. |
+deterministic fuzzing - and the fuzzer may proceed straight to random tweaks. |
+Especially for verbose, human-readable data formats, this can reduce the number |
+of execs by 10-40% or so without an appreciable drop in coverage. In extreme |
+cases, such as normally block-aligned tar archives, the gains can be as high as |
+90%. |
Because the underlying "effector maps" are local every queue entry and remain |
in force only during deterministic stages that do not alter the size or the |
@@ -353,6 +359,14 @@ the grammar of highly verbose and complex languages such as JavaScript, SQL, |
or XML; several examples of generated SQL statements are given in the blog |
post mentioned above. |
+Interestingly, the AFL instrumentation also allows the fuzzer to automatically |
+isolate syntax tokens already present in an input file. It can do so by looking |
+for run of bytes that, when flipped, produce a consistent change to the |
+program's execution path; this is suggestive of an underlying atomic comparison |
+to a predefined value baked into the code. The fuzzer relies on this signal |
+to build compact "auto dictionaries" that are then used in conjunction with |
+other fuzzing strategies. |
+ |
8) De-duping crashes |
-------------------- |
@@ -416,22 +430,25 @@ With fast targets, the fork server can offer considerable performance gains, |
usually between 1.5x and 2x. It is also possible to: |
- Use the fork server in manual ("deferred") mode, skipping over larger, |
- user-selected chunks of initialization code. With some targets, this can |
+ user-selected chunks of initialization code. It requires very modest |
+ code changes to the targeted program, and With some targets, can |
produce 10x+ performance gains. |
- Enable "persistent" mode, where a single process is used to try out |
multiple inputs, greatly limiting the overhead of repetitive fork() |
- calls. As with the previous mode, this requires custom modifications, |
+ calls. This generally requires some code changes to the targeted program, |
but can improve the performance of fast targets by a factor of 5 or more |
- - approximating the benefits of in-process fuzzing jobs. |
+ - approximating the benefits of in-process fuzzing jobs while still |
+ maintaining very robust isolation between the fuzzer process and the |
+ targeted binary. |
11) Parallelization |
------------------- |
The parallelization mechanism relies on periodically examining the queues |
produced by independently-running instances on other CPU cores or on remote |
-machines, and then selectively pulling in the test cases that produce behaviors |
-not yet seen by the fuzzer at hand. |
+machines, and then selectively pulling in the test cases that, when tried |
+out locally, produce behaviors not yet seen by the fuzzer at hand. |
This allows for extreme flexibility in fuzzer setup, including running synced |
instances against different parsers of a common data format, often with |