Index: heuristics/distillable/README.md |
diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md |
index 6f4a1571e4fedd3ce8e3046a3a1ab13f1eba830c..cabdf3985c219e10fa99b9800dadcc38009c98d4 100644 |
--- a/heuristics/distillable/README.md |
+++ b/heuristics/distillable/README.md |
@@ -4,14 +4,16 @@ |
We would like to know whether it's useful to run DOM distiller on a page. This |
signal could be used in places like browser UI. Since this test would run on all |
-the page navigations, it needs to be cheap to compute. Running DOM distiller and |
+the page navigations, it needs to be cheap to compute. Running DOM distiller to |
see if the output is empty would be too slow, and whether DOM distiller returns |
results isn't necessarily equivalent to whether the page should be distilled. |
Considering all the constraints, we decided to train a machine learning model |
that takes features from a page, and classify it. The trained AdaBoost model is |
-added to Chrome in http://crrev.com/1405233009/. The pipeline except for the |
-model training part is described below. |
+added to Chrome in http://crrev.com/1405233009/ to predict whether a page |
+contains an article. Another model added in http://crrev.com/1703313003 predicts |
+whether the article is long enough. The pipeline except for the model training |
+part is described below. |
## URL gathering |
@@ -19,7 +21,7 @@ Gather a bunch of popular URLs that are representative of sites users frequent. |
Put these URLs in a file, one per line. It might make sense to start with a |
short list for dry run. |
-## Data preparation for labeling |
+## Data scrawling |
Use `get_screenshots.py` to generate the screenshots of the original and |
distilled web page, and extract the features by running `extract_features.js`. |
@@ -29,6 +31,9 @@ You can see how it works by running the following command. |
./get_screenshots.py --out out_dir --urls-file urls.txt |
``` |
+Append option `--emulate-mobile` if mobile-friendliness is important, and use |
+`--save-mhtml` to keep a copy in MHTML format. |
+ |
If everything goes fine, run it inside xvfb. Specifying the screen resolution |
makes the size of the screenshots consistent. It also prevent the Chrome window |
from interrupting your work on the main monitor. |
@@ -57,19 +62,32 @@ all: $(ALL) |
xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume |
``` |
-And then run `make -j20`. Adjust the parallelism according to how beefy your |
-machine is. |
+And then run `nice make -j10 -k`. Adjust the parallelism according to how beefy |
+your machine is. The `-k` option is essential for it to keep going. |
+ |
+**Tips and caveats:** |
+ |
+- Use tmpfs for /tmp to avoid thrashing your disk. It would easily be IO-bound |
+ even if you use SSD for /tmp. |
+- `atop` is useful when experimenting the parallelism. |
+- For a 40-core, 64G workstation, `-j80` can keep it CPU-bound, with |
+ throughput of ~100 entries/minute. |
+- You might need to manually kill a few stray Chrome or xvfb-run processes |
+ after hitting `Ctrl-C` for `make`. |
A small proportion of URLs would time out, or fail for some other reasons. When |
you've collected enough data, run the command again with option `--write-index` |
to export data for the next stage. |
```bash |
-./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
+./get_screenshots.py --out out_dir --urls-file urls.txt --write-index |
``` |
## Labeling |
+This section is only needed for distillability model, but not for long-article |
+model. |
+ |
Use `server.py` to serve the web site for data labeling. Human effort is around |
10~20 seconds per entry. |
@@ -89,16 +107,94 @@ The labels would be written to `out_dir/archive/` periodically. |
## Data preparation for training |
-In the step with `--write-index`, `get_screenshots.py` writes the extracted raw |
-features to `out_dir/feature`. We can use `calculate_derived_features.py` to |
-convert it to the final derived features. |
+### Feature re-extraction from MHTML archive |
+ |
+When experimenting with feature extraction, being able to extract new features |
+is useful. After modifying `extract_features.js`, modify the Makefile and change |
+the command to: |
```bash |
-./calculate_derived_features.py --core out_dir/feature --out derived.txt |
+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --load-mhtml --skip-distillation |
``` |
-Then use `write_features_csv.py` to combine with the label. |
+Then rerun `nice make -j10 -k`. |
+ |
+### Recalculating derived features without extracting again |
+ |
+This section is usually optional. You might need this if you've changed how |
+features are derived, and want to recalculate the derived features from the raw |
+features, without extracting the raw features again. This can be useful because |
+feature re-extraction from MHTML archive can sometimes differs from the original |
+web page. Otherwise, if you are dealing with an older dataset where the derived |
+features are not calculated when scrawling, this is also necessary. |
+ |
+The derived features are saved to `out_dir/*.feature-derived` when scrawling |
+each entry. In the step with `--write-index`, `get_screenshots.py` writes the |
+derived features to `out_dir/feature-derived`. To save time, raw features are |
+not aggregated by default, but you can uncomment the line in function |
+`writeFeature()` to write the raw features to `out_dir/feature` as well. We can |
+then use `calculate_derived_features.py` to convert it to the derived features. |
```bash |
-./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled |
+./calculate_derived_features.py --core out_dir/feature --out out_dir/feature-derived |
``` |
+ |
+### Sanity check |
+ |
+This step is optional. |
+ |
+`check_derived_features.py` compares the derived features between JavaScript |
+implementation and the native implementation in Chrome. This only works if your |
+Chrome is new enough to support distillability JSON dumping (with command line |
+argument --distillability-dev). |
+ |
+``` |
+./check_derived_features.py --features out_dir/feature-derived |
+``` |
+ |
+Or if you want to compare the features derived from MHTML archive, do this: |
+ |
+``` |
+./check_derived_features.py --features out_dir/mfeature-derived --from-mhtml |
+``` |
+ |
+When comparing the features extracted from the original page, the error rate |
+would be higher because the feature extractions by JS and native code are done |
+at different events, and the DOM could change dynamically. On the other hand, |
+features extracted from MHTML archive should be exactly the same. However, due |
+to issues like https://crbug.com/586034, MHTML is not fully offline, and the |
+results can be non-deterministic. Sadly there are currently no good way in |
+webdriver to force offline behavior. Other than that, mismatches between the two |
+implementations should be regarded as bugs. |
+ |
+`check_distilled_mhtml.py` compares the distilled content from the original page |
+with the distilled content from the MHTML archive. |
+ |
+``` |
+./check_distilled_mhtml.py --dir out_dir |
+``` |
+ |
+These two should be exactly the same. Known differences includes: |
+ |
+- The original page has next page stitched. |
+- In some rare cases, MHTML would fail to distill and get no data. |
+ |
+We still have inconsistencies that need investigation. |
+ |
+### Final output for training |
+ |
+Use `write_features_csv.py` to combine derived features with the label. |
+ |
+For distillability model, run: |
+ |
+``` |
+./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features out_dir/feature-derived --out labelled |
+``` |
+ |
+Or for long-article model, run: |
+ |
+``` |
+./write_features_csv.py --distilled out_dir/dfeature-derived --features out_dir/feature-derived --out labelled |
+``` |
+ |
+Then lots of files named `labelled-*.csv` would be created. |