heuristics/distillable/README.md - Issue 1808503002: Update distillability modeling scripts to predict long articles

Unified Diff: heuristics/distillable/README.md

Issue 1808503002: Update distillability modeling scripts to predict long articles (Closed) Base URL: git@github.com:chromium/dom-distiller.git@ml-visible

Patch Set: update docs Created 4 years, 7 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: heuristics/distillable/README.md

diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md

index 6f4a1571e4fedd3ce8e3046a3a1ab13f1eba830c..cabdf3985c219e10fa99b9800dadcc38009c98d4 100644

--- a/heuristics/distillable/README.md

+++ b/heuristics/distillable/README.md

@@ -4,14 +4,16 @@

We would like to know whether it's useful to run DOM distiller on a page. This

signal could be used in places like browser UI. Since this test would run on all

-the page navigations, it needs to be cheap to compute. Running DOM distiller and

+the page navigations, it needs to be cheap to compute. Running DOM distiller to

see if the output is empty would be too slow, and whether DOM distiller returns

results isn't necessarily equivalent to whether the page should be distilled.

Considering all the constraints, we decided to train a machine learning model

that takes features from a page, and classify it. The trained AdaBoost model is

-added to Chrome in http://crrev.com/1405233009/. The pipeline except for the

-model training part is described below.

+added to Chrome in http://crrev.com/1405233009/ to predict whether a page

+contains an article. Another model added in http://crrev.com/1703313003 predicts

+whether the article is long enough. The pipeline except for the model training

+part is described below.

## URL gathering

@@ -19,7 +21,7 @@ Gather a bunch of popular URLs that are representative of sites users frequent.

Put these URLs in a file, one per line. It might make sense to start with a

short list for dry run.

-## Data preparation for labeling

+## Data scrawling

Use `get_screenshots.py` to generate the screenshots of the original and

distilled web page, and extract the features by running `extract_features.js`.

@@ -29,6 +31,9 @@ You can see how it works by running the following command.

./get_screenshots.py --out out_dir --urls-file urls.txt

```

+Append option `--emulate-mobile` if mobile-friendliness is important, and use

+`--save-mhtml` to keep a copy in MHTML format.

If everything goes fine, run it inside xvfb. Specifying the screen resolution

makes the size of the screenshots consistent. It also prevent the Chrome window

from interrupting your work on the main monitor.

@@ -57,19 +62,32 @@ all: $(ALL)

xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume

```

-And then run `make -j20`. Adjust the parallelism according to how beefy your

-machine is.

+And then run `nice make -j10 -k`. Adjust the parallelism according to how beefy

+your machine is. The `-k` option is essential for it to keep going.

+**Tips and caveats:**

+- Use tmpfs for /tmp to avoid thrashing your disk. It would easily be IO-bound

+ even if you use SSD for /tmp.

+- `atop` is useful when experimenting the parallelism.

+- For a 40-core, 64G workstation, `-j80` can keep it CPU-bound, with

+ throughput of ~100 entries/minute.

+- You might need to manually kill a few stray Chrome or xvfb-run processes

+ after hitting `Ctrl-C` for `make`.

A small proportion of URLs would time out, or fail for some other reasons. When

you've collected enough data, run the command again with option `--write-index`

to export data for the next stage.

```bash

-./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index

+./get_screenshots.py --out out_dir --urls-file urls.txt --write-index

```

## Labeling

+This section is only needed for distillability model, but not for long-article

+model.

Use `server.py` to serve the web site for data labeling. Human effort is around

10~20 seconds per entry.

@@ -89,16 +107,94 @@ The labels would be written to `out_dir/archive/` periodically.

## Data preparation for training

-In the step with `--write-index`, `get_screenshots.py` writes the extracted raw

-features to `out_dir/feature`. We can use `calculate_derived_features.py` to

-convert it to the final derived features.

+### Feature re-extraction from MHTML archive

+When experimenting with feature extraction, being able to extract new features

+is useful. After modifying `extract_features.js`, modify the Makefile and change

+the command to:

```bash

-./calculate_derived_features.py --core out_dir/feature --out derived.txt

+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --load-mhtml --skip-distillation

```

-Then use `write_features_csv.py` to combine with the label.

+Then rerun `nice make -j10 -k`.

+### Recalculating derived features without extracting again

+This section is usually optional. You might need this if you've changed how

+features are derived, and want to recalculate the derived features from the raw

+features, without extracting the raw features again. This can be useful because

+feature re-extraction from MHTML archive can sometimes differs from the original

+web page. Otherwise, if you are dealing with an older dataset where the derived

+features are not calculated when scrawling, this is also necessary.

+The derived features are saved to `out_dir/*.feature-derived` when scrawling

+each entry. In the step with `--write-index`, `get_screenshots.py` writes the

+derived features to `out_dir/feature-derived`. To save time, raw features are

+not aggregated by default, but you can uncomment the line in function

+`writeFeature()` to write the raw features to `out_dir/feature` as well. We can

+then use `calculate_derived_features.py` to convert it to the derived features.

```bash

-./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled

+./calculate_derived_features.py --core out_dir/feature --out out_dir/feature-derived

```

+### Sanity check

+This step is optional.

+`check_derived_features.py` compares the derived features between JavaScript

+implementation and the native implementation in Chrome. This only works if your

+Chrome is new enough to support distillability JSON dumping (with command line

+argument --distillability-dev).

+```

+./check_derived_features.py --features out_dir/feature-derived

+```

+Or if you want to compare the features derived from MHTML archive, do this:

+```

+./check_derived_features.py --features out_dir/mfeature-derived --from-mhtml

+```

+When comparing the features extracted from the original page, the error rate

+would be higher because the feature extractions by JS and native code are done

+at different events, and the DOM could change dynamically. On the other hand,

+features extracted from MHTML archive should be exactly the same. However, due

+to issues like https://crbug.com/586034, MHTML is not fully offline, and the

+results can be non-deterministic. Sadly there are currently no good way in

+webdriver to force offline behavior. Other than that, mismatches between the two

+implementations should be regarded as bugs.

+`check_distilled_mhtml.py` compares the distilled content from the original page

+with the distilled content from the MHTML archive.

+```

+./check_distilled_mhtml.py --dir out_dir

+```

+These two should be exactly the same. Known differences includes:

+- The original page has next page stitched.

+- In some rare cases, MHTML would fail to distill and get no data.

+We still have inconsistencies that need investigation.

+### Final output for training

+Use `write_features_csv.py` to combine derived features with the label.

+For distillability model, run:

+```

+./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features out_dir/feature-derived --out labelled

+```

+Or for long-article model, run:

+```

+./write_features_csv.py --distilled out_dir/dfeature-derived --features out_dir/feature-derived --out labelled

+```

+Then lots of files named `labelled-*.csv` would be created.

« no previous file with comments | « no previous file | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »