heuristics/distillable/README.md - Issue 1620043002: Add scripts for distillability modelling

Unified Diff: heuristics/distillable/README.md

Issue 1620043002: Add scripts for distillability modelling (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master

Patch Set: set upstream patchset, identical to patch set 2 Created 4 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: heuristics/distillable/README.md

diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md

new file mode 100644

index 0000000000000000000000000000000000000000..685c6f7c6cc2d99e2384f4735b60bc782ac24b51

--- /dev/null

+++ b/heuristics/distillable/README.md

@@ -0,0 +1,105 @@

+# Distillability Heuristics

+## Goal

+We would like to know whether it's useful to run DOM distiller on a page. This

+signal could be used in places like browser UI. Since this test would run on all

+the page navigations, it needs to be cheap to compute. Running DOM distiller and

+see if the output is empty would be too slow, and whether DOM distiller returns

+results isn't necessarily equivalent to whether the page should be distilled.

+Considering all the constraints, we decided to train a machine learning model

+that takes features from a page, and classify it. The trained AdaBoost model is

+added to Chrome in http://crrev.com/1405233009/. The pipeline except for the

+model training part is described below.

+## URL gathering

+Gather a bunch of popular URLs that are representative of sites users frequent.

+Put these URLs in a file, one per line. It might make sense to start with a

+short list for dry run.

+## Data preparation for labeling

+Use ```get_screenshots.py``` to generate the screenshots of the original and

+distilled web page, and extract the features by running

+```extract_features.js```. You can see how it works by running the following

+command.

+```

+./get_screenshots.py --out out_dir --urls-file urls.txt

+```

+If everything goes fine, run it inside xvfb. Specifying the screen resolution

+makes the size of the screenshots consistent. It also prevent the Chrome window

+from interrupting your work on the main monitor.

+```

+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt

+```

+One entry takes about 30 seconds. Depending on the number of entries, it could

+be a lengthy process. If it is interrupted, you could use option ```--resume```

+to continue.

+```

+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume

+```

+Running multiple instances concurrently is recommended if the list is long

+enough. You can create a Makefile like this:

+```

+ALL=$(addsuffix .target,$(shell seq 1000))

+all: $(ALL)

+%.target :

+ xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume

+```

+And then run ```make -j20```. Adjust the parallelism according to how beefy your

+machine is.

+A small proportion of URLs would time out, or fail for some other reasons. When

+you've collected enough data, run the command again with option

+```--write-index``` to export data for the next stage.

+```

+./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index

+```

+## Labeling

+Use ```server.py``` to serve the web site for data labeling. Human effort is

+around 10~20 seconds per entry.

+```

+./server.py --data-dir out_dir

+```

+It should print something like:

+```

+[21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081

+```

+Then visit that address in your browser.

+The labels would be written to ```out_dir/archive/``` periodically.

+## Data preparation for training

+In the step with ```--write-index```, ```get_screenshots.py``` writes the

+extracted raw features to ```out_dir/feature```. We can use

+```calculate_derived_features.py``` to convert it to the final derived features.

+```

+./calculate_derived_features.py --core out_dir/feature --out derived.txt

+```

+Then use ```write_features_csv.py``` to combine with the label.

+```

+./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled

+```

« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »