Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(315)

Unified Diff: heuristics/distillable/README.md

Issue 1620043002: Add scripts for distillability modelling (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master
Patch Set: set upstream patchset, identical to patch set 2 Created 4 years, 10 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: heuristics/distillable/README.md
diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..685c6f7c6cc2d99e2384f4735b60bc782ac24b51
--- /dev/null
+++ b/heuristics/distillable/README.md
@@ -0,0 +1,105 @@
+# Distillability Heuristics
+
+## Goal
+
+We would like to know whether it's useful to run DOM distiller on a page. This
+signal could be used in places like browser UI. Since this test would run on all
+the page navigations, it needs to be cheap to compute. Running DOM distiller and
+see if the output is empty would be too slow, and whether DOM distiller returns
+results isn't necessarily equivalent to whether the page should be distilled.
+
+Considering all the constraints, we decided to train a machine learning model
+that takes features from a page, and classify it. The trained AdaBoost model is
+added to Chrome in http://crrev.com/1405233009/. The pipeline except for the
+model training part is described below.
+
+## URL gathering
+
+Gather a bunch of popular URLs that are representative of sites users frequent.
+Put these URLs in a file, one per line. It might make sense to start with a
+short list for dry run.
+
+## Data preparation for labeling
+
+Use ```get_screenshots.py``` to generate the screenshots of the original and
+distilled web page, and extract the features by running
+```extract_features.js```. You can see how it works by running the following
+command.
+
+```
+./get_screenshots.py --out out_dir --urls-file urls.txt
+```
+
+If everything goes fine, run it inside xvfb. Specifying the screen resolution
+makes the size of the screenshots consistent. It also prevent the Chrome window
+from interrupting your work on the main monitor.
+
+```
+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt
+```
+
+One entry takes about 30 seconds. Depending on the number of entries, it could
+be a lengthy process. If it is interrupted, you could use option ```--resume```
+to continue.
+
+```
+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume
+```
+
+Running multiple instances concurrently is recommended if the list is long
+enough. You can create a Makefile like this:
+
+```
+ALL=$(addsuffix .target,$(shell seq 1000))
+
+all: $(ALL)
+
+%.target :
+ xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume
+```
+
+And then run ```make -j20```. Adjust the parallelism according to how beefy your
+machine is.
+
+A small proportion of URLs would time out, or fail for some other reasons. When
+you've collected enough data, run the command again with option
+```--write-index``` to export data for the next stage.
+
+```
+./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index
+```
+
+## Labeling
+
+Use ```server.py``` to serve the web site for data labeling. Human effort is
+around 10~20 seconds per entry.
+
+```
+./server.py --data-dir out_dir
+```
+
+It should print something like:
+
+```
+[21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081
+```
+
+Then visit that address in your browser.
+
+The labels would be written to ```out_dir/archive/``` periodically.
+
+## Data preparation for training
+
+In the step with ```--write-index```, ```get_screenshots.py``` writes the
+extracted raw features to ```out_dir/feature```. We can use
+```calculate_derived_features.py``` to convert it to the final derived features.
+
+```
+./calculate_derived_features.py --core out_dir/feature --out derived.txt
+```
+
+Then use ```write_features_csv.py``` to combine with the label.
+
+```
+./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled
+```
« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698