Index: heuristics/distillable/README.md |
diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md |
new file mode 100644 |
index 0000000000000000000000000000000000000000..685c6f7c6cc2d99e2384f4735b60bc782ac24b51 |
--- /dev/null |
+++ b/heuristics/distillable/README.md |
@@ -0,0 +1,105 @@ |
+# Distillability Heuristics |
+ |
+## Goal |
+ |
+We would like to know whether it's useful to run DOM distiller on a page. This |
+signal could be used in places like browser UI. Since this test would run on all |
+the page navigations, it needs to be cheap to compute. Running DOM distiller and |
+see if the output is empty would be too slow, and whether DOM distiller returns |
+results isn't necessarily equivalent to whether the page should be distilled. |
+ |
+Considering all the constraints, we decided to train a machine learning model |
+that takes features from a page, and classify it. The trained AdaBoost model is |
+added to Chrome in http://crrev.com/1405233009/. The pipeline except for the |
+model training part is described below. |
+ |
+## URL gathering |
+ |
+Gather a bunch of popular URLs that are representative of sites users frequent. |
+Put these URLs in a file, one per line. It might make sense to start with a |
+short list for dry run. |
+ |
+## Data preparation for labeling |
+ |
+Use ```get_screenshots.py``` to generate the screenshots of the original and |
+distilled web page, and extract the features by running |
+```extract_features.js```. You can see how it works by running the following |
+command. |
+ |
+``` |
+./get_screenshots.py --out out_dir --urls-file urls.txt |
+``` |
+ |
+If everything goes fine, run it inside xvfb. Specifying the screen resolution |
+makes the size of the screenshots consistent. It also prevent the Chrome window |
+from interrupting your work on the main monitor. |
+ |
+``` |
+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt |
+``` |
+ |
+One entry takes about 30 seconds. Depending on the number of entries, it could |
+be a lengthy process. If it is interrupted, you could use option ```--resume``` |
+to continue. |
+ |
+``` |
+xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume |
+``` |
+ |
+Running multiple instances concurrently is recommended if the list is long |
+enough. You can create a Makefile like this: |
+ |
+``` |
+ALL=$(addsuffix .target,$(shell seq 1000)) |
+ |
+all: $(ALL) |
+ |
+%.target : |
+ xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume |
+``` |
+ |
+And then run ```make -j20```. Adjust the parallelism according to how beefy your |
+machine is. |
+ |
+A small proportion of URLs would time out, or fail for some other reasons. When |
+you've collected enough data, run the command again with option |
+```--write-index``` to export data for the next stage. |
+ |
+``` |
+./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
+``` |
+ |
+## Labeling |
+ |
+Use ```server.py``` to serve the web site for data labeling. Human effort is |
+around 10~20 seconds per entry. |
+ |
+``` |
+./server.py --data-dir out_dir |
+``` |
+ |
+It should print something like: |
+ |
+``` |
+[21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 |
+``` |
+ |
+Then visit that address in your browser. |
+ |
+The labels would be written to ```out_dir/archive/``` periodically. |
+ |
+## Data preparation for training |
+ |
+In the step with ```--write-index```, ```get_screenshots.py``` writes the |
+extracted raw features to ```out_dir/feature```. We can use |
+```calculate_derived_features.py``` to convert it to the final derived features. |
+ |
+``` |
+./calculate_derived_features.py --core out_dir/feature --out derived.txt |
+``` |
+ |
+Then use ```write_features_csv.py``` to combine with the label. |
+ |
+``` |
+./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled |
+``` |