OLD | NEW |
(Empty) | |
| 1 # Distillability Heuristics |
| 2 |
| 3 ## Goal |
| 4 |
| 5 We would like to know whether it's useful to run DOM distiller on a page. This |
| 6 signal could be used in places like browser UI. Since this test would run on all |
| 7 the page navigations, it needs to be cheap to compute. Running DOM distiller and |
| 8 see if the output is empty would be too slow, and whether DOM distiller returns |
| 9 results isn't necessarily equivalent to whether the page should be distilled. |
| 10 |
| 11 Considering all the constraints, we decided to train a machine learning model |
| 12 that takes features from a page, and classify it. The trained AdaBoost model is |
| 13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the |
| 14 model training part is described below. |
| 15 |
| 16 ## URL gathering |
| 17 |
| 18 Gather a bunch of popular URLs that are representative of sites users frequent. |
| 19 Put these URLs in a file, one per line. It might make sense to start with a |
| 20 short list for dry run. |
| 21 |
| 22 ## Data preparation for labeling |
| 23 |
| 24 Use ```get_screenshots.py``` to generate the screenshots of the original and |
| 25 distilled web page, and extract the features by running |
| 26 ```extract_features.js```. You can see how it works by running the following |
| 27 command. |
| 28 |
| 29 ``` |
| 30 ./get_screenshots.py --out out_dir --urls-file urls.txt |
| 31 ``` |
| 32 |
| 33 If everything goes fine, run it inside xvfb. Specifying the screen resolution |
| 34 makes the size of the screenshots consistent. It also prevent the Chrome window |
| 35 from interrupting your work on the main monitor. |
| 36 |
| 37 ``` |
| 38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt |
| 39 ``` |
| 40 |
| 41 One entry takes about 30 seconds. Depending on the number of entries, it could |
| 42 be a lengthy process. If it is interrupted, you could use option ```--resume``` |
| 43 to continue. |
| 44 |
| 45 ``` |
| 46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume |
| 47 ``` |
| 48 |
| 49 Running multiple instances concurrently is recommended if the list is long |
| 50 enough. You can create a Makefile like this: |
| 51 |
| 52 ``` |
| 53 ALL=$(addsuffix .target,$(shell seq 1000)) |
| 54 |
| 55 all: $(ALL) |
| 56 |
| 57 %.target : |
| 58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume |
| 59 ``` |
| 60 |
| 61 And then run ```make -j20```. Adjust the parallelism according to how beefy your |
| 62 machine is. |
| 63 |
| 64 A small proportion of URLs would time out, or fail for some other reasons. When |
| 65 you've collected enough data, run the command again with option |
| 66 ```--write-index``` to export data for the next stage. |
| 67 |
| 68 ``` |
| 69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
| 70 ``` |
| 71 |
| 72 ## Labeling |
| 73 |
| 74 Use ```server.py``` to serve the web site for data labeling. Human effort is |
| 75 around 10~20 seconds per entry. |
| 76 |
| 77 ``` |
| 78 ./server.py --data-dir out_dir |
| 79 ``` |
| 80 |
| 81 It should print something like: |
| 82 |
| 83 ``` |
| 84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 |
| 85 ``` |
| 86 |
| 87 Then visit that address in your browser. |
| 88 |
| 89 The labels would be written to ```out_dir/archive/``` periodically. |
| 90 |
| 91 ## Data preparation for training |
| 92 |
| 93 In the step with ```--write-index```, ```get_screenshots.py``` writes the |
| 94 extracted raw features to ```out_dir/feature```. We can use |
| 95 ```calculate_derived_features.py``` to convert it to the final derived features. |
| 96 |
| 97 ``` |
| 98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt |
| 99 ``` |
| 100 |
| 101 Then use ```write_features_csv.py``` to combine with the label. |
| 102 |
| 103 ``` |
| 104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled |
| 105 ``` |
OLD | NEW |