OLD | NEW |
1 # Distillability Heuristics | 1 # Distillability Heuristics |
2 | 2 |
3 ## Goal | 3 ## Goal |
4 | 4 |
5 We would like to know whether it's useful to run DOM distiller on a page. This | 5 We would like to know whether it's useful to run DOM distiller on a page. This |
6 signal could be used in places like browser UI. Since this test would run on all | 6 signal could be used in places like browser UI. Since this test would run on all |
7 the page navigations, it needs to be cheap to compute. Running DOM distiller and | 7 the page navigations, it needs to be cheap to compute. Running DOM distiller and |
8 see if the output is empty would be too slow, and whether DOM distiller returns | 8 see if the output is empty would be too slow, and whether DOM distiller returns |
9 results isn't necessarily equivalent to whether the page should be distilled. | 9 results isn't necessarily equivalent to whether the page should be distilled. |
10 | 10 |
11 Considering all the constraints, we decided to train a machine learning model | 11 Considering all the constraints, we decided to train a machine learning model |
12 that takes features from a page, and classify it. The trained AdaBoost model is | 12 that takes features from a page, and classify it. The trained AdaBoost model is |
13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the | 13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the |
14 model training part is described below. | 14 model training part is described below. |
15 | 15 |
16 ## URL gathering | 16 ## URL gathering |
17 | 17 |
18 Gather a bunch of popular URLs that are representative of sites users frequent. | 18 Gather a bunch of popular URLs that are representative of sites users frequent. |
19 Put these URLs in a file, one per line. It might make sense to start with a | 19 Put these URLs in a file, one per line. It might make sense to start with a |
20 short list for dry run. | 20 short list for dry run. |
21 | 21 |
22 ## Data preparation for labeling | 22 ## Data preparation for labeling |
23 | 23 |
24 Use ```get_screenshots.py``` to generate the screenshots of the original and | 24 Use `get_screenshots.py` to generate the screenshots of the original and |
25 distilled web page, and extract the features by running | 25 distilled web page, and extract the features by running `extract_features.js`. |
26 ```extract_features.js```. You can see how it works by running the following | 26 You can see how it works by running the following command. |
27 command. | |
28 | 27 |
29 ``` | 28 ```bash |
30 ./get_screenshots.py --out out_dir --urls-file urls.txt | 29 ./get_screenshots.py --out out_dir --urls-file urls.txt |
31 ``` | 30 ``` |
32 | 31 |
33 If everything goes fine, run it inside xvfb. Specifying the screen resolution | 32 If everything goes fine, run it inside xvfb. Specifying the screen resolution |
34 makes the size of the screenshots consistent. It also prevent the Chrome window | 33 makes the size of the screenshots consistent. It also prevent the Chrome window |
35 from interrupting your work on the main monitor. | 34 from interrupting your work on the main monitor. |
36 | 35 |
37 ``` | 36 ```bash |
38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt | 37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt |
39 ``` | 38 ``` |
40 | 39 |
41 One entry takes about 30 seconds. Depending on the number of entries, it could | 40 One entry takes about 30 seconds. Depending on the number of entries, it could |
42 be a lengthy process. If it is interrupted, you could use option ```--resume``` | 41 be a lengthy process. If it is interrupted, you could use option `--resume` to |
43 to continue. | 42 continue. |
44 | 43 |
45 ``` | 44 ```bash |
46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume | 45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume |
47 ``` | 46 ``` |
48 | 47 |
49 Running multiple instances concurrently is recommended if the list is long | 48 Running multiple instances concurrently is recommended if the list is long |
50 enough. You can create a Makefile like this: | 49 enough. You can create a Makefile like this: |
51 | 50 |
52 ``` | 51 ```make |
53 ALL=$(addsuffix .target,$(shell seq 1000)) | 52 ALL=$(addsuffix .target,$(shell seq 1000)) |
54 | 53 |
55 all: $(ALL) | 54 all: $(ALL) |
56 | 55 |
57 %.target : | 56 %.target : |
58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume | 57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume |
59 ``` | 58 ``` |
60 | 59 |
61 And then run ```make -j20```. Adjust the parallelism according to how beefy your | 60 And then run `make -j20`. Adjust the parallelism according to how beefy your |
62 machine is. | 61 machine is. |
63 | 62 |
64 A small proportion of URLs would time out, or fail for some other reasons. When | 63 A small proportion of URLs would time out, or fail for some other reasons. When |
65 you've collected enough data, run the command again with option | 64 you've collected enough data, run the command again with option `--write-index` |
66 ```--write-index``` to export data for the next stage. | 65 to export data for the next stage. |
67 | 66 |
68 ``` | 67 ```bash |
69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index | 68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
70 ``` | 69 ``` |
71 | 70 |
72 ## Labeling | 71 ## Labeling |
73 | 72 |
74 Use ```server.py``` to serve the web site for data labeling. Human effort is | 73 Use `server.py` to serve the web site for data labeling. Human effort is around |
75 around 10~20 seconds per entry. | 74 10~20 seconds per entry. |
76 | 75 |
77 ``` | 76 ```bash |
78 ./server.py --data-dir out_dir | 77 ./server.py --data-dir out_dir |
79 ``` | 78 ``` |
80 | 79 |
81 It should print something like: | 80 It should print something like: |
82 | 81 |
83 ``` | 82 ``` |
84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 | 83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 |
85 ``` | 84 ``` |
86 | 85 |
87 Then visit that address in your browser. | 86 Then visit that address in your browser. |
88 | 87 |
89 The labels would be written to ```out_dir/archive/``` periodically. | 88 The labels would be written to `out_dir/archive/` periodically. |
90 | 89 |
91 ## Data preparation for training | 90 ## Data preparation for training |
92 | 91 |
93 In the step with ```--write-index```, ```get_screenshots.py``` writes the | 92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw |
94 extracted raw features to ```out_dir/feature```. We can use | 93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to |
95 ```calculate_derived_features.py``` to convert it to the final derived features. | 94 convert it to the final derived features. |
96 | 95 |
97 ``` | 96 ```bash |
98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt | 97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt |
99 ``` | 98 ``` |
100 | 99 |
101 Then use ```write_features_csv.py``` to combine with the label. | 100 Then use `write_features_csv.py` to combine with the label. |
102 | 101 |
103 ``` | 102 ```bash |
104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled | 103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled |
105 ``` | 104 ``` |
OLD | NEW |