OLD | NEW |
1 # Distillability Heuristics | 1 # Distillability Heuristics |
2 | 2 |
3 ## Goal | 3 ## Goal |
4 | 4 |
5 We would like to know whether it's useful to run DOM distiller on a page. This | 5 We would like to know whether it's useful to run DOM distiller on a page. This |
6 signal could be used in places like browser UI. Since this test would run on all | 6 signal could be used in places like browser UI. Since this test would run on all |
7 the page navigations, it needs to be cheap to compute. Running DOM distiller and | 7 the page navigations, it needs to be cheap to compute. Running DOM distiller to |
8 see if the output is empty would be too slow, and whether DOM distiller returns | 8 see if the output is empty would be too slow, and whether DOM distiller returns |
9 results isn't necessarily equivalent to whether the page should be distilled. | 9 results isn't necessarily equivalent to whether the page should be distilled. |
10 | 10 |
11 Considering all the constraints, we decided to train a machine learning model | 11 Considering all the constraints, we decided to train a machine learning model |
12 that takes features from a page, and classify it. The trained AdaBoost model is | 12 that takes features from a page, and classify it. The trained AdaBoost model is |
13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the | 13 added to Chrome in http://crrev.com/1405233009/ to predict whether a page |
14 model training part is described below. | 14 contains an article. Another model added in http://crrev.com/1703313003 predicts |
| 15 whether the article is long enough. The pipeline except for the model training |
| 16 part is described below. |
15 | 17 |
16 ## URL gathering | 18 ## URL gathering |
17 | 19 |
18 Gather a bunch of popular URLs that are representative of sites users frequent. | 20 Gather a bunch of popular URLs that are representative of sites users frequent. |
19 Put these URLs in a file, one per line. It might make sense to start with a | 21 Put these URLs in a file, one per line. It might make sense to start with a |
20 short list for dry run. | 22 short list for dry run. |
21 | 23 |
22 ## Data preparation for labeling | 24 ## Data scrawling |
23 | 25 |
24 Use `get_screenshots.py` to generate the screenshots of the original and | 26 Use `get_screenshots.py` to generate the screenshots of the original and |
25 distilled web page, and extract the features by running `extract_features.js`. | 27 distilled web page, and extract the features by running `extract_features.js`. |
26 You can see how it works by running the following command. | 28 You can see how it works by running the following command. |
27 | 29 |
28 ```bash | 30 ```bash |
29 ./get_screenshots.py --out out_dir --urls-file urls.txt | 31 ./get_screenshots.py --out out_dir --urls-file urls.txt |
30 ``` | 32 ``` |
31 | 33 |
| 34 Append option `--emulate-mobile` if mobile-friendliness is important, and use |
| 35 `--save-mhtml` to keep a copy in MHTML format. |
| 36 |
32 If everything goes fine, run it inside xvfb. Specifying the screen resolution | 37 If everything goes fine, run it inside xvfb. Specifying the screen resolution |
33 makes the size of the screenshots consistent. It also prevent the Chrome window | 38 makes the size of the screenshots consistent. It also prevent the Chrome window |
34 from interrupting your work on the main monitor. | 39 from interrupting your work on the main monitor. |
35 | 40 |
36 ```bash | 41 ```bash |
37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt | 42 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt |
38 ``` | 43 ``` |
39 | 44 |
40 One entry takes about 30 seconds. Depending on the number of entries, it could | 45 One entry takes about 30 seconds. Depending on the number of entries, it could |
41 be a lengthy process. If it is interrupted, you could use option `--resume` to | 46 be a lengthy process. If it is interrupted, you could use option `--resume` to |
42 continue. | 47 continue. |
43 | 48 |
44 ```bash | 49 ```bash |
45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume | 50 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume |
46 ``` | 51 ``` |
47 | 52 |
48 Running multiple instances concurrently is recommended if the list is long | 53 Running multiple instances concurrently is recommended if the list is long |
49 enough. You can create a Makefile like this: | 54 enough. You can create a Makefile like this: |
50 | 55 |
51 ```make | 56 ```make |
52 ALL=$(addsuffix .target,$(shell seq 1000)) | 57 ALL=$(addsuffix .target,$(shell seq 1000)) |
53 | 58 |
54 all: $(ALL) | 59 all: $(ALL) |
55 | 60 |
56 %.target : | 61 %.target : |
57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume | 62 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume |
58 ``` | 63 ``` |
59 | 64 |
60 And then run `make -j20`. Adjust the parallelism according to how beefy your | 65 And then run `nice make -j10 -k`. Adjust the parallelism according to how beefy |
61 machine is. | 66 your machine is. The `-k` option is essential for it to keep going. |
| 67 |
| 68 **Tips and caveats:** |
| 69 |
| 70 - Use tmpfs for /tmp to avoid thrashing your disk. It would easily be IO-bound |
| 71 even if you use SSD for /tmp. |
| 72 - `atop` is useful when experimenting the parallelism. |
| 73 - For a 40-core, 64G workstation, `-j80` can keep it CPU-bound, with |
| 74 throughput of ~100 entries/minute. |
| 75 - You might need to manually kill a few stray Chrome or xvfb-run processes |
| 76 after hitting `Ctrl-C` for `make`. |
62 | 77 |
63 A small proportion of URLs would time out, or fail for some other reasons. When | 78 A small proportion of URLs would time out, or fail for some other reasons. When |
64 you've collected enough data, run the command again with option `--write-index` | 79 you've collected enough data, run the command again with option `--write-index` |
65 to export data for the next stage. | 80 to export data for the next stage. |
66 | 81 |
67 ```bash | 82 ```bash |
68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index | 83 ./get_screenshots.py --out out_dir --urls-file urls.txt --write-index |
69 ``` | 84 ``` |
70 | 85 |
71 ## Labeling | 86 ## Labeling |
72 | 87 |
| 88 This section is only needed for distillability model, but not for long-article |
| 89 model. |
| 90 |
73 Use `server.py` to serve the web site for data labeling. Human effort is around | 91 Use `server.py` to serve the web site for data labeling. Human effort is around |
74 10~20 seconds per entry. | 92 10~20 seconds per entry. |
75 | 93 |
76 ```bash | 94 ```bash |
77 ./server.py --data-dir out_dir | 95 ./server.py --data-dir out_dir |
78 ``` | 96 ``` |
79 | 97 |
80 It should print something like: | 98 It should print something like: |
81 | 99 |
82 ``` | 100 ``` |
83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 | 101 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 |
84 ``` | 102 ``` |
85 | 103 |
86 Then visit that address in your browser. | 104 Then visit that address in your browser. |
87 | 105 |
88 The labels would be written to `out_dir/archive/` periodically. | 106 The labels would be written to `out_dir/archive/` periodically. |
89 | 107 |
90 ## Data preparation for training | 108 ## Data preparation for training |
91 | 109 |
92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw | 110 ### Feature re-extraction from MHTML archive |
93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to | 111 |
94 convert it to the final derived features. | 112 When experimenting with feature extraction, being able to extract new features |
| 113 is useful. After modifying `extract_features.js`, modify the Makefile and change |
| 114 the command to: |
95 | 115 |
96 ```bash | 116 ```bash |
97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt | 117 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --load-mhtml --skip-distillation |
98 ``` | 118 ``` |
99 | 119 |
100 Then use `write_features_csv.py` to combine with the label. | 120 Then rerun `nice make -j10 -k`. |
| 121 |
| 122 ### Recalculating derived features without extracting again |
| 123 |
| 124 This section is usually optional. You might need this if you've changed how |
| 125 features are derived, and want to recalculate the derived features from the raw |
| 126 features, without extracting the raw features again. This can be useful because |
| 127 feature re-extraction from MHTML archive can sometimes differs from the original |
| 128 web page. Otherwise, if you are dealing with an older dataset where the derived |
| 129 features are not calculated when scrawling, this is also necessary. |
| 130 |
| 131 The derived features are saved to `out_dir/*.feature-derived` when scrawling |
| 132 each entry. In the step with `--write-index`, `get_screenshots.py` writes the |
| 133 derived features to `out_dir/feature-derived`. To save time, raw features are |
| 134 not aggregated by default, but you can uncomment the line in function |
| 135 `writeFeature()` to write the raw features to `out_dir/feature` as well. We can |
| 136 then use `calculate_derived_features.py` to convert it to the derived features. |
101 | 137 |
102 ```bash | 138 ```bash |
103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled | 139 ./calculate_derived_features.py --core out_dir/feature --out out_dir/feature-der
ived |
104 ``` | 140 ``` |
| 141 |
| 142 ### Sanity check |
| 143 |
| 144 This step is optional. |
| 145 |
| 146 `check_derived_features.py` compares the derived features between JavaScript |
| 147 implementation and the native implementation in Chrome. This only works if your |
| 148 Chrome is new enough to support distillability JSON dumping (with command line |
| 149 argument --distillability-dev). |
| 150 |
| 151 ``` |
| 152 ./check_derived_features.py --features out_dir/feature-derived |
| 153 ``` |
| 154 |
| 155 Or if you want to compare the features derived from MHTML archive, do this: |
| 156 |
| 157 ``` |
| 158 ./check_derived_features.py --features out_dir/mfeature-derived --from-mhtml |
| 159 ``` |
| 160 |
| 161 When comparing the features extracted from the original page, the error rate |
| 162 would be higher because the feature extractions by JS and native code are done |
| 163 at different events, and the DOM could change dynamically. On the other hand, |
| 164 features extracted from MHTML archive should be exactly the same. However, due |
| 165 to issues like https://crbug.com/586034, MHTML is not fully offline, and the |
| 166 results can be non-deterministic. Sadly there are currently no good way in |
| 167 webdriver to force offline behavior. Other than that, mismatches between the two |
| 168 implementations should be regarded as bugs. |
| 169 |
| 170 `check_distilled_mhtml.py` compares the distilled content from the original page |
| 171 with the distilled content from the MHTML archive. |
| 172 |
| 173 ``` |
| 174 ./check_distilled_mhtml.py --dir out_dir |
| 175 ``` |
| 176 |
| 177 These two should be exactly the same. Known differences includes: |
| 178 |
| 179 - The original page has next page stitched. |
| 180 - In some rare cases, MHTML would fail to distill and get no data. |
| 181 |
| 182 We still have inconsistencies that need investigation. |
| 183 |
| 184 ### Final output for training |
| 185 |
| 186 Use `write_features_csv.py` to combine derived features with the label. |
| 187 |
| 188 For distillability model, run: |
| 189 |
| 190 ``` |
| 191 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
out_dir/feature-derived --out labelled |
| 192 ``` |
| 193 |
| 194 Or for long-article model, run: |
| 195 |
| 196 ``` |
| 197 ./write_features_csv.py --distilled out_dir/dfeature-derived --features out_dir/
feature-derived --out labelled |
| 198 ``` |
| 199 |
| 200 Then lots of files named `labelled-*.csv` would be created. |
OLD | NEW |