| OLD | NEW |
| 1 # Distillability Heuristics | 1 # Distillability Heuristics |
| 2 | 2 |
| 3 ## Goal | 3 ## Goal |
| 4 | 4 |
| 5 We would like to know whether it's useful to run DOM distiller on a page. This | 5 We would like to know whether it's useful to run DOM distiller on a page. This |
| 6 signal could be used in places like browser UI. Since this test would run on all | 6 signal could be used in places like browser UI. Since this test would run on all |
| 7 the page navigations, it needs to be cheap to compute. Running DOM distiller and | 7 the page navigations, it needs to be cheap to compute. Running DOM distiller and |
| 8 see if the output is empty would be too slow, and whether DOM distiller returns | 8 see if the output is empty would be too slow, and whether DOM distiller returns |
| 9 results isn't necessarily equivalent to whether the page should be distilled. | 9 results isn't necessarily equivalent to whether the page should be distilled. |
| 10 | 10 |
| 11 Considering all the constraints, we decided to train a machine learning model | 11 Considering all the constraints, we decided to train a machine learning model |
| 12 that takes features from a page, and classify it. The trained AdaBoost model is | 12 that takes features from a page, and classify it. The trained AdaBoost model is |
| 13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the | 13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the |
| 14 model training part is described below. | 14 model training part is described below. |
| 15 | 15 |
| 16 ## URL gathering | 16 ## URL gathering |
| 17 | 17 |
| 18 Gather a bunch of popular URLs that are representative of sites users frequent. | 18 Gather a bunch of popular URLs that are representative of sites users frequent. |
| 19 Put these URLs in a file, one per line. It might make sense to start with a | 19 Put these URLs in a file, one per line. It might make sense to start with a |
| 20 short list for dry run. | 20 short list for dry run. |
| 21 | 21 |
| 22 ## Data preparation for labeling | 22 ## Data preparation for labeling |
| 23 | 23 |
| 24 Use ```get_screenshots.py``` to generate the screenshots of the original and | 24 Use `get_screenshots.py` to generate the screenshots of the original and |
| 25 distilled web page, and extract the features by running | 25 distilled web page, and extract the features by running `extract_features.js`. |
| 26 ```extract_features.js```. You can see how it works by running the following | 26 You can see how it works by running the following command. |
| 27 command. | |
| 28 | 27 |
| 29 ``` | 28 ```bash |
| 30 ./get_screenshots.py --out out_dir --urls-file urls.txt | 29 ./get_screenshots.py --out out_dir --urls-file urls.txt |
| 31 ``` | 30 ``` |
| 32 | 31 |
| 33 If everything goes fine, run it inside xvfb. Specifying the screen resolution | 32 If everything goes fine, run it inside xvfb. Specifying the screen resolution |
| 34 makes the size of the screenshots consistent. It also prevent the Chrome window | 33 makes the size of the screenshots consistent. It also prevent the Chrome window |
| 35 from interrupting your work on the main monitor. | 34 from interrupting your work on the main monitor. |
| 36 | 35 |
| 37 ``` | 36 ```bash |
| 38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt | 37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt |
| 39 ``` | 38 ``` |
| 40 | 39 |
| 41 One entry takes about 30 seconds. Depending on the number of entries, it could | 40 One entry takes about 30 seconds. Depending on the number of entries, it could |
| 42 be a lengthy process. If it is interrupted, you could use option ```--resume``` | 41 be a lengthy process. If it is interrupted, you could use option `--resume` to |
| 43 to continue. | 42 continue. |
| 44 | 43 |
| 45 ``` | 44 ```bash |
| 46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume | 45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url
s-file urls.txt --resume |
| 47 ``` | 46 ``` |
| 48 | 47 |
| 49 Running multiple instances concurrently is recommended if the list is long | 48 Running multiple instances concurrently is recommended if the list is long |
| 50 enough. You can create a Makefile like this: | 49 enough. You can create a Makefile like this: |
| 51 | 50 |
| 52 ``` | 51 ```make |
| 53 ALL=$(addsuffix .target,$(shell seq 1000)) | 52 ALL=$(addsuffix .target,$(shell seq 1000)) |
| 54 | 53 |
| 55 all: $(ALL) | 54 all: $(ALL) |
| 56 | 55 |
| 57 %.target : | 56 %.target : |
| 58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume | 57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d
ir --urls-file urls.txt --resume |
| 59 ``` | 58 ``` |
| 60 | 59 |
| 61 And then run ```make -j20```. Adjust the parallelism according to how beefy your | 60 And then run `make -j20`. Adjust the parallelism according to how beefy your |
| 62 machine is. | 61 machine is. |
| 63 | 62 |
| 64 A small proportion of URLs would time out, or fail for some other reasons. When | 63 A small proportion of URLs would time out, or fail for some other reasons. When |
| 65 you've collected enough data, run the command again with option | 64 you've collected enough data, run the command again with option `--write-index` |
| 66 ```--write-index``` to export data for the next stage. | 65 to export data for the next stage. |
| 67 | 66 |
| 68 ``` | 67 ```bash |
| 69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index | 68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
| 70 ``` | 69 ``` |
| 71 | 70 |
| 72 ## Labeling | 71 ## Labeling |
| 73 | 72 |
| 74 Use ```server.py``` to serve the web site for data labeling. Human effort is | 73 Use `server.py` to serve the web site for data labeling. Human effort is around |
| 75 around 10~20 seconds per entry. | 74 10~20 seconds per entry. |
| 76 | 75 |
| 77 ``` | 76 ```bash |
| 78 ./server.py --data-dir out_dir | 77 ./server.py --data-dir out_dir |
| 79 ``` | 78 ``` |
| 80 | 79 |
| 81 It should print something like: | 80 It should print something like: |
| 82 | 81 |
| 83 ``` | 82 ``` |
| 84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 | 83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 |
| 85 ``` | 84 ``` |
| 86 | 85 |
| 87 Then visit that address in your browser. | 86 Then visit that address in your browser. |
| 88 | 87 |
| 89 The labels would be written to ```out_dir/archive/``` periodically. | 88 The labels would be written to `out_dir/archive/` periodically. |
| 90 | 89 |
| 91 ## Data preparation for training | 90 ## Data preparation for training |
| 92 | 91 |
| 93 In the step with ```--write-index```, ```get_screenshots.py``` writes the | 92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw |
| 94 extracted raw features to ```out_dir/feature```. We can use | 93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to |
| 95 ```calculate_derived_features.py``` to convert it to the final derived features. | 94 convert it to the final derived features. |
| 96 | 95 |
| 97 ``` | 96 ```bash |
| 98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt | 97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt |
| 99 ``` | 98 ``` |
| 100 | 99 |
| 101 Then use ```write_features_csv.py``` to combine with the label. | 100 Then use `write_features_csv.py` to combine with the label. |
| 102 | 101 |
| 103 ``` | 102 ```bash |
| 104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled | 103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features
derived.txt --out labeled |
| 105 ``` | 104 ``` |
| OLD | NEW |