Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(120)

Side by Side Diff: heuristics/distillable/README.md

Issue 1728863002: Reformat README.md to Google style (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master
Patch Set: Created 4 years, 10 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « README.md ('k') | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Distillability Heuristics 1 # Distillability Heuristics
2 2
3 ## Goal 3 ## Goal
4 4
5 We would like to know whether it's useful to run DOM distiller on a page. This 5 We would like to know whether it's useful to run DOM distiller on a page. This
6 signal could be used in places like browser UI. Since this test would run on all 6 signal could be used in places like browser UI. Since this test would run on all
7 the page navigations, it needs to be cheap to compute. Running DOM distiller and 7 the page navigations, it needs to be cheap to compute. Running DOM distiller and
8 see if the output is empty would be too slow, and whether DOM distiller returns 8 see if the output is empty would be too slow, and whether DOM distiller returns
9 results isn't necessarily equivalent to whether the page should be distilled. 9 results isn't necessarily equivalent to whether the page should be distilled.
10 10
11 Considering all the constraints, we decided to train a machine learning model 11 Considering all the constraints, we decided to train a machine learning model
12 that takes features from a page, and classify it. The trained AdaBoost model is 12 that takes features from a page, and classify it. The trained AdaBoost model is
13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the 13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the
14 model training part is described below. 14 model training part is described below.
15 15
16 ## URL gathering 16 ## URL gathering
17 17
18 Gather a bunch of popular URLs that are representative of sites users frequent. 18 Gather a bunch of popular URLs that are representative of sites users frequent.
19 Put these URLs in a file, one per line. It might make sense to start with a 19 Put these URLs in a file, one per line. It might make sense to start with a
20 short list for dry run. 20 short list for dry run.
21 21
22 ## Data preparation for labeling 22 ## Data preparation for labeling
23 23
24 Use ```get_screenshots.py``` to generate the screenshots of the original and 24 Use `get_screenshots.py` to generate the screenshots of the original and
25 distilled web page, and extract the features by running 25 distilled web page, and extract the features by running `extract_features.js`.
26 ```extract_features.js```. You can see how it works by running the following 26 You can see how it works by running the following command.
27 command.
28 27
29 ``` 28 ```bash
30 ./get_screenshots.py --out out_dir --urls-file urls.txt 29 ./get_screenshots.py --out out_dir --urls-file urls.txt
31 ``` 30 ```
32 31
33 If everything goes fine, run it inside xvfb. Specifying the screen resolution 32 If everything goes fine, run it inside xvfb. Specifying the screen resolution
34 makes the size of the screenshots consistent. It also prevent the Chrome window 33 makes the size of the screenshots consistent. It also prevent the Chrome window
35 from interrupting your work on the main monitor. 34 from interrupting your work on the main monitor.
36 35
37 ``` 36 ```bash
38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt 37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt
39 ``` 38 ```
40 39
41 One entry takes about 30 seconds. Depending on the number of entries, it could 40 One entry takes about 30 seconds. Depending on the number of entries, it could
42 be a lengthy process. If it is interrupted, you could use option ```--resume``` 41 be a lengthy process. If it is interrupted, you could use option `--resume` to
43 to continue. 42 continue.
44 43
45 ``` 44 ```bash
46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume 45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume
47 ``` 46 ```
48 47
49 Running multiple instances concurrently is recommended if the list is long 48 Running multiple instances concurrently is recommended if the list is long
50 enough. You can create a Makefile like this: 49 enough. You can create a Makefile like this:
51 50
52 ``` 51 ```make
53 ALL=$(addsuffix .target,$(shell seq 1000)) 52 ALL=$(addsuffix .target,$(shell seq 1000))
54 53
55 all: $(ALL) 54 all: $(ALL)
56 55
57 %.target : 56 %.target :
58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume 57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume
59 ``` 58 ```
60 59
61 And then run ```make -j20```. Adjust the parallelism according to how beefy your 60 And then run `make -j20`. Adjust the parallelism according to how beefy your
62 machine is. 61 machine is.
63 62
64 A small proportion of URLs would time out, or fail for some other reasons. When 63 A small proportion of URLs would time out, or fail for some other reasons. When
65 you've collected enough data, run the command again with option 64 you've collected enough data, run the command again with option `--write-index`
66 ```--write-index``` to export data for the next stage. 65 to export data for the next stage.
67 66
68 ``` 67 ```bash
69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index 68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index
70 ``` 69 ```
71 70
72 ## Labeling 71 ## Labeling
73 72
74 Use ```server.py``` to serve the web site for data labeling. Human effort is 73 Use `server.py` to serve the web site for data labeling. Human effort is around
75 around 10~20 seconds per entry. 74 10~20 seconds per entry.
76 75
77 ``` 76 ```bash
78 ./server.py --data-dir out_dir 77 ./server.py --data-dir out_dir
79 ``` 78 ```
80 79
81 It should print something like: 80 It should print something like:
82 81
83 ``` 82 ```
84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081
85 ``` 84 ```
86 85
87 Then visit that address in your browser. 86 Then visit that address in your browser.
88 87
89 The labels would be written to ```out_dir/archive/``` periodically. 88 The labels would be written to `out_dir/archive/` periodically.
90 89
91 ## Data preparation for training 90 ## Data preparation for training
92 91
93 In the step with ```--write-index```, ```get_screenshots.py``` writes the 92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw
94 extracted raw features to ```out_dir/feature```. We can use 93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to
95 ```calculate_derived_features.py``` to convert it to the final derived features. 94 convert it to the final derived features.
96 95
97 ``` 96 ```bash
98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt 97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt
99 ``` 98 ```
100 99
101 Then use ```write_features_csv.py``` to combine with the label. 100 Then use `write_features_csv.py` to combine with the label.
102 101
103 ``` 102 ```bash
104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled 103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled
105 ``` 104 ```
OLDNEW
« no previous file with comments | « README.md ('k') | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698