Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(203)

Side by Side Diff: heuristics/distillable/README.md

Issue 1808503002: Update distillability modeling scripts to predict long articles (Closed) Base URL: git@github.com:chromium/dom-distiller.git@ml-visible
Patch Set: update docs Created 4 years, 7 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Distillability Heuristics 1 # Distillability Heuristics
2 2
3 ## Goal 3 ## Goal
4 4
5 We would like to know whether it's useful to run DOM distiller on a page. This 5 We would like to know whether it's useful to run DOM distiller on a page. This
6 signal could be used in places like browser UI. Since this test would run on all 6 signal could be used in places like browser UI. Since this test would run on all
7 the page navigations, it needs to be cheap to compute. Running DOM distiller and 7 the page navigations, it needs to be cheap to compute. Running DOM distiller to
8 see if the output is empty would be too slow, and whether DOM distiller returns 8 see if the output is empty would be too slow, and whether DOM distiller returns
9 results isn't necessarily equivalent to whether the page should be distilled. 9 results isn't necessarily equivalent to whether the page should be distilled.
10 10
11 Considering all the constraints, we decided to train a machine learning model 11 Considering all the constraints, we decided to train a machine learning model
12 that takes features from a page, and classify it. The trained AdaBoost model is 12 that takes features from a page, and classify it. The trained AdaBoost model is
13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the 13 added to Chrome in http://crrev.com/1405233009/ to predict whether a page
14 model training part is described below. 14 contains an article. Another model added in http://crrev.com/1703313003 predicts
15 whether the article is long enough. The pipeline except for the model training
16 part is described below.
15 17
16 ## URL gathering 18 ## URL gathering
17 19
18 Gather a bunch of popular URLs that are representative of sites users frequent. 20 Gather a bunch of popular URLs that are representative of sites users frequent.
19 Put these URLs in a file, one per line. It might make sense to start with a 21 Put these URLs in a file, one per line. It might make sense to start with a
20 short list for dry run. 22 short list for dry run.
21 23
22 ## Data preparation for labeling 24 ## Data scrawling
23 25
24 Use `get_screenshots.py` to generate the screenshots of the original and 26 Use `get_screenshots.py` to generate the screenshots of the original and
25 distilled web page, and extract the features by running `extract_features.js`. 27 distilled web page, and extract the features by running `extract_features.js`.
26 You can see how it works by running the following command. 28 You can see how it works by running the following command.
27 29
28 ```bash 30 ```bash
29 ./get_screenshots.py --out out_dir --urls-file urls.txt 31 ./get_screenshots.py --out out_dir --urls-file urls.txt
30 ``` 32 ```
31 33
34 Append option `--emulate-mobile` if mobile-friendliness is important, and use
35 `--save-mhtml` to keep a copy in MHTML format.
36
32 If everything goes fine, run it inside xvfb. Specifying the screen resolution 37 If everything goes fine, run it inside xvfb. Specifying the screen resolution
33 makes the size of the screenshots consistent. It also prevent the Chrome window 38 makes the size of the screenshots consistent. It also prevent the Chrome window
34 from interrupting your work on the main monitor. 39 from interrupting your work on the main monitor.
35 40
36 ```bash 41 ```bash
37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt 42 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt
38 ``` 43 ```
39 44
40 One entry takes about 30 seconds. Depending on the number of entries, it could 45 One entry takes about 30 seconds. Depending on the number of entries, it could
41 be a lengthy process. If it is interrupted, you could use option `--resume` to 46 be a lengthy process. If it is interrupted, you could use option `--resume` to
42 continue. 47 continue.
43 48
44 ```bash 49 ```bash
45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume 50 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume
46 ``` 51 ```
47 52
48 Running multiple instances concurrently is recommended if the list is long 53 Running multiple instances concurrently is recommended if the list is long
49 enough. You can create a Makefile like this: 54 enough. You can create a Makefile like this:
50 55
51 ```make 56 ```make
52 ALL=$(addsuffix .target,$(shell seq 1000)) 57 ALL=$(addsuffix .target,$(shell seq 1000))
53 58
54 all: $(ALL) 59 all: $(ALL)
55 60
56 %.target : 61 %.target :
57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume 62 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume
58 ``` 63 ```
59 64
60 And then run `make -j20`. Adjust the parallelism according to how beefy your 65 And then run `nice make -j10 -k`. Adjust the parallelism according to how beefy
61 machine is. 66 your machine is. The `-k` option is essential for it to keep going.
67
68 **Tips and caveats:**
69
70 - Use tmpfs for /tmp to avoid thrashing your disk. It would easily be IO-bound
71 even if you use SSD for /tmp.
72 - `atop` is useful when experimenting the parallelism.
73 - For a 40-core, 64G workstation, `-j80` can keep it CPU-bound, with
74 throughput of ~100 entries/minute.
75 - You might need to manually kill a few stray Chrome or xvfb-run processes
76 after hitting `Ctrl-C` for `make`.
62 77
63 A small proportion of URLs would time out, or fail for some other reasons. When 78 A small proportion of URLs would time out, or fail for some other reasons. When
64 you've collected enough data, run the command again with option `--write-index` 79 you've collected enough data, run the command again with option `--write-index`
65 to export data for the next stage. 80 to export data for the next stage.
66 81
67 ```bash 82 ```bash
68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index 83 ./get_screenshots.py --out out_dir --urls-file urls.txt --write-index
69 ``` 84 ```
70 85
71 ## Labeling 86 ## Labeling
72 87
88 This section is only needed for distillability model, but not for long-article
89 model.
90
73 Use `server.py` to serve the web site for data labeling. Human effort is around 91 Use `server.py` to serve the web site for data labeling. Human effort is around
74 10~20 seconds per entry. 92 10~20 seconds per entry.
75 93
76 ```bash 94 ```bash
77 ./server.py --data-dir out_dir 95 ./server.py --data-dir out_dir
78 ``` 96 ```
79 97
80 It should print something like: 98 It should print something like:
81 99
82 ``` 100 ```
83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081 101 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081
84 ``` 102 ```
85 103
86 Then visit that address in your browser. 104 Then visit that address in your browser.
87 105
88 The labels would be written to `out_dir/archive/` periodically. 106 The labels would be written to `out_dir/archive/` periodically.
89 107
90 ## Data preparation for training 108 ## Data preparation for training
91 109
92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw 110 ### Feature re-extraction from MHTML archive
93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to 111
94 convert it to the final derived features. 112 When experimenting with feature extraction, being able to extract new features
113 is useful. After modifying `extract_features.js`, modify the Makefile and change
114 the command to:
95 115
96 ```bash 116 ```bash
97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt 117 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --load-mhtml --skip-distillation
98 ``` 118 ```
99 119
100 Then use `write_features_csv.py` to combine with the label. 120 Then rerun `nice make -j10 -k`.
121
122 ### Recalculating derived features without extracting again
123
124 This section is usually optional. You might need this if you've changed how
125 features are derived, and want to recalculate the derived features from the raw
126 features, without extracting the raw features again. This can be useful because
127 feature re-extraction from MHTML archive can sometimes differs from the original
128 web page. Otherwise, if you are dealing with an older dataset where the derived
129 features are not calculated when scrawling, this is also necessary.
130
131 The derived features are saved to `out_dir/*.feature-derived` when scrawling
132 each entry. In the step with `--write-index`, `get_screenshots.py` writes the
133 derived features to `out_dir/feature-derived`. To save time, raw features are
134 not aggregated by default, but you can uncomment the line in function
135 `writeFeature()` to write the raw features to `out_dir/feature` as well. We can
136 then use `calculate_derived_features.py` to convert it to the derived features.
101 137
102 ```bash 138 ```bash
103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled 139 ./calculate_derived_features.py --core out_dir/feature --out out_dir/feature-der ived
104 ``` 140 ```
141
142 ### Sanity check
143
144 This step is optional.
145
146 `check_derived_features.py` compares the derived features between JavaScript
147 implementation and the native implementation in Chrome. This only works if your
148 Chrome is new enough to support distillability JSON dumping (with command line
149 argument --distillability-dev).
150
151 ```
152 ./check_derived_features.py --features out_dir/feature-derived
153 ```
154
155 Or if you want to compare the features derived from MHTML archive, do this:
156
157 ```
158 ./check_derived_features.py --features out_dir/mfeature-derived --from-mhtml
159 ```
160
161 When comparing the features extracted from the original page, the error rate
162 would be higher because the feature extractions by JS and native code are done
163 at different events, and the DOM could change dynamically. On the other hand,
164 features extracted from MHTML archive should be exactly the same. However, due
165 to issues like https://crbug.com/586034, MHTML is not fully offline, and the
166 results can be non-deterministic. Sadly there are currently no good way in
167 webdriver to force offline behavior. Other than that, mismatches between the two
168 implementations should be regarded as bugs.
169
170 `check_distilled_mhtml.py` compares the distilled content from the original page
171 with the distilled content from the MHTML archive.
172
173 ```
174 ./check_distilled_mhtml.py --dir out_dir
175 ```
176
177 These two should be exactly the same. Known differences includes:
178
179 - The original page has next page stitched.
180 - In some rare cases, MHTML would fail to distill and get no data.
181
182 We still have inconsistencies that need investigation.
183
184 ### Final output for training
185
186 Use `write_features_csv.py` to combine derived features with the label.
187
188 For distillability model, run:
189
190 ```
191 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features out_dir/feature-derived --out labelled
192 ```
193
194 Or for long-article model, run:
195
196 ```
197 ./write_features_csv.py --distilled out_dir/dfeature-derived --features out_dir/ feature-derived --out labelled
198 ```
199
200 Then lots of files named `labelled-*.csv` would be created.
OLDNEW
« no previous file with comments | « no previous file | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698