Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(62)

Side by Side Diff: heuristics/distillable/README.md

Issue 1620043002: Add scripts for distillability modelling (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master
Patch Set: set upstream patchset, identical to patch set 2 Created 4 years, 10 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 # Distillability Heuristics
2
3 ## Goal
4
5 We would like to know whether it's useful to run DOM distiller on a page. This
6 signal could be used in places like browser UI. Since this test would run on all
7 the page navigations, it needs to be cheap to compute. Running DOM distiller and
8 see if the output is empty would be too slow, and whether DOM distiller returns
9 results isn't necessarily equivalent to whether the page should be distilled.
10
11 Considering all the constraints, we decided to train a machine learning model
12 that takes features from a page, and classify it. The trained AdaBoost model is
13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the
14 model training part is described below.
15
16 ## URL gathering
17
18 Gather a bunch of popular URLs that are representative of sites users frequent.
19 Put these URLs in a file, one per line. It might make sense to start with a
20 short list for dry run.
21
22 ## Data preparation for labeling
23
24 Use ```get_screenshots.py``` to generate the screenshots of the original and
25 distilled web page, and extract the features by running
26 ```extract_features.js```. You can see how it works by running the following
27 command.
28
29 ```
30 ./get_screenshots.py --out out_dir --urls-file urls.txt
31 ```
32
33 If everything goes fine, run it inside xvfb. Specifying the screen resolution
34 makes the size of the screenshots consistent. It also prevent the Chrome window
35 from interrupting your work on the main monitor.
36
37 ```
38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt
39 ```
40
41 One entry takes about 30 seconds. Depending on the number of entries, it could
42 be a lengthy process. If it is interrupted, you could use option ```--resume```
43 to continue.
44
45 ```
46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume
47 ```
48
49 Running multiple instances concurrently is recommended if the list is long
50 enough. You can create a Makefile like this:
51
52 ```
53 ALL=$(addsuffix .target,$(shell seq 1000))
54
55 all: $(ALL)
56
57 %.target :
58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume
59 ```
60
61 And then run ```make -j20```. Adjust the parallelism according to how beefy your
62 machine is.
63
64 A small proportion of URLs would time out, or fail for some other reasons. When
65 you've collected enough data, run the command again with option
66 ```--write-index``` to export data for the next stage.
67
68 ```
69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index
70 ```
71
72 ## Labeling
73
74 Use ```server.py``` to serve the web site for data labeling. Human effort is
75 around 10~20 seconds per entry.
76
77 ```
78 ./server.py --data-dir out_dir
79 ```
80
81 It should print something like:
82
83 ```
84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081
85 ```
86
87 Then visit that address in your browser.
88
89 The labels would be written to ```out_dir/archive/``` periodically.
90
91 ## Data preparation for training
92
93 In the step with ```--write-index```, ```get_screenshots.py``` writes the
94 extracted raw features to ```out_dir/feature```. We can use
95 ```calculate_derived_features.py``` to convert it to the final derived features.
96
97 ```
98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt
99 ```
100
101 Then use ```write_features_csv.py``` to combine with the label.
102
103 ```
104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled
105 ```
OLDNEW
« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698