heuristics/distillable/README.md - Issue 1620043002: Add scripts for distillability modelling

Side by Side Diff: heuristics/distillable/README.md

Issue 1620043002: Add scripts for distillability modelling (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master

Patch Set: set upstream patchset, identical to patch set 2 Created 4 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 # Distillability Heuristics

	2

	3 ## Goal

	4

	5 We would like to know whether it's useful to run DOM distiller on a page. This

	6 signal could be used in places like browser UI. Since this test would run on all

	7 the page navigations, it needs to be cheap to compute. Running DOM distiller and

	8 see if the output is empty would be too slow, and whether DOM distiller returns

	9 results isn't necessarily equivalent to whether the page should be distilled.

	10

	11 Considering all the constraints, we decided to train a machine learning model

	12 that takes features from a page, and classify it. The trained AdaBoost model is

	13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the

	14 model training part is described below.

	15

	16 ## URL gathering

	17

	18 Gather a bunch of popular URLs that are representative of sites users frequent.

	19 Put these URLs in a file, one per line. It might make sense to start with a

	20 short list for dry run.

	21

	22 ## Data preparation for labeling

	23

	24 Use ```get_screenshots.py``` to generate the screenshots of the original and

	25 distilled web page, and extract the features by running

	26 ```extract_features.js```. You can see how it works by running the following

	27 command.

	28

	29 ```

	30 ./get_screenshots.py --out out_dir --urls-file urls.txt

	31 ```

	32

	33 If everything goes fine, run it inside xvfb. Specifying the screen resolution

	34 makes the size of the screenshots consistent. It also prevent the Chrome window

	35 from interrupting your work on the main monitor.

	36

	37 ```

	38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt

	39 ```

	40

	41 One entry takes about 30 seconds. Depending on the number of entries, it could

	42 be a lengthy process. If it is interrupted, you could use option ```--resume```

	43 to continue.

	44

	45 ```

	46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume

	47 ```

	48

	49 Running multiple instances concurrently is recommended if the list is long

	50 enough. You can create a Makefile like this:

	51

	52 ```

	53 ALL=$(addsuffix .target,$(shell seq 1000))

	54

	55 all: $(ALL)

	56

	57 %.target :

	58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume

	59 ```

	60

	61 And then run ```make -j20```. Adjust the parallelism according to how beefy your

	62 machine is.

	63

	64 A small proportion of URLs would time out, or fail for some other reasons. When

	65 you've collected enough data, run the command again with option

	66 ```--write-index``` to export data for the next stage.

	67

	68 ```

	69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index

	70 ```

	71

	72 ## Labeling

	73

	74 Use ```server.py``` to serve the web site for data labeling. Human effort is

	75 around 10~20 seconds per entry.

	76

	77 ```

	78 ./server.py --data-dir out_dir

	79 ```

	80

	81 It should print something like:

	82

	83 ```

	84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081

	85 ```

	86

	87 Then visit that address in your browser.

	88

	89 The labels would be written to ```out_dir/archive/``` periodically.

	90

	91 ## Data preparation for training

	92

	93 In the step with ```--write-index```, ```get_screenshots.py``` writes the

	94 extracted raw features to ```out_dir/feature```. We can use

	95 ```calculate_derived_features.py``` to convert it to the final derived features.

	96

	97 ```

	98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt

	99 ```

	100

	101 Then use ```write_features_csv.py``` to combine with the label.

	102

	103 ```

	104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*\|tail -n1) --features derived.txt --out labeled

	105 ```

OLD	NEW

« no previous file with comments | « get_screenshots.py ('k') | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »