heuristics/distillable/README.md - Issue 1808503002: Update distillability modeling scripts to predict long articles

Side by Side Diff: heuristics/distillable/README.md

Issue 1808503002: Update distillability modeling scripts to predict long articles (Closed) Base URL: git@github.com:chromium/dom-distiller.git@ml-visible

Patch Set: update docs Created 4 years, 7 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # Distillability Heuristics	1 # Distillability Heuristics

2	2

3 ## Goal	3 ## Goal

4	4

5 We would like to know whether it's useful to run DOM distiller on a page. This	5 We would like to know whether it's useful to run DOM distiller on a page. This

6 signal could be used in places like browser UI. Since this test would run on all	6 signal could be used in places like browser UI. Since this test would run on all

7 the page navigations, it needs to be cheap to compute. Running DOM distiller and	7 the page navigations, it needs to be cheap to compute. Running DOM distiller to

8 see if the output is empty would be too slow, and whether DOM distiller returns	8 see if the output is empty would be too slow, and whether DOM distiller returns

9 results isn't necessarily equivalent to whether the page should be distilled.	9 results isn't necessarily equivalent to whether the page should be distilled.

10	10

11 Considering all the constraints, we decided to train a machine learning model	11 Considering all the constraints, we decided to train a machine learning model

12 that takes features from a page, and classify it. The trained AdaBoost model is	12 that takes features from a page, and classify it. The trained AdaBoost model is

13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the	13 added to Chrome in http://crrev.com/1405233009/ to predict whether a page

14 model training part is described below.	14 contains an article. Another model added in http://crrev.com/1703313003 predicts

	15 whether the article is long enough. The pipeline except for the model training

	16 part is described below.

15	17

16 ## URL gathering	18 ## URL gathering

17	19

18 Gather a bunch of popular URLs that are representative of sites users frequent.	20 Gather a bunch of popular URLs that are representative of sites users frequent.

19 Put these URLs in a file, one per line. It might make sense to start with a	21 Put these URLs in a file, one per line. It might make sense to start with a

20 short list for dry run.	22 short list for dry run.

21	23

22 ## Data preparation for labeling	24 ## Data scrawling

23	25

24 Use `get_screenshots.py` to generate the screenshots of the original and	26 Use `get_screenshots.py` to generate the screenshots of the original and

25 distilled web page, and extract the features by running `extract_features.js`.	27 distilled web page, and extract the features by running `extract_features.js`.

26 You can see how it works by running the following command.	28 You can see how it works by running the following command.

27	29

28 ```bash	30 ```bash

29 ./get_screenshots.py --out out_dir --urls-file urls.txt	31 ./get_screenshots.py --out out_dir --urls-file urls.txt

30 ```	32 ```

31	33

	34 Append option `--emulate-mobile` if mobile-friendliness is important, and use

	35 `--save-mhtml` to keep a copy in MHTML format.

	36

32 If everything goes fine, run it inside xvfb. Specifying the screen resolution	37 If everything goes fine, run it inside xvfb. Specifying the screen resolution

33 makes the size of the screenshots consistent. It also prevent the Chrome window	38 makes the size of the screenshots consistent. It also prevent the Chrome window

34 from interrupting your work on the main monitor.	39 from interrupting your work on the main monitor.

35	40

36 ```bash	41 ```bash

37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt	42 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt

38 ```	43 ```

39	44

40 One entry takes about 30 seconds. Depending on the number of entries, it could	45 One entry takes about 30 seconds. Depending on the number of entries, it could

41 be a lengthy process. If it is interrupted, you could use option `--resume` to	46 be a lengthy process. If it is interrupted, you could use option `--resume` to

42 continue.	47 continue.

43	48

44 ```bash	49 ```bash

45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume	50 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume

46 ```	51 ```

47	52

48 Running multiple instances concurrently is recommended if the list is long	53 Running multiple instances concurrently is recommended if the list is long

49 enough. You can create a Makefile like this:	54 enough. You can create a Makefile like this:

50	55

51 ```make	56 ```make

52 ALL=$(addsuffix .target,$(shell seq 1000))	57 ALL=$(addsuffix .target,$(shell seq 1000))

53	58

54 all: $(ALL)	59 all: $(ALL)

55	60

56 %.target :	61 %.target :

57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume	62 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume

58 ```	63 ```

59	64

60 And then run `make -j20`. Adjust the parallelism according to how beefy your	65 And then run `nice make -j10 -k`. Adjust the parallelism according to how beefy

61 machine is.	66 your machine is. The `-k` option is essential for it to keep going.

	67

	68 Tips and caveats:

	69

	70 - Use tmpfs for /tmp to avoid thrashing your disk. It would easily be IO-bound

	71 even if you use SSD for /tmp.

	72 - `atop` is useful when experimenting the parallelism.

	73 - For a 40-core, 64G workstation, `-j80` can keep it CPU-bound, with

	74 throughput of ~100 entries/minute.

	75 - You might need to manually kill a few stray Chrome or xvfb-run processes

	76 after hitting `Ctrl-C` for `make`.

62	77

63 A small proportion of URLs would time out, or fail for some other reasons. When	78 A small proportion of URLs would time out, or fail for some other reasons. When

64 you've collected enough data, run the command again with option `--write-index`	79 you've collected enough data, run the command again with option `--write-index`

65 to export data for the next stage.	80 to export data for the next stage.

66	81

67 ```bash	82 ```bash

68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index	83 ./get_screenshots.py --out out_dir --urls-file urls.txt --write-index

69 ```	84 ```

70	85

71 ## Labeling	86 ## Labeling

72	87

	88 This section is only needed for distillability model, but not for long-article

	89 model.

	90

73 Use `server.py` to serve the web site for data labeling. Human effort is around	91 Use `server.py` to serve the web site for data labeling. Human effort is around

74 10~20 seconds per entry.	92 10~20 seconds per entry.

75	93

76 ```bash	94 ```bash

77 ./server.py --data-dir out_dir	95 ./server.py --data-dir out_dir

78 ```	96 ```

79	97

80 It should print something like:	98 It should print something like:

81	99

82 ```	100 ```

83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081	101 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081

84 ```	102 ```

85	103

86 Then visit that address in your browser.	104 Then visit that address in your browser.

87	105

88 The labels would be written to `out_dir/archive/` periodically.	106 The labels would be written to `out_dir/archive/` periodically.

89	107

90 ## Data preparation for training	108 ## Data preparation for training

91	109

92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw	110 ### Feature re-extraction from MHTML archive

93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to	111

94 convert it to the final derived features.	112 When experimenting with feature extraction, being able to extract new features

	113 is useful. After modifying `extract_features.js`, modify the Makefile and change

	114 the command to:

95	115

96 ```bash	116 ```bash

97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt	117 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --load-mhtml --skip-distillation

98 ```	118 ```

99	119

100 Then use `write_features_csv.py` to combine with the label.	120 Then rerun `nice make -j10 -k`.

	121

	122 ### Recalculating derived features without extracting again

	123

	124 This section is usually optional. You might need this if you've changed how

	125 features are derived, and want to recalculate the derived features from the raw

	126 features, without extracting the raw features again. This can be useful because

	127 feature re-extraction from MHTML archive can sometimes differs from the original

	128 web page. Otherwise, if you are dealing with an older dataset where the derived

	129 features are not calculated when scrawling, this is also necessary.

	130

	131 The derived features are saved to `out_dir/*.feature-derived` when scrawling

	132 each entry. In the step with `--write-index`, `get_screenshots.py` writes the

	133 derived features to `out_dir/feature-derived`. To save time, raw features are

	134 not aggregated by default, but you can uncomment the line in function

	135 `writeFeature()` to write the raw features to `out_dir/feature` as well. We can

	136 then use `calculate_derived_features.py` to convert it to the derived features.

101	137

102 ```bash	138 ```bash

103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*\|tail -n1) --features derived.txt --out labeled	139 ./calculate_derived_features.py --core out_dir/feature --out out_dir/feature-der ived

104 ```	140 ```

	141

	142 ### Sanity check

	143

	144 This step is optional.

	145

	146 `check_derived_features.py` compares the derived features between JavaScript

	147 implementation and the native implementation in Chrome. This only works if your

	148 Chrome is new enough to support distillability JSON dumping (with command line

	149 argument --distillability-dev).

	150

	151 ```

	152 ./check_derived_features.py --features out_dir/feature-derived

	153 ```

	154

	155 Or if you want to compare the features derived from MHTML archive, do this:

	156

	157 ```

	158 ./check_derived_features.py --features out_dir/mfeature-derived --from-mhtml

	159 ```

	160

	161 When comparing the features extracted from the original page, the error rate

	162 would be higher because the feature extractions by JS and native code are done

	163 at different events, and the DOM could change dynamically. On the other hand,

	164 features extracted from MHTML archive should be exactly the same. However, due

	165 to issues like https://crbug.com/586034, MHTML is not fully offline, and the

	166 results can be non-deterministic. Sadly there are currently no good way in

	167 webdriver to force offline behavior. Other than that, mismatches between the two

	168 implementations should be regarded as bugs.

	169

	170 `check_distilled_mhtml.py` compares the distilled content from the original page

	171 with the distilled content from the MHTML archive.

	172

	173 ```

	174 ./check_distilled_mhtml.py --dir out_dir

	175 ```

	176

	177 These two should be exactly the same. Known differences includes:

	178

	179 - The original page has next page stitched.

	180 - In some rare cases, MHTML would fail to distill and get no data.

	181

	182 We still have inconsistencies that need investigation.

	183

	184 ### Final output for training

	185

	186 Use `write_features_csv.py` to combine derived features with the label.

	187

	188 For distillability model, run:

	189

	190 ```

	191 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*\|tail -n1) --features out_dir/feature-derived --out labelled

	192 ```

	193

	194 Or for long-article model, run:

	195

	196 ```

	197 ./write_features_csv.py --distilled out_dir/dfeature-derived --features out_dir/ feature-derived --out labelled

	198 ```

	199

	200 Then lots of files named `labelled-*.csv` would be created.

OLD	NEW

« no previous file with comments | « no previous file | heuristics/distillable/calculate_derived_features.py » ('j') | no next file with comments »