heuristics/distillable/README.md - Issue 1728863002: Reformat README.md to Google style

Side by Side Diff: heuristics/distillable/README.md

Issue 1728863002: Reformat README.md to Google style (Closed) Base URL: git@github.com:chromium/dom-distiller.git@master

Patch Set: Created 4 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # Distillability Heuristics	1 # Distillability Heuristics

2	2

3 ## Goal	3 ## Goal

4	4

5 We would like to know whether it's useful to run DOM distiller on a page. This	5 We would like to know whether it's useful to run DOM distiller on a page. This

6 signal could be used in places like browser UI. Since this test would run on all	6 signal could be used in places like browser UI. Since this test would run on all

7 the page navigations, it needs to be cheap to compute. Running DOM distiller and	7 the page navigations, it needs to be cheap to compute. Running DOM distiller and

8 see if the output is empty would be too slow, and whether DOM distiller returns	8 see if the output is empty would be too slow, and whether DOM distiller returns

9 results isn't necessarily equivalent to whether the page should be distilled.	9 results isn't necessarily equivalent to whether the page should be distilled.

10	10

11 Considering all the constraints, we decided to train a machine learning model	11 Considering all the constraints, we decided to train a machine learning model

12 that takes features from a page, and classify it. The trained AdaBoost model is	12 that takes features from a page, and classify it. The trained AdaBoost model is

13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the	13 added to Chrome in http://crrev.com/1405233009/. The pipeline except for the

14 model training part is described below.	14 model training part is described below.

15	15

16 ## URL gathering	16 ## URL gathering

17	17

18 Gather a bunch of popular URLs that are representative of sites users frequent.	18 Gather a bunch of popular URLs that are representative of sites users frequent.

19 Put these URLs in a file, one per line. It might make sense to start with a	19 Put these URLs in a file, one per line. It might make sense to start with a

20 short list for dry run.	20 short list for dry run.

21	21

22 ## Data preparation for labeling	22 ## Data preparation for labeling

23	23

24 Use ```get_screenshots.py``` to generate the screenshots of the original and	24 Use `get_screenshots.py` to generate the screenshots of the original and

25 distilled web page, and extract the features by running	25 distilled web page, and extract the features by running `extract_features.js`.

26 ```extract_features.js```. You can see how it works by running the following	26 You can see how it works by running the following command.

27 command.

28	27

29 ```	28 ```bash

30 ./get_screenshots.py --out out_dir --urls-file urls.txt	29 ./get_screenshots.py --out out_dir --urls-file urls.txt

31 ```	30 ```

32	31

33 If everything goes fine, run it inside xvfb. Specifying the screen resolution	32 If everything goes fine, run it inside xvfb. Specifying the screen resolution

34 makes the size of the screenshots consistent. It also prevent the Chrome window	33 makes the size of the screenshots consistent. It also prevent the Chrome window

35 from interrupting your work on the main monitor.	34 from interrupting your work on the main monitor.

36	35

37 ```	36 ```bash

38 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt	37 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt

39 ```	38 ```

40	39

41 One entry takes about 30 seconds. Depending on the number of entries, it could	40 One entry takes about 30 seconds. Depending on the number of entries, it could

42 be a lengthy process. If it is interrupted, you could use option ```--resume```	41 be a lengthy process. If it is interrupted, you could use option `--resume` to

43 to continue.	42 continue.

44	43

45 ```	44 ```bash

46 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume	45 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --url s-file urls.txt --resume

47 ```	46 ```

48	47

49 Running multiple instances concurrently is recommended if the list is long	48 Running multiple instances concurrently is recommended if the list is long

50 enough. You can create a Makefile like this:	49 enough. You can create a Makefile like this:

51	50

52 ```	51 ```make

53 ALL=$(addsuffix .target,$(shell seq 1000))	52 ALL=$(addsuffix .target,$(shell seq 1000))

54	53

55 all: $(ALL)	54 all: $(ALL)

56	55

57 %.target :	56 %.target :

58 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume	57 xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_d ir --urls-file urls.txt --resume

59 ```	58 ```

60	59

61 And then run ```make -j20```. Adjust the parallelism according to how beefy your	60 And then run `make -j20`. Adjust the parallelism according to how beefy your

62 machine is.	61 machine is.

63	62

64 A small proportion of URLs would time out, or fail for some other reasons. When	63 A small proportion of URLs would time out, or fail for some other reasons. When

65 you've collected enough data, run the command again with option	64 you've collected enough data, run the command again with option `--write-index`

66 ```--write-index``` to export data for the next stage.	65 to export data for the next stage.

67	66

68 ```	67 ```bash

69 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index	68 ./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index

70 ```	69 ```

71	70

72 ## Labeling	71 ## Labeling

73	72

74 Use ```server.py``` to serve the web site for data labeling. Human effort is	73 Use `server.py` to serve the web site for data labeling. Human effort is around

75 around 10~20 seconds per entry.	74 10~20 seconds per entry.

76	75

77 ```	76 ```bash

78 ./server.py --data-dir out_dir	77 ./server.py --data-dir out_dir

79 ```	78 ```

80	79

81 It should print something like:	80 It should print something like:

82	81

83 ```	82 ```

84 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081	83 [21/Jan/2016:22:53:53] ENGINE Serving on 0.0.0.0:8081

85 ```	84 ```

86	85

87 Then visit that address in your browser.	86 Then visit that address in your browser.

88	87

89 The labels would be written to ```out_dir/archive/``` periodically.	88 The labels would be written to `out_dir/archive/` periodically.

90	89

91 ## Data preparation for training	90 ## Data preparation for training

92	91

93 In the step with ```--write-index```, ```get_screenshots.py``` writes the	92 In the step with `--write-index`, `get_screenshots.py` writes the extracted raw

94 extracted raw features to ```out_dir/feature```. We can use	93 features to `out_dir/feature`. We can use `calculate_derived_features.py` to

95 ```calculate_derived_features.py``` to convert it to the final derived features.	94 convert it to the final derived features.

96	95

97 ```	96 ```bash

98 ./calculate_derived_features.py --core out_dir/feature --out derived.txt	97 ./calculate_derived_features.py --core out_dir/feature --out derived.txt

99 ```	98 ```

100	99

101 Then use ```write_features_csv.py``` to combine with the label.	100 Then use `write_features_csv.py` to combine with the label.

102	101

103 ```	102 ```bash

104 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*\|tail -n1) --features derived.txt --out labeled	103 ./write_features_csv.py --marked $(ls -rt out_dir/archive/*\|tail -n1) --features derived.txt --out labeled

105 ```	104 ```

OLD	NEW

« no previous file with comments | « README.md ('k') | no next file » | no next file with comments »