Index: heuristics/distillable/README.md |
diff --git a/heuristics/distillable/README.md b/heuristics/distillable/README.md |
index 685c6f7c6cc2d99e2384f4735b60bc782ac24b51..6f4a1571e4fedd3ce8e3046a3a1ab13f1eba830c 100644 |
--- a/heuristics/distillable/README.md |
+++ b/heuristics/distillable/README.md |
@@ -21,12 +21,11 @@ short list for dry run. |
## Data preparation for labeling |
-Use ```get_screenshots.py``` to generate the screenshots of the original and |
-distilled web page, and extract the features by running |
-```extract_features.js```. You can see how it works by running the following |
-command. |
+Use `get_screenshots.py` to generate the screenshots of the original and |
+distilled web page, and extract the features by running `extract_features.js`. |
+You can see how it works by running the following command. |
-``` |
+```bash |
./get_screenshots.py --out out_dir --urls-file urls.txt |
``` |
@@ -34,22 +33,22 @@ If everything goes fine, run it inside xvfb. Specifying the screen resolution |
makes the size of the screenshots consistent. It also prevent the Chrome window |
from interrupting your work on the main monitor. |
-``` |
+```bash |
xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt |
``` |
One entry takes about 30 seconds. Depending on the number of entries, it could |
-be a lengthy process. If it is interrupted, you could use option ```--resume``` |
-to continue. |
+be a lengthy process. If it is interrupted, you could use option `--resume` to |
+continue. |
-``` |
+```bash |
xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume |
``` |
Running multiple instances concurrently is recommended if the list is long |
enough. You can create a Makefile like this: |
-``` |
+```make |
ALL=$(addsuffix .target,$(shell seq 1000)) |
all: $(ALL) |
@@ -58,23 +57,23 @@ all: $(ALL) |
xvfb-run -a -s "-screen 0 1600x5000x24" ./get_screenshots.py --out out_dir --urls-file urls.txt --resume |
``` |
-And then run ```make -j20```. Adjust the parallelism according to how beefy your |
+And then run `make -j20`. Adjust the parallelism according to how beefy your |
machine is. |
A small proportion of URLs would time out, or fail for some other reasons. When |
-you've collected enough data, run the command again with option |
-```--write-index``` to export data for the next stage. |
+you've collected enough data, run the command again with option `--write-index` |
+to export data for the next stage. |
-``` |
+```bash |
./get_screenshots.py --out out_dir --urls-file urls.txt --resume --write-index |
``` |
## Labeling |
-Use ```server.py``` to serve the web site for data labeling. Human effort is |
-around 10~20 seconds per entry. |
+Use `server.py` to serve the web site for data labeling. Human effort is around |
+10~20 seconds per entry. |
-``` |
+```bash |
./server.py --data-dir out_dir |
``` |
@@ -86,20 +85,20 @@ It should print something like: |
Then visit that address in your browser. |
-The labels would be written to ```out_dir/archive/``` periodically. |
+The labels would be written to `out_dir/archive/` periodically. |
## Data preparation for training |
-In the step with ```--write-index```, ```get_screenshots.py``` writes the |
-extracted raw features to ```out_dir/feature```. We can use |
-```calculate_derived_features.py``` to convert it to the final derived features. |
+In the step with `--write-index`, `get_screenshots.py` writes the extracted raw |
+features to `out_dir/feature`. We can use `calculate_derived_features.py` to |
+convert it to the final derived features. |
-``` |
+```bash |
./calculate_derived_features.py --core out_dir/feature --out derived.txt |
``` |
-Then use ```write_features_csv.py``` to combine with the label. |
+Then use `write_features_csv.py` to combine with the label. |
-``` |
+```bash |
./write_features_csv.py --marked $(ls -rt out_dir/archive/*|tail -n1) --features derived.txt --out labeled |
``` |