| OLD | NEW |
| (Empty) |
| 1 ***** Current status | |
| 2 | |
| 3 Currently it runs all the way through, but the database.json has all | |
| 4 members[] lists empty. Most entries are skipped for "Suspect title"; | |
| 5 some have ".pageText not found". | |
| 6 | |
| 7 Currently only works on Linux; OS X (or other) will need minor path changes. | |
| 8 | |
| 9 You will need a reasonably modern node.js installed. | |
| 10 0.5.9 is too old; 0.8.8 is not too old. | |
| 11 | |
| 12 I needed to add my own "DumpRenderTree_resources/missingImage.gif", | |
| 13 for some reason. | |
| 14 | |
| 15 For the reasons above, we're currently just using the checked-in | |
| 16 database.json from Feb 2012, but it has some bogus entries. In | |
| 17 particular, the one for UnknownElement would inject irrelevant German | |
| 18 text into our docs. So a hack in apidoc.dart (_mdnTypeNamesToSkip) | |
| 19 works around this. | |
| 20 | |
| 21 ***** Overview | |
| 22 | |
| 23 Here's a rough walkthrough of how this works. The ultimate output file is | |
| 24 database.filtered.json. | |
| 25 | |
| 26 full_run.sh executes all of the scripts in the correct order. | |
| 27 | |
| 28 search.js | |
| 29 - read data/domTypes.json | |
| 30 - for each dom type: | |
| 31 - search for page on www.googleapis.com | |
| 32 - write search results to output/search/<type>.json | |
| 33 . this is a list of search results and urls to pages | |
| 34 | |
| 35 crawl.js | |
| 36 - read data/domTypes.json | |
| 37 - for each dom type: | |
| 38 - for each output/search/<type>.json: | |
| 39 - for each result in the file: | |
| 40 - try to scrape that cached MDN page from webcache.googleusercontent.com | |
| 41 - write mdn page to output/crawl/<type><index of result>.html | |
| 42 - write output/crawl/cache.json | |
| 43 . it maps types -> search result page urls and titles | |
| 44 | |
| 45 extract.sh | |
| 46 - compile extract.dart to js | |
| 47 - run extractRunner.js | |
| 48 - read data/domTypes.json | |
| 49 - read output/crawl/cache.json | |
| 50 - read data/dartIdl.json | |
| 51 - for each scraped search result page: | |
| 52 - create a cleaned up html page in output/extract/<type><index>.html that | |
| 53 contains the scraped content + a script tag that includes extract.dart.js. | |
| 54 - create an args file in output/extract/<type><index>.html.json with some | |
| 55 data on how that file should be processed | |
| 56 - invoke dump render tree on that file | |
| 57 - when that returns, parse the console output and add it to database.json | |
| 58 - add any errors to output/errors.json | |
| 59 - save output/database.json | |
| 60 | |
| 61 extract.dart | |
| 62 - xhr output/extract/<type><index>.html.json | |
| 63 - all sorts of shenanigans to actually pull the content out of the html | |
| 64 - build a JSON object with the results | |
| 65 - do a postmessage with that object so extractRunner.js can pull it out | |
| 66 | |
| 67 - run postProcess.dart | |
| 68 - go through the results for each type looking for the best match | |
| 69 - write output/database.html | |
| 70 - write output/examples.html | |
| 71 - write output/obsolete.html | |
| 72 - write output/database.filtered.json which is the best matches | |
| 73 | |
| 74 ***** Process for updating database.json using these scripts. | |
| 75 | |
| 76 TODO(eub) when I get the scripts to work all the way through. | |
| OLD | NEW |