| OLD | NEW |
| 1 Parsing | 1 Parsing |
| 2 ======= | 2 ======= |
| 3 | 3 |
| 4 Parsing in Sky is a strict pipeline consisting of four stages: | 4 Parsing in Sky is a strict pipeline consisting of five stages: |
| 5 | 5 |
| 6 - decoding, which converts incoming bytes into Unicode characters | 6 - decoding, which converts incoming bytes into Unicode characters |
| 7 using UTF-8 | 7 using UTF-8. |
| 8 | 8 |
| 9 - normalising, which converts certain sequences of characters | 9 - normalising, which manipulates the sequence of characters. |
| 10 | 10 |
| 11 - tokenising, which converts these characters into tokens | 11 - tokenising, which converts these characters into three kinds of |
| 12 tokens: character tokens, start tag tokens, and end tag tokens. |
| 13 Character tokens have a single character value. Tag tokens have a |
| 14 tag name, and a list of name/value pairs known as attributes. |
| 12 | 15 |
| 13 - tree construction, which converts these tokens into a tree of nodes | 16 - token cleanup, which converts sequences of character tokens into |
| 17 string tokens, and removes duplicate attributes in tag tokens. |
| 18 |
| 19 - tree construction, which converts these tokens into a tree of nodes. |
| 14 | 20 |
| 15 Later stages cannot affect earlier stages. | 21 Later stages cannot affect earlier stages. |
| 16 | 22 |
| 17 When a sequence of bytes is to be parsed, there is always a defined | 23 When a sequence of bytes is to be parsed, there is always a defined |
| 18 _parsing context_, which is either "application" or "module". | 24 _parsing context_, which is either an Application object or a Module |
| 25 object. |
| 19 | 26 |
| 20 | 27 |
| 21 Decoding stage | 28 Decoding stage |
| 22 -------------- | 29 -------------- |
| 23 | 30 |
| 24 To decode a sequence of bytes _bytes_ for parsing, the [UTF-8 | 31 To decode a sequence of bytes _bytes_ for parsing, the [UTF-8 |
| 25 decoder](https://encoding.spec.whatwg.org/#utf-8-decoder) must be used | 32 decoder](https://encoding.spec.whatwg.org/#utf-8-decoder) must be used |
| 26 to transform _bytes_ into a sequence of characters _characters_. | 33 to transform _bytes_ into a sequence of characters _characters_. |
| 27 | 34 |
| 28 This sequence must then be passed to the normalisation stage. | 35 This sequence must then be passed to the normalisation stage. |
| (...skipping 18 matching lines...) Expand all Loading... |
| 47 Tokenisation stage | 54 Tokenisation stage |
| 48 ------------------ | 55 ------------------ |
| 49 | 56 |
| 50 To tokenise a sequence of characters, a state machine is used. | 57 To tokenise a sequence of characters, a state machine is used. |
| 51 | 58 |
| 52 Initially, the state machine must begin in the **signature** state. | 59 Initially, the state machine must begin in the **signature** state. |
| 53 | 60 |
| 54 Each character in turn must be processed according to the rules of the | 61 Each character in turn must be processed according to the rules of the |
| 55 state at the time the character is processed. A character is processed | 62 state at the time the character is processed. A character is processed |
| 56 once it has been _consumed_. This produces a stream of tokens; the | 63 once it has been _consumed_. This produces a stream of tokens; the |
| 57 tokens must be passed to the tree construction stage. | 64 tokens must be passed to the token cleanup stage. |
| 58 | 65 |
| 59 When the last character is consumed, the tokeniser ends. | 66 When the last character is consumed, the tokeniser ends. |
| 60 | 67 |
| 61 | 68 |
| 62 ### Expecting a string ### | 69 ### Expecting a string ### |
| 63 | 70 |
| 64 When the user agent is to _expect a string_, it must run these steps: | 71 When the user agent is to _expect a string_, it must run these steps: |
| 65 | 72 |
| 66 1. Let _expectation_ be the string to expect. When this string is | 73 1. Let _expectation_ be the string to expect. When this string is |
| 67 indexed, the first character has index 0. | 74 indexed, the first character has index 0. |
| (...skipping 10 matching lines...) Expand all Loading... |
| 78 | 85 |
| 79 6. Switch to the **expect a string** state. | 86 6. Switch to the **expect a string** state. |
| 80 | 87 |
| 81 | 88 |
| 82 ### Tokeniser states ### | 89 ### Tokeniser states ### |
| 83 | 90 |
| 84 #### **Signature** state #### | 91 #### **Signature** state #### |
| 85 | 92 |
| 86 If the current character is... | 93 If the current character is... |
| 87 | 94 |
| 88 * '```#```': If the _parsing context_ is not "application", switch to | 95 * '```#```': If the _parsing context_ is not an Application, switch to |
| 89 the _failed signature_ state. Otherwise, expect the string | 96 the _failed signature_ state. Otherwise, expect the string |
| 90 "```#!mojo mojo:sky```", with _after signature_ as the _success_ | 97 "```#!mojo mojo:sky```", with _after signature_ as the _success_ |
| 91 state and _failed signature_ as the _failure_ state. | 98 state and _failed signature_ as the _failure_ state. |
| 92 | 99 |
| 93 * '```S```': If the _parsing context_ is not "module", switch to the | 100 * '```S```': If the _parsing context_ is not a Module, switch to the |
| 94 _failed signature_ state. Otherwise, expect the string | 101 _failed signature_ state. Otherwise, expect the string |
| 95 "```SKY MODULE```", with _after signature_ as the _success_ state, | 102 "```SKY MODULE```", with _after signature_ as the _success_ state, |
| 96 and _failed signature_ as the _failure_ state. | 103 and _failed signature_ as the _failure_ state. |
| 97 | 104 |
| 98 * Anything else: Jump to the **failed signature** state. | 105 * Anything else: Jump to the **failed signature** state. |
| 99 | 106 |
| 100 | 107 |
| 101 #### **Expect a string** state #### | 108 #### **Expect a string** state #### |
| 102 | 109 |
| 103 If the current character is not the same as the <i>index</i>th character in | 110 If the current character is not the same as the <i>index</i>th character in |
| (...skipping 284 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 388 tag** state. | 395 tag** state. |
| 389 | 396 |
| 390 * Anything else: Append the current character to the tag name, and | 397 * Anything else: Append the current character to the tag name, and |
| 391 consume the current character. Stay in this state. | 398 consume the current character. Stay in this state. |
| 392 | 399 |
| 393 | 400 |
| 394 ### **Void tag** state ### | 401 ### **Void tag** state ### |
| 395 | 402 |
| 396 If the current character is... | 403 If the current character is... |
| 397 | 404 |
| 398 * '```>```': Consume the current character. Switch to the **after | 405 * '```>```': Consume the current character. Switch to the **after void |
| 399 tag** state. | 406 tag** state. |
| 400 | 407 |
| 401 * Anything else: Switch to the **before attribute name** state without | 408 * Anything else: Switch to the **before attribute name** state without |
| 402 consuming the current character. | 409 consuming the current character. |
| 403 | 410 |
| 404 | 411 |
| 405 ### **Before attribute name** state ### | 412 ### **Before attribute name** state ### |
| 406 | 413 |
| 407 If the current character is... | 414 If the current character is... |
| 408 | 415 |
| (...skipping 137 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 546 | 553 |
| 547 If the tag token was a start tag token and the tag name was | 554 If the tag token was a start tag token and the tag name was |
| 548 '```script```', then and switch to the **script raw data** state. | 555 '```script```', then and switch to the **script raw data** state. |
| 549 | 556 |
| 550 If the tag token was a start tag token and the tag name was | 557 If the tag token was a start tag token and the tag name was |
| 551 '```style```', then and switch to the **style raw data** state. | 558 '```style```', then and switch to the **style raw data** state. |
| 552 | 559 |
| 553 Otherwise, switch to the **data** state. | 560 Otherwise, switch to the **data** state. |
| 554 | 561 |
| 555 | 562 |
| 563 ### **After void tag** state ### |
| 564 |
| 565 Emit the tag token. |
| 566 |
| 567 If the tag token is a start tag token, emit an end tag token with the |
| 568 same tag name. |
| 569 |
| 570 Switch to the **data** state. |
| 571 |
| 572 |
| 556 ### **Comment start 1** state ### | 573 ### **Comment start 1** state ### |
| 557 | 574 |
| 558 If the current character is... | 575 If the current character is... |
| 559 | 576 |
| 560 * '```-```': Consume the character and switch to the **comment start | 577 * '```-```': Consume the character and switch to the **comment start |
| 561 2** state. | 578 2** state. |
| 562 | 579 |
| 563 * '```>```': Emit character tokens for '```<!>```'. Consume the | 580 * '```>```': Emit character tokens for '```<!>```'. Consume the |
| 564 current character. Switch to the **data** state. | 581 current character. Switch to the **data** state. |
| 565 | 582 |
| (...skipping 294 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 860 | 877 |
| 861 * Any other character in the range '```0```'..'```9```', | 878 * Any other character in the range '```0```'..'```9```', |
| 862 '```a```'..'```f```', '```A```'..'```F```': Consume the character | 879 '```a```'..'```f```', '```A```'..'```F```': Consume the character |
| 863 and stay in this state. | 880 and stay in this state. |
| 864 | 881 |
| 865 * Anything else: Run the _emitting operation_ for all but the last | 882 * Anything else: Run the _emitting operation_ for all but the last |
| 866 character in _raw value_, and switch to the **data state** without | 883 character in _raw value_, and switch to the **data state** without |
| 867 consuming the current character. | 884 consuming the current character. |
| 868 | 885 |
| 869 | 886 |
| 870 Tree construction | 887 Token cleanup stage |
| 871 ----------------- | 888 ------------------- |
| 872 | 889 |
| 873 To construct a node tree from a _sequence of tokens_ and a document _document_: | 890 Replace each sequence of character tokens with a single string token |
| 891 whose value is the concatenation of all the characters in the |
| 892 character tokens. |
| 893 |
| 894 For each start tag token, remove all but the first name/value pair for |
| 895 each name (i.e. remove duplicate attributes, keeping only the first |
| 896 one). |
| 897 |
| 898 For each end tag token, remove the attributes entirely. |
| 899 |
| 900 If the token is a start tag token, notify the JavaScript token stream |
| 901 callback of the token. |
| 902 |
| 903 Then, pass the tokens to the tree construction stage. |
| 904 |
| 905 |
| 906 Tree construction stage |
| 907 ----------------------- |
| 908 |
| 909 To construct a node tree from a _sequence of tokens_ and a document |
| 910 _document_: |
| 874 | 911 |
| 875 1. Initialize the _stack of open nodes_ to be _document_. | 912 1. Initialize the _stack of open nodes_ to be _document_. |
| 876 2. Consider each token _token_ in the _sequence of tokens_ in turn. | 913 2. Consider each token _token_ in the _sequence of tokens_ in turn, as |
| 877 - If _token_ is a text token, | 914 follows. If a token is to be skipped, then jump straight to the |
| 878 1. Create a text node _node_ with character data _token.data_. | 915 next token, without doing any more work with the skipped token. |
| 916 - If _token_ is a string token, |
| 917 1. If the value of the token contains only U+0020 and U+000A |
| 918 characters, and there is no ```t``` element on the _stack of |
| 919 open nodes_, then skip the token. |
| 920 2. Create a text node _node_ whose character data is the value of |
| 921 the token. |
| 922 3. Append _node_ to the top node in the _stack of open nodes_. |
| 923 - If _token_ is a start tag token, |
| 924 1. Create an element _node_ with tag name and attributes given by |
| 925 the token. |
| 879 2. Append _node_ to the top node in the _stack of open nodes_. | 926 2. Append _node_ to the top node in the _stack of open nodes_. |
| 880 - If _token_ is a start tag token, | 927 - If _token_ is an end tag token: |
| 881 1. Create an element _node_ with tag name _token.tagName_ and attributes | 928 1. Let _node_ be the topmost node in the _stack of open nodes_ |
| 882 _token.attributes_. | 929 whose tag name is the same as the token's tag name, if any. If |
| 883 2. Append _node_ to the top node in the _stack of open nodes_. | 930 there isn't one, skip this token. |
| 884 3. If the _token.selfClosing_ flag is not set, push _node_ onto the | 931 2. If there's a ```template``` element in the _stack of open |
| 885 _stack of open elements_. | 932 nodes_ above _node_, then skip this token. |
| 886 4. If _token.tagName_ is _script_, TODO: Execute the script. | 933 3. Pop nodes from the _stack of open nodes_ until _node_ has been |
| 887 - If _token_ is an end tag token, | 934 popped. |
| 888 1. If the _stack of open nodes_ contains a node whose _tagName_ is | 935 4. If _node_'s tag name is ```script```, then yield until there |
| 889 _token.tagName_, | 936 are no pending import loads, then execute the script given by |
| 890 - Pop nodes from the _stack of open nodes_ until a node with | 937 the element's contents. |
| 891 a _tagName_ equal to _token.tagName_ has been popped. | 938 3. Yield until there are no pending import loads. |
| 892 2. Otherwise, ignore _token_. | 939 3. Fire a ```load``` event at the _parsing context_ object. |
| 893 - If _token_ is a comment token, | |
| 894 1. Ignore _token_. | |
| 895 - If _token_ is an EOF token, | |
| 896 1. Pop all the nodes from the _stack of open nodes_. | |
| 897 2. Signal _document_ that parsing is complete. | |
| 898 | |
| 899 TODO(ianh): <template>, <t> | |
| OLD | NEW |