| OLD | NEW |
| 1 Parsing | 1 Parsing |
| 2 ======= | 2 ======= |
| 3 | 3 |
| 4 Parsing in Sky is a strict pipeline consisting of four stages: | 4 Parsing in Sky is a strict pipeline consisting of four stages: |
| 5 | 5 |
| 6 - decoding, which converts incoming bytes into Unicode characters | 6 - decoding, which converts incoming bytes into Unicode characters |
| 7 using UTF-8 | 7 using UTF-8 |
| 8 | 8 |
| 9 - normalising, which converts certain sequences of characters | 9 - normalising, which converts certain sequences of characters |
| 10 | 10 |
| (...skipping 113 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 124 | 124 |
| 125 | 125 |
| 126 #### **Consume rest of line** state #### | 126 #### **Consume rest of line** state #### |
| 127 | 127 |
| 128 If the current character is... | 128 If the current character is... |
| 129 | 129 |
| 130 * U+000A: Consume the character and switch to the **data** state. | 130 * U+000A: Consume the character and switch to the **data** state. |
| 131 * Anything else: Consume the character and stay in this state. | 131 * Anything else: Consume the character and stay in this state. |
| 132 | 132 |
| 133 | 133 |
| 134 ### Data state ### | 134 ### **Data** state ### |
| 135 | 135 |
| 136 If the current character is... | 136 If the current character is... |
| 137 |
| 138 * '```<```': Consume the character and switch to the **tag open** state. |
| 137 | 139 |
| 138 * '```&```': Consume the character and switch to the **character | 140 * '```&```': Consume the character and switch to the **character |
| 139 reference** state. | 141 reference** state, with the _return state_ set to the **data** |
| 140 | 142 state, the _extra terminating character_ unset (or set to U+0000, |
| 141 * '```<```': Consume the character and switch to the **tag open** state. | 143 which has the same effect), and the _emitting operation_ being to |
| 144 emit a character token for the given character. |
| 142 | 145 |
| 143 * Anything else: Emit the current input character as a character | 146 * Anything else: Emit the current input character as a character |
| 144 token. Consume the character. Stay in this state. | 147 token. Consume the character. Stay in this state. |
| 145 | 148 |
| 146 | 149 |
| 147 TODO(ianh): Add the remaining tokenizer states. | 150 ### **Script raw data** state ### |
| 148 | 151 |
| 149 TOOD(ianh): <script>, <style> | 152 TOOD(ianh): spec this |
| 153 |
| 154 |
| 155 ### **Style raw data** state ### |
| 156 |
| 157 TOOD(ianh): spec this |
| 158 |
| 159 |
| 160 ### **After tag** state ### |
| 161 |
| 162 Emit the tag token. |
| 163 |
| 164 If the tag token was a start tag token and the tag name was |
| 165 '```script```', then and switch to the **script raw data** state. |
| 166 |
| 167 If the tag token was a start tag token and the tag name was |
| 168 '```style```', then and switch to the **style raw data** state. |
| 169 |
| 170 Otherwise, switch to the **data** state. |
| 171 |
| 172 |
| 173 ### **Tag open** state ### |
| 174 |
| 175 If the current character is... |
| 176 |
| 177 * '```!```': Consume the character and switch to the **comment start |
| 178 1** state. |
| 179 |
| 180 * '```/```': Consume the character and switch to the **close tag |
| 181 state** state. |
| 182 |
| 183 * '```>```': Emit character tokens for '```<>```'. Consume the current |
| 184 character. Switch to the **data** state. |
| 185 |
| 186 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```', |
| 187 '```-```', '```_```', '```.```': Create a start tag token, let its |
| 188 tag name be the current character, consume the current character and |
| 189 switch to the **tag name** state. |
| 190 |
| 191 * Anything else: Emit the character token for '```<```'. Switch to the |
| 192 **data** state without consuming the current character. |
| 193 |
| 194 |
| 195 ### **Close tag** state ### |
| 196 |
| 197 If the current character is... |
| 198 |
| 199 * '```>```': Emit character tokens for '```</>```'. Consume the current |
| 200 character. Switch to the **data** state. |
| 201 |
| 202 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```', |
| 203 '```-```', '```_```', '```.```': Create an end tag token, let its |
| 204 tag name be the current character, consume the current character and |
| 205 switch to the **tag name** state. |
| 206 |
| 207 * Anything else: Emit the character tokens for '```</```'. Switch to |
| 208 the **data** state without consuming the current character. |
| 209 |
| 210 |
| 211 ### **Tag name** state ### |
| 212 |
| 213 If the current character is... |
| 214 |
| 215 * U+0020, U+000A: Consume the current character. Switch to the |
| 216 **before attribute name** state. |
| 217 |
| 218 * '```/```': Consume the current character. Switch to the **void tag** |
| 219 state. |
| 220 |
| 221 * '```>```': Consume the current character. Switch to the **after |
| 222 tag** state. |
| 223 |
| 224 * Anything else: Append the current character to the tag name, and |
| 225 consume the current character. Stay in this state. |
| 226 |
| 227 |
| 228 ### **Void tag** state ### |
| 229 |
| 230 If the current character is... |
| 231 |
| 232 * '```>```': Consume the current character. Switch to the **after |
| 233 tag** state. |
| 234 |
| 235 * Anything else: Switch to the **before attribute name** state without |
| 236 consuming the current character. |
| 237 |
| 238 |
| 239 ### **Before attribute name** state ### |
| 240 |
| 241 If the current character is... |
| 242 |
| 243 * U+0020, U+000A: Consume the current character. Stay in this state. |
| 244 |
| 245 * '```/```': Consume the current character. Switch to the **void tag** |
| 246 state. |
| 247 |
| 248 * '```>```': Consume the current character. Switch to the **after |
| 249 tag** state. |
| 250 |
| 251 * Anything else: Create a new attribute in the tag token, and set its |
| 252 name to the current character. Consume the current character. Switch |
| 253 to the **attribute name** state. |
| 254 |
| 255 |
| 256 ### **Attribute name** state ### |
| 257 |
| 258 If the current character is... |
| 259 |
| 260 * U+0020, U+000A: Consume the current character. Switch to the **after |
| 261 attribute name** state. |
| 262 |
| 263 * '```/```': Consume the current character. Switch to the **void tag** |
| 264 state. |
| 265 |
| 266 * '```=```': Consume the current character. Switch to the **before |
| 267 attribute value** state. |
| 268 |
| 269 * '```>```': Consume the current character. Switch to the **after |
| 270 tag** state. |
| 271 |
| 272 * Anything else: Append the current character to the most recently |
| 273 added attribute's name, and consume the current character. Stay in |
| 274 this state. |
| 275 |
| 276 |
| 277 ### **After attribute name** state ### |
| 278 |
| 279 If the current character is... |
| 280 |
| 281 * U+0020, U+000A: Consume the current character. Stay in this state. |
| 282 |
| 283 * '```/```': Consume the current character. Switch to the **void tag** |
| 284 state. |
| 285 |
| 286 * '```=```': Consume the current character. Switch to the **before |
| 287 attribute value** state. |
| 288 |
| 289 * '```>```': Consume the current character. Switch to the **after |
| 290 tag** state. |
| 291 |
| 292 * Anything else: Create a new attribute in the tag token, and set its |
| 293 name to the current character. Consume the current character. Switch |
| 294 to the **attribute name** state. |
| 295 |
| 296 |
| 297 ### **Before attribute value** state ### |
| 298 |
| 299 If the current character is... |
| 300 |
| 301 * U+0020, U+000A: Consume the current character. Stay in this state. |
| 302 |
| 303 * '```>```': Consume the current character. Switch to the **after |
| 304 tag** state. |
| 305 |
| 306 * '```'```': Consume the current character. Switch to the |
| 307 **single-quoted attribute value** state. |
| 308 |
| 309 * '```"```': Consume the current character. Switch to the |
| 310 **double-quoted attribute value** state. |
| 311 |
| 312 * Anything else: Set the value of the most recently added attribute to |
| 313 the current character. Consume the current character. Switch to the |
| 314 **unquoted attribute value** state. |
| 315 |
| 316 |
| 317 ### **Single-quoted attribute value** state ### |
| 318 |
| 319 If the current character is... |
| 320 |
| 321 * '```'```': Consume the current character. Switch to the |
| 322 **before attribute name** state. |
| 323 |
| 324 * '```&```': Consume the character and switch to the **character |
| 325 reference** state, with the _return state_ set to the |
| 326 **single-quoted attribute value** state, the _extra terminating |
| 327 character_ set to '```'```', and the _emitting operation_ being to |
| 328 append the given character to the value of the most recently added |
| 329 attribute. |
| 330 |
| 331 * Anything else: Append the current character to the value of the most |
| 332 recently added attribute. Consume the current character. Stay in |
| 333 this state. |
| 334 |
| 335 |
| 336 ### **Double-quoted attribute value** state ### |
| 337 |
| 338 If the current character is... |
| 339 |
| 340 * '```"```': Consume the current character. Switch to the |
| 341 **before attribute name** state. |
| 342 |
| 343 * '```&```': Consume the character and switch to the **character |
| 344 reference** state, with the _return state_ set to the |
| 345 **double-quoted attribute value** state, the _extra terminating |
| 346 character_ set to '```"```', and the _emitting operation_ being to |
| 347 append the given character to the value of the most recently added |
| 348 attribute. |
| 349 |
| 350 * Anything else: Append the current character to the value of the most |
| 351 recently added attribute. Consume the current character. Stay in |
| 352 this state. |
| 353 |
| 354 |
| 355 ### **Unquoted attribute value** state ### |
| 356 |
| 357 If the current character is... |
| 358 |
| 359 * U+0020, U+000A: Consume the current character. Switch to the |
| 360 **before attribute name** state. |
| 361 |
| 362 * '```>```': Consume the current character. Switch to the **data** |
| 363 state. Switch to the **after tag** state. |
| 364 |
| 365 * '```&```': Consume the character and switch to the **character |
| 366 reference** state, with the _return state_ set to the **unquoted |
| 367 attribute value** state, the _extra terminating character_ unset (or |
| 368 set to U+0000, which has the same effect), and the _emitting |
| 369 operation_ being to append the given character to the value of the |
| 370 most recently added attribute. |
| 371 |
| 372 * Anything else: Append the current character to the value of the most |
| 373 recently added attribute. Consume the current character. Stay in |
| 374 this state. |
| 375 |
| 376 |
| 377 ### **Comment start 1** state ### |
| 378 |
| 379 If the current character is... |
| 380 |
| 381 * '```-```': Consume the character and switch to the **comment start |
| 382 2** state. |
| 383 |
| 384 * '```>```': Emit character tokens for '```<!>```'. Consume the |
| 385 current character. Switch to the **data** state. |
| 386 |
| 387 |
| 388 ### **Comment start 2** state ### |
| 389 |
| 390 If the current character is... |
| 391 |
| 392 * '```-```': Consume the character and switch to the **comment** |
| 393 state. |
| 394 |
| 395 * '```>```': Emit character tokens for '```<!->```'. Consume the |
| 396 current character. Switch to the **data** state. |
| 397 |
| 398 |
| 399 ### **Comment** state ### |
| 400 |
| 401 If the current character is... |
| 402 |
| 403 * '```-```': Consume the character and switch to the **comment end 1** |
| 404 state. |
| 405 |
| 406 * Anything else: Consume the character and switch to the **comment** |
| 407 state. |
| 408 |
| 409 |
| 410 ### **Comment end 1** state ### |
| 411 |
| 412 If the current character is... |
| 413 |
| 414 * '```-```': Consume the character, switch to the **comment end 2** |
| 415 state. |
| 416 |
| 417 * Anything else: Consume the character, and switch to the **comment** |
| 418 state. |
| 419 |
| 420 |
| 421 ### **Comment end 2** state ### |
| 422 |
| 423 If the current character is... |
| 424 |
| 425 * '```>```': Consume the character and switch to the **data** state. |
| 426 |
| 427 * '```-```': Consume the character, but stay in this state. |
| 428 |
| 429 * Anything else: Consume the character, and switch to the **comment** |
| 430 state. |
| 431 |
| 432 |
| 433 ### **Character reference** state ### |
| 434 |
| 435 Let _raw value_ be the string '```&```'. |
| 436 |
| 437 Append the current character to _raw value_. |
| 438 |
| 439 If the current character is... |
| 440 |
| 441 * '```#```': Consume the character, and switch to the **numeric |
| 442 character reference** state. |
| 443 |
| 444 * '```l```': Consume the character and switch to the **named character |
| 445 reference L** state. |
| 446 |
| 447 * '```a```': Consume the character and switch to the **named character |
| 448 reference A** state. |
| 449 |
| 450 * '```g```': Consume the character and switch to the **named character |
| 451 reference G** state. |
| 452 |
| 453 * '```q```': Consume the character and switch to the **named character |
| 454 reference Q** state. |
| 455 |
| 456 * Any other character in the range '```0```'..'```9```', |
| 457 '```a```'..'```f```', '```A```'..'```F```': Consume the character |
| 458 and switch to the **bad named character reference** state. |
| 459 |
| 460 * Anything else: Run the _emitting operation_ for all but the last |
| 461 character in _raw value_, and switch to the **data state** without |
| 462 consuming the current character. |
| 463 |
| 464 |
| 465 ### **Numeric character reference** state ### |
| 466 |
| 467 Append the current character to _raw value_. |
| 468 |
| 469 If the current character is... |
| 470 |
| 471 * '```x```', '```X```': Let _value_ be zero, consume the character, |
| 472 and switch to the **hexadecimal numeric character reference** state. |
| 473 |
| 474 * '```0```'..'```9```': Let _value_ be the numeric value of the |
| 475 current character interpreted as a decimal digit, consume the |
| 476 character, and switch to the **decimal numeric character reference** |
| 477 state. |
| 478 |
| 479 * Anything else: Run the _emitting operation_ for all but the last |
| 480 character in _raw value_, and switch to the **data state** without |
| 481 consuming the current character. |
| 482 |
| 483 |
| 484 ### **Hexadecimal numeric character reference** state ### |
| 485 |
| 486 Append the current character to _raw value_. |
| 487 |
| 488 If the current character is... |
| 489 |
| 490 * '```0```'..'```9```', '```a```'..'```f```', '```A```'..'```F```': |
| 491 Let _value_ be sixteen times _value_ plus the numeric value of the |
| 492 current character interpreted as a hexadecimal digit. |
| 493 |
| 494 * '```;```': Consume the character. If _value_ is between 0x0001 and |
| 495 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, |
| 496 run the _emitting operation_ with a unicode character having the |
| 497 scalar value _value_; otherwise, run the _emitting operation_ with |
| 498 the character U+FFFD. Then, in either case, switch to the _return |
| 499 state_. |
| 500 |
| 501 * Anything else: Run the _emitting operation_ for all but the last |
| 502 character in _raw value_, and switch to the **data state** without |
| 503 consuming the current character. |
| 504 |
| 505 |
| 506 ### **Decimal numeric character reference** state ### |
| 507 |
| 508 Append the current character to _raw value_. |
| 509 |
| 510 If the current character is... |
| 511 |
| 512 * '```0```'..'```9```': Let _value_ be ten times _value_ plus the |
| 513 numeric value of the current character interpreted as a decimal |
| 514 digit. |
| 515 |
| 516 * '```;```': Consume the character. If _value_ is between 0x0001 and |
| 517 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive, |
| 518 run the _emitting operation_ with a unicode character having the |
| 519 scalar value _value_; otherwise, run the _emitting operation_ with |
| 520 the character U+FFFD. Then, in either case, switch to the _return |
| 521 state_. |
| 522 |
| 523 * Anything else: Run the _emitting operation_ for all but the last |
| 524 character in _raw value_, and switch to the **data state** without |
| 525 consuming the current character. |
| 526 |
| 527 |
| 528 ### **Named character reference L** state ### |
| 529 |
| 530 Append the current character to _raw value_. |
| 531 |
| 532 If the current character is... |
| 533 |
| 534 * '```t```': Let _character_ be '```<```', consume the current |
| 535 character, and switch to the **after named character reference** |
| 536 state. |
| 537 |
| 538 * Anything else: Switch to the _bad named character reference_ state |
| 539 without consuming the character. |
| 540 |
| 541 |
| 542 ### **Named character reference A** state ### |
| 543 |
| 544 Append the current character to _raw value_. |
| 545 |
| 546 If the current character is... |
| 547 |
| 548 * '```p```': Consume the current character and switch to the **named |
| 549 character reference AP** state. |
| 550 |
| 551 * '```m```': Consume the current character and switch to the **named |
| 552 character reference AM** state. |
| 553 |
| 554 * Anything else: Switch to the _bad named character reference_ state |
| 555 without consuming the character. |
| 556 |
| 557 |
| 558 ### **Named character reference AM** state ### |
| 559 |
| 560 Append the current character to _raw value_. |
| 561 |
| 562 If the current character is... |
| 563 |
| 564 * '```p```': Let _character_ be '```&```', consume the current |
| 565 character, and switch to the **after named character reference** |
| 566 state. |
| 567 |
| 568 * Anything else: Switch to the _bad named character reference_ state |
| 569 without consuming the character. |
| 570 |
| 571 |
| 572 ### **Named character reference AP** state ### |
| 573 |
| 574 Append the current character to _raw value_. |
| 575 |
| 576 If the current character is... |
| 577 |
| 578 * '```o```': Consume the current character and switch to the **named |
| 579 character reference APO** state. |
| 580 |
| 581 * Anything else: Switch to the _bad named character reference_ state |
| 582 without consuming the character. |
| 583 |
| 584 |
| 585 ### **Named character reference APO** state ### |
| 586 |
| 587 Append the current character to _raw value_. |
| 588 |
| 589 If the current character is... |
| 590 |
| 591 * '```s```': Let _character_ be '```'```', consume the current |
| 592 character, and switch to the **after named character reference** |
| 593 state. |
| 594 |
| 595 * Anything else: Switch to the _bad named character reference_ state |
| 596 without consuming the character. |
| 597 |
| 598 |
| 599 ### **Named character reference G** state ### |
| 600 |
| 601 Append the current character to _raw value_. |
| 602 |
| 603 If the current character is... |
| 604 |
| 605 * '```t```': Let _character_ be '```>```', consume the current |
| 606 character, and switch to the **after named character reference** |
| 607 state. |
| 608 |
| 609 * Anything else: Switch to the _bad named character reference_ state |
| 610 without consuming the character. |
| 611 |
| 612 |
| 613 ### **Named character reference Q** state ### |
| 614 |
| 615 Append the current character to _raw value_. |
| 616 |
| 617 If the current character is... |
| 618 |
| 619 * '```u```': Consume the current character and switch to the **named |
| 620 character reference QU** state. |
| 621 |
| 622 * Anything else: Switch to the _bad named character reference_ state |
| 623 without consuming the character. |
| 624 |
| 625 |
| 626 ### **Named character reference QU** state ### |
| 627 |
| 628 Append the current character to _raw value_. |
| 629 |
| 630 If the current character is... |
| 631 |
| 632 * '```o```': Consume the current character and switch to the **named |
| 633 character reference QUO** state. |
| 634 |
| 635 * Anything else: Switch to the _bad named character reference_ state |
| 636 without consuming the character. |
| 637 |
| 638 |
| 639 ### **Named character reference QUO** state ### |
| 640 |
| 641 Append the current character to _raw value_. |
| 642 |
| 643 If the current character is... |
| 644 |
| 645 * '```t```': Let _character_ be '```"```', consume the current |
| 646 character, and switch to the **after named character reference** |
| 647 state. |
| 648 |
| 649 * Anything else: Switch to the _bad named character reference_ state |
| 650 without consuming the character. |
| 651 |
| 652 |
| 653 ### **After named character reference** state ### |
| 654 |
| 655 Append the current character to _raw value_. |
| 656 |
| 657 If the current character is... |
| 658 |
| 659 * '```;```': Consume the character. Run the _emitting operation_ with |
| 660 the character _character_. Switch to the _return state_. |
| 661 |
| 662 * The _extra terminating character_: Run the _emitting operation_ with |
| 663 the character U+FFFD. Switch to the _return state_ without consuming |
| 664 the current character. |
| 665 |
| 666 * Anything else: Switch to the _bad named character reference_ state |
| 667 without consuming the current character. |
| 668 |
| 669 |
| 670 ### **Bad named character reference** state ### |
| 671 |
| 672 Append the current character to _raw value_. |
| 673 |
| 674 If the current character is... |
| 675 |
| 676 * '```;```': Consume the character. Run the _emitting operation_ with |
| 677 the character U+FFFD. Switch to the _return state_. |
| 678 |
| 679 * The _extra terminating character_: Switch to the _return state_ |
| 680 without consuming the current character. |
| 681 |
| 682 * Any other character in the range '```0```'..'```9```', |
| 683 '```a```'..'```f```', '```A```'..'```F```': Consume the character |
| 684 and stay in this state. |
| 685 |
| 686 * Anything else: Run the _emitting operation_ for all but the last |
| 687 character in _raw value_, and switch to the **data state** without |
| 688 consuming the current character. |
| 689 |
| 150 | 690 |
| 151 Tree construction | 691 Tree construction |
| 152 ----------------- | 692 ----------------- |
| 153 | 693 |
| 154 To construct a node tree from a _sequence of tokens_ and a document _document_: | 694 To construct a node tree from a _sequence of tokens_ and a document _document_: |
| 155 | 695 |
| 156 1. Initialize the _stack of open nodes_ to be _document_. | 696 1. Initialize the _stack of open nodes_ to be _document_. |
| 157 2. Consider each token _token_ in the _sequence of tokens_ in turn. | 697 2. Consider each token _token_ in the _sequence of tokens_ in turn. |
| 158 - If _token_ is a text token, | 698 - If _token_ is a text token, |
| 159 1. Create a text node _node_ with character data _token.data_. | 699 1. Create a text node _node_ with character data _token.data_. |
| (...skipping 11 matching lines...) Expand all Loading... |
| 171 - Pop nodes from the _stack of open nodes_ until a node with | 711 - Pop nodes from the _stack of open nodes_ until a node with |
| 172 a _tagName_ equal to _token.tagName_ has been popped. | 712 a _tagName_ equal to _token.tagName_ has been popped. |
| 173 2. Otherwise, ignore _token_. | 713 2. Otherwise, ignore _token_. |
| 174 - If _token_ is a comment token, | 714 - If _token_ is a comment token, |
| 175 1. Ignore _token_. | 715 1. Ignore _token_. |
| 176 - If _token_ is an EOF token, | 716 - If _token_ is an EOF token, |
| 177 1. Pop all the nodes from the _stack of open nodes_. | 717 1. Pop all the nodes from the _stack of open nodes_. |
| 178 2. Signal _document_ that parsing is complete. | 718 2. Signal _document_ that parsing is complete. |
| 179 | 719 |
| 180 TODO(ianh): <template>, <t> | 720 TODO(ianh): <template>, <t> |
| OLD | NEW |