Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(98)

Side by Side Diff: sky/specs/parsing.md

Issue 657393004: Parser tokeniser states (WIP, doesn't yet do script/style) (Closed) Base URL: https://github.com/domokit/mojo.git@master
Patch Set: Parser spec updates Created 6 years, 1 month ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « sky/specs/markup.md ('k') | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 Parsing 1 Parsing
2 ======= 2 =======
3 3
4 Parsing in Sky is a strict pipeline consisting of four stages: 4 Parsing in Sky is a strict pipeline consisting of four stages:
5 5
6 - decoding, which converts incoming bytes into Unicode characters 6 - decoding, which converts incoming bytes into Unicode characters
7 using UTF-8 7 using UTF-8
8 8
9 - normalising, which converts certain sequences of characters 9 - normalising, which converts certain sequences of characters
10 10
(...skipping 113 matching lines...) Expand 10 before | Expand all | Expand 10 after
124 124
125 125
126 #### **Consume rest of line** state #### 126 #### **Consume rest of line** state ####
127 127
128 If the current character is... 128 If the current character is...
129 129
130 * U+000A: Consume the character and switch to the **data** state. 130 * U+000A: Consume the character and switch to the **data** state.
131 * Anything else: Consume the character and stay in this state. 131 * Anything else: Consume the character and stay in this state.
132 132
133 133
134 ### Data state ### 134 ### **Data** state ###
135 135
136 If the current character is... 136 If the current character is...
137
138 * '```<```': Consume the character and switch to the **tag open** state.
137 139
138 * '```&```': Consume the character and switch to the **character 140 * '```&```': Consume the character and switch to the **character
139 reference** state. 141 reference** state, with the _return state_ set to the **data**
140 142 state, the _extra terminating character_ unset (or set to U+0000,
141 * '```<```': Consume the character and switch to the **tag open** state. 143 which has the same effect), and the _emitting operation_ being to
144 emit a character token for the given character.
142 145
143 * Anything else: Emit the current input character as a character 146 * Anything else: Emit the current input character as a character
144 token. Consume the character. Stay in this state. 147 token. Consume the character. Stay in this state.
145 148
146 149
147 TODO(ianh): Add the remaining tokenizer states. 150 ### **Script raw data** state ###
148 151
149 TOOD(ianh): &lt;script>, &lt;style> 152 TOOD(ianh): spec this
153
154
155 ### **Style raw data** state ###
156
157 TOOD(ianh): spec this
158
159
160 ### **After tag** state ###
161
162 Emit the tag token.
163
164 If the tag token was a start tag token and the tag name was
165 '```script```', then and switch to the **script raw data** state.
166
167 If the tag token was a start tag token and the tag name was
168 '```style```', then and switch to the **style raw data** state.
169
170 Otherwise, switch to the **data** state.
171
172
173 ### **Tag open** state ###
174
175 If the current character is...
176
177 * '```!```': Consume the character and switch to the **comment start
178 1** state.
179
180 * '```/```': Consume the character and switch to the **close tag
181 state** state.
182
183 * '```>```': Emit character tokens for '```<>```'. Consume the current
184 character. Switch to the **data** state.
185
186 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```',
187 '```-```', '```_```', '```.```': Create a start tag token, let its
188 tag name be the current character, consume the current character and
189 switch to the **tag name** state.
190
191 * Anything else: Emit the character token for '```<```'. Switch to the
192 **data** state without consuming the current character.
193
194
195 ### **Close tag** state ###
196
197 If the current character is...
198
199 * '```>```': Emit character tokens for '```</>```'. Consume the current
200 character. Switch to the **data** state.
201
202 * '```0```'..'```9```', '```a```'..'```z```', '```A```'..'```Z```',
203 '```-```', '```_```', '```.```': Create an end tag token, let its
204 tag name be the current character, consume the current character and
205 switch to the **tag name** state.
206
207 * Anything else: Emit the character tokens for '```</```'. Switch to
208 the **data** state without consuming the current character.
209
210
211 ### **Tag name** state ###
212
213 If the current character is...
214
215 * U+0020, U+000A: Consume the current character. Switch to the
216 **before attribute name** state.
217
218 * '```/```': Consume the current character. Switch to the **void tag**
219 state.
220
221 * '```>```': Consume the current character. Switch to the **after
222 tag** state.
223
224 * Anything else: Append the current character to the tag name, and
225 consume the current character. Stay in this state.
226
227
228 ### **Void tag** state ###
229
230 If the current character is...
231
232 * '```>```': Consume the current character. Switch to the **after
233 tag** state.
234
235 * Anything else: Switch to the **before attribute name** state without
236 consuming the current character.
237
238
239 ### **Before attribute name** state ###
240
241 If the current character is...
242
243 * U+0020, U+000A: Consume the current character. Stay in this state.
244
245 * '```/```': Consume the current character. Switch to the **void tag**
246 state.
247
248 * '```>```': Consume the current character. Switch to the **after
249 tag** state.
250
251 * Anything else: Create a new attribute in the tag token, and set its
252 name to the current character. Consume the current character. Switch
253 to the **attribute name** state.
254
255
256 ### **Attribute name** state ###
257
258 If the current character is...
259
260 * U+0020, U+000A: Consume the current character. Switch to the **after
261 attribute name** state.
262
263 * '```/```': Consume the current character. Switch to the **void tag**
264 state.
265
266 * '```=```': Consume the current character. Switch to the **before
267 attribute value** state.
268
269 * '```>```': Consume the current character. Switch to the **after
270 tag** state.
271
272 * Anything else: Append the current character to the most recently
273 added attribute's name, and consume the current character. Stay in
274 this state.
275
276
277 ### **After attribute name** state ###
278
279 If the current character is...
280
281 * U+0020, U+000A: Consume the current character. Stay in this state.
282
283 * '```/```': Consume the current character. Switch to the **void tag**
284 state.
285
286 * '```=```': Consume the current character. Switch to the **before
287 attribute value** state.
288
289 * '```>```': Consume the current character. Switch to the **after
290 tag** state.
291
292 * Anything else: Create a new attribute in the tag token, and set its
293 name to the current character. Consume the current character. Switch
294 to the **attribute name** state.
295
296
297 ### **Before attribute value** state ###
298
299 If the current character is...
300
301 * U+0020, U+000A: Consume the current character. Stay in this state.
302
303 * '```>```': Consume the current character. Switch to the **after
304 tag** state.
305
306 * '```'```': Consume the current character. Switch to the
307 **single-quoted attribute value** state.
308
309 * '```"```': Consume the current character. Switch to the
310 **double-quoted attribute value** state.
311
312 * Anything else: Set the value of the most recently added attribute to
313 the current character. Consume the current character. Switch to the
314 **unquoted attribute value** state.
315
316
317 ### **Single-quoted attribute value** state ###
318
319 If the current character is...
320
321 * '```'```': Consume the current character. Switch to the
322 **before attribute name** state.
323
324 * '```&```': Consume the character and switch to the **character
325 reference** state, with the _return state_ set to the
326 **single-quoted attribute value** state, the _extra terminating
327 character_ set to '```'```', and the _emitting operation_ being to
328 append the given character to the value of the most recently added
329 attribute.
330
331 * Anything else: Append the current character to the value of the most
332 recently added attribute. Consume the current character. Stay in
333 this state.
334
335
336 ### **Double-quoted attribute value** state ###
337
338 If the current character is...
339
340 * '```"```': Consume the current character. Switch to the
341 **before attribute name** state.
342
343 * '```&```': Consume the character and switch to the **character
344 reference** state, with the _return state_ set to the
345 **double-quoted attribute value** state, the _extra terminating
346 character_ set to '```"```', and the _emitting operation_ being to
347 append the given character to the value of the most recently added
348 attribute.
349
350 * Anything else: Append the current character to the value of the most
351 recently added attribute. Consume the current character. Stay in
352 this state.
353
354
355 ### **Unquoted attribute value** state ###
356
357 If the current character is...
358
359 * U+0020, U+000A: Consume the current character. Switch to the
360 **before attribute name** state.
361
362 * '```>```': Consume the current character. Switch to the **data**
363 state. Switch to the **after tag** state.
364
365 * '```&```': Consume the character and switch to the **character
366 reference** state, with the _return state_ set to the **unquoted
367 attribute value** state, the _extra terminating character_ unset (or
368 set to U+0000, which has the same effect), and the _emitting
369 operation_ being to append the given character to the value of the
370 most recently added attribute.
371
372 * Anything else: Append the current character to the value of the most
373 recently added attribute. Consume the current character. Stay in
374 this state.
375
376
377 ### **Comment start 1** state ###
378
379 If the current character is...
380
381 * '```-```': Consume the character and switch to the **comment start
382 2** state.
383
384 * '```>```': Emit character tokens for '```<!>```'. Consume the
385 current character. Switch to the **data** state.
386
387
388 ### **Comment start 2** state ###
389
390 If the current character is...
391
392 * '```-```': Consume the character and switch to the **comment**
393 state.
394
395 * '```>```': Emit character tokens for '```<!->```'. Consume the
396 current character. Switch to the **data** state.
397
398
399 ### **Comment** state ###
400
401 If the current character is...
402
403 * '```-```': Consume the character and switch to the **comment end 1**
404 state.
405
406 * Anything else: Consume the character and switch to the **comment**
407 state.
408
409
410 ### **Comment end 1** state ###
411
412 If the current character is...
413
414 * '```-```': Consume the character, switch to the **comment end 2**
415 state.
416
417 * Anything else: Consume the character, and switch to the **comment**
418 state.
419
420
421 ### **Comment end 2** state ###
422
423 If the current character is...
424
425 * '```>```': Consume the character and switch to the **data** state.
426
427 * '```-```': Consume the character, but stay in this state.
428
429 * Anything else: Consume the character, and switch to the **comment**
430 state.
431
432
433 ### **Character reference** state ###
434
435 Let _raw value_ be the string '```&```'.
436
437 Append the current character to _raw value_.
438
439 If the current character is...
440
441 * '```#```': Consume the character, and switch to the **numeric
442 character reference** state.
443
444 * '```l```': Consume the character and switch to the **named character
445 reference L** state.
446
447 * '```a```': Consume the character and switch to the **named character
448 reference A** state.
449
450 * '```g```': Consume the character and switch to the **named character
451 reference G** state.
452
453 * '```q```': Consume the character and switch to the **named character
454 reference Q** state.
455
456 * Any other character in the range '```0```'..'```9```',
457 '```a```'..'```f```', '```A```'..'```F```': Consume the character
458 and switch to the **bad named character reference** state.
459
460 * Anything else: Run the _emitting operation_ for all but the last
461 character in _raw value_, and switch to the **data state** without
462 consuming the current character.
463
464
465 ### **Numeric character reference** state ###
466
467 Append the current character to _raw value_.
468
469 If the current character is...
470
471 * '```x```', '```X```': Let _value_ be zero, consume the character,
472 and switch to the **hexadecimal numeric character reference** state.
473
474 * '```0```'..'```9```': Let _value_ be the numeric value of the
475 current character interpreted as a decimal digit, consume the
476 character, and switch to the **decimal numeric character reference**
477 state.
478
479 * Anything else: Run the _emitting operation_ for all but the last
480 character in _raw value_, and switch to the **data state** without
481 consuming the current character.
482
483
484 ### **Hexadecimal numeric character reference** state ###
485
486 Append the current character to _raw value_.
487
488 If the current character is...
489
490 * '```0```'..'```9```', '```a```'..'```f```', '```A```'..'```F```':
491 Let _value_ be sixteen times _value_ plus the numeric value of the
492 current character interpreted as a hexadecimal digit.
493
494 * '```;```': Consume the character. If _value_ is between 0x0001 and
495 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive,
496 run the _emitting operation_ with a unicode character having the
497 scalar value _value_; otherwise, run the _emitting operation_ with
498 the character U+FFFD. Then, in either case, switch to the _return
499 state_.
500
501 * Anything else: Run the _emitting operation_ for all but the last
502 character in _raw value_, and switch to the **data state** without
503 consuming the current character.
504
505
506 ### **Decimal numeric character reference** state ###
507
508 Append the current character to _raw value_.
509
510 If the current character is...
511
512 * '```0```'..'```9```': Let _value_ be ten times _value_ plus the
513 numeric value of the current character interpreted as a decimal
514 digit.
515
516 * '```;```': Consume the character. If _value_ is between 0x0001 and
517 0x10FFFF inclusive, but is not between 0xD800 and 0xDFFF inclusive,
518 run the _emitting operation_ with a unicode character having the
519 scalar value _value_; otherwise, run the _emitting operation_ with
520 the character U+FFFD. Then, in either case, switch to the _return
521 state_.
522
523 * Anything else: Run the _emitting operation_ for all but the last
524 character in _raw value_, and switch to the **data state** without
525 consuming the current character.
526
527
528 ### **Named character reference L** state ###
529
530 Append the current character to _raw value_.
531
532 If the current character is...
533
534 * '```t```': Let _character_ be '```<```', consume the current
535 character, and switch to the **after named character reference**
536 state.
537
538 * Anything else: Switch to the _bad named character reference_ state
539 without consuming the character.
540
541
542 ### **Named character reference A** state ###
543
544 Append the current character to _raw value_.
545
546 If the current character is...
547
548 * '```p```': Consume the current character and switch to the **named
549 character reference AP** state.
550
551 * '```m```': Consume the current character and switch to the **named
552 character reference AM** state.
553
554 * Anything else: Switch to the _bad named character reference_ state
555 without consuming the character.
556
557
558 ### **Named character reference AM** state ###
559
560 Append the current character to _raw value_.
561
562 If the current character is...
563
564 * '```p```': Let _character_ be '```&```', consume the current
565 character, and switch to the **after named character reference**
566 state.
567
568 * Anything else: Switch to the _bad named character reference_ state
569 without consuming the character.
570
571
572 ### **Named character reference AP** state ###
573
574 Append the current character to _raw value_.
575
576 If the current character is...
577
578 * '```o```': Consume the current character and switch to the **named
579 character reference APO** state.
580
581 * Anything else: Switch to the _bad named character reference_ state
582 without consuming the character.
583
584
585 ### **Named character reference APO** state ###
586
587 Append the current character to _raw value_.
588
589 If the current character is...
590
591 * '```s```': Let _character_ be '```'```', consume the current
592 character, and switch to the **after named character reference**
593 state.
594
595 * Anything else: Switch to the _bad named character reference_ state
596 without consuming the character.
597
598
599 ### **Named character reference G** state ###
600
601 Append the current character to _raw value_.
602
603 If the current character is...
604
605 * '```t```': Let _character_ be '```>```', consume the current
606 character, and switch to the **after named character reference**
607 state.
608
609 * Anything else: Switch to the _bad named character reference_ state
610 without consuming the character.
611
612
613 ### **Named character reference Q** state ###
614
615 Append the current character to _raw value_.
616
617 If the current character is...
618
619 * '```u```': Consume the current character and switch to the **named
620 character reference QU** state.
621
622 * Anything else: Switch to the _bad named character reference_ state
623 without consuming the character.
624
625
626 ### **Named character reference QU** state ###
627
628 Append the current character to _raw value_.
629
630 If the current character is...
631
632 * '```o```': Consume the current character and switch to the **named
633 character reference QUO** state.
634
635 * Anything else: Switch to the _bad named character reference_ state
636 without consuming the character.
637
638
639 ### **Named character reference QUO** state ###
640
641 Append the current character to _raw value_.
642
643 If the current character is...
644
645 * '```t```': Let _character_ be '```"```', consume the current
646 character, and switch to the **after named character reference**
647 state.
648
649 * Anything else: Switch to the _bad named character reference_ state
650 without consuming the character.
651
652
653 ### **After named character reference** state ###
654
655 Append the current character to _raw value_.
656
657 If the current character is...
658
659 * '```;```': Consume the character. Run the _emitting operation_ with
660 the character _character_. Switch to the _return state_.
661
662 * The _extra terminating character_: Run the _emitting operation_ with
663 the character U+FFFD. Switch to the _return state_ without consuming
664 the current character.
665
666 * Anything else: Switch to the _bad named character reference_ state
667 without consuming the current character.
668
669
670 ### **Bad named character reference** state ###
671
672 Append the current character to _raw value_.
673
674 If the current character is...
675
676 * '```;```': Consume the character. Run the _emitting operation_ with
677 the character U+FFFD. Switch to the _return state_.
678
679 * The _extra terminating character_: Switch to the _return state_
680 without consuming the current character.
681
682 * Any other character in the range '```0```'..'```9```',
683 '```a```'..'```f```', '```A```'..'```F```': Consume the character
684 and stay in this state.
685
686 * Anything else: Run the _emitting operation_ for all but the last
687 character in _raw value_, and switch to the **data state** without
688 consuming the current character.
689
150 690
151 Tree construction 691 Tree construction
152 ----------------- 692 -----------------
153 693
154 To construct a node tree from a _sequence of tokens_ and a document _document_: 694 To construct a node tree from a _sequence of tokens_ and a document _document_:
155 695
156 1. Initialize the _stack of open nodes_ to be _document_. 696 1. Initialize the _stack of open nodes_ to be _document_.
157 2. Consider each token _token_ in the _sequence of tokens_ in turn. 697 2. Consider each token _token_ in the _sequence of tokens_ in turn.
158 - If _token_ is a text token, 698 - If _token_ is a text token,
159 1. Create a text node _node_ with character data _token.data_. 699 1. Create a text node _node_ with character data _token.data_.
(...skipping 11 matching lines...) Expand all
171 - Pop nodes from the _stack of open nodes_ until a node with 711 - Pop nodes from the _stack of open nodes_ until a node with
172 a _tagName_ equal to _token.tagName_ has been popped. 712 a _tagName_ equal to _token.tagName_ has been popped.
173 2. Otherwise, ignore _token_. 713 2. Otherwise, ignore _token_.
174 - If _token_ is a comment token, 714 - If _token_ is a comment token,
175 1. Ignore _token_. 715 1. Ignore _token_.
176 - If _token_ is an EOF token, 716 - If _token_ is an EOF token,
177 1. Pop all the nodes from the _stack of open nodes_. 717 1. Pop all the nodes from the _stack of open nodes_.
178 2. Signal _document_ that parsing is complete. 718 2. Signal _document_ that parsing is complete.
179 719
180 TODO(ianh): &lt;template>, &lt;t> 720 TODO(ianh): &lt;template>, &lt;t>
OLDNEW
« no previous file with comments | « sky/specs/markup.md ('k') | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698