third_party/google-endpoints/future/backports/email/_header_value_parser.py - Issue 2666783008: Add google-endpoints to third_party/.

Side by Side Diff: third_party/google-endpoints/future/backports/email/_header_value_parser.py

Issue 2666783008: Add google-endpoints to third_party/. (Closed)

Patch Set: Created 3 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

« no previous file with comments | « third_party/google-endpoints/future/backports/email/_encoded_words.py ('k') | third_party/google-endpoints/future/backports/email/_parseaddr.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
(Empty)
	1 """Header value parser implementing various email-related RFC parsing rules.

	2

	3 The parsing methods defined in this module implement various email related

	4 parsing rules. Principal among them is RFC 5322, which is the followon

	5 to RFC 2822 and primarily a clarification of the former. It also implements

	6 RFC 2047 encoded word decoding.

	7

	8 RFC 5322 goes to considerable trouble to maintain backward compatibility with

	9 RFC 822 in the parse phase, while cleaning up the structure on the generation

	10 phase. This parser supports correct RFC 5322 generation by tagging white space

	11 as folding white space only when folding is allowed in the non-obsolete rule

	12 sets. Actually, the parser is even more generous when accepting input than RFC

	13 5322 mandates, following the spirit of Postel's Law, which RFC 5322 encourages.

	14 Where possible deviations from the standard are annotated on the 'defects'

	15 attribute of tokens that deviate.

	16

	17 The general structure of the parser follows RFC 5322, and uses its terminology

	18 where there is a direct correspondence. Where the implementation requires a

	19 somewhat different structure than that used by the formal grammar, new terms

	20 that mimic the closest existing terms are used. Thus, it really helps to have

	21 a copy of RFC 5322 handy when studying this code.

	22

	23 Input to the parser is a string that has already been unfolded according to

	24 RFC 5322 rules. According to the RFC this unfolding is the very first step, and

	25 this parser leaves the unfolding step to a higher level message parser, which

	26 will have already detected the line breaks that need unfolding while

	27 determining the beginning and end of each header.

	28

	29 The output of the parser is a TokenList object, which is a list subclass. A

	30 TokenList is a recursive data structure. The terminal nodes of the structure

	31 are Terminal objects, which are subclasses of str. These do not correspond

	32 directly to terminal objects in the formal grammar, but are instead more

	33 practical higher level combinations of true terminals.

	34

	35 All TokenList and Terminal objects have a 'value' attribute, which produces the

	36 semantically meaningful value of that part of the parse subtree. The value of

	37 all whitespace tokens (no matter how many sub-tokens they may contain) is a

	38 single space, as per the RFC rules. This includes 'CFWS', which is herein

	39 included in the general class of whitespace tokens. There is one exception to

	40 the rule that whitespace tokens are collapsed into single spaces in values: in

	41 the value of a 'bare-quoted-string' (a quoted-string with no leading or

	42 trailing whitespace), any whitespace that appeared between the quotation marks

	43 is preserved in the returned value. Note that in all Terminal strings quoted

	44 pairs are turned into their unquoted values.

	45

	46 All TokenList and Terminal objects also have a string value, which attempts to

	47 be a "canonical" representation of the RFC-compliant form of the substring that

	48 produced the parsed subtree, including minimal use of quoted pair quoting.

	49 Whitespace runs are not collapsed.

	50

	51 Comment tokens also have a 'content' attribute providing the string found

	52 between the parens (including any nested comments) with whitespace preserved.

	53

	54 All TokenList and Terminal objects have a 'defects' attribute which is a

	55 possibly empty list all of the defects found while creating the token. Defects

	56 may appear on any token in the tree, and a composite list of all defects in the

	57 subtree is available through the 'all_defects' attribute of any node. (For

	58 Terminal notes x.defects == x.all_defects.)

	59

	60 Each object in a parse tree is called a 'token', and each has a 'token_type'

	61 attribute that gives the name from the RFC 5322 grammar that it represents.

	62 Not all RFC 5322 nodes are produced, and there is one non-RFC 5322 node that

	63 may be produced: 'ptext'. A 'ptext' is a string of printable ascii characters.

	64 It is returned in place of lists of (ctext/quoted-pair) and

	65 (qtext/quoted-pair).

	66

	67 XXX: provide complete list of token types.

	68 """

	69 from __future__ import print_function

	70 from __future__ import unicode_literals

	71 from __future__ import division

	72 from __future__ import absolute_import

	73 from future.builtins import int, range, str, super, list

	74

	75 import re

	76 from collections import namedtuple, OrderedDict

	77

	78 from future.backports.urllib.parse import (unquote, unquote_to_bytes)

	79 from future.backports.email import _encoded_words as _ew

	80 from future.backports.email import errors

	81 from future.backports.email import utils

	82

	83 #

	84 # Useful constants and functions

	85 #

	86

	87 WSP = set(' \t')

	88 CFWS_LEADER = WSP \| set('(')

	89 SPECIALS = set(r'()<>@,:;.\"[]')

	90 ATOM_ENDS = SPECIALS \| WSP

	91 DOT_ATOM_ENDS = ATOM_ENDS - set('.')

	92 # '.', '"', and '(' do not end phrases in order to support obs-phrase

	93 PHRASE_ENDS = SPECIALS - set('."(')

	94 TSPECIALS = (SPECIALS \| set('/?=')) - set('.')

	95 TOKEN_ENDS = TSPECIALS \| WSP

	96 ASPECIALS = TSPECIALS \| set("*'%")

	97 ATTRIBUTE_ENDS = ASPECIALS \| WSP

	98 EXTENDED_ATTRIBUTE_ENDS = ATTRIBUTE_ENDS - set('%')

	99

	100 def quote_string(value):

	101 return '"'+str(value).replace('\\', '\\\\').replace('"', r'\"')+'"'

	102

	103 #

	104 # Accumulator for header folding

	105 #

	106

	107 class _Folded(object):

	108

	109 def __init__(self, maxlen, policy):

	110 self.maxlen = maxlen

	111 self.policy = policy

	112 self.lastlen = 0

	113 self.stickyspace = None

	114 self.firstline = True

	115 self.done = []

	116 self.current = list() # uses l.clear()

	117

	118 def newline(self):

	119 self.done.extend(self.current)

	120 self.done.append(self.policy.linesep)

	121 self.current.clear()

	122 self.lastlen = 0

	123

	124 def finalize(self):

	125 if self.current:

	126 self.newline()

	127

	128 def __str__(self):

	129 return ''.join(self.done)

	130

	131 def append(self, stoken):

	132 self.current.append(stoken)

	133

	134 def append_if_fits(self, token, stoken=None):

	135 if stoken is None:

	136 stoken = str(token)

	137 l = len(stoken)

	138 if self.stickyspace is not None:

	139 stickyspace_len = len(self.stickyspace)

	140 if self.lastlen + stickyspace_len + l <= self.maxlen:

	141 self.current.append(self.stickyspace)

	142 self.lastlen += stickyspace_len

	143 self.current.append(stoken)

	144 self.lastlen += l

	145 self.stickyspace = None

	146 self.firstline = False

	147 return True

	148 if token.has_fws:

	149 ws = token.pop_leading_fws()

	150 if ws is not None:

	151 self.stickyspace += str(ws)

	152 stickyspace_len += len(ws)

	153 token._fold(self)

	154 return True

	155 if stickyspace_len and l + 1 <= self.maxlen:

	156 margin = self.maxlen - l

	157 if 0 < margin < stickyspace_len:

	158 trim = stickyspace_len - margin

	159 self.current.append(self.stickyspace[:trim])

	160 self.stickyspace = self.stickyspace[trim:]

	161 stickyspace_len = trim

	162 self.newline()

	163 self.current.append(self.stickyspace)

	164 self.current.append(stoken)

	165 self.lastlen = l + stickyspace_len

	166 self.stickyspace = None

	167 self.firstline = False

	168 return True

	169 if not self.firstline:

	170 self.newline()

	171 self.current.append(self.stickyspace)

	172 self.current.append(stoken)

	173 self.stickyspace = None

	174 self.firstline = False

	175 return True

	176 if self.lastlen + l <= self.maxlen:

	177 self.current.append(stoken)

	178 self.lastlen += l

	179 return True

	180 if l < self.maxlen:

	181 self.newline()

	182 self.current.append(stoken)

	183 self.lastlen = l

	184 return True

	185 return False

	186

	187 #

	188 # TokenList and its subclasses

	189 #

	190

	191 class TokenList(list):

	192

	193 token_type = None

	194

	195 def __init__(self, args, *kw):

	196 super(TokenList, self).__init__(args, *kw)

	197 self.defects = []

	198

	199 def __str__(self):

	200 return ''.join(str(x) for x in self)

	201

	202 def __repr__(self):

	203 return '{}({})'.format(self.__class__.__name__,

	204 super(TokenList, self).__repr__())

	205

	206 @property

	207 def value(self):

	208 return ''.join(x.value for x in self if x.value)

	209

	210 @property

	211 def all_defects(self):

	212 return sum((x.all_defects for x in self), self.defects)

	213

	214 #

	215 # Folding API

	216 #

	217 # parts():

	218 #

	219 # return a list of objects that constitute the "higher level syntactic

	220 # objects" specified by the RFC as the best places to fold a header line.

	221 # The returned objects must include leading folding white space, even if

	222 # this means mutating the underlying parse tree of the object. Each object

	223 # is only responsible for returning its parts, and should not drill down

	224 # to any lower level except as required to meet the leading folding white

	225 # space constraint.

	226 #

	227 # _fold(folded):

	228 #

	229 # folded: the result accumulator. This is an instance of _Folded.

	230 # (XXX: I haven't finished factoring this out yet, the folding code

	231 # pretty much uses this as a state object.) When the folded.current

	232 # contains as much text as will fit, the _fold method should call

	233 # folded.newline.

	234 # folded.lastlen: the current length of the test stored in folded.current.

	235 # folded.maxlen: The maximum number of characters that may appear on a

	236 # folded line. Differs from the policy setting in that "no limit" is

	237 # represented by +inf, which means it can be used in the trivially

	238 # logical fashion in comparisons.

	239 #

	240 # Currently no subclasses implement parts, and I think this will remain

	241 # true. A subclass only needs to implement _fold when the generic version

	242 # isn't sufficient. _fold will need to be implemented primarily when it is

	243 # possible for encoded words to appear in the specialized token-list, since

	244 # there is no generic algorithm that can know where exactly the encoded

	245 # words are allowed. A _fold implementation is responsible for filling

	246 # lines in the same general way that the top level _fold does. It may, and

	247 # should, call the _fold method of sub-objects in a similar fashion to that

	248 # of the top level _fold.

	249 #

	250 # XXX: I'm hoping it will be possible to factor the existing code further

	251 # to reduce redundancy and make the logic clearer.

	252

	253 @property

	254 def parts(self):

	255 klass = self.__class__

	256 this = list()

	257 for token in self:

	258 if token.startswith_fws():

	259 if this:

	260 yield this[0] if len(this)==1 else klass(this)

	261 this.clear()

	262 end_ws = token.pop_trailing_ws()

	263 this.append(token)

	264 if end_ws:

	265 yield klass(this)

	266 this = [end_ws]

	267 if this:

	268 yield this[0] if len(this)==1 else klass(this)

	269

	270 def startswith_fws(self):

	271 return self[0].startswith_fws()

	272

	273 def pop_leading_fws(self):

	274 if self[0].token_type == 'fws':

	275 return self.pop(0)

	276 return self[0].pop_leading_fws()

	277

	278 def pop_trailing_ws(self):

	279 if self[-1].token_type == 'cfws':

	280 return self.pop(-1)

	281 return self[-1].pop_trailing_ws()

	282

	283 @property

	284 def has_fws(self):

	285 for part in self:

	286 if part.has_fws:

	287 return True

	288 return False

	289

	290 def has_leading_comment(self):

	291 return self[0].has_leading_comment()

	292

	293 @property

	294 def comments(self):

	295 comments = []

	296 for token in self:

	297 comments.extend(token.comments)

	298 return comments

	299

	300 def fold(self, **_3to2kwargs):

	301 # max_line_length 0/None means no limit, ie: infinitely long.

	302 policy = _3to2kwargs['policy']; del _3to2kwargs['policy']

	303 maxlen = policy.max_line_length or float("+inf")

	304 folded = _Folded(maxlen, policy)

	305 self._fold(folded)

	306 folded.finalize()

	307 return str(folded)

	308

	309 def as_encoded_word(self, charset):

	310 # This works only for things returned by 'parts', which include

	311 # the leading fws, if any, that should be used.

	312 res = []

	313 ws = self.pop_leading_fws()

	314 if ws:

	315 res.append(ws)

	316 trailer = self.pop(-1) if self[-1].token_type=='fws' else ''

	317 res.append(_ew.encode(str(self), charset))

	318 res.append(trailer)

	319 return ''.join(res)

	320

	321 def cte_encode(self, charset, policy):

	322 res = []

	323 for part in self:

	324 res.append(part.cte_encode(charset, policy))

	325 return ''.join(res)

	326

	327 def _fold(self, folded):

	328 for part in self.parts:

	329 tstr = str(part)

	330 tlen = len(tstr)

	331 try:

	332 str(part).encode('us-ascii')

	333 except UnicodeEncodeError:

	334 if any(isinstance(x, errors.UndecodableBytesDefect)

	335 for x in part.all_defects):

	336 charset = 'unknown-8bit'

	337 else:

	338 # XXX: this should be a policy setting

	339 charset = 'utf-8'

	340 tstr = part.cte_encode(charset, folded.policy)

	341 tlen = len(tstr)

	342 if folded.append_if_fits(part, tstr):

	343 continue

	344 # Peel off the leading whitespace if any and make it sticky, to

	345 # avoid infinite recursion.

	346 ws = part.pop_leading_fws()

	347 if ws is not None:

	348 # Peel off the leading whitespace and make it sticky, to

	349 # avoid infinite recursion.

	350 folded.stickyspace = str(part.pop(0))

	351 if folded.append_if_fits(part):

	352 continue

	353 if part.has_fws:

	354 part._fold(folded)

	355 continue

	356 # There are no fold points in this one; it is too long for a single

	357 # line and can't be split...we just have to put it on its own line.

	358 folded.append(tstr)

	359 folded.newline()

	360

	361 def pprint(self, indent=''):

	362 print('\n'.join(self._pp(indent='')))

	363

	364 def ppstr(self, indent=''):

	365 return '\n'.join(self._pp(indent=''))

	366

	367 def _pp(self, indent=''):

	368 yield '{}{}/{}('.format(

	369 indent,

	370 self.__class__.__name__,

	371 self.token_type)

	372 for token in self:

	373 if not hasattr(token, '_pp'):

	374 yield (indent + ' !! invalid element in token '

	375 'list: {!r}'.format(token))

	376 else:

	377 for line in token._pp(indent+' '):

	378 yield line

	379 if self.defects:

	380 extra = ' Defects: {}'.format(self.defects)

	381 else:

	382 extra = ''

	383 yield '{}){}'.format(indent, extra)

	384

	385

	386 class WhiteSpaceTokenList(TokenList):

	387

	388 @property

	389 def value(self):

	390 return ' '

	391

	392 @property

	393 def comments(self):

	394 return [x.content for x in self if x.token_type=='comment']

	395

	396

	397 class UnstructuredTokenList(TokenList):

	398

	399 token_type = 'unstructured'

	400

	401 def _fold(self, folded):

	402 if any(x.token_type=='encoded-word' for x in self):

	403 return self._fold_encoded(folded)

	404 # Here we can have either a pure ASCII string that may or may not

	405 # have surrogateescape encoded bytes, or a unicode string.

	406 last_ew = None

	407 for part in self.parts:

	408 tstr = str(part)

	409 is_ew = False

	410 try:

	411 str(part).encode('us-ascii')

	412 except UnicodeEncodeError:

	413 if any(isinstance(x, errors.UndecodableBytesDefect)

	414 for x in part.all_defects):

	415 charset = 'unknown-8bit'

	416 else:

	417 charset = 'utf-8'

	418 if last_ew is not None:

	419 # We've already done an EW, combine this one with it

	420 # if there's room.

	421 chunk = get_unstructured(

	422 ''.join(folded.current[last_ew:]+[tstr])).as_encoded_wor d(charset)

	423 oldlastlen = sum(len(x) for x in folded.current[:last_ew])

	424 schunk = str(chunk)

	425 lchunk = len(schunk)

	426 if oldlastlen + lchunk <= folded.maxlen:

	427 del folded.current[last_ew:]

	428 folded.append(schunk)

	429 folded.lastlen = oldlastlen + lchunk

	430 continue

	431 tstr = part.as_encoded_word(charset)

	432 is_ew = True

	433 if folded.append_if_fits(part, tstr):

	434 if is_ew:

	435 last_ew = len(folded.current) - 1

	436 continue

	437 if is_ew or last_ew:

	438 # It's too big to fit on the line, but since we've

	439 # got encoded words we can use encoded word folding.

	440 part._fold_as_ew(folded)

	441 continue

	442 # Peel off the leading whitespace if any and make it sticky, to

	443 # avoid infinite recursion.

	444 ws = part.pop_leading_fws()

	445 if ws is not None:

	446 folded.stickyspace = str(ws)

	447 if folded.append_if_fits(part):

	448 continue

	449 if part.has_fws:

	450 part.fold(folded)

	451 continue

	452 # It can't be split...we just have to put it on its own line.

	453 folded.append(tstr)

	454 folded.newline()

	455 last_ew = None

	456

	457 def cte_encode(self, charset, policy):

	458 res = []

	459 last_ew = None

	460 for part in self:

	461 spart = str(part)

	462 try:

	463 spart.encode('us-ascii')

	464 res.append(spart)

	465 except UnicodeEncodeError:

	466 if last_ew is None:

	467 res.append(part.cte_encode(charset, policy))

	468 last_ew = len(res)

	469 else:

	470 tl = get_unstructured(''.join(res[last_ew:] + [spart]))

	471 res.append(tl.as_encoded_word())

	472 return ''.join(res)

	473

	474

	475 class Phrase(TokenList):

	476

	477 token_type = 'phrase'

	478

	479 def _fold(self, folded):

	480 # As with Unstructured, we can have pure ASCII with or without

	481 # surrogateescape encoded bytes, or we could have unicode. But this

	482 # case is more complicated, since we have to deal with the various

	483 # sub-token types and how they can be composed in the face of

	484 # unicode-that-needs-CTE-encoding, and the fact that if a token a

	485 # comment that becomes a barrier across which we can't compose encoded

	486 # words.

	487 last_ew = None

	488 for part in self.parts:

	489 tstr = str(part)

	490 tlen = len(tstr)

	491 has_ew = False

	492 try:

	493 str(part).encode('us-ascii')

	494 except UnicodeEncodeError:

	495 if any(isinstance(x, errors.UndecodableBytesDefect)

	496 for x in part.all_defects):

	497 charset = 'unknown-8bit'

	498 else:

	499 charset = 'utf-8'

	500 if last_ew is not None and not part.has_leading_comment():

	501 # We've already done an EW, let's see if we can combine

	502 # this one with it. The last_ew logic ensures that all we

	503 # have at this point is atoms, no comments or quoted

	504 # strings. So we can treat the text between the last

	505 # encoded word and the content of this token as

	506 # unstructured text, and things will work correctly. But

	507 # we have to strip off any trailing comment on this token

	508 # first, and if it is a quoted string we have to pull out

	509 # the content (we're encoding it, so it no longer needs to

	510 # be quoted).

	511 if part[-1].token_type == 'cfws' and part.comments:

	512 remainder = part.pop(-1)

	513 else:

	514 remainder = ''

	515 for i, token in enumerate(part):

	516 if token.token_type == 'bare-quoted-string':

	517 part[i] = UnstructuredTokenList(token[:])

	518 chunk = get_unstructured(

	519 ''.join(folded.current[last_ew:]+[tstr])).as_encoded_wor d(charset)

	520 schunk = str(chunk)

	521 lchunk = len(schunk)

	522 if last_ew + lchunk <= folded.maxlen:

	523 del folded.current[last_ew:]

	524 folded.append(schunk)

	525 folded.lastlen = sum(len(x) for x in folded.current)

	526 continue

	527 tstr = part.as_encoded_word(charset)

	528 tlen = len(tstr)

	529 has_ew = True

	530 if folded.append_if_fits(part, tstr):

	531 if has_ew and not part.comments:

	532 last_ew = len(folded.current) - 1

	533 elif part.comments or part.token_type == 'quoted-string':

	534 # If a comment is involved we can't combine EWs. And if a

	535 # quoted string is involved, it's not worth the effort to

	536 # try to combine them.

	537 last_ew = None

	538 continue

	539 part._fold(folded)

	540

	541 def cte_encode(self, charset, policy):

	542 res = []

	543 last_ew = None

	544 is_ew = False

	545 for part in self:

	546 spart = str(part)

	547 try:

	548 spart.encode('us-ascii')

	549 res.append(spart)

	550 except UnicodeEncodeError:

	551 is_ew = True

	552 if last_ew is None:

	553 if not part.comments:

	554 last_ew = len(res)

	555 res.append(part.cte_encode(charset, policy))

	556 elif not part.has_leading_comment():

	557 if part[-1].token_type == 'cfws' and part.comments:

	558 remainder = part.pop(-1)

	559 else:

	560 remainder = ''

	561 for i, token in enumerate(part):

	562 if token.token_type == 'bare-quoted-string':

	563 part[i] = UnstructuredTokenList(token[:])

	564 tl = get_unstructured(''.join(res[last_ew:] + [spart]))

	565 res[last_ew:] = [tl.as_encoded_word(charset)]

	566 if part.comments or (not is_ew and part.token_type == 'quoted-string '):

	567 last_ew = None

	568 return ''.join(res)

	569

	570 class Word(TokenList):

	571

	572 token_type = 'word'

	573

	574

	575 class CFWSList(WhiteSpaceTokenList):

	576

	577 token_type = 'cfws'

	578

	579 def has_leading_comment(self):

	580 return bool(self.comments)

	581

	582

	583 class Atom(TokenList):

	584

	585 token_type = 'atom'

	586

	587

	588 class Token(TokenList):

	589

	590 token_type = 'token'

	591

	592

	593 class EncodedWord(TokenList):

	594

	595 token_type = 'encoded-word'

	596 cte = None

	597 charset = None

	598 lang = None

	599

	600 @property

	601 def encoded(self):

	602 if self.cte is not None:

	603 return self.cte

	604 _ew.encode(str(self), self.charset)

	605

	606

	607

	608 class QuotedString(TokenList):

	609

	610 token_type = 'quoted-string'

	611

	612 @property

	613 def content(self):

	614 for x in self:

	615 if x.token_type == 'bare-quoted-string':

	616 return x.value

	617

	618 @property

	619 def quoted_value(self):

	620 res = []

	621 for x in self:

	622 if x.token_type == 'bare-quoted-string':

	623 res.append(str(x))

	624 else:

	625 res.append(x.value)

	626 return ''.join(res)

	627

	628 @property

	629 def stripped_value(self):

	630 for token in self:

	631 if token.token_type == 'bare-quoted-string':

	632 return token.value

	633

	634

	635 class BareQuotedString(QuotedString):

	636

	637 token_type = 'bare-quoted-string'

	638

	639 def __str__(self):

	640 return quote_string(''.join(str(x) for x in self))

	641

	642 @property

	643 def value(self):

	644 return ''.join(str(x) for x in self)

	645

	646

	647 class Comment(WhiteSpaceTokenList):

	648

	649 token_type = 'comment'

	650

	651 def __str__(self):

	652 return ''.join(sum([

	653 ["("],

	654 [self.quote(x) for x in self],

	655 [")"],

	656 ], []))

	657

	658 def quote(self, value):

	659 if value.token_type == 'comment':

	660 return str(value)

	661 return str(value).replace('\\', '\\\\').replace(

	662 '(', '\(').replace(

	663 ')', '\)')

	664

	665 @property

	666 def content(self):

	667 return ''.join(str(x) for x in self)

	668

	669 @property

	670 def comments(self):

	671 return [self.content]

	672

	673 class AddressList(TokenList):

	674

	675 token_type = 'address-list'

	676

	677 @property

	678 def addresses(self):

	679 return [x for x in self if x.token_type=='address']

	680

	681 @property

	682 def mailboxes(self):

	683 return sum((x.mailboxes

	684 for x in self if x.token_type=='address'), [])

	685

	686 @property

	687 def all_mailboxes(self):

	688 return sum((x.all_mailboxes

	689 for x in self if x.token_type=='address'), [])

	690

	691

	692 class Address(TokenList):

	693

	694 token_type = 'address'

	695

	696 @property

	697 def display_name(self):

	698 if self[0].token_type == 'group':

	699 return self[0].display_name

	700

	701 @property

	702 def mailboxes(self):

	703 if self[0].token_type == 'mailbox':

	704 return [self[0]]

	705 elif self[0].token_type == 'invalid-mailbox':

	706 return []

	707 return self[0].mailboxes

	708

	709 @property

	710 def all_mailboxes(self):

	711 if self[0].token_type == 'mailbox':

	712 return [self[0]]

	713 elif self[0].token_type == 'invalid-mailbox':

	714 return [self[0]]

	715 return self[0].all_mailboxes

	716

	717 class MailboxList(TokenList):

	718

	719 token_type = 'mailbox-list'

	720

	721 @property

	722 def mailboxes(self):

	723 return [x for x in self if x.token_type=='mailbox']

	724

	725 @property

	726 def all_mailboxes(self):

	727 return [x for x in self

	728 if x.token_type in ('mailbox', 'invalid-mailbox')]

	729

	730

	731 class GroupList(TokenList):

	732

	733 token_type = 'group-list'

	734

	735 @property

	736 def mailboxes(self):

	737 if not self or self[0].token_type != 'mailbox-list':

	738 return []

	739 return self[0].mailboxes

	740

	741 @property

	742 def all_mailboxes(self):

	743 if not self or self[0].token_type != 'mailbox-list':

	744 return []

	745 return self[0].all_mailboxes

	746

	747

	748 class Group(TokenList):

	749

	750 token_type = "group"

	751

	752 @property

	753 def mailboxes(self):

	754 if self[2].token_type != 'group-list':

	755 return []

	756 return self[2].mailboxes

	757

	758 @property

	759 def all_mailboxes(self):

	760 if self[2].token_type != 'group-list':

	761 return []

	762 return self[2].all_mailboxes

	763

	764 @property

	765 def display_name(self):

	766 return self[0].display_name

	767

	768

	769 class NameAddr(TokenList):

	770

	771 token_type = 'name-addr'

	772

	773 @property

	774 def display_name(self):

	775 if len(self) == 1:

	776 return None

	777 return self[0].display_name

	778

	779 @property

	780 def local_part(self):

	781 return self[-1].local_part

	782

	783 @property

	784 def domain(self):

	785 return self[-1].domain

	786

	787 @property

	788 def route(self):

	789 return self[-1].route

	790

	791 @property

	792 def addr_spec(self):

	793 return self[-1].addr_spec

	794

	795

	796 class AngleAddr(TokenList):

	797

	798 token_type = 'angle-addr'

	799

	800 @property

	801 def local_part(self):

	802 for x in self:

	803 if x.token_type == 'addr-spec':

	804 return x.local_part

	805

	806 @property

	807 def domain(self):

	808 for x in self:

	809 if x.token_type == 'addr-spec':

	810 return x.domain

	811

	812 @property

	813 def route(self):

	814 for x in self:

	815 if x.token_type == 'obs-route':

	816 return x.domains

	817

	818 @property

	819 def addr_spec(self):

	820 for x in self:

	821 if x.token_type == 'addr-spec':

	822 return x.addr_spec

	823 else:

	824 return '<>'

	825

	826

	827 class ObsRoute(TokenList):

	828

	829 token_type = 'obs-route'

	830

	831 @property

	832 def domains(self):

	833 return [x.domain for x in self if x.token_type == 'domain']

	834

	835

	836 class Mailbox(TokenList):

	837

	838 token_type = 'mailbox'

	839

	840 @property

	841 def display_name(self):

	842 if self[0].token_type == 'name-addr':

	843 return self[0].display_name

	844

	845 @property

	846 def local_part(self):

	847 return self[0].local_part

	848

	849 @property

	850 def domain(self):

	851 return self[0].domain

	852

	853 @property

	854 def route(self):

	855 if self[0].token_type == 'name-addr':

	856 return self[0].route

	857

	858 @property

	859 def addr_spec(self):

	860 return self[0].addr_spec

	861

	862

	863 class InvalidMailbox(TokenList):

	864

	865 token_type = 'invalid-mailbox'

	866

	867 @property

	868 def display_name(self):

	869 return None

	870

	871 local_part = domain = route = addr_spec = display_name

	872

	873

	874 class Domain(TokenList):

	875

	876 token_type = 'domain'

	877

	878 @property

	879 def domain(self):

	880 return ''.join(super(Domain, self).value.split())

	881

	882

	883 class DotAtom(TokenList):

	884

	885 token_type = 'dot-atom'

	886

	887

	888 class DotAtomText(TokenList):

	889

	890 token_type = 'dot-atom-text'

	891

	892

	893 class AddrSpec(TokenList):

	894

	895 token_type = 'addr-spec'

	896

	897 @property

	898 def local_part(self):

	899 return self[0].local_part

	900

	901 @property

	902 def domain(self):

	903 if len(self) < 3:

	904 return None

	905 return self[-1].domain

	906

	907 @property

	908 def value(self):

	909 if len(self) < 3:

	910 return self[0].value

	911 return self[0].value.rstrip()+self[1].value+self[2].value.lstrip()

	912

	913 @property

	914 def addr_spec(self):

	915 nameset = set(self.local_part)

	916 if len(nameset) > len(nameset-DOT_ATOM_ENDS):

	917 lp = quote_string(self.local_part)

	918 else:

	919 lp = self.local_part

	920 if self.domain is not None:

	921 return lp + '@' + self.domain

	922 return lp

	923

	924

	925 class ObsLocalPart(TokenList):

	926

	927 token_type = 'obs-local-part'

	928

	929

	930 class DisplayName(Phrase):

	931

	932 token_type = 'display-name'

	933

	934 @property

	935 def display_name(self):

	936 res = TokenList(self)

	937 if res[0].token_type == 'cfws':

	938 res.pop(0)

	939 else:

	940 if res[0][0].token_type == 'cfws':

	941 res[0] = TokenList(res[0][1:])

	942 if res[-1].token_type == 'cfws':

	943 res.pop()

	944 else:

	945 if res[-1][-1].token_type == 'cfws':

	946 res[-1] = TokenList(res[-1][:-1])

	947 return res.value

	948

	949 @property

	950 def value(self):

	951 quote = False

	952 if self.defects:

	953 quote = True

	954 else:

	955 for x in self:

	956 if x.token_type == 'quoted-string':

	957 quote = True

	958 if quote:

	959 pre = post = ''

	960 if self[0].token_type=='cfws' or self[0][0].token_type=='cfws':

	961 pre = ' '

	962 if self[-1].token_type=='cfws' or self[-1][-1].token_type=='cfws':

	963 post = ' '

	964 return pre+quote_string(self.display_name)+post

	965 else:

	966 return super(DisplayName, self).value

	967

	968

	969 class LocalPart(TokenList):

	970

	971 token_type = 'local-part'

	972

	973 @property

	974 def value(self):

	975 if self[0].token_type == "quoted-string":

	976 return self[0].quoted_value

	977 else:

	978 return self[0].value

	979

	980 @property

	981 def local_part(self):

	982 # Strip whitespace from front, back, and around dots.

	983 res = [DOT]

	984 last = DOT

	985 last_is_tl = False

	986 for tok in self[0] + [DOT]:

	987 if tok.token_type == 'cfws':

	988 continue

	989 if (last_is_tl and tok.token_type == 'dot' and

	990 last[-1].token_type == 'cfws'):

	991 res[-1] = TokenList(last[:-1])

	992 is_tl = isinstance(tok, TokenList)

	993 if (is_tl and last.token_type == 'dot' and

	994 tok[0].token_type == 'cfws'):

	995 res.append(TokenList(tok[1:]))

	996 else:

	997 res.append(tok)

	998 last = res[-1]

	999 last_is_tl = is_tl

	1000 res = TokenList(res[1:-1])

	1001 return res.value

	1002

	1003

	1004 class DomainLiteral(TokenList):

	1005

	1006 token_type = 'domain-literal'

	1007

	1008 @property

	1009 def domain(self):

	1010 return ''.join(super(DomainLiteral, self).value.split())

	1011

	1012 @property

	1013 def ip(self):

	1014 for x in self:

	1015 if x.token_type == 'ptext':

	1016 return x.value

	1017

	1018

	1019 class MIMEVersion(TokenList):

	1020

	1021 token_type = 'mime-version'

	1022 major = None

	1023 minor = None

	1024

	1025

	1026 class Parameter(TokenList):

	1027

	1028 token_type = 'parameter'

	1029 sectioned = False

	1030 extended = False

	1031 charset = 'us-ascii'

	1032

	1033 @property

	1034 def section_number(self):

	1035 # Because the first token, the attribute (name) eats CFWS, the second

	1036 # token is always the section if there is one.

	1037 return self[1].number if self.sectioned else 0

	1038

	1039 @property

	1040 def param_value(self):

	1041 # This is part of the "handle quoted extended parameters" hack.

	1042 for token in self:

	1043 if token.token_type == 'value':

	1044 return token.stripped_value

	1045 if token.token_type == 'quoted-string':

	1046 for token in token:

	1047 if token.token_type == 'bare-quoted-string':

	1048 for token in token:

	1049 if token.token_type == 'value':

	1050 return token.stripped_value

	1051 return ''

	1052

	1053

	1054 class InvalidParameter(Parameter):

	1055

	1056 token_type = 'invalid-parameter'

	1057

	1058

	1059 class Attribute(TokenList):

	1060

	1061 token_type = 'attribute'

	1062

	1063 @property

	1064 def stripped_value(self):

	1065 for token in self:

	1066 if token.token_type.endswith('attrtext'):

	1067 return token.value

	1068

	1069 class Section(TokenList):

	1070

	1071 token_type = 'section'

	1072 number = None

	1073

	1074

	1075 class Value(TokenList):

	1076

	1077 token_type = 'value'

	1078

	1079 @property

	1080 def stripped_value(self):

	1081 token = self[0]

	1082 if token.token_type == 'cfws':

	1083 token = self[1]

	1084 if token.token_type.endswith(

	1085 ('quoted-string', 'attribute', 'extended-attribute')):

	1086 return token.stripped_value

	1087 return self.value

	1088

	1089

	1090 class MimeParameters(TokenList):

	1091

	1092 token_type = 'mime-parameters'

	1093

	1094 @property

	1095 def params(self):

	1096 # The RFC specifically states that the ordering of parameters is not

	1097 # guaranteed and may be reordered by the transport layer. So we have

	1098 # to assume the RFC 2231 pieces can come in any order. However, we

	1099 # output them in the order that we first see a given name, which gives

	1100 # us a stable __str__.

	1101 params = OrderedDict()

	1102 for token in self:

	1103 if not token.token_type.endswith('parameter'):

	1104 continue

	1105 if token[0].token_type != 'attribute':

	1106 continue

	1107 name = token[0].value.strip()

	1108 if name not in params:

	1109 params[name] = []

	1110 params[name].append((token.section_number, token))

	1111 for name, parts in params.items():

	1112 parts = sorted(parts)

	1113 # XXX: there might be more recovery we could do here if, for

	1114 # example, this is really a case of a duplicate attribute name.

	1115 value_parts = []

	1116 charset = parts[0][1].charset

	1117 for i, (section_number, param) in enumerate(parts):

	1118 if section_number != i:

	1119 param.defects.append(errors.InvalidHeaderDefect(

	1120 "inconsistent multipart parameter numbering"))

	1121 value = param.param_value

	1122 if param.extended:

	1123 try:

	1124 value = unquote_to_bytes(value)

	1125 except UnicodeEncodeError:

	1126 # source had surrogate escaped bytes. What we do now

	1127 # is a bit of an open question. I'm not sure this is

	1128 # the best choice, but it is what the old algorithm did

	1129 value = unquote(value, encoding='latin-1')

	1130 else:

	1131 try:

	1132 value = value.decode(charset, 'surrogateescape')

	1133 except LookupError:

	1134 # XXX: there should really be a custom defect for

	1135 # unknown character set to make it easy to find,

	1136 # because otherwise unknown charset is a silent

	1137 # failure.

	1138 value = value.decode('us-ascii', 'surrogateescape')

	1139 if utils._has_surrogates(value):

	1140 param.defects.append(errors.UndecodableBytesDefect() )

	1141 value_parts.append(value)

	1142 value = ''.join(value_parts)

	1143 yield name, value

	1144

	1145 def __str__(self):

	1146 params = []

	1147 for name, value in self.params:

	1148 if value:

	1149 params.append('{}={}'.format(name, quote_string(value)))

	1150 else:

	1151 params.append(name)

	1152 params = '; '.join(params)

	1153 return ' ' + params if params else ''

	1154

	1155

	1156 class ParameterizedHeaderValue(TokenList):

	1157

	1158 @property

	1159 def params(self):

	1160 for token in reversed(self):

	1161 if token.token_type == 'mime-parameters':

	1162 return token.params

	1163 return {}

	1164

	1165 @property

	1166 def parts(self):

	1167 if self and self[-1].token_type == 'mime-parameters':

	1168 # We don't want to start a new line if all of the params don't fit

	1169 # after the value, so unwrap the parameter list.

	1170 return TokenList(self[:-1] + self[-1])

	1171 return TokenList(self).parts

	1172

	1173

	1174 class ContentType(ParameterizedHeaderValue):

	1175

	1176 token_type = 'content-type'

	1177 maintype = 'text'

	1178 subtype = 'plain'

	1179

	1180

	1181 class ContentDisposition(ParameterizedHeaderValue):

	1182

	1183 token_type = 'content-disposition'

	1184 content_disposition = None

	1185

	1186

	1187 class ContentTransferEncoding(TokenList):

	1188

	1189 token_type = 'content-transfer-encoding'

	1190 cte = '7bit'

	1191

	1192

	1193 class HeaderLabel(TokenList):

	1194

	1195 token_type = 'header-label'

	1196

	1197

	1198 class Header(TokenList):

	1199

	1200 token_type = 'header'

	1201

	1202 def _fold(self, folded):

	1203 folded.append(str(self.pop(0)))

	1204 folded.lastlen = len(folded.current[0])

	1205 # The first line of the header is different from all others: we don't

	1206 # want to start a new object on a new line if it has any fold points in

	1207 # it that would allow part of it to be on the first header line.

	1208 # Further, if the first fold point would fit on the new line, we want

	1209 # to do that, but if it doesn't we want to put it on the first line.

	1210 # Folded supports this via the stickyspace attribute. If this

	1211 # attribute is not None, it does the special handling.

	1212 folded.stickyspace = str(self.pop(0)) if self[0].token_type == 'cfws' el se ''

	1213 rest = self.pop(0)

	1214 if self:

	1215 raise ValueError("Malformed Header token list")

	1216 rest._fold(folded)

	1217

	1218

	1219 #

	1220 # Terminal classes and instances

	1221 #

	1222

	1223 class Terminal(str):

	1224

	1225 def __new__(cls, value, token_type):

	1226 self = super(Terminal, cls).__new__(cls, value)

	1227 self.token_type = token_type

	1228 self.defects = []

	1229 return self

	1230

	1231 def __repr__(self):

	1232 return "{}({})".format(self.__class__.__name__, super(Terminal, self).__ repr__())

	1233

	1234 @property

	1235 def all_defects(self):

	1236 return list(self.defects)

	1237

	1238 def _pp(self, indent=''):

	1239 return ["{}{}/{}({}){}".format(

	1240 indent,

	1241 self.__class__.__name__,

	1242 self.token_type,

	1243 super(Terminal, self).__repr__(),

	1244 '' if not self.defects else ' {}'.format(self.defects),

	1245 )]

	1246

	1247 def cte_encode(self, charset, policy):

	1248 value = str(self)

	1249 try:

	1250 value.encode('us-ascii')

	1251 return value

	1252 except UnicodeEncodeError:

	1253 return _ew.encode(value, charset)

	1254

	1255 def pop_trailing_ws(self):

	1256 # This terminates the recursion.

	1257 return None

	1258

	1259 def pop_leading_fws(self):

	1260 # This terminates the recursion.

	1261 return None

	1262

	1263 @property

	1264 def comments(self):

	1265 return []

	1266

	1267 def has_leading_comment(self):

	1268 return False

	1269

	1270 def __getnewargs__(self):

	1271 return(str(self), self.token_type)

	1272

	1273

	1274 class WhiteSpaceTerminal(Terminal):

	1275

	1276 @property

	1277 def value(self):

	1278 return ' '

	1279

	1280 def startswith_fws(self):

	1281 return True

	1282

	1283 has_fws = True

	1284

	1285

	1286 class ValueTerminal(Terminal):

	1287

	1288 @property

	1289 def value(self):

	1290 return self

	1291

	1292 def startswith_fws(self):

	1293 return False

	1294

	1295 has_fws = False

	1296

	1297 def as_encoded_word(self, charset):

	1298 return _ew.encode(str(self), charset)

	1299

	1300

	1301 class EWWhiteSpaceTerminal(WhiteSpaceTerminal):

	1302

	1303 @property

	1304 def value(self):

	1305 return ''

	1306

	1307 @property

	1308 def encoded(self):

	1309 return self[:]

	1310

	1311 def __str__(self):

	1312 return ''

	1313

	1314 has_fws = True

	1315

	1316

	1317 # XXX these need to become classes and used as instances so

	1318 # that a program can't change them in a parse tree and screw

	1319 # up other parse trees. Maybe should have tests for that, too.

	1320 DOT = ValueTerminal('.', 'dot')

	1321 ListSeparator = ValueTerminal(',', 'list-separator')

	1322 RouteComponentMarker = ValueTerminal('@', 'route-component-marker')

	1323

	1324 #

	1325 # Parser

	1326 #

	1327

	1328 """Parse strings according to RFC822/2047/2822/5322 rules.

	1329

	1330 This is a stateless parser. Each get_XXX function accepts a string and

	1331 returns either a Terminal or a TokenList representing the RFC object named

	1332 by the method and a string containing the remaining unparsed characters

	1333 from the input. Thus a parser method consumes the next syntactic construct

	1334 of a given type and returns a token representing the construct plus the

	1335 unparsed remainder of the input string.

	1336

	1337 For example, if the first element of a structured header is a 'phrase',

	1338 then:

	1339

	1340 phrase, value = get_phrase(value)

	1341

	1342 returns the complete phrase from the start of the string value, plus any

	1343 characters left in the string after the phrase is removed.

	1344

	1345 """

	1346

	1347 _wsp_splitter = re.compile(r'([{}]+)'.format(''.join(WSP))).split

	1348 _non_atom_end_matcher = re.compile(r"[^{}]+".format(

	1349 ''.join(ATOM_ENDS).replace('\\','\\\\').replace(']','\]'))).match

	1350 _non_printable_finder = re.compile(r"[\x00-\x20\x7F]").findall

	1351 _non_token_end_matcher = re.compile(r"[^{}]+".format(

	1352 ''.join(TOKEN_ENDS).replace('\\','\\\\').replace(']','\]'))).match

	1353 _non_attribute_end_matcher = re.compile(r"[^{}]+".format(

	1354 ''.join(ATTRIBUTE_ENDS).replace('\\','\\\\').replace(']','\]'))).match

	1355 _non_extended_attribute_end_matcher = re.compile(r"[^{}]+".format(

	1356 ''.join(EXTENDED_ATTRIBUTE_ENDS).replace(

	1357 '\\','\\\\').replace(']','\]'))).match

	1358

	1359 def _validate_xtext(xtext):

	1360 """If input token contains ASCII non-printables, register a defect."""

	1361

	1362 non_printables = _non_printable_finder(xtext)

	1363 if non_printables:

	1364 xtext.defects.append(errors.NonPrintableDefect(non_printables))

	1365 if utils._has_surrogates(xtext):

	1366 xtext.defects.append(errors.UndecodableBytesDefect(

	1367 "Non-ASCII characters found in header token"))

	1368

	1369 def _get_ptext_to_endchars(value, endchars):

	1370 """Scan printables/quoted-pairs until endchars and return unquoted ptext.

	1371

	1372 This function turns a run of qcontent, ccontent-without-comments, or

	1373 dtext-with-quoted-printables into a single string by unquoting any

	1374 quoted printables. It returns the string, the remaining value, and

	1375 a flag that is True iff there were any quoted printables decoded.

	1376

	1377 """

	1378 _3to2list = list(_wsp_splitter(value, 1))

	1379 fragment, remainder, = _3to2list[:1] + [_3to2list[1:]]

	1380 vchars = []

	1381 escape = False

	1382 had_qp = False

	1383 for pos in range(len(fragment)):

	1384 if fragment[pos] == '\\':

	1385 if escape:

	1386 escape = False

	1387 had_qp = True

	1388 else:

	1389 escape = True

	1390 continue

	1391 if escape:

	1392 escape = False

	1393 elif fragment[pos] in endchars:

	1394 break

	1395 vchars.append(fragment[pos])

	1396 else:

	1397 pos = pos + 1

	1398 return ''.join(vchars), ''.join([fragment[pos:]] + remainder), had_qp

	1399

	1400 def _decode_ew_run(value):

	1401 """ Decode a run of RFC2047 encoded words.

	1402

	1403 _decode_ew_run(value) -> (text, value, defects)

	1404

	1405 Scans the supplied value for a run of tokens that look like they are RFC

	1406 2047 encoded words, decodes those words into text according to RFC 2047

	1407 rules (whitespace between encoded words is discarded), and returns the text

	1408 and the remaining value (including any leading whitespace on the remaining

	1409 value), as well as a list of any defects encountered while decoding. The

	1410 input value may not have any leading whitespace.

	1411

	1412 """

	1413 res = []

	1414 defects = []

	1415 last_ws = ''

	1416 while value:

	1417 try:

	1418 tok, ws, value = _wsp_splitter(value, 1)

	1419 except ValueError:

	1420 tok, ws, value = value, '', ''

	1421 if not (tok.startswith('=?') and tok.endswith('?=')):

	1422 return ''.join(res), last_ws + tok + ws + value, defects

	1423 text, charset, lang, new_defects = _ew.decode(tok)

	1424 res.append(text)

	1425 defects.extend(new_defects)

	1426 last_ws = ws

	1427 return ''.join(res), last_ws, defects

	1428

	1429 def get_fws(value):

	1430 """FWS = 1*WSP

	1431

	1432 This isn't the RFC definition. We're using fws to represent tokens where

	1433 folding can be done, but when we are parsing the unfolding has already

	1434 been done so we don't need to watch out for CRLF.

	1435

	1436 """

	1437 newvalue = value.lstrip()

	1438 fws = WhiteSpaceTerminal(value[:len(value)-len(newvalue)], 'fws')

	1439 return fws, newvalue

	1440

	1441 def get_encoded_word(value):

	1442 """ encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

	1443

	1444 """

	1445 ew = EncodedWord()

	1446 if not value.startswith('=?'):

	1447 raise errors.HeaderParseError(

	1448 "expected encoded word but found {}".format(value))

	1449 _3to2list1 = list(value[2:].split('?=', 1))

	1450 tok, remainder, = _3to2list1[:1] + [_3to2list1[1:]]

	1451 if tok == value[2:]:

	1452 raise errors.HeaderParseError(

	1453 "expected encoded word but found {}".format(value))

	1454 remstr = ''.join(remainder)

	1455 if remstr[:2].isdigit():

	1456 _3to2list3 = list(remstr.split('?=', 1))

	1457 rest, remainder, = _3to2list3[:1] + [_3to2list3[1:]]

	1458 tok = tok + '?=' + rest

	1459 if len(tok.split()) > 1:

	1460 ew.defects.append(errors.InvalidHeaderDefect(

	1461 "whitespace inside encoded word"))

	1462 ew.cte = value

	1463 value = ''.join(remainder)

	1464 try:

	1465 text, charset, lang, defects = _ew.decode('=?' + tok + '?=')

	1466 except ValueError:

	1467 raise errors.HeaderParseError(

	1468 "encoded word format invalid: '{}'".format(ew.cte))

	1469 ew.charset = charset

	1470 ew.lang = lang

	1471 ew.defects.extend(defects)

	1472 while text:

	1473 if text[0] in WSP:

	1474 token, text = get_fws(text)

	1475 ew.append(token)

	1476 continue

	1477 _3to2list5 = list(_wsp_splitter(text, 1))

	1478 chars, remainder, = _3to2list5[:1] + [_3to2list5[1:]]

	1479 vtext = ValueTerminal(chars, 'vtext')

	1480 _validate_xtext(vtext)

	1481 ew.append(vtext)

	1482 text = ''.join(remainder)

	1483 return ew, value

	1484

	1485 def get_unstructured(value):

	1486 """unstructured = (([FWS] vchar) WSP) / obs-unstruct

	1487 obs-unstruct = ((LF CR (obs-utext) LF CR)) / FWS)

	1488 obs-utext = %d0 / obs-NO-WS-CTL / LF / CR

	1489

	1490 obs-NO-WS-CTL is control characters except WSP/CR/LF.

	1491

	1492 So, basically, we have printable runs, plus control characters or nulls in

	1493 the obsolete syntax, separated by whitespace. Since RFC 2047 uses the

	1494 obsolete syntax in its specification, but requires whitespace on either

	1495 side of the encoded words, I can see no reason to need to separate the

	1496 non-printable-non-whitespace from the printable runs if they occur, so we

	1497 parse this into xtext tokens separated by WSP tokens.

	1498

	1499 Because an 'unstructured' value must by definition constitute the entire

	1500 value, this 'get' routine does not return a remaining value, only the

	1501 parsed TokenList.

	1502

	1503 """

	1504 # XXX: but what about bare CR and LF? They might signal the start or

	1505 # end of an encoded word. YAGNI for now, since out current parsers

	1506 # will never send us strings with bard CR or LF.

	1507

	1508 unstructured = UnstructuredTokenList()

	1509 while value:

	1510 if value[0] in WSP:

	1511 token, value = get_fws(value)

	1512 unstructured.append(token)

	1513 continue

	1514 if value.startswith('=?'):

	1515 try:

	1516 token, value = get_encoded_word(value)

	1517 except errors.HeaderParseError:

	1518 pass

	1519 else:

	1520 have_ws = True

	1521 if len(unstructured) > 0:

	1522 if unstructured[-1].token_type != 'fws':

	1523 unstructured.defects.append(errors.InvalidHeaderDefect(

	1524 "missing whitespace before encoded word"))

	1525 have_ws = False

	1526 if have_ws and len(unstructured) > 1:

	1527 if unstructured[-2].token_type == 'encoded-word':

	1528 unstructured[-1] = EWWhiteSpaceTerminal(

	1529 unstructured[-1], 'fws')

	1530 unstructured.append(token)

	1531 continue

	1532 _3to2list7 = list(_wsp_splitter(value, 1))

	1533 tok, remainder, = _3to2list7[:1] + [_3to2list7[1:]]

	1534 vtext = ValueTerminal(tok, 'vtext')

	1535 _validate_xtext(vtext)

	1536 unstructured.append(vtext)

	1537 value = ''.join(remainder)

	1538 return unstructured

	1539

	1540 def get_qp_ctext(value):

	1541 """ctext = <printable ascii except \ ( )>

	1542

	1543 This is not the RFC ctext, since we are handling nested comments in comment

	1544 and unquoting quoted-pairs here. We allow anything except the '()'

	1545 characters, but if we find any ASCII other than the RFC defined printable

	1546 ASCII an NonPrintableDefect is added to the token's defects list. Since

	1547 quoted pairs are converted to their unquoted values, what is returned is

	1548 a 'ptext' token. In this case it is a WhiteSpaceTerminal, so it's value

	1549 is ' '.

	1550

	1551 """

	1552 ptext, value, _ = _get_ptext_to_endchars(value, '()')

	1553 ptext = WhiteSpaceTerminal(ptext, 'ptext')

	1554 _validate_xtext(ptext)

	1555 return ptext, value

	1556

	1557 def get_qcontent(value):

	1558 """qcontent = qtext / quoted-pair

	1559

	1560 We allow anything except the DQUOTE character, but if we find any ASCII

	1561 other than the RFC defined printable ASCII an NonPrintableDefect is

	1562 added to the token's defects list. Any quoted pairs are converted to their

	1563 unquoted values, so what is returned is a 'ptext' token. In this case it

	1564 is a ValueTerminal.

	1565

	1566 """

	1567 ptext, value, _ = _get_ptext_to_endchars(value, '"')

	1568 ptext = ValueTerminal(ptext, 'ptext')

	1569 _validate_xtext(ptext)

	1570 return ptext, value

	1571

	1572 def get_atext(value):

	1573 """atext = <matches _atext_matcher>

	1574

	1575 We allow any non-ATOM_ENDS in atext, but add an InvalidATextDefect to

	1576 the token's defects list if we find non-atext characters.

	1577 """

	1578 m = _non_atom_end_matcher(value)

	1579 if not m:

	1580 raise errors.HeaderParseError(

	1581 "expected atext but found '{}'".format(value))

	1582 atext = m.group()

	1583 value = value[len(atext):]

	1584 atext = ValueTerminal(atext, 'atext')

	1585 _validate_xtext(atext)

	1586 return atext, value

	1587

	1588 def get_bare_quoted_string(value):

	1589 """bare-quoted-string = DQUOTE *([FWS] qcontent) [FWS] DQUOTE

	1590

	1591 A quoted-string without the leading or trailing white space. Its

	1592 value is the text between the quote marks, with whitespace

	1593 preserved and quoted pairs decoded.

	1594 """

	1595 if value[0] != '"':

	1596 raise errors.HeaderParseError(

	1597 "expected '\"' but found '{}'".format(value))

	1598 bare_quoted_string = BareQuotedString()

	1599 value = value[1:]

	1600 while value and value[0] != '"':

	1601 if value[0] in WSP:

	1602 token, value = get_fws(value)

	1603 else:

	1604 token, value = get_qcontent(value)

	1605 bare_quoted_string.append(token)

	1606 if not value:

	1607 bare_quoted_string.defects.append(errors.InvalidHeaderDefect(

	1608 "end of header inside quoted string"))

	1609 return bare_quoted_string, value

	1610 return bare_quoted_string, value[1:]

	1611

	1612 def get_comment(value):

	1613 """comment = "(" *([FWS] ccontent) [FWS] ")"

	1614 ccontent = ctext / quoted-pair / comment

	1615

	1616 We handle nested comments here, and quoted-pair in our qp-ctext routine.

	1617 """

	1618 if value and value[0] != '(':

	1619 raise errors.HeaderParseError(

	1620 "expected '(' but found '{}'".format(value))

	1621 comment = Comment()

	1622 value = value[1:]

	1623 while value and value[0] != ")":

	1624 if value[0] in WSP:

	1625 token, value = get_fws(value)

	1626 elif value[0] == '(':

	1627 token, value = get_comment(value)

	1628 else:

	1629 token, value = get_qp_ctext(value)

	1630 comment.append(token)

	1631 if not value:

	1632 comment.defects.append(errors.InvalidHeaderDefect(

	1633 "end of header inside comment"))

	1634 return comment, value

	1635 return comment, value[1:]

	1636

	1637 def get_cfws(value):

	1638 """CFWS = (1*([FWS] comment) [FWS]) / FWS

	1639

	1640 """

	1641 cfws = CFWSList()

	1642 while value and value[0] in CFWS_LEADER:

	1643 if value[0] in WSP:

	1644 token, value = get_fws(value)

	1645 else:

	1646 token, value = get_comment(value)

	1647 cfws.append(token)

	1648 return cfws, value

	1649

	1650 def get_quoted_string(value):

	1651 """quoted-string = [CFWS] <bare-quoted-string> [CFWS]

	1652

	1653 'bare-quoted-string' is an intermediate class defined by this

	1654 parser and not by the RFC grammar. It is the quoted string

	1655 without any attached CFWS.

	1656 """

	1657 quoted_string = QuotedString()

	1658 if value and value[0] in CFWS_LEADER:

	1659 token, value = get_cfws(value)

	1660 quoted_string.append(token)

	1661 token, value = get_bare_quoted_string(value)

	1662 quoted_string.append(token)

	1663 if value and value[0] in CFWS_LEADER:

	1664 token, value = get_cfws(value)

	1665 quoted_string.append(token)

	1666 return quoted_string, value

	1667

	1668 def get_atom(value):

	1669 """atom = [CFWS] 1*atext [CFWS]

	1670

	1671 """

	1672 atom = Atom()

	1673 if value and value[0] in CFWS_LEADER:

	1674 token, value = get_cfws(value)

	1675 atom.append(token)

	1676 if value and value[0] in ATOM_ENDS:

	1677 raise errors.HeaderParseError(

	1678 "expected atom but found '{}'".format(value))

	1679 token, value = get_atext(value)

	1680 atom.append(token)

	1681 if value and value[0] in CFWS_LEADER:

	1682 token, value = get_cfws(value)

	1683 atom.append(token)

	1684 return atom, value

	1685

	1686 def get_dot_atom_text(value):

	1687 """ dot-text = 1atext ("." 1*atext)

	1688

	1689 """

	1690 dot_atom_text = DotAtomText()

	1691 if not value or value[0] in ATOM_ENDS:

	1692 raise errors.HeaderParseError("expected atom at a start of "

	1693 "dot-atom-text but found '{}'".format(value))

	1694 while value and value[0] not in ATOM_ENDS:

	1695 token, value = get_atext(value)

	1696 dot_atom_text.append(token)

	1697 if value and value[0] == '.':

	1698 dot_atom_text.append(DOT)

	1699 value = value[1:]

	1700 if dot_atom_text[-1] is DOT:

	1701 raise errors.HeaderParseError("expected atom at end of dot-atom-text "

	1702 "but found '{}'".format('.'+value))

	1703 return dot_atom_text, value

	1704

	1705 def get_dot_atom(value):

	1706 """ dot-atom = [CFWS] dot-atom-text [CFWS]

	1707

	1708 """

	1709 dot_atom = DotAtom()

	1710 if value[0] in CFWS_LEADER:

	1711 token, value = get_cfws(value)

	1712 dot_atom.append(token)

	1713 token, value = get_dot_atom_text(value)

	1714 dot_atom.append(token)

	1715 if value and value[0] in CFWS_LEADER:

	1716 token, value = get_cfws(value)

	1717 dot_atom.append(token)

	1718 return dot_atom, value

	1719

	1720 def get_word(value):

	1721 """word = atom / quoted-string

	1722

	1723 Either atom or quoted-string may start with CFWS. We have to peel off this

	1724 CFWS first to determine which type of word to parse. Afterward we splice

	1725 the leading CFWS, if any, into the parsed sub-token.

	1726

	1727 If neither an atom or a quoted-string is found before the next special, a

	1728 HeaderParseError is raised.

	1729

	1730 The token returned is either an Atom or a QuotedString, as appropriate.

	1731 This means the 'word' level of the formal grammar is not represented in the

	1732 parse tree; this is because having that extra layer when manipulating the

	1733 parse tree is more confusing than it is helpful.

	1734

	1735 """

	1736 if value[0] in CFWS_LEADER:

	1737 leader, value = get_cfws(value)

	1738 else:

	1739 leader = None

	1740 if value[0]=='"':

	1741 token, value = get_quoted_string(value)

	1742 elif value[0] in SPECIALS:

	1743 raise errors.HeaderParseError("Expected 'atom' or 'quoted-string' "

	1744 "but found '{}'".format(value))

	1745 else:

	1746 token, value = get_atom(value)

	1747 if leader is not None:

	1748 token[:0] = [leader]

	1749 return token, value

	1750

	1751 def get_phrase(value):

	1752 """ phrase = 1*word / obs-phrase

	1753 obs-phrase = word *(word / "." / CFWS)

	1754

	1755 This means a phrase can be a sequence of words, periods, and CFWS in any

	1756 order as long as it starts with at least one word. If anything other than

	1757 words is detected, an ObsoleteHeaderDefect is added to the token's defect

	1758 list. We also accept a phrase that starts with CFWS followed by a dot;

	1759 this is registered as an InvalidHeaderDefect, since it is not supported by

	1760 even the obsolete grammar.

	1761

	1762 """

	1763 phrase = Phrase()

	1764 try:

	1765 token, value = get_word(value)

	1766 phrase.append(token)

	1767 except errors.HeaderParseError:

	1768 phrase.defects.append(errors.InvalidHeaderDefect(

	1769 "phrase does not start with word"))

	1770 while value and value[0] not in PHRASE_ENDS:

	1771 if value[0]=='.':

	1772 phrase.append(DOT)

	1773 phrase.defects.append(errors.ObsoleteHeaderDefect(

	1774 "period in 'phrase'"))

	1775 value = value[1:]

	1776 else:

	1777 try:

	1778 token, value = get_word(value)

	1779 except errors.HeaderParseError:

	1780 if value[0] in CFWS_LEADER:

	1781 token, value = get_cfws(value)

	1782 phrase.defects.append(errors.ObsoleteHeaderDefect(

	1783 "comment found without atom"))

	1784 else:

	1785 raise

	1786 phrase.append(token)

	1787 return phrase, value

	1788

	1789 def get_local_part(value):

	1790 """ local-part = dot-atom / quoted-string / obs-local-part

	1791

	1792 """

	1793 local_part = LocalPart()

	1794 leader = None

	1795 if value[0] in CFWS_LEADER:

	1796 leader, value = get_cfws(value)

	1797 if not value:

	1798 raise errors.HeaderParseError(

	1799 "expected local-part but found '{}'".format(value))

	1800 try:

	1801 token, value = get_dot_atom(value)

	1802 except errors.HeaderParseError:

	1803 try:

	1804 token, value = get_word(value)

	1805 except errors.HeaderParseError:

	1806 if value[0] != '\\' and value[0] in PHRASE_ENDS:

	1807 raise

	1808 token = TokenList()

	1809 if leader is not None:

	1810 token[:0] = [leader]

	1811 local_part.append(token)

	1812 if value and (value[0]=='\\' or value[0] not in PHRASE_ENDS):

	1813 obs_local_part, value = get_obs_local_part(str(local_part) + value)

	1814 if obs_local_part.token_type == 'invalid-obs-local-part':

	1815 local_part.defects.append(errors.InvalidHeaderDefect(

	1816 "local-part is not dot-atom, quoted-string, or obs-local-part"))

	1817 else:

	1818 local_part.defects.append(errors.ObsoleteHeaderDefect(

	1819 "local-part is not a dot-atom (contains CFWS)"))

	1820 local_part[0] = obs_local_part

	1821 try:

	1822 local_part.value.encode('ascii')

	1823 except UnicodeEncodeError:

	1824 local_part.defects.append(errors.NonASCIILocalPartDefect(

	1825 "local-part contains non-ASCII characters)"))

	1826 return local_part, value

	1827

	1828 def get_obs_local_part(value):

	1829 """ obs-local-part = word *("." word)

	1830 """

	1831 obs_local_part = ObsLocalPart()

	1832 last_non_ws_was_dot = False

	1833 while value and (value[0]=='\\' or value[0] not in PHRASE_ENDS):

	1834 if value[0] == '.':

	1835 if last_non_ws_was_dot:

	1836 obs_local_part.defects.append(errors.InvalidHeaderDefect(

	1837 "invalid repeated '.'"))

	1838 obs_local_part.append(DOT)

	1839 last_non_ws_was_dot = True

	1840 value = value[1:]

	1841 continue

	1842 elif value[0]=='\\':

	1843 obs_local_part.append(ValueTerminal(value[0],

	1844 'misplaced-special'))

	1845 value = value[1:]

	1846 obs_local_part.defects.append(errors.InvalidHeaderDefect(

	1847 "'\\' character outside of quoted-string/ccontent"))

	1848 last_non_ws_was_dot = False

	1849 continue

	1850 if obs_local_part and obs_local_part[-1].token_type != 'dot':

	1851 obs_local_part.defects.append(errors.InvalidHeaderDefect(

	1852 "missing '.' between words"))

	1853 try:

	1854 token, value = get_word(value)

	1855 last_non_ws_was_dot = False

	1856 except errors.HeaderParseError:

	1857 if value[0] not in CFWS_LEADER:

	1858 raise

	1859 token, value = get_cfws(value)

	1860 obs_local_part.append(token)

	1861 if (obs_local_part[0].token_type == 'dot' or

	1862 obs_local_part[0].token_type=='cfws' and

	1863 obs_local_part[1].token_type=='dot'):

	1864 obs_local_part.defects.append(errors.InvalidHeaderDefect(

	1865 "Invalid leading '.' in local part"))

	1866 if (obs_local_part[-1].token_type == 'dot' or

	1867 obs_local_part[-1].token_type=='cfws' and

	1868 obs_local_part[-2].token_type=='dot'):

	1869 obs_local_part.defects.append(errors.InvalidHeaderDefect(

	1870 "Invalid trailing '.' in local part"))

	1871 if obs_local_part.defects:

	1872 obs_local_part.token_type = 'invalid-obs-local-part'

	1873 return obs_local_part, value

	1874

	1875 def get_dtext(value):

	1876 """ dtext = <printable ascii except \ [ ]> / obs-dtext

	1877 obs-dtext = obs-NO-WS-CTL / quoted-pair

	1878

	1879 We allow anything except the excluded characters, but if we find any

	1880 ASCII other than the RFC defined printable ASCII an NonPrintableDefect is

	1881 added to the token's defects list. Quoted pairs are converted to their

	1882 unquoted values, so what is returned is a ptext token, in this case a

	1883 ValueTerminal. If there were quoted-printables, an ObsoleteHeaderDefect is

	1884 added to the returned token's defect list.

	1885

	1886 """

	1887 ptext, value, had_qp = _get_ptext_to_endchars(value, '[]')

	1888 ptext = ValueTerminal(ptext, 'ptext')

	1889 if had_qp:

	1890 ptext.defects.append(errors.ObsoleteHeaderDefect(

	1891 "quoted printable found in domain-literal"))

	1892 _validate_xtext(ptext)

	1893 return ptext, value

	1894

	1895 def _check_for_early_dl_end(value, domain_literal):

	1896 if value:

	1897 return False

	1898 domain_literal.append(errors.InvalidHeaderDefect(

	1899 "end of input inside domain-literal"))

	1900 domain_literal.append(ValueTerminal(']', 'domain-literal-end'))

	1901 return True

	1902

	1903 def get_domain_literal(value):

	1904 """ domain-literal = [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]

	1905

	1906 """

	1907 domain_literal = DomainLiteral()

	1908 if value[0] in CFWS_LEADER:

	1909 token, value = get_cfws(value)

	1910 domain_literal.append(token)

	1911 if not value:

	1912 raise errors.HeaderParseError("expected domain-literal")

	1913 if value[0] != '[':

	1914 raise errors.HeaderParseError("expected '[' at start of domain-literal "

	1915 "but found '{}'".format(value))

	1916 value = value[1:]

	1917 if _check_for_early_dl_end(value, domain_literal):

	1918 return domain_literal, value

	1919 domain_literal.append(ValueTerminal('[', 'domain-literal-start'))

	1920 if value[0] in WSP:

	1921 token, value = get_fws(value)

	1922 domain_literal.append(token)

	1923 token, value = get_dtext(value)

	1924 domain_literal.append(token)

	1925 if _check_for_early_dl_end(value, domain_literal):

	1926 return domain_literal, value

	1927 if value[0] in WSP:

	1928 token, value = get_fws(value)

	1929 domain_literal.append(token)

	1930 if _check_for_early_dl_end(value, domain_literal):

	1931 return domain_literal, value

	1932 if value[0] != ']':

	1933 raise errors.HeaderParseError("expected ']' at end of domain-literal "

	1934 "but found '{}'".format(value))

	1935 domain_literal.append(ValueTerminal(']', 'domain-literal-end'))

	1936 value = value[1:]

	1937 if value and value[0] in CFWS_LEADER:

	1938 token, value = get_cfws(value)

	1939 domain_literal.append(token)

	1940 return domain_literal, value

	1941

	1942 def get_domain(value):

	1943 """ domain = dot-atom / domain-literal / obs-domain

	1944 obs-domain = atom *("." atom))

	1945

	1946 """

	1947 domain = Domain()

	1948 leader = None

	1949 if value[0] in CFWS_LEADER:

	1950 leader, value = get_cfws(value)

	1951 if not value:

	1952 raise errors.HeaderParseError(

	1953 "expected domain but found '{}'".format(value))

	1954 if value[0] == '[':

	1955 token, value = get_domain_literal(value)

	1956 if leader is not None:

	1957 token[:0] = [leader]

	1958 domain.append(token)

	1959 return domain, value

	1960 try:

	1961 token, value = get_dot_atom(value)

	1962 except errors.HeaderParseError:

	1963 token, value = get_atom(value)

	1964 if leader is not None:

	1965 token[:0] = [leader]

	1966 domain.append(token)

	1967 if value and value[0] == '.':

	1968 domain.defects.append(errors.ObsoleteHeaderDefect(

	1969 "domain is not a dot-atom (contains CFWS)"))

	1970 if domain[0].token_type == 'dot-atom':

	1971 domain[:] = domain[0]

	1972 while value and value[0] == '.':

	1973 domain.append(DOT)

	1974 token, value = get_atom(value[1:])

	1975 domain.append(token)

	1976 return domain, value

	1977

	1978 def get_addr_spec(value):

	1979 """ addr-spec = local-part "@" domain

	1980

	1981 """

	1982 addr_spec = AddrSpec()

	1983 token, value = get_local_part(value)

	1984 addr_spec.append(token)

	1985 if not value or value[0] != '@':

	1986 addr_spec.defects.append(errors.InvalidHeaderDefect(

	1987 "add-spec local part with no domain"))

	1988 return addr_spec, value

	1989 addr_spec.append(ValueTerminal('@', 'address-at-symbol'))

	1990 token, value = get_domain(value[1:])

	1991 addr_spec.append(token)

	1992 return addr_spec, value

	1993

	1994 def get_obs_route(value):

	1995 """ obs-route = obs-domain-list ":"

	1996 obs-domain-list = (CFWS / ",") "@" domain ("," [CFWS] ["@" domain])

	1997

	1998 Returns an obs-route token with the appropriate sub-tokens (that is,

	1999 there is no obs-domain-list in the parse tree).

	2000 """

	2001 obs_route = ObsRoute()

	2002 while value and (value[0]==',' or value[0] in CFWS_LEADER):

	2003 if value[0] in CFWS_LEADER:

	2004 token, value = get_cfws(value)

	2005 obs_route.append(token)

	2006 elif value[0] == ',':

	2007 obs_route.append(ListSeparator)

	2008 value = value[1:]

	2009 if not value or value[0] != '@':

	2010 raise errors.HeaderParseError(

	2011 "expected obs-route domain but found '{}'".format(value))

	2012 obs_route.append(RouteComponentMarker)

	2013 token, value = get_domain(value[1:])

	2014 obs_route.append(token)

	2015 while value and value[0]==',':

	2016 obs_route.append(ListSeparator)

	2017 value = value[1:]

	2018 if not value:

	2019 break

	2020 if value[0] in CFWS_LEADER:

	2021 token, value = get_cfws(value)

	2022 obs_route.append(token)

	2023 if value[0] == '@':

	2024 obs_route.append(RouteComponentMarker)

	2025 token, value = get_domain(value[1:])

	2026 obs_route.append(token)

	2027 if not value:

	2028 raise errors.HeaderParseError("end of header while parsing obs-route")

	2029 if value[0] != ':':

	2030 raise errors.HeaderParseError( "expected ':' marking end of "

	2031 "obs-route but found '{}'".format(value))

	2032 obs_route.append(ValueTerminal(':', 'end-of-obs-route-marker'))

	2033 return obs_route, value[1:]

	2034

	2035 def get_angle_addr(value):

	2036 """ angle-addr = [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr

	2037 obs-angle-addr = [CFWS] "<" obs-route addr-spec ">" [CFWS]

	2038

	2039 """

	2040 angle_addr = AngleAddr()

	2041 if value[0] in CFWS_LEADER:

	2042 token, value = get_cfws(value)

	2043 angle_addr.append(token)

	2044 if not value or value[0] != '<':

	2045 raise errors.HeaderParseError(

	2046 "expected angle-addr but found '{}'".format(value))

	2047 angle_addr.append(ValueTerminal('<', 'angle-addr-start'))

	2048 value = value[1:]

	2049 # Although it is not legal per RFC5322, SMTP uses '<>' in certain

	2050 # circumstances.

	2051 if value[0] == '>':

	2052 angle_addr.append(ValueTerminal('>', 'angle-addr-end'))

	2053 angle_addr.defects.append(errors.InvalidHeaderDefect(

	2054 "null addr-spec in angle-addr"))

	2055 value = value[1:]

	2056 return angle_addr, value

	2057 try:

	2058 token, value = get_addr_spec(value)

	2059 except errors.HeaderParseError:

	2060 try:

	2061 token, value = get_obs_route(value)

	2062 angle_addr.defects.append(errors.ObsoleteHeaderDefect(

	2063 "obsolete route specification in angle-addr"))

	2064 except errors.HeaderParseError:

	2065 raise errors.HeaderParseError(

	2066 "expected addr-spec or obs-route but found '{}'".format(value))

	2067 angle_addr.append(token)

	2068 token, value = get_addr_spec(value)

	2069 angle_addr.append(token)

	2070 if value and value[0] == '>':

	2071 value = value[1:]

	2072 else:

	2073 angle_addr.defects.append(errors.InvalidHeaderDefect(

	2074 "missing trailing '>' on angle-addr"))

	2075 angle_addr.append(ValueTerminal('>', 'angle-addr-end'))

	2076 if value and value[0] in CFWS_LEADER:

	2077 token, value = get_cfws(value)

	2078 angle_addr.append(token)

	2079 return angle_addr, value

	2080

	2081 def get_display_name(value):

	2082 """ display-name = phrase

	2083

	2084 Because this is simply a name-rule, we don't return a display-name

	2085 token containing a phrase, but rather a display-name token with

	2086 the content of the phrase.

	2087

	2088 """

	2089 display_name = DisplayName()

	2090 token, value = get_phrase(value)

	2091 display_name.extend(token[:])

	2092 display_name.defects = token.defects[:]

	2093 return display_name, value

	2094

	2095

	2096 def get_name_addr(value):

	2097 """ name-addr = [display-name] angle-addr

	2098

	2099 """

	2100 name_addr = NameAddr()

	2101 # Both the optional display name and the angle-addr can start with cfws.

	2102 leader = None

	2103 if value[0] in CFWS_LEADER:

	2104 leader, value = get_cfws(value)

	2105 if not value:

	2106 raise errors.HeaderParseError(

	2107 "expected name-addr but found '{}'".format(leader))

	2108 if value[0] != '<':

	2109 if value[0] in PHRASE_ENDS:

	2110 raise errors.HeaderParseError(

	2111 "expected name-addr but found '{}'".format(value))

	2112 token, value = get_display_name(value)

	2113 if not value:

	2114 raise errors.HeaderParseError(

	2115 "expected name-addr but found '{}'".format(token))

	2116 if leader is not None:

	2117 token[0][:0] = [leader]

	2118 leader = None

	2119 name_addr.append(token)

	2120 token, value = get_angle_addr(value)

	2121 if leader is not None:

	2122 token[:0] = [leader]

	2123 name_addr.append(token)

	2124 return name_addr, value

	2125

	2126 def get_mailbox(value):

	2127 """ mailbox = name-addr / addr-spec

	2128

	2129 """

	2130 # The only way to figure out if we are dealing with a name-addr or an

	2131 # addr-spec is to try parsing each one.

	2132 mailbox = Mailbox()

	2133 try:

	2134 token, value = get_name_addr(value)

	2135 except errors.HeaderParseError:

	2136 try:

	2137 token, value = get_addr_spec(value)

	2138 except errors.HeaderParseError:

	2139 raise errors.HeaderParseError(

	2140 "expected mailbox but found '{}'".format(value))

	2141 if any(isinstance(x, errors.InvalidHeaderDefect)

	2142 for x in token.all_defects):

	2143 mailbox.token_type = 'invalid-mailbox'

	2144 mailbox.append(token)

	2145 return mailbox, value

	2146

	2147 def get_invalid_mailbox(value, endchars):

	2148 """ Read everything up to one of the chars in endchars.

	2149

	2150 This is outside the formal grammar. The InvalidMailbox TokenList that is

	2151 returned acts like a Mailbox, but the data attributes are None.

	2152

	2153 """

	2154 invalid_mailbox = InvalidMailbox()

	2155 while value and value[0] not in endchars:

	2156 if value[0] in PHRASE_ENDS:

	2157 invalid_mailbox.append(ValueTerminal(value[0],

	2158 'misplaced-special'))

	2159 value = value[1:]

	2160 else:

	2161 token, value = get_phrase(value)

	2162 invalid_mailbox.append(token)

	2163 return invalid_mailbox, value

	2164

	2165 def get_mailbox_list(value):

	2166 """ mailbox-list = (mailbox *("," mailbox)) / obs-mbox-list

	2167 obs-mbox-list = ([CFWS] ",") mailbox ("," [mailbox / CFWS])

	2168

	2169 For this routine we go outside the formal grammar in order to improve error

	2170 handling. We recognize the end of the mailbox list only at the end of the

	2171 value or at a ';' (the group terminator). This is so that we can turn

	2172 invalid mailboxes into InvalidMailbox tokens and continue parsing any

	2173 remaining valid mailboxes. We also allow all mailbox entries to be null,

	2174 and this condition is handled appropriately at a higher level.

	2175

	2176 """

	2177 mailbox_list = MailboxList()

	2178 while value and value[0] != ';':

	2179 try:

	2180 token, value = get_mailbox(value)

	2181 mailbox_list.append(token)

	2182 except errors.HeaderParseError:

	2183 leader = None

	2184 if value[0] in CFWS_LEADER:

	2185 leader, value = get_cfws(value)

	2186 if not value or value[0] in ',;':

	2187 mailbox_list.append(leader)

	2188 mailbox_list.defects.append(errors.ObsoleteHeaderDefect(

	2189 "empty element in mailbox-list"))

	2190 else:

	2191 token, value = get_invalid_mailbox(value, ',;')

	2192 if leader is not None:

	2193 token[:0] = [leader]

	2194 mailbox_list.append(token)

	2195 mailbox_list.defects.append(errors.InvalidHeaderDefect(

	2196 "invalid mailbox in mailbox-list"))

	2197 elif value[0] == ',':

	2198 mailbox_list.defects.append(errors.ObsoleteHeaderDefect(

	2199 "empty element in mailbox-list"))

	2200 else:

	2201 token, value = get_invalid_mailbox(value, ',;')

	2202 if leader is not None:

	2203 token[:0] = [leader]

	2204 mailbox_list.append(token)

	2205 mailbox_list.defects.append(errors.InvalidHeaderDefect(

	2206 "invalid mailbox in mailbox-list"))

	2207 if value and value[0] not in ',;':

	2208 # Crap after mailbox; treat it as an invalid mailbox.

	2209 # The mailbox info will still be available.

	2210 mailbox = mailbox_list[-1]

	2211 mailbox.token_type = 'invalid-mailbox'

	2212 token, value = get_invalid_mailbox(value, ',;')

	2213 mailbox.extend(token)

	2214 mailbox_list.defects.append(errors.InvalidHeaderDefect(

	2215 "invalid mailbox in mailbox-list"))

	2216 if value and value[0] == ',':

	2217 mailbox_list.append(ListSeparator)

	2218 value = value[1:]

	2219 return mailbox_list, value

	2220

	2221

	2222 def get_group_list(value):

	2223 """ group-list = mailbox-list / CFWS / obs-group-list

	2224 obs-group-list = 1*([CFWS] ",") [CFWS]

	2225

	2226 """

	2227 group_list = GroupList()

	2228 if not value:

	2229 group_list.defects.append(errors.InvalidHeaderDefect(

	2230 "end of header before group-list"))

	2231 return group_list, value

	2232 leader = None

	2233 if value and value[0] in CFWS_LEADER:

	2234 leader, value = get_cfws(value)

	2235 if not value:

	2236 # This should never happen in email parsing, since CFWS-only is a

	2237 # legal alternative to group-list in a group, which is the only

	2238 # place group-list appears.

	2239 group_list.defects.append(errors.InvalidHeaderDefect(

	2240 "end of header in group-list"))

	2241 group_list.append(leader)

	2242 return group_list, value

	2243 if value[0] == ';':

	2244 group_list.append(leader)

	2245 return group_list, value

	2246 token, value = get_mailbox_list(value)

	2247 if len(token.all_mailboxes)==0:

	2248 if leader is not None:

	2249 group_list.append(leader)

	2250 group_list.extend(token)

	2251 group_list.defects.append(errors.ObsoleteHeaderDefect(

	2252 "group-list with empty entries"))

	2253 return group_list, value

	2254 if leader is not None:

	2255 token[:0] = [leader]

	2256 group_list.append(token)

	2257 return group_list, value

	2258

	2259 def get_group(value):

	2260 """ group = display-name ":" [group-list] ";" [CFWS]

	2261

	2262 """

	2263 group = Group()

	2264 token, value = get_display_name(value)

	2265 if not value or value[0] != ':':

	2266 raise errors.HeaderParseError("expected ':' at end of group "

	2267 "display name but found '{}'".format(value))

	2268 group.append(token)

	2269 group.append(ValueTerminal(':', 'group-display-name-terminator'))

	2270 value = value[1:]

	2271 if value and value[0] == ';':

	2272 group.append(ValueTerminal(';', 'group-terminator'))

	2273 return group, value[1:]

	2274 token, value = get_group_list(value)

	2275 group.append(token)

	2276 if not value:

	2277 group.defects.append(errors.InvalidHeaderDefect(

	2278 "end of header in group"))

	2279 if value[0] != ';':

	2280 raise errors.HeaderParseError(

	2281 "expected ';' at end of group but found {}".format(value))

	2282 group.append(ValueTerminal(';', 'group-terminator'))

	2283 value = value[1:]

	2284 if value and value[0] in CFWS_LEADER:

	2285 token, value = get_cfws(value)

	2286 group.append(token)

	2287 return group, value

	2288

	2289 def get_address(value):

	2290 """ address = mailbox / group

	2291

	2292 Note that counter-intuitively, an address can be either a single address or

	2293 a list of addresses (a group). This is why the returned Address object has

	2294 a 'mailboxes' attribute which treats a single address as a list of length

	2295 one. When you need to differentiate between to two cases, extract the singl e

	2296 element, which is either a mailbox or a group token.

	2297

	2298 """

	2299 # The formal grammar isn't very helpful when parsing an address. mailbox

	2300 # and group, especially when allowing for obsolete forms, start off very

	2301 # similarly. It is only when you reach one of @, <, or : that you know

	2302 # what you've got. So, we try each one in turn, starting with the more

	2303 # likely of the two. We could perhaps make this more efficient by looking

	2304 # for a phrase and then branching based on the next character, but that

	2305 # would be a premature optimization.

	2306 address = Address()

	2307 try:

	2308 token, value = get_group(value)

	2309 except errors.HeaderParseError:

	2310 try:

	2311 token, value = get_mailbox(value)

	2312 except errors.HeaderParseError:

	2313 raise errors.HeaderParseError(

	2314 "expected address but found '{}'".format(value))

	2315 address.append(token)

	2316 return address, value

	2317

	2318 def get_address_list(value):

	2319 """ address_list = (address *("," address)) / obs-addr-list

	2320 obs-addr-list = ([CFWS] ",") address ("," [address / CFWS])

	2321

	2322 We depart from the formal grammar here by continuing to parse until the end

	2323 of the input, assuming the input to be entirely composed of an

	2324 address-list. This is always true in email parsing, and allows us

	2325 to skip invalid addresses to parse additional valid ones.

	2326

	2327 """

	2328 address_list = AddressList()

	2329 while value:

	2330 try:

	2331 token, value = get_address(value)

	2332 address_list.append(token)

	2333 except errors.HeaderParseError as err:

	2334 leader = None

	2335 if value[0] in CFWS_LEADER:

	2336 leader, value = get_cfws(value)

	2337 if not value or value[0] == ',':

	2338 address_list.append(leader)

	2339 address_list.defects.append(errors.ObsoleteHeaderDefect(

	2340 "address-list entry with no content"))

	2341 else:

	2342 token, value = get_invalid_mailbox(value, ',')

	2343 if leader is not None:

	2344 token[:0] = [leader]

	2345 address_list.append(Address([token]))

	2346 address_list.defects.append(errors.InvalidHeaderDefect(

	2347 "invalid address in address-list"))

	2348 elif value[0] == ',':

	2349 address_list.defects.append(errors.ObsoleteHeaderDefect(

	2350 "empty element in address-list"))

	2351 else:

	2352 token, value = get_invalid_mailbox(value, ',')

	2353 if leader is not None:

	2354 token[:0] = [leader]

	2355 address_list.append(Address([token]))

	2356 address_list.defects.append(errors.InvalidHeaderDefect(

	2357 "invalid address in address-list"))

	2358 if value and value[0] != ',':

	2359 # Crap after address; treat it as an invalid mailbox.

	2360 # The mailbox info will still be available.

	2361 mailbox = address_list[-1][0]

	2362 mailbox.token_type = 'invalid-mailbox'

	2363 token, value = get_invalid_mailbox(value, ',')

	2364 mailbox.extend(token)

	2365 address_list.defects.append(errors.InvalidHeaderDefect(

	2366 "invalid address in address-list"))

	2367 if value: # Must be a , at this point.

	2368 address_list.append(ValueTerminal(',', 'list-separator'))

	2369 value = value[1:]

	2370 return address_list, value

	2371

	2372 #

	2373 # XXX: As I begin to add additional header parsers, I'm realizing we probably

	2374 # have two level of parser routines: the get_XXX methods that get a token in

	2375 # the grammar, and parse_XXX methods that parse an entire field value. So

	2376 # get_address_list above should really be a parse_ method, as probably should

	2377 # be get_unstructured.

	2378 #

	2379

	2380 def parse_mime_version(value):

	2381 """ mime-version = [CFWS] 1digit [CFWS] "." [CFWS] 1digit [CFWS]

	2382

	2383 """

	2384 # The [CFWS] is implicit in the RFC 2045 BNF.

	2385 # XXX: This routine is a bit verbose, should factor out a get_int method.

	2386 mime_version = MIMEVersion()

	2387 if not value:

	2388 mime_version.defects.append(errors.HeaderMissingRequiredValue(

	2389 "Missing MIME version number (eg: 1.0)"))

	2390 return mime_version

	2391 if value[0] in CFWS_LEADER:

	2392 token, value = get_cfws(value)

	2393 mime_version.append(token)

	2394 if not value:

	2395 mime_version.defects.append(errors.HeaderMissingRequiredValue(

	2396 "Expected MIME version number but found only CFWS"))

	2397 digits = ''

	2398 while value and value[0] != '.' and value[0] not in CFWS_LEADER:

	2399 digits += value[0]

	2400 value = value[1:]

	2401 if not digits.isdigit():

	2402 mime_version.defects.append(errors.InvalidHeaderDefect(

	2403 "Expected MIME major version number but found {!r}".format(digits)))

	2404 mime_version.append(ValueTerminal(digits, 'xtext'))

	2405 else:

	2406 mime_version.major = int(digits)

	2407 mime_version.append(ValueTerminal(digits, 'digits'))

	2408 if value and value[0] in CFWS_LEADER:

	2409 token, value = get_cfws(value)

	2410 mime_version.append(token)

	2411 if not value or value[0] != '.':

	2412 if mime_version.major is not None:

	2413 mime_version.defects.append(errors.InvalidHeaderDefect(

	2414 "Incomplete MIME version; found only major number"))

	2415 if value:

	2416 mime_version.append(ValueTerminal(value, 'xtext'))

	2417 return mime_version

	2418 mime_version.append(ValueTerminal('.', 'version-separator'))

	2419 value = value[1:]

	2420 if value and value[0] in CFWS_LEADER:

	2421 token, value = get_cfws(value)

	2422 mime_version.append(token)

	2423 if not value:

	2424 if mime_version.major is not None:

	2425 mime_version.defects.append(errors.InvalidHeaderDefect(

	2426 "Incomplete MIME version; found only major number"))

	2427 return mime_version

	2428 digits = ''

	2429 while value and value[0] not in CFWS_LEADER:

	2430 digits += value[0]

	2431 value = value[1:]

	2432 if not digits.isdigit():

	2433 mime_version.defects.append(errors.InvalidHeaderDefect(

	2434 "Expected MIME minor version number but found {!r}".format(digits)))

	2435 mime_version.append(ValueTerminal(digits, 'xtext'))

	2436 else:

	2437 mime_version.minor = int(digits)

	2438 mime_version.append(ValueTerminal(digits, 'digits'))

	2439 if value and value[0] in CFWS_LEADER:

	2440 token, value = get_cfws(value)

	2441 mime_version.append(token)

	2442 if value:

	2443 mime_version.defects.append(errors.InvalidHeaderDefect(

	2444 "Excess non-CFWS text after MIME version"))

	2445 mime_version.append(ValueTerminal(value, 'xtext'))

	2446 return mime_version

	2447

	2448 def get_invalid_parameter(value):

	2449 """ Read everything up to the next ';'.

	2450

	2451 This is outside the formal grammar. The InvalidParameter TokenList that is

	2452 returned acts like a Parameter, but the data attributes are None.

	2453

	2454 """

	2455 invalid_parameter = InvalidParameter()

	2456 while value and value[0] != ';':

	2457 if value[0] in PHRASE_ENDS:

	2458 invalid_parameter.append(ValueTerminal(value[0],

	2459 'misplaced-special'))

	2460 value = value[1:]

	2461 else:

	2462 token, value = get_phrase(value)

	2463 invalid_parameter.append(token)

	2464 return invalid_parameter, value

	2465

	2466 def get_ttext(value):

	2467 """ttext = <matches _ttext_matcher>

	2468

	2469 We allow any non-TOKEN_ENDS in ttext, but add defects to the token's

	2470 defects list if we find non-ttext characters. We also register defects for

	2471 any non-printables even though the RFC doesn't exclude all of them,

	2472 because we follow the spirit of RFC 5322.

	2473

	2474 """

	2475 m = _non_token_end_matcher(value)

	2476 if not m:

	2477 raise errors.HeaderParseError(

	2478 "expected ttext but found '{}'".format(value))

	2479 ttext = m.group()

	2480 value = value[len(ttext):]

	2481 ttext = ValueTerminal(ttext, 'ttext')

	2482 _validate_xtext(ttext)

	2483 return ttext, value

	2484

	2485 def get_token(value):

	2486 """token = [CFWS] 1*ttext [CFWS]

	2487

	2488 The RFC equivalent of ttext is any US-ASCII chars except space, ctls, or

	2489 tspecials. We also exclude tabs even though the RFC doesn't.

	2490

	2491 The RFC implies the CFWS but is not explicit about it in the BNF.

	2492

	2493 """

	2494 mtoken = Token()

	2495 if value and value[0] in CFWS_LEADER:

	2496 token, value = get_cfws(value)

	2497 mtoken.append(token)

	2498 if value and value[0] in TOKEN_ENDS:

	2499 raise errors.HeaderParseError(

	2500 "expected token but found '{}'".format(value))

	2501 token, value = get_ttext(value)

	2502 mtoken.append(token)

	2503 if value and value[0] in CFWS_LEADER:

	2504 token, value = get_cfws(value)

	2505 mtoken.append(token)

	2506 return mtoken, value

	2507

	2508 def get_attrtext(value):

	2509 """attrtext = 1*(any non-ATTRIBUTE_ENDS character)

	2510

	2511 We allow any non-ATTRIBUTE_ENDS in attrtext, but add defects to the

	2512 token's defects list if we find non-attrtext characters. We also register

	2513 defects for any non-printables even though the RFC doesn't exclude all of

	2514 them, because we follow the spirit of RFC 5322.

	2515

	2516 """

	2517 m = _non_attribute_end_matcher(value)

	2518 if not m:

	2519 raise errors.HeaderParseError(

	2520 "expected attrtext but found {!r}".format(value))

	2521 attrtext = m.group()

	2522 value = value[len(attrtext):]

	2523 attrtext = ValueTerminal(attrtext, 'attrtext')

	2524 _validate_xtext(attrtext)

	2525 return attrtext, value

	2526

	2527 def get_attribute(value):

	2528 """ [CFWS] 1*attrtext [CFWS]

	2529

	2530 This version of the BNF makes the CFWS explicit, and as usual we use a

	2531 value terminal for the actual run of characters. The RFC equivalent of

	2532 attrtext is the token characters, with the subtraction of '*', "'", and '%'.

	2533 We include tab in the excluded set just as we do for token.

	2534

	2535 """

	2536 attribute = Attribute()

	2537 if value and value[0] in CFWS_LEADER:

	2538 token, value = get_cfws(value)

	2539 attribute.append(token)

	2540 if value and value[0] in ATTRIBUTE_ENDS:

	2541 raise errors.HeaderParseError(

	2542 "expected token but found '{}'".format(value))

	2543 token, value = get_attrtext(value)

	2544 attribute.append(token)

	2545 if value and value[0] in CFWS_LEADER:

	2546 token, value = get_cfws(value)

	2547 attribute.append(token)

	2548 return attribute, value

	2549

	2550 def get_extended_attrtext(value):

	2551 """attrtext = 1*(any non-ATTRIBUTE_ENDS character plus '%')

	2552

	2553 This is a special parsing routine so that we get a value that

	2554 includes % escapes as a single string (which we decode as a single

	2555 string later).

	2556

	2557 """

	2558 m = _non_extended_attribute_end_matcher(value)

	2559 if not m:

	2560 raise errors.HeaderParseError(

	2561 "expected extended attrtext but found {!r}".format(value))

	2562 attrtext = m.group()

	2563 value = value[len(attrtext):]

	2564 attrtext = ValueTerminal(attrtext, 'extended-attrtext')

	2565 _validate_xtext(attrtext)

	2566 return attrtext, value

	2567

	2568 def get_extended_attribute(value):

	2569 """ [CFWS] 1*extended_attrtext [CFWS]

	2570

	2571 This is like the non-extended version except we allow % characters, so that

	2572 we can pick up an encoded value as a single string.

	2573

	2574 """

	2575 # XXX: should we have an ExtendedAttribute TokenList?

	2576 attribute = Attribute()

	2577 if value and value[0] in CFWS_LEADER:

	2578 token, value = get_cfws(value)

	2579 attribute.append(token)

	2580 if value and value[0] in EXTENDED_ATTRIBUTE_ENDS:

	2581 raise errors.HeaderParseError(

	2582 "expected token but found '{}'".format(value))

	2583 token, value = get_extended_attrtext(value)

	2584 attribute.append(token)

	2585 if value and value[0] in CFWS_LEADER:

	2586 token, value = get_cfws(value)

	2587 attribute.append(token)

	2588 return attribute, value

	2589

	2590 def get_section(value):

	2591 """ '*' digits

	2592

	2593 The formal BNF is more complicated because leading 0s are not allowed. We

	2594 check for that and add a defect. We also assume no CFWS is allowed between

	2595 the '*' and the digits, though the RFC is not crystal clear on that.

	2596 The caller should already have dealt with leading CFWS.

	2597

	2598 """

	2599 section = Section()

	2600 if not value or value[0] != '*':

	2601 raise errors.HeaderParseError("Expected section but found {}".format(

	2602 value))

	2603 section.append(ValueTerminal('*', 'section-marker'))

	2604 value = value[1:]

	2605 if not value or not value[0].isdigit():

	2606 raise errors.HeaderParseError("Expected section number but "

	2607 "found {}".format(value))

	2608 digits = ''

	2609 while value and value[0].isdigit():

	2610 digits += value[0]

	2611 value = value[1:]

	2612 if digits[0] == '0' and digits != '0':

	2613 section.defects.append(errors.InvalidHeaderError("section number"

	2614 "has an invalid leading 0"))

	2615 section.number = int(digits)

	2616 section.append(ValueTerminal(digits, 'digits'))

	2617 return section, value

	2618

	2619

	2620 def get_value(value):

	2621 """ quoted-string / attribute

	2622

	2623 """

	2624 v = Value()

	2625 if not value:

	2626 raise errors.HeaderParseError("Expected value but found end of string")

	2627 leader = None

	2628 if value[0] in CFWS_LEADER:

	2629 leader, value = get_cfws(value)

	2630 if not value:

	2631 raise errors.HeaderParseError("Expected value but found "

	2632 "only {}".format(leader))

	2633 if value[0] == '"':

	2634 token, value = get_quoted_string(value)

	2635 else:

	2636 token, value = get_extended_attribute(value)

	2637 if leader is not None:

	2638 token[:0] = [leader]

	2639 v.append(token)

	2640 return v, value

	2641

	2642 def get_parameter(value):

	2643 """ attribute [section] ["*"] [CFWS] "=" value

	2644

	2645 The CFWS is implied by the RFC but not made explicit in the BNF. This

	2646 simplified form of the BNF from the RFC is made to conform with the RFC BNF

	2647 through some extra checks. We do it this way because it makes both error

	2648 recovery and working with the resulting parse tree easier.

	2649 """

	2650 # It is possible CFWS would also be implicitly allowed between the section

	2651 # and the 'extended-attribute' marker (the '*') , but we've never seen that

	2652 # in the wild and we will therefore ignore the possibility.

	2653 param = Parameter()

	2654 token, value = get_attribute(value)

	2655 param.append(token)

	2656 if not value or value[0] == ';':

	2657 param.defects.append(errors.InvalidHeaderDefect("Parameter contains "

	2658 "name ({}) but no value".format(token)))

	2659 return param, value

	2660 if value[0] == '*':

	2661 try:

	2662 token, value = get_section(value)

	2663 param.sectioned = True

	2664 param.append(token)

	2665 except errors.HeaderParseError:

	2666 pass

	2667 if not value:

	2668 raise errors.HeaderParseError("Incomplete parameter")

	2669 if value[0] == '*':

	2670 param.append(ValueTerminal('*', 'extended-parameter-marker'))

	2671 value = value[1:]

	2672 param.extended = True

	2673 if value[0] != '=':

	2674 raise errors.HeaderParseError("Parameter not followed by '='")

	2675 param.append(ValueTerminal('=', 'parameter-separator'))

	2676 value = value[1:]

	2677 leader = None

	2678 if value and value[0] in CFWS_LEADER:

	2679 token, value = get_cfws(value)

	2680 param.append(token)

	2681 remainder = None

	2682 appendto = param

	2683 if param.extended and value and value[0] == '"':

	2684 # Now for some serious hackery to handle the common invalid case of

	2685 # double quotes around an extended value. We also accept (with defect)

	2686 # a value marked as encoded that isn't really.

	2687 qstring, remainder = get_quoted_string(value)

	2688 inner_value = qstring.stripped_value

	2689 semi_valid = False

	2690 if param.section_number == 0:

	2691 if inner_value and inner_value[0] == "'":

	2692 semi_valid = True

	2693 else:

	2694 token, rest = get_attrtext(inner_value)

	2695 if rest and rest[0] == "'":

	2696 semi_valid = True

	2697 else:

	2698 try:

	2699 token, rest = get_extended_attrtext(inner_value)

	2700 except:

	2701 pass

	2702 else:

	2703 if not rest:

	2704 semi_valid = True

	2705 if semi_valid:

	2706 param.defects.append(errors.InvalidHeaderDefect(

	2707 "Quoted string value for extended parameter is invalid"))

	2708 param.append(qstring)

	2709 for t in qstring:

	2710 if t.token_type == 'bare-quoted-string':

	2711 t[:] = []

	2712 appendto = t

	2713 break

	2714 value = inner_value

	2715 else:

	2716 remainder = None

	2717 param.defects.append(errors.InvalidHeaderDefect(

	2718 "Parameter marked as extended but appears to have a "

	2719 "quoted string value that is non-encoded"))

	2720 if value and value[0] == "'":

	2721 token = None

	2722 else:

	2723 token, value = get_value(value)

	2724 if not param.extended or param.section_number > 0:

	2725 if not value or value[0] != "'":

	2726 appendto.append(token)

	2727 if remainder is not None:

	2728 assert not value, value

	2729 value = remainder

	2730 return param, value

	2731 param.defects.append(errors.InvalidHeaderDefect(

	2732 "Apparent initial-extended-value but attribute "

	2733 "was not marked as extended or was not initial section"))

	2734 if not value:

	2735 # Assume the charset/lang is missing and the token is the value.

	2736 param.defects.append(errors.InvalidHeaderDefect(

	2737 "Missing required charset/lang delimiters"))

	2738 appendto.append(token)

	2739 if remainder is None:

	2740 return param, value

	2741 else:

	2742 if token is not None:

	2743 for t in token:

	2744 if t.token_type == 'extended-attrtext':

	2745 break

	2746 t.token_type == 'attrtext'

	2747 appendto.append(t)

	2748 param.charset = t.value

	2749 if value[0] != "'":

	2750 raise errors.HeaderParseError("Expected RFC2231 char/lang encoding "

	2751 "delimiter, but found {!r}".format(val ue))

	2752 appendto.append(ValueTerminal("'", 'RFC2231 delimiter'))

	2753 value = value[1:]

	2754 if value and value[0] != "'":

	2755 token, value = get_attrtext(value)

	2756 appendto.append(token)

	2757 param.lang = token.value

	2758 if not value or value[0] != "'":

	2759 raise errors.HeaderParseError("Expected RFC2231 char/lang encodi ng "

	2760 "delimiter, but found {}".format(value))

	2761 appendto.append(ValueTerminal("'", 'RFC2231 delimiter'))

	2762 value = value[1:]

	2763 if remainder is not None:

	2764 # Treat the rest of value as bare quoted string content.

	2765 v = Value()

	2766 while value:

	2767 if value[0] in WSP:

	2768 token, value = get_fws(value)

	2769 else:

	2770 token, value = get_qcontent(value)

	2771 v.append(token)

	2772 token = v

	2773 else:

	2774 token, value = get_value(value)

	2775 appendto.append(token)

	2776 if remainder is not None:

	2777 assert not value, value

	2778 value = remainder

	2779 return param, value

	2780

	2781 def parse_mime_parameters(value):

	2782 """ parameter *( ";" parameter )

	2783

	2784 That BNF is meant to indicate this routine should only be called after

	2785 finding and handling the leading ';'. There is no corresponding rule in

	2786 the formal RFC grammar, but it is more convenient for us for the set of

	2787 parameters to be treated as its own TokenList.

	2788

	2789 This is 'parse' routine because it consumes the reminaing value, but it

	2790 would never be called to parse a full header. Instead it is called to

	2791 parse everything after the non-parameter value of a specific MIME header.

	2792

	2793 """

	2794 mime_parameters = MimeParameters()

	2795 while value:

	2796 try:

	2797 token, value = get_parameter(value)

	2798 mime_parameters.append(token)

	2799 except errors.HeaderParseError as err:

	2800 leader = None

	2801 if value[0] in CFWS_LEADER:

	2802 leader, value = get_cfws(value)

	2803 if not value:

	2804 mime_parameters.append(leader)

	2805 return mime_parameters

	2806 if value[0] == ';':

	2807 if leader is not None:

	2808 mime_parameters.append(leader)

	2809 mime_parameters.defects.append(errors.InvalidHeaderDefect(

	2810 "parameter entry with no content"))

	2811 else:

	2812 token, value = get_invalid_parameter(value)

	2813 if leader:

	2814 token[:0] = [leader]

	2815 mime_parameters.append(token)

	2816 mime_parameters.defects.append(errors.InvalidHeaderDefect(

	2817 "invalid parameter {!r}".format(token)))

	2818 if value and value[0] != ';':

	2819 # Junk after the otherwise valid parameter. Mark it as

	2820 # invalid, but it will have a value.

	2821 param = mime_parameters[-1]

	2822 param.token_type = 'invalid-parameter'

	2823 token, value = get_invalid_parameter(value)

	2824 param.extend(token)

	2825 mime_parameters.defects.append(errors.InvalidHeaderDefect(

	2826 "parameter with invalid trailing text {!r}".format(token)))

	2827 if value:

	2828 # Must be a ';' at this point.

	2829 mime_parameters.append(ValueTerminal(';', 'parameter-separator'))

	2830 value = value[1:]

	2831 return mime_parameters

	2832

	2833 def _find_mime_parameters(tokenlist, value):

	2834 """Do our best to find the parameters in an invalid MIME header

	2835

	2836 """

	2837 while value and value[0] != ';':

	2838 if value[0] in PHRASE_ENDS:

	2839 tokenlist.append(ValueTerminal(value[0], 'misplaced-special'))

	2840 value = value[1:]

	2841 else:

	2842 token, value = get_phrase(value)

	2843 tokenlist.append(token)

	2844 if not value:

	2845 return

	2846 tokenlist.append(ValueTerminal(';', 'parameter-separator'))

	2847 tokenlist.append(parse_mime_parameters(value[1:]))

	2848

	2849 def parse_content_type_header(value):

	2850 """ maintype "/" subtype *( ";" parameter )

	2851

	2852 The maintype and substype are tokens. Theoretically they could

	2853 be checked against the official IANA list + x-token, but we

	2854 don't do that.

	2855 """

	2856 ctype = ContentType()

	2857 recover = False

	2858 if not value:

	2859 ctype.defects.append(errors.HeaderMissingRequiredValue(

	2860 "Missing content type specification"))

	2861 return ctype

	2862 try:

	2863 token, value = get_token(value)

	2864 except errors.HeaderParseError:

	2865 ctype.defects.append(errors.InvalidHeaderDefect(

	2866 "Expected content maintype but found {!r}".format(value)))

	2867 _find_mime_parameters(ctype, value)

	2868 return ctype

	2869 ctype.append(token)

	2870 # XXX: If we really want to follow the formal grammer we should make

	2871 # mantype and subtype specialized TokenLists here. Probably not worth it.

	2872 if not value or value[0] != '/':

	2873 ctype.defects.append(errors.InvalidHeaderDefect(

	2874 "Invalid content type"))

	2875 if value:

	2876 _find_mime_parameters(ctype, value)

	2877 return ctype

	2878 ctype.maintype = token.value.strip().lower()

	2879 ctype.append(ValueTerminal('/', 'content-type-separator'))

	2880 value = value[1:]

	2881 try:

	2882 token, value = get_token(value)

	2883 except errors.HeaderParseError:

	2884 ctype.defects.append(errors.InvalidHeaderDefect(

	2885 "Expected content subtype but found {!r}".format(value)))

	2886 _find_mime_parameters(ctype, value)

	2887 return ctype

	2888 ctype.append(token)

	2889 ctype.subtype = token.value.strip().lower()

	2890 if not value:

	2891 return ctype

	2892 if value[0] != ';':

	2893 ctype.defects.append(errors.InvalidHeaderDefect(

	2894 "Only parameters are valid after content type, but "

	2895 "found {!r}".format(value)))

	2896 # The RFC requires that a syntactically invalid content-type be treated

	2897 # as text/plain. Perhaps we should postel this, but we should probably

	2898 # only do that if we were checking the subtype value against IANA.

	2899 del ctype.maintype, ctype.subtype

	2900 _find_mime_parameters(ctype, value)

	2901 return ctype

	2902 ctype.append(ValueTerminal(';', 'parameter-separator'))

	2903 ctype.append(parse_mime_parameters(value[1:]))

	2904 return ctype

	2905

	2906 def parse_content_disposition_header(value):

	2907 """ disposition-type *( ";" parameter )

	2908

	2909 """

	2910 disp_header = ContentDisposition()

	2911 if not value:

	2912 disp_header.defects.append(errors.HeaderMissingRequiredValue(

	2913 "Missing content disposition"))

	2914 return disp_header

	2915 try:

	2916 token, value = get_token(value)

	2917 except errors.HeaderParseError:

	2918 ctype.defects.append(errors.InvalidHeaderDefect(

	2919 "Expected content disposition but found {!r}".format(value)))

	2920 _find_mime_parameters(disp_header, value)

	2921 return disp_header

	2922 disp_header.append(token)

	2923 disp_header.content_disposition = token.value.strip().lower()

	2924 if not value:

	2925 return disp_header

	2926 if value[0] != ';':

	2927 disp_header.defects.append(errors.InvalidHeaderDefect(

	2928 "Only parameters are valid after content disposition, but "

	2929 "found {!r}".format(value)))

	2930 _find_mime_parameters(disp_header, value)

	2931 return disp_header

	2932 disp_header.append(ValueTerminal(';', 'parameter-separator'))

	2933 disp_header.append(parse_mime_parameters(value[1:]))

	2934 return disp_header

	2935

	2936 def parse_content_transfer_encoding_header(value):

	2937 """ mechanism

	2938

	2939 """

	2940 # We should probably validate the values, since the list is fixed.

	2941 cte_header = ContentTransferEncoding()

	2942 if not value:

	2943 cte_header.defects.append(errors.HeaderMissingRequiredValue(

	2944 "Missing content transfer encoding"))

	2945 return cte_header

	2946 try:

	2947 token, value = get_token(value)

	2948 except errors.HeaderParseError:

	2949 ctype.defects.append(errors.InvalidHeaderDefect(

	2950 "Expected content trnasfer encoding but found {!r}".format(value)))

	2951 else:

	2952 cte_header.append(token)

	2953 cte_header.cte = token.value.strip().lower()

	2954 if not value:

	2955 return cte_header

	2956 while value:

	2957 cte_header.defects.append(errors.InvalidHeaderDefect(

	2958 "Extra text after content transfer encoding"))

	2959 if value[0] in PHRASE_ENDS:

	2960 cte_header.append(ValueTerminal(value[0], 'misplaced-special'))

	2961 value = value[1:]

	2962 else:

	2963 token, value = get_phrase(value)

	2964 cte_header.append(token)

	2965 return cte_header

OLD	NEW