Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(20)

Side by Side Diff: runtime/vm/regexp.cc

Issue 678193004: Copy irregexp related code from V8. (Closed) Base URL: https://dart.googlecode.com/svn/branches/bleeding_edge/dart
Patch Set: rebase Created 6 years, 1 month ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch | Annotate | Revision Log
« no previous file with comments | « runtime/vm/regexp.h ('k') | runtime/vm/regexp_assembler.h » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 // Copyright (c) 2014, the Dart project authors. Please see the AUTHORS file
2 // for details. All rights reserved. Use of this source code is governed by a
3 // BSD-style license that can be found in the LICENSE file.
4
5 #include "vm/regexp.h"
6
7 // SNIP
8
9 namespace dart {
10
11 // SNIP
12
13 // -------------------------------------------------------------------
14 // Implementation of the Irregexp regular expression engine.
15 //
16 // The Irregexp regular expression engine is intended to be a complete
17 // implementation of ECMAScript regular expressions. It generates either
18 // bytecodes or native code.
19
20 // The Irregexp regexp engine is structured in three steps.
21 // 1) The parser generates an abstract syntax tree. See ast.cc.
22 // 2) From the AST a node network is created. The nodes are all
23 // subclasses of RegExpNode. The nodes represent states when
24 // executing a regular expression. Several optimizations are
25 // performed on the node network.
26 // 3) From the nodes we generate either byte codes or native code
27 // that can actually execute the regular expression (perform
28 // the search). The code generation step is described in more
29 // detail below.
30
31 // Code generation.
32 //
33 // The nodes are divided into four main categories.
34 // * Choice nodes
35 // These represent places where the regular expression can
36 // match in more than one way. For example on entry to an
37 // alternation (foo|bar) or a repetition (*, +, ? or {}).
38 // * Action nodes
39 // These represent places where some action should be
40 // performed. Examples include recording the current position
41 // in the input string to a register (in order to implement
42 // captures) or other actions on register for example in order
43 // to implement the counters needed for {} repetitions.
44 // * Matching nodes
45 // These attempt to match some element part of the input string.
46 // Examples of elements include character classes, plain strings
47 // or back references.
48 // * End nodes
49 // These are used to implement the actions required on finding
50 // a successful match or failing to find a match.
51 //
52 // The code generated (whether as byte codes or native code) maintains
53 // some state as it runs. This consists of the following elements:
54 //
55 // * The capture registers. Used for string captures.
56 // * Other registers. Used for counters etc.
57 // * The current position.
58 // * The stack of backtracking information. Used when a matching node
59 // fails to find a match and needs to try an alternative.
60 //
61 // Conceptual regular expression execution model:
62 //
63 // There is a simple conceptual model of regular expression execution
64 // which will be presented first. The actual code generated is a more
65 // efficient simulation of the simple conceptual model:
66 //
67 // * Choice nodes are implemented as follows:
68 // For each choice except the last {
69 // push current position
70 // push backtrack code location
71 // <generate code to test for choice>
72 // backtrack code location:
73 // pop current position
74 // }
75 // <generate code to test for last choice>
76 //
77 // * Actions nodes are generated as follows
78 // <push affected registers on backtrack stack>
79 // <generate code to perform action>
80 // push backtrack code location
81 // <generate code to test for following nodes>
82 // backtrack code location:
83 // <pop affected registers to restore their state>
84 // <pop backtrack location from stack and go to it>
85 //
86 // * Matching nodes are generated as follows:
87 // if input string matches at current position
88 // update current position
89 // <generate code to test for following nodes>
90 // else
91 // <pop backtrack location from stack and go to it>
92 //
93 // Thus it can be seen that the current position is saved and restored
94 // by the choice nodes, whereas the registers are saved and restored by
95 // by the action nodes that manipulate them.
96 //
97 // The other interesting aspect of this model is that nodes are generated
98 // at the point where they are needed by a recursive call to Emit(). If
99 // the node has already been code generated then the Emit() call will
100 // generate a jump to the previously generated code instead. In order to
101 // limit recursion it is possible for the Emit() function to put the node
102 // on a work list for later generation and instead generate a jump. The
103 // destination of the jump is resolved later when the code is generated.
104 //
105 // Actual regular expression code generation.
106 //
107 // Code generation is actually more complicated than the above. In order
108 // to improve the efficiency of the generated code some optimizations are
109 // performed
110 //
111 // * Choice nodes have 1-character lookahead.
112 // A choice node looks at the following character and eliminates some of
113 // the choices immediately based on that character. This is not yet
114 // implemented.
115 // * Simple greedy loops store reduced backtracking information.
116 // A quantifier like /.*foo/m will greedily match the whole input. It will
117 // then need to backtrack to a point where it can match "foo". The naive
118 // implementation of this would push each character position onto the
119 // backtracking stack, then pop them off one by one. This would use space
120 // proportional to the length of the input string. However since the "."
121 // can only match in one way and always has a constant length (in this case
122 // of 1) it suffices to store the current position on the top of the stack
123 // once. Matching now becomes merely incrementing the current position and
124 // backtracking becomes decrementing the current position and checking the
125 // result against the stored current position. This is faster and saves
126 // space.
127 // * The current state is virtualized.
128 // This is used to defer expensive operations until it is clear that they
129 // are needed and to generate code for a node more than once, allowing
130 // specialized an efficient versions of the code to be created. This is
131 // explained in the section below.
132 //
133 // Execution state virtualization.
134 //
135 // Instead of emitting code, nodes that manipulate the state can record their
136 // manipulation in an object called the Trace. The Trace object can record a
137 // current position offset, an optional backtrack code location on the top of
138 // the virtualized backtrack stack and some register changes. When a node is
139 // to be emitted it can flush the Trace or update it. Flushing the Trace
140 // will emit code to bring the actual state into line with the virtual state.
141 // Avoiding flushing the state can postpone some work (e.g. updates of capture
142 // registers). Postponing work can save time when executing the regular
143 // expression since it may be found that the work never has to be done as a
144 // failure to match can occur. In addition it is much faster to jump to a
145 // known backtrack code location than it is to pop an unknown backtrack
146 // location from the stack and jump there.
147 //
148 // The virtual state found in the Trace affects code generation. For example
149 // the virtual state contains the difference between the actual current
150 // position and the virtual current position, and matching code needs to use
151 // this offset to attempt a match in the correct location of the input
152 // string. Therefore code generated for a non-trivial trace is specialized
153 // to that trace. The code generator therefore has the ability to generate
154 // code for each node several times. In order to limit the size of the
155 // generated code there is an arbitrary limit on how many specialized sets of
156 // code may be generated for a given node. If the limit is reached, the
157 // trace is flushed and a generic version of the code for a node is emitted.
158 // This is subsequently used for that node. The code emitted for non-generic
159 // trace is not recorded in the node and so it cannot currently be reused in
160 // the event that code generation is requested for an identical trace.
161
162
163 void RegExpTree::AppendToText(RegExpText* text, Zone* zone) {
164 UNREACHABLE();
165 }
166
167
168 void RegExpAtom::AppendToText(RegExpText* text, Zone* zone) {
169 text->AddElement(TextElement::Atom(this), zone);
170 }
171
172
173 void RegExpCharacterClass::AppendToText(RegExpText* text, Zone* zone) {
174 text->AddElement(TextElement::CharClass(this), zone);
175 }
176
177
178 void RegExpText::AppendToText(RegExpText* text, Zone* zone) {
179 for (int i = 0; i < elements()->length(); i++)
180 text->AddElement(elements()->at(i), zone);
181 }
182
183
184 TextElement TextElement::Atom(RegExpAtom* atom) {
185 return TextElement(ATOM, atom);
186 }
187
188
189 TextElement TextElement::CharClass(RegExpCharacterClass* char_class) {
190 return TextElement(CHAR_CLASS, char_class);
191 }
192
193
194 int TextElement::length() const {
195 switch (text_type()) {
196 case ATOM:
197 return atom()->length();
198
199 case CHAR_CLASS:
200 return 1;
201 }
202 UNREACHABLE();
203 return 0;
204 }
205
206
207 DispatchTable* ChoiceNode::GetTable(bool ignore_case) {
208 if (table_ == NULL) {
209 table_ = new(zone()) DispatchTable(zone());
210 DispatchTableConstructor cons(table_, ignore_case, zone());
211 cons.BuildTable(this);
212 }
213 return table_;
214 }
215
216
217 class FrequencyCollator {
218 public:
219 FrequencyCollator() : total_samples_(0) {
220 for (int i = 0; i < RegExpMacroAssembler::kTableSize; i++) {
221 frequencies_[i] = CharacterFrequency(i);
222 }
223 }
224
225 void CountCharacter(int character) {
226 int index = (character & RegExpMacroAssembler::kTableMask);
227 frequencies_[index].Increment();
228 total_samples_++;
229 }
230
231 // Does not measure in percent, but rather per-128 (the table size from the
232 // regexp macro assembler).
233 int Frequency(int in_character) {
234 DCHECK((in_character & RegExpMacroAssembler::kTableMask) == in_character);
235 if (total_samples_ < 1) return 1; // Division by zero.
236 int freq_in_per128 =
237 (frequencies_[in_character].counter() * 128) / total_samples_;
238 return freq_in_per128;
239 }
240
241 private:
242 class CharacterFrequency {
243 public:
244 CharacterFrequency() : counter_(0), character_(-1) { }
245 explicit CharacterFrequency(int character)
246 : counter_(0), character_(character) { }
247
248 void Increment() { counter_++; }
249 int counter() { return counter_; }
250 int character() { return character_; }
251
252 private:
253 int counter_;
254 int character_;
255 };
256
257
258 private:
259 CharacterFrequency frequencies_[RegExpMacroAssembler::kTableSize];
260 int total_samples_;
261 };
262
263
264 class RegExpCompiler {
265 public:
266 RegExpCompiler(int capture_count, bool ignore_case, bool is_one_byte,
267 Zone* zone);
268
269 int AllocateRegister() {
270 if (next_register_ >= RegExpMacroAssembler::kMaxRegister) {
271 reg_exp_too_big_ = true;
272 return next_register_;
273 }
274 return next_register_++;
275 }
276
277 RegExpEngine::CompilationResult Assemble(RegExpMacroAssembler* assembler,
278 RegExpNode* start,
279 int capture_count,
280 Handle<String> pattern);
281
282 inline void AddWork(RegExpNode* node) { work_list_->Add(node); }
283
284 static const int kImplementationOffset = 0;
285 static const int kNumberOfRegistersOffset = 0;
286 static const int kCodeOffset = 1;
287
288 RegExpMacroAssembler* macro_assembler() { return macro_assembler_; }
289 EndNode* accept() { return accept_; }
290
291 static const int kMaxRecursion = 100;
292 inline int recursion_depth() { return recursion_depth_; }
293 inline void IncrementRecursionDepth() { recursion_depth_++; }
294 inline void DecrementRecursionDepth() { recursion_depth_--; }
295
296 void SetRegExpTooBig() { reg_exp_too_big_ = true; }
297
298 inline bool ignore_case() { return ignore_case_; }
299 inline bool one_byte() { return one_byte_; }
300 FrequencyCollator* frequency_collator() { return &frequency_collator_; }
301
302 int current_expansion_factor() { return current_expansion_factor_; }
303 void set_current_expansion_factor(int value) {
304 current_expansion_factor_ = value;
305 }
306
307 Zone* zone() const { return zone_; }
308
309 static const int kNoRegister = -1;
310
311 private:
312 EndNode* accept_;
313 int next_register_;
314 List<RegExpNode*>* work_list_;
315 int recursion_depth_;
316 RegExpMacroAssembler* macro_assembler_;
317 bool ignore_case_;
318 bool one_byte_;
319 bool reg_exp_too_big_;
320 int current_expansion_factor_;
321 FrequencyCollator frequency_collator_;
322 Zone* zone_;
323 };
324
325
326 class RecursionCheck {
327 public:
328 explicit RecursionCheck(RegExpCompiler* compiler) : compiler_(compiler) {
329 compiler->IncrementRecursionDepth();
330 }
331 ~RecursionCheck() { compiler_->DecrementRecursionDepth(); }
332 private:
333 RegExpCompiler* compiler_;
334 };
335
336
337 static RegExpEngine::CompilationResult IrregexpRegExpTooBig(Isolate* isolate) {
338 return RegExpEngine::CompilationResult(isolate, "RegExp too big");
339 }
340
341
342 // Attempts to compile the regexp using an Irregexp code generator. Returns
343 // a fixed array or a null handle depending on whether it succeeded.
344 RegExpCompiler::RegExpCompiler(int capture_count, bool ignore_case,
345 bool one_byte, Zone* zone)
346 : next_register_(2 * (capture_count + 1)),
347 work_list_(NULL),
348 recursion_depth_(0),
349 ignore_case_(ignore_case),
350 one_byte_(one_byte),
351 reg_exp_too_big_(false),
352 current_expansion_factor_(1),
353 frequency_collator_(),
354 zone_(zone) {
355 accept_ = new(zone) EndNode(EndNode::ACCEPT, zone);
356 DCHECK(next_register_ - 1 <= RegExpMacroAssembler::kMaxRegister);
357 }
358
359
360 RegExpEngine::CompilationResult RegExpCompiler::Assemble(
361 RegExpMacroAssembler* macro_assembler,
362 RegExpNode* start,
363 int capture_count,
364 Handle<String> pattern) {
365 Heap* heap = pattern->GetHeap();
366
367 bool use_slow_safe_regexp_compiler = false;
368 if (heap->total_regexp_code_generated() >
369 RegExpImpl::kRegWxpCompiledLimit &&
370 heap->isolate()->memory_allocator()->SizeExecutable() >
371 RegExpImpl::kRegExpExecutableMemoryLimit) {
372 use_slow_safe_regexp_compiler = true;
373 }
374
375 macro_assembler->set_slow_safe(use_slow_safe_regexp_compiler);
376
377 #ifdef DEBUG
378 if (FLAG_trace_regexp_assembler)
379 macro_assembler_ = new RegExpMacroAssemblerTracer(macro_assembler);
380 else
381 #endif
382 macro_assembler_ = macro_assembler;
383
384 List <RegExpNode*> work_list(0);
385 work_list_ = &work_list;
386 Label fail;
387 macro_assembler_->PushBacktrack(&fail);
388 Trace new_trace;
389 start->Emit(this, &new_trace);
390 macro_assembler_->Bind(&fail);
391 macro_assembler_->Fail();
392 while (!work_list.is_empty()) {
393 work_list.RemoveLast()->Emit(this, &new_trace);
394 }
395 if (reg_exp_too_big_) return IrregexpRegExpTooBig(zone_->isolate());
396
397 Handle<HeapObject> code = macro_assembler_->GetCode(pattern);
398 heap->IncreaseTotalRegexpCodeGenerated(code->Size());
399 work_list_ = NULL;
400 #ifdef DEBUG
401 if (FLAG_print_code) {
402 CodeTracer::Scope trace_scope(heap->isolate()->GetCodeTracer());
403 OFStream os(trace_scope.file());
404 Handle<Code>::cast(code)->Disassemble(pattern->ToCString().get(), os);
405 }
406 if (FLAG_trace_regexp_assembler) {
407 delete macro_assembler_;
408 }
409 #endif
410 return RegExpEngine::CompilationResult(*code, next_register_);
411 }
412
413
414 bool Trace::DeferredAction::Mentions(int that) {
415 if (action_type() == ActionNode::CLEAR_CAPTURES) {
416 Interval range = static_cast<DeferredClearCaptures*>(this)->range();
417 return range.Contains(that);
418 } else {
419 return reg() == that;
420 }
421 }
422
423
424 bool Trace::mentions_reg(int reg) {
425 for (DeferredAction* action = actions_;
426 action != NULL;
427 action = action->next()) {
428 if (action->Mentions(reg))
429 return true;
430 }
431 return false;
432 }
433
434
435 bool Trace::GetStoredPosition(int reg, int* cp_offset) {
436 DCHECK_EQ(0, *cp_offset);
437 for (DeferredAction* action = actions_;
438 action != NULL;
439 action = action->next()) {
440 if (action->Mentions(reg)) {
441 if (action->action_type() == ActionNode::STORE_POSITION) {
442 *cp_offset = static_cast<DeferredCapture*>(action)->cp_offset();
443 return true;
444 } else {
445 return false;
446 }
447 }
448 }
449 return false;
450 }
451
452
453 int Trace::FindAffectedRegisters(OutSet* affected_registers,
454 Zone* zone) {
455 int max_register = RegExpCompiler::kNoRegister;
456 for (DeferredAction* action = actions_;
457 action != NULL;
458 action = action->next()) {
459 if (action->action_type() == ActionNode::CLEAR_CAPTURES) {
460 Interval range = static_cast<DeferredClearCaptures*>(action)->range();
461 for (int i = range.from(); i <= range.to(); i++)
462 affected_registers->Set(i, zone);
463 if (range.to() > max_register) max_register = range.to();
464 } else {
465 affected_registers->Set(action->reg(), zone);
466 if (action->reg() > max_register) max_register = action->reg();
467 }
468 }
469 return max_register;
470 }
471
472
473 void Trace::RestoreAffectedRegisters(RegExpMacroAssembler* assembler,
474 int max_register,
475 const OutSet& registers_to_pop,
476 const OutSet& registers_to_clear) {
477 for (int reg = max_register; reg >= 0; reg--) {
478 if (registers_to_pop.Get(reg)) {
479 assembler->PopRegister(reg);
480 } else if (registers_to_clear.Get(reg)) {
481 int clear_to = reg;
482 while (reg > 0 && registers_to_clear.Get(reg - 1)) {
483 reg--;
484 }
485 assembler->ClearRegisters(reg, clear_to);
486 }
487 }
488 }
489
490
491 void Trace::PerformDeferredActions(RegExpMacroAssembler* assembler,
492 int max_register,
493 const OutSet& affected_registers,
494 OutSet* registers_to_pop,
495 OutSet* registers_to_clear,
496 Zone* zone) {
497 // The "+1" is to avoid a push_limit of zero if stack_limit_slack() is 1.
498 const int push_limit = (assembler->stack_limit_slack() + 1) / 2;
499
500 // Count pushes performed to force a stack limit check occasionally.
501 int pushes = 0;
502
503 for (int reg = 0; reg <= max_register; reg++) {
504 if (!affected_registers.Get(reg)) {
505 continue;
506 }
507
508 // The chronologically first deferred action in the trace
509 // is used to infer the action needed to restore a register
510 // to its previous state (or not, if it's safe to ignore it).
511 enum DeferredActionUndoType { IGNORE, RESTORE, CLEAR };
512 DeferredActionUndoType undo_action = IGNORE;
513
514 int value = 0;
515 bool absolute = false;
516 bool clear = false;
517 int store_position = -1;
518 // This is a little tricky because we are scanning the actions in reverse
519 // historical order (newest first).
520 for (DeferredAction* action = actions_;
521 action != NULL;
522 action = action->next()) {
523 if (action->Mentions(reg)) {
524 switch (action->action_type()) {
525 case ActionNode::SET_REGISTER: {
526 Trace::DeferredSetRegister* psr =
527 static_cast<Trace::DeferredSetRegister*>(action);
528 if (!absolute) {
529 value += psr->value();
530 absolute = true;
531 }
532 // SET_REGISTER is currently only used for newly introduced loop
533 // counters. They can have a significant previous value if they
534 // occour in a loop. TODO(lrn): Propagate this information, so
535 // we can set undo_action to IGNORE if we know there is no value to
536 // restore.
537 undo_action = RESTORE;
538 DCHECK_EQ(store_position, -1);
539 DCHECK(!clear);
540 break;
541 }
542 case ActionNode::INCREMENT_REGISTER:
543 if (!absolute) {
544 value++;
545 }
546 DCHECK_EQ(store_position, -1);
547 DCHECK(!clear);
548 undo_action = RESTORE;
549 break;
550 case ActionNode::STORE_POSITION: {
551 Trace::DeferredCapture* pc =
552 static_cast<Trace::DeferredCapture*>(action);
553 if (!clear && store_position == -1) {
554 store_position = pc->cp_offset();
555 }
556
557 // For captures we know that stores and clears alternate.
558 // Other register, are never cleared, and if the occur
559 // inside a loop, they might be assigned more than once.
560 if (reg <= 1) {
561 // Registers zero and one, aka "capture zero", is
562 // always set correctly if we succeed. There is no
563 // need to undo a setting on backtrack, because we
564 // will set it again or fail.
565 undo_action = IGNORE;
566 } else {
567 undo_action = pc->is_capture() ? CLEAR : RESTORE;
568 }
569 DCHECK(!absolute);
570 DCHECK_EQ(value, 0);
571 break;
572 }
573 case ActionNode::CLEAR_CAPTURES: {
574 // Since we're scanning in reverse order, if we've already
575 // set the position we have to ignore historically earlier
576 // clearing operations.
577 if (store_position == -1) {
578 clear = true;
579 }
580 undo_action = RESTORE;
581 DCHECK(!absolute);
582 DCHECK_EQ(value, 0);
583 break;
584 }
585 default:
586 UNREACHABLE();
587 break;
588 }
589 }
590 }
591 // Prepare for the undo-action (e.g., push if it's going to be popped).
592 if (undo_action == RESTORE) {
593 pushes++;
594 RegExpMacroAssembler::StackCheckFlag stack_check =
595 RegExpMacroAssembler::kNoStackLimitCheck;
596 if (pushes == push_limit) {
597 stack_check = RegExpMacroAssembler::kCheckStackLimit;
598 pushes = 0;
599 }
600
601 assembler->PushRegister(reg, stack_check);
602 registers_to_pop->Set(reg, zone);
603 } else if (undo_action == CLEAR) {
604 registers_to_clear->Set(reg, zone);
605 }
606 // Perform the chronologically last action (or accumulated increment)
607 // for the register.
608 if (store_position != -1) {
609 assembler->WriteCurrentPositionToRegister(reg, store_position);
610 } else if (clear) {
611 assembler->ClearRegisters(reg, reg);
612 } else if (absolute) {
613 assembler->SetRegister(reg, value);
614 } else if (value != 0) {
615 assembler->AdvanceRegister(reg, value);
616 }
617 }
618 }
619
620
621 // This is called as we come into a loop choice node and some other tricky
622 // nodes. It normalizes the state of the code generator to ensure we can
623 // generate generic code.
624 void Trace::Flush(RegExpCompiler* compiler, RegExpNode* successor) {
625 RegExpMacroAssembler* assembler = compiler->macro_assembler();
626
627 DCHECK(!is_trivial());
628
629 if (actions_ == NULL && backtrack() == NULL) {
630 // Here we just have some deferred cp advances to fix and we are back to
631 // a normal situation. We may also have to forget some information gained
632 // through a quick check that was already performed.
633 if (cp_offset_ != 0) assembler->AdvanceCurrentPosition(cp_offset_);
634 // Create a new trivial state and generate the node with that.
635 Trace new_state;
636 successor->Emit(compiler, &new_state);
637 return;
638 }
639
640 // Generate deferred actions here along with code to undo them again.
641 OutSet affected_registers;
642
643 if (backtrack() != NULL) {
644 // Here we have a concrete backtrack location. These are set up by choice
645 // nodes and so they indicate that we have a deferred save of the current
646 // position which we may need to emit here.
647 assembler->PushCurrentPosition();
648 }
649
650 int max_register = FindAffectedRegisters(&affected_registers,
651 compiler->zone());
652 OutSet registers_to_pop;
653 OutSet registers_to_clear;
654 PerformDeferredActions(assembler,
655 max_register,
656 affected_registers,
657 &registers_to_pop,
658 &registers_to_clear,
659 compiler->zone());
660 if (cp_offset_ != 0) {
661 assembler->AdvanceCurrentPosition(cp_offset_);
662 }
663
664 // Create a new trivial state and generate the node with that.
665 Label undo;
666 assembler->PushBacktrack(&undo);
667 Trace new_state;
668 successor->Emit(compiler, &new_state);
669
670 // On backtrack we need to restore state.
671 assembler->Bind(&undo);
672 RestoreAffectedRegisters(assembler,
673 max_register,
674 registers_to_pop,
675 registers_to_clear);
676 if (backtrack() == NULL) {
677 assembler->Backtrack();
678 } else {
679 assembler->PopCurrentPosition();
680 assembler->GoTo(backtrack());
681 }
682 }
683
684
685 void NegativeSubmatchSuccess::Emit(RegExpCompiler* compiler, Trace* trace) {
686 RegExpMacroAssembler* assembler = compiler->macro_assembler();
687
688 // Omit flushing the trace. We discard the entire stack frame anyway.
689
690 if (!label()->is_bound()) {
691 // We are completely independent of the trace, since we ignore it,
692 // so this code can be used as the generic version.
693 assembler->Bind(label());
694 }
695
696 // Throw away everything on the backtrack stack since the start
697 // of the negative submatch and restore the character position.
698 assembler->ReadCurrentPositionFromRegister(current_position_register_);
699 assembler->ReadStackPointerFromRegister(stack_pointer_register_);
700 if (clear_capture_count_ > 0) {
701 // Clear any captures that might have been performed during the success
702 // of the body of the negative look-ahead.
703 int clear_capture_end = clear_capture_start_ + clear_capture_count_ - 1;
704 assembler->ClearRegisters(clear_capture_start_, clear_capture_end);
705 }
706 // Now that we have unwound the stack we find at the top of the stack the
707 // backtrack that the BeginSubmatch node got.
708 assembler->Backtrack();
709 }
710
711
712 void EndNode::Emit(RegExpCompiler* compiler, Trace* trace) {
713 if (!trace->is_trivial()) {
714 trace->Flush(compiler, this);
715 return;
716 }
717 RegExpMacroAssembler* assembler = compiler->macro_assembler();
718 if (!label()->is_bound()) {
719 assembler->Bind(label());
720 }
721 switch (action_) {
722 case ACCEPT:
723 assembler->Succeed();
724 return;
725 case BACKTRACK:
726 assembler->GoTo(trace->backtrack());
727 return;
728 case NEGATIVE_SUBMATCH_SUCCESS:
729 // This case is handled in a different virtual method.
730 UNREACHABLE();
731 }
732 UNIMPLEMENTED();
733 }
734
735
736 void GuardedAlternative::AddGuard(Guard* guard, Zone* zone) {
737 if (guards_ == NULL)
738 guards_ = new(zone) ZoneList<Guard*>(1, zone);
739 guards_->Add(guard, zone);
740 }
741
742
743 ActionNode* ActionNode::SetRegister(int reg,
744 int val,
745 RegExpNode* on_success) {
746 ActionNode* result =
747 new(on_success->zone()) ActionNode(SET_REGISTER, on_success);
748 result->data_.u_store_register.reg = reg;
749 result->data_.u_store_register.value = val;
750 return result;
751 }
752
753
754 ActionNode* ActionNode::IncrementRegister(int reg, RegExpNode* on_success) {
755 ActionNode* result =
756 new(on_success->zone()) ActionNode(INCREMENT_REGISTER, on_success);
757 result->data_.u_increment_register.reg = reg;
758 return result;
759 }
760
761
762 ActionNode* ActionNode::StorePosition(int reg,
763 bool is_capture,
764 RegExpNode* on_success) {
765 ActionNode* result =
766 new(on_success->zone()) ActionNode(STORE_POSITION, on_success);
767 result->data_.u_position_register.reg = reg;
768 result->data_.u_position_register.is_capture = is_capture;
769 return result;
770 }
771
772
773 ActionNode* ActionNode::ClearCaptures(Interval range,
774 RegExpNode* on_success) {
775 ActionNode* result =
776 new(on_success->zone()) ActionNode(CLEAR_CAPTURES, on_success);
777 result->data_.u_clear_captures.range_from = range.from();
778 result->data_.u_clear_captures.range_to = range.to();
779 return result;
780 }
781
782
783 ActionNode* ActionNode::BeginSubmatch(int stack_reg,
784 int position_reg,
785 RegExpNode* on_success) {
786 ActionNode* result =
787 new(on_success->zone()) ActionNode(BEGIN_SUBMATCH, on_success);
788 result->data_.u_submatch.stack_pointer_register = stack_reg;
789 result->data_.u_submatch.current_position_register = position_reg;
790 return result;
791 }
792
793
794 ActionNode* ActionNode::PositiveSubmatchSuccess(int stack_reg,
795 int position_reg,
796 int clear_register_count,
797 int clear_register_from,
798 RegExpNode* on_success) {
799 ActionNode* result =
800 new(on_success->zone()) ActionNode(POSITIVE_SUBMATCH_SUCCESS, on_success);
801 result->data_.u_submatch.stack_pointer_register = stack_reg;
802 result->data_.u_submatch.current_position_register = position_reg;
803 result->data_.u_submatch.clear_register_count = clear_register_count;
804 result->data_.u_submatch.clear_register_from = clear_register_from;
805 return result;
806 }
807
808
809 ActionNode* ActionNode::EmptyMatchCheck(int start_register,
810 int repetition_register,
811 int repetition_limit,
812 RegExpNode* on_success) {
813 ActionNode* result =
814 new(on_success->zone()) ActionNode(EMPTY_MATCH_CHECK, on_success);
815 result->data_.u_empty_match_check.start_register = start_register;
816 result->data_.u_empty_match_check.repetition_register = repetition_register;
817 result->data_.u_empty_match_check.repetition_limit = repetition_limit;
818 return result;
819 }
820
821
822 #define DEFINE_ACCEPT(Type) \
823 void Type##Node::Accept(NodeVisitor* visitor) { \
824 visitor->Visit##Type(this); \
825 }
826 FOR_EACH_NODE_TYPE(DEFINE_ACCEPT)
827 #undef DEFINE_ACCEPT
828
829
830 void LoopChoiceNode::Accept(NodeVisitor* visitor) {
831 visitor->VisitLoopChoice(this);
832 }
833
834
835 // -------------------------------------------------------------------
836 // Emit code.
837
838
839 void ChoiceNode::GenerateGuard(RegExpMacroAssembler* macro_assembler,
840 Guard* guard,
841 Trace* trace) {
842 switch (guard->op()) {
843 case Guard::LT:
844 DCHECK(!trace->mentions_reg(guard->reg()));
845 macro_assembler->IfRegisterGE(guard->reg(),
846 guard->value(),
847 trace->backtrack());
848 break;
849 case Guard::GEQ:
850 DCHECK(!trace->mentions_reg(guard->reg()));
851 macro_assembler->IfRegisterLT(guard->reg(),
852 guard->value(),
853 trace->backtrack());
854 break;
855 }
856 }
857
858
859 // Returns the number of characters in the equivalence class, omitting those
860 // that cannot occur in the source string because it is ASCII.
861 static int GetCaseIndependentLetters(Isolate* isolate, uc16 character,
862 bool one_byte_subject,
863 unibrow::uchar* letters) {
864 int length =
865 isolate->jsregexp_uncanonicalize()->get(character, '\0', letters);
866 // Unibrow returns 0 or 1 for characters where case independence is
867 // trivial.
868 if (length == 0) {
869 letters[0] = character;
870 length = 1;
871 }
872 if (!one_byte_subject || character <= String::kMaxOneByteCharCode) {
873 return length;
874 }
875
876 // The standard requires that non-ASCII characters cannot have ASCII
877 // character codes in their equivalence class.
878 // TODO(dcarney): issue 3550 this is not actually true for Latin1 anymore,
879 // is it? For example, \u00C5 is equivalent to \u212B.
880 return 0;
881 }
882
883
884 static inline bool EmitSimpleCharacter(Isolate* isolate,
885 RegExpCompiler* compiler,
886 uc16 c,
887 Label* on_failure,
888 int cp_offset,
889 bool check,
890 bool preloaded) {
891 RegExpMacroAssembler* assembler = compiler->macro_assembler();
892 bool bound_checked = false;
893 if (!preloaded) {
894 assembler->LoadCurrentCharacter(
895 cp_offset,
896 on_failure,
897 check);
898 bound_checked = true;
899 }
900 assembler->CheckNotCharacter(c, on_failure);
901 return bound_checked;
902 }
903
904
905 // Only emits non-letters (things that don't have case). Only used for case
906 // independent matches.
907 static inline bool EmitAtomNonLetter(Isolate* isolate,
908 RegExpCompiler* compiler,
909 uc16 c,
910 Label* on_failure,
911 int cp_offset,
912 bool check,
913 bool preloaded) {
914 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
915 bool one_byte = compiler->one_byte();
916 unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
917 int length = GetCaseIndependentLetters(isolate, c, one_byte, chars);
918 if (length < 1) {
919 // This can't match. Must be an one-byte subject and a non-one-byte
920 // character. We do not need to do anything since the one-byte pass
921 // already handled this.
922 return false; // Bounds not checked.
923 }
924 bool checked = false;
925 // We handle the length > 1 case in a later pass.
926 if (length == 1) {
927 if (one_byte && c > String::kMaxOneByteCharCodeU) {
928 // Can't match - see above.
929 return false; // Bounds not checked.
930 }
931 if (!preloaded) {
932 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check);
933 checked = check;
934 }
935 macro_assembler->CheckNotCharacter(c, on_failure);
936 }
937 return checked;
938 }
939
940
941 static bool ShortCutEmitCharacterPair(RegExpMacroAssembler* macro_assembler,
942 bool one_byte, uc16 c1, uc16 c2,
943 Label* on_failure) {
944 uc16 char_mask;
945 if (one_byte) {
946 char_mask = String::kMaxOneByteCharCode;
947 } else {
948 char_mask = String::kMaxUtf16CodeUnit;
949 }
950 uc16 exor = c1 ^ c2;
951 // Check whether exor has only one bit set.
952 if (((exor - 1) & exor) == 0) {
953 // If c1 and c2 differ only by one bit.
954 // Ecma262UnCanonicalize always gives the highest number last.
955 DCHECK(c2 > c1);
956 uc16 mask = char_mask ^ exor;
957 macro_assembler->CheckNotCharacterAfterAnd(c1, mask, on_failure);
958 return true;
959 }
960 DCHECK(c2 > c1);
961 uc16 diff = c2 - c1;
962 if (((diff - 1) & diff) == 0 && c1 >= diff) {
963 // If the characters differ by 2^n but don't differ by one bit then
964 // subtract the difference from the found character, then do the or
965 // trick. We avoid the theoretical case where negative numbers are
966 // involved in order to simplify code generation.
967 uc16 mask = char_mask ^ diff;
968 macro_assembler->CheckNotCharacterAfterMinusAnd(c1 - diff,
969 diff,
970 mask,
971 on_failure);
972 return true;
973 }
974 return false;
975 }
976
977
978 typedef bool EmitCharacterFunction(Isolate* isolate,
979 RegExpCompiler* compiler,
980 uc16 c,
981 Label* on_failure,
982 int cp_offset,
983 bool check,
984 bool preloaded);
985
986 // Only emits letters (things that have case). Only used for case independent
987 // matches.
988 static inline bool EmitAtomLetter(Isolate* isolate,
989 RegExpCompiler* compiler,
990 uc16 c,
991 Label* on_failure,
992 int cp_offset,
993 bool check,
994 bool preloaded) {
995 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
996 bool one_byte = compiler->one_byte();
997 unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
998 int length = GetCaseIndependentLetters(isolate, c, one_byte, chars);
999 if (length <= 1) return false;
1000 // We may not need to check against the end of the input string
1001 // if this character lies before a character that matched.
1002 if (!preloaded) {
1003 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check);
1004 }
1005 Label ok;
1006 DCHECK(unibrow::Ecma262UnCanonicalize::kMaxWidth == 4);
1007 switch (length) {
1008 case 2: {
1009 if (ShortCutEmitCharacterPair(macro_assembler, one_byte, chars[0],
1010 chars[1], on_failure)) {
1011 } else {
1012 macro_assembler->CheckCharacter(chars[0], &ok);
1013 macro_assembler->CheckNotCharacter(chars[1], on_failure);
1014 macro_assembler->Bind(&ok);
1015 }
1016 break;
1017 }
1018 case 4:
1019 macro_assembler->CheckCharacter(chars[3], &ok);
1020 // Fall through!
1021 case 3:
1022 macro_assembler->CheckCharacter(chars[0], &ok);
1023 macro_assembler->CheckCharacter(chars[1], &ok);
1024 macro_assembler->CheckNotCharacter(chars[2], on_failure);
1025 macro_assembler->Bind(&ok);
1026 break;
1027 default:
1028 UNREACHABLE();
1029 break;
1030 }
1031 return true;
1032 }
1033
1034
1035 static void EmitBoundaryTest(RegExpMacroAssembler* masm,
1036 int border,
1037 Label* fall_through,
1038 Label* above_or_equal,
1039 Label* below) {
1040 if (below != fall_through) {
1041 masm->CheckCharacterLT(border, below);
1042 if (above_or_equal != fall_through) masm->GoTo(above_or_equal);
1043 } else {
1044 masm->CheckCharacterGT(border - 1, above_or_equal);
1045 }
1046 }
1047
1048
1049 static void EmitDoubleBoundaryTest(RegExpMacroAssembler* masm,
1050 int first,
1051 int last,
1052 Label* fall_through,
1053 Label* in_range,
1054 Label* out_of_range) {
1055 if (in_range == fall_through) {
1056 if (first == last) {
1057 masm->CheckNotCharacter(first, out_of_range);
1058 } else {
1059 masm->CheckCharacterNotInRange(first, last, out_of_range);
1060 }
1061 } else {
1062 if (first == last) {
1063 masm->CheckCharacter(first, in_range);
1064 } else {
1065 masm->CheckCharacterInRange(first, last, in_range);
1066 }
1067 if (out_of_range != fall_through) masm->GoTo(out_of_range);
1068 }
1069 }
1070
1071
1072 // even_label is for ranges[i] to ranges[i + 1] where i - start_index is even.
1073 // odd_label is for ranges[i] to ranges[i + 1] where i - start_index is odd.
1074 static void EmitUseLookupTable(
1075 RegExpMacroAssembler* masm,
1076 ZoneList<int>* ranges,
1077 int start_index,
1078 int end_index,
1079 int min_char,
1080 Label* fall_through,
1081 Label* even_label,
1082 Label* odd_label) {
1083 static const int kSize = RegExpMacroAssembler::kTableSize;
1084 static const int kMask = RegExpMacroAssembler::kTableMask;
1085
1086 int base = (min_char & ~kMask);
1087 USE(base);
1088
1089 // Assert that everything is on one kTableSize page.
1090 for (int i = start_index; i <= end_index; i++) {
1091 DCHECK_EQ(ranges->at(i) & ~kMask, base);
1092 }
1093 DCHECK(start_index == 0 || (ranges->at(start_index - 1) & ~kMask) <= base);
1094
1095 char templ[kSize];
1096 Label* on_bit_set;
1097 Label* on_bit_clear;
1098 int bit;
1099 if (even_label == fall_through) {
1100 on_bit_set = odd_label;
1101 on_bit_clear = even_label;
1102 bit = 1;
1103 } else {
1104 on_bit_set = even_label;
1105 on_bit_clear = odd_label;
1106 bit = 0;
1107 }
1108 for (int i = 0; i < (ranges->at(start_index) & kMask) && i < kSize; i++) {
1109 templ[i] = bit;
1110 }
1111 int j = 0;
1112 bit ^= 1;
1113 for (int i = start_index; i < end_index; i++) {
1114 for (j = (ranges->at(i) & kMask); j < (ranges->at(i + 1) & kMask); j++) {
1115 templ[j] = bit;
1116 }
1117 bit ^= 1;
1118 }
1119 for (int i = j; i < kSize; i++) {
1120 templ[i] = bit;
1121 }
1122 Factory* factory = masm->zone()->isolate()->factory();
1123 // TODO(erikcorry): Cache these.
1124 Handle<ByteArray> ba = factory->NewByteArray(kSize, TENURED);
1125 for (int i = 0; i < kSize; i++) {
1126 ba->set(i, templ[i]);
1127 }
1128 masm->CheckBitInTable(ba, on_bit_set);
1129 if (on_bit_clear != fall_through) masm->GoTo(on_bit_clear);
1130 }
1131
1132
1133 static void CutOutRange(RegExpMacroAssembler* masm,
1134 ZoneList<int>* ranges,
1135 int start_index,
1136 int end_index,
1137 int cut_index,
1138 Label* even_label,
1139 Label* odd_label) {
1140 bool odd = (((cut_index - start_index) & 1) == 1);
1141 Label* in_range_label = odd ? odd_label : even_label;
1142 Label dummy;
1143 EmitDoubleBoundaryTest(masm,
1144 ranges->at(cut_index),
1145 ranges->at(cut_index + 1) - 1,
1146 &dummy,
1147 in_range_label,
1148 &dummy);
1149 DCHECK(!dummy.is_linked());
1150 // Cut out the single range by rewriting the array. This creates a new
1151 // range that is a merger of the two ranges on either side of the one we
1152 // are cutting out. The oddity of the labels is preserved.
1153 for (int j = cut_index; j > start_index; j--) {
1154 ranges->at(j) = ranges->at(j - 1);
1155 }
1156 for (int j = cut_index + 1; j < end_index; j++) {
1157 ranges->at(j) = ranges->at(j + 1);
1158 }
1159 }
1160
1161
1162 // Unicode case. Split the search space into kSize spaces that are handled
1163 // with recursion.
1164 static void SplitSearchSpace(ZoneList<int>* ranges,
1165 int start_index,
1166 int end_index,
1167 int* new_start_index,
1168 int* new_end_index,
1169 int* border) {
1170 static const int kSize = RegExpMacroAssembler::kTableSize;
1171 static const int kMask = RegExpMacroAssembler::kTableMask;
1172
1173 int first = ranges->at(start_index);
1174 int last = ranges->at(end_index) - 1;
1175
1176 *new_start_index = start_index;
1177 *border = (ranges->at(start_index) & ~kMask) + kSize;
1178 while (*new_start_index < end_index) {
1179 if (ranges->at(*new_start_index) > *border) break;
1180 (*new_start_index)++;
1181 }
1182 // new_start_index is the index of the first edge that is beyond the
1183 // current kSize space.
1184
1185 // For very large search spaces we do a binary chop search of the non-Latin1
1186 // space instead of just going to the end of the current kSize space. The
1187 // heuristics are complicated a little by the fact that any 128-character
1188 // encoding space can be quickly tested with a table lookup, so we don't
1189 // wish to do binary chop search at a smaller granularity than that. A
1190 // 128-character space can take up a lot of space in the ranges array if,
1191 // for example, we only want to match every second character (eg. the lower
1192 // case characters on some Unicode pages).
1193 int binary_chop_index = (end_index + start_index) / 2;
1194 // The first test ensures that we get to the code that handles the Latin1
1195 // range with a single not-taken branch, speeding up this important
1196 // character range (even non-Latin1 charset-based text has spaces and
1197 // punctuation).
1198 if (*border - 1 > String::kMaxOneByteCharCode && // Latin1 case.
1199 end_index - start_index > (*new_start_index - start_index) * 2 &&
1200 last - first > kSize * 2 && binary_chop_index > *new_start_index &&
1201 ranges->at(binary_chop_index) >= first + 2 * kSize) {
1202 int scan_forward_for_section_border = binary_chop_index;;
1203 int new_border = (ranges->at(binary_chop_index) | kMask) + 1;
1204
1205 while (scan_forward_for_section_border < end_index) {
1206 if (ranges->at(scan_forward_for_section_border) > new_border) {
1207 *new_start_index = scan_forward_for_section_border;
1208 *border = new_border;
1209 break;
1210 }
1211 scan_forward_for_section_border++;
1212 }
1213 }
1214
1215 DCHECK(*new_start_index > start_index);
1216 *new_end_index = *new_start_index - 1;
1217 if (ranges->at(*new_end_index) == *border) {
1218 (*new_end_index)--;
1219 }
1220 if (*border >= ranges->at(end_index)) {
1221 *border = ranges->at(end_index);
1222 *new_start_index = end_index; // Won't be used.
1223 *new_end_index = end_index - 1;
1224 }
1225 }
1226
1227
1228 // Gets a series of segment boundaries representing a character class. If the
1229 // character is in the range between an even and an odd boundary (counting from
1230 // start_index) then go to even_label, otherwise go to odd_label. We already
1231 // know that the character is in the range of min_char to max_char inclusive.
1232 // Either label can be NULL indicating backtracking. Either label can also be
1233 // equal to the fall_through label.
1234 static void GenerateBranches(RegExpMacroAssembler* masm,
1235 ZoneList<int>* ranges,
1236 int start_index,
1237 int end_index,
1238 uc16 min_char,
1239 uc16 max_char,
1240 Label* fall_through,
1241 Label* even_label,
1242 Label* odd_label) {
1243 int first = ranges->at(start_index);
1244 int last = ranges->at(end_index) - 1;
1245
1246 DCHECK_LT(min_char, first);
1247
1248 // Just need to test if the character is before or on-or-after
1249 // a particular character.
1250 if (start_index == end_index) {
1251 EmitBoundaryTest(masm, first, fall_through, even_label, odd_label);
1252 return;
1253 }
1254
1255 // Another almost trivial case: There is one interval in the middle that is
1256 // different from the end intervals.
1257 if (start_index + 1 == end_index) {
1258 EmitDoubleBoundaryTest(
1259 masm, first, last, fall_through, even_label, odd_label);
1260 return;
1261 }
1262
1263 // It's not worth using table lookup if there are very few intervals in the
1264 // character class.
1265 if (end_index - start_index <= 6) {
1266 // It is faster to test for individual characters, so we look for those
1267 // first, then try arbitrary ranges in the second round.
1268 static int kNoCutIndex = -1;
1269 int cut = kNoCutIndex;
1270 for (int i = start_index; i < end_index; i++) {
1271 if (ranges->at(i) == ranges->at(i + 1) - 1) {
1272 cut = i;
1273 break;
1274 }
1275 }
1276 if (cut == kNoCutIndex) cut = start_index;
1277 CutOutRange(
1278 masm, ranges, start_index, end_index, cut, even_label, odd_label);
1279 DCHECK_GE(end_index - start_index, 2);
1280 GenerateBranches(masm,
1281 ranges,
1282 start_index + 1,
1283 end_index - 1,
1284 min_char,
1285 max_char,
1286 fall_through,
1287 even_label,
1288 odd_label);
1289 return;
1290 }
1291
1292 // If there are a lot of intervals in the regexp, then we will use tables to
1293 // determine whether the character is inside or outside the character class.
1294 static const int kBits = RegExpMacroAssembler::kTableSizeBits;
1295
1296 if ((max_char >> kBits) == (min_char >> kBits)) {
1297 EmitUseLookupTable(masm,
1298 ranges,
1299 start_index,
1300 end_index,
1301 min_char,
1302 fall_through,
1303 even_label,
1304 odd_label);
1305 return;
1306 }
1307
1308 if ((min_char >> kBits) != (first >> kBits)) {
1309 masm->CheckCharacterLT(first, odd_label);
1310 GenerateBranches(masm,
1311 ranges,
1312 start_index + 1,
1313 end_index,
1314 first,
1315 max_char,
1316 fall_through,
1317 odd_label,
1318 even_label);
1319 return;
1320 }
1321
1322 int new_start_index = 0;
1323 int new_end_index = 0;
1324 int border = 0;
1325
1326 SplitSearchSpace(ranges,
1327 start_index,
1328 end_index,
1329 &new_start_index,
1330 &new_end_index,
1331 &border);
1332
1333 Label handle_rest;
1334 Label* above = &handle_rest;
1335 if (border == last + 1) {
1336 // We didn't find any section that started after the limit, so everything
1337 // above the border is one of the terminal labels.
1338 above = (end_index & 1) != (start_index & 1) ? odd_label : even_label;
1339 DCHECK(new_end_index == end_index - 1);
1340 }
1341
1342 DCHECK_LE(start_index, new_end_index);
1343 DCHECK_LE(new_start_index, end_index);
1344 DCHECK_LT(start_index, new_start_index);
1345 DCHECK_LT(new_end_index, end_index);
1346 DCHECK(new_end_index + 1 == new_start_index ||
1347 (new_end_index + 2 == new_start_index &&
1348 border == ranges->at(new_end_index + 1)));
1349 DCHECK_LT(min_char, border - 1);
1350 DCHECK_LT(border, max_char);
1351 DCHECK_LT(ranges->at(new_end_index), border);
1352 DCHECK(border < ranges->at(new_start_index) ||
1353 (border == ranges->at(new_start_index) &&
1354 new_start_index == end_index &&
1355 new_end_index == end_index - 1 &&
1356 border == last + 1));
1357 DCHECK(new_start_index == 0 || border >= ranges->at(new_start_index - 1));
1358
1359 masm->CheckCharacterGT(border - 1, above);
1360 Label dummy;
1361 GenerateBranches(masm,
1362 ranges,
1363 start_index,
1364 new_end_index,
1365 min_char,
1366 border - 1,
1367 &dummy,
1368 even_label,
1369 odd_label);
1370 if (handle_rest.is_linked()) {
1371 masm->Bind(&handle_rest);
1372 bool flip = (new_start_index & 1) != (start_index & 1);
1373 GenerateBranches(masm,
1374 ranges,
1375 new_start_index,
1376 end_index,
1377 border,
1378 max_char,
1379 &dummy,
1380 flip ? odd_label : even_label,
1381 flip ? even_label : odd_label);
1382 }
1383 }
1384
1385
1386 static void EmitCharClass(RegExpMacroAssembler* macro_assembler,
1387 RegExpCharacterClass* cc, bool one_byte,
1388 Label* on_failure, int cp_offset, bool check_offset,
1389 bool preloaded, Zone* zone) {
1390 ZoneList<CharacterRange>* ranges = cc->ranges(zone);
1391 if (!CharacterRange::IsCanonical(ranges)) {
1392 CharacterRange::Canonicalize(ranges);
1393 }
1394
1395 int max_char;
1396 if (one_byte) {
1397 max_char = String::kMaxOneByteCharCode;
1398 } else {
1399 max_char = String::kMaxUtf16CodeUnit;
1400 }
1401
1402 int range_count = ranges->length();
1403
1404 int last_valid_range = range_count - 1;
1405 while (last_valid_range >= 0) {
1406 CharacterRange& range = ranges->at(last_valid_range);
1407 if (range.from() <= max_char) {
1408 break;
1409 }
1410 last_valid_range--;
1411 }
1412
1413 if (last_valid_range < 0) {
1414 if (!cc->is_negated()) {
1415 macro_assembler->GoTo(on_failure);
1416 }
1417 if (check_offset) {
1418 macro_assembler->CheckPosition(cp_offset, on_failure);
1419 }
1420 return;
1421 }
1422
1423 if (last_valid_range == 0 &&
1424 ranges->at(0).IsEverything(max_char)) {
1425 if (cc->is_negated()) {
1426 macro_assembler->GoTo(on_failure);
1427 } else {
1428 // This is a common case hit by non-anchored expressions.
1429 if (check_offset) {
1430 macro_assembler->CheckPosition(cp_offset, on_failure);
1431 }
1432 }
1433 return;
1434 }
1435 if (last_valid_range == 0 &&
1436 !cc->is_negated() &&
1437 ranges->at(0).IsEverything(max_char)) {
1438 // This is a common case hit by non-anchored expressions.
1439 if (check_offset) {
1440 macro_assembler->CheckPosition(cp_offset, on_failure);
1441 }
1442 return;
1443 }
1444
1445 if (!preloaded) {
1446 macro_assembler->LoadCurrentCharacter(cp_offset, on_failure, check_offset);
1447 }
1448
1449 if (cc->is_standard(zone) &&
1450 macro_assembler->CheckSpecialCharacterClass(cc->standard_type(),
1451 on_failure)) {
1452 return;
1453 }
1454
1455
1456 // A new list with ascending entries. Each entry is a code unit
1457 // where there is a boundary between code units that are part of
1458 // the class and code units that are not. Normally we insert an
1459 // entry at zero which goes to the failure label, but if there
1460 // was already one there we fall through for success on that entry.
1461 // Subsequent entries have alternating meaning (success/failure).
1462 ZoneList<int>* range_boundaries =
1463 new(zone) ZoneList<int>(last_valid_range, zone);
1464
1465 bool zeroth_entry_is_failure = !cc->is_negated();
1466
1467 for (int i = 0; i <= last_valid_range; i++) {
1468 CharacterRange& range = ranges->at(i);
1469 if (range.from() == 0) {
1470 DCHECK_EQ(i, 0);
1471 zeroth_entry_is_failure = !zeroth_entry_is_failure;
1472 } else {
1473 range_boundaries->Add(range.from(), zone);
1474 }
1475 range_boundaries->Add(range.to() + 1, zone);
1476 }
1477 int end_index = range_boundaries->length() - 1;
1478 if (range_boundaries->at(end_index) > max_char) {
1479 end_index--;
1480 }
1481
1482 Label fall_through;
1483 GenerateBranches(macro_assembler,
1484 range_boundaries,
1485 0, // start_index.
1486 end_index,
1487 0, // min_char.
1488 max_char,
1489 &fall_through,
1490 zeroth_entry_is_failure ? &fall_through : on_failure,
1491 zeroth_entry_is_failure ? on_failure : &fall_through);
1492 macro_assembler->Bind(&fall_through);
1493 }
1494
1495
1496 RegExpNode::~RegExpNode() {
1497 }
1498
1499
1500 RegExpNode::LimitResult RegExpNode::LimitVersions(RegExpCompiler* compiler,
1501 Trace* trace) {
1502 // If we are generating a greedy loop then don't stop and don't reuse code.
1503 if (trace->stop_node() != NULL) {
1504 return CONTINUE;
1505 }
1506
1507 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
1508 if (trace->is_trivial()) {
1509 if (label_.is_bound()) {
1510 // We are being asked to generate a generic version, but that's already
1511 // been done so just go to it.
1512 macro_assembler->GoTo(&label_);
1513 return DONE;
1514 }
1515 if (compiler->recursion_depth() >= RegExpCompiler::kMaxRecursion) {
1516 // To avoid too deep recursion we push the node to the work queue and just
1517 // generate a goto here.
1518 compiler->AddWork(this);
1519 macro_assembler->GoTo(&label_);
1520 return DONE;
1521 }
1522 // Generate generic version of the node and bind the label for later use.
1523 macro_assembler->Bind(&label_);
1524 return CONTINUE;
1525 }
1526
1527 // We are being asked to make a non-generic version. Keep track of how many
1528 // non-generic versions we generate so as not to overdo it.
1529 trace_count_++;
1530 if (FLAG_regexp_optimization &&
1531 trace_count_ < kMaxCopiesCodeGenerated &&
1532 compiler->recursion_depth() <= RegExpCompiler::kMaxRecursion) {
1533 return CONTINUE;
1534 }
1535
1536 // If we get here code has been generated for this node too many times or
1537 // recursion is too deep. Time to switch to a generic version. The code for
1538 // generic versions above can handle deep recursion properly.
1539 trace->Flush(compiler, this);
1540 return DONE;
1541 }
1542
1543
1544 int ActionNode::EatsAtLeast(int still_to_find,
1545 int budget,
1546 bool not_at_start) {
1547 if (budget <= 0) return 0;
1548 if (action_type_ == POSITIVE_SUBMATCH_SUCCESS) return 0; // Rewinds input!
1549 return on_success()->EatsAtLeast(still_to_find,
1550 budget - 1,
1551 not_at_start);
1552 }
1553
1554
1555 void ActionNode::FillInBMInfo(int offset,
1556 int budget,
1557 BoyerMooreLookahead* bm,
1558 bool not_at_start) {
1559 if (action_type_ == BEGIN_SUBMATCH) {
1560 bm->SetRest(offset);
1561 } else if (action_type_ != POSITIVE_SUBMATCH_SUCCESS) {
1562 on_success()->FillInBMInfo(offset, budget - 1, bm, not_at_start);
1563 }
1564 SaveBMInfo(bm, not_at_start, offset);
1565 }
1566
1567
1568 int AssertionNode::EatsAtLeast(int still_to_find,
1569 int budget,
1570 bool not_at_start) {
1571 if (budget <= 0) return 0;
1572 // If we know we are not at the start and we are asked "how many characters
1573 // will you match if you succeed?" then we can answer anything since false
1574 // implies false. So lets just return the max answer (still_to_find) since
1575 // that won't prevent us from preloading a lot of characters for the other
1576 // branches in the node graph.
1577 if (assertion_type() == AT_START && not_at_start) return still_to_find;
1578 return on_success()->EatsAtLeast(still_to_find,
1579 budget - 1,
1580 not_at_start);
1581 }
1582
1583
1584 void AssertionNode::FillInBMInfo(int offset,
1585 int budget,
1586 BoyerMooreLookahead* bm,
1587 bool not_at_start) {
1588 // Match the behaviour of EatsAtLeast on this node.
1589 if (assertion_type() == AT_START && not_at_start) return;
1590 on_success()->FillInBMInfo(offset, budget - 1, bm, not_at_start);
1591 SaveBMInfo(bm, not_at_start, offset);
1592 }
1593
1594
1595 int BackReferenceNode::EatsAtLeast(int still_to_find,
1596 int budget,
1597 bool not_at_start) {
1598 if (budget <= 0) return 0;
1599 return on_success()->EatsAtLeast(still_to_find,
1600 budget - 1,
1601 not_at_start);
1602 }
1603
1604
1605 int TextNode::EatsAtLeast(int still_to_find,
1606 int budget,
1607 bool not_at_start) {
1608 int answer = Length();
1609 if (answer >= still_to_find) return answer;
1610 if (budget <= 0) return answer;
1611 // We are not at start after this node so we set the last argument to 'true'.
1612 return answer + on_success()->EatsAtLeast(still_to_find - answer,
1613 budget - 1,
1614 true);
1615 }
1616
1617
1618 int NegativeLookaheadChoiceNode::EatsAtLeast(int still_to_find,
1619 int budget,
1620 bool not_at_start) {
1621 if (budget <= 0) return 0;
1622 // Alternative 0 is the negative lookahead, alternative 1 is what comes
1623 // afterwards.
1624 RegExpNode* node = alternatives_->at(1).node();
1625 return node->EatsAtLeast(still_to_find, budget - 1, not_at_start);
1626 }
1627
1628
1629 void NegativeLookaheadChoiceNode::GetQuickCheckDetails(
1630 QuickCheckDetails* details,
1631 RegExpCompiler* compiler,
1632 int filled_in,
1633 bool not_at_start) {
1634 // Alternative 0 is the negative lookahead, alternative 1 is what comes
1635 // afterwards.
1636 RegExpNode* node = alternatives_->at(1).node();
1637 return node->GetQuickCheckDetails(details, compiler, filled_in, not_at_start);
1638 }
1639
1640
1641 int ChoiceNode::EatsAtLeastHelper(int still_to_find,
1642 int budget,
1643 RegExpNode* ignore_this_node,
1644 bool not_at_start) {
1645 if (budget <= 0) return 0;
1646 int min = 100;
1647 int choice_count = alternatives_->length();
1648 budget = (budget - 1) / choice_count;
1649 for (int i = 0; i < choice_count; i++) {
1650 RegExpNode* node = alternatives_->at(i).node();
1651 if (node == ignore_this_node) continue;
1652 int node_eats_at_least =
1653 node->EatsAtLeast(still_to_find, budget, not_at_start);
1654 if (node_eats_at_least < min) min = node_eats_at_least;
1655 if (min == 0) return 0;
1656 }
1657 return min;
1658 }
1659
1660
1661 int LoopChoiceNode::EatsAtLeast(int still_to_find,
1662 int budget,
1663 bool not_at_start) {
1664 return EatsAtLeastHelper(still_to_find,
1665 budget - 1,
1666 loop_node_,
1667 not_at_start);
1668 }
1669
1670
1671 int ChoiceNode::EatsAtLeast(int still_to_find,
1672 int budget,
1673 bool not_at_start) {
1674 return EatsAtLeastHelper(still_to_find,
1675 budget,
1676 NULL,
1677 not_at_start);
1678 }
1679
1680
1681 // Takes the left-most 1-bit and smears it out, setting all bits to its right.
1682 static inline uint32_t SmearBitsRight(uint32_t v) {
1683 v |= v >> 1;
1684 v |= v >> 2;
1685 v |= v >> 4;
1686 v |= v >> 8;
1687 v |= v >> 16;
1688 return v;
1689 }
1690
1691
1692 bool QuickCheckDetails::Rationalize(bool asc) {
1693 bool found_useful_op = false;
1694 uint32_t char_mask;
1695 if (asc) {
1696 char_mask = String::kMaxOneByteCharCode;
1697 } else {
1698 char_mask = String::kMaxUtf16CodeUnit;
1699 }
1700 mask_ = 0;
1701 value_ = 0;
1702 int char_shift = 0;
1703 for (int i = 0; i < characters_; i++) {
1704 Position* pos = &positions_[i];
1705 if ((pos->mask & String::kMaxOneByteCharCode) != 0) {
1706 found_useful_op = true;
1707 }
1708 mask_ |= (pos->mask & char_mask) << char_shift;
1709 value_ |= (pos->value & char_mask) << char_shift;
1710 char_shift += asc ? 8 : 16;
1711 }
1712 return found_useful_op;
1713 }
1714
1715
1716 bool RegExpNode::EmitQuickCheck(RegExpCompiler* compiler,
1717 Trace* bounds_check_trace,
1718 Trace* trace,
1719 bool preload_has_checked_bounds,
1720 Label* on_possible_success,
1721 QuickCheckDetails* details,
1722 bool fall_through_on_failure) {
1723 if (details->characters() == 0) return false;
1724 GetQuickCheckDetails(
1725 details, compiler, 0, trace->at_start() == Trace::FALSE_VALUE);
1726 if (details->cannot_match()) return false;
1727 if (!details->Rationalize(compiler->one_byte())) return false;
1728 DCHECK(details->characters() == 1 ||
1729 compiler->macro_assembler()->CanReadUnaligned());
1730 uint32_t mask = details->mask();
1731 uint32_t value = details->value();
1732
1733 RegExpMacroAssembler* assembler = compiler->macro_assembler();
1734
1735 if (trace->characters_preloaded() != details->characters()) {
1736 DCHECK(trace->cp_offset() == bounds_check_trace->cp_offset());
1737 // We are attempting to preload the minimum number of characters
1738 // any choice would eat, so if the bounds check fails, then none of the
1739 // choices can succeed, so we can just immediately backtrack, rather
1740 // than go to the next choice.
1741 assembler->LoadCurrentCharacter(trace->cp_offset(),
1742 bounds_check_trace->backtrack(),
1743 !preload_has_checked_bounds,
1744 details->characters());
1745 }
1746
1747
1748 bool need_mask = true;
1749
1750 if (details->characters() == 1) {
1751 // If number of characters preloaded is 1 then we used a byte or 16 bit
1752 // load so the value is already masked down.
1753 uint32_t char_mask;
1754 if (compiler->one_byte()) {
1755 char_mask = String::kMaxOneByteCharCode;
1756 } else {
1757 char_mask = String::kMaxUtf16CodeUnit;
1758 }
1759 if ((mask & char_mask) == char_mask) need_mask = false;
1760 mask &= char_mask;
1761 } else {
1762 // For 2-character preloads in one-byte mode or 1-character preloads in
1763 // two-byte mode we also use a 16 bit load with zero extend.
1764 if (details->characters() == 2 && compiler->one_byte()) {
1765 if ((mask & 0xffff) == 0xffff) need_mask = false;
1766 } else if (details->characters() == 1 && !compiler->one_byte()) {
1767 if ((mask & 0xffff) == 0xffff) need_mask = false;
1768 } else {
1769 if (mask == 0xffffffff) need_mask = false;
1770 }
1771 }
1772
1773 if (fall_through_on_failure) {
1774 if (need_mask) {
1775 assembler->CheckCharacterAfterAnd(value, mask, on_possible_success);
1776 } else {
1777 assembler->CheckCharacter(value, on_possible_success);
1778 }
1779 } else {
1780 if (need_mask) {
1781 assembler->CheckNotCharacterAfterAnd(value, mask, trace->backtrack());
1782 } else {
1783 assembler->CheckNotCharacter(value, trace->backtrack());
1784 }
1785 }
1786 return true;
1787 }
1788
1789
1790 // Here is the meat of GetQuickCheckDetails (see also the comment on the
1791 // super-class in the .h file).
1792 //
1793 // We iterate along the text object, building up for each character a
1794 // mask and value that can be used to test for a quick failure to match.
1795 // The masks and values for the positions will be combined into a single
1796 // machine word for the current character width in order to be used in
1797 // generating a quick check.
1798 void TextNode::GetQuickCheckDetails(QuickCheckDetails* details,
1799 RegExpCompiler* compiler,
1800 int characters_filled_in,
1801 bool not_at_start) {
1802 Isolate* isolate = compiler->macro_assembler()->zone()->isolate();
1803 DCHECK(characters_filled_in < details->characters());
1804 int characters = details->characters();
1805 int char_mask;
1806 if (compiler->one_byte()) {
1807 char_mask = String::kMaxOneByteCharCode;
1808 } else {
1809 char_mask = String::kMaxUtf16CodeUnit;
1810 }
1811 for (int k = 0; k < elms_->length(); k++) {
1812 TextElement elm = elms_->at(k);
1813 if (elm.text_type() == TextElement::ATOM) {
1814 Vector<const uc16> quarks = elm.atom()->data();
1815 for (int i = 0; i < characters && i < quarks.length(); i++) {
1816 QuickCheckDetails::Position* pos =
1817 details->positions(characters_filled_in);
1818 uc16 c = quarks[i];
1819 if (c > char_mask) {
1820 // If we expect a non-Latin1 character from an one-byte string,
1821 // there is no way we can match. Not even case-independent
1822 // matching can turn an Latin1 character into non-Latin1 or
1823 // vice versa.
1824 // TODO(dcarney): issue 3550. Verify that this works as expected.
1825 // For example, \u0178 is uppercase of \u00ff (y-umlaut).
1826 details->set_cannot_match();
1827 pos->determines_perfectly = false;
1828 return;
1829 }
1830 if (compiler->ignore_case()) {
1831 unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
1832 int length = GetCaseIndependentLetters(isolate, c,
1833 compiler->one_byte(), chars);
1834 DCHECK(length != 0); // Can only happen if c > char_mask (see above).
1835 if (length == 1) {
1836 // This letter has no case equivalents, so it's nice and simple
1837 // and the mask-compare will determine definitely whether we have
1838 // a match at this character position.
1839 pos->mask = char_mask;
1840 pos->value = c;
1841 pos->determines_perfectly = true;
1842 } else {
1843 uint32_t common_bits = char_mask;
1844 uint32_t bits = chars[0];
1845 for (int j = 1; j < length; j++) {
1846 uint32_t differing_bits = ((chars[j] & common_bits) ^ bits);
1847 common_bits ^= differing_bits;
1848 bits &= common_bits;
1849 }
1850 // If length is 2 and common bits has only one zero in it then
1851 // our mask and compare instruction will determine definitely
1852 // whether we have a match at this character position. Otherwise
1853 // it can only be an approximate check.
1854 uint32_t one_zero = (common_bits | ~char_mask);
1855 if (length == 2 && ((~one_zero) & ((~one_zero) - 1)) == 0) {
1856 pos->determines_perfectly = true;
1857 }
1858 pos->mask = common_bits;
1859 pos->value = bits;
1860 }
1861 } else {
1862 // Don't ignore case. Nice simple case where the mask-compare will
1863 // determine definitely whether we have a match at this character
1864 // position.
1865 pos->mask = char_mask;
1866 pos->value = c;
1867 pos->determines_perfectly = true;
1868 }
1869 characters_filled_in++;
1870 DCHECK(characters_filled_in <= details->characters());
1871 if (characters_filled_in == details->characters()) {
1872 return;
1873 }
1874 }
1875 } else {
1876 QuickCheckDetails::Position* pos =
1877 details->positions(characters_filled_in);
1878 RegExpCharacterClass* tree = elm.char_class();
1879 ZoneList<CharacterRange>* ranges = tree->ranges(zone());
1880 if (tree->is_negated()) {
1881 // A quick check uses multi-character mask and compare. There is no
1882 // useful way to incorporate a negative char class into this scheme
1883 // so we just conservatively create a mask and value that will always
1884 // succeed.
1885 pos->mask = 0;
1886 pos->value = 0;
1887 } else {
1888 int first_range = 0;
1889 while (ranges->at(first_range).from() > char_mask) {
1890 first_range++;
1891 if (first_range == ranges->length()) {
1892 details->set_cannot_match();
1893 pos->determines_perfectly = false;
1894 return;
1895 }
1896 }
1897 CharacterRange range = ranges->at(first_range);
1898 uc16 from = range.from();
1899 uc16 to = range.to();
1900 if (to > char_mask) {
1901 to = char_mask;
1902 }
1903 uint32_t differing_bits = (from ^ to);
1904 // A mask and compare is only perfect if the differing bits form a
1905 // number like 00011111 with one single block of trailing 1s.
1906 if ((differing_bits & (differing_bits + 1)) == 0 &&
1907 from + differing_bits == to) {
1908 pos->determines_perfectly = true;
1909 }
1910 uint32_t common_bits = ~SmearBitsRight(differing_bits);
1911 uint32_t bits = (from & common_bits);
1912 for (int i = first_range + 1; i < ranges->length(); i++) {
1913 CharacterRange range = ranges->at(i);
1914 uc16 from = range.from();
1915 uc16 to = range.to();
1916 if (from > char_mask) continue;
1917 if (to > char_mask) to = char_mask;
1918 // Here we are combining more ranges into the mask and compare
1919 // value. With each new range the mask becomes more sparse and
1920 // so the chances of a false positive rise. A character class
1921 // with multiple ranges is assumed never to be equivalent to a
1922 // mask and compare operation.
1923 pos->determines_perfectly = false;
1924 uint32_t new_common_bits = (from ^ to);
1925 new_common_bits = ~SmearBitsRight(new_common_bits);
1926 common_bits &= new_common_bits;
1927 bits &= new_common_bits;
1928 uint32_t differing_bits = (from & common_bits) ^ bits;
1929 common_bits ^= differing_bits;
1930 bits &= common_bits;
1931 }
1932 pos->mask = common_bits;
1933 pos->value = bits;
1934 }
1935 characters_filled_in++;
1936 DCHECK(characters_filled_in <= details->characters());
1937 if (characters_filled_in == details->characters()) {
1938 return;
1939 }
1940 }
1941 }
1942 DCHECK(characters_filled_in != details->characters());
1943 if (!details->cannot_match()) {
1944 on_success()-> GetQuickCheckDetails(details,
1945 compiler,
1946 characters_filled_in,
1947 true);
1948 }
1949 }
1950
1951
1952 void QuickCheckDetails::Clear() {
1953 for (int i = 0; i < characters_; i++) {
1954 positions_[i].mask = 0;
1955 positions_[i].value = 0;
1956 positions_[i].determines_perfectly = false;
1957 }
1958 characters_ = 0;
1959 }
1960
1961
1962 void QuickCheckDetails::Advance(int by, bool one_byte) {
1963 DCHECK(by >= 0);
1964 if (by >= characters_) {
1965 Clear();
1966 return;
1967 }
1968 for (int i = 0; i < characters_ - by; i++) {
1969 positions_[i] = positions_[by + i];
1970 }
1971 for (int i = characters_ - by; i < characters_; i++) {
1972 positions_[i].mask = 0;
1973 positions_[i].value = 0;
1974 positions_[i].determines_perfectly = false;
1975 }
1976 characters_ -= by;
1977 // We could change mask_ and value_ here but we would never advance unless
1978 // they had already been used in a check and they won't be used again because
1979 // it would gain us nothing. So there's no point.
1980 }
1981
1982
1983 void QuickCheckDetails::Merge(QuickCheckDetails* other, int from_index) {
1984 DCHECK(characters_ == other->characters_);
1985 if (other->cannot_match_) {
1986 return;
1987 }
1988 if (cannot_match_) {
1989 *this = *other;
1990 return;
1991 }
1992 for (int i = from_index; i < characters_; i++) {
1993 QuickCheckDetails::Position* pos = positions(i);
1994 QuickCheckDetails::Position* other_pos = other->positions(i);
1995 if (pos->mask != other_pos->mask ||
1996 pos->value != other_pos->value ||
1997 !other_pos->determines_perfectly) {
1998 // Our mask-compare operation will be approximate unless we have the
1999 // exact same operation on both sides of the alternation.
2000 pos->determines_perfectly = false;
2001 }
2002 pos->mask &= other_pos->mask;
2003 pos->value &= pos->mask;
2004 other_pos->value &= pos->mask;
2005 uc16 differing_bits = (pos->value ^ other_pos->value);
2006 pos->mask &= ~differing_bits;
2007 pos->value &= pos->mask;
2008 }
2009 }
2010
2011
2012 class VisitMarker {
2013 public:
2014 explicit VisitMarker(NodeInfo* info) : info_(info) {
2015 DCHECK(!info->visited);
2016 info->visited = true;
2017 }
2018 ~VisitMarker() {
2019 info_->visited = false;
2020 }
2021 private:
2022 NodeInfo* info_;
2023 };
2024
2025
2026 RegExpNode* SeqRegExpNode::FilterOneByte(int depth, bool ignore_case) {
2027 if (info()->replacement_calculated) return replacement();
2028 if (depth < 0) return this;
2029 DCHECK(!info()->visited);
2030 VisitMarker marker(info());
2031 return FilterSuccessor(depth - 1, ignore_case);
2032 }
2033
2034
2035 RegExpNode* SeqRegExpNode::FilterSuccessor(int depth, bool ignore_case) {
2036 RegExpNode* next = on_success_->FilterOneByte(depth - 1, ignore_case);
2037 if (next == NULL) return set_replacement(NULL);
2038 on_success_ = next;
2039 return set_replacement(this);
2040 }
2041
2042
2043 // We need to check for the following characters: 0x39c 0x3bc 0x178.
2044 static inline bool RangeContainsLatin1Equivalents(CharacterRange range) {
2045 // TODO(dcarney): this could be a lot more efficient.
2046 return range.Contains(0x39c) ||
2047 range.Contains(0x3bc) || range.Contains(0x178);
2048 }
2049
2050
2051 static bool RangesContainLatin1Equivalents(ZoneList<CharacterRange>* ranges) {
2052 for (int i = 0; i < ranges->length(); i++) {
2053 // TODO(dcarney): this could be a lot more efficient.
2054 if (RangeContainsLatin1Equivalents(ranges->at(i))) return true;
2055 }
2056 return false;
2057 }
2058
2059
2060 RegExpNode* TextNode::FilterOneByte(int depth, bool ignore_case) {
2061 if (info()->replacement_calculated) return replacement();
2062 if (depth < 0) return this;
2063 DCHECK(!info()->visited);
2064 VisitMarker marker(info());
2065 int element_count = elms_->length();
2066 for (int i = 0; i < element_count; i++) {
2067 TextElement elm = elms_->at(i);
2068 if (elm.text_type() == TextElement::ATOM) {
2069 Vector<const uc16> quarks = elm.atom()->data();
2070 for (int j = 0; j < quarks.length(); j++) {
2071 uint16_t c = quarks[j];
2072 if (c <= String::kMaxOneByteCharCode) continue;
2073 if (!ignore_case) return set_replacement(NULL);
2074 // Here, we need to check for characters whose upper and lower cases
2075 // are outside the Latin-1 range.
2076 uint16_t converted = unibrow::Latin1::ConvertNonLatin1ToLatin1(c);
2077 // Character is outside Latin-1 completely
2078 if (converted == 0) return set_replacement(NULL);
2079 // Convert quark to Latin-1 in place.
2080 uint16_t* copy = const_cast<uint16_t*>(quarks.start());
2081 copy[j] = converted;
2082 }
2083 } else {
2084 DCHECK(elm.text_type() == TextElement::CHAR_CLASS);
2085 RegExpCharacterClass* cc = elm.char_class();
2086 ZoneList<CharacterRange>* ranges = cc->ranges(zone());
2087 if (!CharacterRange::IsCanonical(ranges)) {
2088 CharacterRange::Canonicalize(ranges);
2089 }
2090 // Now they are in order so we only need to look at the first.
2091 int range_count = ranges->length();
2092 if (cc->is_negated()) {
2093 if (range_count != 0 &&
2094 ranges->at(0).from() == 0 &&
2095 ranges->at(0).to() >= String::kMaxOneByteCharCode) {
2096 // This will be handled in a later filter.
2097 if (ignore_case && RangesContainLatin1Equivalents(ranges)) continue;
2098 return set_replacement(NULL);
2099 }
2100 } else {
2101 if (range_count == 0 ||
2102 ranges->at(0).from() > String::kMaxOneByteCharCode) {
2103 // This will be handled in a later filter.
2104 if (ignore_case && RangesContainLatin1Equivalents(ranges)) continue;
2105 return set_replacement(NULL);
2106 }
2107 }
2108 }
2109 }
2110 return FilterSuccessor(depth - 1, ignore_case);
2111 }
2112
2113
2114 RegExpNode* LoopChoiceNode::FilterOneByte(int depth, bool ignore_case) {
2115 if (info()->replacement_calculated) return replacement();
2116 if (depth < 0) return this;
2117 if (info()->visited) return this;
2118 {
2119 VisitMarker marker(info());
2120
2121 RegExpNode* continue_replacement =
2122 continue_node_->FilterOneByte(depth - 1, ignore_case);
2123 // If we can't continue after the loop then there is no sense in doing the
2124 // loop.
2125 if (continue_replacement == NULL) return set_replacement(NULL);
2126 }
2127
2128 return ChoiceNode::FilterOneByte(depth - 1, ignore_case);
2129 }
2130
2131
2132 RegExpNode* ChoiceNode::FilterOneByte(int depth, bool ignore_case) {
2133 if (info()->replacement_calculated) return replacement();
2134 if (depth < 0) return this;
2135 if (info()->visited) return this;
2136 VisitMarker marker(info());
2137 int choice_count = alternatives_->length();
2138
2139 for (int i = 0; i < choice_count; i++) {
2140 GuardedAlternative alternative = alternatives_->at(i);
2141 if (alternative.guards() != NULL && alternative.guards()->length() != 0) {
2142 set_replacement(this);
2143 return this;
2144 }
2145 }
2146
2147 int surviving = 0;
2148 RegExpNode* survivor = NULL;
2149 for (int i = 0; i < choice_count; i++) {
2150 GuardedAlternative alternative = alternatives_->at(i);
2151 RegExpNode* replacement =
2152 alternative.node()->FilterOneByte(depth - 1, ignore_case);
2153 DCHECK(replacement != this); // No missing EMPTY_MATCH_CHECK.
2154 if (replacement != NULL) {
2155 alternatives_->at(i).set_node(replacement);
2156 surviving++;
2157 survivor = replacement;
2158 }
2159 }
2160 if (surviving < 2) return set_replacement(survivor);
2161
2162 set_replacement(this);
2163 if (surviving == choice_count) {
2164 return this;
2165 }
2166 // Only some of the nodes survived the filtering. We need to rebuild the
2167 // alternatives list.
2168 ZoneList<GuardedAlternative>* new_alternatives =
2169 new(zone()) ZoneList<GuardedAlternative>(surviving, zone());
2170 for (int i = 0; i < choice_count; i++) {
2171 RegExpNode* replacement =
2172 alternatives_->at(i).node()->FilterOneByte(depth - 1, ignore_case);
2173 if (replacement != NULL) {
2174 alternatives_->at(i).set_node(replacement);
2175 new_alternatives->Add(alternatives_->at(i), zone());
2176 }
2177 }
2178 alternatives_ = new_alternatives;
2179 return this;
2180 }
2181
2182
2183 RegExpNode* NegativeLookaheadChoiceNode::FilterOneByte(int depth,
2184 bool ignore_case) {
2185 if (info()->replacement_calculated) return replacement();
2186 if (depth < 0) return this;
2187 if (info()->visited) return this;
2188 VisitMarker marker(info());
2189 // Alternative 0 is the negative lookahead, alternative 1 is what comes
2190 // afterwards.
2191 RegExpNode* node = alternatives_->at(1).node();
2192 RegExpNode* replacement = node->FilterOneByte(depth - 1, ignore_case);
2193 if (replacement == NULL) return set_replacement(NULL);
2194 alternatives_->at(1).set_node(replacement);
2195
2196 RegExpNode* neg_node = alternatives_->at(0).node();
2197 RegExpNode* neg_replacement = neg_node->FilterOneByte(depth - 1, ignore_case);
2198 // If the negative lookahead is always going to fail then
2199 // we don't need to check it.
2200 if (neg_replacement == NULL) return set_replacement(replacement);
2201 alternatives_->at(0).set_node(neg_replacement);
2202 return set_replacement(this);
2203 }
2204
2205
2206 void LoopChoiceNode::GetQuickCheckDetails(QuickCheckDetails* details,
2207 RegExpCompiler* compiler,
2208 int characters_filled_in,
2209 bool not_at_start) {
2210 if (body_can_be_zero_length_ || info()->visited) return;
2211 VisitMarker marker(info());
2212 return ChoiceNode::GetQuickCheckDetails(details,
2213 compiler,
2214 characters_filled_in,
2215 not_at_start);
2216 }
2217
2218
2219 void LoopChoiceNode::FillInBMInfo(int offset,
2220 int budget,
2221 BoyerMooreLookahead* bm,
2222 bool not_at_start) {
2223 if (body_can_be_zero_length_ || budget <= 0) {
2224 bm->SetRest(offset);
2225 SaveBMInfo(bm, not_at_start, offset);
2226 return;
2227 }
2228 ChoiceNode::FillInBMInfo(offset, budget - 1, bm, not_at_start);
2229 SaveBMInfo(bm, not_at_start, offset);
2230 }
2231
2232
2233 void ChoiceNode::GetQuickCheckDetails(QuickCheckDetails* details,
2234 RegExpCompiler* compiler,
2235 int characters_filled_in,
2236 bool not_at_start) {
2237 not_at_start = (not_at_start || not_at_start_);
2238 int choice_count = alternatives_->length();
2239 DCHECK(choice_count > 0);
2240 alternatives_->at(0).node()->GetQuickCheckDetails(details,
2241 compiler,
2242 characters_filled_in,
2243 not_at_start);
2244 for (int i = 1; i < choice_count; i++) {
2245 QuickCheckDetails new_details(details->characters());
2246 RegExpNode* node = alternatives_->at(i).node();
2247 node->GetQuickCheckDetails(&new_details, compiler,
2248 characters_filled_in,
2249 not_at_start);
2250 // Here we merge the quick match details of the two branches.
2251 details->Merge(&new_details, characters_filled_in);
2252 }
2253 }
2254
2255
2256 // Check for [0-9A-Z_a-z].
2257 static void EmitWordCheck(RegExpMacroAssembler* assembler,
2258 Label* word,
2259 Label* non_word,
2260 bool fall_through_on_word) {
2261 if (assembler->CheckSpecialCharacterClass(
2262 fall_through_on_word ? 'w' : 'W',
2263 fall_through_on_word ? non_word : word)) {
2264 // Optimized implementation available.
2265 return;
2266 }
2267 assembler->CheckCharacterGT('z', non_word);
2268 assembler->CheckCharacterLT('0', non_word);
2269 assembler->CheckCharacterGT('a' - 1, word);
2270 assembler->CheckCharacterLT('9' + 1, word);
2271 assembler->CheckCharacterLT('A', non_word);
2272 assembler->CheckCharacterLT('Z' + 1, word);
2273 if (fall_through_on_word) {
2274 assembler->CheckNotCharacter('_', non_word);
2275 } else {
2276 assembler->CheckCharacter('_', word);
2277 }
2278 }
2279
2280
2281 // Emit the code to check for a ^ in multiline mode (1-character lookbehind
2282 // that matches newline or the start of input).
2283 static void EmitHat(RegExpCompiler* compiler,
2284 RegExpNode* on_success,
2285 Trace* trace) {
2286 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2287 // We will be loading the previous character into the current character
2288 // register.
2289 Trace new_trace(*trace);
2290 new_trace.InvalidateCurrentCharacter();
2291
2292 Label ok;
2293 if (new_trace.cp_offset() == 0) {
2294 // The start of input counts as a newline in this context, so skip to
2295 // ok if we are at the start.
2296 assembler->CheckAtStart(&ok);
2297 }
2298 // We already checked that we are not at the start of input so it must be
2299 // OK to load the previous character.
2300 assembler->LoadCurrentCharacter(new_trace.cp_offset() -1,
2301 new_trace.backtrack(),
2302 false);
2303 if (!assembler->CheckSpecialCharacterClass('n',
2304 new_trace.backtrack())) {
2305 // Newline means \n, \r, 0x2028 or 0x2029.
2306 if (!compiler->one_byte()) {
2307 assembler->CheckCharacterAfterAnd(0x2028, 0xfffe, &ok);
2308 }
2309 assembler->CheckCharacter('\n', &ok);
2310 assembler->CheckNotCharacter('\r', new_trace.backtrack());
2311 }
2312 assembler->Bind(&ok);
2313 on_success->Emit(compiler, &new_trace);
2314 }
2315
2316
2317 // Emit the code to handle \b and \B (word-boundary or non-word-boundary).
2318 void AssertionNode::EmitBoundaryCheck(RegExpCompiler* compiler, Trace* trace) {
2319 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2320 Trace::TriBool next_is_word_character = Trace::UNKNOWN;
2321 bool not_at_start = (trace->at_start() == Trace::FALSE_VALUE);
2322 BoyerMooreLookahead* lookahead = bm_info(not_at_start);
2323 if (lookahead == NULL) {
2324 int eats_at_least =
2325 Min(kMaxLookaheadForBoyerMoore, EatsAtLeast(kMaxLookaheadForBoyerMoore,
2326 kRecursionBudget,
2327 not_at_start));
2328 if (eats_at_least >= 1) {
2329 BoyerMooreLookahead* bm =
2330 new(zone()) BoyerMooreLookahead(eats_at_least, compiler, zone());
2331 FillInBMInfo(0, kRecursionBudget, bm, not_at_start);
2332 if (bm->at(0)->is_non_word())
2333 next_is_word_character = Trace::FALSE_VALUE;
2334 if (bm->at(0)->is_word()) next_is_word_character = Trace::TRUE_VALUE;
2335 }
2336 } else {
2337 if (lookahead->at(0)->is_non_word())
2338 next_is_word_character = Trace::FALSE_VALUE;
2339 if (lookahead->at(0)->is_word())
2340 next_is_word_character = Trace::TRUE_VALUE;
2341 }
2342 bool at_boundary = (assertion_type_ == AssertionNode::AT_BOUNDARY);
2343 if (next_is_word_character == Trace::UNKNOWN) {
2344 Label before_non_word;
2345 Label before_word;
2346 if (trace->characters_preloaded() != 1) {
2347 assembler->LoadCurrentCharacter(trace->cp_offset(), &before_non_word);
2348 }
2349 // Fall through on non-word.
2350 EmitWordCheck(assembler, &before_word, &before_non_word, false);
2351 // Next character is not a word character.
2352 assembler->Bind(&before_non_word);
2353 Label ok;
2354 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsNonWord : kIsWord);
2355 assembler->GoTo(&ok);
2356
2357 assembler->Bind(&before_word);
2358 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsWord : kIsNonWord);
2359 assembler->Bind(&ok);
2360 } else if (next_is_word_character == Trace::TRUE_VALUE) {
2361 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsWord : kIsNonWord);
2362 } else {
2363 DCHECK(next_is_word_character == Trace::FALSE_VALUE);
2364 BacktrackIfPrevious(compiler, trace, at_boundary ? kIsNonWord : kIsWord);
2365 }
2366 }
2367
2368
2369 void AssertionNode::BacktrackIfPrevious(
2370 RegExpCompiler* compiler,
2371 Trace* trace,
2372 AssertionNode::IfPrevious backtrack_if_previous) {
2373 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2374 Trace new_trace(*trace);
2375 new_trace.InvalidateCurrentCharacter();
2376
2377 Label fall_through, dummy;
2378
2379 Label* non_word = backtrack_if_previous == kIsNonWord ?
2380 new_trace.backtrack() :
2381 &fall_through;
2382 Label* word = backtrack_if_previous == kIsNonWord ?
2383 &fall_through :
2384 new_trace.backtrack();
2385
2386 if (new_trace.cp_offset() == 0) {
2387 // The start of input counts as a non-word character, so the question is
2388 // decided if we are at the start.
2389 assembler->CheckAtStart(non_word);
2390 }
2391 // We already checked that we are not at the start of input so it must be
2392 // OK to load the previous character.
2393 assembler->LoadCurrentCharacter(new_trace.cp_offset() - 1, &dummy, false);
2394 EmitWordCheck(assembler, word, non_word, backtrack_if_previous == kIsNonWord);
2395
2396 assembler->Bind(&fall_through);
2397 on_success()->Emit(compiler, &new_trace);
2398 }
2399
2400
2401 void AssertionNode::GetQuickCheckDetails(QuickCheckDetails* details,
2402 RegExpCompiler* compiler,
2403 int filled_in,
2404 bool not_at_start) {
2405 if (assertion_type_ == AT_START && not_at_start) {
2406 details->set_cannot_match();
2407 return;
2408 }
2409 return on_success()->GetQuickCheckDetails(details,
2410 compiler,
2411 filled_in,
2412 not_at_start);
2413 }
2414
2415
2416 void AssertionNode::Emit(RegExpCompiler* compiler, Trace* trace) {
2417 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2418 switch (assertion_type_) {
2419 case AT_END: {
2420 Label ok;
2421 assembler->CheckPosition(trace->cp_offset(), &ok);
2422 assembler->GoTo(trace->backtrack());
2423 assembler->Bind(&ok);
2424 break;
2425 }
2426 case AT_START: {
2427 if (trace->at_start() == Trace::FALSE_VALUE) {
2428 assembler->GoTo(trace->backtrack());
2429 return;
2430 }
2431 if (trace->at_start() == Trace::UNKNOWN) {
2432 assembler->CheckNotAtStart(trace->backtrack());
2433 Trace at_start_trace = *trace;
2434 at_start_trace.set_at_start(true);
2435 on_success()->Emit(compiler, &at_start_trace);
2436 return;
2437 }
2438 }
2439 break;
2440 case AFTER_NEWLINE:
2441 EmitHat(compiler, on_success(), trace);
2442 return;
2443 case AT_BOUNDARY:
2444 case AT_NON_BOUNDARY: {
2445 EmitBoundaryCheck(compiler, trace);
2446 return;
2447 }
2448 }
2449 on_success()->Emit(compiler, trace);
2450 }
2451
2452
2453 static bool DeterminedAlready(QuickCheckDetails* quick_check, int offset) {
2454 if (quick_check == NULL) return false;
2455 if (offset >= quick_check->characters()) return false;
2456 return quick_check->positions(offset)->determines_perfectly;
2457 }
2458
2459
2460 static void UpdateBoundsCheck(int index, int* checked_up_to) {
2461 if (index > *checked_up_to) {
2462 *checked_up_to = index;
2463 }
2464 }
2465
2466
2467 // We call this repeatedly to generate code for each pass over the text node.
2468 // The passes are in increasing order of difficulty because we hope one
2469 // of the first passes will fail in which case we are saved the work of the
2470 // later passes. for example for the case independent regexp /%[asdfghjkl]a/
2471 // we will check the '%' in the first pass, the case independent 'a' in the
2472 // second pass and the character class in the last pass.
2473 //
2474 // The passes are done from right to left, so for example to test for /bar/
2475 // we will first test for an 'r' with offset 2, then an 'a' with offset 1
2476 // and then a 'b' with offset 0. This means we can avoid the end-of-input
2477 // bounds check most of the time. In the example we only need to check for
2478 // end-of-input when loading the putative 'r'.
2479 //
2480 // A slight complication involves the fact that the first character may already
2481 // be fetched into a register by the previous node. In this case we want to
2482 // do the test for that character first. We do this in separate passes. The
2483 // 'preloaded' argument indicates that we are doing such a 'pass'. If such a
2484 // pass has been performed then subsequent passes will have true in
2485 // first_element_checked to indicate that that character does not need to be
2486 // checked again.
2487 //
2488 // In addition to all this we are passed a Trace, which can
2489 // contain an AlternativeGeneration object. In this AlternativeGeneration
2490 // object we can see details of any quick check that was already passed in
2491 // order to get to the code we are now generating. The quick check can involve
2492 // loading characters, which means we do not need to recheck the bounds
2493 // up to the limit the quick check already checked. In addition the quick
2494 // check can have involved a mask and compare operation which may simplify
2495 // or obviate the need for further checks at some character positions.
2496 void TextNode::TextEmitPass(RegExpCompiler* compiler,
2497 TextEmitPassType pass,
2498 bool preloaded,
2499 Trace* trace,
2500 bool first_element_checked,
2501 int* checked_up_to) {
2502 RegExpMacroAssembler* assembler = compiler->macro_assembler();
2503 Isolate* isolate = assembler->zone()->isolate();
2504 bool one_byte = compiler->one_byte();
2505 Label* backtrack = trace->backtrack();
2506 QuickCheckDetails* quick_check = trace->quick_check_performed();
2507 int element_count = elms_->length();
2508 for (int i = preloaded ? 0 : element_count - 1; i >= 0; i--) {
2509 TextElement elm = elms_->at(i);
2510 int cp_offset = trace->cp_offset() + elm.cp_offset();
2511 if (elm.text_type() == TextElement::ATOM) {
2512 Vector<const uc16> quarks = elm.atom()->data();
2513 for (int j = preloaded ? 0 : quarks.length() - 1; j >= 0; j--) {
2514 if (first_element_checked && i == 0 && j == 0) continue;
2515 if (DeterminedAlready(quick_check, elm.cp_offset() + j)) continue;
2516 EmitCharacterFunction* emit_function = NULL;
2517 switch (pass) {
2518 case NON_LATIN1_MATCH:
2519 DCHECK(one_byte);
2520 if (quarks[j] > String::kMaxOneByteCharCode) {
2521 assembler->GoTo(backtrack);
2522 return;
2523 }
2524 break;
2525 case NON_LETTER_CHARACTER_MATCH:
2526 emit_function = &EmitAtomNonLetter;
2527 break;
2528 case SIMPLE_CHARACTER_MATCH:
2529 emit_function = &EmitSimpleCharacter;
2530 break;
2531 case CASE_CHARACTER_MATCH:
2532 emit_function = &EmitAtomLetter;
2533 break;
2534 default:
2535 break;
2536 }
2537 if (emit_function != NULL) {
2538 bool bound_checked = emit_function(isolate,
2539 compiler,
2540 quarks[j],
2541 backtrack,
2542 cp_offset + j,
2543 *checked_up_to < cp_offset + j,
2544 preloaded);
2545 if (bound_checked) UpdateBoundsCheck(cp_offset + j, checked_up_to);
2546 }
2547 }
2548 } else {
2549 DCHECK_EQ(TextElement::CHAR_CLASS, elm.text_type());
2550 if (pass == CHARACTER_CLASS_MATCH) {
2551 if (first_element_checked && i == 0) continue;
2552 if (DeterminedAlready(quick_check, elm.cp_offset())) continue;
2553 RegExpCharacterClass* cc = elm.char_class();
2554 EmitCharClass(assembler, cc, one_byte, backtrack, cp_offset,
2555 *checked_up_to < cp_offset, preloaded, zone());
2556 UpdateBoundsCheck(cp_offset, checked_up_to);
2557 }
2558 }
2559 }
2560 }
2561
2562
2563 int TextNode::Length() {
2564 TextElement elm = elms_->last();
2565 DCHECK(elm.cp_offset() >= 0);
2566 return elm.cp_offset() + elm.length();
2567 }
2568
2569
2570 bool TextNode::SkipPass(int int_pass, bool ignore_case) {
2571 TextEmitPassType pass = static_cast<TextEmitPassType>(int_pass);
2572 if (ignore_case) {
2573 return pass == SIMPLE_CHARACTER_MATCH;
2574 } else {
2575 return pass == NON_LETTER_CHARACTER_MATCH || pass == CASE_CHARACTER_MATCH;
2576 }
2577 }
2578
2579
2580 // This generates the code to match a text node. A text node can contain
2581 // straight character sequences (possibly to be matched in a case-independent
2582 // way) and character classes. For efficiency we do not do this in a single
2583 // pass from left to right. Instead we pass over the text node several times,
2584 // emitting code for some character positions every time. See the comment on
2585 // TextEmitPass for details.
2586 void TextNode::Emit(RegExpCompiler* compiler, Trace* trace) {
2587 LimitResult limit_result = LimitVersions(compiler, trace);
2588 if (limit_result == DONE) return;
2589 DCHECK(limit_result == CONTINUE);
2590
2591 if (trace->cp_offset() + Length() > RegExpMacroAssembler::kMaxCPOffset) {
2592 compiler->SetRegExpTooBig();
2593 return;
2594 }
2595
2596 if (compiler->one_byte()) {
2597 int dummy = 0;
2598 TextEmitPass(compiler, NON_LATIN1_MATCH, false, trace, false, &dummy);
2599 }
2600
2601 bool first_elt_done = false;
2602 int bound_checked_to = trace->cp_offset() - 1;
2603 bound_checked_to += trace->bound_checked_up_to();
2604
2605 // If a character is preloaded into the current character register then
2606 // check that now.
2607 if (trace->characters_preloaded() == 1) {
2608 for (int pass = kFirstRealPass; pass <= kLastPass; pass++) {
2609 if (!SkipPass(pass, compiler->ignore_case())) {
2610 TextEmitPass(compiler,
2611 static_cast<TextEmitPassType>(pass),
2612 true,
2613 trace,
2614 false,
2615 &bound_checked_to);
2616 }
2617 }
2618 first_elt_done = true;
2619 }
2620
2621 for (int pass = kFirstRealPass; pass <= kLastPass; pass++) {
2622 if (!SkipPass(pass, compiler->ignore_case())) {
2623 TextEmitPass(compiler,
2624 static_cast<TextEmitPassType>(pass),
2625 false,
2626 trace,
2627 first_elt_done,
2628 &bound_checked_to);
2629 }
2630 }
2631
2632 Trace successor_trace(*trace);
2633 successor_trace.set_at_start(false);
2634 successor_trace.AdvanceCurrentPositionInTrace(Length(), compiler);
2635 RecursionCheck rc(compiler);
2636 on_success()->Emit(compiler, &successor_trace);
2637 }
2638
2639
2640 void Trace::InvalidateCurrentCharacter() {
2641 characters_preloaded_ = 0;
2642 }
2643
2644
2645 void Trace::AdvanceCurrentPositionInTrace(int by, RegExpCompiler* compiler) {
2646 DCHECK(by > 0);
2647 // We don't have an instruction for shifting the current character register
2648 // down or for using a shifted value for anything so lets just forget that
2649 // we preloaded any characters into it.
2650 characters_preloaded_ = 0;
2651 // Adjust the offsets of the quick check performed information. This
2652 // information is used to find out what we already determined about the
2653 // characters by means of mask and compare.
2654 quick_check_performed_.Advance(by, compiler->one_byte());
2655 cp_offset_ += by;
2656 if (cp_offset_ > RegExpMacroAssembler::kMaxCPOffset) {
2657 compiler->SetRegExpTooBig();
2658 cp_offset_ = 0;
2659 }
2660 bound_checked_up_to_ = Max(0, bound_checked_up_to_ - by);
2661 }
2662
2663
2664 void TextNode::MakeCaseIndependent(bool is_one_byte) {
2665 int element_count = elms_->length();
2666 for (int i = 0; i < element_count; i++) {
2667 TextElement elm = elms_->at(i);
2668 if (elm.text_type() == TextElement::CHAR_CLASS) {
2669 RegExpCharacterClass* cc = elm.char_class();
2670 // None of the standard character classes is different in the case
2671 // independent case and it slows us down if we don't know that.
2672 if (cc->is_standard(zone())) continue;
2673 ZoneList<CharacterRange>* ranges = cc->ranges(zone());
2674 int range_count = ranges->length();
2675 for (int j = 0; j < range_count; j++) {
2676 ranges->at(j).AddCaseEquivalents(ranges, is_one_byte, zone());
2677 }
2678 }
2679 }
2680 }
2681
2682
2683 int TextNode::GreedyLoopTextLength() {
2684 TextElement elm = elms_->at(elms_->length() - 1);
2685 return elm.cp_offset() + elm.length();
2686 }
2687
2688
2689 RegExpNode* TextNode::GetSuccessorOfOmnivorousTextNode(
2690 RegExpCompiler* compiler) {
2691 if (elms_->length() != 1) return NULL;
2692 TextElement elm = elms_->at(0);
2693 if (elm.text_type() != TextElement::CHAR_CLASS) return NULL;
2694 RegExpCharacterClass* node = elm.char_class();
2695 ZoneList<CharacterRange>* ranges = node->ranges(zone());
2696 if (!CharacterRange::IsCanonical(ranges)) {
2697 CharacterRange::Canonicalize(ranges);
2698 }
2699 if (node->is_negated()) {
2700 return ranges->length() == 0 ? on_success() : NULL;
2701 }
2702 if (ranges->length() != 1) return NULL;
2703 uint32_t max_char;
2704 if (compiler->one_byte()) {
2705 max_char = String::kMaxOneByteCharCode;
2706 } else {
2707 max_char = String::kMaxUtf16CodeUnit;
2708 }
2709 return ranges->at(0).IsEverything(max_char) ? on_success() : NULL;
2710 }
2711
2712
2713 // Finds the fixed match length of a sequence of nodes that goes from
2714 // this alternative and back to this choice node. If there are variable
2715 // length nodes or other complications in the way then return a sentinel
2716 // value indicating that a greedy loop cannot be constructed.
2717 int ChoiceNode::GreedyLoopTextLengthForAlternative(
2718 GuardedAlternative* alternative) {
2719 int length = 0;
2720 RegExpNode* node = alternative->node();
2721 // Later we will generate code for all these text nodes using recursion
2722 // so we have to limit the max number.
2723 int recursion_depth = 0;
2724 while (node != this) {
2725 if (recursion_depth++ > RegExpCompiler::kMaxRecursion) {
2726 return kNodeIsTooComplexForGreedyLoops;
2727 }
2728 int node_length = node->GreedyLoopTextLength();
2729 if (node_length == kNodeIsTooComplexForGreedyLoops) {
2730 return kNodeIsTooComplexForGreedyLoops;
2731 }
2732 length += node_length;
2733 SeqRegExpNode* seq_node = static_cast<SeqRegExpNode*>(node);
2734 node = seq_node->on_success();
2735 }
2736 return length;
2737 }
2738
2739
2740 void LoopChoiceNode::AddLoopAlternative(GuardedAlternative alt) {
2741 DCHECK_EQ(loop_node_, NULL);
2742 AddAlternative(alt);
2743 loop_node_ = alt.node();
2744 }
2745
2746
2747 void LoopChoiceNode::AddContinueAlternative(GuardedAlternative alt) {
2748 DCHECK_EQ(continue_node_, NULL);
2749 AddAlternative(alt);
2750 continue_node_ = alt.node();
2751 }
2752
2753
2754 void LoopChoiceNode::Emit(RegExpCompiler* compiler, Trace* trace) {
2755 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
2756 if (trace->stop_node() == this) {
2757 // Back edge of greedy optimized loop node graph.
2758 int text_length =
2759 GreedyLoopTextLengthForAlternative(&(alternatives_->at(0)));
2760 DCHECK(text_length != kNodeIsTooComplexForGreedyLoops);
2761 // Update the counter-based backtracking info on the stack. This is an
2762 // optimization for greedy loops (see below).
2763 DCHECK(trace->cp_offset() == text_length);
2764 macro_assembler->AdvanceCurrentPosition(text_length);
2765 macro_assembler->GoTo(trace->loop_label());
2766 return;
2767 }
2768 DCHECK(trace->stop_node() == NULL);
2769 if (!trace->is_trivial()) {
2770 trace->Flush(compiler, this);
2771 return;
2772 }
2773 ChoiceNode::Emit(compiler, trace);
2774 }
2775
2776
2777 int ChoiceNode::CalculatePreloadCharacters(RegExpCompiler* compiler,
2778 int eats_at_least) {
2779 int preload_characters = Min(4, eats_at_least);
2780 if (compiler->macro_assembler()->CanReadUnaligned()) {
2781 bool one_byte = compiler->one_byte();
2782 if (one_byte) {
2783 if (preload_characters > 4) preload_characters = 4;
2784 // We can't preload 3 characters because there is no machine instruction
2785 // to do that. We can't just load 4 because we could be reading
2786 // beyond the end of the string, which could cause a memory fault.
2787 if (preload_characters == 3) preload_characters = 2;
2788 } else {
2789 if (preload_characters > 2) preload_characters = 2;
2790 }
2791 } else {
2792 if (preload_characters > 1) preload_characters = 1;
2793 }
2794 return preload_characters;
2795 }
2796
2797
2798 // This class is used when generating the alternatives in a choice node. It
2799 // records the way the alternative is being code generated.
2800 class AlternativeGeneration: public Malloced {
2801 public:
2802 AlternativeGeneration()
2803 : possible_success(),
2804 expects_preload(false),
2805 after(),
2806 quick_check_details() { }
2807 Label possible_success;
2808 bool expects_preload;
2809 Label after;
2810 QuickCheckDetails quick_check_details;
2811 };
2812
2813
2814 // Creates a list of AlternativeGenerations. If the list has a reasonable
2815 // size then it is on the stack, otherwise the excess is on the heap.
2816 class AlternativeGenerationList {
2817 public:
2818 AlternativeGenerationList(int count, Zone* zone)
2819 : alt_gens_(count, zone) {
2820 for (int i = 0; i < count && i < kAFew; i++) {
2821 alt_gens_.Add(a_few_alt_gens_ + i, zone);
2822 }
2823 for (int i = kAFew; i < count; i++) {
2824 alt_gens_.Add(new AlternativeGeneration(), zone);
2825 }
2826 }
2827 ~AlternativeGenerationList() {
2828 for (int i = kAFew; i < alt_gens_.length(); i++) {
2829 delete alt_gens_[i];
2830 alt_gens_[i] = NULL;
2831 }
2832 }
2833
2834 AlternativeGeneration* at(int i) {
2835 return alt_gens_[i];
2836 }
2837
2838 private:
2839 static const int kAFew = 10;
2840 ZoneList<AlternativeGeneration*> alt_gens_;
2841 AlternativeGeneration a_few_alt_gens_[kAFew];
2842 };
2843
2844
2845 // The '2' variant is has inclusive from and exclusive to.
2846 // This covers \s as defined in ECMA-262 5.1, 15.10.2.12,
2847 // which include WhiteSpace (7.2) or LineTerminator (7.3) values.
2848 static const int kSpaceRanges[] = { '\t', '\r' + 1, ' ', ' ' + 1,
2849 0x00A0, 0x00A1, 0x1680, 0x1681, 0x180E, 0x180F, 0x2000, 0x200B,
2850 0x2028, 0x202A, 0x202F, 0x2030, 0x205F, 0x2060, 0x3000, 0x3001,
2851 0xFEFF, 0xFF00, 0x10000 };
2852 static const int kSpaceRangeCount = arraysize(kSpaceRanges);
2853
2854 static const int kWordRanges[] = {
2855 '0', '9' + 1, 'A', 'Z' + 1, '_', '_' + 1, 'a', 'z' + 1, 0x10000 };
2856 static const int kWordRangeCount = arraysize(kWordRanges);
2857 static const int kDigitRanges[] = { '0', '9' + 1, 0x10000 };
2858 static const int kDigitRangeCount = arraysize(kDigitRanges);
2859 static const int kSurrogateRanges[] = { 0xd800, 0xe000, 0x10000 };
2860 static const int kSurrogateRangeCount = arraysize(kSurrogateRanges);
2861 static const int kLineTerminatorRanges[] = { 0x000A, 0x000B, 0x000D, 0x000E,
2862 0x2028, 0x202A, 0x10000 };
2863 static const int kLineTerminatorRangeCount = arraysize(kLineTerminatorRanges);
2864
2865
2866 void BoyerMoorePositionInfo::Set(int character) {
2867 SetInterval(Interval(character, character));
2868 }
2869
2870
2871 void BoyerMoorePositionInfo::SetInterval(const Interval& interval) {
2872 s_ = AddRange(s_, kSpaceRanges, kSpaceRangeCount, interval);
2873 w_ = AddRange(w_, kWordRanges, kWordRangeCount, interval);
2874 d_ = AddRange(d_, kDigitRanges, kDigitRangeCount, interval);
2875 surrogate_ =
2876 AddRange(surrogate_, kSurrogateRanges, kSurrogateRangeCount, interval);
2877 if (interval.to() - interval.from() >= kMapSize - 1) {
2878 if (map_count_ != kMapSize) {
2879 map_count_ = kMapSize;
2880 for (int i = 0; i < kMapSize; i++) map_->at(i) = true;
2881 }
2882 return;
2883 }
2884 for (int i = interval.from(); i <= interval.to(); i++) {
2885 int mod_character = (i & kMask);
2886 if (!map_->at(mod_character)) {
2887 map_count_++;
2888 map_->at(mod_character) = true;
2889 }
2890 if (map_count_ == kMapSize) return;
2891 }
2892 }
2893
2894
2895 void BoyerMoorePositionInfo::SetAll() {
2896 s_ = w_ = d_ = kLatticeUnknown;
2897 if (map_count_ != kMapSize) {
2898 map_count_ = kMapSize;
2899 for (int i = 0; i < kMapSize; i++) map_->at(i) = true;
2900 }
2901 }
2902
2903
2904 BoyerMooreLookahead::BoyerMooreLookahead(
2905 int length, RegExpCompiler* compiler, Zone* zone)
2906 : length_(length),
2907 compiler_(compiler) {
2908 if (compiler->one_byte()) {
2909 max_char_ = String::kMaxOneByteCharCode;
2910 } else {
2911 max_char_ = String::kMaxUtf16CodeUnit;
2912 }
2913 bitmaps_ = new(zone) ZoneList<BoyerMoorePositionInfo*>(length, zone);
2914 for (int i = 0; i < length; i++) {
2915 bitmaps_->Add(new(zone) BoyerMoorePositionInfo(zone), zone);
2916 }
2917 }
2918
2919
2920 // Find the longest range of lookahead that has the fewest number of different
2921 // characters that can occur at a given position. Since we are optimizing two
2922 // different parameters at once this is a tradeoff.
2923 bool BoyerMooreLookahead::FindWorthwhileInterval(int* from, int* to) {
2924 int biggest_points = 0;
2925 // If more than 32 characters out of 128 can occur it is unlikely that we can
2926 // be lucky enough to step forwards much of the time.
2927 const int kMaxMax = 32;
2928 for (int max_number_of_chars = 4;
2929 max_number_of_chars < kMaxMax;
2930 max_number_of_chars *= 2) {
2931 biggest_points =
2932 FindBestInterval(max_number_of_chars, biggest_points, from, to);
2933 }
2934 if (biggest_points == 0) return false;
2935 return true;
2936 }
2937
2938
2939 // Find the highest-points range between 0 and length_ where the character
2940 // information is not too vague. 'Too vague' means that there are more than
2941 // max_number_of_chars that can occur at this position. Calculates the number
2942 // of points as the product of width-of-the-range and
2943 // probability-of-finding-one-of-the-characters, where the probability is
2944 // calculated using the frequency distribution of the sample subject string.
2945 int BoyerMooreLookahead::FindBestInterval(
2946 int max_number_of_chars, int old_biggest_points, int* from, int* to) {
2947 int biggest_points = old_biggest_points;
2948 static const int kSize = RegExpMacroAssembler::kTableSize;
2949 for (int i = 0; i < length_; ) {
2950 while (i < length_ && Count(i) > max_number_of_chars) i++;
2951 if (i == length_) break;
2952 int remembered_from = i;
2953 bool union_map[kSize];
2954 for (int j = 0; j < kSize; j++) union_map[j] = false;
2955 while (i < length_ && Count(i) <= max_number_of_chars) {
2956 BoyerMoorePositionInfo* map = bitmaps_->at(i);
2957 for (int j = 0; j < kSize; j++) union_map[j] |= map->at(j);
2958 i++;
2959 }
2960 int frequency = 0;
2961 for (int j = 0; j < kSize; j++) {
2962 if (union_map[j]) {
2963 // Add 1 to the frequency to give a small per-character boost for
2964 // the cases where our sampling is not good enough and many
2965 // characters have a frequency of zero. This means the frequency
2966 // can theoretically be up to 2*kSize though we treat it mostly as
2967 // a fraction of kSize.
2968 frequency += compiler_->frequency_collator()->Frequency(j) + 1;
2969 }
2970 }
2971 // We use the probability of skipping times the distance we are skipping to
2972 // judge the effectiveness of this. Actually we have a cut-off: By
2973 // dividing by 2 we switch off the skipping if the probability of skipping
2974 // is less than 50%. This is because the multibyte mask-and-compare
2975 // skipping in quickcheck is more likely to do well on this case.
2976 bool in_quickcheck_range =
2977 ((i - remembered_from < 4) ||
2978 (compiler_->one_byte() ? remembered_from <= 4 : remembered_from <= 2));
2979 // Called 'probability' but it is only a rough estimate and can actually
2980 // be outside the 0-kSize range.
2981 int probability = (in_quickcheck_range ? kSize / 2 : kSize) - frequency;
2982 int points = (i - remembered_from) * probability;
2983 if (points > biggest_points) {
2984 *from = remembered_from;
2985 *to = i - 1;
2986 biggest_points = points;
2987 }
2988 }
2989 return biggest_points;
2990 }
2991
2992
2993 // Take all the characters that will not prevent a successful match if they
2994 // occur in the subject string in the range between min_lookahead and
2995 // max_lookahead (inclusive) measured from the current position. If the
2996 // character at max_lookahead offset is not one of these characters, then we
2997 // can safely skip forwards by the number of characters in the range.
2998 int BoyerMooreLookahead::GetSkipTable(int min_lookahead,
2999 int max_lookahead,
3000 Handle<ByteArray> boolean_skip_table) {
3001 const int kSize = RegExpMacroAssembler::kTableSize;
3002
3003 const int kSkipArrayEntry = 0;
3004 const int kDontSkipArrayEntry = 1;
3005
3006 for (int i = 0; i < kSize; i++) {
3007 boolean_skip_table->set(i, kSkipArrayEntry);
3008 }
3009 int skip = max_lookahead + 1 - min_lookahead;
3010
3011 for (int i = max_lookahead; i >= min_lookahead; i--) {
3012 BoyerMoorePositionInfo* map = bitmaps_->at(i);
3013 for (int j = 0; j < kSize; j++) {
3014 if (map->at(j)) {
3015 boolean_skip_table->set(j, kDontSkipArrayEntry);
3016 }
3017 }
3018 }
3019
3020 return skip;
3021 }
3022
3023
3024 // See comment above on the implementation of GetSkipTable.
3025 void BoyerMooreLookahead::EmitSkipInstructions(RegExpMacroAssembler* masm) {
3026 const int kSize = RegExpMacroAssembler::kTableSize;
3027
3028 int min_lookahead = 0;
3029 int max_lookahead = 0;
3030
3031 if (!FindWorthwhileInterval(&min_lookahead, &max_lookahead)) return;
3032
3033 bool found_single_character = false;
3034 int single_character = 0;
3035 for (int i = max_lookahead; i >= min_lookahead; i--) {
3036 BoyerMoorePositionInfo* map = bitmaps_->at(i);
3037 if (map->map_count() > 1 ||
3038 (found_single_character && map->map_count() != 0)) {
3039 found_single_character = false;
3040 break;
3041 }
3042 for (int j = 0; j < kSize; j++) {
3043 if (map->at(j)) {
3044 found_single_character = true;
3045 single_character = j;
3046 break;
3047 }
3048 }
3049 }
3050
3051 int lookahead_width = max_lookahead + 1 - min_lookahead;
3052
3053 if (found_single_character && lookahead_width == 1 && max_lookahead < 3) {
3054 // The mask-compare can probably handle this better.
3055 return;
3056 }
3057
3058 if (found_single_character) {
3059 Label cont, again;
3060 masm->Bind(&again);
3061 masm->LoadCurrentCharacter(max_lookahead, &cont, true);
3062 if (max_char_ > kSize) {
3063 masm->CheckCharacterAfterAnd(single_character,
3064 RegExpMacroAssembler::kTableMask,
3065 &cont);
3066 } else {
3067 masm->CheckCharacter(single_character, &cont);
3068 }
3069 masm->AdvanceCurrentPosition(lookahead_width);
3070 masm->GoTo(&again);
3071 masm->Bind(&cont);
3072 return;
3073 }
3074
3075 Factory* factory = masm->zone()->isolate()->factory();
3076 Handle<ByteArray> boolean_skip_table = factory->NewByteArray(kSize, TENURED);
3077 int skip_distance = GetSkipTable(
3078 min_lookahead, max_lookahead, boolean_skip_table);
3079 DCHECK(skip_distance != 0);
3080
3081 Label cont, again;
3082 masm->Bind(&again);
3083 masm->LoadCurrentCharacter(max_lookahead, &cont, true);
3084 masm->CheckBitInTable(boolean_skip_table, &cont);
3085 masm->AdvanceCurrentPosition(skip_distance);
3086 masm->GoTo(&again);
3087 masm->Bind(&cont);
3088 }
3089
3090
3091 /* Code generation for choice nodes.
3092 *
3093 * We generate quick checks that do a mask and compare to eliminate a
3094 * choice. If the quick check succeeds then it jumps to the continuation to
3095 * do slow checks and check subsequent nodes. If it fails (the common case)
3096 * it falls through to the next choice.
3097 *
3098 * Here is the desired flow graph. Nodes directly below each other imply
3099 * fallthrough. Alternatives 1 and 2 have quick checks. Alternative
3100 * 3 doesn't have a quick check so we have to call the slow check.
3101 * Nodes are marked Qn for quick checks and Sn for slow checks. The entire
3102 * regexp continuation is generated directly after the Sn node, up to the
3103 * next GoTo if we decide to reuse some already generated code. Some
3104 * nodes expect preload_characters to be preloaded into the current
3105 * character register. R nodes do this preloading. Vertices are marked
3106 * F for failures and S for success (possible success in the case of quick
3107 * nodes). L, V, < and > are used as arrow heads.
3108 *
3109 * ----------> R
3110 * |
3111 * V
3112 * Q1 -----> S1
3113 * | S /
3114 * F| /
3115 * | F/
3116 * | /
3117 * | R
3118 * | /
3119 * V L
3120 * Q2 -----> S2
3121 * | S /
3122 * F| /
3123 * | F/
3124 * | /
3125 * | R
3126 * | /
3127 * V L
3128 * S3
3129 * |
3130 * F|
3131 * |
3132 * R
3133 * |
3134 * backtrack V
3135 * <----------Q4
3136 * \ F |
3137 * \ |S
3138 * \ F V
3139 * \-----S4
3140 *
3141 * For greedy loops we push the current position, then generate the code that
3142 * eats the input specially in EmitGreedyLoop. The other choice (the
3143 * continuation) is generated by the normal code in EmitChoices, and steps back
3144 * in the input to the starting position when it fails to match. The loop code
3145 * looks like this (U is the unwind code that steps back in the greedy loop).
3146 *
3147 * _____
3148 * / \
3149 * V |
3150 * ----------> S1 |
3151 * /| |
3152 * / |S |
3153 * F/ \_____/
3154 * /
3155 * |<-----
3156 * | \
3157 * V |S
3158 * Q2 ---> U----->backtrack
3159 * | F /
3160 * S| /
3161 * V F /
3162 * S2--/
3163 */
3164
3165 GreedyLoopState::GreedyLoopState(bool not_at_start) {
3166 counter_backtrack_trace_.set_backtrack(&label_);
3167 if (not_at_start) counter_backtrack_trace_.set_at_start(false);
3168 }
3169
3170
3171 void ChoiceNode::AssertGuardsMentionRegisters(Trace* trace) {
3172 #ifdef DEBUG
3173 int choice_count = alternatives_->length();
3174 for (int i = 0; i < choice_count - 1; i++) {
3175 GuardedAlternative alternative = alternatives_->at(i);
3176 ZoneList<Guard*>* guards = alternative.guards();
3177 int guard_count = (guards == NULL) ? 0 : guards->length();
3178 for (int j = 0; j < guard_count; j++) {
3179 DCHECK(!trace->mentions_reg(guards->at(j)->reg()));
3180 }
3181 }
3182 #endif
3183 }
3184
3185
3186 void ChoiceNode::SetUpPreLoad(RegExpCompiler* compiler,
3187 Trace* current_trace,
3188 PreloadState* state) {
3189 if (state->eats_at_least_ == PreloadState::kEatsAtLeastNotYetInitialized) {
3190 // Save some time by looking at most one machine word ahead.
3191 state->eats_at_least_ =
3192 EatsAtLeast(compiler->one_byte() ? 4 : 2, kRecursionBudget,
3193 current_trace->at_start() == Trace::FALSE_VALUE);
3194 }
3195 state->preload_characters_ =
3196 CalculatePreloadCharacters(compiler, state->eats_at_least_);
3197
3198 state->preload_is_current_ =
3199 (current_trace->characters_preloaded() == state->preload_characters_);
3200 state->preload_has_checked_bounds_ = state->preload_is_current_;
3201 }
3202
3203
3204 void ChoiceNode::Emit(RegExpCompiler* compiler, Trace* trace) {
3205 int choice_count = alternatives_->length();
3206
3207 AssertGuardsMentionRegisters(trace);
3208
3209 LimitResult limit_result = LimitVersions(compiler, trace);
3210 if (limit_result == DONE) return;
3211 DCHECK(limit_result == CONTINUE);
3212
3213 // For loop nodes we already flushed (see LoopChoiceNode::Emit), but for
3214 // other choice nodes we only flush if we are out of code size budget.
3215 if (trace->flush_budget() == 0 && trace->actions() != NULL) {
3216 trace->Flush(compiler, this);
3217 return;
3218 }
3219
3220 RecursionCheck rc(compiler);
3221
3222 PreloadState preload;
3223 preload.init();
3224 GreedyLoopState greedy_loop_state(not_at_start());
3225
3226 int text_length = GreedyLoopTextLengthForAlternative(&alternatives_->at(0));
3227 AlternativeGenerationList alt_gens(choice_count, zone());
3228
3229 if (choice_count > 1 && text_length != kNodeIsTooComplexForGreedyLoops) {
3230 trace = EmitGreedyLoop(compiler,
3231 trace,
3232 &alt_gens,
3233 &preload,
3234 &greedy_loop_state,
3235 text_length);
3236 } else {
3237 // TODO(erikcorry): Delete this. We don't need this label, but it makes us
3238 // match the traces produced pre-cleanup.
3239 Label second_choice;
3240 compiler->macro_assembler()->Bind(&second_choice);
3241
3242 preload.eats_at_least_ = EmitOptimizedUnanchoredSearch(compiler, trace);
3243
3244 EmitChoices(compiler,
3245 &alt_gens,
3246 0,
3247 trace,
3248 &preload);
3249 }
3250
3251 // At this point we need to generate slow checks for the alternatives where
3252 // the quick check was inlined. We can recognize these because the associated
3253 // label was bound.
3254 int new_flush_budget = trace->flush_budget() / choice_count;
3255 for (int i = 0; i < choice_count; i++) {
3256 AlternativeGeneration* alt_gen = alt_gens.at(i);
3257 Trace new_trace(*trace);
3258 // If there are actions to be flushed we have to limit how many times
3259 // they are flushed. Take the budget of the parent trace and distribute
3260 // it fairly amongst the children.
3261 if (new_trace.actions() != NULL) {
3262 new_trace.set_flush_budget(new_flush_budget);
3263 }
3264 bool next_expects_preload =
3265 i == choice_count - 1 ? false : alt_gens.at(i + 1)->expects_preload;
3266 EmitOutOfLineContinuation(compiler,
3267 &new_trace,
3268 alternatives_->at(i),
3269 alt_gen,
3270 preload.preload_characters_,
3271 next_expects_preload);
3272 }
3273 }
3274
3275
3276 Trace* ChoiceNode::EmitGreedyLoop(RegExpCompiler* compiler,
3277 Trace* trace,
3278 AlternativeGenerationList* alt_gens,
3279 PreloadState* preload,
3280 GreedyLoopState* greedy_loop_state,
3281 int text_length) {
3282 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3283 // Here we have special handling for greedy loops containing only text nodes
3284 // and other simple nodes. These are handled by pushing the current
3285 // position on the stack and then incrementing the current position each
3286 // time around the switch. On backtrack we decrement the current position
3287 // and check it against the pushed value. This avoids pushing backtrack
3288 // information for each iteration of the loop, which could take up a lot of
3289 // space.
3290 DCHECK(trace->stop_node() == NULL);
3291 macro_assembler->PushCurrentPosition();
3292 Label greedy_match_failed;
3293 Trace greedy_match_trace;
3294 if (not_at_start()) greedy_match_trace.set_at_start(false);
3295 greedy_match_trace.set_backtrack(&greedy_match_failed);
3296 Label loop_label;
3297 macro_assembler->Bind(&loop_label);
3298 greedy_match_trace.set_stop_node(this);
3299 greedy_match_trace.set_loop_label(&loop_label);
3300 alternatives_->at(0).node()->Emit(compiler, &greedy_match_trace);
3301 macro_assembler->Bind(&greedy_match_failed);
3302
3303 Label second_choice; // For use in greedy matches.
3304 macro_assembler->Bind(&second_choice);
3305
3306 Trace* new_trace = greedy_loop_state->counter_backtrack_trace();
3307
3308 EmitChoices(compiler,
3309 alt_gens,
3310 1,
3311 new_trace,
3312 preload);
3313
3314 macro_assembler->Bind(greedy_loop_state->label());
3315 // If we have unwound to the bottom then backtrack.
3316 macro_assembler->CheckGreedyLoop(trace->backtrack());
3317 // Otherwise try the second priority at an earlier position.
3318 macro_assembler->AdvanceCurrentPosition(-text_length);
3319 macro_assembler->GoTo(&second_choice);
3320 return new_trace;
3321 }
3322
3323 int ChoiceNode::EmitOptimizedUnanchoredSearch(RegExpCompiler* compiler,
3324 Trace* trace) {
3325 int eats_at_least = PreloadState::kEatsAtLeastNotYetInitialized;
3326 if (alternatives_->length() != 2) return eats_at_least;
3327
3328 GuardedAlternative alt1 = alternatives_->at(1);
3329 if (alt1.guards() != NULL && alt1.guards()->length() != 0) {
3330 return eats_at_least;
3331 }
3332 RegExpNode* eats_anything_node = alt1.node();
3333 if (eats_anything_node->GetSuccessorOfOmnivorousTextNode(compiler) != this) {
3334 return eats_at_least;
3335 }
3336
3337 // Really we should be creating a new trace when we execute this function,
3338 // but there is no need, because the code it generates cannot backtrack, and
3339 // we always arrive here with a trivial trace (since it's the entry to a
3340 // loop. That also implies that there are no preloaded characters, which is
3341 // good, because it means we won't be violating any assumptions by
3342 // overwriting those characters with new load instructions.
3343 DCHECK(trace->is_trivial());
3344
3345 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3346 // At this point we know that we are at a non-greedy loop that will eat
3347 // any character one at a time. Any non-anchored regexp has such a
3348 // loop prepended to it in order to find where it starts. We look for
3349 // a pattern of the form ...abc... where we can look 6 characters ahead
3350 // and step forwards 3 if the character is not one of abc. Abc need
3351 // not be atoms, they can be any reasonably limited character class or
3352 // small alternation.
3353 BoyerMooreLookahead* bm = bm_info(false);
3354 if (bm == NULL) {
3355 eats_at_least = Min(kMaxLookaheadForBoyerMoore,
3356 EatsAtLeast(kMaxLookaheadForBoyerMoore,
3357 kRecursionBudget,
3358 false));
3359 if (eats_at_least >= 1) {
3360 bm = new(zone()) BoyerMooreLookahead(eats_at_least,
3361 compiler,
3362 zone());
3363 GuardedAlternative alt0 = alternatives_->at(0);
3364 alt0.node()->FillInBMInfo(0, kRecursionBudget, bm, false);
3365 }
3366 }
3367 if (bm != NULL) {
3368 bm->EmitSkipInstructions(macro_assembler);
3369 }
3370 return eats_at_least;
3371 }
3372
3373
3374 void ChoiceNode::EmitChoices(RegExpCompiler* compiler,
3375 AlternativeGenerationList* alt_gens,
3376 int first_choice,
3377 Trace* trace,
3378 PreloadState* preload) {
3379 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3380 SetUpPreLoad(compiler, trace, preload);
3381
3382 // For now we just call all choices one after the other. The idea ultimately
3383 // is to use the Dispatch table to try only the relevant ones.
3384 int choice_count = alternatives_->length();
3385
3386 int new_flush_budget = trace->flush_budget() / choice_count;
3387
3388 for (int i = first_choice; i < choice_count; i++) {
3389 bool is_last = i == choice_count - 1;
3390 bool fall_through_on_failure = !is_last;
3391 GuardedAlternative alternative = alternatives_->at(i);
3392 AlternativeGeneration* alt_gen = alt_gens->at(i);
3393 alt_gen->quick_check_details.set_characters(preload->preload_characters_);
3394 ZoneList<Guard*>* guards = alternative.guards();
3395 int guard_count = (guards == NULL) ? 0 : guards->length();
3396 Trace new_trace(*trace);
3397 new_trace.set_characters_preloaded(preload->preload_is_current_ ?
3398 preload->preload_characters_ :
3399 0);
3400 if (preload->preload_has_checked_bounds_) {
3401 new_trace.set_bound_checked_up_to(preload->preload_characters_);
3402 }
3403 new_trace.quick_check_performed()->Clear();
3404 if (not_at_start_) new_trace.set_at_start(Trace::FALSE_VALUE);
3405 if (!is_last) {
3406 new_trace.set_backtrack(&alt_gen->after);
3407 }
3408 alt_gen->expects_preload = preload->preload_is_current_;
3409 bool generate_full_check_inline = false;
3410 if (FLAG_regexp_optimization &&
3411 try_to_emit_quick_check_for_alternative(i == 0) &&
3412 alternative.node()->EmitQuickCheck(compiler,
3413 trace,
3414 &new_trace,
3415 preload->preload_has_checked_bounds_,
3416 &alt_gen->possible_success,
3417 &alt_gen->quick_check_details,
3418 fall_through_on_failure)) {
3419 // Quick check was generated for this choice.
3420 preload->preload_is_current_ = true;
3421 preload->preload_has_checked_bounds_ = true;
3422 // If we generated the quick check to fall through on possible success,
3423 // we now need to generate the full check inline.
3424 if (!fall_through_on_failure) {
3425 macro_assembler->Bind(&alt_gen->possible_success);
3426 new_trace.set_quick_check_performed(&alt_gen->quick_check_details);
3427 new_trace.set_characters_preloaded(preload->preload_characters_);
3428 new_trace.set_bound_checked_up_to(preload->preload_characters_);
3429 generate_full_check_inline = true;
3430 }
3431 } else if (alt_gen->quick_check_details.cannot_match()) {
3432 if (!fall_through_on_failure) {
3433 macro_assembler->GoTo(trace->backtrack());
3434 }
3435 continue;
3436 } else {
3437 // No quick check was generated. Put the full code here.
3438 // If this is not the first choice then there could be slow checks from
3439 // previous cases that go here when they fail. There's no reason to
3440 // insist that they preload characters since the slow check we are about
3441 // to generate probably can't use it.
3442 if (i != first_choice) {
3443 alt_gen->expects_preload = false;
3444 new_trace.InvalidateCurrentCharacter();
3445 }
3446 generate_full_check_inline = true;
3447 }
3448 if (generate_full_check_inline) {
3449 if (new_trace.actions() != NULL) {
3450 new_trace.set_flush_budget(new_flush_budget);
3451 }
3452 for (int j = 0; j < guard_count; j++) {
3453 GenerateGuard(macro_assembler, guards->at(j), &new_trace);
3454 }
3455 alternative.node()->Emit(compiler, &new_trace);
3456 preload->preload_is_current_ = false;
3457 }
3458 macro_assembler->Bind(&alt_gen->after);
3459 }
3460 }
3461
3462
3463 void ChoiceNode::EmitOutOfLineContinuation(RegExpCompiler* compiler,
3464 Trace* trace,
3465 GuardedAlternative alternative,
3466 AlternativeGeneration* alt_gen,
3467 int preload_characters,
3468 bool next_expects_preload) {
3469 if (!alt_gen->possible_success.is_linked()) return;
3470
3471 RegExpMacroAssembler* macro_assembler = compiler->macro_assembler();
3472 macro_assembler->Bind(&alt_gen->possible_success);
3473 Trace out_of_line_trace(*trace);
3474 out_of_line_trace.set_characters_preloaded(preload_characters);
3475 out_of_line_trace.set_quick_check_performed(&alt_gen->quick_check_details);
3476 if (not_at_start_) out_of_line_trace.set_at_start(Trace::FALSE_VALUE);
3477 ZoneList<Guard*>* guards = alternative.guards();
3478 int guard_count = (guards == NULL) ? 0 : guards->length();
3479 if (next_expects_preload) {
3480 Label reload_current_char;
3481 out_of_line_trace.set_backtrack(&reload_current_char);
3482 for (int j = 0; j < guard_count; j++) {
3483 GenerateGuard(macro_assembler, guards->at(j), &out_of_line_trace);
3484 }
3485 alternative.node()->Emit(compiler, &out_of_line_trace);
3486 macro_assembler->Bind(&reload_current_char);
3487 // Reload the current character, since the next quick check expects that.
3488 // We don't need to check bounds here because we only get into this
3489 // code through a quick check which already did the checked load.
3490 macro_assembler->LoadCurrentCharacter(trace->cp_offset(),
3491 NULL,
3492 false,
3493 preload_characters);
3494 macro_assembler->GoTo(&(alt_gen->after));
3495 } else {
3496 out_of_line_trace.set_backtrack(&(alt_gen->after));
3497 for (int j = 0; j < guard_count; j++) {
3498 GenerateGuard(macro_assembler, guards->at(j), &out_of_line_trace);
3499 }
3500 alternative.node()->Emit(compiler, &out_of_line_trace);
3501 }
3502 }
3503
3504
3505 void ActionNode::Emit(RegExpCompiler* compiler, Trace* trace) {
3506 RegExpMacroAssembler* assembler = compiler->macro_assembler();
3507 LimitResult limit_result = LimitVersions(compiler, trace);
3508 if (limit_result == DONE) return;
3509 DCHECK(limit_result == CONTINUE);
3510
3511 RecursionCheck rc(compiler);
3512
3513 switch (action_type_) {
3514 case STORE_POSITION: {
3515 Trace::DeferredCapture
3516 new_capture(data_.u_position_register.reg,
3517 data_.u_position_register.is_capture,
3518 trace);
3519 Trace new_trace = *trace;
3520 new_trace.add_action(&new_capture);
3521 on_success()->Emit(compiler, &new_trace);
3522 break;
3523 }
3524 case INCREMENT_REGISTER: {
3525 Trace::DeferredIncrementRegister
3526 new_increment(data_.u_increment_register.reg);
3527 Trace new_trace = *trace;
3528 new_trace.add_action(&new_increment);
3529 on_success()->Emit(compiler, &new_trace);
3530 break;
3531 }
3532 case SET_REGISTER: {
3533 Trace::DeferredSetRegister
3534 new_set(data_.u_store_register.reg, data_.u_store_register.value);
3535 Trace new_trace = *trace;
3536 new_trace.add_action(&new_set);
3537 on_success()->Emit(compiler, &new_trace);
3538 break;
3539 }
3540 case CLEAR_CAPTURES: {
3541 Trace::DeferredClearCaptures
3542 new_capture(Interval(data_.u_clear_captures.range_from,
3543 data_.u_clear_captures.range_to));
3544 Trace new_trace = *trace;
3545 new_trace.add_action(&new_capture);
3546 on_success()->Emit(compiler, &new_trace);
3547 break;
3548 }
3549 case BEGIN_SUBMATCH:
3550 if (!trace->is_trivial()) {
3551 trace->Flush(compiler, this);
3552 } else {
3553 assembler->WriteCurrentPositionToRegister(
3554 data_.u_submatch.current_position_register, 0);
3555 assembler->WriteStackPointerToRegister(
3556 data_.u_submatch.stack_pointer_register);
3557 on_success()->Emit(compiler, trace);
3558 }
3559 break;
3560 case EMPTY_MATCH_CHECK: {
3561 int start_pos_reg = data_.u_empty_match_check.start_register;
3562 int stored_pos = 0;
3563 int rep_reg = data_.u_empty_match_check.repetition_register;
3564 bool has_minimum = (rep_reg != RegExpCompiler::kNoRegister);
3565 bool know_dist = trace->GetStoredPosition(start_pos_reg, &stored_pos);
3566 if (know_dist && !has_minimum && stored_pos == trace->cp_offset()) {
3567 // If we know we haven't advanced and there is no minimum we
3568 // can just backtrack immediately.
3569 assembler->GoTo(trace->backtrack());
3570 } else if (know_dist && stored_pos < trace->cp_offset()) {
3571 // If we know we've advanced we can generate the continuation
3572 // immediately.
3573 on_success()->Emit(compiler, trace);
3574 } else if (!trace->is_trivial()) {
3575 trace->Flush(compiler, this);
3576 } else {
3577 Label skip_empty_check;
3578 // If we have a minimum number of repetitions we check the current
3579 // number first and skip the empty check if it's not enough.
3580 if (has_minimum) {
3581 int limit = data_.u_empty_match_check.repetition_limit;
3582 assembler->IfRegisterLT(rep_reg, limit, &skip_empty_check);
3583 }
3584 // If the match is empty we bail out, otherwise we fall through
3585 // to the on-success continuation.
3586 assembler->IfRegisterEqPos(data_.u_empty_match_check.start_register,
3587 trace->backtrack());
3588 assembler->Bind(&skip_empty_check);
3589 on_success()->Emit(compiler, trace);
3590 }
3591 break;
3592 }
3593 case POSITIVE_SUBMATCH_SUCCESS: {
3594 if (!trace->is_trivial()) {
3595 trace->Flush(compiler, this);
3596 return;
3597 }
3598 assembler->ReadCurrentPositionFromRegister(
3599 data_.u_submatch.current_position_register);
3600 assembler->ReadStackPointerFromRegister(
3601 data_.u_submatch.stack_pointer_register);
3602 int clear_register_count = data_.u_submatch.clear_register_count;
3603 if (clear_register_count == 0) {
3604 on_success()->Emit(compiler, trace);
3605 return;
3606 }
3607 int clear_registers_from = data_.u_submatch.clear_register_from;
3608 Label clear_registers_backtrack;
3609 Trace new_trace = *trace;
3610 new_trace.set_backtrack(&clear_registers_backtrack);
3611 on_success()->Emit(compiler, &new_trace);
3612
3613 assembler->Bind(&clear_registers_backtrack);
3614 int clear_registers_to = clear_registers_from + clear_register_count - 1;
3615 assembler->ClearRegisters(clear_registers_from, clear_registers_to);
3616
3617 DCHECK(trace->backtrack() == NULL);
3618 assembler->Backtrack();
3619 return;
3620 }
3621 default:
3622 UNREACHABLE();
3623 }
3624 }
3625
3626
3627 void BackReferenceNode::Emit(RegExpCompiler* compiler, Trace* trace) {
3628 RegExpMacroAssembler* assembler = compiler->macro_assembler();
3629 if (!trace->is_trivial()) {
3630 trace->Flush(compiler, this);
3631 return;
3632 }
3633
3634 LimitResult limit_result = LimitVersions(compiler, trace);
3635 if (limit_result == DONE) return;
3636 DCHECK(limit_result == CONTINUE);
3637
3638 RecursionCheck rc(compiler);
3639
3640 DCHECK_EQ(start_reg_ + 1, end_reg_);
3641 if (compiler->ignore_case()) {
3642 assembler->CheckNotBackReferenceIgnoreCase(start_reg_,
3643 trace->backtrack());
3644 } else {
3645 assembler->CheckNotBackReference(start_reg_, trace->backtrack());
3646 }
3647 on_success()->Emit(compiler, trace);
3648 }
3649
3650
3651 // -------------------------------------------------------------------
3652 // Dot/dotty output
3653
3654
3655 #ifdef DEBUG
3656
3657
3658 class DotPrinter: public NodeVisitor {
3659 public:
3660 DotPrinter(OStream& os, bool ignore_case) // NOLINT
3661 : os_(os),
3662 ignore_case_(ignore_case) {}
3663 void PrintNode(const char* label, RegExpNode* node);
3664 void Visit(RegExpNode* node);
3665 void PrintAttributes(RegExpNode* from);
3666 void PrintOnFailure(RegExpNode* from, RegExpNode* to);
3667 #define DECLARE_VISIT(Type) \
3668 virtual void Visit##Type(Type##Node* that);
3669 FOR_EACH_NODE_TYPE(DECLARE_VISIT)
3670 #undef DECLARE_VISIT
3671 private:
3672 OStream& os_;
3673 bool ignore_case_;
3674 };
3675
3676
3677 void DotPrinter::PrintNode(const char* label, RegExpNode* node) {
3678 os_ << "digraph G {\n graph [label=\"";
3679 for (int i = 0; label[i]; i++) {
3680 switch (label[i]) {
3681 case '\\':
3682 os_ << "\\\\";
3683 break;
3684 case '"':
3685 os_ << "\"";
3686 break;
3687 default:
3688 os_ << label[i];
3689 break;
3690 }
3691 }
3692 os_ << "\"];\n";
3693 Visit(node);
3694 os_ << "}" << endl;
3695 }
3696
3697
3698 void DotPrinter::Visit(RegExpNode* node) {
3699 if (node->info()->visited) return;
3700 node->info()->visited = true;
3701 node->Accept(this);
3702 }
3703
3704
3705 void DotPrinter::PrintOnFailure(RegExpNode* from, RegExpNode* on_failure) {
3706 os_ << " n" << from << " -> n" << on_failure << " [style=dotted];\n";
3707 Visit(on_failure);
3708 }
3709
3710
3711 class TableEntryBodyPrinter {
3712 public:
3713 TableEntryBodyPrinter(OStream& os, ChoiceNode* choice) // NOLINT
3714 : os_(os),
3715 choice_(choice) {}
3716 void Call(uc16 from, DispatchTable::Entry entry) {
3717 OutSet* out_set = entry.out_set();
3718 for (unsigned i = 0; i < OutSet::kFirstLimit; i++) {
3719 if (out_set->Get(i)) {
3720 os_ << " n" << choice() << ":s" << from << "o" << i << " -> n"
3721 << choice()->alternatives()->at(i).node() << ";\n";
3722 }
3723 }
3724 }
3725 private:
3726 ChoiceNode* choice() { return choice_; }
3727 OStream& os_;
3728 ChoiceNode* choice_;
3729 };
3730
3731
3732 class TableEntryHeaderPrinter {
3733 public:
3734 explicit TableEntryHeaderPrinter(OStream& os) // NOLINT
3735 : first_(true),
3736 os_(os) {}
3737 void Call(uc16 from, DispatchTable::Entry entry) {
3738 if (first_) {
3739 first_ = false;
3740 } else {
3741 os_ << "|";
3742 }
3743 os_ << "{\\" << AsUC16(from) << "-\\" << AsUC16(entry.to()) << "|{";
3744 OutSet* out_set = entry.out_set();
3745 int priority = 0;
3746 for (unsigned i = 0; i < OutSet::kFirstLimit; i++) {
3747 if (out_set->Get(i)) {
3748 if (priority > 0) os_ << "|";
3749 os_ << "<s" << from << "o" << i << "> " << priority;
3750 priority++;
3751 }
3752 }
3753 os_ << "}}";
3754 }
3755
3756 private:
3757 bool first_;
3758 OStream& os_;
3759 };
3760
3761
3762 class AttributePrinter {
3763 public:
3764 explicit AttributePrinter(OStream& os) // NOLINT
3765 : os_(os),
3766 first_(true) {}
3767 void PrintSeparator() {
3768 if (first_) {
3769 first_ = false;
3770 } else {
3771 os_ << "|";
3772 }
3773 }
3774 void PrintBit(const char* name, bool value) {
3775 if (!value) return;
3776 PrintSeparator();
3777 os_ << "{" << name << "}";
3778 }
3779 void PrintPositive(const char* name, int value) {
3780 if (value < 0) return;
3781 PrintSeparator();
3782 os_ << "{" << name << "|" << value << "}";
3783 }
3784
3785 private:
3786 OStream& os_;
3787 bool first_;
3788 };
3789
3790
3791 void DotPrinter::PrintAttributes(RegExpNode* that) {
3792 os_ << " a" << that << " [shape=Mrecord, color=grey, fontcolor=grey, "
3793 << "margin=0.1, fontsize=10, label=\"{";
3794 AttributePrinter printer(os_);
3795 NodeInfo* info = that->info();
3796 printer.PrintBit("NI", info->follows_newline_interest);
3797 printer.PrintBit("WI", info->follows_word_interest);
3798 printer.PrintBit("SI", info->follows_start_interest);
3799 Label* label = that->label();
3800 if (label->is_bound())
3801 printer.PrintPositive("@", label->pos());
3802 os_ << "}\"];\n"
3803 << " a" << that << " -> n" << that
3804 << " [style=dashed, color=grey, arrowhead=none];\n";
3805 }
3806
3807
3808 static const bool kPrintDispatchTable = false;
3809 void DotPrinter::VisitChoice(ChoiceNode* that) {
3810 if (kPrintDispatchTable) {
3811 os_ << " n" << that << " [shape=Mrecord, label=\"";
3812 TableEntryHeaderPrinter header_printer(os_);
3813 that->GetTable(ignore_case_)->ForEach(&header_printer);
3814 os_ << "\"]\n";
3815 PrintAttributes(that);
3816 TableEntryBodyPrinter body_printer(os_, that);
3817 that->GetTable(ignore_case_)->ForEach(&body_printer);
3818 } else {
3819 os_ << " n" << that << " [shape=Mrecord, label=\"?\"];\n";
3820 for (int i = 0; i < that->alternatives()->length(); i++) {
3821 GuardedAlternative alt = that->alternatives()->at(i);
3822 os_ << " n" << that << " -> n" << alt.node();
3823 }
3824 }
3825 for (int i = 0; i < that->alternatives()->length(); i++) {
3826 GuardedAlternative alt = that->alternatives()->at(i);
3827 alt.node()->Accept(this);
3828 }
3829 }
3830
3831
3832 void DotPrinter::VisitText(TextNode* that) {
3833 Zone* zone = that->zone();
3834 os_ << " n" << that << " [label=\"";
3835 for (int i = 0; i < that->elements()->length(); i++) {
3836 if (i > 0) os_ << " ";
3837 TextElement elm = that->elements()->at(i);
3838 switch (elm.text_type()) {
3839 case TextElement::ATOM: {
3840 Vector<const uc16> data = elm.atom()->data();
3841 for (int i = 0; i < data.length(); i++) {
3842 os_ << static_cast<char>(data[i]);
3843 }
3844 break;
3845 }
3846 case TextElement::CHAR_CLASS: {
3847 RegExpCharacterClass* node = elm.char_class();
3848 os_ << "[";
3849 if (node->is_negated()) os_ << "^";
3850 for (int j = 0; j < node->ranges(zone)->length(); j++) {
3851 CharacterRange range = node->ranges(zone)->at(j);
3852 os_ << AsUC16(range.from()) << "-" << AsUC16(range.to());
3853 }
3854 os_ << "]";
3855 break;
3856 }
3857 default:
3858 UNREACHABLE();
3859 }
3860 }
3861 os_ << "\", shape=box, peripheries=2];\n";
3862 PrintAttributes(that);
3863 os_ << " n" << that << " -> n" << that->on_success() << ";\n";
3864 Visit(that->on_success());
3865 }
3866
3867
3868 void DotPrinter::VisitBackReference(BackReferenceNode* that) {
3869 os_ << " n" << that << " [label=\"$" << that->start_register() << "..$"
3870 << that->end_register() << "\", shape=doubleoctagon];\n";
3871 PrintAttributes(that);
3872 os_ << " n" << that << " -> n" << that->on_success() << ";\n";
3873 Visit(that->on_success());
3874 }
3875
3876
3877 void DotPrinter::VisitEnd(EndNode* that) {
3878 os_ << " n" << that << " [style=bold, shape=point];\n";
3879 PrintAttributes(that);
3880 }
3881
3882
3883 void DotPrinter::VisitAssertion(AssertionNode* that) {
3884 os_ << " n" << that << " [";
3885 switch (that->assertion_type()) {
3886 case AssertionNode::AT_END:
3887 os_ << "label=\"$\", shape=septagon";
3888 break;
3889 case AssertionNode::AT_START:
3890 os_ << "label=\"^\", shape=septagon";
3891 break;
3892 case AssertionNode::AT_BOUNDARY:
3893 os_ << "label=\"\\b\", shape=septagon";
3894 break;
3895 case AssertionNode::AT_NON_BOUNDARY:
3896 os_ << "label=\"\\B\", shape=septagon";
3897 break;
3898 case AssertionNode::AFTER_NEWLINE:
3899 os_ << "label=\"(?<=\\n)\", shape=septagon";
3900 break;
3901 }
3902 os_ << "];\n";
3903 PrintAttributes(that);
3904 RegExpNode* successor = that->on_success();
3905 os_ << " n" << that << " -> n" << successor << ";\n";
3906 Visit(successor);
3907 }
3908
3909
3910 void DotPrinter::VisitAction(ActionNode* that) {
3911 os_ << " n" << that << " [";
3912 switch (that->action_type_) {
3913 case ActionNode::SET_REGISTER:
3914 os_ << "label=\"$" << that->data_.u_store_register.reg
3915 << ":=" << that->data_.u_store_register.value << "\", shape=octagon";
3916 break;
3917 case ActionNode::INCREMENT_REGISTER:
3918 os_ << "label=\"$" << that->data_.u_increment_register.reg
3919 << "++\", shape=octagon";
3920 break;
3921 case ActionNode::STORE_POSITION:
3922 os_ << "label=\"$" << that->data_.u_position_register.reg
3923 << ":=$pos\", shape=octagon";
3924 break;
3925 case ActionNode::BEGIN_SUBMATCH:
3926 os_ << "label=\"$" << that->data_.u_submatch.current_position_register
3927 << ":=$pos,begin\", shape=septagon";
3928 break;
3929 case ActionNode::POSITIVE_SUBMATCH_SUCCESS:
3930 os_ << "label=\"escape\", shape=septagon";
3931 break;
3932 case ActionNode::EMPTY_MATCH_CHECK:
3933 os_ << "label=\"$" << that->data_.u_empty_match_check.start_register
3934 << "=$pos?,$" << that->data_.u_empty_match_check.repetition_register
3935 << "<" << that->data_.u_empty_match_check.repetition_limit
3936 << "?\", shape=septagon";
3937 break;
3938 case ActionNode::CLEAR_CAPTURES: {
3939 os_ << "label=\"clear $" << that->data_.u_clear_captures.range_from
3940 << " to $" << that->data_.u_clear_captures.range_to
3941 << "\", shape=septagon";
3942 break;
3943 }
3944 }
3945 os_ << "];\n";
3946 PrintAttributes(that);
3947 RegExpNode* successor = that->on_success();
3948 os_ << " n" << that << " -> n" << successor << ";\n";
3949 Visit(successor);
3950 }
3951
3952
3953 class DispatchTableDumper {
3954 public:
3955 explicit DispatchTableDumper(OStream& os) : os_(os) {}
3956 void Call(uc16 key, DispatchTable::Entry entry);
3957 private:
3958 OStream& os_;
3959 };
3960
3961
3962 void DispatchTableDumper::Call(uc16 key, DispatchTable::Entry entry) {
3963 os_ << "[" << AsUC16(key) << "-" << AsUC16(entry.to()) << "]: {";
3964 OutSet* set = entry.out_set();
3965 bool first = true;
3966 for (unsigned i = 0; i < OutSet::kFirstLimit; i++) {
3967 if (set->Get(i)) {
3968 if (first) {
3969 first = false;
3970 } else {
3971 os_ << ", ";
3972 }
3973 os_ << i;
3974 }
3975 }
3976 os_ << "}\n";
3977 }
3978
3979
3980 void DispatchTable::Dump() {
3981 OFStream os(stderr);
3982 DispatchTableDumper dumper(os);
3983 tree()->ForEach(&dumper);
3984 }
3985
3986
3987 void RegExpEngine::DotPrint(const char* label,
3988 RegExpNode* node,
3989 bool ignore_case) {
3990 OFStream os(stdout);
3991 DotPrinter printer(os, ignore_case);
3992 printer.PrintNode(label, node);
3993 }
3994
3995
3996 #endif // DEBUG
3997
3998
3999 // -------------------------------------------------------------------
4000 // Tree to graph conversion
4001
4002 RegExpNode* RegExpAtom::ToNode(RegExpCompiler* compiler,
4003 RegExpNode* on_success) {
4004 ZoneList<TextElement>* elms =
4005 new(compiler->zone()) ZoneList<TextElement>(1, compiler->zone());
4006 elms->Add(TextElement::Atom(this), compiler->zone());
4007 return new(compiler->zone()) TextNode(elms, on_success);
4008 }
4009
4010
4011 RegExpNode* RegExpText::ToNode(RegExpCompiler* compiler,
4012 RegExpNode* on_success) {
4013 return new(compiler->zone()) TextNode(elements(), on_success);
4014 }
4015
4016
4017 static bool CompareInverseRanges(ZoneList<CharacterRange>* ranges,
4018 const int* special_class,
4019 int length) {
4020 length--; // Remove final 0x10000.
4021 DCHECK(special_class[length] == 0x10000);
4022 DCHECK(ranges->length() != 0);
4023 DCHECK(length != 0);
4024 DCHECK(special_class[0] != 0);
4025 if (ranges->length() != (length >> 1) + 1) {
4026 return false;
4027 }
4028 CharacterRange range = ranges->at(0);
4029 if (range.from() != 0) {
4030 return false;
4031 }
4032 for (int i = 0; i < length; i += 2) {
4033 if (special_class[i] != (range.to() + 1)) {
4034 return false;
4035 }
4036 range = ranges->at((i >> 1) + 1);
4037 if (special_class[i+1] != range.from()) {
4038 return false;
4039 }
4040 }
4041 if (range.to() != 0xffff) {
4042 return false;
4043 }
4044 return true;
4045 }
4046
4047
4048 static bool CompareRanges(ZoneList<CharacterRange>* ranges,
4049 const int* special_class,
4050 int length) {
4051 length--; // Remove final 0x10000.
4052 DCHECK(special_class[length] == 0x10000);
4053 if (ranges->length() * 2 != length) {
4054 return false;
4055 }
4056 for (int i = 0; i < length; i += 2) {
4057 CharacterRange range = ranges->at(i >> 1);
4058 if (range.from() != special_class[i] ||
4059 range.to() != special_class[i + 1] - 1) {
4060 return false;
4061 }
4062 }
4063 return true;
4064 }
4065
4066
4067 bool RegExpCharacterClass::is_standard(Zone* zone) {
4068 // TODO(lrn): Remove need for this function, by not throwing away information
4069 // along the way.
4070 if (is_negated_) {
4071 return false;
4072 }
4073 if (set_.is_standard()) {
4074 return true;
4075 }
4076 if (CompareRanges(set_.ranges(zone), kSpaceRanges, kSpaceRangeCount)) {
4077 set_.set_standard_set_type('s');
4078 return true;
4079 }
4080 if (CompareInverseRanges(set_.ranges(zone), kSpaceRanges, kSpaceRangeCount)) {
4081 set_.set_standard_set_type('S');
4082 return true;
4083 }
4084 if (CompareInverseRanges(set_.ranges(zone),
4085 kLineTerminatorRanges,
4086 kLineTerminatorRangeCount)) {
4087 set_.set_standard_set_type('.');
4088 return true;
4089 }
4090 if (CompareRanges(set_.ranges(zone),
4091 kLineTerminatorRanges,
4092 kLineTerminatorRangeCount)) {
4093 set_.set_standard_set_type('n');
4094 return true;
4095 }
4096 if (CompareRanges(set_.ranges(zone), kWordRanges, kWordRangeCount)) {
4097 set_.set_standard_set_type('w');
4098 return true;
4099 }
4100 if (CompareInverseRanges(set_.ranges(zone), kWordRanges, kWordRangeCount)) {
4101 set_.set_standard_set_type('W');
4102 return true;
4103 }
4104 return false;
4105 }
4106
4107
4108 RegExpNode* RegExpCharacterClass::ToNode(RegExpCompiler* compiler,
4109 RegExpNode* on_success) {
4110 return new(compiler->zone()) TextNode(this, on_success);
4111 }
4112
4113
4114 RegExpNode* RegExpDisjunction::ToNode(RegExpCompiler* compiler,
4115 RegExpNode* on_success) {
4116 ZoneList<RegExpTree*>* alternatives = this->alternatives();
4117 int length = alternatives->length();
4118 ChoiceNode* result =
4119 new(compiler->zone()) ChoiceNode(length, compiler->zone());
4120 for (int i = 0; i < length; i++) {
4121 GuardedAlternative alternative(alternatives->at(i)->ToNode(compiler,
4122 on_success));
4123 result->AddAlternative(alternative);
4124 }
4125 return result;
4126 }
4127
4128
4129 RegExpNode* RegExpQuantifier::ToNode(RegExpCompiler* compiler,
4130 RegExpNode* on_success) {
4131 return ToNode(min(),
4132 max(),
4133 is_greedy(),
4134 body(),
4135 compiler,
4136 on_success);
4137 }
4138
4139
4140 // Scoped object to keep track of how much we unroll quantifier loops in the
4141 // regexp graph generator.
4142 class RegExpExpansionLimiter {
4143 public:
4144 static const int kMaxExpansionFactor = 6;
4145 RegExpExpansionLimiter(RegExpCompiler* compiler, int factor)
4146 : compiler_(compiler),
4147 saved_expansion_factor_(compiler->current_expansion_factor()),
4148 ok_to_expand_(saved_expansion_factor_ <= kMaxExpansionFactor) {
4149 DCHECK(factor > 0);
4150 if (ok_to_expand_) {
4151 if (factor > kMaxExpansionFactor) {
4152 // Avoid integer overflow of the current expansion factor.
4153 ok_to_expand_ = false;
4154 compiler->set_current_expansion_factor(kMaxExpansionFactor + 1);
4155 } else {
4156 int new_factor = saved_expansion_factor_ * factor;
4157 ok_to_expand_ = (new_factor <= kMaxExpansionFactor);
4158 compiler->set_current_expansion_factor(new_factor);
4159 }
4160 }
4161 }
4162
4163 ~RegExpExpansionLimiter() {
4164 compiler_->set_current_expansion_factor(saved_expansion_factor_);
4165 }
4166
4167 bool ok_to_expand() { return ok_to_expand_; }
4168
4169 private:
4170 RegExpCompiler* compiler_;
4171 int saved_expansion_factor_;
4172 bool ok_to_expand_;
4173
4174 DISALLOW_IMPLICIT_CONSTRUCTORS(RegExpExpansionLimiter);
4175 };
4176
4177
4178 RegExpNode* RegExpQuantifier::ToNode(int min,
4179 int max,
4180 bool is_greedy,
4181 RegExpTree* body,
4182 RegExpCompiler* compiler,
4183 RegExpNode* on_success,
4184 bool not_at_start) {
4185 // x{f, t} becomes this:
4186 //
4187 // (r++)<-.
4188 // | `
4189 // | (x)
4190 // v ^
4191 // (r=0)-->(?)---/ [if r < t]
4192 // |
4193 // [if r >= f] \----> ...
4194 //
4195
4196 // 15.10.2.5 RepeatMatcher algorithm.
4197 // The parser has already eliminated the case where max is 0. In the case
4198 // where max_match is zero the parser has removed the quantifier if min was
4199 // > 0 and removed the atom if min was 0. See AddQuantifierToAtom.
4200
4201 // If we know that we cannot match zero length then things are a little
4202 // simpler since we don't need to make the special zero length match check
4203 // from step 2.1. If the min and max are small we can unroll a little in
4204 // this case.
4205 static const int kMaxUnrolledMinMatches = 3; // Unroll (foo)+ and (foo){3,}
4206 static const int kMaxUnrolledMaxMatches = 3; // Unroll (foo)? and (foo){x,3}
4207 if (max == 0) return on_success; // This can happen due to recursion.
4208 bool body_can_be_empty = (body->min_match() == 0);
4209 int body_start_reg = RegExpCompiler::kNoRegister;
4210 Interval capture_registers = body->CaptureRegisters();
4211 bool needs_capture_clearing = !capture_registers.is_empty();
4212 Zone* zone = compiler->zone();
4213
4214 if (body_can_be_empty) {
4215 body_start_reg = compiler->AllocateRegister();
4216 } else if (FLAG_regexp_optimization && !needs_capture_clearing) {
4217 // Only unroll if there are no captures and the body can't be
4218 // empty.
4219 {
4220 RegExpExpansionLimiter limiter(
4221 compiler, min + ((max != min) ? 1 : 0));
4222 if (min > 0 && min <= kMaxUnrolledMinMatches && limiter.ok_to_expand()) {
4223 int new_max = (max == kInfinity) ? max : max - min;
4224 // Recurse once to get the loop or optional matches after the fixed
4225 // ones.
4226 RegExpNode* answer = ToNode(
4227 0, new_max, is_greedy, body, compiler, on_success, true);
4228 // Unroll the forced matches from 0 to min. This can cause chains of
4229 // TextNodes (which the parser does not generate). These should be
4230 // combined if it turns out they hinder good code generation.
4231 for (int i = 0; i < min; i++) {
4232 answer = body->ToNode(compiler, answer);
4233 }
4234 return answer;
4235 }
4236 }
4237 if (max <= kMaxUnrolledMaxMatches && min == 0) {
4238 DCHECK(max > 0); // Due to the 'if' above.
4239 RegExpExpansionLimiter limiter(compiler, max);
4240 if (limiter.ok_to_expand()) {
4241 // Unroll the optional matches up to max.
4242 RegExpNode* answer = on_success;
4243 for (int i = 0; i < max; i++) {
4244 ChoiceNode* alternation = new(zone) ChoiceNode(2, zone);
4245 if (is_greedy) {
4246 alternation->AddAlternative(
4247 GuardedAlternative(body->ToNode(compiler, answer)));
4248 alternation->AddAlternative(GuardedAlternative(on_success));
4249 } else {
4250 alternation->AddAlternative(GuardedAlternative(on_success));
4251 alternation->AddAlternative(
4252 GuardedAlternative(body->ToNode(compiler, answer)));
4253 }
4254 answer = alternation;
4255 if (not_at_start) alternation->set_not_at_start();
4256 }
4257 return answer;
4258 }
4259 }
4260 }
4261 bool has_min = min > 0;
4262 bool has_max = max < RegExpTree::kInfinity;
4263 bool needs_counter = has_min || has_max;
4264 int reg_ctr = needs_counter
4265 ? compiler->AllocateRegister()
4266 : RegExpCompiler::kNoRegister;
4267 LoopChoiceNode* center = new(zone) LoopChoiceNode(body->min_match() == 0,
4268 zone);
4269 if (not_at_start) center->set_not_at_start();
4270 RegExpNode* loop_return = needs_counter
4271 ? static_cast<RegExpNode*>(ActionNode::IncrementRegister(reg_ctr, center))
4272 : static_cast<RegExpNode*>(center);
4273 if (body_can_be_empty) {
4274 // If the body can be empty we need to check if it was and then
4275 // backtrack.
4276 loop_return = ActionNode::EmptyMatchCheck(body_start_reg,
4277 reg_ctr,
4278 min,
4279 loop_return);
4280 }
4281 RegExpNode* body_node = body->ToNode(compiler, loop_return);
4282 if (body_can_be_empty) {
4283 // If the body can be empty we need to store the start position
4284 // so we can bail out if it was empty.
4285 body_node = ActionNode::StorePosition(body_start_reg, false, body_node);
4286 }
4287 if (needs_capture_clearing) {
4288 // Before entering the body of this loop we need to clear captures.
4289 body_node = ActionNode::ClearCaptures(capture_registers, body_node);
4290 }
4291 GuardedAlternative body_alt(body_node);
4292 if (has_max) {
4293 Guard* body_guard =
4294 new(zone) Guard(reg_ctr, Guard::LT, max);
4295 body_alt.AddGuard(body_guard, zone);
4296 }
4297 GuardedAlternative rest_alt(on_success);
4298 if (has_min) {
4299 Guard* rest_guard = new(compiler->zone()) Guard(reg_ctr, Guard::GEQ, min);
4300 rest_alt.AddGuard(rest_guard, zone);
4301 }
4302 if (is_greedy) {
4303 center->AddLoopAlternative(body_alt);
4304 center->AddContinueAlternative(rest_alt);
4305 } else {
4306 center->AddContinueAlternative(rest_alt);
4307 center->AddLoopAlternative(body_alt);
4308 }
4309 if (needs_counter) {
4310 return ActionNode::SetRegister(reg_ctr, 0, center);
4311 } else {
4312 return center;
4313 }
4314 }
4315
4316
4317 RegExpNode* RegExpAssertion::ToNode(RegExpCompiler* compiler,
4318 RegExpNode* on_success) {
4319 NodeInfo info;
4320 Zone* zone = compiler->zone();
4321
4322 switch (assertion_type()) {
4323 case START_OF_LINE:
4324 return AssertionNode::AfterNewline(on_success);
4325 case START_OF_INPUT:
4326 return AssertionNode::AtStart(on_success);
4327 case BOUNDARY:
4328 return AssertionNode::AtBoundary(on_success);
4329 case NON_BOUNDARY:
4330 return AssertionNode::AtNonBoundary(on_success);
4331 case END_OF_INPUT:
4332 return AssertionNode::AtEnd(on_success);
4333 case END_OF_LINE: {
4334 // Compile $ in multiline regexps as an alternation with a positive
4335 // lookahead in one side and an end-of-input on the other side.
4336 // We need two registers for the lookahead.
4337 int stack_pointer_register = compiler->AllocateRegister();
4338 int position_register = compiler->AllocateRegister();
4339 // The ChoiceNode to distinguish between a newline and end-of-input.
4340 ChoiceNode* result = new(zone) ChoiceNode(2, zone);
4341 // Create a newline atom.
4342 ZoneList<CharacterRange>* newline_ranges =
4343 new(zone) ZoneList<CharacterRange>(3, zone);
4344 CharacterRange::AddClassEscape('n', newline_ranges, zone);
4345 RegExpCharacterClass* newline_atom = new(zone) RegExpCharacterClass('n');
4346 TextNode* newline_matcher = new(zone) TextNode(
4347 newline_atom,
4348 ActionNode::PositiveSubmatchSuccess(stack_pointer_register,
4349 position_register,
4350 0, // No captures inside.
4351 -1, // Ignored if no captures.
4352 on_success));
4353 // Create an end-of-input matcher.
4354 RegExpNode* end_of_line = ActionNode::BeginSubmatch(
4355 stack_pointer_register,
4356 position_register,
4357 newline_matcher);
4358 // Add the two alternatives to the ChoiceNode.
4359 GuardedAlternative eol_alternative(end_of_line);
4360 result->AddAlternative(eol_alternative);
4361 GuardedAlternative end_alternative(AssertionNode::AtEnd(on_success));
4362 result->AddAlternative(end_alternative);
4363 return result;
4364 }
4365 default:
4366 UNREACHABLE();
4367 }
4368 return on_success;
4369 }
4370
4371
4372 RegExpNode* RegExpBackReference::ToNode(RegExpCompiler* compiler,
4373 RegExpNode* on_success) {
4374 return new(compiler->zone())
4375 BackReferenceNode(RegExpCapture::StartRegister(index()),
4376 RegExpCapture::EndRegister(index()),
4377 on_success);
4378 }
4379
4380
4381 RegExpNode* RegExpEmpty::ToNode(RegExpCompiler* compiler,
4382 RegExpNode* on_success) {
4383 return on_success;
4384 }
4385
4386
4387 RegExpNode* RegExpLookahead::ToNode(RegExpCompiler* compiler,
4388 RegExpNode* on_success) {
4389 int stack_pointer_register = compiler->AllocateRegister();
4390 int position_register = compiler->AllocateRegister();
4391
4392 const int registers_per_capture = 2;
4393 const int register_of_first_capture = 2;
4394 int register_count = capture_count_ * registers_per_capture;
4395 int register_start =
4396 register_of_first_capture + capture_from_ * registers_per_capture;
4397
4398 RegExpNode* success;
4399 if (is_positive()) {
4400 RegExpNode* node = ActionNode::BeginSubmatch(
4401 stack_pointer_register,
4402 position_register,
4403 body()->ToNode(
4404 compiler,
4405 ActionNode::PositiveSubmatchSuccess(stack_pointer_register,
4406 position_register,
4407 register_count,
4408 register_start,
4409 on_success)));
4410 return node;
4411 } else {
4412 // We use a ChoiceNode for a negative lookahead because it has most of
4413 // the characteristics we need. It has the body of the lookahead as its
4414 // first alternative and the expression after the lookahead of the second
4415 // alternative. If the first alternative succeeds then the
4416 // NegativeSubmatchSuccess will unwind the stack including everything the
4417 // choice node set up and backtrack. If the first alternative fails then
4418 // the second alternative is tried, which is exactly the desired result
4419 // for a negative lookahead. The NegativeLookaheadChoiceNode is a special
4420 // ChoiceNode that knows to ignore the first exit when calculating quick
4421 // checks.
4422 Zone* zone = compiler->zone();
4423
4424 GuardedAlternative body_alt(
4425 body()->ToNode(
4426 compiler,
4427 success = new(zone) NegativeSubmatchSuccess(stack_pointer_register,
4428 position_register,
4429 register_count,
4430 register_start,
4431 zone)));
4432 ChoiceNode* choice_node =
4433 new(zone) NegativeLookaheadChoiceNode(body_alt,
4434 GuardedAlternative(on_success),
4435 zone);
4436 return ActionNode::BeginSubmatch(stack_pointer_register,
4437 position_register,
4438 choice_node);
4439 }
4440 }
4441
4442
4443 RegExpNode* RegExpCapture::ToNode(RegExpCompiler* compiler,
4444 RegExpNode* on_success) {
4445 return ToNode(body(), index(), compiler, on_success);
4446 }
4447
4448
4449 RegExpNode* RegExpCapture::ToNode(RegExpTree* body,
4450 int index,
4451 RegExpCompiler* compiler,
4452 RegExpNode* on_success) {
4453 int start_reg = RegExpCapture::StartRegister(index);
4454 int end_reg = RegExpCapture::EndRegister(index);
4455 RegExpNode* store_end = ActionNode::StorePosition(end_reg, true, on_success);
4456 RegExpNode* body_node = body->ToNode(compiler, store_end);
4457 return ActionNode::StorePosition(start_reg, true, body_node);
4458 }
4459
4460
4461 RegExpNode* RegExpAlternative::ToNode(RegExpCompiler* compiler,
4462 RegExpNode* on_success) {
4463 ZoneList<RegExpTree*>* children = nodes();
4464 RegExpNode* current = on_success;
4465 for (int i = children->length() - 1; i >= 0; i--) {
4466 current = children->at(i)->ToNode(compiler, current);
4467 }
4468 return current;
4469 }
4470
4471
4472 static void AddClass(const int* elmv,
4473 int elmc,
4474 ZoneList<CharacterRange>* ranges,
4475 Zone* zone) {
4476 elmc--;
4477 DCHECK(elmv[elmc] == 0x10000);
4478 for (int i = 0; i < elmc; i += 2) {
4479 DCHECK(elmv[i] < elmv[i + 1]);
4480 ranges->Add(CharacterRange(elmv[i], elmv[i + 1] - 1), zone);
4481 }
4482 }
4483
4484
4485 static void AddClassNegated(const int *elmv,
4486 int elmc,
4487 ZoneList<CharacterRange>* ranges,
4488 Zone* zone) {
4489 elmc--;
4490 DCHECK(elmv[elmc] == 0x10000);
4491 DCHECK(elmv[0] != 0x0000);
4492 DCHECK(elmv[elmc-1] != String::kMaxUtf16CodeUnit);
4493 uc16 last = 0x0000;
4494 for (int i = 0; i < elmc; i += 2) {
4495 DCHECK(last <= elmv[i] - 1);
4496 DCHECK(elmv[i] < elmv[i + 1]);
4497 ranges->Add(CharacterRange(last, elmv[i] - 1), zone);
4498 last = elmv[i + 1];
4499 }
4500 ranges->Add(CharacterRange(last, String::kMaxUtf16CodeUnit), zone);
4501 }
4502
4503
4504 void CharacterRange::AddClassEscape(uc16 type,
4505 ZoneList<CharacterRange>* ranges,
4506 Zone* zone) {
4507 switch (type) {
4508 case 's':
4509 AddClass(kSpaceRanges, kSpaceRangeCount, ranges, zone);
4510 break;
4511 case 'S':
4512 AddClassNegated(kSpaceRanges, kSpaceRangeCount, ranges, zone);
4513 break;
4514 case 'w':
4515 AddClass(kWordRanges, kWordRangeCount, ranges, zone);
4516 break;
4517 case 'W':
4518 AddClassNegated(kWordRanges, kWordRangeCount, ranges, zone);
4519 break;
4520 case 'd':
4521 AddClass(kDigitRanges, kDigitRangeCount, ranges, zone);
4522 break;
4523 case 'D':
4524 AddClassNegated(kDigitRanges, kDigitRangeCount, ranges, zone);
4525 break;
4526 case '.':
4527 AddClassNegated(kLineTerminatorRanges,
4528 kLineTerminatorRangeCount,
4529 ranges,
4530 zone);
4531 break;
4532 // This is not a character range as defined by the spec but a
4533 // convenient shorthand for a character class that matches any
4534 // character.
4535 case '*':
4536 ranges->Add(CharacterRange::Everything(), zone);
4537 break;
4538 // This is the set of characters matched by the $ and ^ symbols
4539 // in multiline mode.
4540 case 'n':
4541 AddClass(kLineTerminatorRanges,
4542 kLineTerminatorRangeCount,
4543 ranges,
4544 zone);
4545 break;
4546 default:
4547 UNREACHABLE();
4548 }
4549 }
4550
4551
4552 Vector<const int> CharacterRange::GetWordBounds() {
4553 return Vector<const int>(kWordRanges, kWordRangeCount - 1);
4554 }
4555
4556
4557 class CharacterRangeSplitter {
4558 public:
4559 CharacterRangeSplitter(ZoneList<CharacterRange>** included,
4560 ZoneList<CharacterRange>** excluded,
4561 Zone* zone)
4562 : included_(included),
4563 excluded_(excluded),
4564 zone_(zone) { }
4565 void Call(uc16 from, DispatchTable::Entry entry);
4566
4567 static const int kInBase = 0;
4568 static const int kInOverlay = 1;
4569
4570 private:
4571 ZoneList<CharacterRange>** included_;
4572 ZoneList<CharacterRange>** excluded_;
4573 Zone* zone_;
4574 };
4575
4576
4577 void CharacterRangeSplitter::Call(uc16 from, DispatchTable::Entry entry) {
4578 if (!entry.out_set()->Get(kInBase)) return;
4579 ZoneList<CharacterRange>** target = entry.out_set()->Get(kInOverlay)
4580 ? included_
4581 : excluded_;
4582 if (*target == NULL) *target = new(zone_) ZoneList<CharacterRange>(2, zone_);
4583 (*target)->Add(CharacterRange(entry.from(), entry.to()), zone_);
4584 }
4585
4586
4587 void CharacterRange::Split(ZoneList<CharacterRange>* base,
4588 Vector<const int> overlay,
4589 ZoneList<CharacterRange>** included,
4590 ZoneList<CharacterRange>** excluded,
4591 Zone* zone) {
4592 DCHECK_EQ(NULL, *included);
4593 DCHECK_EQ(NULL, *excluded);
4594 DispatchTable table(zone);
4595 for (int i = 0; i < base->length(); i++)
4596 table.AddRange(base->at(i), CharacterRangeSplitter::kInBase, zone);
4597 for (int i = 0; i < overlay.length(); i += 2) {
4598 table.AddRange(CharacterRange(overlay[i], overlay[i + 1] - 1),
4599 CharacterRangeSplitter::kInOverlay, zone);
4600 }
4601 CharacterRangeSplitter callback(included, excluded, zone);
4602 table.ForEach(&callback);
4603 }
4604
4605
4606 void CharacterRange::AddCaseEquivalents(ZoneList<CharacterRange>* ranges,
4607 bool is_one_byte, Zone* zone) {
4608 Isolate* isolate = zone->isolate();
4609 uc16 bottom = from();
4610 uc16 top = to();
4611 if (is_one_byte && !RangeContainsLatin1Equivalents(*this)) {
4612 if (bottom > String::kMaxOneByteCharCode) return;
4613 if (top > String::kMaxOneByteCharCode) top = String::kMaxOneByteCharCode;
4614 }
4615 unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
4616 if (top == bottom) {
4617 // If this is a singleton we just expand the one character.
4618 int length = isolate->jsregexp_uncanonicalize()->get(bottom, '\0', chars);
4619 for (int i = 0; i < length; i++) {
4620 uc32 chr = chars[i];
4621 if (chr != bottom) {
4622 ranges->Add(CharacterRange::Singleton(chars[i]), zone);
4623 }
4624 }
4625 } else {
4626 // If this is a range we expand the characters block by block,
4627 // expanding contiguous subranges (blocks) one at a time.
4628 // The approach is as follows. For a given start character we
4629 // look up the remainder of the block that contains it (represented
4630 // by the end point), for instance we find 'z' if the character
4631 // is 'c'. A block is characterized by the property
4632 // that all characters uncanonicalize in the same way, except that
4633 // each entry in the result is incremented by the distance from the first
4634 // element. So a-z is a block because 'a' uncanonicalizes to ['a', 'A'] and
4635 // the k'th letter uncanonicalizes to ['a' + k, 'A' + k].
4636 // Once we've found the end point we look up its uncanonicalization
4637 // and produce a range for each element. For instance for [c-f]
4638 // we look up ['z', 'Z'] and produce [c-f] and [C-F]. We then only
4639 // add a range if it is not already contained in the input, so [c-f]
4640 // will be skipped but [C-F] will be added. If this range is not
4641 // completely contained in a block we do this for all the blocks
4642 // covered by the range (handling characters that is not in a block
4643 // as a "singleton block").
4644 unibrow::uchar range[unibrow::Ecma262UnCanonicalize::kMaxWidth];
4645 int pos = bottom;
4646 while (pos <= top) {
4647 int length = isolate->jsregexp_canonrange()->get(pos, '\0', range);
4648 uc16 block_end;
4649 if (length == 0) {
4650 block_end = pos;
4651 } else {
4652 DCHECK_EQ(1, length);
4653 block_end = range[0];
4654 }
4655 int end = (block_end > top) ? top : block_end;
4656 length = isolate->jsregexp_uncanonicalize()->get(block_end, '\0', range);
4657 for (int i = 0; i < length; i++) {
4658 uc32 c = range[i];
4659 uc16 range_from = c - (block_end - pos);
4660 uc16 range_to = c - (block_end - end);
4661 if (!(bottom <= range_from && range_to <= top)) {
4662 ranges->Add(CharacterRange(range_from, range_to), zone);
4663 }
4664 }
4665 pos = end + 1;
4666 }
4667 }
4668 }
4669
4670
4671 bool CharacterRange::IsCanonical(ZoneList<CharacterRange>* ranges) {
4672 DCHECK_NOT_NULL(ranges);
4673 int n = ranges->length();
4674 if (n <= 1) return true;
4675 int max = ranges->at(0).to();
4676 for (int i = 1; i < n; i++) {
4677 CharacterRange next_range = ranges->at(i);
4678 if (next_range.from() <= max + 1) return false;
4679 max = next_range.to();
4680 }
4681 return true;
4682 }
4683
4684
4685 ZoneList<CharacterRange>* CharacterSet::ranges(Zone* zone) {
4686 if (ranges_ == NULL) {
4687 ranges_ = new(zone) ZoneList<CharacterRange>(2, zone);
4688 CharacterRange::AddClassEscape(standard_set_type_, ranges_, zone);
4689 }
4690 return ranges_;
4691 }
4692
4693
4694 // Move a number of elements in a zonelist to another position
4695 // in the same list. Handles overlapping source and target areas.
4696 static void MoveRanges(ZoneList<CharacterRange>* list,
4697 int from,
4698 int to,
4699 int count) {
4700 // Ranges are potentially overlapping.
4701 if (from < to) {
4702 for (int i = count - 1; i >= 0; i--) {
4703 list->at(to + i) = list->at(from + i);
4704 }
4705 } else {
4706 for (int i = 0; i < count; i++) {
4707 list->at(to + i) = list->at(from + i);
4708 }
4709 }
4710 }
4711
4712
4713 static int InsertRangeInCanonicalList(ZoneList<CharacterRange>* list,
4714 int count,
4715 CharacterRange insert) {
4716 // Inserts a range into list[0..count[, which must be sorted
4717 // by from value and non-overlapping and non-adjacent, using at most
4718 // list[0..count] for the result. Returns the number of resulting
4719 // canonicalized ranges. Inserting a range may collapse existing ranges into
4720 // fewer ranges, so the return value can be anything in the range 1..count+1.
4721 uc16 from = insert.from();
4722 uc16 to = insert.to();
4723 int start_pos = 0;
4724 int end_pos = count;
4725 for (int i = count - 1; i >= 0; i--) {
4726 CharacterRange current = list->at(i);
4727 if (current.from() > to + 1) {
4728 end_pos = i;
4729 } else if (current.to() + 1 < from) {
4730 start_pos = i + 1;
4731 break;
4732 }
4733 }
4734
4735 // Inserted range overlaps, or is adjacent to, ranges at positions
4736 // [start_pos..end_pos[. Ranges before start_pos or at or after end_pos are
4737 // not affected by the insertion.
4738 // If start_pos == end_pos, the range must be inserted before start_pos.
4739 // if start_pos < end_pos, the entire range from start_pos to end_pos
4740 // must be merged with the insert range.
4741
4742 if (start_pos == end_pos) {
4743 // Insert between existing ranges at position start_pos.
4744 if (start_pos < count) {
4745 MoveRanges(list, start_pos, start_pos + 1, count - start_pos);
4746 }
4747 list->at(start_pos) = insert;
4748 return count + 1;
4749 }
4750 if (start_pos + 1 == end_pos) {
4751 // Replace single existing range at position start_pos.
4752 CharacterRange to_replace = list->at(start_pos);
4753 int new_from = Min(to_replace.from(), from);
4754 int new_to = Max(to_replace.to(), to);
4755 list->at(start_pos) = CharacterRange(new_from, new_to);
4756 return count;
4757 }
4758 // Replace a number of existing ranges from start_pos to end_pos - 1.
4759 // Move the remaining ranges down.
4760
4761 int new_from = Min(list->at(start_pos).from(), from);
4762 int new_to = Max(list->at(end_pos - 1).to(), to);
4763 if (end_pos < count) {
4764 MoveRanges(list, end_pos, start_pos + 1, count - end_pos);
4765 }
4766 list->at(start_pos) = CharacterRange(new_from, new_to);
4767 return count - (end_pos - start_pos) + 1;
4768 }
4769
4770
4771 void CharacterSet::Canonicalize() {
4772 // Special/default classes are always considered canonical. The result
4773 // of calling ranges() will be sorted.
4774 if (ranges_ == NULL) return;
4775 CharacterRange::Canonicalize(ranges_);
4776 }
4777
4778
4779 void CharacterRange::Canonicalize(ZoneList<CharacterRange>* character_ranges) {
4780 if (character_ranges->length() <= 1) return;
4781 // Check whether ranges are already canonical (increasing, non-overlapping,
4782 // non-adjacent).
4783 int n = character_ranges->length();
4784 int max = character_ranges->at(0).to();
4785 int i = 1;
4786 while (i < n) {
4787 CharacterRange current = character_ranges->at(i);
4788 if (current.from() <= max + 1) {
4789 break;
4790 }
4791 max = current.to();
4792 i++;
4793 }
4794 // Canonical until the i'th range. If that's all of them, we are done.
4795 if (i == n) return;
4796
4797 // The ranges at index i and forward are not canonicalized. Make them so by
4798 // doing the equivalent of insertion sort (inserting each into the previous
4799 // list, in order).
4800 // Notice that inserting a range can reduce the number of ranges in the
4801 // result due to combining of adjacent and overlapping ranges.
4802 int read = i; // Range to insert.
4803 int num_canonical = i; // Length of canonicalized part of list.
4804 do {
4805 num_canonical = InsertRangeInCanonicalList(character_ranges,
4806 num_canonical,
4807 character_ranges->at(read));
4808 read++;
4809 } while (read < n);
4810 character_ranges->Rewind(num_canonical);
4811
4812 DCHECK(CharacterRange::IsCanonical(character_ranges));
4813 }
4814
4815
4816 void CharacterRange::Negate(ZoneList<CharacterRange>* ranges,
4817 ZoneList<CharacterRange>* negated_ranges,
4818 Zone* zone) {
4819 DCHECK(CharacterRange::IsCanonical(ranges));
4820 DCHECK_EQ(0, negated_ranges->length());
4821 int range_count = ranges->length();
4822 uc16 from = 0;
4823 int i = 0;
4824 if (range_count > 0 && ranges->at(0).from() == 0) {
4825 from = ranges->at(0).to();
4826 i = 1;
4827 }
4828 while (i < range_count) {
4829 CharacterRange range = ranges->at(i);
4830 negated_ranges->Add(CharacterRange(from + 1, range.from() - 1), zone);
4831 from = range.to();
4832 i++;
4833 }
4834 if (from < String::kMaxUtf16CodeUnit) {
4835 negated_ranges->Add(CharacterRange(from + 1, String::kMaxUtf16CodeUnit),
4836 zone);
4837 }
4838 }
4839
4840
4841 // -------------------------------------------------------------------
4842 // Splay tree
4843
4844
4845 OutSet* OutSet::Extend(unsigned value, Zone* zone) {
4846 if (Get(value))
4847 return this;
4848 if (successors(zone) != NULL) {
4849 for (int i = 0; i < successors(zone)->length(); i++) {
4850 OutSet* successor = successors(zone)->at(i);
4851 if (successor->Get(value))
4852 return successor;
4853 }
4854 } else {
4855 successors_ = new(zone) ZoneList<OutSet*>(2, zone);
4856 }
4857 OutSet* result = new(zone) OutSet(first_, remaining_);
4858 result->Set(value, zone);
4859 successors(zone)->Add(result, zone);
4860 return result;
4861 }
4862
4863
4864 void OutSet::Set(unsigned value, Zone *zone) {
4865 if (value < kFirstLimit) {
4866 first_ |= (1 << value);
4867 } else {
4868 if (remaining_ == NULL)
4869 remaining_ = new(zone) ZoneList<unsigned>(1, zone);
4870 if (remaining_->is_empty() || !remaining_->Contains(value))
4871 remaining_->Add(value, zone);
4872 }
4873 }
4874
4875
4876 bool OutSet::Get(unsigned value) const {
4877 if (value < kFirstLimit) {
4878 return (first_ & (1 << value)) != 0;
4879 } else if (remaining_ == NULL) {
4880 return false;
4881 } else {
4882 return remaining_->Contains(value);
4883 }
4884 }
4885
4886
4887 const uc16 DispatchTable::Config::kNoKey = unibrow::Utf8::kBadChar;
4888
4889
4890 void DispatchTable::AddRange(CharacterRange full_range, int value,
4891 Zone* zone) {
4892 CharacterRange current = full_range;
4893 if (tree()->is_empty()) {
4894 // If this is the first range we just insert into the table.
4895 ZoneSplayTree<Config>::Locator loc;
4896 DCHECK_RESULT(tree()->Insert(current.from(), &loc));
4897 loc.set_value(Entry(current.from(), current.to(),
4898 empty()->Extend(value, zone)));
4899 return;
4900 }
4901 // First see if there is a range to the left of this one that
4902 // overlaps.
4903 ZoneSplayTree<Config>::Locator loc;
4904 if (tree()->FindGreatestLessThan(current.from(), &loc)) {
4905 Entry* entry = &loc.value();
4906 // If we've found a range that overlaps with this one, and it
4907 // starts strictly to the left of this one, we have to fix it
4908 // because the following code only handles ranges that start on
4909 // or after the start point of the range we're adding.
4910 if (entry->from() < current.from() && entry->to() >= current.from()) {
4911 // Snap the overlapping range in half around the start point of
4912 // the range we're adding.
4913 CharacterRange left(entry->from(), current.from() - 1);
4914 CharacterRange right(current.from(), entry->to());
4915 // The left part of the overlapping range doesn't overlap.
4916 // Truncate the whole entry to be just the left part.
4917 entry->set_to(left.to());
4918 // The right part is the one that overlaps. We add this part
4919 // to the map and let the next step deal with merging it with
4920 // the range we're adding.
4921 ZoneSplayTree<Config>::Locator loc;
4922 DCHECK_RESULT(tree()->Insert(right.from(), &loc));
4923 loc.set_value(Entry(right.from(),
4924 right.to(),
4925 entry->out_set()));
4926 }
4927 }
4928 while (current.is_valid()) {
4929 if (tree()->FindLeastGreaterThan(current.from(), &loc) &&
4930 (loc.value().from() <= current.to()) &&
4931 (loc.value().to() >= current.from())) {
4932 Entry* entry = &loc.value();
4933 // We have overlap. If there is space between the start point of
4934 // the range we're adding and where the overlapping range starts
4935 // then we have to add a range covering just that space.
4936 if (current.from() < entry->from()) {
4937 ZoneSplayTree<Config>::Locator ins;
4938 DCHECK_RESULT(tree()->Insert(current.from(), &ins));
4939 ins.set_value(Entry(current.from(),
4940 entry->from() - 1,
4941 empty()->Extend(value, zone)));
4942 current.set_from(entry->from());
4943 }
4944 DCHECK_EQ(current.from(), entry->from());
4945 // If the overlapping range extends beyond the one we want to add
4946 // we have to snap the right part off and add it separately.
4947 if (entry->to() > current.to()) {
4948 ZoneSplayTree<Config>::Locator ins;
4949 DCHECK_RESULT(tree()->Insert(current.to() + 1, &ins));
4950 ins.set_value(Entry(current.to() + 1,
4951 entry->to(),
4952 entry->out_set()));
4953 entry->set_to(current.to());
4954 }
4955 DCHECK(entry->to() <= current.to());
4956 // The overlapping range is now completely contained by the range
4957 // we're adding so we can just update it and move the start point
4958 // of the range we're adding just past it.
4959 entry->AddValue(value, zone);
4960 // Bail out if the last interval ended at 0xFFFF since otherwise
4961 // adding 1 will wrap around to 0.
4962 if (entry->to() == String::kMaxUtf16CodeUnit)
4963 break;
4964 DCHECK(entry->to() + 1 > current.from());
4965 current.set_from(entry->to() + 1);
4966 } else {
4967 // There is no overlap so we can just add the range
4968 ZoneSplayTree<Config>::Locator ins;
4969 DCHECK_RESULT(tree()->Insert(current.from(), &ins));
4970 ins.set_value(Entry(current.from(),
4971 current.to(),
4972 empty()->Extend(value, zone)));
4973 break;
4974 }
4975 }
4976 }
4977
4978
4979 OutSet* DispatchTable::Get(uc16 value) {
4980 ZoneSplayTree<Config>::Locator loc;
4981 if (!tree()->FindGreatestLessThan(value, &loc))
4982 return empty();
4983 Entry* entry = &loc.value();
4984 if (value <= entry->to())
4985 return entry->out_set();
4986 else
4987 return empty();
4988 }
4989
4990
4991 // -------------------------------------------------------------------
4992 // Analysis
4993
4994
4995 void Analysis::EnsureAnalyzed(RegExpNode* that) {
4996 StackLimitCheck check(that->zone()->isolate());
4997 if (check.HasOverflowed()) {
4998 fail("Stack overflow");
4999 return;
5000 }
5001 if (that->info()->been_analyzed || that->info()->being_analyzed)
5002 return;
5003 that->info()->being_analyzed = true;
5004 that->Accept(this);
5005 that->info()->being_analyzed = false;
5006 that->info()->been_analyzed = true;
5007 }
5008
5009
5010 void Analysis::VisitEnd(EndNode* that) {
5011 // nothing to do
5012 }
5013
5014
5015 void TextNode::CalculateOffsets() {
5016 int element_count = elements()->length();
5017 // Set up the offsets of the elements relative to the start. This is a fixed
5018 // quantity since a TextNode can only contain fixed-width things.
5019 int cp_offset = 0;
5020 for (int i = 0; i < element_count; i++) {
5021 TextElement& elm = elements()->at(i);
5022 elm.set_cp_offset(cp_offset);
5023 cp_offset += elm.length();
5024 }
5025 }
5026
5027
5028 void Analysis::VisitText(TextNode* that) {
5029 if (ignore_case_) {
5030 that->MakeCaseIndependent(is_one_byte_);
5031 }
5032 EnsureAnalyzed(that->on_success());
5033 if (!has_failed()) {
5034 that->CalculateOffsets();
5035 }
5036 }
5037
5038
5039 void Analysis::VisitAction(ActionNode* that) {
5040 RegExpNode* target = that->on_success();
5041 EnsureAnalyzed(target);
5042 if (!has_failed()) {
5043 // If the next node is interested in what it follows then this node
5044 // has to be interested too so it can pass the information on.
5045 that->info()->AddFromFollowing(target->info());
5046 }
5047 }
5048
5049
5050 void Analysis::VisitChoice(ChoiceNode* that) {
5051 NodeInfo* info = that->info();
5052 for (int i = 0; i < that->alternatives()->length(); i++) {
5053 RegExpNode* node = that->alternatives()->at(i).node();
5054 EnsureAnalyzed(node);
5055 if (has_failed()) return;
5056 // Anything the following nodes need to know has to be known by
5057 // this node also, so it can pass it on.
5058 info->AddFromFollowing(node->info());
5059 }
5060 }
5061
5062
5063 void Analysis::VisitLoopChoice(LoopChoiceNode* that) {
5064 NodeInfo* info = that->info();
5065 for (int i = 0; i < that->alternatives()->length(); i++) {
5066 RegExpNode* node = that->alternatives()->at(i).node();
5067 if (node != that->loop_node()) {
5068 EnsureAnalyzed(node);
5069 if (has_failed()) return;
5070 info->AddFromFollowing(node->info());
5071 }
5072 }
5073 // Check the loop last since it may need the value of this node
5074 // to get a correct result.
5075 EnsureAnalyzed(that->loop_node());
5076 if (!has_failed()) {
5077 info->AddFromFollowing(that->loop_node()->info());
5078 }
5079 }
5080
5081
5082 void Analysis::VisitBackReference(BackReferenceNode* that) {
5083 EnsureAnalyzed(that->on_success());
5084 }
5085
5086
5087 void Analysis::VisitAssertion(AssertionNode* that) {
5088 EnsureAnalyzed(that->on_success());
5089 }
5090
5091
5092 void BackReferenceNode::FillInBMInfo(int offset,
5093 int budget,
5094 BoyerMooreLookahead* bm,
5095 bool not_at_start) {
5096 // Working out the set of characters that a backreference can match is too
5097 // hard, so we just say that any character can match.
5098 bm->SetRest(offset);
5099 SaveBMInfo(bm, not_at_start, offset);
5100 }
5101
5102
5103 STATIC_ASSERT(BoyerMoorePositionInfo::kMapSize ==
5104 RegExpMacroAssembler::kTableSize);
5105
5106
5107 void ChoiceNode::FillInBMInfo(int offset,
5108 int budget,
5109 BoyerMooreLookahead* bm,
5110 bool not_at_start) {
5111 ZoneList<GuardedAlternative>* alts = alternatives();
5112 budget = (budget - 1) / alts->length();
5113 for (int i = 0; i < alts->length(); i++) {
5114 GuardedAlternative& alt = alts->at(i);
5115 if (alt.guards() != NULL && alt.guards()->length() != 0) {
5116 bm->SetRest(offset); // Give up trying to fill in info.
5117 SaveBMInfo(bm, not_at_start, offset);
5118 return;
5119 }
5120 alt.node()->FillInBMInfo(offset, budget, bm, not_at_start);
5121 }
5122 SaveBMInfo(bm, not_at_start, offset);
5123 }
5124
5125
5126 void TextNode::FillInBMInfo(int initial_offset,
5127 int budget,
5128 BoyerMooreLookahead* bm,
5129 bool not_at_start) {
5130 if (initial_offset >= bm->length()) return;
5131 int offset = initial_offset;
5132 int max_char = bm->max_char();
5133 for (int i = 0; i < elements()->length(); i++) {
5134 if (offset >= bm->length()) {
5135 if (initial_offset == 0) set_bm_info(not_at_start, bm);
5136 return;
5137 }
5138 TextElement text = elements()->at(i);
5139 if (text.text_type() == TextElement::ATOM) {
5140 RegExpAtom* atom = text.atom();
5141 for (int j = 0; j < atom->length(); j++, offset++) {
5142 if (offset >= bm->length()) {
5143 if (initial_offset == 0) set_bm_info(not_at_start, bm);
5144 return;
5145 }
5146 uc16 character = atom->data()[j];
5147 if (bm->compiler()->ignore_case()) {
5148 unibrow::uchar chars[unibrow::Ecma262UnCanonicalize::kMaxWidth];
5149 int length = GetCaseIndependentLetters(
5150 Isolate::Current(),
5151 character,
5152 bm->max_char() == String::kMaxOneByteCharCode,
5153 chars);
5154 for (int j = 0; j < length; j++) {
5155 bm->Set(offset, chars[j]);
5156 }
5157 } else {
5158 if (character <= max_char) bm->Set(offset, character);
5159 }
5160 }
5161 } else {
5162 DCHECK_EQ(TextElement::CHAR_CLASS, text.text_type());
5163 RegExpCharacterClass* char_class = text.char_class();
5164 ZoneList<CharacterRange>* ranges = char_class->ranges(zone());
5165 if (char_class->is_negated()) {
5166 bm->SetAll(offset);
5167 } else {
5168 for (int k = 0; k < ranges->length(); k++) {
5169 CharacterRange& range = ranges->at(k);
5170 if (range.from() > max_char) continue;
5171 int to = Min(max_char, static_cast<int>(range.to()));
5172 bm->SetInterval(offset, Interval(range.from(), to));
5173 }
5174 }
5175 offset++;
5176 }
5177 }
5178 if (offset >= bm->length()) {
5179 if (initial_offset == 0) set_bm_info(not_at_start, bm);
5180 return;
5181 }
5182 on_success()->FillInBMInfo(offset,
5183 budget - 1,
5184 bm,
5185 true); // Not at start after a text node.
5186 if (initial_offset == 0) set_bm_info(not_at_start, bm);
5187 }
5188
5189
5190 // -------------------------------------------------------------------
5191 // Dispatch table construction
5192
5193
5194 void DispatchTableConstructor::VisitEnd(EndNode* that) {
5195 AddRange(CharacterRange::Everything());
5196 }
5197
5198
5199 void DispatchTableConstructor::BuildTable(ChoiceNode* node) {
5200 node->set_being_calculated(true);
5201 ZoneList<GuardedAlternative>* alternatives = node->alternatives();
5202 for (int i = 0; i < alternatives->length(); i++) {
5203 set_choice_index(i);
5204 alternatives->at(i).node()->Accept(this);
5205 }
5206 node->set_being_calculated(false);
5207 }
5208
5209
5210 class AddDispatchRange {
5211 public:
5212 explicit AddDispatchRange(DispatchTableConstructor* constructor)
5213 : constructor_(constructor) { }
5214 void Call(uc32 from, DispatchTable::Entry entry);
5215 private:
5216 DispatchTableConstructor* constructor_;
5217 };
5218
5219
5220 void AddDispatchRange::Call(uc32 from, DispatchTable::Entry entry) {
5221 CharacterRange range(from, entry.to());
5222 constructor_->AddRange(range);
5223 }
5224
5225
5226 void DispatchTableConstructor::VisitChoice(ChoiceNode* node) {
5227 if (node->being_calculated())
5228 return;
5229 DispatchTable* table = node->GetTable(ignore_case_);
5230 AddDispatchRange adder(this);
5231 table->ForEach(&adder);
5232 }
5233
5234
5235 void DispatchTableConstructor::VisitBackReference(BackReferenceNode* that) {
5236 // TODO(160): Find the node that we refer back to and propagate its start
5237 // set back to here. For now we just accept anything.
5238 AddRange(CharacterRange::Everything());
5239 }
5240
5241
5242 void DispatchTableConstructor::VisitAssertion(AssertionNode* that) {
5243 RegExpNode* target = that->on_success();
5244 target->Accept(this);
5245 }
5246
5247
5248 static int CompareRangeByFrom(const CharacterRange* a,
5249 const CharacterRange* b) {
5250 return Compare<uc16>(a->from(), b->from());
5251 }
5252
5253
5254 void DispatchTableConstructor::AddInverse(ZoneList<CharacterRange>* ranges) {
5255 ranges->Sort(CompareRangeByFrom);
5256 uc16 last = 0;
5257 for (int i = 0; i < ranges->length(); i++) {
5258 CharacterRange range = ranges->at(i);
5259 if (last < range.from())
5260 AddRange(CharacterRange(last, range.from() - 1));
5261 if (range.to() >= last) {
5262 if (range.to() == String::kMaxUtf16CodeUnit) {
5263 return;
5264 } else {
5265 last = range.to() + 1;
5266 }
5267 }
5268 }
5269 AddRange(CharacterRange(last, String::kMaxUtf16CodeUnit));
5270 }
5271
5272
5273 void DispatchTableConstructor::VisitText(TextNode* that) {
5274 TextElement elm = that->elements()->at(0);
5275 switch (elm.text_type()) {
5276 case TextElement::ATOM: {
5277 uc16 c = elm.atom()->data()[0];
5278 AddRange(CharacterRange(c, c));
5279 break;
5280 }
5281 case TextElement::CHAR_CLASS: {
5282 RegExpCharacterClass* tree = elm.char_class();
5283 ZoneList<CharacterRange>* ranges = tree->ranges(that->zone());
5284 if (tree->is_negated()) {
5285 AddInverse(ranges);
5286 } else {
5287 for (int i = 0; i < ranges->length(); i++)
5288 AddRange(ranges->at(i));
5289 }
5290 break;
5291 }
5292 default: {
5293 UNIMPLEMENTED();
5294 }
5295 }
5296 }
5297
5298
5299 void DispatchTableConstructor::VisitAction(ActionNode* that) {
5300 RegExpNode* target = that->on_success();
5301 target->Accept(this);
5302 }
5303
5304
5305 RegExpEngine::CompilationResult RegExpEngine::Compile(
5306 RegExpCompileData* data, bool ignore_case, bool is_global,
5307 bool is_multiline, bool is_sticky, Handle<String> pattern,
5308 Handle<String> sample_subject, bool is_one_byte, Zone* zone) {
5309 if ((data->capture_count + 1) * 2 - 1 > RegExpMacroAssembler::kMaxRegister) {
5310 return IrregexpRegExpTooBig(zone->isolate());
5311 }
5312 RegExpCompiler compiler(data->capture_count, ignore_case, is_one_byte, zone);
5313
5314 // Sample some characters from the middle of the string.
5315 static const int kSampleSize = 128;
5316
5317 sample_subject = String::Flatten(sample_subject);
5318 int chars_sampled = 0;
5319 int half_way = (sample_subject->length() - kSampleSize) / 2;
5320 for (int i = Max(0, half_way);
5321 i < sample_subject->length() && chars_sampled < kSampleSize;
5322 i++, chars_sampled++) {
5323 compiler.frequency_collator()->CountCharacter(sample_subject->Get(i));
5324 }
5325
5326 // Wrap the body of the regexp in capture #0.
5327 RegExpNode* captured_body = RegExpCapture::ToNode(data->tree,
5328 0,
5329 &compiler,
5330 compiler.accept());
5331 RegExpNode* node = captured_body;
5332 bool is_end_anchored = data->tree->IsAnchoredAtEnd();
5333 bool is_start_anchored = data->tree->IsAnchoredAtStart();
5334 int max_length = data->tree->max_match();
5335 if (!is_start_anchored && !is_sticky) {
5336 // Add a .*? at the beginning, outside the body capture, unless
5337 // this expression is anchored at the beginning or sticky.
5338 RegExpNode* loop_node =
5339 RegExpQuantifier::ToNode(0,
5340 RegExpTree::kInfinity,
5341 false,
5342 new(zone) RegExpCharacterClass('*'),
5343 &compiler,
5344 captured_body,
5345 data->contains_anchor);
5346
5347 if (data->contains_anchor) {
5348 // Unroll loop once, to take care of the case that might start
5349 // at the start of input.
5350 ChoiceNode* first_step_node = new(zone) ChoiceNode(2, zone);
5351 first_step_node->AddAlternative(GuardedAlternative(captured_body));
5352 first_step_node->AddAlternative(GuardedAlternative(
5353 new(zone) TextNode(new(zone) RegExpCharacterClass('*'), loop_node)));
5354 node = first_step_node;
5355 } else {
5356 node = loop_node;
5357 }
5358 }
5359 if (is_one_byte) {
5360 node = node->FilterOneByte(RegExpCompiler::kMaxRecursion, ignore_case);
5361 // Do it again to propagate the new nodes to places where they were not
5362 // put because they had not been calculated yet.
5363 if (node != NULL) {
5364 node = node->FilterOneByte(RegExpCompiler::kMaxRecursion, ignore_case);
5365 }
5366 }
5367
5368 if (node == NULL) node = new(zone) EndNode(EndNode::BACKTRACK, zone);
5369 data->node = node;
5370 Analysis analysis(ignore_case, is_one_byte);
5371 analysis.EnsureAnalyzed(node);
5372 if (analysis.has_failed()) {
5373 const char* error_message = analysis.error_message();
5374 return CompilationResult(zone->isolate(), error_message);
5375 }
5376
5377 // Create the correct assembler for the architecture.
5378 #ifndef V8_INTERPRETED_REGEXP
5379 // Native regexp implementation.
5380
5381 NativeRegExpMacroAssembler::Mode mode =
5382 is_one_byte ? NativeRegExpMacroAssembler::LATIN1
5383 : NativeRegExpMacroAssembler::UC16;
5384
5385 #if V8_TARGET_ARCH_IA32
5386 RegExpMacroAssemblerIA32 macro_assembler(mode, (data->capture_count + 1) * 2,
5387 zone);
5388 #elif V8_TARGET_ARCH_X64
5389 RegExpMacroAssemblerX64 macro_assembler(mode, (data->capture_count + 1) * 2,
5390 zone);
5391 #elif V8_TARGET_ARCH_ARM
5392 RegExpMacroAssemblerARM macro_assembler(mode, (data->capture_count + 1) * 2,
5393 zone);
5394 #elif V8_TARGET_ARCH_ARM64
5395 RegExpMacroAssemblerARM64 macro_assembler(mode, (data->capture_count + 1) * 2,
5396 zone);
5397 #elif V8_TARGET_ARCH_MIPS
5398 RegExpMacroAssemblerMIPS macro_assembler(mode, (data->capture_count + 1) * 2,
5399 zone);
5400 #elif V8_TARGET_ARCH_MIPS64
5401 RegExpMacroAssemblerMIPS macro_assembler(mode, (data->capture_count + 1) * 2,
5402 zone);
5403 #elif V8_TARGET_ARCH_X87
5404 RegExpMacroAssemblerX87 macro_assembler(mode, (data->capture_count + 1) * 2,
5405 zone);
5406 #else
5407 #error "Unsupported architecture"
5408 #endif
5409
5410 #else // V8_INTERPRETED_REGEXP
5411 // Interpreted regexp implementation.
5412 EmbeddedVector<byte, 1024> codes;
5413 RegExpMacroAssemblerIrregexp macro_assembler(codes, zone);
5414 #endif // V8_INTERPRETED_REGEXP
5415
5416 // Inserted here, instead of in Assembler, because it depends on information
5417 // in the AST that isn't replicated in the Node structure.
5418 static const int kMaxBacksearchLimit = 1024;
5419 if (is_end_anchored &&
5420 !is_start_anchored &&
5421 max_length < kMaxBacksearchLimit) {
5422 macro_assembler.SetCurrentPositionFromEnd(max_length);
5423 }
5424
5425 if (is_global) {
5426 macro_assembler.set_global_mode(
5427 (data->tree->min_match() > 0)
5428 ? RegExpMacroAssembler::GLOBAL_NO_ZERO_LENGTH_CHECK
5429 : RegExpMacroAssembler::GLOBAL);
5430 }
5431
5432 return compiler.Assemble(&macro_assembler,
5433 node,
5434 data->capture_count,
5435 pattern);
5436 }
5437
5438 } // namespace dart
OLDNEW
« no previous file with comments | « runtime/vm/regexp.h ('k') | runtime/vm/regexp_assembler.h » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698