[go: up one dir, main page]

Stage 3 Draft / October 8, 2024

Regular Expression Pattern Modifiers for ECMAScript

Introduction

See the proposal repository for background material and discussion.

1 Text Processing

1.1 RegExp (Regular Expression) Objects

A RegExp object contains a regular expression and the associated flags.

Note

The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language.

1.1.1 Patterns

The RegExp constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of Pattern.

Syntax

Pattern[UnicodeMode, N] :: Disjunction[?UnicodeMode, ?N] Disjunction[UnicodeMode, N] :: Alternative[?UnicodeMode, ?N] Alternative[?UnicodeMode, ?N] | Disjunction[?UnicodeMode, ?N] Alternative[UnicodeMode, N] :: [empty] Alternative[?UnicodeMode, ?N] Term[?UnicodeMode, ?N] Term[UnicodeMode, N] :: Assertion[?UnicodeMode, ?N] Atom[?UnicodeMode, ?N] Atom[?UnicodeMode, ?N] Quantifier Assertion[UnicodeMode, N] :: ^ $ \ b \ B ( ? = Disjunction[?UnicodeMode, ?N] ) ( ? ! Disjunction[?UnicodeMode, ?N] ) ( ? <= Disjunction[?UnicodeMode, ?N] ) ( ? <! Disjunction[?UnicodeMode, ?N] ) Quantifier :: QuantifierPrefix QuantifierPrefix ? QuantifierPrefix :: * + ? { DecimalDigits[~Sep] } { DecimalDigits[~Sep] , } { DecimalDigits[~Sep] , DecimalDigits[~Sep] } Atom[UnicodeMode, N] :: PatternCharacter . \ AtomEscape[?UnicodeMode, ?N] CharacterClass[?UnicodeMode] ( GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] ) ( ? : Disjunction[?UnicodeMode, ?N] ) ( ? RegularExpressionFlags : Disjunction[?UnicodeMode, ?N] ) ( ? RegularExpressionFlags - RegularExpressionFlags : Disjunction[?UnicodeMode, ?N] ) SyntaxCharacter :: one of ^ $ \ . * + ? ( ) [ ] { } | PatternCharacter :: SourceCharacter but not SyntaxCharacter AtomEscape[UnicodeMode, N] :: DecimalEscape CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode] [+N] k GroupName[?UnicodeMode] CharacterEscape[UnicodeMode] :: ControlEscape c ControlLetter 0 [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?UnicodeMode] IdentityEscape[?UnicodeMode] ControlEscape :: one of f n r t v ControlLetter :: one of a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z GroupSpecifier[UnicodeMode] :: [empty] ? GroupName[?UnicodeMode] GroupName[UnicodeMode] :: < RegExpIdentifierName[?UnicodeMode] > RegExpIdentifierName[UnicodeMode] :: RegExpIdentifierStart[?UnicodeMode] RegExpIdentifierName[?UnicodeMode] RegExpIdentifierPart[?UnicodeMode] RegExpIdentifierStart[UnicodeMode] :: IdentifierStartChar \ RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate RegExpIdentifierPart[UnicodeMode] :: IdentifierPartChar \ RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate RegExpUnicodeEscapeSequence[UnicodeMode] :: [+UnicodeMode] u HexLeadSurrogate \u HexTrailSurrogate [+UnicodeMode] u HexLeadSurrogate [+UnicodeMode] u HexTrailSurrogate [+UnicodeMode] u HexNonSurrogate [~UnicodeMode] u Hex4Digits [+UnicodeMode] u{ CodePoint } UnicodeLeadSurrogate :: any Unicode code point in the inclusive range 0xD800 to 0xDBFF UnicodeTrailSurrogate :: any Unicode code point in the inclusive range 0xDC00 to 0xDFFF

Each \u HexTrailSurrogate for which the choice of associated u HexLeadSurrogate is ambiguous shall be associated with the nearest possible u HexLeadSurrogate that would otherwise have no corresponding \u HexTrailSurrogate.

HexLeadSurrogate :: Hex4Digits but only if the MV of Hex4Digits is in the inclusive range 0xD800 to 0xDBFF HexTrailSurrogate :: Hex4Digits but only if the MV of Hex4Digits is in the inclusive range 0xDC00 to 0xDFFF HexNonSurrogate :: Hex4Digits but only if the MV of Hex4Digits is not in the inclusive range 0xD800 to 0xDFFF IdentityEscape[UnicodeMode] :: [+UnicodeMode] SyntaxCharacter [+UnicodeMode] / [~UnicodeMode] SourceCharacter but not UnicodeIDContinue DecimalEscape :: NonZeroDigit DecimalDigits[~Sep]opt [lookahead ∉ DecimalDigit] CharacterClassEscape[UnicodeMode] :: d D s S w W [+UnicodeMode] p{ UnicodePropertyValueExpression } [+UnicodeMode] P{ UnicodePropertyValueExpression } UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyName :: UnicodePropertyNameCharacters UnicodePropertyNameCharacters :: UnicodePropertyNameCharacter UnicodePropertyNameCharactersopt UnicodePropertyValue :: UnicodePropertyValueCharacters LoneUnicodePropertyNameOrValue :: UnicodePropertyValueCharacters UnicodePropertyValueCharacters :: UnicodePropertyValueCharacter UnicodePropertyValueCharactersopt UnicodePropertyValueCharacter :: UnicodePropertyNameCharacter DecimalDigit UnicodePropertyNameCharacter :: ControlLetter _ CharacterClass[UnicodeMode] :: [ [lookahead ≠ ^] ClassRanges[?UnicodeMode] ] [ ^ ClassRanges[?UnicodeMode] ] ClassRanges[UnicodeMode] :: [empty] NonemptyClassRanges[?UnicodeMode] NonemptyClassRanges[UnicodeMode] :: ClassAtom[?UnicodeMode] ClassAtom[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] ClassAtom[?UnicodeMode] - ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] NonemptyClassRangesNoDash[UnicodeMode] :: ClassAtom[?UnicodeMode] ClassAtomNoDash[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] ClassAtomNoDash[?UnicodeMode] - ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] ClassAtom[UnicodeMode] :: - ClassAtomNoDash[?UnicodeMode] ClassAtomNoDash[UnicodeMode] :: SourceCharacter but not one of \ or ] or - \ ClassEscape[?UnicodeMode] ClassEscape[UnicodeMode] :: b [+UnicodeMode] - CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode] Note

A number of productions in this section are given alternative definitions in section A.1.1.

1.1.2 Pattern Semantics

1.1.2.1 Notation

The descriptions below use the following aliases:

  • Input is a List whose elements are the characters of the String being matched by the regular expression pattern. Each character is either a code unit or a code point, depending upon the kind of pattern involved. The notation Input[n] means the nth character of Input, where n can range between 0 (inclusive) and InputLength (exclusive).
  • InputLength is the number of characters in Input.
  • NcapturingParens is the total number of left-capturing parentheses (i.e. the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes) in the pattern. A left-capturing parenthesis is any ( pattern character that is matched by the ( terminal of the Atom :: ( GroupSpecifier Disjunction ) production.
  • DotAll is true if the RegExp object's [[OriginalFlags]] internal slot contains "s" and otherwise is false.
  • IgnoreCase is true if the RegExp object's [[OriginalFlags]] internal slot contains "i" and otherwise is false.
  • Multiline is true if the RegExp object's [[OriginalFlags]] internal slot contains "m" and otherwise is false.
  • Unicode is true if the RegExp object's [[OriginalFlags]] internal slot contains "u" and otherwise is false.
  • WordCharacters is the mathematical set that is the union of all sixty-three characters in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_" (letters, numbers, and U+005F (LOW LINE) in the Unicode Basic Latin block) and all characters c for which c is not in that set but Canonicalize(c) is. WordCharacters cannot contain more than sixty-three characters unless Unicode and IgnoreCase are both true.

Furthermore, the descriptions below use the following internal data structures:

  • A CharSet is a mathematical set of characters. When the Unicode flag is true, “all characters” means the CharSet containing all code point values; otherwise “all characters” means the CharSet containing all code unit values.
  • A State is an ordered pair (endIndex, captures) where endIndex is an integer and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The nth element of captures is either a List of characters that represents the value obtained by the nth set of capturing parentheses or undefined if the nth set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token failure that indicates that the match failed.
  • A Continuation is an Abstract Closure that takes one State argument and returns a MatchResult result. The Continuation attempts to match the remaining portion (specified by the closure's captured values) of the pattern against Input, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
  • A Matcher is an Abstract Closure that takes two arguments—a State and a Continuation—and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's captured values) of the pattern against Input, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.

1.1.2.2 Static Semantics: Early Errors

Atom :: ( ? RegularExpressionFlags : Disjunction ) Atom :: ( ? RegularExpressionFlags - RegularExpressionFlags : Disjunction )

1.1.2.3 Modifiers Records

A Modifiers Record is a Record value used to encapsulate information about the regular expression flags that apply to a subpattern.

Modifiers Records have the fields listed in Table 1.

Table 1: Modifiers Record Fields
Field Name Value Meaning
[[DotAll]] a Boolean Indicates whether the "s" flag is currently enabled.
[[IgnoreCase]] a Boolean Indicates whether the "i" flag is currently enabled.
[[Multiline]] a Boolean Indicates whether the "m" flag is currently enabled.

1.1.2.4 Runtime Semantics: CompilePattern

The syntax-directed operation CompilePattern takes no arguments. It returns an Abstract Closure that takes a String and a non-negative integer and returns a MatchResult. It is defined piecewise over the following productions:

Pattern :: Disjunction
  1. Let modifiers be the Modifiers Record { [[DotAll]]: DotAll, [[IgnoreCase]]: IgnoreCase, [[Multiline]]: Multiline }.
  2. Let m be CompileSubpattern of Disjunction with arguments forward and modifiers.
  3. Return a new Abstract Closure with parameters (str, index) that captures m and performs the following steps when called:
    1. Assert: Type(str) is String.
    2. Assert: index is a non-negative integer which is ≤ the length of str.
    3. If Unicode is true, let Input be StringToCodePoints(str). Otherwise, let Input be a List whose elements are the code units that are the elements of str. Input will be used throughout the algorithms in 1.1.2. Each element of Input is considered to be a character.
    4. Let InputLength be the number of characters contained in Input. This alias will be used throughout the algorithms in 1.1.2.
    5. Let listIndex be the index into Input of the character that was obtained from element index of str.
    6. Let c be a new Continuation with parameters (y) that captures nothing and performs the following steps when called:
      1. Assert: y is a State.
      2. Return y.
    7. Let cap be a List of NcapturingParens undefined values, indexed 1 through NcapturingParens.
    8. Let x be the State (listIndex, cap).
    9. Return m(x, c).
Note

A Pattern compiles to an Abstract Closure value. RegExpBuiltinExec can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String, and, if it does match, what the values of the capturing parentheses would be. The algorithms in 1.1.2 are designed so that compiling a pattern may throw a SyntaxError exception; on the other hand, once the pattern is successfully compiled, applying the resulting Abstract Closure to find a match in a String cannot throw an exception (except for any implementation-defined exceptions that can occur anywhere such as out-of-memory).

1.1.2.5 Runtime Semantics: CompileSubpattern

The syntax-directed operation CompileSubpattern takes arguments direction (forward or backward) and modifiers (a Modifiers Record) and returns a Matcher.

Note 1

This section is amended in B.1.2.4.

It is defined piecewise over the following productions:

Disjunction :: Alternative | Disjunction
  1. Let m1 be CompileSubpattern of Alternative with arguments direction and modifiers.
  2. Let m2 be CompileSubpattern of Disjunction with arguments direction and modifiers.
  3. Return a new Matcher with parameters (x, c) that captures m1 and m2 and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let r be m1(x, c).
    4. If r is not failure, return r.
    5. Return m2(x, c).
Note 2

The | regular expression operator separates two alternatives. The pattern first tries to match the left Alternative (followed by the sequel of the regular expression); if it fails, it tries to match the right Disjunction (followed by the sequel of the regular expression). If the left Alternative, the right Disjunction, and the sequel all have choice points, all choices in the sequel are tried before moving on to the next choice in the left Alternative. If choices in the left Alternative are exhausted, the right Disjunction is tried instead of the left Alternative. Any capturing parentheses inside a portion of the pattern skipped by | produce undefined values instead of Strings. Thus, for example,

/a|ab/.exec("abc")

returns the result "a" and not "ab". Moreover,

/((a)|(ab))((c)|(bc))/.exec("abc")

returns the array

["abc", "a", "a", undefined, "bc", undefined, "bc"]

and not

["abc", "ab", undefined, "ab", "c", "c", undefined]

The order in which the two alternatives are tried is independent of the value of direction.

Alternative :: [empty]
  1. Return a new Matcher with parameters (x, c) that captures nothing and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Return c(x).
Alternative :: Alternative Term
  1. Let m1 be CompileSubpattern of Alternative with arguments direction and modifiers.
  2. Let m2 be CompileSubpattern of Term with arguments direction and modifiers.
  3. If direction is forward, then
    1. Let m be a new Matcher with parameters (x, c) that captures m1 and m2 and performs the following steps when called:
      1. Assert: x is a State.
      2. Assert: c is a Continuation.
      3. Let d be a new Continuation with parameters (y) that captures c and m2 and performs the following steps when called:
        1. Assert: y is a State.
        2. Return m2(y, c).
      4. Return m1(x, d).
  4. Else,
    1. Assert: direction is backward.
    2. Let m be a new Matcher with parameters (x, c) that captures m1 and m2 and performs the following steps when called:
      1. Assert: x is a State.
      2. Assert: c is a Continuation.
      3. Let d be a new Continuation with parameters (y) that captures c and m1 and performs the following steps when called:
        1. Assert: y is a State.
        2. Return m1(y, c).
      4. Return m2(x, d).
Note 3

Consecutive Terms try to simultaneously match consecutive portions of Input. When direction is forward, if the left Alternative, the right Term, and the sequel of the regular expression all have choice points, all choices in the sequel are tried before moving on to the next choice in the right Term, and all choices in the right Term are tried before moving on to the next choice in the left Alternative. When direction is backward, the evaluation order of Alternative and Term are reversed.

Term :: Assertion
  1. Return CompileAssertion of Assertion with argument modifiers.
Note 4

The resulting Matcher is independent of direction.

Term :: Atom
  1. Return CompileAtom of Atom with arguments direction and modifiers.
Term :: Atom Quantifier
  1. Let m be CompileAtom of Atom with arguments direction and modifiers.
  2. Let q be CompileQuantifier of Quantifier.
  3. Assert: q.[[Min]] ≤ q.[[Max]].
  4. Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Term. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing this Term.
  5. Let parenCount be the number of left-capturing parentheses in Atom. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes enclosed by Atom.
  6. Return a new Matcher with parameters (x, c) that captures m, q, parenIndex, and parenCount and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Return RepeatMatcher(m, q.[[Min]], q.[[Max]], q.[[Greedy]], x, c, parenIndex, parenCount).

1.1.2.6 Runtime Semantics: CompileAssertion

The syntax-directed operation CompileAssertion takes argument modifiers (a Modifiers Record) and returns a Matcher.

Note 1

This section is amended in B.1.2.5.

It is defined piecewise over the following productions:

Assertion :: ^
  1. Return a new Matcher with parameters (x, c) that captures nothing and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. If e = 0, or if Multilinemodifiers.[[Multiline]] is true and the character Input[e - 1] is one of LineTerminator, then
      1. Return c(x).
    5. Return failure.
Note 2

Even when the y flag is used with a pattern, ^ always matches only at the beginning of Input, or (if Multilinemodifiers.[[Multiline]] is true) at the beginning of a line.

Assertion :: $
  1. Return a new Matcher with parameters (x, c) that captures nothing and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. If e = InputLength, or if Multilinemodifiers.[[Multiline]] is true and the character Input[e] is one of LineTerminator, then
      1. Return c(x).
    5. Return failure.
Assertion :: \ b
  1. Return a new Matcher with parameters (x, c) that captures nothing and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. Let a be IsWordChar(e - 1, modifiers).
    5. Let b be IsWordChar(e, modifiers).
    6. If a is true and b is false, or if a is false and b is true, return c(x).
    7. Return failure.
Assertion :: \ B
  1. Return a new Matcher with parameters (x, c) that captures nothing and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. Let a be IsWordChar(e - 1, modifiers).
    5. Let b be IsWordChar(e, modifiers).
    6. If a is true and b is true, or if a is false and b is false, return c(x).
    7. Return failure.
Assertion :: ( ? = Disjunction )
  1. Let m be CompileSubpattern of Disjunction with arguments forward and modifiers.
  2. Return a new Matcher with parameters (x, c) that captures m and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let d be a new Continuation with parameters (y) that captures nothing and performs the following steps when called:
      1. Assert: y is a State.
      2. Return y.
    4. Let r be m(x, d).
    5. If r is failure, return failure.
    6. Let y be r's State.
    7. Let cap be y's captures List.
    8. Let xe be x's endIndex.
    9. Let z be the State (xe, cap).
    10. Return c(z).
Assertion :: ( ? ! Disjunction )
  1. Let m be CompileSubpattern of Disjunction with arguments forward and modifiers.
  2. Return a new Matcher with parameters (x, c) that captures m and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let d be a new Continuation with parameters (y) that captures nothing and performs the following steps when called:
      1. Assert: y is a State.
      2. Return y.
    4. Let r be m(x, d).
    5. If r is not failure, return failure.
    6. Return c(x).
Assertion :: ( ? <= Disjunction )
  1. Let m be CompileSubpattern of Disjunction with arguments backward and modifiers.
  2. Return a new Matcher with parameters (x, c) that captures m and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let d be a new Continuation with parameters (y) that captures nothing and performs the following steps when called:
      1. Assert: y is a State.
      2. Return y.
    4. Let r be m(x, d).
    5. If r is failure, return failure.
    6. Let y be r's State.
    7. Let cap be y's captures List.
    8. Let xe be x's endIndex.
    9. Let z be the State (xe, cap).
    10. Return c(z).
Assertion :: ( ? <! Disjunction )
  1. Let m be CompileSubpattern of Disjunction with arguments backward and modifiers.
  2. Return a new Matcher with parameters (x, c) that captures m and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let d be a new Continuation with parameters (y) that captures nothing and performs the following steps when called:
      1. Assert: y is a State.
      2. Return y.
    4. Let r be m(x, d).
    5. If r is not failure, return failure.
    6. Return c(x).

1.1.2.6.1 IsWordChar ( e, modifiers )

The abstract operation IsWordChar takes arguments e (an integer) and modifiers (a Modifiers Record). It performs the following steps when called:

  1. If e = -1 or e is InputLength, return false.
  2. Let c be the character Input[e].
  3. Let wordCharacters be GetWordCharacters(modifiers).
  4. If c is in WordCharacterswordCharacters, return true.
  5. Return false.

1.1.2.7 Runtime Semantics: CompileAtom

The syntax-directed operation CompileAtom takes arguments direction (forward or backward) and modifiers (a Modifiers Record) and returns a Matcher.

Note 1

This section is amended in B.1.2.6.

It is defined piecewise over the following productions:

Atom :: PatternCharacter
  1. Let ch be the character matched by PatternCharacter.
  2. Let A be a one-element CharSet containing the character ch.
  3. Return CharacterSetMatcher(A, false, direction, modifiers).
Atom :: .
  1. Let A be the CharSet of all characters.
  2. If DotAllmodifiers.[[DotAll]] is not true, then
    1. Remove from A all characters corresponding to a code point on the right-hand side of the LineTerminator production.
  3. Return CharacterSetMatcher(A, false, direction, modifiers).
Atom :: CharacterClass
  1. Let cc be CompileCharacterClass of CharacterClass.
  2. Return CharacterSetMatcher(cc.[[CharSet]], cc.[[Invert]], direction, modifiers).
Atom :: ( GroupSpecifier Disjunction )
  1. Let m be CompileSubpattern of Disjunction with arguments direction and modifiers.
  2. Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Atom. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing this Atom.
  3. Return a new Matcher with parameters (x, c) that captures direction, m, and parenIndex and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let d be a new Continuation with parameters (y) that captures x, c, direction, and parenIndex and performs the following steps when called:
      1. Assert: y is a State.
      2. Let cap be a copy of y's captures List.
      3. Let xe be x's endIndex.
      4. Let ye be y's endIndex.
      5. If direction is forward, then
        1. Assert: xeye.
        2. Let s be a List whose elements are the characters of Input at indices xe (inclusive) through ye (exclusive).
      6. Else,
        1. Assert: direction is backward.
        2. Assert: yexe.
        3. Let s be a List whose elements are the characters of Input at indices ye (inclusive) through xe (exclusive).
      7. Set cap[parenIndex + 1] to s.
      8. Let z be the State (ye, cap).
      9. Return c(z).
    4. Return m(x, d).
Atom :: ( ? : Disjunction )
  1. Return CompileSubpattern of Disjunction with arguments direction and modifiers.
Atom :: ( ? RegularExpressionFlags : Disjunction )
  1. Let addModifiers be the source text matched by RegularExpressionFlags.
  2. Let removeModifiers be the empty String.
  3. Let newModifiers be UpdateModifiers(modifiers, CodePointsToString(addModifiers), removeModifiers).
  4. Return CompileSubpattern of Disjunction with arguments direction and newModifiers.
Atom :: ( ? RegularExpressionFlags - RegularExpressionFlags : Disjunction )
  1. Let addModifiers be the source text matched by the first RegularExpressionFlags.
  2. Let removeModifiers be the source text matched by the second RegularExpressionFlags.
  3. Let newModifiers be UpdateModifiers(modifiers, CodePointsToString(addModifiers), CodePointsToString(removeModifiers)).
  4. Return CompileSubpattern of Disjunction with arguments direction and newModifiers.
AtomEscape :: DecimalEscape
  1. Let n be the CapturingGroupNumber of DecimalEscape.
  2. Assert: nNcapturingParens.
  3. Return BackreferenceMatcher(n, direction, modifiers).
Note 2

An escape sequence of the form \ followed by a non-zero decimal number n matches the result of the nth set of capturing parentheses (1.1.2.1). It is an error if the regular expression has fewer than n capturing parentheses. If the regular expression has n or more capturing parentheses but the nth one is undefined because it has not captured anything, then the backreference always succeeds.

AtomEscape :: CharacterEscape
  1. Let cv be the CharacterValue of CharacterEscape.
  2. Let ch be the character whose character value is cv.
  3. Let A be a one-element CharSet containing the character ch.
  4. Return CharacterSetMatcher(A, false, direction, modifiers).
AtomEscape :: CharacterClassEscape
  1. Let A be CompileToCharSet of CharacterClassEscape.
  2. Return CharacterSetMatcher(A, false, direction, modifiers).
AtomEscape :: k GroupName
  1. Search the enclosing Pattern for an instance of a GroupSpecifier containing a RegExpIdentifierName which has a CapturingGroupName equal to the CapturingGroupName of the RegExpIdentifierName contained in GroupName.
  2. Assert: A unique such GroupSpecifier is found.
  3. Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of the located GroupSpecifier. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing the located GroupSpecifier, including its immediately enclosing Atom.
  4. Return BackreferenceMatcher(parenIndex, direction, modifiers).

1.1.2.7.1 CharacterSetMatcher ( A, invert, direction, modifiers )

The abstract operation CharacterSetMatcher takes arguments A (a CharSet), invert (a Boolean), direction (forward or backward), and modifiers (a Modifiers Record) and returns a Matcher. It performs the following steps when called:

  1. Return a new Matcher with parameters (x, c) that captures A, invert, and direction and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. If direction is forward, let f be e + 1.
    5. Else, let f be e - 1.
    6. If f < 0 or f > InputLength, return failure.
    7. Let index be min(e, f).
    8. Let ch be the character Input[index].
    9. Let cc be Canonicalize(ch, modifiers).
    10. If there exists a member a of A such that Canonicalize(a, modifiers) is cc, let found be true. Otherwise, let found be false.
    11. If invert is false and found is false, return failure.
    12. If invert is true and found is true, return failure.
    13. Let cap be x's captures List.
    14. Let y be the State (f, cap).
    15. Return c(y).

1.1.2.7.2 BackreferenceMatcher ( n, direction, modifiers )

The abstract operation BackreferenceMatcher takes arguments n (a positive integer), direction (forward or backward), and modifiers (a Modifiers Record) and returns a Matcher. It performs the following steps when called:

  1. Assert: n ≥ 1.
  2. Return a new Matcher with parameters (x, c) that captures n and direction and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let cap be x's captures List.
    4. Let s be cap[n].
    5. If s is undefined, return c(x).
    6. Let e be x's endIndex.
    7. Let len be the number of elements in s.
    8. If direction is forward, let f be e + len.
    9. Else, let f be e - len.
    10. If f < 0 or f > InputLength, return failure.
    11. Let g be min(e, f).
    12. If there exists an integer i between 0 (inclusive) and len (exclusive) such that Canonicalize(s[i], modifiers) is not the same character value as Canonicalize(Input[g + i], modifiers), return failure.
    13. Let y be the State (f, cap).
    14. Return c(y).

1.1.2.7.3 Canonicalize ( ch, modifiers )

The abstract operation Canonicalize takes arguments ch (a character) and modifiers (a Modifiers Record) and returns a Matcher. It performs the following steps when called:

  1. If Unicode is true and IgnoreCasemodifiers.[[IgnoreCase]] is true, then
    1. If the file CaseFolding.txt of the Unicode Character Database provides a simple or common case folding mapping for ch, return the result of applying that mapping to ch.
    2. Return ch.
  2. If IgnoreCasemodifiers.[[IgnoreCase]] is false, return ch.
  3. Assert: ch is a UTF-16 code unit.
  4. Let cp be the code point whose numeric value is that of ch.
  5. Let u be the result of toUppercase(« cp »), according to the Unicode Default Case Conversion algorithm.
  6. Let uStr be CodePointsToString(u).
  7. If uStr does not consist of a single code unit, return ch.
  8. Let cu be uStr's single code unit element.
  9. If the numeric value of ch ≥ 128 and the numeric value of cu < 128, return ch.
  10. Return cu.
Note 1

Parentheses of the form ( Disjunction ) serve both to group the components of the Disjunction pattern together and to save the result of the match. The result can be used either in a backreference (\ followed by a non-zero decimal number), referenced in a replace String, or returned as part of an array from the regular expression matching Abstract Closure. To inhibit the capturing behaviour of parentheses, use the form (?: Disjunction ) instead.

Note 2

The form (?= Disjunction ) specifies a zero-width positive lookahead. In order for it to succeed, the pattern inside Disjunction must match at the current position, but the current position is not advanced before matching the sequel. If Disjunction can match at the current position in several ways, only the first one is tried. Unlike other regular expression operators, there is no backtracking into a (?= form (this unusual behaviour is inherited from Perl). This only matters when the Disjunction contains capturing parentheses and the sequel of the pattern contains backreferences to those captures.

For example,

/(?=(a+))/.exec("baaabac")

matches the empty String immediately after the first b and therefore returns the array:

["", "aaa"]

To illustrate the lack of backtracking into the lookahead, consider:

/(?=(a+))a*b\1/.exec("baaabac")

This expression returns

["aba", "a"]

and not:

["aaaba", "a"]
Note 3

The form (?! Disjunction ) specifies a zero-width negative lookahead. In order for it to succeed, the pattern inside Disjunction must fail to match at the current position. The current position is not advanced before matching the sequel. Disjunction can contain capturing parentheses, but backreferences to them only make sense from within Disjunction itself. Backreferences to these capturing parentheses from elsewhere in the pattern always return undefined because the negative lookahead must fail for the pattern to succeed. For example,

/(.*?)a(?!(a+)b\2c)\2(.*)/.exec("baaabaac")

looks for an a not immediately followed by some positive number n of a's, a b, another n a's (specified by the first \2) and a c. The second \2 is outside the negative lookahead, so it matches against undefined and therefore always succeeds. The whole expression returns the array:

["baaabaac", "ba", undefined, "abaac"]
Note 4

In case-insignificant matches when Unicode is true, all characters are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared. The simple mapping always maps to a single code point, so it does not map, for example, ß (U+00DF) to SS. It may however map a code point outside the Basic Latin range to a character within, for example, ſ (U+017F) to s. Such characters are not mapped if Unicode is false. This prevents Unicode code points such as U+017F and U+212A from matching regular expressions such as /[a-z]/i, but they will match /[a-z]/ui.

1.1.2.8 Runtime Semantics: CompileToCharSet

The syntax-directed operation CompileToCharSet takes no arguments and returns a CharSet.

Note 1

This section is amended in B.1.2.8.

It is defined piecewise over the following productions:

ClassRanges :: [empty]
  1. Return the empty CharSet.
NonemptyClassRanges :: ClassAtom NonemptyClassRangesNoDash
  1. Let A be CompileToCharSet of ClassAtom.
  2. Let B be CompileToCharSet of NonemptyClassRangesNoDash.
  3. Return the union of CharSets A and B.
NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges
  1. Let A be CompileToCharSet of the first ClassAtom.
  2. Let B be CompileToCharSet of the second ClassAtom.
  3. Let C be CompileToCharSet of ClassRanges.
  4. Let D be CharacterRange(A, B).
  5. Return the union of D and C.
NonemptyClassRangesNoDash :: ClassAtomNoDash NonemptyClassRangesNoDash
  1. Let A be CompileToCharSet of ClassAtomNoDash.
  2. Let B be CompileToCharSet of NonemptyClassRangesNoDash.
  3. Return the union of CharSets A and B.
NonemptyClassRangesNoDash :: ClassAtomNoDash - ClassAtom ClassRanges
  1. Let A be CompileToCharSet of ClassAtomNoDash.
  2. Let B be CompileToCharSet of ClassAtom.
  3. Let C be CompileToCharSet of ClassRanges.
  4. Let D be CharacterRange(A, B).
  5. Return the union of D and C.
Note 2

ClassRanges can expand into a single ClassAtom and/or ranges of two ClassAtom separated by dashes. In the latter case the ClassRanges includes all characters between the first ClassAtom and the second ClassAtom, inclusive; an error occurs if either ClassAtom does not represent a single character (for example, if one is \w) or if the first ClassAtom's character value is greater than the second ClassAtom's character value.

Note 3

Even if the pattern ignores case, the case of the two ends of a range is significant in determining which characters belong to the range. Thus, for example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while the pattern /[E-f]/i matches all upper and lower-case letters in the Unicode Basic Latin block as well as the symbols [, \, ], ^, _, and `.

Note 4

A - character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification.

ClassAtom :: -
  1. Return the CharSet containing the single character - U+002D (HYPHEN-MINUS).
ClassAtomNoDash :: SourceCharacter but not one of \ or ] or -
  1. Return the CharSet containing the character matched by SourceCharacter.
ClassEscape :: b ClassEscape :: - ClassEscape :: CharacterEscape
  1. Let cv be the CharacterValue of this ClassEscape.
  2. Let c be the character whose character value is cv.
  3. Return the CharSet containing the single character c.
Note 5

A ClassAtom can use any of the escape sequences that are allowed in the rest of the regular expression except for \b, \B, and backreferences. Inside a CharacterClass, \b means the backspace character, while \B and backreferences raise errors. Using a backreference inside a ClassAtom causes an error.

CharacterClassEscape :: d
  1. Return the ten-element CharSet containing the characters 0 through 9 inclusive.
CharacterClassEscape :: D
  1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: d .
CharacterClassEscape :: s
  1. Return the CharSet containing all characters corresponding to a code point on the right-hand side of the WhiteSpace or LineTerminator productions.
CharacterClassEscape :: S
  1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: s .
CharacterClassEscape :: w
  1. Return WordCharactersGetWordCharacters(modifiers).
CharacterClassEscape :: W
  1. Return the CharSet containing all characters not in the CharSet returned by CharacterClassEscape :: w .
CharacterClassEscape :: p{ UnicodePropertyValueExpression }
  1. Return the CharSet containing all Unicode code points included in CompileToCharSet of UnicodePropertyValueExpression.
CharacterClassEscape :: P{ UnicodePropertyValueExpression }
  1. Return the CharSet containing all Unicode code points not included in CompileToCharSet of UnicodePropertyValueExpression.
UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue
  1. Let ps be SourceText of UnicodePropertyName.
  2. Let p be UnicodeMatchProperty(ps).
  3. Assert: p is a Unicode property name or property alias listed in the “Property name and aliases” column of Table 66.
  4. Let vs be SourceText of UnicodePropertyValue.
  5. Let v be UnicodeMatchPropertyValue(p, vs).
  6. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value v.
UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue
  1. Let s be SourceText of LoneUnicodePropertyNameOrValue.
  2. If UnicodeMatchPropertyValue(General_Category, s) is identical to a List of Unicode code points that is the name of a Unicode general category or general category alias listed in the “Property value and aliases” column of Table 68, then
    1. Return the CharSet containing all Unicode code points whose character database definition includes the property “General_Category” with value s.
  3. Let p be UnicodeMatchProperty(s).
  4. Assert: p is a binary Unicode property or binary property alias listed in the “Property name and aliases” column of Table 67.
  5. Return the CharSet containing all Unicode code points whose character database definition includes the property p with value “True”.

1.1.2.9 GetWordCharacters ( modifiers )

The abstract operation GetWordCharacters takes argument modifiers (a Modifiers Record) and returns a CharSet. It performs the following steps when called:

  1. Let wordCharacters be the mathematical set that is the union of all sixty-three characters in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_" (letters, numbers, and U+005F (LOW LINE) in the Unicode Basic Latin block) and all characters c for which c is not in that set but Canonicalize(c, modifiers) is.
  2. Return wordCharacters.
Note
wordCharacters cannot contain more than sixty-three characters unless Unicode and modifiers.[[IgnoreCase]] are both true.

1.1.2.10 UpdateModifiers ( modifiers, add, remove )

The abstract operation UpdateModifiers takes arguments modifiers (a Modifiers Record), add (a String), and remove (a String) and returns a Modifiers. It performs the following steps when called:

  1. Let dotAll be modifiers.[[DotAll]].
  2. Let ignoreCase be modifiers.[[IgnoreCase]].
  3. Let multiline be modifiers.[[Multiline]].
  4. If add contains "s", set dotAll to true.
  5. If add contains "i", set ignoreCase to true.
  6. If add contains "m", set multiline to true.
  7. If remove contains "s", set dotAll to false.
  8. If remove contains "i", set ignoreCase to false.
  9. If remove contains "m", set multiline to false.
  10. Return the Modifiers Record { [[DotAll]]: dotAll, [[IgnoreCase]]: ignoreCase, [[Multiline]]: multiline }.

A Additional ECMAScript Features for Web Browsers

A.1 Additional Syntax

A.1.1 Regular Expressions Patterns

The syntax of 1.1.1 is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [UnicodeMode] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [UnicodeMode] parameter present on the goal symbol.

Syntax

Term[UnicodeMode, N] :: [+UnicodeMode] Assertion[+UnicodeMode, ?N] [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier [+UnicodeMode] Atom[+UnicodeMode, ?N] [~UnicodeMode] QuantifiableAssertion[?N] Quantifier [~UnicodeMode] Assertion[~UnicodeMode, ?N] [~UnicodeMode] ExtendedAtom[?N] Quantifier [~UnicodeMode] ExtendedAtom[?N] Assertion[UnicodeMode, N] :: ^ $ \ b \ B [+UnicodeMode] ( ? = Disjunction[+UnicodeMode, ?N] ) [+UnicodeMode] ( ? ! Disjunction[+UnicodeMode, ?N] ) [~UnicodeMode] QuantifiableAssertion[?N] ( ? <= Disjunction[?UnicodeMode, ?N] ) ( ? <! Disjunction[?UnicodeMode, ?N] ) QuantifiableAssertion[N] :: ( ? = Disjunction[~UnicodeMode, ?N] ) ( ? ! Disjunction[~UnicodeMode, ?N] ) ExtendedAtom[N] :: . \ AtomEscape[~UnicodeMode, ?N] \ [lookahead = c] CharacterClass[~UnicodeMode] ( Disjunction[~UnicodeMode, ?N] ) ( ? : Disjunction[~UnicodeMode, ?N] ) ( ? RegularExpressionFlags : Disjunction[?UnicodeMode, ?N] ) ( ? RegularExpressionFlags - RegularExpressionFlags : Disjunction[?UnicodeMode, ?N] ) InvalidBracedQuantifier ExtendedPatternCharacter InvalidBracedQuantifier :: { DecimalDigits[~Sep] } { DecimalDigits[~Sep] , } { DecimalDigits[~Sep] , DecimalDigits[~Sep] } ExtendedPatternCharacter :: SourceCharacter but not one of ^ $ \ . * + ? ( ) [ | AtomEscape[UnicodeMode, N] :: [+UnicodeMode] DecimalEscape [~UnicodeMode] DecimalEscape but only if the CapturingGroupNumber of DecimalEscape is ≤ NcapturingParens CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode, ?N] [+N] k GroupName[?UnicodeMode] CharacterEscape[UnicodeMode, N] :: ControlEscape c ControlLetter 0 [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?UnicodeMode] [~UnicodeMode] LegacyOctalEscapeSequence IdentityEscape[?UnicodeMode, ?N] IdentityEscape[UnicodeMode, N] :: [+UnicodeMode] SyntaxCharacter [+UnicodeMode] / [~UnicodeMode] SourceCharacterIdentityEscape[?N] SourceCharacterIdentityEscape[N] :: [~N] SourceCharacter but not c [+N] SourceCharacter but not one of c or k ClassAtomNoDash[UnicodeMode, N] :: SourceCharacter but not one of \ or ] or - \ ClassEscape[?UnicodeMode, ?N] \ [lookahead = c] ClassEscape[UnicodeMode, N] :: b [+UnicodeMode] - [~UnicodeMode] c ClassControlLetter CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode, ?N] ClassControlLetter :: DecimalDigit _ Note

When the same left-hand sides occurs with both [+UnicodeMode] and [~UnicodeMode] guards it is to control the disambiguation priority.

B Copyright & Software License

Copyright Notice

© 2024 Ron Buckton, Ecma International

Software License

All Software contained in this document ("Software") is protected by copyright and is being made available under the "BSD License", included below. This Software may be subject to third party rights (rights from parties other than Ecma International), including patent rights, and no licenses under such third party rights are granted under this license even if the third party concerned is a member of Ecma International. SEE THE ECMA CODE OF CONDUCT IN PATENT MATTERS AVAILABLE AT https://ecma-international.org/memento/codeofconduct.htm FOR INFORMATION REGARDING THE LICENSING OF PATENT CLAIMS THAT ARE REQUIRED TO IMPLEMENT ECMA INTERNATIONAL STANDARDS.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the authors nor Ecma International may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE ECMA INTERNATIONAL "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ECMA INTERNATIONAL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.