Keywords are all singleton tokens with pre calculated lengths. Booleans are pre-calculated (rather than evaluating the strings “false” “true” repeatedly.
Reverse lookup of keyword name to string
The NAME and CLASSREF in 4x are strict. Each segment must start with a letter a-z and may not contain dashes (w includes letters, digits and _).
The single line comment includes the line ending.
PERFORMANCE NOTE: Comparison against a frozen string is faster (than unfrozen).
EPP_START is currently a marker token, may later get syntax
HEREDOC has syntax as an argument.
ALl tokens have three slots, the token name (a Symbol), the token text (String), and a token text length. All operator and punctuation tokens reuse singleton arrays Tokens that require unique values create a unique array per token.
PEFORMANCE NOTES: This construct reduces the amount of object that needs to be created for operators and punctuation. The length is pre-calculated for all singleton tokens. The length is used both to signal the length of the token, and to advance the scanner position (without having to advance it with a scan(regexp)).
This is used for unrecognized tokens, will always be a single character. This particular instance is not used, but is kept here for documentation purposes.
Tokens that are always unique to what has been lexed
Clears the lexer state (it is not required to call this as it will be garbage collected and the next lex call (#lex_string, #lex_file) will reset the internal state.
# File lib/puppet/pops/parser/lexer2.rb, line 189 def clear() # not really needed, but if someone wants to ensure garbage is collected as early as possible @scanner = nil @locator = nil @lexing_context = nil end
Emits (produces) a token [:tokensymbol, TokenValue] and moves the scanner’s position past the token
# File lib/puppet/pops/parser/lexer2.rb, line 650 def emit(token, byte_offset) @scanner.pos = byte_offset + token[2] [token[0], TokenValue.new(token, byte_offset, @locator)] end
Emits the completed token on the form [:tokensymbol, TokenValue. This method does not alter the scanner’s position.
# File lib/puppet/pops/parser/lexer2.rb, line 658 def emit_completed(token, byte_offset) [token[0], TokenValue.new(token, byte_offset, @locator)] end
Allows subprocessors for heredoc etc to enqueue tokens that are tokenized by a different lexer instance
# File lib/puppet/pops/parser/lexer2.rb, line 669 def enqueue(emitted_token) @token_queue << emitted_token end
Enqueues a completed token at the given offset
# File lib/puppet/pops/parser/lexer2.rb, line 663 def enqueue_completed(token, byte_offset) @token_queue << emit_completed(token, byte_offset) end
TODO: This method should not be used, callers should get the locator since it is most likely required to compute line, position etc given offsets.
# File lib/puppet/pops/parser/lexer2.rb, line 235 def file @locator ? @locator.file : nil end
Convenience method, and for compatibility with older lexer. Use the #lex_file instead. (Bad form to use overloading of assignment operator for something that is not really an assignment).
# File lib/puppet/pops/parser/lexer2.rb, line 228 def file=(file) lex_file(file) end
Scans all of the content and returns it in an array Note that the terminating [false, false] token is included in the result.
# File lib/puppet/pops/parser/lexer2.rb, line 260 def fullscan result = [] scan {|token, value| result.push([token, value]) } result end
# File lib/puppet/pops/parser/lexer2.rb, line 248 def initvars @token_queue = [] # NOTE: additional keys are used; :escapes, :uq_slurp_pattern, :newline_jump, :epp_* @lexing_context = { :brace_count => 0, :after => nil, } end
Initializes lexing of the content of the given file. An empty string is used if the file does not exist.
# File lib/puppet/pops/parser/lexer2.rb, line 241 def lex_file(file) initvars contents = Puppet::FileSystem.exist?(file) ? Puppet::FileSystem.read(file) : "" @scanner = StringScanner.new(contents.freeze) @locator = Puppet::Pops::Parser::Locator.locator(contents, file) end
# File lib/puppet/pops/parser/lexer2.rb, line 205 def lex_string(string, path='') initvars @scanner = StringScanner.new(string) @locator = Puppet::Pops::Parser::Locator.locator(string, path) end
This lexes one token at the current position of the scanner. PERFORMANCE NOTE: Any change to this logic should be performance measured.
# File lib/puppet/pops/parser/lexer2.rb, line 302 def lex_token # Using three char look ahead (may be faster to do 2 char look ahead since only 2 tokens require a third scn = @scanner ctx = @lexing_context before = @scanner.pos # A look ahead of 3 characters is used since the longest operator ambiguity is resolved at that point. # PERFORMANCE NOTE: It is faster to peek once and use three separate variables for lookahead 0, 1 and 2. # la = scn.peek(3) return nil if la.empty? # Ruby 1.8.7 requires using offset and length (or integers are returned. # PERFORMANCE NOTE. # It is slightly faster to use these local variables than accessing la[0], la[1] etc. in ruby 1.9.3 # But not big enough to warrant two completely different implementations. # la0 = la[0,1] la1 = la[1,1] la2 = la[2,1] # PERFORMANCE NOTE: # A case when, where all the cases are literal values is the fastest way to map from data to code. # It is much faster than using a hash with lambdas, hash with symbol used to then invoke send etc. # This case statement is evaluated for most character positions in puppet source, and great care must # be taken to not introduce performance regressions. # case la0 when '.' emit(TOKEN_DOT, before) when ',' emit(TOKEN_COMMA, before) when '[' if (before == 0 || scn.string[locator.char_offset(before)-1,1] =~ /[[:blank:]\r\n]+/) emit(TOKEN_LISTSTART, before) else emit(TOKEN_LBRACK, before) end when ']' emit(TOKEN_RBRACK, before) when '(' emit(TOKEN_LPAREN, before) when ')' emit(TOKEN_RPAREN, before) when ';' emit(TOKEN_SEMIC, before) when '?' emit(TOKEN_QMARK, before) when '*' emit(TOKEN_TIMES, before) when '%' if la1 == '>' && ctx[:epp_mode] scn.pos += 2 if ctx[:epp_mode] == :expr enqueue_completed(TOKEN_EPPEND, before) end ctx[:epp_mode] = :text interpolate_epp else emit(TOKEN_MODULO, before) end when '{' # The lexer needs to help the parser since the technology used cannot deal with # lookahead of same token with different precedence. This is solved by making left brace # after ? into a separate token. # ctx[:brace_count] += 1 emit(if ctx[:after] == :QMARK TOKEN_SELBRACE else TOKEN_LBRACE end, before) when '}' ctx[:brace_count] -= 1 emit(TOKEN_RBRACE, before) # TOKENS @, @@, @( when '@' case la1 when '@' emit(TOKEN_ATAT, before) # TODO; Check if this is good for the grammar when '(' heredoc else emit(TOKEN_AT, before) end # TOKENS |, |>, |>> when '|' emit(case la1 when '>' la2 == '>' ? TOKEN_RRCOLLECT : TOKEN_RCOLLECT else TOKEN_PIPE end, before) # TOKENS =, =>, ==, =~ when '=' emit(case la1 when '=' TOKEN_ISEQUAL when '>' TOKEN_FARROW when '~' TOKEN_MATCH else TOKEN_EQUALS end, before) # TOKENS '+', '+=', and '+>' when '+' emit(case la1 when '=' TOKEN_APPENDS when '>' TOKEN_PARROW else TOKEN_PLUS end, before) # TOKENS '-', '->', and epp '-%>' (end of interpolation with trim) when '-' if ctx[:epp_mode] && la1 == '%' && la2 == '>' scn.pos += 3 if ctx[:epp_mode] == :expr enqueue_completed(TOKEN_EPPEND_TRIM, before) end interpolate_epp(:with_trim) else emit(case la1 when '>' TOKEN_IN_EDGE when '=' TOKEN_DELETES else TOKEN_MINUS end, before) end # TOKENS !, !=, !~ when '!' emit(case la1 when '=' TOKEN_NOTEQUAL when '~' TOKEN_NOMATCH else TOKEN_NOT end, before) # TOKENS ~>, ~ when '~' emit(la1 == '>' ? TOKEN_IN_EDGE_SUB : TOKEN_TILDE, before) when '#' scn.skip(PATTERN_COMMENT) nil # TOKENS '/', '/*' and '/ regexp /' when '/' case la1 when '*' scn.skip(PATTERN_MLCOMMENT) nil else # regexp position is a regexp, else a div if regexp_acceptable? && value = scn.scan(PATTERN_REGEX) # Ensure an escaped / was not matched while value[-2..-2] == STRING_BSLASH_BSLASH # i.e. \\ value += scn.scan_until(PATTERN_REGEX_END) end regex = value.sub(PATTERN_REGEX_A, '').sub(PATTERN_REGEX_Z, '').gsub(PATTERN_REGEX_ESC, '/') emit_completed([:REGEX, Regexp.new(regex), scn.pos-before], before) else emit(TOKEN_DIV, before) end end # TOKENS <, <=, <|, <<|, <<, <-, <~ when '<' emit(case la1 when '<' if la2 == '|' TOKEN_LLCOLLECT else TOKEN_LSHIFT end when '=' TOKEN_LESSEQUAL when '|' TOKEN_LCOLLECT when '-' TOKEN_OUT_EDGE when '~' TOKEN_OUT_EDGE_SUB else TOKEN_LESSTHAN end, before) # TOKENS >, >=, >> when '>' emit(case la1 when '>' TOKEN_RSHIFT when '=' TOKEN_GREATEREQUAL else TOKEN_GREATERTHAN end, before) # TOKENS :, ::CLASSREF, ::NAME when ':' if la1 == ':' before = scn.pos # PERFORMANCE NOTE: This could potentially be speeded up by using a case/when listing all # upper case letters. Alternatively, the 'A', and 'Z' comparisons may be faster if they are # frozen. # if la2 >= 'A' && la2 <= 'Z' # CLASSREF or error value = scn.scan(PATTERN_CLASSREF) if value after = scn.pos emit_completed([:CLASSREF, value.freeze, after-before], before) else # move to faulty position ('::<uc-letter>' was ok) scn.pos = scn.pos + 3 lex_error(Puppet::Pops::Issues::ILLEGAL_FULLY_QUALIFIED_CLASS_REFERENCE) end else value = scn.scan(PATTERN_BARE_WORD) if value if value =~ PATTERN_NAME emit_completed([:NAME, value.freeze, scn.pos-before], before) else emit_completed([:WORD, value.freeze, scn.pos - before], before) end else # move to faulty position ('::' was ok) scn.pos = scn.pos + 2 lex_error(Puppet::Pops::Issues::ILLEGAL_FULLY_QUALIFIED_NAME) end end else emit(TOKEN_COLON, before) end when '$' if value = scn.scan(PATTERN_DOLLAR_VAR) emit_completed([:VARIABLE, value[1..-1].freeze, scn.pos - before], before) else # consume the $ and let higher layer complain about the error instead of getting a syntax error emit(TOKEN_VARIABLE_EMPTY, before) end when '"' # Recursive string interpolation, 'interpolate' either returns a STRING token, or # a DQPRE with the rest of the string's tokens placed in the @token_queue interpolate_dq when "'" emit_completed([:STRING, slurp_sqstring.freeze, scn.pos - before], before) when '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' value = scn.scan(PATTERN_NUMBER) if value length = scn.pos - before assert_numeric(value, length) emit_completed([:NUMBER, value.freeze, length], before) else # move to faulty position ([0-9] was ok) scn.pos = scn.pos + 1 lex_error(Puppet::Pops::Issues::ILLEGAL_NUMBER) end when 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_' value = scn.scan(PATTERN_BARE_WORD) if value && value =~ PATTERN_NAME emit_completed(KEYWORDS[value] || [:NAME, value.freeze, scn.pos - before], before) elsif value emit_completed([:WORD, value.freeze, scn.pos - before], before) else # move to faulty position ([a-z_] was ok) scn.pos = scn.pos + 1 fully_qualified = scn.match?(/::/) if fully_qualified lex_error(Puppet::Pops::Issues::ILLEGAL_FULLY_QUALIFIED_NAME) else lex_error(Puppet::Pops::Issues::ILLEGAL_NAME_OR_BARE_WORD) end end when 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z' value = scn.scan(PATTERN_CLASSREF) if value emit_completed([:CLASSREF, value.freeze, scn.pos - before], before) else # move to faulty position ([A-Z] was ok) scn.pos = scn.pos + 1 lex_error(Puppet::Pops::Issues::ILLEGAL_CLASS_REFERENCE) end when "\n" # If heredoc_cont is in effect there are heredoc text lines to skip over # otherwise just skip the newline. # if ctx[:newline_jump] scn.pos = ctx[:newline_jump] ctx[:newline_jump] = nil else scn.pos += 1 end return nil when ' ', "\t", "\r" scn.skip(PATTERN_WS) return nil else # In case of unicode spaces of various kinds that are captured by a regexp, but not by the # simpler case expression above (not worth handling those special cases with better performance). if scn.skip(PATTERN_WS) nil else # "unrecognized char" emit([:OTHER, la0, 1], before) end end end
Lexes an unquoted string. @param string [String] the string to lex @param locator [Puppet::Pops::Parser::Locator] the locator to use (a default is used if nil is given) @param escapes [Array<String>] array of character strings representing the escape sequences to transform @param interpolate [Boolean] whether interpolation of expressions should be made or not.
# File lib/puppet/pops/parser/lexer2.rb, line 217 def lex_unquoted_string(string, locator, escapes, interpolate) initvars @scanner = StringScanner.new(string) @locator = locator || Puppet::Pops::Parser::Locator.locator(string, '') @lexing_context[:escapes] = escapes || UQ_ESCAPES @lexing_context[:uq_slurp_pattern] = interpolate ? (escapes.include?('$') ? SLURP_UQ_PATTERN : SLURP_UQNE_PATTERN) : SLURP_ALL_PATTERN end
Answers after which tokens it is acceptable to lex a regular expression. PERFORMANCE NOTE: It may be beneficial to turn this into a hash with default value of true for missing entries. A case expression with literal values will however create a hash internally. Since a reference is always needed to the hash, this access is almost as costly as a method call.
# File lib/puppet/pops/parser/lexer2.rb, line 679 def regexp_acceptable? case @lexing_context[:after] # Ends of (potential) R-value generating expressions when :RPAREN, :RBRACK, :RRCOLLECT, :RCOLLECT false # End of (potential) R-value - but must be allowed because of case expressions # Called out here to not be mistaken for a bug. when :RBRACE true # Operands (that can be followed by DIV (even if illegal in grammar) when :NAME, :CLASSREF, :NUMBER, :STRING, :BOOLEAN, :DQPRE, :DQMID, :DQPOST, :HEREDOC, :REGEX, :VARIABLE, :WORD false else true end end
A block must be passed to scan. It will be called with two arguments, a symbol for the token, and an instance of LexerSupport::TokenValue PERFORMANCE NOTE: The TokenValue is designed to reduce the amount of garbage / temporary data and to only convert the lexer’s internal tokens on demand. It is slightly more costly to create an instance of a class defined in Ruby than an Array or Hash, but the gain is much bigger since transformation logic is avoided for many of its members (most are never used (e.g. line/pos information which is only of value in general for error messages, and for some expressions (which the lexer does not know about).
# File lib/puppet/pops/parser/lexer2.rb, line 274 def scan # PERFORMANCE note: it is faster to access local variables than instance variables. # This makes a small but notable difference since instance member access is avoided for # every token in the lexed content. # scn = @scanner ctx = @lexing_context queue = @token_queue lex_error_without_pos(Puppet::Pops::Issues::NO_INPUT_TO_LEXER) unless scn scn.skip(PATTERN_WS) # This is the lexer's main loop until queue.empty? && scn.eos? do if token = queue.shift || lex_token ctx[:after] = token[0] yield token end end # Signals end of input yield [false, false] end
Convenience method, and for compatibility with older lexer. Use the #lex_string instead which allows passing the path to use without first having to call file= (which reads the file if it exists). (Bad form to use overloading of assignment operator for something that is not really an assignment. Also, overloading of = does not allow passing more than one argument).
# File lib/puppet/pops/parser/lexer2.rb, line 201 def string=(string) lex_string(string, '') end
# File lib/puppet/pops/parser/lexer2.rb, line 183 def initialize() end