DIY: Simple Ruby Tokenizer and Parser that recognize Polish notation

These days I had to create a parser that recognized a particular type of string.

With this article I’ll propose you a simple approach to write a simple Tokenizer and Parser, in Ruby, that recognize the Polish notation

We want to obtain a result from a string like this:

+ 2 2

obviosly 4

In order to do this we need to split the string in “tokens” that we will be able to use.

After the work of the Tokenizer we will have a result like this:

[ <Token @type=:operator, @value=‘+’, <Token @type=:number, @value=2.0>, <Token @type=:number, @value=2.0> ]

With this array of tokens, we can loop and choose which action to do for every single element based on its type.

First of all let’s create our Token class. Our string will be splitted in tokens, each one will be an instance of this class.

token.rb


class Token
  attr_reader :type, :value

  def initialize type, value
    @type, @value = type, value
  end
end

Now we can create the Tokenizer class that will split the string.

tokenizer.rb


class Tokenizer
  OPERATORS = ['+', '-', '*', '/']
  NUMBER_REGEXP = /^d*(.d+)?$/

  attr_reader :tokens, :cursor

  def initialize expression
    tokenize expression
    @cursor = -1
  end

  def current_token
    return if cursor == -1
    tokens[cursor]
  end

  def next_token
    cursor += 1
    current_token
  end

  def look_next_token
    tokens[cursor + 1]
  end

  def operator_included?
    tokens.map(&:type).include? :operator
  end

  private

  def tokenize expression
    @tokens = []
    expression_to_parse = expression.strip
    while expression_to_parse.size > 0
      tokens << read_next_token(expression_to_parse)
    end
  end

  def read_next_token expression_to_parse
    expression_to_parse.strip!
    next_char = expression_to_parse.slice 0
    if OPERATORS.include? next_char
      read_next_operator_token expression_to_parse
    elsif next_char =~ NUMBER_REGEXP
      read_next_numeric_token expression_to_parse
    else
      raise "Unknown token starting with '#{next_char}'"
    end
  end

  def read_next_operator_token expression_to_parse
    Token.new :operator, expression_to_parse.slice!(0)
  end

  def read_next_numeric_token expression_to_parse
    numeric_value = expression_to_parse.slice! 0
    while next_char = expression_to_parse.slice(0) && "#{numeric_value}#{next_char}" =~ NUMBER_REGEXP
      numeric_value << expression_to_parse.slice!(0)
    end
    Token.new :number, numeric_value.to_f
  end
end

This class takes the string and loops through every single char and tries to recognize them. Every time a char is analyzed it is removed from the main string. In this way we avoid the possibility of errors during the tokenizer process.

Pay attetion to read_next_numeric_token: this method will continue the loop on expression_to_parse until the char already parsed joined with next character match the NUMBER_REGEXP.

Once the expression has been tokenized we are ready to parse the result. Now the code of the Parser class:

parser.rb


class Parser

  attr_reader :tokenizer

  # Parse a polish notation expression and return the result
  def parse expression
    @tokenizer = Tokenizer.new expression

    raise 'Expression must start with an operator' if tokenizer.look_next_token.type == :number

    operate
  end

  def tokens
    tokenizer.tokens
  end

  def operate
    number_stack = []
    # continue while the last operator are popped from tokens
    while tokenizer.operator_included?
      current_token = tokens.pop
      if current_token.type == :number
        # push the number in the number_stack
        number_stack << current_token.value
      elsif current_token.type == :operator
        raise "Not enough operands" if number_stack.size < 2
        # take the last 2 number added to the stack
        first_value, second_value = number_stack.pop(2)
        # Add a new :number Token with the result of th operation
        tokens << Token.new( :number, first_value.send( current_token.value, second_value ) )
      end
    end
    raise "Not enough operators" if number_stack.size > 0

    tokens.first.value
  end
end

Following the Polish Notation’s rules, the Parser loops through the array of tokens starting from the last element. When a :number is found it will be pushed in the number_stack, whereas when we found an :operator we take (pop) the two last elements from the number_stack and then we send the operation (defined by the :operator Token), positioning it in the middle between the two numbers. Once we have the result, we can push it as a new :number Token in the tokens array.

In this specific case the loop through the tokens continues until there are no more :operator

In the end we get an array with a single element that contains the final result.

I hope you liked the post, this is my first post on this blog!

I would like to thank Nicola for most of the code and the theory behind it.

Thanks for reading and see you soon!

Leave a Reply

wpDiscuz