how to make a parser?

#parser #howto #explainlikeimfive

Ok, i just want to parse stuff and make a small language of my own. so i am wondering what is a simple way of parsing stuff.

i have managed to parse stuff like 4 + 4 * 2 without operation order.

(i mainly couldn't parse parentheses)

Top comments (5)

Tamás Szelei • Dec 23 '20

This is a big topic and the approach you need to use depends on your end goal. I'll try to explain it shortly and hopefully you'll get some keywords that you can google.

The old traditional way is tokenizing, then lexing, then parsing. In each step, you basically annotate your input to be more structured and then use that annotation in the next. The output of tokenizing is a token stream, which typically means that you split up the incoming characters into bigger, meaningful items. In your example above, this would mean you get number, operator, number, operator, number.

Then, lexing will take that token stream and use this information and the rules of your language to yield an abstract syntax tree. At this point, the tree can be traversed and the expression can be understood by a program (i.e. parsed).

Tokenizers and lexers can be auto-generated from formal grammar definitions, such as LR(1). yacc/bison and flex are tools that can be used for this. However, the generated code might not be very maintainable if your language is too complex. There is a certain middle ground of complexity where it's worth generating this stuff and where you don't get a monster of generated code that you can't touch.

Another, more modern approach is using something like a PEG grammar / PEG parser generator. PEG grammars bunch the tokenizing and lexing steps together and generally give you more fidelity in expressing the rules of your toy language. You didn't mention the language you were using to implement your parser, but I had some success with parsimonious using Python: github.com/erikrose/parsimonious (you can also search "xyz PEG parser" where xyz is the language of your choice).

Finally, not every language can be expressed with a formal grammar, but if you are creating something from scratch, it's a really good idea to at least start with that. Otherwise, you will have to find context manually in the lower levels (e.g. in the token stream or the text itself), or even from other sources (like the output of parsing other files).

Dinu Blanovschi • Dec 23 '20

tokenizing, then lexing, then parsing

Then, lexing will take that token stream and use this information and the rules of your language to yield an abstract syntax tree.

Note that tokenizing = lexing = getting a token stream from the input text, while parsing refers to building up the AST.

Patrick Smith • Dec 23 '20

I’ve made a library for JavaScript to make parsers called parcook. github.com/RoyalIcing/parcook

You write generator functions that yield the different parts you are interested in reading next. You can yield strings or regular expressions. You can compose those functions together to parse more complex things.

To parse a plus you could yield ‘+’
To parse any operator you could const operator = yield [‘+’, ‘-‘, ‘*’, ‘/‘]
To parse an integer you could yield /^-?\d+/
To parse using the result of another generator function, yield ThatFunctionName

(Generator functions have been part of JavaScript for a few years. They are quite unusual at first but once you get familiar, very powerful)

I’d be happy to make an example parser of maths expressions if you like. Or if you want to have a go I’m happy to answer any questions you have.

SimpleGeek • Dec 27 '20

I highly recommend this site: craftinginterpreters.com/

It has sections on parsing, but they are part of the larger overall tutorial on building a toy language interpreter. (It teaches you how to parse parantheses.)