Lexing Viper
After settling on Viper’s syntax (at least somewhat), I decided to set about building a lexer. I’ve never written one of these from scratch before, so it was a learning experience (which is true of pretty much this whole project, really). When writing it I wasn’t overly concerned with performance; I don’t mind if it’s considered “slow”, at least for now. Maybe eventually I’ll write a better one, but for now this is what I’ve got.
For some background, a lexer is a program which takes in some arbitrary text and divides it up into various tokens, called lexemes. You then use the lexemes to perform a parse of the text. So an analogy in English might be that lexing is breaking up the words in a sentence by looking for whitespace, but you also separate punctuation into separate lexemes in your head — after all, the period at the end of a sentence isn’t literally part of the last word of the sentence, right?
Lexing has nothing to do with determining the meaning of a set of tokens, though. It’s purely about splitting the tokens up. The meaning will come later.
Viper’s lexer has a few different modes of operation:
- Lex some arbitrary text (newlines included)
- Lex a file
- Lex a line
- Lex a token (no spaces)
The first two modes handle splitting the input text into separate lines. They then pass these lines along to the third
mode, which splits the lines based on whitespace. Finally, the individual tokens are passed to the fourth mode to be
turned into Lexeme
objects, and at the end of it we get a list of lexemes.
Lexemes
The Lexeme
objects for Viper can be subdivided as:
Indent
— four spacesNewLine
— the end of a line (or beginning of another line, depending on perspective)Comma
— a separator (not allowed to be used in operators)Number
— integers, floats, scientific notationName
— a lowercase name of a function or variableClass
— an uppercase name of a class or typeOperator
— a combination of symbols
From my understanding, most lexers use regular expressions to determine which tokens belong to what class of lexeme. I’m
fairly familiar with regular expressions, so this part wasn’t too bad. The Indent
, NewLine
, and Comma
lexemes were
all very straightforward to implement; they’re either individual characters (like the comma ,
and newline \n
) or
they’re a sequence of spaces at the front of a line (the indent).
Numbers
The Number
took a little more work, but I mostly copied the rules for Python’s own lexer to write it (since I intend
to support the same numerical representations):
Names
Name
lexemes were a little tricky. I wanted to enforce that all function and variable names had to start with either
an underscore or else a lowercase letter. (Haskell enforces the lowercase/uppercase distinction, and I find it
useful there.) I also wanted the ability to separate either by underscore or by hyphen, and then I wanted the name to be
able to end with certain symbols such as ?
or !
(like in Scheme).
The end result is a rather complicated-looking regex:
Essentially, a “name” is either a string of underscores, or else it’s a series of characters which:
- May start with any number of underscores (or not)
- Followed by a lowercase letter
- Followed by any number of letters, numbers, or underscores
- Followed by any number of similar constructions, but beginning with a hyphen
- Optionally followed by a symbol, such a
!
,@
,?
, etc (though not all symbols are supported, such as parens)
Valid Names | Invalid Names |
---|---|
foo |
Foo |
_foo |
FOO |
_ |
_Foo |
___ |
_FOO-BAR |
__foo_bar |
-foo |
foo-bar |
foo-? |
foo? |
?-foo |
foo-bar! |
!foo-bar |
Classes
The Class
lexeme (which will also represent types) is a bit simpler:
- Start with a capital letter
- Followed by any number of letters, numbers, or underscores
- Followed by any number of similar constructions, but beginning with a hpyen
Here’s the Python regex for it:
Valid Classes | Invalid Classes |
---|---|
FOO |
_FOO |
Foo |
foo |
FOO_BAR |
_foo |
FOO-BAR |
-foo |
FooBar |
Foo? |
Foo_Bar |
fOO |
Foo_Bar_ |
Operators
The last type of lexeme for Viper is the Operator
. This represents the various symbols which, as you may have guessed,
can constitute operators in Viper.
I wanted the language to be customizable in this respect, so I tried not to restrict the possible symbols for operators more than I had to. Here’s the regex I ended up with:
As you can see, you can pretty much stick any symbols together to form a new operator. There are no restrictions on, for example, whether the parentheses need to be balanced or something. (Whether this will end up being a point of pain for me remains to be seen, haha.)
Ambiguity
Of course, there are some potential sources of ambiguity in the language. For example, consider the token foo?!bar
.
Is this the same as foo ?! bar
, or should it be lexed as foo? ! bar
? I opted to take the
maximal munch approach: we look for the longest potential tokens from
left to right. The end result is that the above example would be correctly parsed as foo? ! bar
(where !
is used as
an operator).
Regex Matching Chain in Python
Unfortunately, Python’s regex library re
is a little difficult to deal with when you want to make a bunch of
comparisons to a single token.
Consider you have a token, token
, and a set of regular expressions, say RE_1
, RE_2
, and RE_3
. These regular
expressions have parenthetical groups inside them that you would want to extract from if token
matches.
The regular way of doing this in Python is roundabout and somewhat convoluted:
As you can see, this is repetitive and not very useful. Why do we have to keep assigning a new match
variable? I
looked around and eventually found
this solution.
However, I decided to rework it a bit to fit with my style more.
The basic idea is to create a class which you can use to do the comparisons. This class will return a truthy value if the match is successful. The end result is that you can use a simpler if/else system to achieve more readable code.
My class ended up looking like this:
It’s not very lengthy, which I think is good. (You’ll also notice that I like to use Python 3’s type annotations, though those can be left out if desired.) Now to use it, simply do:
Maybe it doesn’t look like much to you, but I prefer this version considerably. It just makes more sense to me. So… that’s what I went with!
Summary
In conclusion, I have a lexer now. It appears to run correctly (or so my tests would indicate, I think), so I’m happy with it. I imagine it’ll change somewhat over the coming months as I work more on Viper, but for the moment it’s solid and that’s what I wanted.
Stay tuned for more development on Viper!