A simple C++ tokenizer

Loading and parsing text files in computer science is not a trivial topic, and has been studied for a very long time. In this little post I won’t be delving into theory, formal grammars and languages, or corresponding automatons, but describe a simple implementation of a pure C++ tokenizer, as used in the possumwood project.

Parsing text files

Parsing of a text file in modern computer science consists of two steps – the lexical analysis (or tokenization) and the parsing step (building of a parsing tree and its interpretation). Trivial parsers can combine these two aspects into a single class or set of functions, but most more complex frameworks find this separation beneficial.

A classical tool to create parsers in C++ is a combination of LEX and Yacc. LEX is a lexer – it allows to translate the input source code into a sequence of symbols, or tokens. Read symbols are then assembled into a tree structure by a formal grammar described using Yacc. While these tools are time-tested and proven to work well, they require specific build steps, source files in their respective special formats, and they generate code that just does not look great. On the other hand, they are extremely powerful and efficient.

For the possumwood project, I wanted something simpler, and written purely in C++. The file formats it would need to deal with are relatively straightforward (parseable using a LL(k) grammar), so the parsing step can be quite easily implemented using a classical recursive descent parser. A more tricky part is the tokenizer, which can be implemented in the parsing functions directly (a naive choice leading to lots of unnecessarily complex code with duplication), or as a separate object, moving the cumbersome lexical analysis step to a separate class. A tokenizer is essentially just a state machine, but even such a simple concept can be implemented in different ways suiting different applications.

An interface of a tokenizer

A tokenizer class implemented in C++ will actually consist of two distinct interfaces:

  • the external interface, used by the parser to request new tokens
  • and the internal interface, used to define states of a lexing state machine.

Our tokenizer will be driven from a parser via the external interface, and each call will then trigger one or more calls into the internal interface (changing states and processing characters) until at least one new token has been read, or the end of file has been encountered. Once a token is available, it is returned via the external interface.

Let’s have a look at both interfaces in more detail.

External interface

The external interface is the simpler of the two. Essentially, we just need a way of querying the “current” token, requesting a “next” token and determining the end of file:

The use of this interface is quite straightforward – if a parser needs to test current token value against a set of options, it would use the current() call; if it needs to read a next token and, optionally, compare it with an expected value, it would use the next() call.

Token class

The tokens of this simple parser implementation will just be structs with a string token value and the line it came from (usable for error reporting). For convenience, we can also add string comparison operators:

A more sophisticated tokenizer would require at least a set of token types, and replace the string value with concrete token enum values, but none of these are necessary for this simple version.

Internal interface

The internal interface, used by the implementation of states, should allow each state function to accept (or convert) a character, reject a character, and accept (emit) a token. Assuming that our specific tokenizers will be derived from the base tokenizer, the internal interface will be protected:

This interface is used during parsing, where an active state will process each input character. As a response to this character, it can:

  • reject the character (and shift the input reader to the next character)
  • accept the character, optionally converting it to a single character or a string (upper case/lower case handling; special characters)
  • emit a token, made of characters accepted since last token emission (will return a new token that can be retrieved via the public interface)

The emit() method should only emit a new token if at least one character has been accepted already. This simplifies the client code, as it does not have to create extra states to handle starts and ends of tokens precisely.

At this point, the last thing we are missing for a full implementation of a state machine is the definition of states and state transitions.

State class

The first intuition about states would be to implement them as instances of classes derived from a common base class. However, by looking at their requirements, we can find a few details that might be tricky using that model:

  • the states should not have any “internal state” (internal variables), otherwise our state machine becomes something else entirely. The implementation should try to discourage the client code from implementing state’s own internal states as much as possible.
  • they should have access to the internal interface of the tokenizer class
  • they should be easy to implement, preferably in-line in the body of the derived tokenizer class

All the above would be satisfied by a method defined on the derived class, or in C++11 by a lambda stored as an std::function. If we then store them in a container, and allow them to be set “active” to represent state transitions, we get an interface like this:

By befriending the Tokenizer class to the State class, we can allow it to set the current state on the tokenizer to a different State instance, implementing state transitions. Befriending State to Tokenizer then allows the tokenizer to call State’s function without the need to explicitly expose it in the interface.

Simple tokenizer example

At this point, it might not necessarily be too clear how to use the interfaces described above. Lets have a look at a simple (but fully functional) BVH file format tokenizer. BVH files tokenization is trivial – any words and brackets are handled as separate tokens, with spaces and brackets separating individual tokens. All this can be handled using the interface above using a single state (at this point, our tokenizer is not a plain state machine anymore, as it can handle multiple state machine states in a single State instance).

A slightly more complex example involving multiple states is the ASF tokenizer. ASF files can contain comments, which requires at least one additional state. The implementation below also uses a separate state to represent a “keyword”, even though it is not strictly necessary.

Apart from parsing of brackets, this code actually maps quite nicely to a state machine performing the same task.

Convenience methods

In practice, parsers often need to check a particular sequence of tokens, test them either for value or for correct type, and throw an exception if something goes wrong. With the interface described above, this is entirely possible, but by implementing two simple operators, it can be made significantly simpler:

Using this code, reading a formatted vector of three elements in format vect: [0, 1, 0] would become a statement like this:

This will read the vector name, its value, test for the expected file structure and throw an exception if anything doesn’t fit.

Implementation

While the implementation of the above is relatively straightforward, there are a few “gotchas” that need to be addressed. To name a few:

  • sometimes, it is necessary to emit multiple tokens at once — reading a bracket just after a keyword requires this – we need to emit both the original token, and the bracket as a separate one
  • in some cases, skipping a character until we see the next one can make processing a token significantly simple — this can be addressed as separate states, but a simpler way is to just allow to accept multiple characters; hash-bangs and C-style comments are a good example of this
  • to implement tricks with epsilon transitions, it has to be possible to process a single character multiple times, until it is accepted or rejected explicitly by a state. This is easier to achieve by using istream::peek() instead of istream::get().

A full implementation with all these details addressed can be found in the anim plugin for possumwood, together with several more examples of both tokenizers and parsers using them (bvh, amc, asf, x). Have a look, use it, and let me know if you can think of any improvements!