Thursday, September 8, 2011

A C++11 preprocessor in 2000 lines of code

Here's something different…

GCC has been gaining C++11 (formerly C++0x) features for some time, and support is now mostly complete. There are still several things missing, such as user-defined literals and raw strings. These depend on the preprocessor as well as the core of the compiler. So, I decided to take the new language for a spin and implement a C++11 compliant preprocessor with such features. I call it Cplus, "because merely making the grade, following the standard, still isn't easy."

See the code here.

It took a couple weeks and almost 2000 lines of code to implement what the Standard describes as phases 1-4, from reading the characters through executing macros. Constant-expression evaluation for #if directives is missing, because that should be handled by an instance object of a later compiler phase… doing that already would be getting ahead of myself. Also, error messages aren't tagged with the filename and need more work. And it requires testing, lots more testing. But still, raw string literals, the _Pragma operator, better universal-character-name (UCN) support than anything else out there, all the required directives, and far-as-I-can-tell perfectly compliant macro substitution are all there.

The program either outputs tokens ready for the "real" compilation or exits by one of 60 distinct error messages. I like to think of the ratio of error messages to lines of code as a measure of parser quality. This gets a score of 3%, which is half as good as the lambda calculus parser (which I promise I'll share), not bad considering it's a far more complicated process.

What sets it apart? For one thing, it properly stringizes UCNs. So whereas GCC doesn't support a variable named niño at all, Cplus will stringize it to "niño", and stringizing that will yield "\"ni\\u00F1o\"", which may be unintuitive but is demanded by the Standard. For another, it supports raw strings, like R"""(xyz)""". Catenating R onto "(x)" is valid, but onto "x" generates an error. Alright, these aren't that exciting, but standard-compliance is the name of the game.

Two C++11 features had a huge impact on its architecture, perfect forwarding and std::move. For example, here is the shortest stage of its processing pipeline:

template< typename output_iterator >
class macro_filter {
    output_iterator cont; // in-place instantiation of all succeeding stages
    template< typename ... args >
// pass whatever initialization those stages need
    macro_filter( args && ... a )
        : cont( std::forward< args >( a ) ... ) {}
    void operator() ( pp_token &&in ) // Receive and send data without churning malloc().
        { if ( ! in.s.empty() ) * cont ++ = std::move( in ); } // Filter out placemarkers and recursion stops.
    friend void finalize( macro_filter &o ) { finalize( o.cont ); }

That's fairly low overhead for a dedicated compiler stage. And the whole thing is put together like that, as interconnected functors adapted to look like output iterators. And of course, there are scoped enumerations, lambdas, constexpr functions to walk the tables that describe what Unicode characters are allowed where in identifiers, and other fun.

The macro engine is particularly elegant. At 450 lines including validation of macro definitions, it's a tiny fraction of the size of Boost.Wave and others. I haven't tested performance, but it should be fairly fast… there are only three std::vector objects constructed (read: malloc calls) upon entering each macro invokation with at least one argument, and no others. Anyway, being written at a high level, it should be very possible to optimize, without disturbing it too much.

Well, that's my first fully-C++11 project. Hopefully something good comes of it :v) .

No comments:

Post a Comment