Tuesday, August 27, 2013

Closing the source

Today is the second birthday of the preprocessor project. I'm still a bit conflicted, but the best choice for now seems to be to proceed with it on a closed source basis.

A closed-source standalone preprocessor is fairly ludicrous, but for most of the project lifetime the focus has been on the underlying framework. (This is the part which would be of any interest to anyone.) The problem is that it's very promising, and a lot of effort. If I'm working very hard on something which should become very valuable, it's not really useful to work even harder to properly document and support it so others can realize that value for free.

Hopefully this can be turned back around at some point. The framework now consists of several independent modules:

stage.h — Links templates into a modular data-processing pipeline.
construct.h — Provides a uniform interface for mapping between source and target code.
parameter.h — Represents metadata regarding format and desired semantics, i.e. compiler options.

The last one has taken almost a month to get fully working, because those options have to propagate to the stages that need them, and stages can be combined and instantiated in various ways. Applied uniformly, it should decouple the implementation of the stages from the output types the user desires. The whole compiler may be fully templated, like a generic algorithm, without needing explicit template arguments to be passed in or global typedefs.

A month spent on purely abstract templates that don't "do" anything is crazy-making. Not starting with a comprehensive theory, I tried to shoot for just getting the right effects. It fell short semantically, resulting in infinite template recursion when applied to a (non-infinitely) recursive stage. So I ended up rewriting almost all of it.

Still to do are more modules: for memory allocation (previously drafted as a string class), small-scale decisions, and serialization. These should be more instrumental in the creation of a faster, more flexible compiler.

The stage and construct libraries are just a couple hundred lines each, and should be instructive and useful to others by themselves. As for the parameter library, I'm not sure if it's too advanced and special-purpose to serve the open-source community. The future libraries look like they will sit squarely in trade-secret territory.

Hopefully there will be an update here before long, with some kind of online demo where the preprocessor colors your code interactively. I need to focus on functionality rather than build a community. But I do support open source, and open-source updates will return.

Saturday, July 13, 2013

Duplex printing from OS X to a dessicated HP zombie

My humble DeskJet all-in-one was left in the Arizona desert for an entire year, and emerged slightly melted and dry but no worse for wear. Being obsessed with making marginal machines function, I've learned a few things in the last few days about printer operation.


1. Restoring ink cartridges

HP sells cheap printers and expensive cartridges, so it's nice not to throw away good ink. But what if it's been sitting for a long time, at high temperature?

According to WikiHow, immersion in hot water can restore ink flow. But once it starts flowing, you lose a lot. Plugging the business end of the cartridge into a steam kettle is more efficient. (Maybe not in terms of energy, though.) For best results, balance the cartridge in a roughly horizontal position. Check for leaking ink and remove when ready. I found that color ink likes to be steamed for a few minutes, perhaps to melt solidified contents. Black on the other hand quickly gets flowing, and should not be left on the kettle.

In my case, this process may be repeated daily or before each use. Mercifully the ink will run out relatively fast, and then you can go for a refill or replacement.


2. Two-sided printing in Mac OS X

Resurrecting sturdy hardware is easy, but printing on both sides is incredibly tricky on Mac OS X. The printer manual gave some general tips, which seemed to assume the software had a built-in process for double-sided printing. Not on the Mac; there's a two-sided printing option but it's greyed out. The relevant options in the print dialog are:

Layout → Reverse Page Orientation

Checking this box turns the page upside down in the printer output. For an inkjet printer, the default is already upside down, so this turns it right side up from your perspective. (I guess it's reversed from the printer's perspective. The bottom of the page comes out first.)

Paper Handling → Page Order

The options are Automatic, Normal, and Reverse. Normal causes the lowest-numbered page to emerge first from the printer, leading to a back-to-front document. Reverse means you get to turn the pages forward. Automatic equals reverse as far as I can tell. I cannot fathom what information the printer driver receives that would helpfully inform the automatic decision-making.

Paper Handling → Pages to Print

Options are Odd-only and Even-only. They managed not to make this one confusing.


The task at hand.

Once the process is finished, we should have a stack of papers in the output tray, with Page 1 right-side up and on top. Is that too much to ask? Maybe. Anyway, here is the algorithm.
  1. Print even pages, normal order, non-reversed orientation. The even pages come out backwards in every way, which is nice because all you did was change “automatic” to “normal.”
  2. If there are an odd number of pages in the document, add a blank sheet on top of the output. This will be the last page of the finished product.
  3. Move the half-printed stack from the output tray to the top of the input tray.
  4. Print odd pages, reverse order (or “automatic” if you like to tempt fate), reversed orientation.
  5. Hope for the best.
If an upside-down stack is acceptable, the process is slightly simpler.
  1. Print odd pages, reverse (i.e. correct) order, reversed orientation.
  2. Move the half-printed stack back to the input tray.
  3. Print even pages, normal (i.e. back-to-front) order, non-reversed orientation.
  4. If the document has an odd number of pages, retrieve the last page from the input tray.
I hope this saves someone out there a bunch of time, aggravation, and paper.

Saturday, November 19, 2011

Preprocessor update

I've just checked-in a couple months' worth of progress on the preprocessor project. Here are some bullet points for you.
  • Clean separation of the "framework" from the processing stages, including generalization of the tricky business of configuring compiler flags
  • Pragmas for setting header search paths, trigraphs, whitespace preservation (including emitting #line directives), and including headers "once"
  • Very nice error messages, tracing macro expansions like Clang does, but better
  • Header guard optimization and a fast string class, now only ~2.5x slower than GCC
  • Numerous small fixes and tweaks
Most of the work lately has been to polish the framework. A lot remains to be done, but what's there is looking pretty good!

The source is still available at Google Code.

Thursday, September 8, 2011

A C++11 preprocessor in 2000 lines of code

Here's something different…

GCC has been gaining C++11 (formerly C++0x) features for some time, and support is now mostly complete. There are still several things missing, such as user-defined literals and raw strings. These depend on the preprocessor as well as the core of the compiler. So, I decided to take the new language for a spin and implement a C++11 compliant preprocessor with such features. I call it Cplus, "because merely making the grade, following the standard, still isn't easy."

See the code here.

It took a couple weeks and almost 2000 lines of code to implement what the Standard describes as phases 1-4, from reading the characters through executing macros. Constant-expression evaluation for #if directives is missing, because that should be handled by an instance object of a later compiler phase… doing that already would be getting ahead of myself. Also, error messages aren't tagged with the filename and need more work. And it requires testing, lots more testing. But still, raw string literals, the _Pragma operator, better universal-character-name (UCN) support than anything else out there, all the required directives, and far-as-I-can-tell perfectly compliant macro substitution are all there.

The program either outputs tokens ready for the "real" compilation or exits by one of 60 distinct error messages. I like to think of the ratio of error messages to lines of code as a measure of parser quality. This gets a score of 3%, which is half as good as the lambda calculus parser (which I promise I'll share), not bad considering it's a far more complicated process.

What sets it apart? For one thing, it properly stringizes UCNs. So whereas GCC doesn't support a variable named niño at all, Cplus will stringize it to "niño", and stringizing that will yield "\"ni\\u00F1o\"", which may be unintuitive but is demanded by the Standard. For another, it supports raw strings, like R"""(xyz)""". Catenating R onto "(x)" is valid, but onto "x" generates an error. Alright, these aren't that exciting, but standard-compliance is the name of the game.

Two C++11 features had a huge impact on its architecture, perfect forwarding and std::move. For example, here is the shortest stage of its processing pipeline:

template< typename output_iterator >
class macro_filter {
    output_iterator cont; // in-place instantiation of all succeeding stages
    template< typename ... args >
// pass whatever initialization those stages need
    macro_filter( args && ... a )
        : cont( std::forward< args >( a ) ... ) {}
    void operator() ( pp_token &&in ) // Receive and send data without churning malloc().
        { if ( ! in.s.empty() ) * cont ++ = std::move( in ); } // Filter out placemarkers and recursion stops.
    friend void finalize( macro_filter &o ) { finalize( o.cont ); }

That's fairly low overhead for a dedicated compiler stage. And the whole thing is put together like that, as interconnected functors adapted to look like output iterators. And of course, there are scoped enumerations, lambdas, constexpr functions to walk the tables that describe what Unicode characters are allowed where in identifiers, and other fun.

The macro engine is particularly elegant. At 450 lines including validation of macro definitions, it's a tiny fraction of the size of Boost.Wave and others. I haven't tested performance, but it should be fairly fast… there are only three std::vector objects constructed (read: malloc calls) upon entering each macro invokation with at least one argument, and no others. Anyway, being written at a high level, it should be very possible to optimize, without disturbing it too much.

Well, that's my first fully-C++11 project. Hopefully something good comes of it :v) .

Tuesday, August 16, 2011

Introducing PULC

Hi, everybody! Welcome to my new blog. It's sure to be chock full of random ramblings, long gaps without updates, and of course, that most popular of esoteric languages, C++.

Let's start by kicking off the Palatable Untyped Lambda Calculus project, or PULC, pronounced like "pulse." A while ago I was inspired by John Tromp's Binary Lambda Calculus encoding of the untyped lambda calculus. The interesting property of BLC is its extremely compact, 210-bit universal machine. That's 26¼ bytes, an entire language interpreter within two Hello, world!s. This is due to its compactness and simplicity as a language. There is little to express, and it is very expressive. But what about more sophisticated programs? Does the expressiveness remain when the ideas start piling up? The only way to find out is to try and see.

Sophisticated programs written in untyped lambda calculus are few, however. Every language is bound together with its idioms and usage, and lambda calculus shows its heritage as a mathematical formalism, not a programming language. It is typically used with a "unary" number system, akin to counting on your fingers. In common terms, the time and storage required to add two numbers in this system grow exponentially with the numbers' lengths. This is completely impractical. Lambda calculus is usually used to annotate a mathematical "narrative" in natural language, or in a figurative sense for anonymous functions in a computer program (hence "lambdas" in Python and C++). It is used to write phrases, not stories.

Is there a good reason not to build larger programs in untyped lambda calculus? Poor idioms can be left behind. Tromp suggests a couple binary natural-number encodings in his paper, to solve the unary issue. It is also easy to build definitions upon definitions, to avoid embedding everything within another language. One obvious problem is that it is untyped, hence missing a cornerstone of programming practice. Type verification can be moved outside the language proper, though. It can be treated as part of software testing, manipulated by annotation and interactive tools. Well, I'm beginning to speculate. But it's certainly possible to get far enough in an untyped language to be able to write a typechecker in it. And such a program might expose the high-level qualities of the language.

So I see a challenge, and one which is true to the original spirit of lambda calculus. It is the foundation of all functional programming, but usually it is buried under a load of syntactic sugar. Modifying language syntax is the easy way out, but most problems can be solved with a good idiom instead. What is the most efficient way to represent the essential concepts? When the best idioms, tools, and representation formats are used, how efficient and elegant can this language be? How far do you get with only the most fundamental formalization of the notion of abstraction?

To meet this challenge and answer these questions is the object of PULC. At the least, it will be educational.