Fahhem's Blog – Wikipedia "templates" Part 1

Wikipedia "templates" Part 1

Wikipedia templates seem to have grown organically, and that's the nicest way to describe it. They look simple at first glance, just a bunch of squiggly brackets to denote everything, how bad can it be?

In this multi-part series, I'll be writing about the complicated bits of the wikipedia template syntax. I'll start off explaining how they look like, with all their variances included, combine them to see how day-to-day articles are really torturous bundles of thousands of squiggly and square brackets, lumped with pipes, colons, and pounds to make parsing it all feel like a drive down a pothole-laden street with bicyclists and pedestrians making sudden appearances and hour-long street lights followed by miles of empty road with a stop sign every hundred feet for good measure.

At the end, I'll try to provide an AST for parsing Wikipedia articles into a sane structure to render from. Looking at many of the Wikipedia parsing engines, I've noticed they're mainly built to parse/understand only some of the syntax, and that most are rather haphazardly built to accept the ever-evolving Wikipedia template syntax. To fight this code-organization problem, one practice is to parse the source syntax into an AST (abstract syntax tree) as an intermediary, in-memory format, and then to convert that AST into the format you expect, whether it's HTML for a website, a PDF for a printable version, or LaTeX because you're a grad student with an insane professor. There is a performance-loss due to the multiple passes required, but you'll see that's not really a concern with Wikipedia template parsing after this series; and the programmer performance gain is rather large, you can sustainably add support for new syntax to the parser and renderers without needing to do them in tandem, so you can have multiple programmers do this if you wish, nor adding O(N^2^) complexity for each new syntax.

Posted on: Mon 25 February 2013

Category: Wikipedia