ANTLR3 – Ready for prime time?
Over the last couple of weeks I’ve been playing fairly intensively with Antlr 3.1. Antlr 3 is the latest upgrade to Terrence Parr’s Antlr parsing tool and was released in May last year. However, ‘upgrade’ is the wrong word – it’s a total rewrite and well and truly breaks backwards compatibility. Normally, I have a real downer on things that break backwards compatibility, but in this case I think it’s justified. Still, the question is: 9 months after its initial launch, is Antlr 3 ready for heavy production use?
Actually, I started playing with Antlr 3.0 in August last year. I got quite a long way in converting my existing Antlr 2.7 programs to Antlr3 before coming to the conclusion that there were just too many bugs, difficulties and missing functionality in Antrl3 to proceed much further. I decided to wait until the next release, 3.1. So, over Christmas, I downloaded an early release of Antlr 3.1, built it and started serious work.
One of the things that anyone upgrading to Antlr3 pretty soon realises, is that it’s not an afternoon’s work. It’s more of a complete rewrite. It’s not so much that the syntax of Antlr3 is different from Antlr2 – it isn’t really but it’s a very much tidier and clearer syntax and a big improvement to boot – it’s more that Antlr3 behaves in a fundamentally different way to Antlr2. Sure, both are LL parsers, but Antlr3 introduces what Dr. Parr calls ‘LL(*)’ – variable look ahead in other words.
I was under the impression that you just turned on LL(*) and off you went. All your parsing problems would be solved with a magic *. Big mistake! It’s more like the conversion from steam to the internal combustion engine: you have to make a few conceptual adjustments. It simply takes time to become familiar with the quirks and foibles of LL(*) and the underlying DFAs (Deterministic Finite Automatons).
I won’t go into all the trouble I had converting the Ruby, ERb and HTML parsers from Antlr2 to Antlr3 but I will say that it was far from trivial. But having done it (mostly), the results are simply amazing. With Antlr2, there were constructs that I could not parse in a reasonable time or in a sensible manner. The original Ruby language parser is written in an LALR parser (yacc) and that just does not translate into an efficient LL(2) grammar. I had numerous workarounds and some things I couldn’t parse at all easily - expressions on the left hand side of an assignment and multiple assignment statements are just two examples.
Another problem I had was a construct like this:
a b c d e f g h i j
This reads (in English) as ‘a is a function call with b as its argument. b is also a function call with c as its argument. And so on’. Believe it or not, this is valid Ruby from a syntax viewpoint at any rate. The problem was (and is) that beyond about 7 function calls, the time taken to parse it becomes unacceptable – the parser has exponential rather than linear performance. Now in case you are wondering what sort of idiot would write a piece of code like that consider this:
starwars = ‘once upon a time in a galaxy far far away’
A trivial parse! Until you accidentally leave out the first quote. The Antlr2 parser will take a long, long time to tell you that simple fact. With Antlr2 I never found a way round that. But with Antlr3, I’ve solved it.
My resulting LL(*) grammars are far cleaner and simpler to follow than the original LL(2) ones. The parse time seems a bit longer, though I haven’t done any real timing tests yet (and 3.1 is still beta). In any case, the time to parse is usually irrelevant. A C++ yacc parser will be a lot faster than Antlr, but who will notice a few milliseconds on a Core Duo?
The reason I’m looking at Antlr3 is that I want to re-write the IntelliSense system to properly incorporate and improve on all the things I’ve learnt in doing the original version(s). The end result will be a very much more powerful system with greater inference capacities and the ability to do accurate Ruby refactoring. I have decided against incorporating refactoring into the current system since I can’t guarantee that it will be bullet proof – and for me (personally at any rate) it’s very important to be 100% confident that the refactoring is right. A 90% refactoring is just not worthwhile – you might as well use a regular expression and do the rest by hand.
Now the question is – it Antlr3 up to the job? I’m not 100% sure yet, but I’m ‘quietly confident’ as they say. And with the addition of a real ‘tree grammar’ system in Antlr3, the work in actually implementing the new IntelliSense code should be a lot easier.
Finally, Antlr 3 comes with a pretty neat development environment called AntlrWorks. Here’s a picture:
To be honest, it hasn’t helped me that much – it doesn’t do any thinking for you. But it really is nice to stand back at the end of the day and admire your work in it!