[ProgSoc] Context-affinitive parsing w/ context-free grammar

Roland Turner raz at raz.cx
Thu Sep 24 17:08:20 EST 2009


John Elliot wrote:

> Roland Turner wrote:

>> Why not simply specify "this is a reserved word; that means it's 
>> reserved so you can't use it as an identifier"?
> 
> That's what I've done. However, there are two answers to your question. 
> The first is that I'm curious as to what might be involved in supporting 
> such a scenario (maybe there's an easy way that I'm ignorant of?);

I suspect that the answer to your question is fortran.

Chomksy gives us three interesting groups of languages to think about:

- regular,
- context-free and
- context-sensitive.

Regular languages can be described using regular expressions and 
recognised using a finite-state automaton. This provides for blistering 
speed (think: one memory-access-cycle per byte of input), fixed response 
time and particularly compact description. Note that perl expressions 
aren't regular (in particular, they include back-reference) despite 
perl's use of the term.

Context-free languages can be described using productions and recognised 
using a push-down automaton. The need for back-reference (where is the 
'{' that this '}' refers to?) introduces the stack which brings with it 
worse performance and more variable performance. It does keep the 
description of the parser's rules exceptionally simple. Note that most 
programming languages aren't strictly context-free - they include some 
sort of identifier management - however this is pretty well trodden 
ground. The handling of identifiers is very similar across all widely 
used languages, so little per-project complexity is introduced.

Context-sensitive languages essentially require the full capability of a 
Turing machine to describe and to recognise. This introduces a great 
deal of complexity into the description, into the semantics of the 
language - often in non-obvious ways (again, think fortran) - and into 
runtime behaviour of the result.

In effect, a context-free grammar for a language with identifiers 
actually describes a context-sensitive language, but one that we can 
pretend most of the time is context free. The moment you "cross the 
streams" and take away the ability of the parser to know which camp a 
token belongs to (context-free parser or identifier dictionary), the 
entire edifice collapses: you're left with a fortran-style 
context-sensitive grammar with magnificent, but counter-productive, 
ambiguities in it.

> and 
> the second is that the use of reserved words as identifiers is required 
> in my language,

That doesn't answer the question. Why is this required in your language 
(and in particular, given that almost all widely-used languages live 
without it, this seems likely to be a rather spurious "requirement")?

- Raz



More information about the Progsoc mailing list