[ProgSoc] on regularity was: Context-affinitive parsing w/ context-free grammar

Roland Turner raz at raz.cx
Fri Sep 25 12:15:58 EST 2009


Anand Kumria wrote:

> On Thu, Sep 24, 2009 at 8:08 AM, Roland Turner <raz at raz.cx 
> <mailto:raz at raz.cx>> wrote:
> 
>     I suspect that the answer to your question is fortran.
> 
> 
> Or jumanji. That is pretty much always a valid answer for most questions.

Indeed, parsing jumanji is probably even more difficult than parsing 
fortran.

>     Chomksy gives us three interesting groups of languages to think about:
> 
>     - regular,
>     - context-free and
>     - context-sensitive.
> 
>     Regular languages can be described using regular expressions and
>     recognised using a finite-state automaton. This provides for blistering
>     speed (think: one memory-access-cycle per byte of input), fixed response
>     time and particularly compact description. Note that perl expressions
>     aren't regular (in particular, they include back-reference) despite
>     perl's use of the term.
> 
> 
> Hmm, I was not aware that a regex syntax that included back references 
> is not a regular language. Is this because looking for the matches 
> requires a stack?

I suspect that the answer is actually "because that's how Chomsky 
defined it", but he wasn't being arbitrary: this class of languages is 
interesting specifically because it can be recognised without a stack 
and, yes, back-references require a stack to recognise.

> Actually, would it necessarily -- if you define a unit of input (e.g. a 
> line terminated with either CR, LF or some random combination of the 
> two)  -- then you can perform the matching without a stack.

Perl expressions are a supserset of regular expressions. So, yes, you 
can use perl's expression syntax to describe an expression which is also 
regular (the test is easy: if the perl expression doesn't contain 
back-reference then it's regular, unless perl has grown some new 
extensions that I'm unaware of). In concrete terms:

     [^\r\n](\r[^\r]|\n|\r\n)

is a perl expression and a regular expression which, more or less, 
matches what you described while this:

     (a+)b+\1

is a perl expression but not a regular expression and can't be matched 
without a stack or similar (because all of those "a"s - or at least a 
count of how many there are - have to go somewhere).

So, when I said that perl expressions aren't regular I didn't mean to 
imply that the set of languages that can be expressed by regular 
expressions and the set of languages that can be expressed by perl 
expressions were disjoint (no overlap), just that the perl set contains 
members that cannot be expressed as regular expressions (it's a "there 
exists" vs. "for all" distinction; some, but not all, perl expressions 
are not regular) and that, therefore, a complete implementation of perl 
expressions without a stack or similar is not possible.

- Raz



More information about the Progsoc mailing list