Re: RegExps

Daniel Mahler (mahler@nospam.socs.uts.edu.au)
Mon, 1 Jul 1996 18:53:44 +1000 (EST)

Dennis is right right about HTML grammar not being regular,
but the examples he gives are red herrings.
Separation of proper HTML text, comment text
and quoted text is a regular problem.
The real reason HTML is not regular is that
some elements can be nested.
One must then keep track of the nesting levels
to work out where a given element stops.
Jimmy's particular problem:
finding the value of the alt attribute of an image element,
can be solved correctly with regexps,
because, as far as I know, image elements cannot contain
any nested elements.
Basically the problem only requires tokenising HTML,
rather than actually parsing it.

Daniel