Re: RegExp Conundrum

Daniel Mahler (mahler@nospam.socs.uts.edu.au)
Wed, 26 Jun 1996 18:19:52 +1000 (EST)

How would one specify a regular expression to extract the alt
paramater from an img tag in html? I keep falling into the "find
the longest possible match" hole that regular expressions provide,
either from trying to find matches between double quotes or between
opening and closing <>'s, whenever there were multiple tags in the
file.

Simply say that the bit between the delimiters is not allowed to contain
any delimiters. You just have to wath out for escaped delimiters.

Using EMACS regexp notation "[^"]*" will match any text between two successive
quotes. This is good enough for HTML since an escaped quote is &quot
rather than \".

For other languages, that use backslashes for escaped characters,
you would have to make sure that
none of the quotes are preceded by an odd number of backslashes.
In EMACS \(^\|[^\\]\)\(\\\\\)*" will match any unescaped quote.

Daniel

Well?

jimmy

ps Anyone know the location of an html parser? C++ preferred, but C
accepted. There's probably a Java one around, no?