[wplug] regex help

Brian Sammon brians+ at cs.cmu.edu
Mon Jan 12 16:09:29 EST 2004


> /<td.*?>/
> /<td[^>]*>/

Ooh!  This is a topic I was just dealing with recently myself.
Of these two, the last is much more highly recommended, provided it matches 
what you are looking for.
Of course, you could make it non-greedy as follows:
  /<td[^>]*?>/
but it doesn't matter.

The reason I recommend [^>]* over .*? is mainly when you are building more 
complex expressions.
Assume for the moment that you are searching a medium sized HTML document and 
you are using the search term
   /<td.*?><a href= "(.*?)"><\/td>/s
(I'm assuming we're talking about perl here)

This will probably fail because of the space in the <a> tag.
It will take much longer to fail than it should, because the <td.*?> will have 
a lot of matches (e.g. it matches "<td>" and "<td>foo</td>" ) and the computer 
will have to try the expression for each match.  If you replace the <td.*?> 
with <td[^>]*>, it will fail more quickly.
Here the difference is something like 100 milliseconds vs 5 milliseconds, but 
it can get much worse with larger expressions...
I was recently working with the following expression, which ran for over a 
minute on a PIII before I killed the process.  Replacing .*? with [^>]* where 
appropriate made it fail in a few seconds, so I could then proceed to figure 
out why it failed.

     m|
     <tr.*?>.*?<td.*?>(.*?)</td>.*?
       <td.*?>((?:<table.*?>.*?</table>)?.*?)</td>.*?</tr>.*?
     <tr.*?>(.*?)</tr>.*?
     <tr.*?><table.*?>.*?mailto:(.*?)<.*?</table>.*?
       <table.*?>.*?</table>.*?</tr>
     |xgs






More information about the wplug mailing list