[wplug] regex help
Brian Sammon
brians+ at cs.cmu.edu
Mon Jan 12 16:09:29 EST 2004
> /<td.*?>/
> /<td[^>]*>/
Ooh! This is a topic I was just dealing with recently myself.
Of these two, the last is much more highly recommended, provided it matches
what you are looking for.
Of course, you could make it non-greedy as follows:
/<td[^>]*?>/
but it doesn't matter.
The reason I recommend [^>]* over .*? is mainly when you are building more
complex expressions.
Assume for the moment that you are searching a medium sized HTML document and
you are using the search term
/<td.*?><a href= "(.*?)"><\/td>/s
(I'm assuming we're talking about perl here)
This will probably fail because of the space in the <a> tag.
It will take much longer to fail than it should, because the <td.*?> will have
a lot of matches (e.g. it matches "<td>" and "<td>foo</td>" ) and the computer
will have to try the expression for each match. If you replace the <td.*?>
with <td[^>]*>, it will fail more quickly.
Here the difference is something like 100 milliseconds vs 5 milliseconds, but
it can get much worse with larger expressions...
I was recently working with the following expression, which ran for over a
minute on a PIII before I killed the process. Replacing .*? with [^>]* where
appropriate made it fail in a few seconds, so I could then proceed to figure
out why it failed.
m|
<tr.*?>.*?<td.*?>(.*?)</td>.*?
<td.*?>((?:<table.*?>.*?</table>)?.*?)</td>.*?</tr>.*?
<tr.*?>(.*?)</tr>.*?
<tr.*?><table.*?>.*?mailto:(.*?)<.*?</table>.*?
<table.*?>.*?</table>.*?</tr>
|xgs
More information about the wplug
mailing list