Sunday, October 27, 2019

Python, regex and html: match final tag on line



I'm confused about python greedy/not-greedy characters.



"Given multi-line html, return the final tag on each line."



I would think this would be correct:



re.findall('<.*?>$', html, re.MULTILINE)


I'm irked because I expected a list of single tags like:



"", "
    ", "".


My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."



So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?


Answer



Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.



Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...