Tuesday, October 22, 2019

python - Regex include line breaks





I have the following xml file




A




B
C





D



Picture number 3?




and I just want to get the text between

and
.

So I've tried this code :



import os, re

html = open("2.xml", "r")
text = html.read()
lon = re.compile(r'
\n(.+)\n
', re.MULTILINE)
lon = lon.search(text).group(1)
print lon



but It doesn't seem to work.


Answer



1) Don't parse XML with regex. It just doesn't work. Use an XML parser.



2) If you do use regex for this, you don't want re.MULTILINE, which controls how ^ and $ work in a multiple-line string. You want re.DOTALL, which controls whether . matches \n or not.



3) You probably also want your pattern to return the shortest possible match, using the non-greedy +? operator.



lon = re.compile(r'
\n(.+?)\n
', re.DOTALL)


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...