Saturday, August 31, 2019

linux - grep with regex containing pipe character



I am trying to grep with regex that contains pipe character |. However, It doesn't work as expected. The regex does not match the | inclusively as seen in the attach image below.



enter image description here




this is my bash command



cat data | grep "{{flag\|[a-z|A-Z\s]+}}"



the sample data are the following



| 155||NA||{{flag|Central African Republic}}||2.693||NA||0.000||0.000||0.019||0.271||0.281||0.057||2.066
|{{flagicon|Kosovo}} ''[[Kosovo]]'' {{Kosovo-note}}
|{{flagicon|Somaliland}} [[Somaliland|Somaliland region]]
|{{flagicon|Palestine}} ''[[Palestinian Territories]]''{{refn|See the following on statehood criteria:



the expected output is



| 155||NA||{{flag|Central African Republic}}||2.693||NA||0.000||0.000||0.019||0.271||0.281||0.057||2.066


However, having tested it with Regex101.com, the result came out as expected.


Answer



It appears that grep accepts \| as a separator between alternative search expressions (like | in egrep, where \| matches a literal |).




Apart from that, your expression has other problems:-




  • + is supported in egrep (or grep -E) only.

  • \s is not supported within a [] character group.

  • I don't see the need for | in the character group.



So the following works for grep:-




grep "{{flag|[a-zA-Z ][a-zA-Z ]*}}" 


Or (thanks to Glenn Jackman's input):-



grep "{{flag|[a-zA-Z ]\+}}" 


In egrep the {} characters have special significance, so they need to be escaped:-




egrep "\{\{flag\|[a-zA-Z ]+\}\}" 


Note that I have removed the unnecessary use of cat.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...