Saturday, September 2, 2017

How to copy multiple-line-regex outputs into clipboard using Notepad++



I have a fasta file containing genome sequences of multiple viruses.



Example:



>gi_138375030_Human_papillomavirus
GAAAGTTTCAATCATACTTTATTATATTGGGAGTAAAAAAAA...


>gi_94481944_Human_herpesvirus_3
GGCCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAA...


I want to extract only herpes virus entries, including the actual sequence, which is (in this file) always the line folowing the description.



The folowing regex works:



>.*herpes.*\n.*\n



It selects the description and the sequence lines.



I have found similar questions but all make use of the "bookmark line" function:
Export all regular expression matches in Textpad or Notepad++ as a list



However, this only bookmarks the first line of the regex output, so I am unable to use the described solutions. If I use "find all in current document", it also only lists the first lines.



All I want to do is copy the output of regex into a new file. It is especially frustrating since it finds just above a hundred entries, which is just above the margin under which I would be willing to do it manually.




I would prefer a solution in Windows OS.


Answer



You could make a copy of the file and then, on the copy, search and replace the negation of what you want:



(?!>.*herpes.*)^(>.*\R)([ATGC]+\R)



The above will (or ought to) find paired lines that do not have herpes. Couple this with a blank replace field, you will wind up with a file that has only what you are looking for.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...