I have a fasta file containing genome sequences of multiple viruses.
Example:
>gi_138375030_Human_papillomavirus
GAAAGTTTCAATCATACTTTATTATATTGGGAGTAAAAAAAA...
>gi_94481944_Human_herpesvirus_3
GGCCCAGCCCTCTCGCGGCCCCCTCGAGAGAGAAAAAAA...
I want to extract only herpes virus entries, including the actual sequence, which is (in this file) always the line folowing the description.
The folowing regex works:
>.*herpes.*\n.*\n
It selects the description and the sequence lines.
I have found similar questions but all make use of the "bookmark line" function:
Export all regular expression matches in Textpad or Notepad++ as a list
However, this only bookmarks the first line of the regex output, so I am unable to use the described solutions. If I use "find all in current document", it also only lists the first lines.
All I want to do is copy the output of regex into a new file. It is especially frustrating since it finds just above a hundred entries, which is just above the margin under which I would be willing to do it manually.
I would prefer a solution in Windows OS.
Answer
You could make a copy of the file and then, on the copy, search and replace the negation of what you want:
(?!>.*herpes.*)^(>.*\R)([ATGC]+\R)
The above will (or ought to) find paired lines that do not have herpes. Couple this with a blank replace field, you will wind up with a file that has only what you are looking for.
No comments:
Post a Comment