我有一个 fasta 文件如下:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>hik \PName=EERT abs 1 \GName=EERT \Type=2 \Processed=(1|181:mature protein)
MMKFKPNPGDREGFKKRAACLCFRSEQEDEVLLVSSQTRTYSRYPDQWIVPGGGMEPEEE
>dmd \PName=YYHY abs 1 \GName=YYHY \Type=0 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
我想提取条件为 Type=1
的 fasta 文件。这样我的输出如下所示:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
我尝试使用 grep 命令作为 grep 'Type=1' file.fasta
。它返回不带序列的 header 名称,如下所示:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
如何获得所需的输出?
最佳答案
如果 awk
是一个选项:
awk '
BEGIN { RS="" } # redefine input record separator as empty string => treat consecutive non-blank lines as single record
/Type=1 / { print $0 ORS } # if record contains string "Type=1 " then print record plus default output record separator; ORS provides the blank line
' file.fasta
这会生成:
>abc \PName=Did abs 1 \GName=NUDT \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
GAAVREVYEEAGVKGKLGRLLGIFEQNQDRKHRTYVYVLTVTEILEDWEDSVNIGRKREW
R
>dmd \PName=REWW abs 1 \GName=REWW \Type=1 \Processed=(1|181:mature protein)
MMKFKPNQTRTYSRYPDQWIVPGGGMEPEEEPGDREGFKKRAACLCFRSEQEDEVLLVSS
G
关于python - 如何提取给定文件中带有 header 和序列的特定 fasta 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73174820/