python - 边列表中唯一的节点列表

我有一个很大的边缘列表(约 2600 万)，其中前两列作为节点并且具有可变数量的可选列:

Node1    Node2    OptionalCol1    OptionalCol2   ...

Gene A    Gene D   --             --
Gene C    Gene F   --             --
Gene D    Gene C   --             --
Gene F    Gene A   --             --

我想要一个文本文件，其中包含非冗余 节点列表，两者列。输出:

Gene A
Gene D
Gene C
Gene F

我的 python 代码:

file1 = open("input.txt", "r")
node_id = file1.readlines()
node_list=[]

for i in node_id:
    node_info=i.split()
    node_info[0]=node_info[0].strip()
    node_info[1]=node_info[1].strip()
    if node_info[0] not in node_list:
        node_list.append(node_info[0])
    if node_info[1] not in node_list:
        node_list.append(node_info[1])

print node_list

是否可以用 awk 做到这一点？谢谢

最佳答案

假设分隔符是制表符 (\t)。如果它是一堆空格(一堆不止一个)而不是 -F"\t" 使用:-F"+":

$ awk -F"\t" 'NR>2{a[$1];a[$2]}END{for(i in a)print i}' file
Gene A
Gene C
Gene D
Gene F

输出没有任何特定的顺序，但它可能是。解释:

$ awk -F"\t" '
NR>2 {           # starting on the third record
    a[$1]        # hash first...
    a[$2]        # and second columns
}
END {            # after all that hashing
    for(i in a)  # iterate whole hash
        print i  # and output
}' file

关于python - 边列表中唯一的节点列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53279445/

上一篇：python - Numpy:如何在某些元素包含额外的引号字符串时将字符串数组转换为 float

下一篇：python - 查找列表中数字后跟较大数字的次数

相关文章：

python - 如何使用 lambda 打印字符串列表？

python - Python 中的 YouTube 下载链接生成器

python - 使用漏勺进行 xml 反序列化

mysql - 创建包含目录中文件名称的 CSV 文件

linux - 没有任何循环程序的 sed 增量计数器

linux - 获取 bash 中最后一次 grep 匹配的行

linux - awk 第一列中的项目，然后再次 awk 使用第二列中的结果？

bash - 从模式匹配中删除直到下一个模式

python - 如何在 Python 的 struct 中使用 C++ struct

python - 如何在 python 中存储和检索数据