python - 从大文件中提取特定行

我有一个大文件(5,000,000 行)，格式为:

'User ID,Mov ID,Rating,Timestamp'

我有另一个文件(200,000 行)，编号更少。形式的记录:

'User ID, Mov ID'

我必须生成一个新文件，如果第二个文件中的 (User ID, Mov ID) 与第一个文件的 5,000,000 行中的任何记录匹配，我就不应该将其包含在我的新文件中。换句话说，新文件由唯一的用户 ID、Mov ID 组成，因为它与文件 2(200,000 行)没有任何共同点(用户 ID、Mov ID)

我正在尝试这种幼稚的方法，但它花费了太多时间。是否有更快的算法来实现？:

from sys import argv
import re
script, filename1, filename2 = argv
#open files
testing_small= open(filename1)
ratings=open(filename2)
##Open file to write thedata
ratings_training=open("ratings_training.csv",'w')

for line_rating in ratings:
    flag=0;testing_small.seek(0)
    for line_test in testing_small:
        matched_line=re.match(line_test.rstrip(),line_rating)
        if matched_line:
            flag=1;break
    if(flag==0):
        ratings_training.write(line_rating)


testing_small.close()
ratings.close()
ratings_training.close()

我也可以使用任何基于 spark 的方法

最佳答案

例如:

# df1:
User_ID,Mov_ID,Rating,Timestamp
sam,apple,0.6,2017-03-17 09:04:39
sam,banana,0.7,2017-03-17 09:04:39
tom,apple,0.3,2017-03-17 09:04:39
tom,pear,0.9,2017-03-17 09:04:39

# df2:
User_ID,Mov_ID
sam,apple
sam,pear
tom,apple

在 Pandas 中:

import pandas as pd

df1 = pd.read_csv('./disk_file')
df2 = pd.read_csv('./tmp_file')
res = pd.merge(df1, df2, on=['User_ID', 'Mov_ID'], how='left', indicator=True)
res = res[res['_merge'] == 'left_only']
print(res)

或者在 Spark 中:

cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()

df1 = spark.read.load(path='file:///home/zht/PycharmProjects/test/disk_file', format='csv', sep=',', header=True)
df2 = spark.read.load(path='file:///home/zht/PycharmProjects/test/tmp_file', format='csv', sep=',', header=True)
res = df1.join(df2, on=[df1['User_ID'] == df2['User_ID'], df1['Mov_ID'] == df2['Mov_ID']], how='left_outer')
res = res.filter(df2['User_ID'].isNotNull())
res.show()

关于python - 从大文件中提取特定行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42845292/

python - 从大文件中提取特定行

上一篇：python - Pandas dataframe merge 的性能并不比附加到新列表更好

下一篇：Java 搜索两个数组