我有一个大文件(5,000,000 行),格式为:
'User ID,Mov ID,Rating,Timestamp'
我有另一个文件(200,000 行),编号更少。形式的记录:
'User ID, Mov ID'
我必须生成一个新文件,如果第二个文件中的 (User ID, Mov ID) 与第一个文件的 5,000,000 行中的任何记录匹配,我就不应该将其包含在我的新文件中。 换句话说,新文件由唯一的用户 ID、Mov ID 组成,因为它与文件 2(200,000 行)没有任何共同点(用户 ID、Mov ID)
我正在尝试这种幼稚的方法,但它花费了太多时间。是否有更快的算法来实现?:
from sys import argv
import re
script, filename1, filename2 = argv
#open files
testing_small= open(filename1)
ratings=open(filename2)
##Open file to write thedata
ratings_training=open("ratings_training.csv",'w')
for line_rating in ratings:
flag=0;testing_small.seek(0)
for line_test in testing_small:
matched_line=re.match(line_test.rstrip(),line_rating)
if matched_line:
flag=1;break
if(flag==0):
ratings_training.write(line_rating)
testing_small.close()
ratings.close()
ratings_training.close()
我也可以使用任何基于 spark 的方法
最佳答案
例如:
# df1:
User_ID,Mov_ID,Rating,Timestamp
sam,apple,0.6,2017-03-17 09:04:39
sam,banana,0.7,2017-03-17 09:04:39
tom,apple,0.3,2017-03-17 09:04:39
tom,pear,0.9,2017-03-17 09:04:39
# df2:
User_ID,Mov_ID
sam,apple
sam,pear
tom,apple
在 Pandas 中:
import pandas as pd
df1 = pd.read_csv('./disk_file')
df2 = pd.read_csv('./tmp_file')
res = pd.merge(df1, df2, on=['User_ID', 'Mov_ID'], how='left', indicator=True)
res = res[res['_merge'] == 'left_only']
print(res)
或者在 Spark 中:
cfg = SparkConf().setAppName('MyApp')
spark = SparkSession.builder.config(conf=cfg).getOrCreate()
df1 = spark.read.load(path='file:///home/zht/PycharmProjects/test/disk_file', format='csv', sep=',', header=True)
df2 = spark.read.load(path='file:///home/zht/PycharmProjects/test/tmp_file', format='csv', sep=',', header=True)
res = df1.join(df2, on=[df1['User_ID'] == df2['User_ID'], df1['Mov_ID'] == df2['Mov_ID']], how='left_outer')
res = res.filter(df2['User_ID'].isNotNull())
res.show()
关于python - 从大文件中提取特定行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42845292/