我有两个数据帧,一个 Actor 的 df,其功能是他们所制作的电影的电影标识符号列表。我还有一个电影列表,其中有一个标识符号,如果 Actor 出现在该电影中,该标识符号将显示在 Actor 的列表中。
我尝试迭代电影数据帧,这确实产生了结果,但速度太慢。
似乎迭代 Actor 数据帧中的电影列表会减少循环,但我无法保存结果。
这是 Actor 数据框:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
还有电影数据框:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
如您所见,movies['tconst']
标识符显示在 Actor 数据帧的列表中。
我对电影数据帧的非常缓慢的迭代如下:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
这会生成一些结果,但速度不够快,无法发挥作用。一个观察结果是,通过为 knownForTitles
中具有电影标识符的所有 Actor 创建一个新数据帧,可以将该列表放入电影数据帧的单个特征中。
虽然我尝试循环下面的 Actor 数据帧,但我似乎无法将项目附加到电影数据帧中:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
因此,如果我运行上面的代码,我会很快得到结果,但“cast”字段仍为空。
最佳答案
我发现了 def actor_loop(movie_df, actor_df)
函数遇到的问题。问题是
结果['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
正在将值设置为等于结果
数据帧的副本。最好使用 df.set_value()
方法或 df.at[]
方法。
我还找到了一个更快的解决方案,而不是迭代两个数据帧并创建递归循环,最好迭代一次。所以我创建了一个元组列表:
def actor_tuples(actor_df):
tuples =[]
for index, value in actor_df['knownForTitles'].iteritems():
cinemetography = [x.strip() for x in value.split(',')]
for movie in cinemetography:
tuples.append((actor_df['primaryName'].loc[index], movie))
return tuples
这创建了以下形式的元组列表:
[('Fred Astaire', 'tt0043044'),
('Lauren Bacall', 'tt0117057')]
然后,我创建了一个电影标识符号和索引点列表(来自电影数据帧),其形式如下:
{'tt0000009': 0,
'tt0000147': 1,
'tt0000335': 2,
'tt0000502': 3,
'tt0000574': 4,
'tt0000615': 5,
'tt0000630': 6,
'tt0000675': 7,
'tt0000676': 8,
'tt0000679': 9}
然后,我使用下面的函数迭代 Actor 元组,并使用电影标识符作为电影字典中的键,这返回了正确的电影索引,我用它来将 Actor 名称元组添加到目标数据帧:
def add_cast(movie_df, Atuples, Mtuples):
results_df = movie_df.copy()
results_df['cast'] = ''
counter = 0
total = len(Atuples)
for tup in Atuples:
#this passes the movie ID into the movie_dict that will return an index
try:
movie_index = Mtuples[tup[1]]
if results_df.at[movie_index, 'cast'] == '':
results_df.at[movie_index, 'cast'] += tup[0]
else:
results_df.at[movie_index, 'cast'] += ',' + tup[0]
except KeyError:
pass
#logging
counter +=1
if counter % 1000000 == 0:
logging.warning(f'Index {counter} out of {total}, {counter/total}% finished')
return results_df
它在 10 分钟内运行了 1650 万个 actor 元组(制作 2 组元组,然后是添加函数)。结果如下:
0 tt0000009 Miss Jerry 1894 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 Documentary,News,Sport
2 tt0000335 Soldiers of the Cross 1900 Biography,Drama
3 tt0000502 Bohemios 1905 \N
4 tt0000574 The Story of the Kelly Gang 1906 Biography,Crime,Drama
cast
0 Blanche Bayliss,Alexander Black,William Courte...
1 Bob Fitzsimmons,Enoch J. Rector,John L. Sulliv...
2 Herbert Booth,Joseph Perry,Orrie Perry,Reg Per...
3 Antonio del Pozo,El Mochuelo,Guillermo Perrín,...
4 Bella Cola,Sam Crewes,W.A. Gibson,Millard John...
谢谢堆栈溢出!
关于python - 如何将元素列表附加到数据框的单个特征中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56740420/