数据结构:
一个 pandas DataFrame (
business_df
),其中有一列 (category
) 我感兴趣的列表包含餐厅类别的列表 (
restaurant_categories_list
)
我正在尝试做的事情:
根据 category
列(具有列表结构)过滤 business_df
中的企业,如果至少有一个列出的类别匹配,则将企业归类为餐厅至少一种餐厅类别。
我检查了这 2 个问题,但它们没有为我的问题提供答案:
Filter dataframe rows if value in column is in a set list of values
use a list of values to select rows from a pandas dataframe
我现在正在运行这段代码:
restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])
这是我感兴趣的专栏:
0 ['Fast Food', 'Restaurants']
1 ['Nightlife']
2 ['Auto Repair', 'Automotive']
3 ['Active Life', 'Mini Golf', 'Golf']
4 ['Shopping', 'Home Services', 'Internet Servic...
5 ['Bars', 'American (New)', 'Nightlife', 'Loung...
6 ['Active Life', 'Trainers', 'Fitness & Instruc...
7 ['Bars', 'American (Traditional)', 'Nightlife'...
8 ['Auto Repair', 'Automotive', 'Tires']
9 ['Active Life', 'Mini Golf']
10 ['Home Services', 'Contractors']
11 ['Veterinarians', 'Pets']
12 ['Libraries', 'Public Services & Government']
13 ['Automotive', 'Auto Parts & Supplies']
14 ['Burgers', 'Breakfast & Brunch', 'American (T...
因此,如果我只处理这些行,我预期的数据框应该只包含第 0 行和第 14 行。
最佳答案
更新:
此版本使用 ast.literal_eval()
来反序列化字符串中的列表,它似乎工作正常:
import ast
import pandas as pd
restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)
df = df_orig[(pd.notnull(df_orig['categories']))]
mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)
但是正如@CorleyBrigman 已经说过的那样,使用 Pandas 处理这样的数据结构非常困难且效率非常低......
基于示例数据的旧答案:
您可以将列表转换为列/系列,然后使用 pd.isin()
函数生成一个 True/False 值矩阵,可以将其相加(因为在 Python 中:False== 0 和 True==1):
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])
解释:
print(df['business'].apply(pd.Series))
0 1 2 3
0 Fast Food Restaurants NaN NaN
1 Nightlife NaN NaN NaN
2 Auto Repair Automotive NaN NaN
3 Active Life Mini Golf Golf NaN
4 Shopping Home Services Internet Servic NaN
5 Bars American (New) Nightlife Loung
6 Active Life Trainers Fitness & Instruc NaN
7 Bars American (Traditional) Nightlife NaN
8 Auto Repair Automotive Tires NaN
9 Active Life Mini Golf NaN NaN
10 Home Services Contractors NaN NaN
11 Veterinarians Pets NaN NaN
12 Libraries Public Services & Government NaN NaN
13 Automotive Auto Parts & Supplies NaN NaN
14 Burgers Breakfast & Brunch American NaN
然后
print(df['business'].apply(pd.Series).isin(restaurant_categories_list))
输出:
0 1 2 3
0 False True False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 True True False False
然后
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)
输出:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
dtype: bool
最后:
print(df[(mask)])
输出:
business
0 [Fast Food, Restaurants]
14 [Burgers, Breakfast & Brunch, American]
关于python - 过滤作为列表的 DataFrame 行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35971335/