我有一个可重现的示例,玩具数据框:
df = pd.DataFrame({'my_customers':['John','Foo'],'email':['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="056068646c69456268646c692b666a68" rel="noreferrer noopener nofollow">[email protected]</a>','<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="204f544845524d41494c605941484f4f0e434f4d" rel="noreferrer noopener nofollow">[email protected]</a>'],'other_column':['yes','no']})
print(df)
my_customers email other_column
0 John <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="62070f030b0e22050f030b0e4c010d0f" rel="noreferrer noopener nofollow">[email protected]</a> yes
1 Foo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="650a110d001708040c09251c040d0a0a4b060a08" rel="noreferrer noopener nofollow">[email protected]</a> no
我apply()
一个函数到行,在函数内创建一个新列:
def func(row):
# if this column is 'yes'
if row['other_column'] == 'yes':
# create a new column with 'Hello' in it
row['new_column'] = 'Hello'
# return to df
return row
# otherwise
else:
# just return the row
return row
然后我将该函数应用于 df,我们可以看到顺序已更改。这些列现在按字母顺序排列。有什么办法可以避免这种情况吗?我想保持原来的顺序。
df = df.apply(func, axis = 1)
print(df)
email my_customers new_column other_column
0 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="43262e222a2f03242e222a2f6d202c2e" rel="noreferrer noopener nofollow">[email protected]</a> John Hello yes
1 <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96f9e2fef3e4fbf7fffad6eff7fef9f9b8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a> Foo NaN no
编辑以澄清 - 上面的代码太简单
输入
df = pd.DataFrame({'my_customers':['John','Foo'],
'email':['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="63060e020a0f23040e020a0f4d000c0e" rel="noreferrer noopener nofollow">[email protected]</a>','<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7718031f12051a161e1b370e161f18185914181a" rel="noreferrer noopener nofollow">[email protected]</a>'],
'api_status':['data found','no data found'],
'api_response':['huge json','huge json']})
my_customers email api_status api_response
0 John <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cda8a0aca4a18daaa0aca4a1e3aea2a0" rel="noreferrer noopener nofollow">[email protected]</a> data found huge json
1 Foo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="acc3d8c4c9dec1cdc5c0ecd5cdc4c3c382cfc3c1" rel="noreferrer noopener nofollow">[email protected]</a> no data found huge json
解析 api_response。我需要在 DF 中创建许多新行:
def api_parse(row):
# if we have response data
if row['api_response'] == huge json:
# get response for parsing
response_data = row['api_response']
"""Let's get associated URLS first"""
# if there's a URL section in the response
if 'urls' in response_data .keys():
# get all associated URLS into a list
urls = extract_values(response_data ['urls'], 'url')
row['Associated_Urls'] = urls
"""Get a list of jobs"""
if 'jobs' in response_data .keys():
# get all associated jobs and organizations into a list
titles = extract_values(person_data['jobs'], 'title')
organizations = extract_values(person_data['jobs'], 'organization')
counter = 1
# create a new column for each job
for pair in zip(titles,organizations):
row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'
counter +=1
"""Get a list of education"""
if 'educations' in response_data .keys():
# get all degrees into list
degrees = extract_values(response_data ['educations'], 'display')
counter = 1
# create a new column for each degree
for edu in degrees:
row['education'+'_'+str(counter)] = edu
counter +=1
"""Get a list of social profiles from URLS we parsed earlier"""
facebook = [i for i in urls if 'facebook' in i] or [np.nan]
instagram = [i for i in urls if 'instagram' in i] or [np.nan]
linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
twitter = [i for i in urls if 'twitter' in i] or [np.nan]
amazon = [i for i in urls if 'amazon' in i] or [np.nan]
row['facebook'] = facebook
row['instagram'] = instagram
row['linkedin'] = linkedin
row['twitter'] = twitter
row['amazon'] = amazon
return row
elif row['api_Status'] == 'No Data Found':
# do nothing
return row
预期输出:
my_customers email api_status api_response job_1 job_2 \
0 John <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b4d1d9d5ddd8f4d3d9d5ddd89ad7dbd9" rel="noreferrer noopener nofollow">[email protected]</a> data found huge json xyz xyz2
1 Foo <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7a150e121f08171b13163a031b12151554191517" rel="noreferrer noopener nofollow">[email protected]</a> no data found huge json nan nan
education_1 facebook other api info
0 foo profile1 etc
1 nan nan nan
最佳答案
运行 apply 函数后,您可以调整 DataFrame
中的列顺序。例如:
df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]
为了减少重复量(即必须重新输入所有列名称),您可以在调用 apply 函数之前获取现有的列集:
columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]
根据作者对原始问题的编辑进行更新。虽然我不确定所选的数据结构(将 API 结果存储在数据框中)是否是最佳选择,但一种简单的解决方案可能是在调用 apply 函数后提取新列。
# Store the existing columns before calling apply
existing_columns = list(df.columns)
df = df.apply(func, axis = 1)
all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]
df = df[columns + new_columns]
为了优化性能,您可以将现有列存储在集合
中,而不是列表
中,由于集合数据的散列性质,这将在恒定时间内产生查找Python 中的结构。这会将 existing_columns = list(df.columns)
更改为 existing_columns = set(df.columns)
。
最后,正如 @Parfait 在他们的评论中非常友善地指出的那样,上面的代码可能会引发一些折旧警告。使用pandas.DataFrame.reindex
而不是 df = df[columns + new_columns]
将使警告消失:
new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)
关于python - 如何阻止 apply() 改变列的顺序?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57662117/