python - 如何阻止 apply() 改变列的顺序?

标签 python pandas

我有一个可重现的示例,玩具数据框:

df = pd.DataFrame({'my_customers':['John','Foo'],'email':['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="056068646c69456268646c692b666a68" rel="noreferrer noopener nofollow">[email protected]</a>','<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="204f544845524d41494c605941484f4f0e434f4d" rel="noreferrer noopener nofollow">[email protected]</a>'],'other_column':['yes','no']})

print(df)

  my_customers                email other_column
0         John      <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="62070f030b0e22050f030b0e4c010d0f" rel="noreferrer noopener nofollow">[email protected]</a>          yes
1          Foo  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="650a110d001708040c09251c040d0a0a4b060a08" rel="noreferrer noopener nofollow">[email protected]</a>           no

apply()一个函数到行,在函数内创建一个新列:

def func(row):

    # if this column is 'yes'
    if row['other_column'] == 'yes':

        # create a new column with 'Hello' in it        
        row['new_column'] = 'Hello' 

        # return to df
        return row 

    # otherwise
    else: 

        # just return the row
        return row

然后我将该函数应用于 df,我们可以看到顺序已更改。这些列现在按字母顺序排列。有什么办法可以避免这种情况吗?我想保持原来的顺序。

df = df.apply(func, axis = 1)
print(df)

                 email my_customers new_column other_column
0      <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="43262e222a2f03242e222a2f6d202c2e" rel="noreferrer noopener nofollow">[email protected]</a>         John      Hello          yes
1  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="96f9e2fef3e4fbf7fffad6eff7fef9f9b8f5f9fb" rel="noreferrer noopener nofollow">[email protected]</a>          Foo        NaN           no

编辑以澄清 - 上面的代码太简单

输入

df = pd.DataFrame({'my_customers':['John','Foo'],
                   'email':['<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="63060e020a0f23040e020a0f4d000c0e" rel="noreferrer noopener nofollow">[email protected]</a>','<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7718031f12051a161e1b370e161f18185914181a" rel="noreferrer noopener nofollow">[email protected]</a>'],
                   'api_status':['data found','no data found'],
                   'api_response':['huge json','huge json']})

  my_customers                email     api_status api_response
0         John      <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cda8a0aca4a18daaa0aca4a1e3aea2a0" rel="noreferrer noopener nofollow">[email protected]</a>     data found    huge json
1          Foo  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="acc3d8c4c9dec1cdc5c0ecd5cdc4c3c382cfc3c1" rel="noreferrer noopener nofollow">[email protected]</a>  no data found    huge json

解析 api_response。我需要在 DF 中创建许多新行:

def api_parse(row):

    # if we have response data

    if row['api_response'] == huge json:

        # get response for parsing

        response_data = row['api_response']

        """Let's get associated URLS first"""

        # if there's a URL section in the response

        if 'urls' in response_data .keys():

            # get all associated URLS into a list

            urls = extract_values(response_data ['urls'], 'url')

            row['Associated_Urls'] = urls


        """Get a list of jobs"""

        if 'jobs' in response_data .keys():

            # get all associated jobs and organizations into a list

            titles = extract_values(person_data['jobs'], 'title')
            organizations = extract_values(person_data['jobs'], 'organization')

            counter = 1

            # create a new column for each job

            for pair in zip(titles,organizations):

                row['Job'+'_'+str(counter)] = f'Title: {pair[0]}, Organization: {pair[1]}'

                counter +=1


        """Get a list of education"""

        if 'educations' in response_data .keys():

            # get all degrees into list

            degrees = extract_values(response_data ['educations'], 'display')

            counter = 1

            # create a new column for each degree

            for edu in degrees:

                row['education'+'_'+str(counter)] = edu

                counter +=1


        """Get a list of social profiles from URLS we parsed earlier"""

        facebook = [i for i in urls if 'facebook' in i] or [np.nan]
        instagram = [i for i in urls if 'instagram' in i] or [np.nan]
        linkedin = [i for i in urls if 'linkedin' in i] or [np.nan]
        twitter = [i for i in urls if 'twitter' in i] or [np.nan]
        amazon = [i for i in urls if 'amazon' in i] or [np.nan]

        row['facebook'] = facebook
        row['instagram'] = instagram
        row['linkedin'] = linkedin
        row['twitter'] = twitter
        row['amazon'] = amazon

        return row 

    elif row['api_Status'] == 'No Data Found':
        # do nothing
        return row

预期输出:

  my_customers                email     api_status api_response job_1 job_2  \
0         John      <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b4d1d9d5ddd8f4d3d9d5ddd89ad7dbd9" rel="noreferrer noopener nofollow">[email protected]</a>     data found    huge json   xyz  xyz2   
1          Foo  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="7a150e121f08171b13163a031b12151554191517" rel="noreferrer noopener nofollow">[email protected]</a>  no data found    huge json   nan  nan

  education_1  facebook other api info  
0         foo  profile1            etc  
1         nan  nan                 nan

最佳答案

运行 apply 函数后,您可以调整 DataFrame 中的列顺序。例如:

df = df.apply(func, axis = 1)
df = df[['my_customers', 'email', 'other_column', 'new_column']]

为了减少重复量(即必须重新输入所有列名称),您可以在调用 apply 函数之前获取现有的列集:

columns = list(df.columns)
df = df.apply(func, axis = 1)
df = df[columns + ['new_column']]

根据作者对原始问题的编辑进行更新。虽然我不确定所选的数据结构(将 API 结果存储在数据框中)是否是最佳选择,但一种简单的解决方案可能是在调用 apply 函数后提取新列。

# Store the existing columns before calling apply
existing_columns = list(df.columns)

df = df.apply(func, axis = 1)

all_columns = list(df.columns)
new_columns = [column for column in all_columns if column not in existing_columns]

df = df[columns + new_columns]

为了优化性能,您可以将现有列存储在集合中,而不是列表中,由于集合数据的散列性质,这将在恒定时间内产生查找Python 中的结构。这会将 existing_columns = list(df.columns) 更改为 existing_columns = set(df.columns)


最后,正如 @Parfait 在他们的评论中非常友善地指出的那样,上面的代码可能会引发一些折旧警告。使用pandas.DataFrame.reindex而不是 df = df[columns + new_columns] 将使警告消失:

new_columns_order = [columns + new_columns]
df = df.reindex(columns=new_columns_order)

关于python - 如何阻止 apply() 改变列的顺序?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57662117/

相关文章:

python - 将解析后的 pdf 中的句子连接在一起

python isinstance(obj, tpyes.GeneratorType) 失败

python - 从 Python 中的路径确定文件系统类型

python - 如何从 Pandas 中的前 20 个唯一日期(其实例计数不相等)中选择所有柱状值

python - 连接数据框后对特定列进行排序

python - pyconfig.h 在 "pip install cryptography"期间丢失

python - asyncio.sleep 的行为就像一个拦截器

python - 使用单独的自定义容器(阈值)对 Pandas 列进行分类

python - 将嵌套字典转换为附加数据框

python - 将系列索引转换为列