python - Pandas 删除每行中部分完成数据的重复项并合并数据

标签 python pandas

我有一个包含重复 ID 的数据框,但数据在多个区域中部分完成。

df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan, np.nan],
               [1234, 'Customer A', np.nan, '333 Street', np.nan],
               [1234, 'Customer A', '12345 Street', np.nan, np.nan],
               [1234, 'Customer A', np.nan, np.nan, np.nan],
               [1233, 'Customer B', '444 Street', '3335 Street', np.nan],
               [1233, 'Customer B', '555 Street', '666 Street', np.nan],
               [1233, 'Customer B', '553 Street', '666 Street', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a8c9cacbe8cdc5c9c1c486cbc7c5" rel="noreferrer noopener nofollow">[email protected]</a>'],
               [1235, 'Customer C', '1553 Street', '644 Street', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e3828180a3868e828a8fcd808c8e" rel="noreferrer noopener nofollow">[email protected]</a>'],
               [1235, 'Customer C', '2553 Street', '644 Street', '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1d7c7f7e5d78707c7471337e7270" rel="noreferrer noopener nofollow">[email protected]</a>']],     
               columns=['ID', 'Customer', 'Billing Address', 'Shipping Address', 'Contact'])


df
        ID  Customer    Billing Address Shipping Address    Contact
0   1234    Customer A  123 Street      NaN                 NaN
1   1234    Customer A  NaN             333 Street          NaN
2   1234    Customer A  12345 Street    NaN                 NaN
3   1234    Customer A  NaN             NaN                 NaN
4   1233    Customer B  444 Street      3335 Street         NaN
5   1233    Customer B  555 Street      666 Street          NaN
6   1233    Customer B  553 Street      666 Street          <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="84e5e6e7c4e1e9e5ede8aae7ebe9" rel="noreferrer noopener nofollow">[email protected]</a>
7   1235    Customer C  1553 Street     644 Street          <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aacbc8c9eacfc7cbc3c684c9c5c7" rel="noreferrer noopener nofollow">[email protected]</a>
8   1235    Customer C  2553 Street     644 Street          <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="caaba8a98aafa7aba3a6e4a9a5a7" rel="noreferrer noopener nofollow">[email protected]</a>

我想保留所有数据,以便在数据存在时创建新列,使其看起来像下面的数据框: enter image description here

我尝试了以下操作,但它删除了我想要保留的数据。

df.drop_duplicates(subset=['ID'], inplace=True)
df

    ID      Customer    Billing Address Shipping Address    Contact
0   1234    Customer A  123 Street      NaN                 NaN
4   1233    Customer B  444 Street      3335 Street         NaN
7   1235    Customer C  1553 Street     644 Street          <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="e3828180a3868e828a8fcd808c8e" rel="noreferrer noopener nofollow">[email protected]</a>

编辑:我添加了更多数据,因为从原始帖子中不清楚可以存在多行 ID。

最佳答案

这是一种使用 apply 并创建新列的方法,使用 dict 创建 pd.Series

In [1057]: cols = ['Billing Address', 'Shipping Address']

In [1058]: (df.groupby(['ID', 'Customer'])
              .apply(lambda g: pd.Series({'%s %s' % (x, i+1): v[x] 
                     for i, v in enumerate(g[cols].to_dict('r'))
                     for x in v})))
Out[1058]:
                Billing Address 1 Billing Address 2 Shipping Address 1  \
ID   Customer
1233 Customer B        444 Street        555 Street         333 Street
1234 Customer A        123 Street               NaN                NaN

                Shipping Address 2
ID   Customer
1233 Customer B         666 Street
1234 Customer A         333 Street

关于python - Pandas 删除每行中部分完成数据的重复项并合并数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45842932/

相关文章:

python - 嵌套字典理解

python - 我可以使用 pandas read_csv 将列名全部小写吗?

python - Flask中初始化DB的地方

python - 如何在绘图标题中打印数据框名称?

python - pandas 按分钟比较时区感知日期时间字段

python - 在 Pandas 中,我如何在两个不同的轴上按两次分组?

python - 关闭 Django 开发服务器上的模型验证

python - 如何使用 Pandas 中的预聚合数据绘制直方图?

python - 将多个字段转换为单个字段

python - 检查可变时间范围内的列值是否唯一