我使用的数据集如下所示。这是一个视频字幕数据集,在“描述”列下有字幕。
Video_ID Description
mv89psg6zh4 A bird is bathing in a sink.
mv89psg6zh4 A faucet is running while a bird stands.
mv89psg6zh4 A bird gets washed.
mv89psg6zh4 A parakeet is taking a shower in a sink.
mv89psg6zh4 The bird is taking a bath under the faucet.
mv89psg6zh4 A bird is standing in a sink drinking water.
R2DvpPTfl-E PLAYING GAME ON LAPTOP.
R2DvpPTfl-E THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU A woman is pouring ingredients into a bowl.
l7x8uIdg2XU A woman is adding milk to some pasta.
l7x8uIdg2XU A person adds ingredients to pasta.
l7x8uIdg2XU the girls are doing the cooking.
但是每个视频的字幕数量是不一样的,不统一。
我打算为一个唯一的 Video_ID 提取一行,并形成一个合并这些唯一行的新数据帧。此外,从现有数据框中删除同一行。
我想要的结果应该是这样的:
数据框 1-
Video_ID Description
mv89psg6zh4 A faucet is running while a bird stands.
mv89psg6zh4 A bird gets washed.
mv89psg6zh4 A parakeet is taking a shower in a sink.
mv89psg6zh4 The bird is taking a bath under the faucet.
mv89psg6zh4 A bird is standing in a sink drinking water.
R2DvpPTfl-E THE MAN IS WATCHING LAPTOP.
l7x8uIdg2XU A woman is adding milk to some pasta.
l7x8uIdg2XU A person adds ingredients to pasta.
l7x8uIdg2XU the girls are doing the cooking.
数据框 2-
Video_ID Description
mv89psg6zh4 A bird is bathing in a sink.
R2DvpPTfl-E PLAYING GAME ON LAPTOP.
l7x8uIdg2XU A woman is pouring ingredients into a bowl.
因此,行基本上从现有数据框中移动,形成一个新的数据框。
最佳答案
您可以使用 groupby()
对索引进行采样:
s = df.index.to_series().groupby(df['Video_ID']).apply(lambda x: x.sample(n=1))
# random unique
df.loc[s]
# rest of data
df.drop(s)
关于python - 如何为特定列的每个不同值选择一行并合并以在 Python 中形成一个新的数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60891828/