我有这样的数据:
| Term | Value|
| -------- | -----|
| Apple | 100 |
| Appel | 50 |
| Banana | 200 |
| Banan | 25 |
| Orange | 140 |
| Pear | 75 |
| Lapel | 10 |
目前,我使用以下代码:
matches = []
for term in terms:
tlist = difflib.get_close_matches(term, terms, cutoff = .80, n=5)
matches.append(tlist)
df["terms"] = matches
输出是这样的
| Term | Value|
| --------------------- | -----|
| [Apple, Appel] | 100 |
| [Appel, Apple, Lapel] | 50 |
| [Banana, Banan] | 200 |
| [Banan, Banana] | 25 |
| [Orange] | 140 |
| [Pear] | 75 |
| [Lapel, Appel] | 10 |
这段代码并没有多大帮助。我想要的输出是这样的:
| Term | Value|
| -------- | -----|
| Apple | 150 |
| Banana | 225 |
| Orange | 140 |
| Pear | 75 |
| Lapel | 10 |
主要问题是列表的顺序不同,并且列表中通常只有一两个单词重叠。例如,我可能有
- [苹果、苹果]
- [上诉、苹果、翻领]
理想情况下,我希望这两个返回“apple”,因为它具有重叠项的最高值。
有办法做到这一点吗?
最佳答案
实现目标的一个简单方法是使用 Python 标准库 difflib模块,它提供了计算增量的帮助器,如下所示:
from difflib import SequenceMatcher
import pandas as pd
# Toy dataframe
df = pd.DataFrame(
{
"Term": ["Apple", "Appel", "Banana", "Banan", "Orange", "Pear", "Lapel"],
"Value": [100, 50, 200, 25, 140, 75, 10],
}
)
KEY_TERMS = ("Apple", "Banana", "Orange", "Pear")
for i, row in df.copy().iterrows():
# Get the similarity ratio for a given value in df "Term" column (row[0])
# and each term from KEY_TERM, and store the pair "term:ratio" in a dict
similarities = {
term: SequenceMatcher(None, row[0], term).ratio() for term in KEY_TERMS
}
# Find the key term for which the similarity ratio is maximum
# and use it to replace the original term in the dataframe
df.loc[i, "Term"] = max(similarities, key=lambda key: similarities[key])
# Group by term and sum values
df = df.groupby("Term").agg("sum").reset_index()
然后:
print(df)
# Outputs
Term Value
0 Apple 160
1 Banana 225
2 Orange 140
3 Pear 75
关于python - 如何在数据框中查找相似的术语并将其分组以求和它们的值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68774663/