python - 检查数据框中列的字符串值是否以元组的字符串元素开头(除了 str.startswith)

标签 python pandas dataframe optimization startswith

我有一个带有随机值的 pandas 数据框列("457645","458762496","1113423453"...)，我需要检查这些值是否以元组的元素开头(“323”，“229”，“111”)。

在本例中，“1113423453” 应该如此。

我尝试过df[column].str.startswith(tuple)，效果很好；但对于大量数据(2M df 行和 3K 元组元素)，与 10K df 行和 3K 元组元素(1.47 秒)相比，它变得慢得多(大约 28 秒)。

有没有更有效的方法？

最佳答案

I have tried df[column].str.startswith(tuple), which works fine … but i'm searching for a more efficient way to do it if it's possible

由于 startswith() 并未针对大量前缀字符串进行优化，并且仅对它们进行线性搜索，因此此处使用二分搜索可能更有效。为此，我们需要对前缀进行排序。

from bisect import bisect_right
s = sorted(tuple)
df[column].apply(lambda str: str.startswith(s[bisect_right(s, str)-1]))

is it possible to extract the prefix into a new column of the dataframe?

是的，e。 G。使用此功能:

def startwiths(str):
    prefix = s[bisect_right(s, str)-1]
    if str.startswith(prefix): return prefix

df['new column'] = df[column].apply(startwiths)

关于python - 检查数据框中列的字符串值是否以元组的字符串元素开头(除了 str.startswith)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58305553/

上一篇：python - 当 x 轴和 y 轴为文本时注释数据点 matplotlib

下一篇：python - 厨师和最高星级(CodeChef十月的长期挑战)

scala - 由 org.apache.spark.sql.Dataset 处的 : java. lang.NullPointerException 引起

python - 在 Python/Pandas 数据框中创建新列时，有没有办法避免键入数据框名称、括号和引号？

python - 如何在 PyQt5 (5.13) 中启用 macOS 深色模式

python - 构建网络分析工具的建议(最好是 Python 友好的)- OLAP/Python

Python matplotlib 手动颜色图

r - 从数据框中提取字符串

python - 从 pyodbc 读取数据到 pandas

python - 来自 csv 的数据透视并存储在数据框中

python - 标记数据框的索引