python - 如何在 Pandas 选定列数据框中应用具有多个参数的函数

标签 python pandas

我有以下数据框:

import pandas as pd 
data = {'gene':['a','b','c','d','e'],
        'count':[61,320,34,14,33],
        'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]

看起来像这样:

In [9]: df
Out[9]:
  gene  count  gene_length
0    a     61          152
1    b    320           86
2    c     34           92
3    d     14          170
4    e     33          111

我想做的是应用一个函数:

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm

关于 countgene_length 列和常量 N=12345 并将新结果命名为“rpkm”。 但为什么失败了?

N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

正确的做法是什么? 第一行应如下所示:

 gene  count  gene_length rpkm
   a     61          152  32508.366

更新:我得到的错误是:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6270e1d19b89> in <module>()
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL)
     13     theN  == Total reads mapped
     14     """
---> 15     rpkm = float((10**9) * theC)/(theN * theL)
     16     return rpkm

/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
     74             return converter(self.iloc[0])
     75         raise TypeError(
---> 76             "cannot convert the series to {0}".format(str(converter)))
     77     return wrapper
     78

最佳答案

不要在您的方法中强制转换为float,它会正常工作:

In [9]:
def calculate_RPKM(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df

Out[9]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

错误消息告诉您不能将 pandas Series 转换为 float,而您可以调用 apply 逐行调用您的方法。您应该考虑重写您的方法,以便它可以在整个 Series 上工作,这将被矢量化并且比调用 apply 快得多,后者本质上是一个 for 循环。

时间

In [11]:

def calculate_RPKM1(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
​
def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm
N=12345

%timeit calculate_RPKM1(df['count'],N,df['gene_length'])
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

1000 loops, best of 3: 238 µs per loop
100 loops, best of 3: 1.5 ms per loop

您可以看到非转换版本的速度提高了 6 倍以上,并且在更大的数据集上性能会更好

更新

以下代码以及使用非强制转换 float 版本的方法在语义上是等效的:

df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length'])
df

Out[16]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

关于python - 如何在 Pandas 选定列数据框中应用具有多个参数的函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30840856/

相关文章:

python - 如何将多个 CSV 文件从文件夹读取到 pandas 中,并以数据框名称作为文件名

python-2.7 - sklearn 和导入 CSV 时出现不可哈希类型错误

python - 如何根据任一列中的 2 个变量删除数据框中的行

python - 在 Pandas to_sql 中指定模式

python - 监控请求的持久性和丢失的数据点

python - Pandas - 合并具有短间隔的开始/结束时间范围

python - Psycopg2 : Create a table in a stored procedure Postgres

python - 在读取 Python 文件中的行时跳过前几行

python - 在 Pandas 中高效编辑字符串并转换为 float

python - 如何从数据框的索引中获取行的名称?