python - 在 Python 中通过正则表达式分解 CSV

标签 python regex csv pandas

我有一个格式如下的文件:

S1A23
0.01,0.01
0.02,0.02
0.03,0.03
S25A123
0.05,0.06
0.07,0.08
S3034A1
1000,0.04
2000,0.08
3000,0.1

我想将其按每个“S_A_”进行分解,并计算下面数据的相关系数。到目前为止,我已经:

import re
import pandas as pd

test = pd.read_csv("predict.csv",sep=('S\d+A\d+'))

print test

但这只能给我:

  Unnamed: 0     ,
0  0.01,0.01  None
1  0.02,0.02  None
2  0.03,0.03  None
3        NaN     ,
4  0.05,0.06  None
5  0.07,0.08  None
6        NaN     ,
7  1000,0.04  None
8  2000,0.08  None
9   3000,0.1  None

[10 rows x 2 columns]

理想情况下,我希望保留正则表达式分隔符,并且有类似的内容:

S1A23: 1.0
S2A123: 0.86
S303A1: 0.75

这可能吗?

编辑
运行大文件(~250k 行)时,我收到以下错误。这不是数据的问题,因为当我将 ~250k 行分成更小的 block 时,所有部分都运行良好。

Traceback (most recent call last):
  File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
    print(result)
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
    return self.__bytes__()
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
    return self.__unicode__().encode(encoding, 'replace')
  File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
    result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

我的确切代码是:

import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'keyStrokeFourgram/TESTING1'

df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
              .where(mask, np.nan)
              .fillna(method='ffill'))
df = df.loc[~mask]

result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)

最佳答案

sep 参数用于指定分隔同一行上的值的模式。它不能用于将 csv 的行分隔成单独的数据帧。

编辑:有一种方法可以使用 read_csv 将 csv 读入 DataFrame。这比使用 Python 循环(如我原来的答案中所做的那样)更好,因为 read_csv should be faster 。这可能很重要——特别是对于大型 csv 文件。

import numpy as np
import pandas as pd
df = pd.read_csv("data", names=['x', 'y'])
mask = df['x'].str.match('^S\d+A\d+')         # 1
df['type'] = (df['x']
              .where(mask, np.nan)            # 2
              .fillna(method='ffill'))        # 3
df = df.loc[~mask]                            # 4

result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

产量

type
S1A23      1.000000
S25A123    1.000000
S3034A1    0.981981
dtype: float64

  1. 'x' 列中具有“type”的行的掩码为 True。

    In [139]: mask
    Out[139]: 
    0      True
    1     False
    2     False
    3     False
    4      True
    5     False
    6     False
    7      True
    8     False
    9     False
    10    False
    Name: x, dtype: bool
    
  2. df['x'].where(mask, np.nan) 返回一个 Series,等于 df['x'] 其中 掩码为 True,否则为 np.nan。
  3. 用货币值向前填写 nan

    In [141]: df['x'].where(mask, np.nan).fillna(method='ffill')
    Out[141]: 
    0       S1A23
    1       S1A23
    2       S1A23
    3       S1A23
    4     S25A123
    5     S25A123
    6     S25A123
    7     S3034A1
    8     S3034A1
    9     S3034A1
    10    S3034A1
    Name: x, dtype: object
    
  4. 仅选择掩码为 False 的行

原答案:

不幸的是,我没有找到将数据文件直接读入适当的 DataFrame 的方法。您需要使用 Python 循环对行进行一些调整,使其变成正确的形式。

import pandas as pd
import csv

def to_columns(f):
    val = None
    for row in csv.reader(f):
        if len(row) == 1:
            val = row[0]
        else:
            yield [val] + row

with open('data') as f:
    df = pd.DataFrame.from_records(to_columns(f), columns=['type', 'x', 'y'])

print(df)
result = df.groupby(['type']).apply(lambda grp: grp['x'].corr(grp['y']))
print(result)

关于python - 在 Python 中通过正则表达式分解 CSV,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21755302/

相关文章:

python - Beautiful Soup - 如何修复损坏的标签

scala - 无法覆盖 Spark 2.x 中 CSV 文件的架构

string - SQL 服务器 : escape punctuation in string

ruby-on-rails - 从Ruby中的CSV转义逗号

python - pandas如何在不创建新列的情况下进行外连接

python - 如果所有值都是某个字符串,则删除 pandas 数据框中的列

php - 解析 IP :Port from string with characters after the port #

java - 空白字段上的@Pattern 验证

python - 导入错误 : 'No module named' *does* exist

Python 打印重定向作为 stdin 命令行参数