python - 使用正则表达式将数据从字符串移动到 pandas 数据框？

我有一个包含几个单行字符串的文本文件，这些字符串并不总是以相同的顺序排列，但通常包含一些相同的信息。

例如

(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))
(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))

在这种情况下，并不是每行都需要读入所有内容，在本例中只需要读入“名称”、“x”、“y”、“标签”和“代码”。假设我有几百行看起来像示例，是否可以轻松地从每一行中获取我想要的数据？理想情况下，我正在尝试将信息传递到 pandas 数据帧中，但问题主要是关于如何正确地对字符串进行正则表达式，因为没有真正的模式。

DataFrame 的示例(如果这有助于理解问题)

Names   x   y   label   code
RED    123 456   ONE    XYZ
GREEN  789 101   TWO

正则表达式是解决这个问题的最佳方法吗？在查看所有线条时，我没有发现真正的模式，因此它可能并不理想。

最佳答案

除了属性是任意顺序之外，该模式是规则的，所以它当然是可行的。我分两步完成此操作 - 一个正则表达式用于在开头获取颜色并提取属性字符串，第二个用于提取属性。

import re


inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = prop_re.findall(props)
    # [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
    print("%s: %s" % (color, properties))

给出的输出是

RED: [('x', '123'), ('y', '456'), ('type', 'MT'), ('label', 'ONE'), ('code', 'XYZ')]
GREEN: [('type', 'MX'), ('label', 'TWO'), ('x', '789'), ('y', '101')]

要将其放入 pandas 中，您可以在列表字典中累积属性(我在下面使用 defaultdict 完成了此操作)。您需要为空值存储一些内容，以便所有列的长度相同，这里我只存储 None(或 null)。最后使用 pd.DataFrame.from_dict 获得最终的 DataFrame 。

import re
import pandas as pd
from collections import defaultdict

inputs = [
'(Names RED (property (x 123) (y 456) (type MT) (label ONE) (code XYZ)))',
'(Names GREEN (property (type MX) (label TWO) (x 789) (y 101)))'
]

# Get the initial part, and chop off the property innerstring
initial_re = re.compile('^\(Names\s([^\s]*)\s\(property\s(.*)\)\)')
# Get all groups from (x 123) (y 456) (type MT) (label ONE) (code XYZ)
prop_re = re.compile('\(([^\s]*)\s([^\s]*)\)')

columns = ['color', 'x', 'y', 'type', 'label', 'code']

data_dict = defaultdict(list)

for s in inputs:
    parts = initial_re.match(s)
    color = parts.group(1)
    props = parts.group(2)
    # e.g. (x 123) (y 456) (type MT) (label ONE) (code XYZ)
    properties = dict(prop_re.findall(props))
    properties['color'] = color

    for k in columns:
        v = properties.get(k)  # None if missing
        data_dict[k].append(v)


pd.DataFrame.from_dict(data_dict)

最终输出为

   color    x    y type label  code
0    RED  123  456   MT   ONE   XYZ
1  GREEN  789  101   MX   TWO  None

关于python - 使用正则表达式将数据从字符串移动到 pandas 数据框？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54789106/

python - 使用正则表达式将数据从字符串移动到 pandas 数据框？

上一篇：Python 按值将列表元素分组为元组

下一篇：python - 如何在 python 中从具有多个条件的数组 A 获取 bool 数组？