python - 将包含带有标记部分的 OrderedDict 的元组转换为包含以标记部分命名的列的表

标签 python transpose

标题更完整:Convert tuple containing an OrderedDict with tagged parts to table with columns named from tagged parts (variable number of tagged parts and variable number of occurrences of tags).

我比 python 更了解地址解析,这可能是问题的根本根源。如何做到这一点可能是显而易见的。 usaddress 库有意以这种可能有用的方式返回结果。

我正在使用 usaddress,它“是一个 python 库,用于使用高级 NLP 方法将非结构化地址字符串解析为地址组件”,并且似乎运行良好。这是 the usaddress sourcewebsite .

所以我在一个文件上运行它:

2244 NE 29TH DR
1742 NW 57TH ST
1241 NE EAST DEVILS LAKE RD 
4239 SW HWY 101, UNIT 19 
1315 NE HARBOR RIDGE 
4850 SE 51ST ST 
1501 SE EAST DEVILS LAKE RD 
1525 NE REGATTA WAY 
6458 NE MAST AVE 
4009 SW HWY 101 
814 SW 9TH ST 
1665 SALMON RIVER HWY 
3500 NE WEST DEVILS LAKE RD, UNIT 18 
1912 NE 56TH DR 
3334 NE SURF AVE 
2734 SW DUNE CT
2558 NE 33RD ST 
2600 NE 33RD ST 
5617 NW JETTY AVE 

我想将这些结果转换成更像表格的东西(最终是 CSV 或数据库)。

我不确定返回的是什么数据类型。阅读文档,告诉我 tag 方法返回一个元组,其中包含带有标记部分的 OrderedDict。 parse 方法似乎返回一个稍微不同的类型。 This question ,帮助我确定它是一个列表和一个元组(显然带有标签)。搜索 for how to convert a python list with tagged parts to a table没有成功。

搜索如何转换包含 OrderedDict 的元组的结果并不多。 This是我发现的最接近的。我还发现 pandas擅长各种格式化任务,虽然我不清楚如何将 pandas 应用于此。许多我发现的最接近的问题 like the opposite question or one with named tuples分数很低。

我还进行了一些探索性尝试,看看它是否可行(如下)。我能够从这个 Matrix Transpose question 中看到几种访问数据和使用 zip 的方法离表格更近了一点,因为数据和命名标签现在是分​​开的,尽管不统一。有没有办法将这些结果放在包含带标记部分的 OrderedDict 的标记列表或元组中?从返回的结果中是否有相当直接的方法?

解析方法如下:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try the parse method
    parsed = usaddress.parse(line)
    ## See what the parse results look like
    zippy = [list(i) for i in zip(*parsed)]
    print(zippy)
    ## read the next line
    line = f.readline()

## close the file
f.close()

以及生成的结果(请注意,当标签有多个部分时,它会重复)。

[['2244', 'NE', '29TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1742', 'NW', '57TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1241', 'NE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['4239', 'SW', 'HWY', '101,', 'UNIT', '19'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier']]
[['1315', 'NE', 'HARBOR', 'RIDGE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4850', 'SE', '51ST', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1501', 'SE', 'EAST', 'DEVILS', 'LAKE', 'RD'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['1525', 'NE', 'REGATTA', 'WAY'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['6458', 'NE', 'MAST', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['4009', 'SW', 'HWY', '101'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName']]
[['814', 'SW', '9TH', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['1665', 'SALMON', 'RIVER', 'HWY'], ['AddressNumber', 'StreetName', 'StreetName', 'StreetNamePostType']]
[['3500', 'NE', 'WEST', 'DEVILS', 'LAKE', 'RD,', 'UNIT', '18'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetName', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier']]
[['1912', 'NE', '56TH', 'DR'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['3334', 'NE', 'SURF', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2734', 'SW', 'DUNE', 'CT'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2558', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['2600', 'NE', '33RD', 'ST'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]
[['5617', 'NW', 'JETTY', 'AVE'], ['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType']]

这是标记方法:

## Get a library
import usaddress

## Open the file with read only permmission
f = open('address_sample.txt')

## Read the first line 
line = f.readline()

## If the file is not empty keep reading line one at a time
## until the file is empty
while line:
    ## Try tag method
    tagged = usaddress.tag(line)
    ## See what the tag results look like
    items_ = list(tagged[0].items())
    zippy2 = [list(i) for i in zip(*items_)]
    print(zippy2)
    ## read the next line
    line = f.readline()

## close the file
f.close()

产生以下输出,可以更好地处理具有相同标签的多个部分的组合:

[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2244', 'NE', '29TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1742', 'NW', '57TH', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1241', 'NE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName', 'OccupancyType', 'OccupancyIdentifier'], ['4239', 'SW', 'HWY', '101', 'UNIT', '19']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1315', 'NE', 'HARBOR', 'RIDGE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['4850', 'SE', '51ST', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1501', 'SE', 'EAST DEVILS LAKE', 'RD']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1525', 'NE', 'REGATTA', 'WAY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['6458', 'NE', 'MAST', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetNamePreType', 'StreetName'], ['4009', 'SW', 'HWY', '101']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['814', 'SW', '9TH', 'ST']]
[['AddressNumber', 'StreetName', 'StreetNamePostType'], ['1665', 'SALMON RIVER', 'HWY']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType', 'OccupancyType', 'OccupancyIdentifier'], ['3500', 'NE', 'WEST DEVILS LAKE', 'RD', 'UNIT', '18']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['1912', 'NE', '56TH', 'DR']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['3334', 'NE', 'SURF', 'AVE']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2734', 'SW', 'DUNE', 'CT']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2558', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['2600', 'NE', '33RD', 'ST']]
[['AddressNumber', 'StreetNamePreDirectional', 'StreetName', 'StreetNamePostType'], ['5617', 'NW', 'JETTY', 'AVE']]

最佳答案

只需使用 csv.DictWriter使用您的标记方法上课:

from csv import DictWriter
import usaddress

tagged_lines = []
fields = set()
# Note 1: Use the 'with' statement instead of worrying about opening
# and closing your file manually
with open('address_sample.txt') as in_file:
    # Note 2: You don't need to mess with readline() and while loops; 
    # just iterate over the file handle directly, it produces lines.
    for line in in_file:
        tagged = usaddress.tag(line)[0]
        tagged_lines.append(tagged)
        fields.update(tagged.keys()) # keep track of all field names we see

with open('address_sample.csv', 'w') as out_file:
    writer = DictWriter(out_file, fieldnames=fields)
    writer.writeheader()
    writer.writerows(tagged_lines)

请注意,这对于大文件来说效率很低,因为它会一次性将您输入的全部内容保存在内存中;唯一的原因是事先不知道字段名集(即 csv 列标题)。

如果你知道完整的集合,你可以在一次流式传输中完成,在你阅读每一行时写下标记的输出。或者,您可以通过一次传递文件来生成一组 header ,然后第二次传递来进行转换。

关于python - 将包含带有标记部分的 OrderedDict 的元组转换为包含以标记部分命名的列的表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29782125/

相关文章:

python - 在 python 数据框中选择具有相同顺序值的行?

excel - excel中的数学转置

php - 转置和展平二维索引数组,其中行的长度可能不相等

Excel - 转置列对

python - 如何在 Pandas 中旋转数据框?

Python新手生成随机字符串

python - 如何根据 Python 请求轮换代理

python - 由于 Python 代码中存在制表符,因此删除电子邮件正文中的制表符

python - 如何编写一个接受 int 或 float 的 C 函数?

python - 在 numpy 中转置 4 维数组