python - 用可能的多个单词匹配州和城市

标签 python regex python-3.x

我有一个类似以下元素的 Python 列表:

['Alabama[edit]',
 'Auburn (Auburn University)[1]',
 'Florence (University of North Alabama)',
 'Jacksonville (Jacksonville State University)[2]',
 'Livingston (University of West Alabama)[2]',
 'Montevallo (University of Montevallo)[2]',
 'Troy (Troy University)[2]',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
 'Tuskegee (Tuskegee University)[5]',
 'Alaska[edit]',
 'Fairbanks (University of Alaska Fairbanks)[2]',
 'Arizona[edit]',
 'Flagstaff (Northern Arizona University)[6]',
 'Tempe (Arizona State University)',
 'Tucson (University of Arizona)',
 'Arkansas[edit]',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
 'Fayetteville (University of Arkansas)[7]']

该列表并不完整,但足以让您了解其中的内容。

数据结构如下:

有一个美国州名,州名后面是该州的一些城市名。如您所见,州名称以“[edit]”结尾,城市名称以带数字的括号结尾(例如“1”或“[2]”),或者以大学的名称结尾括号内的名称(例如“(北阿拉巴马大学)”)。

(查找此问题的完整引用文件 here )

理想情况下,我想要一个以州名称作为索引的 Python 字典,并且该州的所有城市名称都嵌套在列表中作为该特定索引的值。因此,例如字典应该是这样的:

{'Alabama': ['Auburn', 'Florence', 'Jacksonville'...], 'Arizona': ['Flagstaff', 'Temple', 'Tucson', ....], ......}

现在,我尝试了以下解决方案,以剔除不需要的部分:

import numpy as np
import pandas as pd

    def get_list_of_university_towns():
        '''
        Returns a DataFrame of towns and the states they are in from the 
        university_towns.txt list. The format of the DataFrame should be:
        DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
        columns=["State", "RegionName"]  )

        The following cleaning needs to be done:

        1. For "State", removing characters from "[" to the end.
        2. For "RegionName", when applicable, removing every character from " (" to the end.
        3. Depending on how you read the data, you may need to remove newline character '\n'. 

        '''

        fhandle = open("university_towns.txt")
        ftext = fhandle.read().split("\n")

        reftext = list()
        for item in ftext:
            reftext.append(item.split(" ")[0])

        #pos = reftext[0].find("[")
        #reftext[0] = reftext[0][:pos]

        towns = list()
        dic = dict()

        for item in reftext:
            if item == "Alabama[edit]":
                state = "Alabama"

            elif item.endswith("[edit]"):
                dic[state] = towns
                towns = list()
                pos = item.find("[")
                item = item[:pos]
                state = item

            else:
                towns.append(item)

        return ftext

    get_list_of_university_towns()

我的代码生成的输出片段如下所示:

{'Alabama': ['Auburn',
  'Florence',
  'Jacksonville',
  'Livingston',
  'Montevallo',
  'Troy',
  'Tuscaloosa',
  'Tuskegee'],
 'Alaska': ['Fairbanks'],
 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
 'Arkansas': ['Arkadelphia',
  'Conway',
  'Fayetteville',
  'Jonesboro',
  'Magnolia',
  'Monticello',
  'Russellville',
  'Searcy'],
 'California': ['Angwin',
  'Arcata',
  'Berkeley',
  'Chico',
  'Claremont',
  'Cotati',
  'Davis',
  'Irvine',
  'Isla',
  'University',
  'Merced',
  'Orange',
  'Palo',
  'Pomona',
  'Redlands',
  'Riverside',
  'Sacramento',
  'University',
  'San',
  'San',
  'Santa',
  'Santa',
  'Turlock',
  'Westwood,',
  'Whittier'],
 'Colorado': ['Alamosa',
  'Boulder',
  'Durango',
  'Fort',
  'Golden',
  'Grand',
  'Greeley',
  'Gunnison',
  'Pueblo,'],
 'Connecticut': ['Fairfield',
  'Middletown',
  'New',
  'New',
  'New',
  'Storrs',
  'Willimantic'],
 'Delaware': ['Dover', 'Newark'],
 'Florida': ['Ave',
  'Boca',
  'Coral',
  'DeLand',
  'Estero',
  'Gainesville',
  'Orlando',
  'Sarasota',
  'St.',
  'St.',
  'Tallahassee',
  'Tampa'],
 'Georgia': ['Albany',
  'Athens',
  'Atlanta',
  'Carrollton',
  'Demorest',
  'Fort',
  'Kennesaw',
  'Milledgeville',
  'Mount',
  'Oxford',
  'Rome',
  'Savannah',
  'Statesboro',
  'Valdosta',
  'Waleska',
  'Young'],
 'Hawaii': ['Manoa'],

但是,输出中存在一个错误:不包括名称中带有空格的州​​(例如“北卡罗来纳州”)。我知道背后的原因。

我想到了使用正则表达式,但由于我还没有研究它们,所以我不知道如何形成一个。关于使用或不使用正则表达式如何完成它的任何想法?

最佳答案

那就赞美一下正则表达式的强大吧:

states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)

cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)

transformed = '\n'.join(lst_)

result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)

这产生

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']}


说明:

想法是将任务拆分成几个较小的任务:

  1. 加入完整列表\n
  2. 不同的州
  3. 独立的城镇
  4. 对所有找到的项目使用字典理解


第一个子任务

transformed = '\n'.join(your_list)

第二个子任务

^                      # match start of the line
(?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?)   # afterwards match anything up to
(?=^.*\[edit\]$|\Z)    # ... either another state or the very end of the string

参见 the demo on regex101.com .

第三个子任务

^[^()\n]+              # match start of the line, anything not a newline character or ( or )

参见 another demo on regex101.com .

第四个子任务

result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}

这大致相当于:

for state in states_rx.finditer(transformed):
    # state is in state.group('state')
    for city in cities_rx.finditer(state.group('cities')):
        # city is in city.group(0), possibly with whitespaces
        # hence the rstrip


最后,一些时间问题:

import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965

所以在我的计算机上运行上面的 100.000 次花了我大约 12 秒,所以它应该相当快。

关于python - 用可能的多个单词匹配州和城市,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48049006/

相关文章:

java - 正则表达式验证空字母或仅字母

javascript - 如何在 JavaScript 中评估第一个 "Who' 上的 ?"as being equal to "s?

python - 用python分隔逗号

Python Manager通过注册表访问dict

python - django - 将现有的 ModelAdmin 及其内联添加到另一个管理表单

java - 正则表达式如何选择 "."之后的任何字符

python - Python 3 中类的字符串表示

python - 是否有一种 pythonic 方法来获取数据帧之间的差异?

python - 2.2GB JSON 文件解析不一致

python - 为什么 select.select() 可以处理磁盘文件而不是 epoll()?