python - 从字符串中解析始发城市/目的地城市

标签 python regex pandas nlp nltk

我有一个 Pandas 数据框,其中一列是一堆带有某些旅行细节的字符串。我的目标是解析每个字符串以提取出发城市和目的地城市(我希望最终有两个名为“起源”和“目的地”的新列)。

数据:

df_col = [
    'new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags'
]

这应该导致:
Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

到目前为止,我已经尝试过:
各种 NLTK 方法,但让我最接近的是使用 nltk.pos_tag方法来标记字符串中的每个单词。结果是带有每个单词和相关标签的元组列表。这是一个例子...
[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('€422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]

我被困在这个阶段,不确定如何最好地实现这一点。任何人都可以指出我正确的方向吗?谢谢。

最佳答案

TL; 博士

乍一看几乎不可能,除非您可以访问一些包含非常复杂组件的 API。

在龙

乍一看,您似乎是在要求神奇地解决自然语言问题。但是让我们分解它并将其范围定为可以构建的点。

首先,要识别国家和城市,您需要枚举它们的数据,所以让我们尝试:https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

在搜索结果的顶部,我们找到 https://datahub.io/core/world-cities这导致了 world-cities.json 文件。现在我们将它们加载到国家和城市的集合中。

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])

现在给定数据,让我们尝试构建 组件一 :
  • 任务:检测文本中的任何子字符串是否与城市/国家匹配。
  • 工具: https://github.com/vi3k6i5/flashtext (快速字符串搜索/匹配)
  • 公制:字符串中正确识别的城市/国家的数量

  • 让我们把它们放在一起。
    import requests
    import json
    from flashtext import KeywordProcessor
    
    cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
    cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))
    
    countries = set([city['country'] for city in cities_json])
    cities = set([city['name'] for city in cities_json])
    
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    keyword_processor.extract_keywords(texts[0])
    

    [出去]:
    ['York', 'Venice', 'Italy']
    

    咦,怎么了?!

    做尽职调查,第一个预感是“纽约”不在数据中,
    >>> "New York" in cities
    False
    

    什么?! #$%^&* 为了理智起见,我们检查这些:
    >>> len(countries)
    244
    >>> len(cities)
    21940
    

    是的,您不能只信任单个数据源,因此让我们尝试获取所有数据源。

    来自 https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json , 你找到另一个链接 https://github.com/dr5hn/countries-states-cities-database让我们把这个...
    import requests
    import json
    
    cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
    cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))
    
    countries1 = set([city['country'] for city in cities1_json])
    cities1 = set([city['name'] for city in cities1_json])
    
    dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
    dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"
    
    cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
    countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))
    
    countries2 = set([c['name'] for c in countries2_json])
    cities2 = set([c['name'] for c in cities2_json])
    
    countries = countries2.union(countries1)
    cities = cities2.union(cities1)
    

    既然我们神经质了,我们就会进行理智检查。
    >>> len(countries)
    282
    >>> len(cities)
    127793
    

    哇,这比以前多了很多城市。

    让我们试试 flashtext再次编码。
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    keyword_processor.extract_keywords(texts[0])
    

    [出去]:
    ['York', 'Venice', 'Italy']
    

    严重地?!没有纽约?! $%^&*

    好的,为了更多的健全性检查,让我们在城市列表中寻找“约克”。
    >>> [c for c in cities if 'york' in c.lower()]
    ['Yorklyn',
     'West York',
     'West New York',
     'Yorktown Heights',
     'East Riding of Yorkshire',
     'Yorke Peninsula',
     'Yorke Hill',
     'Yorktown',
     'Jefferson Valley-Yorktown',
     'New York Mills',
     'City of York',
     'Yorkville',
     'Yorkton',
     'New York County',
     'East York',
     'East New York',
     'York Castle',
     'York County',
     'Yorketown',
     'New York City',
     'York Beach',
     'Yorkshire',
     'North Yorkshire',
     'Yorkeys Knob',
     'York',
     'York Town',
     'York Harbor',
     'North York']
    

    Eureka !这是因为它被称为“纽约市”而不是“纽约”!

    您:这是什么恶作剧?!

    语言学家:欢迎来到 的世界自然语言 处理,其中自然语言是一种主观的社会建构,受公共(public)和惯用语变体的影响。

    : 废话少说,告诉我如何解决这个问题。

    NLP从业者 (真正适用于嘈杂的用户生成文本):您只需要添加到列表中即可。但在此之前,请根据您已有的列表检查您的指标。

    对于样本“测试集”中的每个文本,您应该提供一些真实标签以确保您可以“衡量您的指标”。
    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
    ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
    ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
    ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]
    
    # No. of correctly extracted terms.
    true_positives = 0
    false_positives = 0
    total_truth = 0
    
    for text, label in texts_labels:
        extracted = keyword_processor.extract_keywords(text)
    
        # We're making some assumptions here that the order of 
        # extracted and the truth must be the same.
        true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
        false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
        total_truth += len(label)
    
        # Just visualization candies.
        print(text)
        print(extracted)
        print(label)
        print()
    

    事实上,它看起来并没有那么糟糕。我们得到了 90% 的准确率:
    >>> true_positives / total_truth
    0.9
    

    但我 %^&*(-ing 想要 100% 提取!!

    好吧,好吧,看看上述方法所犯的“唯一”错误,只是“纽约”不在城市列表中。

    :我们为什么不将“纽约”添加到城市列表中,即
    keyword_processor.add_keyword('New York')
    
    print(texts[0])
    print(keyword_processor.extract_keywords(texts[0]))
    

    [出去]:
    ['New York', 'Venice', 'Italy']
    

    : 看,我做到了!!!现在我应该喝啤酒。
    语言学家 : 怎么样'I live in Marawi' ?
    >>> keyword_processor.extract_keywords('I live in Marawi')
    []
    

    NLP从业者 (插话):怎么样'I live in Jeju' ?
    >>> keyword_processor.extract_keywords('I live in Jeju')
    []
    

    Raymond Hettinger 粉丝 (远方):“一定有更好的办法!”

    是的,如果我们只是尝试一些愚蠢的事情,例如将以“City”结尾的城市关键字添加到我们的 keyword_processor 中会怎样? ?
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
                print(c[:-5])
    

    有用!

    现在让我们重试我们的回归测试示例:
    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
    ('return flights from brussels to bangkok with etihad from €407', ('Brussels', 'Bangkok')),
    ('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
    ('fly to australia new zealand from paris from €422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
    ('I live in Florida', ('Florida')), 
    ('I live in Marawi', ('Marawi')), 
    ('I live in jeju', ('Jeju'))]
    
    # No. of correctly extracted terms.
    true_positives = 0
    false_positives = 0
    total_truth = 0
    
    for text, label in texts_labels:
        extracted = keyword_processor.extract_keywords(text)
    
        # We're making some assumptions here that the order of 
        # extracted and the truth must be the same.
        true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
        false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
        total_truth += len(label)
    
        # Just visualization candies.
        print(text)
        print(extracted)
        print(label)
        print()
    

    [出去]:
    new york to venice, italy for usd271
    ['New York', 'Venice', 'Italy']
    ('New York', 'Venice', 'Italy')
    
    return flights from brussels to bangkok with etihad from €407
    ['Brussels', 'Bangkok']
    ('Brussels', 'Bangkok')
    
    from los angeles to guadalajara, mexico for usd191
    ['Los Angeles', 'Guadalajara', 'Mexico']
    ('Los Angeles', 'Guadalajara')
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    ['Australia', 'New Zealand', 'Paris']
    ('Australia', 'New Zealand', 'Paris')
    
    I live in Florida
    ['Florida']
    Florida
    
    I live in Marawi
    ['Marawi']
    Marawi
    
    I live in jeju
    ['Jeju']
    Jeju
    

    100% 是的,NLP-bunga !!!

    但严重的是,这只是问题的一角。如果你有这样的句子会发生什么:
    >>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
    ['Adam', 'Bangkok', 'Singapore', 'China']
    

    为什么是 Adam提取为城市?!

    然后你再做一些神经质的检查:
    >>> 'Adam' in cities
    Adam
    

    恭喜,你已经跳入了另一个 NLP 多义词的兔子洞,其中同一个词有不同的含义,在这种情况下,Adam最有可能指的是句子中的一个人,但巧合的是它也是一个城市的名称(根据您从中提取的数据)。

    我明白你在那里做了什么......即使我们忽略了这个多义的废话,你仍然没有给我想要的输出:

    [在]:
    ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags'
    ]
    

    [出去]:
    Origin: New York, USA; Destination: Venice, Italy
    Origin: Brussels, BEL; Destination: Bangkok, Thailand
    Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
    Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)
    

    语言学家 :即使假设城市前面的介词(例如 fromto )为您提供“起点”/“目的地”标签,您将如何处理“多段”航类的情况,例如
    >>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
    

    这句话的期望输出是什么:
    > Adam flew to Bangkok from Singapore and then to China
    

    也许像这样?规范是什么?您的输入文本如何(非)结构化?
    > Origin: Singapore
    > Departure: Bangkok
    > Departure: China
    

    让我们尝试构建组件二来检测介词。

    让我们假设您已经拥有并尝试对相同的 flashtext 进行一些黑客攻击方法。

    如果我们添加 to 会怎样和 from到名单?
    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    keyword_processor.add_keyword('to')
    keyword_processor.add_keyword('from')
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    
    for text in texts:
        extracted = keyword_processor.extract_keywords(text)
        print(text)
        print(extracted)
        print()
    

    [出去]:
    new york to venice, italy for usd271
    ['New York', 'to', 'Venice', 'Italy']
    
    return flights from brussels to bangkok with etihad from €407
    ['from', 'Brussels', 'to', 'Bangkok', 'from']
    
    from los angeles to guadalajara, mexico for usd191
    ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    ['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']
    

    嘿,这是非常糟糕的规则,用于/从,
  • 如果“from”指的是机票的价格怎么办?
  • 如果国家/城市之前没有“to/from”怎么办?

  • 好的,让我们使用上面的输出,看看我们对问题 1 做了什么。也许检查 from 后面的术语是否是城市,如果不是,删除 to/from ?
    from itertools import zip_longest
    from flashtext import KeywordProcessor
    
    keyword_processor = KeywordProcessor(case_sensitive=False)
    keyword_processor.add_keywords_from_list(sorted(countries))
    keyword_processor.add_keywords_from_list(sorted(cities))
    
    for c in cities:
        if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
            if c[:-5].strip():
                keyword_processor.add_keyword(c[:-5])
    
    keyword_processor.add_keyword('to')
    keyword_processor.add_keyword('from')
    
    texts = ['new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from €407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from €422 return including 2 checked bags']
    
    
    for text in texts:
        extracted = keyword_processor.extract_keywords(text)
        print(text)
    
        new_extracted = []
        extracted_next = extracted[1:]
        for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
            if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
                print(e_i, e_iplus1)
                continue
            elif e_i == 'from' and e_iplus1 == None: # last word in the list.
                continue
            else:
                new_extracted.append(e_i)
    
        print(new_extracted)
        print()
    

    这似乎可以解决问题并删除 from不在城市/国家之前。

    [出去]:
    new york to venice, italy for usd271
    ['New York', 'to', 'Venice', 'Italy']
    
    return flights from brussels to bangkok with etihad from €407
    from None
    ['from', 'Brussels', 'to', 'Bangkok']
    
    from los angeles to guadalajara, mexico for usd191
    ['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']
    
    fly to australia new zealand from paris from €422 return including 2 checked bags
    from None
    ['to', 'Australia', 'New Zealand', 'from', 'Paris']
    

    但是“来自纽约”还是没有解决!!

    语言学家 :仔细想想,是否应该通过做出明智的决定使歧义短语显而易见来解决歧义?如果是这样,知情决定中的“信息”是什么?是否应该先按照一定的模板来检测信息,然后再填充歧义?

    :我对你失去了耐心......你把我带到了圈子里,我从新闻、谷歌和 Facebook 等等中不断听到的可以理解人类语言的人工智能在哪里?!

    :你给我的都是基于规则的,人工智能在哪里?

    NLP从业者 : 你不是想要100%吗?在没有任何可用于“训练 AI”的预设数据集的情况下,编写“业务逻辑”或基于规则的系统将是在给定特定数据集的情况下真正实现“100%”的唯一方法。

    : 训练AI是什么意思?为什么我不能只使用谷歌或 Facebook 或亚马逊或微软甚至 IBM 的人工智能?

    NLP从业者 : 让我把你介绍给
  • https://learning.oreilly.com/library/view/data-science-from/9781492041122/
  • https://allennlp.org/tutorials
  • https://www.aclweb.org/anthology/

  • 欢迎来到计算语言学和 NLP 的世界!

    简而言之

    是的,没有真正现成的神奇解决方案,如果你想使用“人工智能”或机器学习算法,很可能你需要更多的训练数据,比如 texts_labels上例中显示的对。

    关于python - 从字符串中解析始发城市/目的地城市,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59956670/

    相关文章:

    regex - 为什么在分配此正则表达式匹配时括号如此重要?

    PHP 和弦转置器

    java - 如何拆分字符串数组?

    python - 仅保留满足相对于另一列的条件的行

    python - Pandas/Numpy 从数组列中获取矩阵

    Python;旧格式化程序参数列表末尾的逗号

    python - 曾孙的 SQLAlchemy 链接关联代理?

    python - 模型属性 django 中的简单乘法?

    python Pandas : Is it possible to convert date Object to DateTimeIndex in multi-index dataframe?

    python - Numpy 中坐标距离的向量化