python - 如何从字符串中找到子字符串列表的位置?

标签 python string indexing substring offset

如何从字符串中找到子字符串列表的位置?

给定一个字符串:

"The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."

和一个子字符串列表:

['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

期望的输出:

>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
        (34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
        (68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
        (110, 119), (120, 122), (123, 131), (131, 132)]

输出说明,第一个子字符串“The”可以使用字符串 s 使用 (start, end) 索引找到。所以从所需的输出。

因此,如果我们遍历所需输出的所有整数元组,我们将返回子字符串列表,即

>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']

我试过:

def find_offset(tokens, s):
    index = 0
    offsets = []
    for token in tokens:
        start = s[index:].index(token) + index
        index = start + len(token)
        offsets.append((start, index))
    return offsets

是否有另一种方法可以从字符串中找到子字符串列表的位置?

最佳答案

第一种解决方案:

#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]

第二个解决方案纠正第一个解决方案中的问题:

def find_offsets(tokens, s):
    tid = [list(e) for e in tokens]
    i = 0
    for id_token,token in enumerate(tid):
        while (token[0]!=s[i]):            
            i+=1
        tid[id_token] = tuple((i,i+len(token)))
        i+=len(token)

    return tid


find_offsets(tokens, s)
Out[201]: 
[(0, 3),
 (4, 9),
 (9, 10),
 (11, 16),
 (17, 20),
 (21, 23),
 (24, 34),
 (34, 35),
 (36, 43),
 (44, 46),
 (47, 52),
 (52, 54),
 (55, 60),
 (61, 67),
 (68, 72),
 (73, 75),
 (76, 83),
 (84, 89),
 (90, 98),
 (99, 103),
 (104, 109),
 (110, 119),
 (120, 122),
 (123, 131),
 (131, 132)]   

#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]

关于python - 如何从字符串中找到子字符串列表的位置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43773962/

相关文章:

r - 在R中将一个变量拆分为多个变量

c - 在一行中搜索一个词

postgresql - 慢速嵌套循环左连接在循环中索引扫描 130k 次

python - 如何将 Python 的 urllib2.urlopen() 转换为文本?

python - 栅格化 GDAL 层

Python Selenium 在 headless 状态下发生错误?

javascript - Javascript 中字符串中的数字数组

Python 链式 get() 方法与 JSON 中的列表元素

matlab - 如何在 Matlab 中优化嵌入循环索引以实现并行化?

PHP 错误 : Notice: Undefined index: