python - 使用 Pandas 和 Regex 搜索并提取 txt 文件的值

我有 2 个数据表，我正在尝试从中提取值。这是我当前的脚本。

import re 
import os
import pandas as pd

os.chdir('C:/Users/Sams PC/Desktop')

test1=pd.read_csv('test1.txt', sep='\s+', header=None)
test1.columns=['Column_1','Column_2','Column_3']
test2=pd.read_csv('test2.txt', sep='\s+', header=None)
test2.columns=['Column_1','Column_2','Column_3','Column_4']

if 'S31N' in test1:
    data2=nhsqc[['Column_1','Column_2']].copy()
    if 'S31N-CA-HN' in test2:
        data2=nhsqc[['Column_3']].copy()
    else:
        print('Not Found')      
else:
    print('Not Found')


print(test1)
print (test2)

有了这个输出:

Not Found
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582

我能够使用 pandas 组织表格。接下来我想从与“S31N”相关的列中提取值。然而，正如您所看到的，我的 if 行在查找 S31N 方面不起作用，即使它确实存在于我的数据表中。现在，如果我将该值更改为我的标题(如果 test1 中为“Column_1”:)，那么它将起作用。我不太明白为什么它无法搜索实际的表格，而只搜索列标题。

此外，虽然我的 if 行确实有效(如果我使用了列标题)，但第二个 if 行会覆盖第一个 if 行中的 data2 表。我怎样才能将它作为额外列添加到 data2 而不是覆盖它。

自从问题解决后，我删除了后半部分。然而主要问题仍然存在，我的脚本仍然无法找到我的值。更新的脚本:

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in test1:
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in test2:
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

输出:

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
Not Found
Not Found
Y32N

最佳答案

我想，这可能会让你们更接近。问题可能与 test1 和 test2 的类型有关，这会更改整个代码中的类型，str(test1) 或 str( test1) 可能是使其发挥作用的一种方法。

测试

x=re.findall('[A-Z][0-9][0-9][A-Z]',str(test1))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]',str(test2))
print (x,y)

for i in range (0,2):
    if x[i] in str(test1):
        data2=nhsqc[['Column_1','Column_2']].copy()
        if y[i] in str(test2):
            data2=nhsqc[['Column_3']].copy()
            print (data2)
        else:   
            print('Not Found')      
    else:
        print('Not Found')


print(x[i])

模拟测试

import re
test1 = '''
0  S31N-HN   114.424     7.390
1  Y32N-HN   121.981     7.468
'''

test2 = '''
           Column_1  Column_2  Column_3  Column_4
0  S31N-A30CA-S31HN   114.424    54.808     7.393
1  S31N-A30CA-S31HN   126.854    53.005     9.277
2        S31N-CA-HN   114.424    61.717     7.391
3        S31N-HA-HN   126.864    59.633     9.287
4  Y32N-S31CA-Y32HN   121.981    61.674     7.467
5        Y32N-CA-HN   121.981    60.789     7.469
6  Q33N-Y32CA-Q33HN   120.770    60.775     8.582
'''

x = re.findall('[A-Z][0-9][0-9][A-Z]', str(test1))
y = re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]', str(test2))
print(x, y)

for i in range(0, 2):
    if x[i] in str(test1):
        print(x[i])
        data2 = nhsqc[['Column_1', 'Column_2']].copy()
        if y[i] in str(test2):
            data2 = nhsqc[['Column_3']].copy()
            print(y[i])
        else:
            print('Not Found')
    else:
        print('Not Found')

输出

['S31N', 'Y32N'] ['S31N-CA', 'Y32N-CA']
S31N
S31N-CA
Y32N
Y32N-CA

关于python - 使用 Pandas 和 Regex 搜索并提取 txt 文件的值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58843397/

python - 使用 Pandas 和 Regex 搜索并提取 txt 文件的值

测试

模拟测试

输出

上一篇：python - 如何查看 html 选择器中的隐藏内容？

下一篇：python - 将值添加到序列化字典而不加载它