python - 从 csv 文件中获取每个提取字符的行索引

标签 python csv pandas dataframe

我的 csv 文件中有一列(第二列称为 second_column),它表示字符列表及其位置如下:名为 character_position

的列

此列的每一行都包含一个 character_position 列表。总的来说,这一列有 300 行,每行都有 character position

列表
character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

每个字符都有 for 值:左、上、右、下。例如字符“1”的左=1890、上=1904、右=486、下=505。

我的文件整个csv文件如下:

df = pd.read_csv(filepath_or_buffer='list_characters.csv', header=None, usecols=[1], names=['character_position])

我从这个文件创建了一个包含五列的新 csv 文件:

column 1:  character, column 2 : left , column 3 : top, column 4 : right, column 5 : bottom.
cols = ['char','left','top','right','bottom']
df1 = df.character_position.str.strip('[]').str.split(', ', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)
print (df1)
   char  left  top  right  bottom
0   'm'    38  104   2456    2492
1   'i'    40  102   2442     222
2   '.'   203  213    191     198
3   '3'   235  262    131    3333
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

我想添加 2 个名为 line_numberall_chars_in_same_row 的列 1)line_number 对应于例如从第 2 行提取“m”38 104 2456 2492 的行 2) all_chars_in_same_row 对应于同一行中的所有(间隔)字符。例如

character_position = [['1', 1890, 1904, 486, 505, '8', 1905, 1916, 486, 507, '4', 1919, 1931, 486, 505, '1', 1935, 1947, 486, 505, '7', 1950, 1962, 486, 505, '2', 1965, 1976, 486, 505, '9', 1980, 1992, 486, 507, '6', 1995, 2007, 486, 505, '/', 2010, 2022, 484, 508, '4', 2025, 2037, 486, 505, '8', 2040, 2052, 486, 505, '3', 2057, 2067, 486, 507, '3', 2072, 2082, 486, 505, '0', 2085, 2097, 486, 507, '/', 2100, 2112, 484, 508, 'Q', 2115, 2127, 486, 507, '1', 2132, 2144, 486, 505, '7', 2147, 2157, 486, 505, '9', 2162, 2174, 486, 505, '/', 2175, 2189, 484, 508, 'C', 2190, 2204, 487, 505, '4', 2207, 2219, 486, 505, '1', 2241, 2253, 486, 505, '/', 2255, 2268, 484, 508, '1', 2271, 2285, 486, 507, '5', 2288, 2297, 486, 505], ['D', 2118, 2132, 519, 535, '.', 2138, 2144, 529, 534, '2', 2150, 2162, 516, 535, '0', 2165, 2177, 516, 535, '4', 2180, 2192, 516, 534, '7', 2196, 2208, 516, 534, '0', 2210, 2223, 514, 535, '1', 2226, 2238, 516, 534, '8', 2241, 2253, 514, 534, '2', 2256, 2267, 514, 535, '4', 2270, 2282, 516, 534, '0', 2285, 2298, 514, 535]]

我得到 '1' '8' '4' '1' '7' 等等。

更正式地说: all_chars_in_same_row 表示:将给定行的所有字符写入line_number列

char  left  top  right  bottom     line_number  all_chars_in_same_row
0   'm'    38  104   2456    2492   from line 2  'm' '2' '5' 'g'
1   'i'    40  102   2442     222   from line 4
2   '.'   203  213    191     198   from line 6
3   '3'   235  262    131    3333  
4   'A'   275  347    147     239
5   'M'   363  465    145    3334
6   'A'    73   91    373     394
7   'D'    93  112    373      39
8   'D'   454  473    663     685
9   'O'   474  495    664      33
10  'A'   108  129    727     751
11  'V'   129  150    727     444

编辑 1:

import pandas as pd
df_data=pd.read_csv('/home/ahmed/internship/cnn_ocr/list_characters.csv')

df_data.shape

(50, 3)

df_data.icol(1)   
0     [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442...
1     [['.', 203, 213, 191, 198, '3', 235, 262, 131,...
2     [['A', 275, 347, 147, 239, 'M', 363, 465, 145,...
3     [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 39...
4     [['D', 454, 473, 663, 685, 'O', 474, 495, 664,...
5     [['A', 108, 129, 727, 751, 'V', 129, 150, 727,...
6     [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970...
7     [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43...
8     [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, ...
9     [['h', 1686, 1703, 315, 339, 't', 1706, 1715, ...
10    [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, ...
11    [['N', 1758, 1775, 370, 391, 'D', 1785, 1803, ...
12    [['D', 2166, 2184, 370, 391, 'A', 2186, 2205, ...
13    [['2', 1395, 1415, 427, 454, '0', 1416, 1434, ...
14    [['I', 1533, 1545, 487, 541, 'I', 1548, 1551, ...
15    [['P', 1659, 1677, 490, 514, '2', 1680, 1697, ...
16    [['1', 1890, 1904, 486, 505, '8', 1905, 1916, ...
17    [['B', 1344, 1361, 583, 607, 'O', 1364, 1386, ...
18    [['B', 1548, 1580, 979, 1015, 'T', 1586, 1619,...
19    [['Q', 169, 190, 1291, 1312, 'U', 192, 210, 12...
20    [['1', 296, 305, 1492, 1516, 'S', 339, 357, 14...
21    [['G', 339, 362, 1815, 1840, 'S', 365, 384, 18...
22    [['2', 1440, 1455, 2047, 2073, '9', 1458, 1475...
23    [['R', 339, 360, 2137, 2163, 'e', 363, 378, 21...
24    [['R', 339, 360, 1860, 1885, 'e', 363, 380, 18...
25    [['0', 1266, 1283, 1951, 1977, ',', 1287, 1290...
26    [['1', 2207, 2217, 1492, 1515, '0', 2225, 2240...
27    [['1', 2364, 2382, 1552, 1585], [], ['E', 2369...
28                      [['S', 2369, 2382, 1833, 1866]]
29    [['0', 2243, 2259, 1951, 1977, '0', 2271, 2288...
30    [['0', 2243, 2259, 2227, 2253, '0', 2271, 2286...
31    [['D', 76, 88, 2580, 2596, 'é', 91, 100, 2580,...
32    [['ü', 1474, 1489, 2586, 2616, '3', 1541, 1557...
33    [['E', 1440, 1461, 2670, 2697, 'U', 1466, 1488...
34    [['2', 1685, 1703, 2670, 2697, '.', 1707, 1712...
35    [['1', 2202, 2213, 2668, 2695, '3', 2220, 2237...
36                         [['c', 88, 118, 2872, 2902]]
37    [['N', 127, 144, 2889, 2910, 'D', 156, 175, 28...
38    [['E', 108, 129, 3144, 3172, 'C', 133, 156, 31...
39    [['5', 108, 126, 3204, 3231, '0', 129, 147, 32...
40                                                 [[]]
41    [['1', 480, 492, 3202, 3229, '6', 500, 518, 32...
42    [['P', 217, 234, 3337, 3360, 'A', 235, 255, 33...
43                                                 [[]]
44    [['I', 954, 963, 2892, 2934, 'M', 969, 1011, 2...
45    [['E', 1385, 1407, 2970, 2998, 'U', 1410, 1433...
46    [['T', 2067, 2084, 2889, 2911, 'O', 2088, 2106...
47    [['1', 2201, 2213, 2970, 2997, '6', 2219, 2238...
48    [['M', 1734, 1755, 3246, 3267, 'O', 1758, 1779...
49    [['L', 923, 935, 3411, 3430, 'A', 941, 957, 34...
Name: character_position, dtype: object

然后在我的 char.csv 中执行以下操作

    df = pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
    df = df.replace(['\[','\]'], ['',''], regex=True)




cols = ['char','left','right','top','bottom']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('chars.csv')

但是我在您的回复中看不到您是如何添加列 from_lineall_char_in_same_rows 的。

当我执行你的代码行时:

df_data = df_data.character_position.str.strip('[]').str.split(',', expand=True)

我得到以下内容:

df_data[0:10]
  0      1      2      3      4     5      6      7      8      9     ...   \
0  'm'     38    104   2456   2492   'i'     40    102   2442   2448  ...    
1  '.'    203    213    191    198   '3'    235    262    131    198  ...    
2  'A'    275    347    147    239   'M'    363    465    145    239  ...    
3  'A'     73     91    373    394   'D'     93    112    373    396  ...    
4  'D'    454    473    663    685   'O'    474    495    664    687  ...    
5  'A'    108    129    727    751   'V'    129    150    727    753  ...    
6  'N'     34     51    949    970   '/'     52     61    948    970  ...    
7  'S'   1368   1401     43     85   'A'   1406   1446     43     85  ...    
8  'S'   1437   1457    112    138   'o'   1458   1476    118    138  ...    
9  'h'   1686   1703    315    339   't'   1706   1715    316    339  ...    
   1821  1822  1823  1824  1825  1826  1827  1828  1829  1830  
0  None  None  None  None  None  None  None  None  None  None  
1  None  None  None  None  None  None  None  None  None  None  
2  None  None  None  None  None  None  None  None  None  None  
3  None  None  None  None  None  None  None  None  None  None  
4  None  None  None  None  None  None  None  None  None  None  
5  None  None  None  None  None  None  None  None  None  None  
6  None  None  None  None  None  None  None  None  None  None  

这是我的 csv 文件的前 10 行:

    character_position
0   [['m', 38, 104, 2456, 2492, 'i', 40, 102, 2442, 2448, 'i', 40, 100, 2402, 2410, 'l', 40, 102, 2372, 2382, 'm', 40, 102, 2312, 2358, 'u', 40, 102, 2292, 2310, 'i', 40, 104, 2210, 2260, 'l', 40, 104, 2180, 2208, 'i', 40, 104, 2140, 2166, 'l', 40, 104, 2124, 2134]]
1   [['.', 203, 213, 191, 198, '3', 235, 262, 131, 198]]
2   [['A', 275, 347, 147, 239, 'M', 363, 465, 145, 239, 'S', 485, 549, 145, 243, 'U', 569, 631, 145, 241, 'N', 657, 733, 145, 239]]
3   [['A', 73, 91, 373, 394, 'D', 93, 112, 373, 396, 'R', 115, 133, 373, 396, 'E', 136, 153, 373, 396, 'S', 156, 172, 373, 396, 'S', 175, 192, 373, 396, 'E', 195, 211, 373, 396, 'D', 222, 241, 373, 396, 'E', 244, 261, 373, 396, 'L', 272, 285, 375, 396, 'I', 288, 293, 375, 396, 'V', 296, 314, 375, 396, 'R', 317, 334, 373, 396, 'A', 334, 354, 375, 396, 'I', 357, 360, 373, 396, 'S', 365, 381, 373, 396, 'O', 384, 405, 373, 396, 'N', 408, 425, 373, 394]]
4   [['D', 454, 473, 663, 685, 'O', 474, 495, 664, 687, 'C', 498, 516, 664, 687, 'U', 519, 536, 663, 687, 'M', 540, 561, 663, 687, 'E', 564, 581, 663, 685, 'N', 584, 600, 664, 685, 'T', 603, 618, 663, 685]]
5   [['A', 108, 129, 727, 751, 'V', 129, 150, 727, 753, 'O', 153, 175, 727, 753, 'I', 178, 183, 727, 751, 'R', 187, 210, 727, 751, 'S', 220, 240, 727, 753, 'U', 243, 263, 727, 753, 'R', 267, 288, 727, 751, 'F', 302, 318, 727, 751, 'A', 320, 341, 727, 751, 'C', 342, 363, 726, 751, 'T', 366, 384, 726, 750, 'U', 387, 407, 727, 751, 'R', 411, 432, 727, 751, 'E', 435, 453, 726, 751, 'P', 797, 815, 727, 751, 'A', 818, 839, 727, 751, 'G', 840, 863, 727, 751, 'E', 867, 885, 726, 751, '1', 900, 911, 727, 751, '1', 926, 934, 727, 751, '1', 947, 956, 727, 751, '5', 962, 979, 727, 751], ['R', 120, 142, 778, 807, 'T', 144, 165, 778, 805, 'T', 178, 199, 778, 805, 'e', 201, 219, 786, 807, 'c', 222, 240, 786, 807, 'h', 241, 258, 778, 807, 'n', 263, 279, 786, 807, 'i', 284, 287, 778, 805, 'c', 291, 308, 786, 807, 'a', 309, 327, 786, 807, 'R', 350, 374, 778, 807, 'e', 377, 395, 786, 807, 't', 396, 405, 780, 805, 'u', 408, 425, 786, 807, 'r', 429, 440, 786, 807, 'n', 441, 458, 786, 807, '-', 471, 482, 793, 798, 'D', 497, 518, 778, 807, 'O', 522, 548, 777, 807, 'A', 549, 573, 778, 807, '/', 585, 596, 778, 807, 'D', 606, 630, 778, 807, 'A', 632, 656, 778, 807, 'P', 659, 680, 778, 805]]
6   [['N', 34, 51, 949, 970, '/', 52, 61, 948, 970, 'C', 63, 81, 948, 970, 'O', 84, 103, 948, 970, 'M', 106, 127, 949, 970, 'M', 130, 151, 948, 970, 'A', 153, 172, 949, 970, 'N', 175, 192, 949, 970, 'D', 195, 213, 948, 970, 'E', 217, 232, 948, 970], ['1', 73, 84, 993, 1020, '1', 94, 105, 993, 1020, '8', 112, 130, 991, 1020, '4', 135, 153, 993, 1018, '5', 156, 172, 994, 1018, '7', 175, 192, 993, 1018, '6', 195, 213, 993, 1020, '0', 216, 235, 991, 1020, '6', 238, 257, 993, 1020, '5', 260, 278, 993, 1020, '0', 407, 425, 991, 1020, '9', 428, 446, 991, 1020, '.', 450, 455, 1015, 1020, '0', 459, 477, 991, 1020, '1', 485, 494, 994, 1018, '.', 503, 507, 1015, 1020, '2', 512, 530, 991, 1020, '0', 533, 551, 991, 1020, '1', 555, 566, 993, 1020, '5', 575, 593, 993, 1020, 'R', 632, 656, 991, 1020, 'M', 659, 684, 991, 1020, 'A', 689, 713, 991, 1020, 'N', 726, 747, 993, 1020, 'o', 752, 770, 999, 1020, '.', 774, 779, 1015, 1020, '5', 794, 812, 993, 1020, '8', 815, 833, 991, 1020, '4', 834, 852, 993, 1017, '4', 857, 873, 994, 1018, '3', 878, 896, 991, 1020, '8', 899, 917, 991, 1020, '0', 920, 938, 991, 1020, '/', 950, 960, 991, 1020, '0', 971, 990, 993, 1020, '7', 995, 1011, 993, 1018, '1', 1016, 1026, 993, 1018, '6', 1034, 1052, 993, 1020, '7', 1055, 1073, 993, 1020, '4', 1076, 1094, 993, 1018, '8', 1098, 1116, 991, 1020, '9', 1119, 1137, 991, 1020, '0', 1140, 1158, 993, 1020, '9', 1160, 1178, 991, 1020], ['N', 34, 51, 1045, 1066, '/', 54, 61, 1045, 1066, 'B', 63, 79, 1044, 1066, 'O', 82, 102, 1044, 1066, 'N', 105, 121, 1045, 1066, 'D', 133, 151, 1045, 1066, 'E', 156, 172, 1044, 1066, 'L', 183, 196, 1045, 1066, 'I', 199, 204, 1045, 1066, 'V', 205, 223, 1045, 1066, 'R', 226, 244, 1045, 1066, 'A', 246, 266, 1045, 1066, 'I', 267, 272, 1045, 1066, 'S', 275, 291, 1044, 1066, 'O', 294, 314, 1045, 1066, 'N', 318, 335, 1045, 1066], ['8', 72, 90, 1093, 1122, '2', 93, 109, 1093, 1122, '5', 114, 132, 1095, 1122, '9', 135, 153, 1093, 1122, '7', 154, 172, 1095, 1122, '1', 178, 189, 1093, 1122, '3', 196, 214, 1093, 1122, '1', 220, 231, 1095, 1122, '0', 238, 257, 1093, 1122, '3', 260, 278, 1093, 1122, '0', 407, 425, 1093, 1122, '6', 429, 447, 1095, 1122, '.', 452, 455, 1117, 1122, '0', 459, 477, 1093, 1122, '2', 480, 498, 1093, 1122, '.', 503, 507, 1117, 1122, '2', 512, 530, 1093, 1122, '0', 533, 551, 1093, 1122, '1', 557, 567, 1095, 1122, '5', 575, 593, 1095, 1122], ['v', 70, 90, 1150, 1171, '/', 88, 97, 1150, 1171, 'r', 100, 118, 1150, 1171, 'é', 121, 136, 1144, 1173, 'f', 141, 156, 1150, 1171, 'ê', 159, 174, 1144, 1173, 'r', 177, 195, 1150, 1173, 'e', 198, 214, 1150, 1171, 'n', 217, 234, 1150, 1171, 'c', 238, 257, 1149, 1171, 'e', 260, 276, 1149, 1173, 'B', 476, 497, 1152, 1179, 'O', 501, 527, 1149, 1179, 'G', 530, 555, 1150, 1180, 'D', 560, 582, 1152, 1179, 'O', 585, 611, 1149, 1179, 'A', 614, 638, 1150, 1179, '1', 642, 653, 1152, 1179, '5', 659, 677, 1153, 1180, 'B', 681, 701, 1152, 1179, 'T', 705, 726, 1152, 1179, '0', 728, 746, 1152, 1179, '6', 749, 767, 1152, 1179]]
7   [['S', 1368, 1401, 43, 85, 'A', 1406, 1446, 43, 85, 'M', 1451, 1491, 36, 85, 'S', 1500, 1533, 43, 85, 'U', 1539, 1574, 43, 85, 'N', 1581, 1616, 43, 85, 'G', 1623, 1662, 42, 85, 'E', 1686, 1719, 43, 85, 'L', 1725, 1755, 43, 85, 'E', 1763, 1794, 42, 85, 'C', 1800, 1836, 43, 85, 'T', 1841, 1874, 42, 85, 'R', 1880, 1914, 42, 84, 'O', 1919, 1959, 42, 85, 'N', 1965, 1998, 42, 84, 'I', 2007, 2016, 42, 84, 'C', 2022, 2058, 42, 84, 'S', 2066, 2099, 42, 84, 'F', 2121, 2151, 42, 84, 'R', 2159, 2193, 42, 84, 'A', 2198, 2237, 40, 84, 'N', 2243, 2277, 40, 84, 'C', 2285, 2321, 42, 84, 'E', 2328, 2360, 40, 84]]
8   [['S', 1437, 1457, 112, 138, 'o', 1458, 1476, 118, 138, 'c', 1479, 1493, 120, 138, 'i', 1494, 1499, 112, 136, 'é', 1503, 1518, 114, 138, 't', 1520, 1527, 115, 138, 'é', 1530, 1547, 112, 138, 'p', 1559, 1575, 120, 144, 'a', 1577, 1593, 118, 138, 'r', 1596, 1607, 118, 136, 'A', 1616, 1637, 112, 136, 'c', 1640, 1653, 118, 138, 't', 1655, 1664, 115, 136, 'i', 1665, 1670, 112, 136, 'o', 1673, 1688, 118, 138, 'n', 1692, 1707, 118, 136, 's', 1710, 1725, 118, 138, 'S', 1736, 1755, 112, 138, 'i', 1760, 1763, 112, 136, 'm', 1767, 1791, 118, 136, 'p', 1794, 1811, 118, 142, 'l', 1812, 1817, 112, 136, 'i', 1821, 1824, 112, 136, 'f', 1827, 1835, 112, 136, 'i', 1835, 1841, 112, 136, 'é', 1845, 1860, 112, 136, 'e', 1863, 1878, 118, 136, 'a', 1890, 1907, 118, 138, 'u', 1910, 1925, 118, 136, 'C', 1937, 1958, 112, 136, 'a', 1961, 1977, 118, 136, 'p', 1980, 1995, 118, 142, 'i', 1998, 2003, 112, 136, 't', 2006, 2013, 114, 136, 'a', 2015, 2030, 118, 136, 'l', 2034, 2037, 112, 136, 'd', 2051, 2066, 111, 136, 'e', 2069, 2085, 117, 136, '2', 2097, 2112, 112, 136, '7', 2115, 2132, 111, 136, '.', 2136, 2139, 132, 136, '0', 2144, 2159, 111, 136, '0', 2162, 2178, 111, 136, '0', 2180, 2196, 111, 136, '.', 2201, 2205, 132, 135, '0', 2208, 2225, 111, 136, '0', 2228, 2243, 111, 136, '0', 2246, 2261, 111, 136, 't', 2273, 2281, 112, 135, 'i', 2281, 2291, 111, 136], ['1', 1473, 1482, 153, 177, ',', 1491, 1494, 172, 181, 'r', 1508, 1517, 159, 177, 'u', 1520, 1535, 160, 177, 'e', 1538, 1554, 159, 177, 'F', 1566, 1583, 153, 177, 'r', 1587, 1596, 159, 177, 'u', 1598, 1613, 159, 177, 'c', 1617, 1631, 159, 177, 't', 1634, 1641, 154, 177, 'i', 1643, 1646, 153, 177, 'd', 1650, 1665, 151, 177, 'o', 1668, 1685, 159, 177, 'r', 1688, 1697, 159, 177, 'C', 1709, 1730, 153, 177, 'S', 1733, 1751, 153, 177, '2', 1764, 1779, 153, 177, '0', 1781, 1797, 153, 177, '0', 1800, 1817, 153, 177, '3', 1820, 1835, 151, 177, '9', 1847, 1863, 151, 177, '3', 1866, 1883, 151, 177, '4', 1883, 1901, 153, 175, '8', 1904, 1919, 151, 177, '4', 1919, 1937, 153, 175, 'S', 1950, 1968, 151, 177, 'A', 1971, 1992, 151, 175, 'I', 1995, 2000, 151, 175, 'N', 2004, 2024, 151, 175, 'T', 2027, 2046, 151, 175, 'O', 2058, 2081, 151, 177, 'U', 2085, 2105, 151, 177, 'E', 2109, 2127, 151, 177, 'N', 2130, 2150, 151, 175, 'C', 2163, 2186, 151, 175, 'e', 2187, 2204, 157, 175, 'd', 2207, 2222, 150, 175, 'e', 2225, 2240, 157, 175, 'x', 2243, 2258, 157, 175], ['T', 1638, 1656, 192, 216, 'É', 1659, 1677, 186, 217, 'L', 1682, 1697, 193, 217, 'É', 1701, 1719, 187, 217, 'P', 1722, 1742, 192, 217, 'H', 1746, 1766, 193, 217, 'O', 1770, 1793, 192, 217, 'N', 1796, 1815, 192, 216, 'E', 1820, 1838, 192, 217, '0', 1869, 1886, 190, 216, '1', 1890, 1899, 192, 216, '4', 1914, 1931, 193, 216, '4', 1934, 1950, 193, 216, '0', 1961, 1977, 190, 216, '4', 1980, 1997, 193, 216, '7', 2009, 2024, 192, 216, '0', 2027, 2042, 192, 216, '0', 2055, 2070, 192, 216, '0', 2073, 2090, 192, 216], ['R', 1517, 1538, 232, 258, '.', 1542, 1545, 253, 256, 'C', 1550, 1571, 232, 256, '.', 1575, 1580, 252, 256, 'S', 1584, 1602, 232, 256, '.', 1607, 1611, 252, 256, 'B', 1625, 1643, 232, 256, 'O', 1649, 1670, 231, 258, 'B', 1674, 1692, 232, 256, 'I', 1697, 1701, 232, 256, 'G', 1706, 1728, 232, 256, 'N', 1731, 1751, 232, 256, 'Y', 1754, 1775, 232, 256, 'B', 1788, 1806, 232, 256, '3', 1818, 1835, 231, 256, '3', 1838, 1855, 231, 256, '4', 1855, 1872, 232, 255, '3', 1884, 1899, 232, 256, '6', 1904, 1919, 232, 256, '7', 1922, 1937, 232, 256, '4', 1947, 1964, 232, 256, '9', 1967, 1983, 232, 256, '7', 1986, 2001, 232, 256, '-', 2013, 2022, 244, 249, 'A', 2034, 2055, 231, 255, 'P', 2057, 2075, 231, 255, 'E', 2079, 2097, 231, 256, '4', 2109, 2126, 232, 255, '6', 2129, 2145, 232, 256, '5', 2148, 2163, 232, 256, '2', 2166, 2183, 232, 255, 'Z', 2193, 2211, 231, 255], ['C', 1628, 1647, 271, 297, 'o', 1652, 1670, 279, 297, 'd', 1671, 1689, 273, 297, 'e', 1692, 1709, 279, 298, 'T', 1721, 1739, 273, 297, 'V', 1742, 1763, 273, 297, 'A', 1763, 1787, 273, 297, 'F', 1818, 1835, 273, 297, 'R', 1839, 1859, 273, 297, '8', 1872, 1889, 273, 297, '9', 1890, 1905, 273, 297, '3', 1919, 1932, 273, 297, '3', 1937, 1952, 273, 297, '4', 1953, 1971, 273, 297, '3', 1983, 1998, 273, 297, '6', 2001, 2018, 273, 297, '7', 2021, 2036, 273, 295, '4', 2048, 2064, 274, 297, '9', 2066, 2082, 273, 297, '7', 2085, 2100, 273, 295]]
9   [['h', 1686, 1703, 315, 339, 't', 1706, 1715, 316, 339, 't', 1718, 1727, 316, 339, 'p', 1730, 1748, 321, 345, 'i', 1751, 1757, 321, 339, 'f', 1760, 1769, 315, 339, '/', 1769, 1776, 313, 339, 'w', 1779, 1804, 321, 337, 'w', 1804, 1829, 321, 339, 'w', 1830, 1854, 321, 337, '.', 1859, 1863, 333, 337, 's', 1868, 1883, 319, 339, 'a', 1886, 1901, 321, 337, 'm', 1905, 1929, 321, 337, 's', 1932, 1949, 321, 339, 'u', 1953, 1968, 321, 339, 'n', 1973, 1989, 321, 339, 'g', 1992, 2010, 319, 345, '.', 2015, 2019, 333, 337, 'f', 2021, 2033, 313, 337, 'r', 2034, 2045, 319, 337]]
10  [['N', 1331, 1349, 370, 391, 'C', 1361, 1379, 370, 393, 'O', 1382, 1403, 370, 393, 'M', 1404, 1425, 370, 391, 'P', 1430, 1446, 370, 391, 'T', 1448, 1464, 370, 391, 'E', 1467, 1484, 370, 393, 'C', 1494, 1512, 370, 393, 'L', 1515, 1532, 370, 393, 'I', 1533, 1539, 370, 393, 'E', 1542, 1559, 370, 393, 'N', 1560, 1580, 370, 393, 'T', 1580, 1598, 370, 393]]

这是第二个 csv 文件:

    char    left    right   top bottom
0   'm' 38  104 2456    2492
1   'i' 40  102 2442    2448
2   'i' 40  100 2402    2410
3   'l' 40  102 2372    2382
4   'm' 40  102 2312    2358
5   'u' 40  102 2292    2310
6   'i' 40  104 2210    2260
7   'l' 40  104 2180    2208
8   'i' 40  104 2140    2166

编辑1

HERE IS MY output for solution 2 (`input character_position described` )

    1831    1830    level_2 char    left    top right   bottom  FromLine    all_chars_in_same_row
0   0   character_position  0   character_position                  0   character_position
1   1   'm','i','i','l','m','u','i','l','i','l' 0   'm' 38  104 2456    2492    1   'm','i','i','l','m','u','i','l','i','l'
2   1   'm','i','i','l','m','u','i','l','i','l' 1   'i' 40  102 2442    2448    1   'm','i','i','l','m','u','i','l','i','l'
3   1   'm','i','i','l','m','u','i','l','i','l' 2   'i' 40  100 2402    2410    1   'm','i','i','l','m','u','i','l','i','l'

我认为问题来自于我的数据中的事实: [[',' , 'A', ',' , '.', ':' , ';', '1'], [], ['m', 'a', ]] 所以:

empty `[ ]`  causes problem for the order. l noticed that when l tried to omit  all [] which are empty beacause l find my csv as follow :

在 char 中:['a' 而不是 'a'8794] 而不是 8794或者 [5345 而不是 5345 如此处理csv如下

    df = pd.read_csv(filepath_or_buffer='lit_charaters.csv', header=None, usecols=[1,3], names=['character_position','LineIndex'])
    df = df.replace(['\[','\]'], ['',''], regex=True)
cols = ['char','left','right','top','bottom','LineIndex']
df1 = df.positionlrtb.str.strip('[]').str.split(',', expand=True)
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)
df1.columns = cols
df1.char = df1.char.replace(['\[','\]'], ['',''], regex=True)
df1['left']=df1['left'].replace(['\[','\]'], ['',''], regex=True)
df1['top']=df1['top'].replace(['\[','\]'], ['',''], regex=True)
df1['right']=df1['right'].replace(['\[','\]'], ['',''], regex=True)
df1['bottom']=df1['bottom'].replace(['\[','\]'], ['',''], regex=True)
df1.to_csv('char.csv')

enter image description here

然后我注意到了以下内容

查看 line 1221 column B 它是空的,它替换了 [] 然后我们得到了由于空 char 而导致的列切换(B 和 C)的困惑。 如何解决? 我也有空行

3831    '6' 296 314 3204    3231
3832                    
3833    '1' 480 492 3202    3229

应删除第 3832 行。

为了得到像这样的 enter image description here

**EDIT2:**

为了解决list_characters.csv中空行[]的问题

[['1', 2364, 2382, 1552, 1585], [], ['E', 2369, 2381, 1623, 1640], ['8', 2369, 2382 , 1644, 1668]] 和 [[]] [[]]

我做了以下事情:

    df1 = df.applymap(lambda x: [y for y in x if len(y) > 0])

    df1 = df1[df1.applymap(len).ne(0).all(axis=1)]

    df1 = df.replace(['\[\],','\[\[\]\]', ''],['','', np.nan], regex=True)

    df1 = df1.dropna()
then

df = pd.read_csv('character_position.csv', index_col=0)

df.positionlrtb = df.positionlrtb.apply(ast.literal_eval)

df.positionlrtb = df.positionlrtb.apply(lambda x: [y for y in x if len(y) > 0])
print (df.head())
      page_number                                       positionlrtb  \
0  1841729699_001  [[m, 38, 104, 2456, 2492, i, 40, 102, 2442, 24...   
1  1841729699_001   [[., 203, 213, 191, 198, 3, 235, 262, 131, 198]]   
2  1841729699_001  [[A, 275, 347, 147, 239, M, 363, 465, 145, 239...   
3  1841729699_001  [[A, 73, 91, 373, 394, D, 93, 112, 373, 396, R...   
4  1841729699_001  [[D, 454, 473, 663, 685, O, 474, 495, 664, 687...   

                    LineIndex  
0      [[mi, il, mu, il, il]]  
1                      [[.3]]  
2                   [[amsun]]  
3  [[adresse, de, livraison]]  
4                [[document]]

cols = ['char','left','top','right','bottom']

df1 = pd.DataFrame({
        "a": np.repeat(df.page_number.values, df.positionlrtb.str.len()),
        "b": list(chain.from_iterable(df.positionlrtb))})

df1 = pd.DataFrame(df1.b.values.tolist())    
df1.columns = [df1.columns % 5, df1.columns // 5]
df1 = df1.stack().reset_index(drop=True)  
cols = ['char','left','top','right','bottom']
df1.columns = cols
df1[cols[1:]] = df1[cols[1:]].astype(int)   
print (df1)
     char  left   top  right  bottom
0       m    38   104   2456    2492
1       i    40   102   2442    2448
2       i    40   100   2402    2410
3       l    40   102   2372    2382
4       m    40   102   2312    2358
5       u    40   102   2292    2310
6       i    40   104   2210    2260
7       l    40   104   2180    2208
8       i    40   104   2140    2166

但是:

df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

返回无值

最佳答案

一旦你创建了所需的数据框,在堆叠之后,不要删除索引,它会保存你的行号。由于这是一个多级索引,请获取第一个索引 - 您的行号。

df_data['LineIndex'] = df_data.index.get_level_values(0)

然后您可以按 LineIndex 列分组并获取公共(public) LineIndex 的所有字符。这是作为字典创建的。将这个字典转换成数据框,最后合并到实际数据中


Solution 1


import pandas as pd

df_data=pd.read_csv('list_characters.csv' , header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)
df_data.columns = [df_data.columns % 5, df_data.columns // 5]

df_data = df_data.stack() # dont remove Index, it has the line from where this record was created
print  df_data

df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column

cols = ['char','left','top','right','bottom','FromLine']
df_data.columns = cols #assign the new column names

#create a new dictionary
#it contains the line number as key and all the characters from that line as value

DictChar= {k: list(v) for k,v in df_data.groupby("FromLine")["char"]}

#convert dictionary to a dataframe 
df_chars=pd.DataFrame(DictChar.items())
df_chars.columns=cols = ['FromLine','char']

#   Merge dataframes on column 'FromLine'
df_final=df_data.merge(df_chars,on ='FromLine')
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_final.columns=cols
print df_final

Solution 2


与第一个相比,我个人更喜欢这个解决方案。有关更多详细信息,请参阅内联评论

import pandas as pd

df_data=pd.read_csv('list_characters.csv', header=None, usecols=[1], names=['character_position'])
df_data = df_data.character_position.str.strip('[]').str.split(', ', expand=True)

x=len(df_data.columns) #get total number of columns 
#get all characters from every 5th column, concatenate and create new column in df_data
df_data[x] = df_data[df_data.columns[::5]].apply(lambda x: ','.join(x.dropna()), axis=1)
# get index of each row. This is the line number for your record
df_data[x+1]=df_data.index.get_level_values(0) 
 # now set line number and character columns as Index of data frame
df_data.set_index([x+1,x],inplace=True,drop=True)

df_data.columns = [df_data.columns % 5, df_data.columns // 5]

df_data = df_data.stack()
df_data['FromLine'] = df_data.index.get_level_values(0) #assign line number to a column
df_data['all_chars_in_same_row'] = df_data.index.get_level_values(1) #assign character values to a column
cols = ['char','left','top','right','bottom','FromLine','all_chars_in_same_row']
df_data.columns=cols
df_data.reset_index(inplace=True) #remove mutiindexing
print df_data[cols]

输出

      char  left   top right bottom  from line all_chars_in_same_row
0     '.'   203   213   191    198          0  ['.', '3', 'C']
1     '3'  1758  1775   370    391          0  ['.', '3', 'C']
2     'C'   296   305  1492   1516          0  ['.', '3', 'C']
3     'A'   275   347   147    239          1  ['A', 'M', 'D']
4     'M'  2166  2184   370    391          1  ['A', 'M', 'D']
5     'D'   339   362  1815   1840          1  ['A', 'M', 'D']
6     'A'    73    91   373    394          2  ['A', 'D', 'A']
7     'D'  1395  1415   427    454          2  ['A', 'D', 'A']
8     'A'  1440  1455  2047   2073          2  ['A', 'D', 'A']
9     'D'   454   473   663    685          3  ['D', 'O', '0']
10    'O'  1533  1545   487    541          3  ['D', 'O', '0']
11    '0'   339   360  2137   2163          3  ['D', 'O', '0']
12    'A'   108   129   727    751          4  ['A', 'V', 'I']
13    'V'  1659  1677   490    514          4  ['A', 'V', 'I']
14    'I'   339   360  1860   1885          4  ['A', 'V', 'I']
15    'N'    34    51   949    970          5  ['N', '/', '2']
16    '/'  1890  1904   486    505          5  ['N', '/', '2']
17    '2'  1266  1283  1951   1977          5  ['N', '/', '2']
18    'S'  1368  1401    43     85          6  ['S', 'A', '8']
19    'A'  1344  1361   583    607          6  ['S', 'A', '8']
20    '8'  2207  2217  1492   1515          6  ['S', 'A', '8']
21    'S'  1437  1457   112    138          7  ['S', 'o', 'O']
22    'o'  1548  1580   979   1015          7  ['S', 'o', 'O']
23    'O'  1331  1349   370    391          7  ['S', 'o', 'O']
24    'h'  1686  1703   315    339          8  ['h', 't', 't']
25    't'   169   190  1291   1312          8  ['h', 't', 't']
26    't'   169   190  1291   1312          8  ['h', 't', 't']
27    'N'  1331  1349   370    391          9  ['N', 'C', 'C']
28    'C'   296   305  1492   1516          9  ['N', 'C', 'C']
29    'C'   296   305  1492   1516          9  ['N', 'C', 'C']

关于python - 从 csv 文件中获取每个提取字符的行索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43261258/

相关文章:

python - 在终端中运行 python 脚本

python - 如何从 Python pandas 中包含字符串和 float 的不同 Dataframe 中划分两行。

Python Pandas csv import "Error tokenizing data"- 显示错误行内容

c++ - 使用 excel 和 C++ 将指数转换为 CSV

python - 根据实例类型删除 Pandas 数据框的行

python - 如何取消对 pandas 中的列进行分类

python - 如何在 Python 中使用 argparse.ArgumentParser 从命令行传递和解析字符串列表?

Pythonnet System.Object[,] 到 Pandas DataFrame 或 Numpy 数组

javascript - 使用 Papaparse 从 CSV 文件中删除不需要的列

python - 年化返回均值。