python - 突出显示两个 html 字符串之间的差异

标签 python html python-3.x

我有 2 个具有多个细微差别的 HTML 字符串:

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>

我正在尝试寻找两个字符串之间的差异。我需要返回第二个字符串,其中使用 <mark> 突出显示任何差异。标签。

这有点难以解释,所以这里有一些例子:

如果一个字符串包含文本 <span>This is a string</span>第二个有 <span>Thiss is a string</span> ,我要回<span><mark>Thiss is a string</mark></span> 。 如果另一个字符串具有文本 <p>36</p>第二个有 <p>3</p> ,我要回<p><mark>3</mark></p> .

请注意 <mark>标签插入到最近的>之后到差异的左侧,而 </mark>插入到最近的 < 之前到差异的右侧

我确信这是可能的,但我似乎找不到一种有效的方法来实现这一点。这是我到目前为止所拥有的:

skew=0
prev_i = []
highlighted_area_info = my_second_html_string
diff = difflib.ndiff(my_first_html_string, my_second_html_string)
for i,s in enumerate(diff, start=0):
    if s[0]==' ':
        continue
    else:
        if i in prev_i:
             continue
        count_right = my_second_html_string[i].find('<')
        
        count_left = 0
        for a, b in reversed(list(enumerate(my_second_html_string))):
            if a < i:
                if b == ">":
                    break
                else:
                    count_left += 1
                
        highlighted_area_info2 = highlighted_area_info[:i-count_left+skew]
        highlighted_area_info2 += highlight_beginning
        highlighted_area_info2 += highlighted_area_info[i-count_left+skew:i+count_right+skew]
        highlighted_area_info2 += highlight_end
        highlighted_area_info2 += highlighted_area_info[i+count_right+skew:]
        skew += len(highlight_beginning)+len(highlight_end)
        highlighted_area_info = highlighted_area_info2
        prev_i = list(range(i-count_left+skew, i+count_right+skew))
print(highlighted_area_info)

不幸的是,<mark></mark>标签插入到不正确的位置,导致类似这样的问题:<td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</</ma<mark>rk>s</mark>pan></td> 而不是<td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder"><mark>0</mark></span></td> ,这正是我所期待的。

我已经花了几天时间在这上面,但我仍然不确定我做错了什么,尽管有些事情显然是不对的。我的代码也可能没有利用最有效的方式来实现我的目标。

我需要在几天内获得工作代码,因此非常感谢任何帮助。

最佳答案

我用过print()测试代码中变量的值,我发现您使用 ndiff(string1, string2)但它需要ndiff(list_of_lines1, list_of_lines2) - 因此它将您的字符串视为字符列表,并分别比较每个字符。这样它就把<mark>对于每个更改的字符 - 而不是放置一个 <mark>获取完整单词。

我尝试使用单行列表 ndiff([string1], [string2]) 来更改此设置和其他变化,但最终我辞职了,因为这没有意义。您宁愿需要使用 lxmlBeautifulsoup解析HTML到树 tagsnodes然后比较textnodes .


我找到模块xmldiff它使用 lxml它生成两个 XML 的更改列表或HTML .

import xmldiff.main

all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string)

每个change给出xpath所以我用lxml查找节点并替换 text<mark>text</mark>

它可以找到不同的changes但我只需要 UpdateTextIn (当文本位于标签内时 - 即 <a>new text</a> )和 UpdateTextAfter (当文本位于标签之后时 - 即 <a>...</a>new text

highlighted_tree = lxml.etree.fromstring(my_second_html_string)

for item in all_changes:

    highlighted_node = highlighted_tree.xpath(item.node)[0]

    if isinstance(item, xmldiff.actions.UpdateTextIn):
        highlighted_node.text = '' # remove
        highlighted_node.insert(0, lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

    if isinstance(item, xmldiff.actions.UpdateTextAfter):
        highlighted_node.tail = '' # remove # has to be before addnext
        highlighted_node.addnext(lxml.etree.fromstring('<mark>' + item.text + '</mark>'))

之后我再次将树转换为 HTML

html = lxml.etree.tostring(highlighted_tree)

print(html.decode())

带数据的最小工作示例

import xmldiff.main     # diff_texts
import xmldiff.actions  # UpdateTextIn, UpdateTextAfter
import lxml.etree

my_first_html_string = '''<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goal4s</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">1</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">9</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusivey Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">7</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">1</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmsasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>'''
my_second_html_string = '''<tbody class="Expanded4" id="divisionG_area24_clubs"><!--<tr><th class='noBorderLeftRight'></th>--><th class="noBorderLeftRight" colspan="6"></th><th colspan="6"><table style="margin-bottom:auto;" width="100%"><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_blue Grid_Table" colspan="2">Membership</th><th class="Grid_top_blue Grid_Table" colspan="1">Goals</th><th class="Grid_Title_top_black grid_blue_border" colspan="6">Education</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Mem.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Trn.</th><th class="Grid_Title_top_black grid_blue_border" colspan="2">Rn.|Lst.</th></tr><tr><th class="noBorderTopLeft" style="width:auto;"></th><th class="Grid_top_black_max Grid_Table">Base</th><th class="Grid_top_black_max Grid_Table">To Date</th><th class="Grid_top_black_max Grid_Table blue_border_right">Met</th><th class="Grid_top_black" title="Four Level 1 awards">1</th><th class="Grid_top_black" title="Two Level 2 awards">2</th><th class="Grid_top_black" title="Two more Level 2 awards">3</th><th class="Grid_top_black" title="Two Level 3 awards">4</th><th class="Grid_top_black" title="One Level 4, Level 5, or DTM award">5</th><th class="Grid_top_black" title="One more Level 4, Level 5, or DTM award">6</th><th class="Grid_top_black max22" title="4 New members">7</th><th class="Grid_top_black max22" title="4 More new members">8</th><th class="Grid_top_black max22" title="4 Officers trained first training period">9a</th><th class="Grid_top_black max22" title="4 Officers trained second training period">9b</th><th class="Grid_top_black max22" title="1 Dues-renewal on time">10a</th><th class="Grid_top_black max22" title="1 officer list on time">10b</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=01448795'"><td class="Grid_Title_top5 min280 crop" title="Advanced Speakers on the Hill"> <span class="redFont">01448795</span> Advanced Speakers on the Hill</td><th class="Grid_Table_yellow"><span>29<span></span></span></th><td class="Grid_Table title_gray"><span>30<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">2</span></td><th class="Grid_Title_goal" title="3 Level 1s needed">1</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">7</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02194262'"><td class="Grid_Title_top5 min280 crop" title="Inclusive Toastmasters"> <span class="redFont">02194262</span> Inclusive Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">5</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=02785335'"><td class="Grid_Title_top5 min280 crop" title="Club Toastmasters FrancoFun"> <span class="redFont">02785335</span> Club Toastmasters FrancoFun</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>21<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">1</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goalAchieved" title="Achieved">1</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr><tr class="Grid_Top_Row club_gray" onclick="window.location.href='ClubReport.aspx?id=04437661'"><td class="Grid_Title_top5 min280 crop" title="Feel Good Toastmasters"> <span class="redFont">04437661</span> Feel Good Toastmasters</td><th class="Grid_Table_yellow"><span>21<span></span></span></th><td class="Grid_Table title_gray"><span>22<span></span></span></td><td class="Grid_Table x_light_gray blue_border_right"><span class="chart_table_big_numbers goalsMetBorder">0</span></td><th class="Grid_Title_goal" title="4 Level 1s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 2s needed">0</th><th class="Grid_Title_goal" title="2 Level 3s needed">0</th><th class="Grid_Title_goal" title="1 Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="1 more Level 4, Level 5, or DTM needed">0</th><th class="Grid_Title_goal" title="3 New Members needed">1</th><th class="Grid_Title_goal" title="4 New Members needed">0</th><th class="Grid_Title_goalAchieved" title="First Training Period Achieved">6</th><th class="Grid_Title_goal" title="Second Training Period 4 needed">0</th><th class="Grid_Title_goal" title="On-time dues-renewal needed">0</th><th class="Grid_Title_goalAchieved" title="On-time officer list Achieved">1</th></tr></table></th></tbody>'''

#my_first_html_string =  '''<html>test1 <p>325</p><div>This</div> testA</html>'''
#my_second_html_string = '''<html>test2 <p>3</p><div>Thiss</div> testB</html>'''

all_changes = xmldiff.main.diff_texts(my_first_html_string, my_second_html_string)

#old_tree = lxml.etree.fromstring(my_first_html_string)
#new_tree = lxml.etree.fromstring(my_second_html_string)
highlighted_tree = lxml.etree.fromstring(my_second_html_string)

for item in all_changes:
    #print('item:', item)
    #print('item.xpath:', item.node)
    #print('item.text:', item.text)
    #old_node = old_tree.xpath(item.node)[0]
    #new_node = new_tree.xpath(item.node)[0]
    #print('old node:', lxml.etree.tostring(old_node))
    #print('new node:', lxml.etree.tostring(new_node))
    #print('old text and tail:', [old_node.text, old_node.tail])
    #print('new text and tail:', [new_node.text, new_node.tail])
    
    highlighted_node = highlighted_tree.xpath(item.node)[0]
    
    if isinstance(item, xmldiff.actions.UpdateTextIn):
        print('changed text:', item.text)
        highlighted_node.text = ''
        highlighted_node.insert(0, lxml.etree.fromstring('<mark style="background:red">' + item.text + '</mark>'))

    if isinstance(item, xmldiff.actions.UpdateTextAfter):
        print('changed tail:', item.text)
        highlighted_node.tail = '' # has to be removed before `addnext`
        highlighted_node.addnext(lxml.etree.fromstring('<mark style="background:red">' + item.text + '</mark>'))
    
    print('---')

html = lxml.etree.tostring(highlighted_tree)
html = html.decode()
print(html)

with open('output.html', 'w') as f:
    f.write(html)

结果:

enter image description here


唯一的问题是,有时旧文本和新文本可能具有相同的文本,但空格、制表符、换行数不同,并且它也被视为 change - 但它会被跳过(但这需要额外的代码)

关于python - 突出显示两个 html 字符串之间的差异,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63770166/

相关文章:

python - 使用Python打印400空消息代码中的变量输出

python - 任务计划程序未运行 Selenium 脚本

python - 根据另一个键过滤字典列表以删除键中的重复项

javascript - 为什么 Google Places API 不执行我的请求?

python - 如何在 Metis for Python 中构建图表

javascript - 使用 Javascript 或 CSS 在跨度内换行

javascript - 单击后更改链接的背景颜色

python - 从 pandas 数据帧的特定行检索信息

Python 多处理 - 类型错误 : Pickling an AuthenticationString object is disallowed for security reasons

Python 3.4 : How to import a module given the full path?