确定有效 HTML 结构的 Ruby 算法

标签 ruby algorithm

我必须将一个带有散列的数组作为输入数据,每个散列是一个 html 标签的描述(文本中的开始和结束位置以及标签的类型)。我需要生成另一个数组,其中标签按顺序排列。

例如:

input = [
         {start_p: 0, end_p: 100, start_t: '<p>', end_t: '</p>'},
         {start_p: 10, end_p: 50, start_t: '<p>', end_t: '</p>'},
         {start_p: 0, end_p: 100, start_t: '<span>', end_t: '</span>'},
         {start_p: 20, end_p: 30, start_t: '<em>', end_t: '</em>'},
         {start_p: 40, end_p: 50, start_t: '<em>', end_t: '</em>'},
         {start_p: 50, end_p: 60, start_t: '<em>', end_t: '</em>'},
         {start_p: 70, end_p: 80, start_t: '<em>', end_t: '</em>'},
         {start_p: 8, end_p: 99, start_t: '<strong>', end_t: '</strong>'}
        ]

expected_output: [<p><span><strong><p><em></em><em></em></p><em></em><em></em></strong></span></p>]

不仅仅是输出中的标签,每个标签应该是一个带有位置和标签的哈希,比如:

     {position: 0, tag: '<p>'}

最重要的是按照正确的顺序排列,遵守 HTML 标签不相交的规则(如果多个标签在同一位置结束,最后打开的应该排在第一位,如果一个结束另一个打开在相同的位置,结束将在第一位,依此类推)。

这是遗留系统的一部分,目前无法更改输入和输出。此外,输入可能非常大(数十万个元素)

有比暴力递归更好的解决方案吗?

最佳答案

input.group_by { |h| h[:start_p] }.
      values.
      flat_map do |a|
        x = 1.0
        a.flat_map do |h|
          x /= 2.0
          [[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
        end
      end.sort_by(&:first).map(&:last).join
#=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"

步骤如下。

b = input.group_by { |h| h[:start_p] }
  #=> { 0=>[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
  #        {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
  #    10=>[{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
  #    20=>[{:start_p=>20, :end_p=>30, :start_t=>"<em>", :end_t=>"</em>"}],
  #    40=>[{:start_p=>40, :end_p=>50, :start_t=>"<em>", :end_t=>"</em>"}],
  #    50=>[{:start_p=>50, :end_p=>60, :start_t=>"<em>", :end_t=>"</em>"}],
  #    70=>[{:start_p=>70, :end_p=>80, :start_t=>"<em>", :end_t=>"</em>"}],
  #     8=>[{:start_p=> 8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]}
c = b.values
  #=> [[{:start_p=>0, :end_p=>100, :start_t=>"<p>", :end_t=>"</p>"},
  #     {:start_p=>0, :end_p=>100, :start_t=>"<span>", :end_t=>"</span>"}],
  #    [{:start_p=>10, :end_p=>50, :start_t=>"<p>", :end_t=>"</p>"}],
  #   ...
  #    [{:start_p=>8, :end_p=>99, :start_t=>"<strong>", :end_t=>"</strong>"}]]
d = c.flat_map do |a|
      x = 1.0
      a.flat_map do |h|
        x /= 2.0
        [[h[:start_p] += x, h[:start_t]], [h[:end_p] -= x, h[:end_t]]]
      end
    end
  #=> [[0.5, "<p>"], [99.5, "</p>"], [0.25, "<span>"], [99.75, "</span>"],
  #    [10.5, "<p>"], [49.5, "</p>"], [20.5, "<em>"], [29.5, "</em>"],
  #    [40.5, "<em>"], [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"],
  #    [70.5, "<em>"], [79.5, "</em>"], [8.5, "<strong>"], [98.5, "</strong>"]]

d 的前四个元素(元组)对于理解我所采用的方法最为重要。

e = d.sort_by(&:first)
  #=> [[0.25, "<span>"], [0.5, "<p>"], [8.5, "<strong>"], [10.5, "<p>"],
  #    [20.5, "<em>"], [29.5, "</em>"], [40.5, "<em>"], [49.5, "</p>"],
  #    [49.5, "</em>"], [50.5, "<em>"], [59.5, "</em>"], [70.5, "<em>"],
  #    [79.5, "</em>"], [98.5, "</strong>"], [99.5, "</p>"], [99.75, "</span>"]]

f = e.map(&:last)
  #=> ["<span>", "<p>", "<strong>", "<p>", "<em>", "</em>", "<em>", "</p>",
  #    "</em>", "<em>", "</em>", "<em>", "</em>", "</strong>", "</p>", "</span>"]
f.join
  #=> "<span><p><strong><p><em></em><em></p></em><em></em><em></em></strong></p></span>"

如果需要的话,我会在上面详细说明 d 的计算。

关于确定有效 HTML 结构的 Ruby 算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45309534/

相关文章:

ruby-on-rails - ruby 中字符串占位符的换行符

c# - 棋盘游戏棋子移动算法

c# - 如何对数字进行编码,以便微小的变化导致非常不同的编码?

java - 反转排序算法

ruby-on-rails - Actionmailer - Sparkpost 模板和多语言

ruby - Rails Redis 将计数器重置为 0

ruby-on-rails - Ruby On Rails,Redis::CommandError: 'set' 命令的 ERR 参数数量错误

algorithm - 如果您每天最多可以观看 3.00 时长的电影,则完成观看给定时长数组的所有电影所需的最少天数

python - 什么是对相似词进行分组的好策略?

ruby - 从 Sinatra/Rack 应用程序流式传输数据