regex - 在 Mechanize 函数中使用哪个正则表达式标签？

我从包含 /title/tt 的网页中检索了所有链接在列表中的 url 内。

my @url_links= $mech->find_all_links( url_regex => qr/title\/tt/i );

但列表太长，所以我想通过添加函数 find_all_Links 进行过滤，该链接必须也在以 <id="actor-tt..."> 开头的标签中这是链接( /title/tt... )所在的位置，在 cmd.exe 检索到的代码源中:

<div class="filmo-row odd" id="actor-tt0361748">
<span class="year_column">
&nbsp;2009
</span>
<b><a href="/title/tt0361748/"
>Inglourious Basterds</a></b>
<br/>
Lt. Aldo Raine
</div>

我想您必须使用 tag_regex ，但我不知道如何使用，因为命令提示符在我输入时似乎没有考虑 tag_regex 。

最佳答案

使用 HTML::TreeBuilder 和 HTML::Element 而不是 Mechanize :

use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;

my $html_string = join "", <DATA>;

my $tree = HTML::TreeBuilder->new_from_content($html_string);

my @url_links = map { $_->attr_get_i("href") }
                map { $_->look_down(href => qr{/title/tt}) }
                $tree->look_down(id => qr/^actor-tt/);

say for @url_links;

__DATA__
<div class="filmo-row odd" id="actor-tt0361748">
    <span class="year_column">
      &nbsp;2009
    </span>
    <b><a href="/title/tt0361748/">Inglourious Basterds</a></b>
    <br/>
    Lt. Aldo Raine
</div>
<div id="not-the-right-id">
    <a href="/title/tt-looks-correct-but-wrong-id/"></a>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
    <b><a href="/title/tt0123456/">Another movie</a></b>
</div>
<div class="filmo-row odd" id="actor-tt0123456">
    the id will match, but no href in here
</div>

$tree->look_down(id => qr/^actor-tt/); 查找 id 匹配 actor-tt 的所有元素。然后 $_->look_down(href => qr{/title/tt}) 将找到其中包含匹配 href 字段 /title/tt 的所有元素。最后，$_->attr_get_i("href") 返回其 href 字段的值。
您可能对 new_from_url 中的 new_from_file 或 HTML::TreeBuilder 方法而不是我使用的 new_from_content 感兴趣。

关于regex - 在 Mechanize 函数中使用哪个正则表达式标签？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63406019/

regex - 在 Mechanize 函数中使用哪个正则表达式标签？

上一篇：perl - WWW::Scripter 的身份验证问题

下一篇：python - 解析 BeautifulSoup 中 select 下的所有选项