python - 带有 cssselct 的 scrapy

标签 python css regex css-selectors

我正在尝试使用 cssselect 进行练习不幸的是,当我阅读 <h1> 时才起作用让我向您展示示例:

我在本练习中使用的 HTML 代码如下:

<h1 class="lbl_titulo">3 Bedroom House<span class="subtit"> - Well located house in  the heart of Lapa (Lisbon) </span></h1>

    <div class="bloco-imovel-dados">

                        <div class="bloco-imovel-resumo-dados">
                            <div id="Cpl_modulodadosresumidos_module_holder"     class="modulo-dados-resumidos">

    <h2 class="lbl_descricao_dados">Property Information</h2>

    <ul class="bloco-dados">

        <li>
            <b>Condition:</b> <span>Renewed</span></li>
        <li>
            <b>Living Area:</b><span> 80 m<sup>2</sup></span></li>
        <li>
            <b>Total Area:</b><span> 0 m<sup>2</sup></span></li>
        <li>
            <b>Bathrooms:</b><span> 1 </span></li>
        <li>
            <b>Bedrooms:</b><span> 2 </span></li>
        <li>
            <b>Energy Rating:</b><span> C</span></li>
        <li>
            <b>Construction year:</b><span> 1997</span></li>
        <li>
            <b>ID Property:</b><span> CAS.10.13286</span></li>

    </ul>

</div>

    <div class="row pref-search-results"><script>setTimeout(function(){     $(".pref-search-results").addClass("pref-search-result").removeClass("pref-    search-results") }, 1000)</script>
        <div class="col-lg-4 col-md-4 col-sm-6 col-xs-12"><a class="pref-   property-container" href="http://xxxxxxxxx.com/link-to-image" style="height:    420px;">
            <div class="pref-teaser-shadow"></div>
            <div class="pref-teaser-image">
                <div class="pref-teaser-icons-container bottom">
                    <div class="pref-watchlist-teaser-icon-container">
                        <div class="pref-teaser-icon pref-watchlist-icon active initial-hide"></div>
                        <div class="pref-teaser-icon pref-watchlist-icon inactive "></div>
                    </div>
                </div>
                <div class="pref-teaser-shadow"></div>
                <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Ranch_style_home_in_Salinas%2C_California.JPG/220px-Ranch_style_home_in_Salinas%2C_California.JPG" title="Link to Image" alt="Link Image">
            </div>
        </a></div>

                    </div>
                    <div class="bloco-imovel-caracteristicas-holder">
                        <div id="Cpl_modulocaracteristicas_module_holder" class="modulo-caracteristicas">
    <a class="pref-property-container" href="http://xxxxxxxxxx.com/link-to-listing" style="height: 420px;">Link 1</a>
    <a class="pref-property-container" href="http://xxxxxxxxxx.com/link-to-listing" style="height: 420px;">Link 2</a>
    <h3 class="lbl_titulo_caracteristicas">Features</h3>
    <div id="Cpl_modulocaracteristicas_div_caracteristicas_gerais" class="modulo-caracteristicas-item  open">
        <ul class="modulo-caracteristicas-conteudo js-caracteristicas-holder">
            <li>
                <span id="features">Garden</span>
            </li>
            <li>
                <span id="features">Gas Heating</span>
            </li>
            <li>
                <span id="features">2 garages</span>
            </li>
            <li>
                <span id="features">Large pool</span>
            </li>
        </ul>
    </div>


</div>

                    </div>

                    <div class="bloco-imovel-texto">
                        <h3 class="lbl_description">
                            Description </h3>
                        <p>At vero eos et accusamus et iusto odio dignissimos ducimus qui blanditiis praesentium voluptatum deleniti atque corrupti quos dolores et quas molestias excepturi sint occaecati cupiditate non provident, similique sunt in culpa qui officia deserunt mollitia animi, id est laborum et dolorum fugaEt harum quidem rerum facilis est et expedita distinctio.Nam libero tempore, cum soluta nobis est eligendi optio cumque nihil impedit quo minus id quod maxime placeat facere possimus, omnis voluptas assumenda est, omnis dolor repellendus.</p>
                    </div>

                     <div class="bloco-imovel-content">
            <!-- Galeria -->
            <div class="bloco-imovel-galeria">
                <div id="Cpl_modulogaleriavertical_module_holder" class="modulo-galeria-vertical">
            <span id="Cpl_modulogaleriavertical_lbl_galeria" class="lbl_galeria"><b>Relevant Information</b></span>

</div>

                    <div id="Cpl_pnl_mapa" class="pnl_mapa">

                        <span id="Cpl_lbl_mapa" class="lbl_mapa"><b>Location:</b></span>
                        <span id="Cpl_lbl_morada" class="lbl_morada">Portugal, Lisboa, Estrela, Lapa</span>
                        <div id="Cpl_modulomapa_module_holder" class="modulo-mapa">

    <div class="bloco-mapa">
        <div id="Cpl_modulomapa_mapa" class="mapa"></div>
        <div id="map-canvas" data-coorgps="36.5194999,-4.7743365"></div> 
    </div>

    </div>


</div>
                </div>
            </div>
            <div class="bloco-imovel-sidebar">
                <div id="Cpl_moduloinformacaolateral_module_holder" class="modulo-informacao-lateral">

    <div class="informacao">
        <div class="info1">
            <div class="lbl_titulo">Agent John Doe</div>
            <div class="lbl_subtitulo">
            Contact Information
            </div>
            <div class="lbl_resumo">
            Phone Number: 0800-1111<br><b><a href="mailto:casa@casa.pt">casa@casa.pt</a></b><br><b></b>
            </div>
        </div>
    </div>

</div>

<a class="pref-property-container" href="http://xxxxxxxxxx.com/link-to-listing" style="height: 420px;">Link 3</a>

</div></div></div>

所以我当然会安装:

!pip install lxml
!pip install cssselect
from lxml import html,etree

我打开并阅读文件

with open(r'listing.html', "r") as f:
    page = f.read()
tree = html.fromstring(page)

然后我想抓取以下信息:

Name of the property
Number of bathrooms
Number of bedrooms
Living Area
Energy Rating
Description
Agent Name
Extract the location of the property

所以我尝试做的是:

Selector_Example = "h1.lbl_titulo"
print('Example -> Property type: {}'.format(tree.cssselect(Selector_Example)[0].text))

我得到了答案:

Example -> Property type: 3 Bedroom House

当我尝试其他示例时,我总是会遇到错误:

#Number of bathrooms
Selector_1 = "li:nth-child(1)"
print('Bathrooms: {}'.format(tree.cssselect(Selector_1)[0].text))
print('')
# this returns just the word Bathrooms: but I tried to use `li.b` and does not work as well.

#Number of bedrooms
Selector_2 = "li:nth-child(2)"
print('Bedrooms: {}'.format(tree.cssselect(Selector_2)[0].text))
print('')
# this returns just the word Bedrooms: but I tried to use `li.b` and does not work as well.

#Living Area
Selector_3 = "li:nth-child(3)"
print('Total area: {}'.format(tree.cssselect(Selector_3)[0].text))
print('')
this returns just the words Living Area: but I tried to use `li.b` and does not work as well.

#Energy Rating
Selector_4 = "li:nth-child(4)"

#Description
Selector_5 = "h3.lbl_description"
print('Description: {}'.format(tree.cssselect(Selector_5)[0].text))
print('')
# This returns the word Description: Description but not the description.



#Agent Name
Selector_6 = "div.lbl_titulo"
print('Agent name: {}'.format(tree.cssselect(Selector_6)[0].text))
print('')
# This gives my in fact the agent name: Agent John Doe

#Extract the location of the property
Selector_7 = "div.Cpl_lbl_morada.lbl_morada"
print('Location: {}'.format(tree.cssselect(Selector_7)[0].text))
# I got direct : IndexError: list index out of range

有谁知道我做错了什么以及如何修改这段代码?在此先感谢您的建议!

最佳答案

你的例子:

#Number of bedrooms
Selector_2 = "li:nth-child(2)"
print('Bedrooms: {}'.format(tree.cssselect(Selector_2)[0].text))

据我所知在这种情况下"li:nth-child(2)"返回 '\n '什么是第二个 li元素,"li:nth-child(3)"返回第 3 li . 如果你想得到Bedrooms你需要使用

#Number of bedrooms
Selector_2 = "li:nth-child(5) span"
print('Bedrooms: {}'.format(tree.cssselect(Selector_2)[0].text))

li:nth-child(5)因为Bedrooms是第 5 li元素和 span因为您需要来自 li 中的 span 元素的信息

编辑。 您分配给 Selector_ 的字符串变量是 CSS所以你可以像在 CSS 文件中一样使用相同的符号。所以li.b表示 <li class='b'>

关于python - 带有 cssselct 的 scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57822677/

相关文章:

html - <TD> 背景图像应根据文本大小缩小

python - 需要 RE 来检测 UTF-8

python - 使用Python对音频文件进行FFT-采样率

python - 如何表示未使用的函数参数?

python - 如何从 fasta 文件中获取索引列表序列?

Java 正则表达式正向先行

javascript - 限制接受包含非英文字符的电子邮件

python - 在 python 连接对象中发生异常之前,如何提交所有挂起的查询

jquery - 元素符号不显示 <ul>

css - 使用 CSS 设置复选框的格式和位置