php - 从外部页面抓取 DIV 内的特定元素

标签 php html parsing dom html-parsing

我需要废弃这些 div 的每个 class="product-grid-item" 内的以下元素(页面包含其中几个),但事实上我不知道该怎么做......所以,我需要帮助不要把我的头发拉出来。

1 - div 内的链接和图像:class="product-element-top2 ;

<a href="https://...this_link" class="product-image-link"> (只需链接)

<img width="300" height="300" src="https://...this_image_url... (只需要这个图片网址)

2 - h3标签内的标题如下;

<h3 class="wd-entities-title"><a href="https://...linkhere">The title goes here (只是标题)

3 - 最后但并非最不重要的一点是,我需要获取其中的价格;

<span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">€</span>20,00</bdi></span></span> (仅“20.00 欧元”)

这是完整的 HTML :

<div class="product-grid-item" data-loop="1">

<div class="product-element-top">
    <a href="https://...linkhere" class="product-image-link">
        <img width="300" height="300" src="https://image-goes-here.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">    </a>
    
    <div class="top-information wd-fill">

        <h3 class="wd-entities-title"><a href="https://...linkhere">The title goes here</a></h3>        
                
        
    <span class="price"><span class="woocommerce-Price-amount amount"><bdi><span class="woocommerce-Price-currencySymbol">€</span>20,00</bdi></span></span>

        <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
            <a href="https://...linkhere" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop"><span>Options</span></a></div> 
    </div>

    <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                            <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
            </div>
    <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                <a href="https://...linkhere" class="open-quick-view quick-view-button">quick view</a>
            </div>
                            <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                <a class="" href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
            </div>
            </div>
                <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon"><a href="#" rel="nofollow noopener">Close</a></div>
                <div class="quick-shop-form wd-scroll-content">
                </div>
            </div>
        </div>
</div>

我笨拙的尝试之一:

$html = file_get_contents("https://url-here.goetohere");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'product-grid-item';
$classname = 'product-element-top2';
$classname = 'product-element-top2';
$classname = 'wd-entities-title';
$classname = 'price';
$nodes = $finder->query("//*[contains(@class, '$classname')]");
foreach ($nodes as $node) {
    echo 'here »» ' . htmlentities($node->nodeValue) . '<br>';
}

最佳答案

假设在尝试任何 DOM 处理之前正确获取 HTML,那么构建一些基本的 XPath 表达式来查找指示的内容是相当简单的。

根据评论页面包含其中几个,您将在输出中注意到,有 2 个 product-grid-item div。

$html='
    <div class="product-grid-item" data-loop="1">
        <div class="product-element-top">
            <a href="https://...linkhere" class="product-image-link">
                <img width="300" height="300" src="https://image-goes-here.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">
            </a>
            <div class="top-information wd-fill">
                <h3 class="wd-entities-title">
                    <a href="https://...linkhere">The title goes here</a>
                </h3>
                <span class="price">
                    <span class="woocommerce-Price-amount amount">
                        <bdi>
                            <span class="woocommerce-Price-currencySymbol">€</span>20,00
                        </bdi>
                    </span>
                </span>
                <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
                    <a href="https://...linkhere" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop">
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                    <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
                </div>
                <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                    <a href="https://...linkhere" class="open-quick-view quick-view-button">quick view</a>
                </div>
                <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                    <a class="" href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon">
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div class="quick-shop-form wd-scroll-content"></div>
            </div>
        </div>
    </div>
    
    <div class="product-grid-item" data-loop="1">
        <div class="product-element-top">
            <a href="https://www.example.com/banana" class="product-image-link">
                <img width="300" height="300" src="https://www.example.com/kittykat.jpg" class="attachment-woocommerce_thumbnail size-woocommerce_thumbnail">
            </a>
            <div class="top-information wd-fill">
                <h3 class="wd-entities-title">
                    <a href="https://www.example.com/womble">Oh look, another title!</a>
                </h3>
                <span class="price">
                    <span class="woocommerce-Price-amount amount">
                        <bdi>
                            <span class="woocommerce-Price-currencySymbol">€</span>540,00
                        </bdi>
                    </span>
                </span>
                <div class="wd-add-btn wd-add-btn-replace woodmart-add-btn">
                    <a href="https://www.example.com/gorilla" data-quantity="1" class="button product_type_variable add_to_cart_button add-to-cart-loop">
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div class="wd-buttons wd-pos-r-t color-scheme-light woodmart-buttons">
                <div class="wd-compare-btn product-compare-button wd-action-btn wd-style-icon wd-compare-icon">
                    <a href="https:www.example.com/buy" data-added-text="Compare Products">Buy</a>
                </div>
                <div class="quick-view wd-action-btn wd-style-icon wd-quick-view-icon wd-quick-view-btn">
                    <a href="https://www.example.com/view" class="open-quick-view quick-view-button">quick view</a>
                </div>
                <div class="wd-wishlist-btn wd-action-btn wd-style-icon wd-wishlist-icon woodmart-wishlist-btn">
                    <a class="" href="https://www.example.com/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div class="quick-shop-wrapper wd-fill wd-scroll">
                <div class="quick-shop-close wd-action-btn wd-style-text wd-cross-icon">
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div class="quick-shop-form wd-scroll-content"></div>
            </div>
        </div>
    </div>';

处理下载的 HTML

# set the libxml parameters and create new DOMDocument/XPath objects.
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTML( $html );
libxml_clear_errors();

$xp=new DOMXPath( $dom );

# some basic XPath expressions
$exprs=(object)array(
    'product-link'      =>  '//a[@class="product-image-link"]',
    'product-img-src'   =>  '//a[@class="product-image-link"]/img',
    'h3-title-text'     =>  '//h3[@class="wd-entities-title"]',
    'price'             =>  '//span[@class="price"]/span/bdi'
);
# find the keys (for convenience) to be used below
$keys=array_keys( get_object_vars( $exprs ) );

# store results here
$res=array();

# loop through all patterns and issue XPath query.
foreach( $exprs as $key => $expr ){
    # add key to output and set as an array.
    $res[ $key ]=[];
    $col=$xp->query( $expr );
    
    # find the data if the query succeeds
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            switch( $key ){
                case $keys[0]:$res[$key][]=$node->getAttribute('href');break;
                case $keys[1]:$res[$key][]=$node->getAttribute('src');break;
                case $keys[2]:$res[$key][]=trim($node->textContent);break;
                case $keys[3]:$res[$key][]=trim($node->textContent);break;
            }
        }
    }
}
# show the result or do really interesting things with the data
printf('<pre>%s</pre>',print_r($res,true));

其产量:

Array
(
    [product-link] => Array
        (
            [0] => https://...linkhere
            [1] => https://www.example.com/banana
        )

    [product-img-src] => Array
        (
            [0] => https://image-goes-here.jpg
            [1] => https://www.example.com/kittykat.jpg
        )

    [h3-title-text] => Array
        (
            [0] => The title goes here
            [1] => Oh look, another title!
        )

    [price] => Array
        (
            [0] => â¬20,00
            [1] => â¬540,00
        )

)

关于php - 从外部页面抓取 DIV 内的特定元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73632971/

相关文章:

javascript - css 将阴影切换到文本的背面

php - Magento getUrl 不适用于目录/类别对象?

java - 无法使用 DocumentBuilder 解析 XML 文件

php - magento2设置: no CSS

php - 需要有关 OAuthException 代码 2500 的帮助

php - centos php服务器上的文件上传错误日志

PHP结果页面显示在同一页面上

javascript - 在 Video.js 中使用多个缩略图 Sprite

html - 如何使用 CSS 创建这个带 Angular 条形

parsing - 如何在 Go 的文本/扫描仪中倒带?