php - 获取元素的innerHTML，但不是元素本身

我正在努力从两列表中提取数据。第一列是变量名称，第二列是该变量的数据。

我的这个几乎可以工作，但有些数据可能包含 HTML 并且通常包装在 DIV 中。我想要获取 DIV 内的 HTML，而不是 DIV 本身。我知道正则表达式可能是一个解决方案，但我想更好地理解 DOMDocument。

这是我到目前为止的代码:

private function readHtml()
{

    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    $htmlData = curl_exec($curl);
    curl_close($curl);

    $dom        = new \DOMDocument();
    $html       = $dom->loadHTML($htmlData);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

               $htmlNode = $node->getElementsByTagName('div');

                if($htmlNode->length >=1) {

                    $innerHTML= '';

                    foreach ($htmlNode as $innerNode) {

                        $innerHTML .= $innerNode->ownerDocument->saveHTML( $innerNode );
                    }

                    $value = $innerHTML;

                } else {

                    $value = $node->textContent;
                }
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

我的输出是正确的，但我不想包含包含 HTML 的数据的包装器 DIV:

    Array
    (
        [type] => raw
        [direction] => north
        [intro] => Welcome to the test. 
        [html_body] => <div class="softmerge-inner" style="width: 5653px; left: -1px;">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut <span style="font-weight:bold;">aliquip</span> ex ea commodo consequat. Duis aute irure dolor in <span style="text-decoration:underline;">reprehenderit</span> in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, <span style="font-style:italic;">sunt in</span> culpa qui officia deserunt mollit anim id est laborum.</div>
        [count] => 1003
    )

更新

根据答案中的一些反馈和想法，这是函数的当前迭代，它更精简并且返回所需的输出。我对双正则表达式感觉不太好，但它确实有效。

private function readHtml()
{

    # the url given in your example
    $url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";

    $dom = new \DOMDocument();
    $dom->loadHTMLFile($url);
    $dom->preserveWhiteSpace = false;

    $tables     = $dom->getElementsByTagName('table');
    $rows       = $tables->item(0)->getElementsByTagName('tr');
    $cols       = $rows->item(1)->getElementsByTagName('td');

    $table = [];
    $key = null;
    $value = null;

    foreach ($rows as $i => $row){

        //skip the heading columns
        if($i <= 1 ) continue;

        $cols = $row->getElementsByTagName('td');

        foreach ($cols as $count => $node) {

            if($count == 0) {

                $key = strtolower(str_replace(' ', '_',$node->textContent));

            } else {

                $value = $node->ownerDocument->saveHTML( $node );

                $value = preg_replace('/(<div.*?>|<\/div>)/','',$value);
                $value = preg_replace('/(<td.*?>|<\/td>)/','',$value);
            }
        }

        $table[$key] = $value;
    }

    return $table;
}

最佳答案

使用`preg_replace`!像这样:

$table['html_body']=preg_replace('/(<div.*?>|<\/div>)/','',$table['html_body']);

参见here对于preg_replace。请参阅here用于正则表达式的使用。

或者!您可以使用 simple_html_dom.php像这样:

<?php
include 'simple_html_dom.php';//<--- Must download to current directory
$url = 'https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml';
$html = file_get_html( $url );
foreach ( $html->find( "div[class=softmerge-inner]" ) as $element ) {
    echo $element->innertext;
    //See http://simplehtmldom.sourceforge.net/manual.htm for usage
}
?>

关于php - 获取元素的innerHTML，但不是元素本身，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36898222/

php - 获取元素的innerHTML，但不是元素本身

使用`preg_replace`!像这样:

或者!您可以使用 simple_html_dom.php像这样:

上一篇：sql-server - 使用通配符的 T-SQL xquery .modify 方法

下一篇：Python 请求 API 调用不起作用

php - 获取元素的innerHTML，但不是元素本身

使用preg_replace!像这样:

或者!您可以使用 simple_html_dom.php像这样:

上一篇：sql-server - 使用通配符的 T-SQL xquery .modify 方法

下一篇：Python 请求 API 调用不起作用

使用`preg_replace`!像这样: