我正在抓取一个具有以下“cheerio”标签的网站,如何获取 p 标签的完整文本以及带有属性 < 的 **span * *强>“数据数学”。
<p><strong class="content_question">Đề bài</strong></p>
<p style="text-align: justify;">"a. "
<span class="MathJax_Preview" style="color: inherit; display: none;"></span>
<span id="MathJax-Element-1-Frame"
class="mjx-chtml MathJax_CHTML"
tabindex="0"
style="font-size: 121%; position: relative;"
data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math>" role="presentation"><span id="MJXc-Node-1" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-2" class="mjx-mrow"><span id="MJXc-Node-3" class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">5</span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mn>5</mn></math></span></span><script type="math/tex" id="MathJax-Element-1">5</script> và <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span id="MathJax-Element-2-Frame" class="mjx-chtml MathJax_CHTML" tabindex="0" style="font-size: 121%; position: relative;" data-mathml="<math xmlns="http://www.w3.org/1998/Math/MathML"><mroot><mn>123</mn><mn>3</mn></mroot></math>" role="presentation"><span id="MJXc-Node-4" class="mjx-math" aria-hidden="true"><span id="MJXc-Node-5" class="mjx-mrow"><span id="MJXc-Node-6" class="mjx-mroot"><span class="mjx-root" style="font-size: 50%; vertical-align: 0.774em; width: 0px;"><span id="MJXc-Node-8" class="mjx-mn" style="padding-left: 0.543em;"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">3</span></span></span><span class="mjx-box" style="padding-top: 0.045em;"><span class="mjx-surd"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.507em; padding-bottom: 0.553em;">√</span></span><span class="mjx-box" style="padding-top: 0.119em; border-top: 1.6px solid;"><span id="MJXc-Node-7" class="mjx-mn"><span class="mjx-char MJXc-TeX-main-R" style="padding-top: 0.37em; padding-bottom: 0.37em;">123</span></span></span></span></span></span></span><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mroot><mn>123</mn><mn>3</mn></mroot></math></span></span>
<script type="math/tex" id="MathJax-Element-2">\root 3 \of {123} </script>
" ;"</p>
在带有属性“data-mathml”的span标签中,我应该获取此属性中的文本还是获取元素以将数据返回给客户端?
const html = response.data;
const $ = cheerio.load(html);
const mathjaxEquations = $("span[data-mathml]");
console.log({ mathjaxEquations });
请帮助我,非常感谢!
最佳答案
根据您的评论,您可以使用 Puppeteer 等工具提取此文本。 Cheerio 不评估 JS,包括 MathJax,但浏览器自动化可以让实时页面运行,并让您有机会提取 JS 注入(inject)的数据。
const puppeteer = require("puppeteer"); // ^21.0.2
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch({headless: "new"});
const [page] = await browser.pages();
await page.goto(url);
await page.waitForSelector(".mjx-char");
await page.$$eval('[data-id="sp-target-div-outstream"]', els =>
els.forEach(el => el.remove())
);
const result = await page.evaluate(() =>
$("#box-content > p")
.first()
.nextUntil(":not(p)")
.get()
.map(e =>
[...e.childNodes]
.flatMap(e =>
e.nodeType === Node.TEXT_NODE
? e.textContent
: e.classList?.contains("mjx-chtml")
? [...e.querySelectorAll(".mjx-char")].map(
e => e.textContent
)
: ""
)
.join("")
)
);
console.log(result);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
输出:
[ 'So sánh', 'a) 5 và 3√123 ;', 'b) 53√6 và 63√5.' ]
如果您想要更原始的数据版本,可以进一步处理并可选择稍后加入,请将 .join("")
替换为 .filter(Boolean)
:
[
[ 'So sánh' ],
[
'a) ', '5',
' và ', '3',
'√', '123',
' ;'
],
[
'b) ', '5', '3',
'√', '6', ' và ',
'6', '3', '√',
'5', '.'
]
]
关于node.js - 爬取数据时如何获取MathJax中的元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77129059/