html - 基于正则表达式 attr 从 read_html 过滤表

标签 html pandas web-scraping beautifulsoup

我知道如何借助 attrs 参数根据确切的 ID 进行过滤。

tables = pd.read_html(url, attrs={"id": "box-CHI-game-basic"})

我事先不知道确切的ID，但我知道它的结构。我可以使用正则表达式捕获 id:

re.search(".+-game-basic", "box-CHI-game-basic")

如果您只是添加正则表达式作为 attr 的值，这是行不通的。

read_html的match参数可以使用正则表达式，但它会遍历整个文本，我想将其缩小到id。

最佳答案

我认为您无法使用 pandas_html 做到这一点。

match 参数将匹配:

The set of tables containing text matching this regex or string

至于attrs，您尝试的操作将不起作用，因为

attrs is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly

所以，我想你必须首先求助于 bs4，例如:


soup.find_all("table", id=re.compile(".+-game-basic")

然后，将表传递给 pandas 进行进一步解析。

关于html - 基于正则表达式 attr 从 read_html 过滤表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74487351/

上一篇：sorting - x86 尝试进行冒泡排序时出现段错误

下一篇：.net - 环境变量不起作用(Windows 11)

javascript - JavaScript 中的法语字符 HTML 中的 XSLT

python - 我想使用 Python 检查 sheet1 中某个列上的值是否也存在于 sheet2 上

python - 如何使用 Python、Requests 和 Xpath 抓取网站？

javascript - 防止页面加载时触发 Onchange 事件

javascript - HTML5 Canvas : Draw lines from a centered DIV, 连接到周围的其他 DIV 吗？ (蜘蛛网效应)Kinetic JS

python - 在python pandas中多次出现相同的分隔符之间提取字符串

python - 仅适用第一个条件

python - BeautifulSoup - 解析文件中的数值

python - 在没有打印日志的情况下运行 scrapy runspider