使用 headless Chrome (Rust)进行网页抓取,点击似乎不起作用

标签 web-scraping rust headless-browser

我对 Rust 比较陌生,对网络(抓取)完全陌生。 我尝试将网络抓取工具作为一个宠物项目来实现,以便更熟悉 Rust 和网络堆栈。

我使用 headless-chrome 访问网站并抓取网站的链接,稍后我将对此进行调查。 因此,我打开一个选项卡,导航到该网站,然后抓取 URL,最后单击下一步按钮。即使我找到下一个按钮(使用 CSS 选择器)并使用 click(),也没有任何反应。 在下一次迭代中,我再次抓取相同的列表(显然没有移动到下一页)。

use headless_chrome::Tab;
use std::error::Error;
use std::sync::Arc;
use std::{thread, time};

pub fn scrape(tab: Arc<Tab>) {
    let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP";

    if let Err(_) = tab.navigate_to(url) {
        println!("Failed to navigate to {}", url);
        return;
    }

    if let Err(e) = tab.wait_until_navigated() {
        println!("Failed to wait for navigation: {}", e);
        return;
    }

    if let Ok(gdpr_accept_button) = tab.wait_for_element(".sc-gsDKAQ.fILFKg") {
        if let Err(e) = gdpr_accept_button.click() {
            println!("Failed to click GDPR accept button: {}", e);
            return;
        }
    } else {
        println!("No GDPR popup to acknowledge found.");
    }

    let mut links = Vec::<String>::new();
    loop {
        let mut skipped: usize = 0;
        let new_urls_count: usize;
        match parse_list(&tab) {
            Ok(urls) => {
                new_urls_count = urls.len();
                for url in urls {
                    if !links.contains(&url) {
                        links. Push(url);
                    } else {
                        skipped += 1;
                    }
                }
            }
            Err(_) => {
                println!("No more houses found: stopping");
                break;
            }
        }

        if skipped == new_urls_count {
            println!("Only previously loaded houses found: stopping");
            break;
        }

        if let Ok(button) = tab.wait_for_element("[class=\"arrowButton-20ae5\"]") {
            if let Err(e) = button.click() {
                println!("Failed to click next page button: {}", e);
                break;
            } else {
                println!("Clicked next page button");
            }
        } else {
            println!("No next page button found: stopping");
            break;
        }

        if let Err(e) = tab.wait_until_navigated() {
            println!("Failed to load next page: {}", e);
            break;
        }
    }

    println!("Found {} houses:", links.len());
    for link in links {
        println!("\t{}", link);
    }
}

fn parse_list(tab: &Arc<Tab>) -> Result<Vec<String>, Box<dyn Error>> {
    let elements = tab.find_elements("div[class*=\"EstateItem\"] > a")?; //".EstateItem-1c115"

    let mut links = Vec::<String>::new();
    for element in elements {
        if let Some(url) = element
            .call_js_fn(
                &"function() {{ return this.getAttribute(\"href\"); }}",
                vec![],
                true,
            )?
            .value
        {
            links. Push(url.to_string());
        }
    }

    Ok(links)
}

当我在 main 中调用此代码时,我得到以下输出:

No GDPR popup to acknowledge found.
Clicked next page button
Only previously loaded houses found: stopping
Found 20 houses:
    ...

我的问题是我不明白单击下一步按钮不会执行任何操作。因为我是 Rust 和 Web 应用程序的新手,如果我使用 crate ( headless Chrome )或我对网络抓取的理解有问题。

最佳答案

tl;dr:将单击下一页按钮中的代码替换为:

if let Ok(button) = tab.wait_for_element(r#"*[class^="Pagination"] button:last-child"#) {
    // Expl: both left and right arrow buttons have the same class. The original selector doesn't work, thusly.
    if let Err(e) = button.click() {
        println!("Failed to click next page button: {}", e);
        break;
    } else {
        println!("Clicked next page button");
    }
} else {
    println!("No next page button found: stopping");
    break;
}

// Expl: rust is too fast, so we need to wait for the page to load
std::thread::sleep(std::time::Duration::from_secs(5)); // Wait for 5 seconds
if let Err(e) = tab.wait_until_navigated() {
    println!("Failed to load next page: {}", e);
    break;
}
  1. 原始代码会在第一页单击右键,然后在此处单击左按钮,因为 CSS 也会匹配左按钮;并且由于位于 DOM 树中的第一个位置,因此将返回左按钮。
  2. 原始代码太快了。 chrome 需要稍等一下才能加载。如果您发现这种性能无法忍受,请检查此处的事件并等待浏览器发出事件 https://docs.rs/headless_chrome/latest/headless_chrome/protocol/cdp/Accessibility/events/struct.LoadCompleteEvent.html .

作为最后的建议,上述所有工作都是不必要的:很明显,URL 模式如下所示:https://www.immowelt.at/liste/bezirk- bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC&sf=TIMESTAMP&sp={PAGINATION}。你基本上可以通过抓取分页元素来找到该网站的所有页面;您不妨放弃 Chrome 并执行基本的 HTTP 请求并解析返回的 HTML。为此,请检查 https://docs.rs/scraper/latest/scraper/https://docs.rs/reqwest/latest/reqwest/出去。如果性能对于此蜘蛛来说至关重要,reqwest 还可以与 tokio 一起使用,以异步/并发方式抓取网页。

更新:

下面是我上述建议的 rust/py 实现。然而,用于解析 HTML/XML 和评估 XPath 的 rust 库似乎非常罕见,而且相对不可靠。

use reqwest::Client;
use std::error::Error;
use std::sync::Arc;
use sxd_xpath::{Context, Factory, Value};

async fn get_page_count(client: &reqwest::Client, url: &str) -> Result<i32, Box<dyn Error>> {
    let res = client.get(url).send().await?;
    let body = res.text().await?;
    let pages_count = body
        .split("\"pagesCount\":")
        .nth(1)
        .unwrap()
        .split(",")
        .next()
        .unwrap()
        .trim()
        .parse::<i32>()?;
    Ok(pages_count)
}

async fn scrape_one(client: &Client, url: &str) -> Result<Vec<String>, Box<dyn Error>> {
    let res = client.get(url).send().await?;
    let body = res.text().await?;
    let package = sxd_html::parse_html(&body);
    let doc = package.as_document();

    let factory = Factory::new();
    let ctx = Context::new();

    let houses_selector = factory
        .build("//*[contains(@class, 'EstateItem')]")?
        .unwrap();
    let houses = houses_selector.evaluate(&ctx, doc.root())?;

    if let Value::Nodeset(houses) = houses {
        let mut data = Vec::new();
        for house in houses {
            let title_selector = factory.build(".//h2/text()")?.unwrap();
            let title = title_selector.evaluate(&ctx, house)?.string();
            let a_selector = factory.build(".//a/@href")?.unwrap();
            let href = a_selector.evaluate(&ctx, house)?.string();
            data.push(format!("{} - {}", title, href));
        }
        return Ok(data);
    }
    Err("No data found".into())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC";
    let client = reqwest::Client::builder()
        .user_agent(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0",
        )
        .build()?;
    let client = Arc::new(client);
    let page_count = get_page_count(&client, url).await?;
    let mut tasks = Vec::new();
    for i in 1..=page_count {
        let url = format!("{}&sf={}", url, i);
        let client = client.clone();
        tasks.push(tokio::spawn(async move {
            scrape_one(&client, &url).await.unwrap()
        }));
    }
    let results = futures::future::join_all(tasks).await;
    for result in results {
        println!("{:?}", result?);
    }
    Ok(())
}
async def page_count(url):
    req = await session.get(url)
    return int(re.search(f'"pagesCount":\s*(\d+)', await req.text()).group(1))

async def scrape_one(url):
    req = await session.get(url)
    tree = etree.HTML(await req.text())
    houses = tree.xpath("//*[contains(@class, 'EstateItem')]")
    data = [
        dict(title=house.xpath(".//h2/text()")[0], href=house.xpath(".//a/@href")[0])
        for house in houses
    ]
    return data

url = "https://www.immowelt.at/liste/bezirk-bruck-muerzzuschlag/haeuser/kaufen?ami=125&d=true&lids=513958&lids=513960&lids=513966&pami=750&pma=500000&pmi=10000&sd=DESC"
result = await asyncio.gather(
    *[
        scrape_one(url + f"&sf={i}")
        for i in range(1, await page_count(url + "&sf=1") + 1)
    ]
)

关于使用 headless Chrome (Rust)进行网页抓取,点击似乎不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76451684/

相关文章:

struct - 如何交换结构的两个字段

rust - 在 Rust 中销毁字符串

python - 为什么本地 chrome-url 像 : chrome://downloads or chrome://apps doesn't work in headless mode?

selenium - 如何优化PhantomJS让搜索引擎索引单页应用程序?

python - 网络抓取最大重试次数被拒绝

使用 BeautifulSoup4 的 Python Web 抓取跨度标签以获取英镑价格

excel - 使用 Excel VBA 进行网页抓取不会返回某些网站的值

rust - 如何在Rust中使用任意文本作为函数名称?

javascript - 如何从 DOM 获取所有链接?

javascript - python 3 : How to extract url image?