php - Amazon Scraper 脚本适用于 XAMPP Windows,但不适用于 Linux 上的 PHP5 Cli

标签 php linux windows screen-scraping

我正在尝试使用以下代码抓取亚马逊 ASIN 代码:

<?php

class Scraper {

const BASE_URL = "http://www.amazon.com";
private $categoryFile = "";
private $outputFile = "";
private $catArray;
private $currentPage = NULL;
private $asin = array();
private $categoriesMatched = 0;
private $categoryProducts = array();
private $pagesMatched = 0;
private $totalPagesMatched = 0;
private $productsMatched = 0;

public function __construct($categoryFile, $outputFile) {

    $this->categoryFile = $categoryFile;
    $this->outputFile = $outputFile;

}

public function run() {

    $this->readCategories($this->categoryFile);
    $this->setupASINArray($this->asin);

    $x = 1;

    foreach ($this->catArray as $cat) {

        $this->categoryProducts["$x"] = 0;

        if ($this->currentPage == NULL) {

            $this->currentPage = $cat;
            $this->scrapeASIN($this->currentPage, $x);
            $this->pagesMatched++;

        }           

        if ($this->getNextPageLink($this->currentPage)) {

            do {

                // next page found
                $this->pagesMatched++;
                $this->scrapeASIN($this->currentPage, $x);

            } while ($this->getNextPageLink($this->currentPage));

        }

        echo "Category complete: $this->pagesMatched Pages" . "\n";
        $this->totalPagesMatched += $this->pagesMatched;
        $this->pagesMatched = 0;
        $this->writeASIN($this->outputFile, $x);
        $x++;
        $this->currentPage = NULL;
        $this->categoriesMatched++;



    }

    $this->returnStats();


}

private function readCategories($categoryFile) {

    $catArray = file($categoryFile, FILE_IGNORE_NEW_LINES);

    $this->catArray = $catArray;

}

private function setupASINArray($asinArray) {

    $x = 0;

    foreach ($this->catArray as $cat) {

        $asinArray["$x"][0] = "$cat";
        $x++;

    }

    $this->asin = $asinArray;

}

private function getNextPageLink($currentPage) {

    $document = new DOMDocument();

    $html = file_get_contents($currentPage);

    @$document->loadHTML($html);

    $xpath = new DOMXPath($document);

    $element = $xpath->query("//a[@id='pagnNextLink']/@href");

    if ($element->length != 0) {

        $this->currentPage = self::BASE_URL . $element->item(0)->value;
        return true;

    } else {

        return false;

    }


}

private function scrapeASIN($currentPage, $catNo) {

    $html = file_get_contents($currentPage);

    $regex = '~(?:www\.)?ama?zo?n\.(?:com|ca|co\.uk|co\.jp|de|fr)/(?:exec/obidos/ASIN/|o/|gp/product/|(?:(?:[^"\'/]*)/)?dp/|)(B[A-Z0-9]{9})(?:(?:/|\?|\#)(?:[^"\'\s]*))?~isx';

    preg_match_all($regex, $html, $asin);

    foreach ($asin[1] as $match) {

        $this->asin[$catNo-1][] = $match;

    }   


}

private function writeASIN($outputFile, $catNo) {

    $fh = fopen($outputFile, "a+");

    $this->fixDupes($catNo);
    $this->productsMatched += (count($this->asin[$catNo-1]) - 1);
    $this->categoryProducts["$catNo"] = (count($this->asin[$catNo-1]) - 1);

    flock($fh, LOCK_EX);

    $x = 0;

    foreach ($this->asin[$catNo-1] as $asin) {

        fwrite($fh, "$asin" . "\n");

        $x++;

    }



    flock($fh, LOCK_UN);

    fclose($fh);

    $x -= 1;

    echo "$x ASIN codes written to file" . "\n";

}

private function fixDupes($catNo) {

    $this->asin[$catNo-1] = array_unique($this->asin[$catNo-1], SORT_STRING);

}

public function returnStats() {

    echo "Categories matched: " . $this->categoriesMatched . "\n";
    echo "Pages parsed: " . $this->totalPagesMatched . "\n";
    echo "Products parsed: " . $this->productsMatched . "\n";
    echo "Category breakdown:" . "\n";

    $x = 1;

    foreach ($this->categoryProducts as $catProds) {

        echo "Category $x had $catProds products" . "\n";
        $x++;

    }

}

}

$scraper = new Scraper($argv[1], $argv[2]);
$scraper->run();

?>

但它在 Windows 上的 XAMPP 上运行良好,但在 Linux 上则不行。关于为什么会这样的任何想法吗?有时它会抓取 0 个 ASIN 进行归档,有时它只会抓取 400 多个页面类别中的 1 个页面。但输出/功能在 Windows/XAMPP 中完全没问题。

如有任何想法,我们将不胜感激!

干杯 - 布莱斯

最佳答案

所以尝试改变这种方式,只是为了避免错误消息:

private function readCategories($categoryFile) {

if (file_exists($categoryFile)) {
    $catArray = file($categoryFile, FILE_IGNORE_NEW_LINES);

    $this->catArray = $catArray;
} else {
    echo "File ".$categoryFile.' not exists!';
    $this->catArray = array();
}

}

关于php - Amazon Scraper 脚本适用于 XAMPP Windows,但不适用于 Linux 上的 PHP5 Cli,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28252982/

相关文章:

javascript - 我需要帮助使用 Ajax、PHP 和 mySQL 将信息发送到我的服务器而不使用表单

windows - PowerGUI "Root Element is Missing"启动时

javascript - 我想在 php 脚本中创建带有联系表单的自动弹出窗口

linux - bash 数组在命令提示符下工作但在作为 ksh 执行时不工作

c - C 打印字符串的速度

c - linux c进程只同步信号和消息队列

python-3.x - 如何使用 wxPython 制作窗口覆盖(在浏览器、游戏之上)

windows - 如何使用批处理文件将 IP 地址转换为整数

php - 每次执行按钮操作时如何将值插入数据库表

PHP/Mysql 比上一行加一