Python:webdriver Firefox 和 PhantomJS 之间的区别问题

标签 python xpath selenium-webdriver

在过去的一两天里,我一直在研究这个 Python 脚本,当我使用 Firefox 网络驱动程序时一切正常,但是当我切换到使用像 PhantomJS 这样的 headless 浏览器时,它在 setNumber = parseSetNumber(setName[0]) 上失败了。出现错误 Error: list index out of range由于 setName是空的。

它前面的行 setName = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a/text()")仅使用 PhantomJS 网络驱动程序时不返回任何内容,如果使用 Firefox 网络驱动程序则返回一个值。

仅当我将 webdriver 从 Firefox 切换到 PhantomJS 时才会发生该错误。我使用 PhantomJS 作为脚本在 linux 服务器上运行。

import time
import os.path
import lxml.html as LH
import re
import sys
from selenium import webdriver
from random import randint
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

PARAMS = sys.argv
URL = PARAMS[1]
BASEURL = URL[:URL.rfind('/')+1]

# Parses the set name for the set number
def parseSetNumber(string):
    string = string.split(' ')
    stringLength = len(string)
    string = string[(stringLength - 1)]
    if string.replace('.','').isdigit():
        return string
    else:
        return ""

# Returns set reference for this site
def parseRefId(string):
    string = string.split('_')
    return str(string[2])

try:
    PAGE_NUMBER = 1

    #--------------------------------------------------
    ## Get initial page

    driver = webdriver.PhantomJS()
    driver.get(PARAMS[1])

    #--------------------------------------------------
    ## Get page count

    # Give page time to load
    time.sleep(2)

    PAGE_RAW = driver.page_source
    PAGE_RAW = LH.fromstring(PAGE_RAW)
    PAGE_COUNT_RAW = PAGE_RAW.xpath("//div[contains(@class, 'pageControlMenu')]/div/ul/li")
    PAGE_COUNT = len(PAGE_COUNT_RAW) - 2

    #--------------------------------------------------
    ## Get page if its not page one

    while PAGE_NUMBER <= PAGE_COUNT:
        #--------------------------------------------------
        ## Create empty file
        FILE_NAME = PARAMS[3] + 'json/' + time.strftime("%Y%m%d%H") + '_' + str(PARAMS[2]) + '_' + str(PAGE_NUMBER) + '.json'
        #--------------------------------------------------
        ## Create JSON file if it doesnt exist
        if os.path.exists(FILE_NAME)==False:
            JSON_FILE = open(FILE_NAME, "a+", encoding="utf-8")
        else:
            JSON_FILE = open(FILE_NAME, "w", encoding="utf-8")
        JSON_FILE.write("{")
        #--------------------------------------------------
        # Click page for next page if not page 1
        if PAGE_NUMBER > 1:
            index = 0
            for atag in PAGE_COUNT_RAW:
                if index == PAGE_NUMBER:
                    elements = driver.find_elements_by_xpath("//div[contains(@class, 'pageControlMenu')]/div/ul/li")
                    if elements:
                        element = elements[index].find_elements_by_xpath("./a")
                        if element:
                            element[0].click()
                            time.sleep(randint(3,5))
                index += 1
        #--------------------------------------------------
        ## Remove survey box if it pops up and log
        try:
            surveyBox = driver.find_element_by_link_text("No, thanks")
            if surveyBox:
                surveyBox.click()
                print("Store[" + str(PARAMS[2]) + "]: Survey box found on page - " + str(PAGE_NUMBER))
        except:
            print("Store[" + str(PARAMS[2]) + "]: No survey box on page - " + str(PAGE_NUMBER))
        #--------------------------------------------------
        ## Proces page
        # If page is greater then 1 then get the page source of the new page.
        if PAGE_NUMBER > 1:
            PAGE_RAW = driver.page_source
            PAGE_RAW = LH.fromstring(PAGE_RAW)
        PAGE_RAW = PAGE_RAW.xpath("//div[contains(@class, 'estore_product_container')]")
        index = 0
        size = len(PAGE_RAW)
        for atag in PAGE_RAW:
            if PAGE_NUMBER > 1 and index == 0:
                WebDriverWait(driver,10).until(EC.presence_of_element_located((By.XPATH, "./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a")))
            setStore = PARAMS[2]
            setName = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a/text()")
            setNumber = parseSetNumber(setName[0])
            setPrice = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_price')]/text()")
            setLink = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a/@href")
            setRef = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_price')]/@id")
            if setRef:
                setRef = parseRefId(setRef[0])
            if re.search('[0-9\.]+', setPrice[0]) is not None:
                JSON_FILE.write("\"" + str(index) + "\":{\"store\":\"" + str(setStore) + "\",\"name\":\"" + str(setName[0]) + "\",\"number\":\"" + str(setNumber) + "\",\"price\":\"" + re.search('[0-9\.]+', setPrice[0]).group() + "\",\"ref\":\"" + str(setRef) + "\",\"link\":\"" + str(setLink[0]) + "\"}")
                if index+1 < size:
                    JSON_FILE.write(",")
            index += 1
        #--------------------------------------------------
        ## Close JSON file
        JSON_FILE.write("}")
        JSON_FILE.close()
        #--------------------------------------------------
        ## Increment page number
        PAGE_NUMBER += 1
        #--------------------------------------------------

    #--------------------------------------------------
    ## Close webdriver
    driver.quit()
    #--------------------------------------------------

except Exception as e:
    print('Error: ' + str(e.args[0]))

# Remove gecodriver.log file
GHOSTDRIVER_FILE = str(PARAMS[3]) + 'jobs/ghostdriver.log'
if os.path.exists(GHOSTDRIVER_FILE)==True:
    os.remove(GHOSTDRIVER_FILE)

更新

看起来这是仅有的两行不使用 PhantomJS,它们都返回一个空值。
setName = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a/text()")
setLink = atag.xpath("./div[contains(@class, 'product_info')]/div[contains(@class, 'product_name')]/a/@href")

最佳答案

好的,看来我已经解决了这个问题,我必须添加 set_windows_size使用 PhantomJS 时 webdriver 的选项。

起初:

driver = webdriver.PhantomJS()
driver.get(PARAMS[1])

解决方案:
driver = webdriver.PhantomJS()
driver.set_window_size(1024, 768)
driver.get(PARAMS[1])

现在 PhantomJS 网络驱动程序的工作方式与 Firefox 网络驱动程序的工作方式相同。

关于Python:webdriver Firefox 和 PhantomJS 之间的区别问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45325738/

相关文章:

java - Java中的Selenium找不到ahref

python - pandas如何计算仅给定月份和日期的增量

python - 选择 Selenium 中的焦点元素

python - Pandas 从长到宽

python - 借助xpath从没有类属性的元素中获取值

css - Selenium Web Driver-尝试使用Xpath时找不到元素。有人可以建议在元素下面找到更好的方法吗?

Python 子进程输出到标准输出

xslt - 如何使用xslt选择节点的祖 parent

java - 如何在 Selenium WebDriver 中对 FirefoxDriver、ChromeDriver 和 IEdriver 执行基本身份验证?

java - 即使程序在 excel 中将 resukt FALSE 放入数据中,数据仍然匹配 - Selenium Web 驱动程序