text - 如何获取包含特定 url 的 <a> 标签中的文本

标签 text scrapy href contains

我有一个我不知道答案的问题,它可能很有趣。 我正在抓取这样的链接

    <a href="http://www.sandoz.com/careers/career_opportunities/job_offers/index.shtml">Prosta delovna mesta  v Sandozu</a>

现在我已经找到它了,我还想要标签的文本:“Prosta delovna mesta v Sandozu”

如何获取文本? 使用纯字符串似乎很容易,这就是解决方案:


但我在一个循环中,我只引用了这个 url。我尝试了几个选项,例如:


    word = "career"
    response.xpath('//a[contains(@href, "%s")]/text()').extract() % word


谢谢 马尔科

def parse(self, response):


    #We take all urls, they are marked by "href". These are either webpages on our website either new websites.
    urls = response.xpath('//@href').extract()

    #Base url.
    base_url = get_base_url(response) 

    #Loop through all urls on the webpage.
    for url in urls:

        #If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
        if url.endswith((
            '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', 
            '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', 

            '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', 
            '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', 

            #music and video
            '.mp3', '.mp4', '.mpg', '.ai', '.avi',
            '.MP3', '.MP4', '.MPG', '.AI', '.AVI',

            #compressions and other
            '.zip', '.rar', '.css', '.flv',
            '.ZIP', '.RAR', '.CSS', '.FLV',


        #If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it. 
        #However in this case we exclude good urls like http://www.mdm.si/company#employment
        if any(x in url for x in ['?', '%', '&', '#']):

        #Ignore ftp.
        if url.startswith("ftp"):

        #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
        # -- It is true, that we may get some strange urls, but it is fine for now.
        if not (url.startswith("http")):

            url_orig = url
            url = urljoin(base_url,url)

        #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.         
        if (urlparse(url).netloc == urlparse(base_url).netloc):

            #The main part. We look for webpages, whose urls include one of the employment words as strings.

            # -- Instruction. 
            # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
            if any(x in url for x in [









                #We found url that includes one of the magic words. We check, if we have found it before. If it is new, we add it to the list "jobs_urls".
                if url not in self.jobs_urls:
                    item = JobItem()
                    item["link"] = url
                    #item["term"] = response.xpath('//a[@href=url_orig]/text()').extract() 
                    #item["term"] = response.xpath('//a[contains(@href, "career")]/text()').extract()

                    #We return the item.
                    yield item

            #We don't put "else" sentence because we want to explore the employment webpage to find possible new employment webpages.
            #We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py. 
            yield Request(url, callback = self.parse)


您需要将 url 放在引号中并使用字符串格式:

item["term"] = response.xpath('//a[@href="%s"]/text()' % url_orig).extract() 

关于text - 如何获取包含特定 url 的 <a> 标签中的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32508685/


javascript - Selenium 点击链接href与javascript

jquery - 使用每个显示 div 中的值

java - 我怎样才能在 Java 中实现这一点?

android - 如何在 android GridView 教程中应用不同的图标和不同的文本?

python - Scrapy:测试内联请求的有效方法

python - Scrapy 与 TOR (Windows)

regex - 从文本文件中删除长行(Notepad++/EditPlus)

scrapy - 使用scrapyd有什么优势?

javascript - 如何使用谷歌翻译工具自动翻译一个部分?

javascript - 单击超链接时如何将 href 值传递给 javascript 并在 window.location 中使用它?