python - HTML 清理代码不太有效

标签 python xml scrapy lxml

我在 Windows Vista 64 位上运行 Scrapy.org 版本 2.7 64 位。我有以下代码,旨在从 Guardian 开放平台 API 中提取数据并使用一些 Scrapy 模块进行清理:

import requests
from scrapy.utils.markup import remove_tags
from scrapy.selector import Selector


def get_content():
    api_url = 'http://beta.content.guardianapis.com/football/premierleague'
    payload = {
        'api-key':              '',
        'page-size':            10,
        'show-editors-picks':   'true',
        'show-elements':        'image',
        'show-fields':          'all'


    }
    response = requests.get(api_url, params=payload)

    def parse(self, response):
        titles = response.selector.xpath("normalize-space(//title)")
        for titles in titles:
            body = response.xpath("//p").extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')
            return titles

get_content()

代码运行时不会产生错误,但不会向 Python IDLE 打印任何内容。我怀疑这是因为我没有正确缩进某些内容。我尝试过使用缩进,但是我没有取得任何进展。这是我的问题还是我对这段代码做了完全错误的事情?

谢谢

最佳答案

尝试用 beautifulSoup 解析:

from bs4 import BeautifulSoup
api_url = 'http://beta.content.guardianapis.com/football/premierleague'
payload = {
    'api-key':              '',
    'page-size':            10,
    'show-editors-picks':   'true',
    'show-elements':        'image',
    'show-fields':          'all'


    }
response = requests.get(api_url, params=payload).content
soup = BeautifulSoup(response)

text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]

好的,这段代码应该正是您想要的:

from bs4 import BeautifulSoup
response = requests.get(api_url, params=payload).content
soup = BeautifulSoup(response)

text = [''.join(s.findAll(text=True)).encode("utf-8") for s in soup.findAll('p')]
for x in text:
     print x

*Plenty of sides tried free-flowing, pacey Latin football this summer – even England had 

their moments. A moment. But Argentina stayed functional. They haven’t conceded once in the knockouts, they’ve not been behind in any game, and they don’t mind a lack of respect. Coach Alejandro “The Sloth” Sabella says his side are “sore, beaten and tired after the war [with Holland]. But with work, humility and seriousness, we’ll get there”; Pablo Zabaleta says their strengths are spoiling, staying “compact and tight”, “closing down” and feeding on negativity. “Sometimes, if you have all the people against you, you feel even stronger.”
A series of heroic performances were undone by moments of cold quality – Switzerland, Mexico and Nigeria among those losing to cruel late strikes; and the USA stopped in extra time. But raw passion was at the heart of all the summer’s enduring images: Brazil’s maelstrom; Ivory Coast’s Serey Die in tears during his anthem; Suárez against England; Suárez against Chiellini; and the best squad meltdown for years – Ghana’s trip featuring a fist fight, suspensions, a plane load of cash and an inquiry. FA president Kwesi Nyantakyi: “We will unravel this farce.”
Van Gaal’s goalkeeper subbing move went down well: widely taken as evidence of brave, unsentimental, original thinking (even if Martin O’Neill did it first, in Leicester City’s 1996 play-off final) – and not as evidence of daft, look-at-me risk-taking, which it could have been if Tim Krul had gaffed. But the wider signs for Manchester United were good: a readiness to be flexible on tactics, to switch his back-line formation mid-game, to make space for flair, and to treat the press in a no-nonsense “je lot zijn idioten” way that’ll bring back warm pre-Moyes memories. He had no interest in the third-place play-off, and wasn’t shy to say so.
It’s a biennial revelation. The fundamentals of Germany’s 2002 football reboot are well-known - new academies with German quotas, leading to more German Bundesliga first-teamers at clubs where “50+1” ownership rules stop single entities from taking over. Joachim Löw was installed with a long-term brief, and will lead his team out in the final. England, in the same period, tried four different managers, giving each a smaller talent pool to pick from as the Premier League filled out with foreign owners and foreign players, gorged on its £5.5bn income, and grassroots facilities festered. Still, a Premier B League should fix it.
Bryan Ruiz, not good enough for Fulham’s relegation campaign and shipped out to make way for Kostas Mitroglou, captained Costa Rica into the knockout stage, scoring twice. He starred alongside Joel Campbell, who faces another season on loan from Arsenal. Also making points: Swiss Arsenal reject Johan Djourou; Colombia’s Pablo Armero, a loan flop at West Ham; Algeria pair Rafik Halliche (ex-Fulham) and Carl Medjani (ex-Liverpool); Mexico’s Spurs reject Giovani dos Santos; Germany’s Shkodran Mustafi, given a free by Everton in 2012; and former West Brom and Forest defender Gonzalo Jara, a star for Chile, despite a brutal own-goal/penalty miss double. Even Gervinho looked good.
The surprise on 2014’s top player lists so far: the number of keepers. There’s Tim Howard, whose old high school yearbook photo motto, “It will take a nation of millions to hold me back”, went viral; Costa Rica’s Keylor Navas, now in talks with Bayern Munich; Mexico’s free agent Guillermo Ochoa, whose Gordon Banks moment against Brazil put him in a good bargaining position; Nigeria’s Vincent Enyeama; Germany’s Manuel Neuer; Argentina’s Sergio Romero; and potentially Van Gaal’s strutting mind-gamer Tim Krul, who revelled in his cameo chance. Being a keeper is cool again. Even the ones who play for backwater minnows have their own Head & Shoulders ads.
Pre-tournament, DeAndre Yedlin was a Seattle Sounders homegrown full-back – low on European scouting lists, a known unknown. He’s now the answer to everyone’s full-back needs – his USMNTMVP game against Belgium drawing Roma, Liverpool, Inter, Genoa, Anderlecht and others. Club owner Adrian Hanauer says he doesn’t fancy selling Yedlin, but, on the other hand, “there’s always a number”. Among other talents who weren’t so well-known in inward-looking Premier League circles, where even Monaco’s €45m James Rodríguez counts as a breakthrough act: PSV’s winger Memphis Depay, Lille’s Divock Origi, about to join Liverpool, and Atlético Madrid’s José Giménez, whose buyout clause is on the rise.
Holland’s kicks were clinical against Costa Rica. Then, four days later, two players refused to take one and they lost 4-2. But science says it’s not a lottery. Among the historical World Cup data from analyst Robert O’Connor: the side kicking first win 60% of the time; players aged under 22 score 85% of their kicks, over-22s score 78%; keepers dive low and away from the centre of their net 94% of the time. Overall the ideal taker is young, left-footed, with a “well-established pre-shot routine” and wearing a red shirt. Today’s tailored facts: Argentina have won four out of five of their shootouts, Germany four out of four. One was against Argentina, in 2006.
For all the bad press, when something really bad happened – something disgusting – Fifa didn’t hold back. They fined Argentina £200,000 for breaching press conference regulations – failing to provide a player to give quotes “on three consecutive occasions”. Luis Suárez, meanwhile, was fined £96,000. But Suárez’s four-month ban did represent an unexpectedly heavy hit - a bit “fascist”, reckoned the Uruguay president José Mujica, who called it “an assault on the poor” driven by “Fifa’s bunch of old sons of bitches”. Meanwhile, Pepe was fined £10,000 for a headbutt, Alex Song £13,000 for an elbow chop, and Algeria £32,000 for fans using lasers, while none of the 12 complaints made about racist, homophobic or far-right chants or banners led to any Fifa action.


16 August Arsenal v Crystal Palace
16 August Burnley v Chelsea
16 August Leicester v Everton
16 August Liverpool v Southampton
16 August Man Utd v Swansea
etc...............*

关于python - HTML 清理代码不太有效,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24708097/

相关文章:

xml - XQuery - 连接查询

python - 有没有办法防止解析从另一个页面重定向的响应?

python - 碎片 : Select only <p> elements with text content with xpath

python - 返回结构时使用 ctypes 的段错误 - 32 位 linux

python - 创建一个孤立分支而不使用孤立标志

java - 程序如何决定xml文件的编码?

c# - xPath 获取具有特定值的元素的兄弟值

python - Scrapy编程错误: Not all parameters were used in the SQL statement

python - pyquery (lxml) 在结构良好的 XML 文档中找不到标签?

python - 学习Python和Scrapy