Python在html标签之间抓取文本延续主题

标签 python html

所有, 这是 my previous post 的延续,但针对不同的场景。

现在有一个特定的场景,我需要提取标签之间的文本。

    data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c10">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times Company</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 16, 2016 Wednesday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section B; Column 0; Classified; Pg. 16</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3 </SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 16, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>

'''

我尝试过的解决方案:

publicationnamepattern="\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>(.*)\</SPAN>\</P>"

copyrightpattern = "\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>([^<]*)\</SPAN>"

publicationnamepatternvalues = [a.strip("*") for a in re.findall(publicationnamepattern, data)]

copyrightpatternvalues = [a.strip("*") for a in re.findall(copyrightpattern, data)]

print(str(publicationnamepatternvalues))

print(str(copyrightpatternvalues))

结果:

['The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company', 'The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company']

其中我只需要“纽约时报”作为出版物名称模式值,并使用“Copyright 2016 The New York Times Company”作为版权模式值

我无法提供更多静态值,因为只有这些字段在数据中很常见。即《纽约时报》

一些数据包含 span 类,如 c2,一些包含 c4 等,

谁能帮我,如何解决这种情况。

最佳答案

from bs4 import BeautifulSoup

a="""
data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>'''
"""
soup=BeautifulSoup(a)
soup2 = soup.select('div.c0')
list1 = [b.text.strip().encode('utf-8') for b in soup2]
print list1
var1, var2 = list1[1], list1[2]
print var1
print var2

输出:

['1 of 2 DOCUMENTS', 'The New York Times', 'Copyright 2016 The New York Times Company']
The New York Times
Copyright 2016 The New York Times Company

关于Python在html标签之间抓取文本延续主题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40929609/

相关文章:

html - 很难用网格定位我的代码

python - Pandas :水平扩展/分解数据框

python - 使用 Flask 和非唯一处理程序名称构建 URL

javascript - 我怎样才能克隆一个标签,该标签有许多其他带有属性的标签?

html - 如何删除图像周围自动生成的 HTML 容器边距

html - 在LESS中,如何重写这个CSS组合样式?

javascript - 不从后端数据在 html 中呈现 &amp 符号

python - 重新解释 Julia 中的指针

python - 使用 Matplotlib 以非阻塞方式绘图

python - Django 1.9 升级问题 "django.core.exceptions.AppRegistryNotReady: Apps aren' 尚未加载。”