python - 如何匹配多行的行首和行尾

标签 python regex python-3.x

我从 beautifulsoup 的网站上获取了一些文字,一开始看起来像这样:

9-Day Weather Forecast

General Situation: An anticyclone aloft over the northern part of the South China Sea will bring mainly fine and hot weather to the south China coast in the next few days. Under the influence of a trough of low pressure, there will be showers over southern China midweek next week.

Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

我想将跨越几行的“日期/月份”和“百分比”之间的每个部分分开。我通过在 html 标签中查找一个大字符串来获得 NavigableString。我尝试过,但无法通过 re 搜索 NavigableString,因此我将字符串转换为 unicode 字符串:

daily_forecast_text = str(daily_forecast_text.encode('utf-8'))

返回如下:

b'\r\n9-Day Weather Forecast\n\nGeneral Situation:\nAn anticyclone aloft over the northern part of the South\nChina Sea will bring mainly fine and very hot weather to the\nsouth China coast in the next few days. Under the influence\nof a trough of low pressure, there will be showers over\nsouthern China midweek next week.\n\nDate/Month 18/5 (Friday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 27 - 33 C\nR.H. Range: 60 - 85 Per Cent\n\nDate/Month 19/5(Saturday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 27 - 33 C\nR.H. Range: 60 - 85 Per Cent\n\nDate/Month 20/5(Sunday)\nWind: South force 2 to 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 28 - 33 C\nR.H. Range: 65 - 85 Per Cent\n\nDate/Month 21/5(Monday)\nWind: Southwest force 3.\nWeather: Fine. Very hot during the day.\nTemp Range: 28 - 33 C\nR.H. Range: 65 - 85 Per Cent\n\nDate/Month 22/5(Tuesday)\nWind: Southwest force 2 to 3.\nWeather: Mainly fine and very hot. Isolated showers later.\nTemp Range: 28 - 33 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 23/5(Wednesday)\nWind: Light winds force 2.\nWeather: Sunny intervals and a few showers.\nTemp Range: 27 - 31 C\nR.H. Range: 70 - 95 Per Cent\n\nDate/Month 24/5(Thursday)\nWind: South force 2 to 3.\nWeather: Hot with sunny periods and a few showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 25/5(Friday)\nWind: South force 3.\nWeather: Hot with sunny periods and one or two showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nDate/Month 26/5(Saturday)\nWind: South force 3 to 4.\nWeather: Hot with sunny periods and one or two showers.\nTemp Range: 27 - 32 C\nR.H. Range: 70 - 90 Per Cent\n\nSea surface temperature at 2 p.m.17/5/2018 at North Point\nwas 27 degrees C.\n\nSoil temperatures at 7 a.m.17/5/2018 at the Hong Kong\nObservatory:\n0.5 M below surface was 27.7 degrees C.\n1.0 M below surface was 26.6 degrees C.\n\nWeather Cartoons for 9-day weather forecast\nDay 1 cartoon no. 90 - Hot\nDay 2 cartoon no. 90 - Hot\nDay 3 cartoon no. 90 - Hot\nDay 4 cartoon no. 90 - Hot\nDay 5 cartoon no. 90 - Hot\nDay 6 cartoon no. 54 - Sunny Intervals with Showers\nDay 7 cartoon no. 53 - Sunny Periods with A Few Showers\nDay 8 cartoon no. 53 - Sunny Periods with A Few Showers\nDay 9 cartoon no. 53 - Sunny Periods with A Few Showers\n'

以下代码没有返回任何内容:

 result = re.findall(
            "^Date.+Cent$", daily_forecast_text, flags=re.MULTILINE | re.DOTALL)

以下代码获取了所有文本,但它返回了一个以第一个“日期/月份”开头并以最后一个“百分比”结尾的大字符串。

 result = re.findall(
                "Date.+Cent", daily_forecast_text, flags=re.MULTILINE | re.DOTALL)

最佳答案

包含您的文本的 Html:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p id="weather">9-Day Weather Forecast

General Situation: An anticyclone aloft over the northern part of the South China Sea will bring mainly fine and hot weather to the south China coast in the next few days. Under the influence of a trough of low pressure, there will be showers over southern China midweek next week.

Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent

Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body></html>
"""

获取标签

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find(id='weather')

尽管 tag.string 是一个 bs4 NavigableString 它也是一个 Python str

>>> 
>>> type(tag.string)
<class 'bs4.element.NavigableString'>
>>> isinstance(tag.string, str)
True
>>> 'South force 3' in tag.string
True
>>> 

无需转换为使用正则表达式进行搜索

pattern = r'Date/Month.*?Per Cent'
rex = re.compile(pattern, flags = re.DOTALL)
for match in rex.findall(tag.string):
    print(match)
    print('**************')
<小时/>
>>>
Date/Month 18/5 (Friday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
**************
Date/Month 19/5(Saturday)

Wind: South force 3.

Weather: Fine and hot.

Temp Range: 27 - 32 C

R.H. Range: 65 - 85 Per Cent
**************
>>> 

关于python - 如何匹配多行的行首和行尾,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50403483/

相关文章:

python - 最佳实践 10 多个项目将使用一个容器/参数化运行

php - 在换行符中使用 preg_match_all

python - 通过 "for"循环拆分 pandas 数据帧会导致错误 : KeyError: 'the label [1] is not in the [index]'

python - 语音到文本 - 将说话者标签映射到 JSON 响应中的相应转录本

python - 使用套接字 : address already in use 监听 ip 地址

Python Selenium ChromeDriver 错误消息 : EGL_NOT_INITIALIZED

Python pytest pytest_exception_interact 从VCR.py异常自定义异常信息

java - 如何检查某个 Pattern 是否在 xml 响应的字符串表示形式中?

regex - 如何在 Linux 上从 id 中提取 uid 值?

python-3.x - 使用 gremlin-python Janus 进行社交网络应用