我希望解析十进制 数字而不考虑它们的格式,这是未知的。原文的语言未知,可能会有所不同。此外,源字符串可以在前后包含一些额外的文本,例如货币或单位。
我正在使用以下内容:
# NOTE: Do not use, this algorithm is buggy. See below.
def extractnumber(value):
if (isinstance(value, int)): return value
if (isinstance(value, float)): return value
result = re.sub(r'&#\d+', '', value)
result = re.sub(r'[^0-9\,\.]', '', result)
if (len(result) == 0): return None
numPoints = result.count('.')
numCommas = result.count(',')
result = result.replace(",", ".")
if ((numPoints > 0 and numCommas > 0) or (numPoints == 1) or (numCommas == 1)):
decimalPart = result.split(".")[-1]
integerPart = "".join ( result.split(".")[0:-1] )
else:
integerPart = result.replace(".", "")
result = int(integerPart) + (float(decimalPart) / pow(10, len(decimalPart) ))
return result
这种作品...
>>> extractnumber("2")
2
>>> extractnumber("2.3")
2.3
>>> extractnumber("2,35")
2.35
>>> extractnumber("-2 000,5")
-2000.5
>>> extractnumber("EUR 1.000,74 €")
1000.74
>>> extractnumber("20,5 20,8") # Testing failure...
ValueError: invalid literal for int() with base 10: '205 208'
>>> extractnumber("20.345.32.231,50") # Returns false positive
2034532231.5
所以我的方法对我来说似乎非常脆弱,并且会返回很多误报。
是否有任何库或智能函数可以处理此问题?理想情况下,20.345.32.231,50
不应通过,但将提取其他语言的数字,如 1.200,50
或 1 200'50
,无论周围其他文本和字符(包括换行符)的数量。
(根据接受的答案更新实现: https://github.com/jjmontesl/cubetl/blob/master/cubetl/text/functions.py#L91 )
最佳答案
您可以使用合适的奇特正则表达式来做到这一点。这是我最好的尝试之一。我使用命名捕获组,因为对于这种复杂的模式,数字组在反向引用中使用会更加困惑。
首先,正则表达式模式:
_pattern = r"""(?x) # enable verbose mode (which ignores whitespace and comments)
^ # start of the input
[^\d+-\.]* # prefixed junk
(?P<number> # capturing group for the whole number
(?P<sign>[+-])? # sign group (optional)
(?P<integer_part> # capturing group for the integer part
\d{1,3} # leading digits in an int with a thousands separator
(?P<sep> # capturing group for the thousands separator
[ ,.] # the allowed separator characters
)
\d{3} # exactly three digits after the separator
(?: # non-capturing group
(?P=sep) # the same separator again (a backreference)
\d{3} # exactly three more digits
)* # repeated 0 or more times
| # or
\d+ # simple integer (just digits with no separator)
)? # integer part is optional, to allow numbers like ".5"
(?P<decimal_part> # capturing group for the decimal part of the number
(?P<point> # capturing group for the decimal point
(?(sep) # conditional pattern, only tested if sep matched
(?! # a negative lookahead
(?P=sep) # backreference to the separator
)
)
[.,] # the accepted decimal point characters
)
\d+ # one or more digits after the decimal point
)? # the whole decimal part is optional
)
[^\d]* # suffixed junk
$ # end of the input
"""
这是一个使用它的函数:
def parse_number(text):
match = re.match(_pattern, text)
if match is None or not (match.group("integer_part") or
match.group("decimal_part")): # failed to match
return None # consider raising an exception instead
num_str = match.group("number") # get all of the number, without the junk
sep = match.group("sep")
if sep:
num_str = num_str.replace(sep, "") # remove thousands separators
if match.group("decimal_part"):
point = match.group("point")
if point != ".":
num_str = num_str.replace(point, ".") # regularize the decimal point
return float(num_str)
return int(num_str)
某些数字字符串只有一个逗号或句点且后面紧跟三个数字(例如 "1,234"
和 "1.234"
)是不明确的。此代码会将它们解析为带有千位分隔符 (1234
) 的整数,而不是浮点值 (1.234
),而不管实际使用的分隔符是什么。如果您希望这些数字有不同的结果(例如,如果您希望从 1.234
中提取 float ),您可以在特殊情况下处理此问题。
一些测试输出:
>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
"20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
print("{!r:20}: {}".format(s, parse_number(s)))
'2' : 2
'2.3' : 2.3
'2,35' : 2.35
'-2 000,5' : -2000.5
'EUR 1.000,74 €' : 1000.74
'20,5 20,8' : None
'20.345.32.231,50' : None
'1.234' : 1234
关于python - Python 中的模糊智能数字解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20157375/