几天来我一直在努力反对这个问题,尝试了多种方法,但似乎没有一种方法能以我可以使用的方式工作......
问题。
我得到了一个任意字节流。字节中隐藏了一些语义元素。有花括号、方括号和方括号。这些表示三个不同的东西 - {} 是一个没有。字节范围,例如{17} 是 17 个字节。 [] 是一个字节值,例如[90:95] 是字节 x90、x91、x92、x93、x94、x95。 () 是字节值 'OR' 选项,例如(46|47) 表示 x46 或 x47。
我还必须检测其他语法结构,“!”、“*”、“?”和 ”:”。
字节流示例:524946(46|58){4}434452367672736E
我试图过滤它,所以我得到类似的东西:
1 string 524946
2 token (46|58)
3 token {4}
4 string 434452367672736E
拆分后,我可以进一步处理它。
我最接近它的工作(它丑陋丑陋的丑陋代码......):http://pastebin.com/XLg2H0PW
我尝试使用一些正则表达式,但我可以让它不将语法单元内的字符串字节计为普通字符串元素:
range_masks_list = [(m_mask1.span()) for m_mask1 in re.finditer("\{([0-9]+|[0-9]+-[0-9]+|[0-9]+-\*)\}",sequence)] ## looks for {int}, {int-int} and {int-*}
byte_masks_list = [(m_mask2.span()) for m_mask2 in re.finditer("\[[a-fA-F0-9]{2}:[a-fA-F0-9]{2}]",sequence)] ## looks for [a:b] where a and b are byte ranges
options_sets_list = [(m_mask3.span()) for m_mask3 in re.finditer("\(([a-fA-F0-9]{2})+\|([a-fA-F0-9]{2})+(\|([a-fA-F0-9]{2})+)*\)",sequence)] ## looks for regex or clauses e.g. (a|b)
string_chunk_list = [(m_mask4.span()) for m_mask4 in re.finditer("([a-fA-F0-9]{2})+",sequence)] ## looks for uninterrupted hex byte spans
看起来像:
def do_fragmenter(self,sequence):
""" converts the grep grammer normalised string into a set of fragments and offsets for sig population"""
sequence = sequence.replace(" ","")
range_masks_list = [(m_mask1.span()) for m_mask1 in re.finditer("\{([0-9]+|[0-9]+-[0-9]+|[0-9]+-\*)\}",sequence)] ## looks for {int}, {int-int} and {int-*}
byte_masks_list = [(m_mask2.span()) for m_mask2 in re.finditer("\[[a-fA-F0-9]{2}:[a-fA-F0-9]{2}]",sequence)] ## looks for [a:b] where a and b are byte ranges
options_sets_list = [(m_mask3.span()) for m_mask3 in re.finditer("\(([a-fA-F0-9]{2})+\|([a-fA-F0-9]{2})+(\|([a-fA-F0-9]{2})+)*\)",sequence)] ## looks for regex or clauses e.g. (a|b)
string_chunk_list = [(m_mask4.span()) for m_mask4 in re.finditer("([a-fA-F0-9]{2})+",sequence)] ## looks for uninterupted hex byte spans
string_chunks = []
string_chunks_len = []
for pair in string_chunk_list:
string_chunks.append(sequence[pair[0]:pair[1]])
string_chunks_len.append(len(sequence[pair[0]:pair[1]]))
print zip(string_chunks,string_chunks_len)
最佳答案
只要考虑到您定义的语法元素,您就可以使用这样的东西(用您需要的处理替换打印品):
#! /usr/bin/python3.2
import re
a = '524946(46|58){4}434452[22:33]367672736E'
patterns = [ ('([0-9a-fA-F]+)', 'Sequence '),
('(\\([0-9a-fA-F]+\\|[0-9a-fA-F]+\\))', 'Option '),
('({[0-9a-fA-F]+})', 'Curly '),
('(\\[[0-9a-fA-F]+:[0-9a-fA-F]+\\])', 'Slice ') ]
while a:
found = False
for pattern, name in patterns:
m = re.match (pattern, a)
if m:
m = m.groups () [0]
print (name + m)
a = a [len (m):]
found = True
break
if not found: raise Exception ('Unrecognized sequence')
产量:
Sequence 524946
Option (46|58)
Curly {4}
Sequence 434452
Slice [22:33]
Sequence 367672736E
关于Python - 根据类型将字符过滤到结构中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14946993/