regex - 使用 PRXNEXT 捕获关键字的所有实例

标签 regex sas

我正在搜索医学笔记以捕获短语的所有实例,特别是“产生碳青霉烯 enzyme ”。有时,此短语在字符串中出现 > 1 次。从一些研究来看,我认为 PRXNEXT 最有意义,但我很难让它做我想做的事。作为此字符串的示例:

if amikacin results are needed please notify microbiology lab at ext for further testing the organism will be held until meropenem result obtained by disc diffusion presumptive carbapenemase producing cre see spmi for carba r pcr results not confirmed carbapenemase producing cre

从上面的评论中,我想提取短语

presumptive carbapenemase producing

not confirmed carbapenemase producing

我意识到我无法提取,我不认为,那些确切的短语,但它带有子字符串的一些变体。我在这里找到了我一直在使用的代码。这是我到目前为止所拥有的,但它只捕获了第一个短语:

carba_cnt = count(as_comments,'carba','i');

if _n_ = 1 then do;
retain reg1 neg1;
 reg1 = prxparse("/ca[bepr]\w+ prod/");
end;

start = 1;
stop = length(as_comments);
position = 0;
length = 0;

/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances.       */
/* PRXNEXT changes the start parameter so that searching  */
/* begins again after the last match.                     */

call prxnext(reg1, start, stop, as_comments, position, length);

lastpos = 0;
 do while (position > 0);
 if lastpos then do; 
 length found $200;
 found = substr(as_comments,lastpos,position-lastpos);
 put found=;
  output;
 end;
 lastpos = position;

 call prxnext(reg1, start, stop, as_comments, position, length);
 end;

 if lastpos then do;
 found = substr(as_comments,lastpos);
 put found=;
  output;
 end;

最佳答案

您使用 PRXNEXT 来定位源中正则表达式匹配的每个匹配项是正确的。可以修改正则表达式模式以使用组捕获来搜索可选的前导“未确认”。最不可能“编码器失败”的情况是聚焦循环并围绕对 PRXNEXT 的单个调用进行提取。

此示例使用模式 /((not returned\s*)?(ca[bepr]\w+ prod)) 并每次匹配输出一行。

data have;
  id + 1;
  length comment $2000;
  infile datalines eof=done;
  do until (_infile_ = '----');
    input;
    if _infile_ ne '----' then 
      comment = catx(' ',comment,_infile_);
  end;
  done:
  if not missing(comment);
  datalines4;
if amikacin results are needed please notify microbiology lab at ext 
for further testing the organism will be held until meropenem result 
obtained by disc diffusion presumptive carbapenemase producing cre 
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
if amikacin results are needed please notify microbiology lab at ext 
for further testing the organism will be held until meropenem result 
obtained by disc diffusion conjectured carbapenems producing cre 
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
;;;;
run;

data want;
  set have;
  prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod))/');

  _start_inout = 1;

  do hitnum = 1 by 1 until (pos=0);
    call prxnext (prx, _start_inout, length(comment), comment, pos, len);
    if len then do;
      content = substr(comment,pos,len);
      output;
    end;
  end;

  keep id hitnum content;
run;

额外信息:prxparse 不需要位于 if _n_=1 block 内。请参阅PRXPARSE docs

If perl-regular-expression is a constant or if it uses the /o option, the Perl regular expression is compiled only once. Successive calls to PRXPARSE do not cause a recompile, but returns the regular-expression-id for the regular expression that was already compiled. This behavior simplifies the code because you do not need to use an initialization block (IF _N_ = 1) to initialize Perl regular expressions.

关于regex - 使用 PRXNEXT 捕获关键字的所有实例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55033872/

相关文章:

android - 在 EditText 中仅允许基于正则表达式的选定字符

python - Pandas 在读取 SAS 文件时数据类型正确失败

java - 如何使用 Java 将数据从 sas 服务器拉到 hdfs?

SAS - 根据另一个数据集重命名变量

command-line - SAS 显示管理器命令

nbsp 的 Ruby 正则表达式处理

javascript - 正则表达式非捕获组 - 没用?

c# - 正则表达式围绕大括号拆分

regex - 使用 Perl 反斜杠序列的 Apache 重写规则不起作用

sas - 计算中使用的提示或宏变量