sas - 合并 SAS 中的巨大数据集

我需要 SAS 大师的建议:)。
假设我有两个大数据集。第一个是巨大的数据集(大约 50-100Gb!)，其中包含电话号码。第二个包含前缀(20-40,000 个观测值)。我需要为每个电话号码在第一个表中添加最合适的前缀。

例如，如果我有电话号码 +71230000 和前缀

+7
+71230
+7123

最合适的前缀是+71230。

我的想法。首先，对前缀表进行排序。然后在数据步骤中，处理电话号码表

data OutputTable;
    set PhoneNumbersTable end=_last;
    if _N_ = 1 then do;
        dsid = open('PrefixTable');
    end;
    /* for each observation in PhoneNumbersTable:
       1. Take the first digit of phone number (`+7`).
          Look it up in PrefixTable. Store a number of observation of
          this prefix (`n_obs`).
       2. Take the first TWO digits of the phone number (`+71`).
          Look it up in PrefixTable, starting with `n_obs + 1` observation.
          Stop when we will find this prefix
          (then store a number of observation of this prefix) or
          when the first digit will change (then previous one was the
          most appropriate prefix).
       etc....
    */
    if _last then do;
        rc = close(dsid);
    end;
run;

我希望我的想法足够清晰，但如果不是，我很抱歉)。

那么你有什么建议呢？感谢您的帮助。

附注当然，第一个表中的电话号码不是唯一的(可能会重复)，不幸的是，我的算法没有使用它。

最佳答案

有几种方法可以做到这一点，您可以使用格式或哈希表。

使用格式示例:

/* Build a simple format of all prefixes, and determine max prefix length */
data prefix_fmt ;
  set prefixtable end=eof ;
  retain fmtname 'PREFIX' type 'C' maxlen . ;
  maxlen = max(maxlen,length(prefix)) ; /* Store maximum prefix length */
  start = prefix ;
  label = 'Y' ;
  output ;
  if eof then do ;
    hlo = 'O' ;
    label = 'N' ;
    output ;

    call symputx('MAXPL',maxlen) ;
  end ;

  drop maxlen ;
run ;
proc format cntlin=prefix_fmt ; run ; 

/* For each phone number, start with full number and reduce by 1 digit until prefix match found */
/* For efficiency, initially reduce phone number to length of max prefix */
data match_prefix ;
  set phonenumberstable ;

  length prefix $&MAXPL.. ;

  prefix = '' ;
  pnum = substr(phonenumber,1,&MAXPL) ;

  do until (not missing(prefix) or length(pnum) = 1) ;
    if put(pnum,$PREFIX.) = 'Y' then prefix = pnum ;
    pnum = substr(pnum,1,length(pnum)-1) ; /* Drop last digit */
  end ;
  drop pnum ;
run ;

关于sas - 合并 SAS 中的巨大数据集，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32581629/

sas - 合并 SAS 中的巨大数据集

上一篇：mfc - MFC对话框中如何阻止鼠标输入

下一篇：r - 将列名粘贴到循环中的新列