perl - 如何提取美国国防部契约(Contract)信息以用于统计分析?

标签 perl

我正在尝试抓取和分析国防部获得的契约(Contract),并将其与我已经获得的其他经济数据相关联。这一切都可以在 Defense.gov 上公开获取。 .

但是,他们并没有将其列在表格中,而是以段落形式编写相关信息(承包商、日期、名称、契约(Contract) ID 等)。我一直在尝试将数据放入 CSV 中,以便可以通过 R 运行它。

通常我只是根据数据周围的标签进行提取,但是有人可以推荐一种更简单的方法来获取这些数据吗?我已经使用 wget 提取了数据,但我只是想提取它。

这是一个example of a typical paragraph :

Booz Allen Hamilton, Inc., Herndon, Va., is being awarded a $9,450,189 cost-plus-fixed-fee, indefinite-delivery, requirements contract for research and development in order to complete/deliver the assessment of army warfighting challenges and integrated learning plans, the experiment final reports, and experiment-to-action plans. The U.S. Army will use these reports to develop and revise Army concepts and contribute to other services and joint concepts; make recommendations for the development of Army and joint capabilities development scenarios; research current and future warfare through experimentation; and build models and simulations to test new warfighting ideas. ESG/PKS DTIC, Offut Air Force Base, Neb., is the contracting activity (SP0700-03-D-1380, Delivery Order: 0452).

我从 Perl 脚本开始,但提取效果不太好。我很好奇是否有人构建了一个更加动态的脚本,我可以从中构建而不是从头开始重建。

#!/usr/bin/perl -w
use Spreadsheet::WriteExcel;


# Create a new workbook called simple.xls and add a worksheet.
my $workbook  = Spreadsheet::WriteExcel->new('Dec4_min.xls');
my $worksheet = $workbook->add_worksheet();
our $row = 0;


@files = <~/Def_Contracts/*.*>;

foreach $HTML (@files) {                # open each file in folder  #$HTML = "contract.html";
    open (HTML) or die "Can't open the file!";
    @fullpage = <HTML>;
    print "fullpage array size = ", @fullpage. "\n";

my @cleaned;                        # this is a simplified array we will create

foreach $curr (@fullpage){              #this for each loop cuts array elements without dollar signs

# [0-9]+?\/[0-9]+?\/[0-9]{3}

        if($curr =~ m/content="([0-9]+?\/[0-9]+?\/[0-9]{4})/) {     #get date - looking for this: content="8/29/1995"
        print $1;
#           if ($currnt =~ m/([0-9]+,.[0-9]{4}/){   #  extract date dd,(space)dddd  
        our $date = $1;
        }

                            # CLEAN UP
    while(substr($curr,0,1) =~ m/[^\w]/){       # while not a word char   
            substr($curr,0,1)='';                 #cut that char
    }   
    if($curr =~ m/\$[0-9]/) {           # only use if has $number.


#######################  Now we've got what we need, output relevant parts into excel.


        my $firstcom = index($curr, ',');
        $name = substr($curr,0,$firstcom); 
    #   print "Name:", $name. "\n";
        $worksheet->write($row,0,$name);        # print the name in the first col

        $worksheet->write($row,1,$date);            # print the date in the 2nd col

        if($curr =~ m/\$([0-9,]*)/) {           # finds the cost PROBLEM: there may be more than one
    #   print "Cost:", $1. "\n";
        $worksheet->write($row,2,$1);
        }

        if($curr =~ m/([A-Za-z0-9][A-Z0-9]{4}[A-Z0-9]?\-[0-9]+\-[A-Z]\-[A-Z0-9]{4})/) {     # print ref # in 3rd col
    #   print "Cost:", $1. "\n";
        $worksheet->write($row,3,$1);               # ref takes form (letter ...-...-...number)
        }
                                            # 2nd attempt to get ref #
        if($curr =~ m/\((.*\-.*\-.*)\)/){   # print ref # in 4rd col
    #   print "Cost:", $1. "\n";
        $worksheet->write($row,4,$1);               # ref takes form (letter ...-...-...number)
        }

        $worksheet->write($row,5,$curr);        # print full record (for verification!)

        $row ++;    
    }                           # close for if has a number statement
}                               # close foreach line of HTML Page

#print "cleaned array size = ", @cleaned. "\n";



print "The end.\n";
close (HTML);

}               # End of foreach file 

最佳答案

显然,非常不完整,但是,通常需要大量现金才能说服我处理这种困惑(VIEWSTATE,真的吗?):

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
use Regexp::Common qw( number );

my $parser = HTML::TokeParser::Simple->new('contract.html');
my %contracts;

while (my $tag = $parser->get_tag('p')) {
    if (defined( my $align = $tag->get_attr('align')) ) {
        my $text = get_text_in_p($parser);
        next unless defined $text;

        if (lc($text) eq 'contracts') {
            process_contracts($parser, \%contracts);
        }
    }
}

use YAML;
print Dump \%contracts;

sub process_contracts {
    my ($parser, $contracts) = @_;

    my $current_dept = '';

    while (my $tag = $parser->get_tag('p')) {
        my $text = get_text_in_p($parser);
        next unless defined $text;

        if (defined $tag->get_attr('align')) {
            $current_dept = $text;
            next;
        }

        my ($company)  = ($text =~ /^(.+?), (?:is|was)/);

        my ($amount)   = ($text =~ m{
                (
                    \$
                    $RE{num}{int}{-base => 10}{-sep => ','}
                )
            }x
        );

        my ($contract) = ($text =~ m{
                (
                    [A-Z0-9]{6}
                    [A-Z0-9/-]+
                )
            }x
        );

        push @{ $contracts->{$current_dept} }, {
            company  => $company,
            amount   => $amount,
            contract => $contract,
            # text     => $text,
        };
    }
}

sub get_text_in_p {
    my ($parser) = @_;
    my $text = $parser->get_text('/p');
    return unless defined $text;

    $text =~ s/^[^A-Z]+//;
    $text =~ s/\s+\z//;

    return $text;
}

No. 1001-11 的输出:

---
DEFENSE LOGISTICS AGENCY:
  - amount: '$49,418,113'
    company: 'Physio-Control, Inc., Redmond, Wash.'
    contract: SPM200-07-D-8261/P00005
  - amount: '$43,246,524'
    company: 'Johnson & Johnson Healthcare Systems, on behalf of Ortho-McNell-Janssen Pharmaceuticals, Inc., Piscataway, N.J.'
    contract: SPM2D0-12-D-0001
  - amount: '$15,240,054'
    company: 'Patterson Dental Supply, Inc., Minn.'
    contract: SPM2DE-10-D-7447/P00005
NAVY:
  - amount: '$60,360,995'
    company: 'Raytheon Co., Integrated Defense Systems, San Diego, Calif.'
    contract: N00024-11-C-2404
  - amount: '$33,693,891'
    company: 'Wyle Laboratories, Inc., Huntsville, Ala.'
    contract: N00421-03-D-0015
  - amount: '$30,071,729'
    company: 'Deloitte Consulting, L.L.P., Lexington Park, Md.'
    contract: N00421-03-D-0014
  - amount: '$22,151,900'
    company: 'Raytheon Co., Tucson, Ariz.'
    contract: N00024-08-C-5401
  - amount: '$18,508,325'
    company: 'Canadian Commercial Corp., General Dynamics Land Systems - Canada, Ontario, Canada'
    contract: M67854-07-D-5028

关于perl - 如何提取美国国防部契约(Contract)信息以用于统计分析?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8407785/

相关文章:

regex - Perl正则表达式提取两个连续的单词

perl - 无法在运行时加载 `Cwd`(以及其他非核心模块)

Perl、字符串、 float 、单元测试和正则表达式!

javascript - 如何在 Perl 中打开 javascript 对象?

Perl 静态类属性

perl - 为什么 Perl 的严格不允许我传递参数哈希?

regex - 在 Perl 中从文件中逐行提取以特定字符开头的单词

perl - Sed错误 "sed: no input files"

php - PHP中从Mysql结果向Solr V5添加多个文档的方法

arrays - Perl:匿名列表的长度