我正在尝试抓取和分析国防部获得的契约(Contract),并将其与我已经获得的其他经济数据相关联。这一切都可以在 Defense.gov 上公开获取。 .
但是,他们并没有将其列在表格中,而是以段落形式编写相关信息(承包商、日期、名称、契约(Contract) ID 等)。我一直在尝试将数据放入 CSV 中,以便可以通过 R 运行它。
通常我只是根据数据周围的标签进行提取,但是有人可以推荐一种更简单的方法来获取这些数据吗?我已经使用 wget 提取了数据,但我只是想提取它。
这是一个example of a typical paragraph :
Booz Allen Hamilton, Inc., Herndon, Va., is being awarded a $9,450,189 cost-plus-fixed-fee, indefinite-delivery, requirements contract for research and development in order to complete/deliver the assessment of army warfighting challenges and integrated learning plans, the experiment final reports, and experiment-to-action plans. The U.S. Army will use these reports to develop and revise Army concepts and contribute to other services and joint concepts; make recommendations for the development of Army and joint capabilities development scenarios; research current and future warfare through experimentation; and build models and simulations to test new warfighting ideas. ESG/PKS DTIC, Offut Air Force Base, Neb., is the contracting activity (SP0700-03-D-1380, Delivery Order: 0452).
我从 Perl 脚本开始,但提取效果不太好。我很好奇是否有人构建了一个更加动态的脚本,我可以从中构建而不是从头开始重建。
#!/usr/bin/perl -w
use Spreadsheet::WriteExcel;
# Create a new workbook called simple.xls and add a worksheet.
my $workbook = Spreadsheet::WriteExcel->new('Dec4_min.xls');
my $worksheet = $workbook->add_worksheet();
our $row = 0;
@files = <~/Def_Contracts/*.*>;
foreach $HTML (@files) { # open each file in folder #$HTML = "contract.html";
open (HTML) or die "Can't open the file!";
@fullpage = <HTML>;
print "fullpage array size = ", @fullpage. "\n";
my @cleaned; # this is a simplified array we will create
foreach $curr (@fullpage){ #this for each loop cuts array elements without dollar signs
# [0-9]+?\/[0-9]+?\/[0-9]{3}
if($curr =~ m/content="([0-9]+?\/[0-9]+?\/[0-9]{4})/) { #get date - looking for this: content="8/29/1995"
print $1;
# if ($currnt =~ m/([0-9]+,.[0-9]{4}/){ # extract date dd,(space)dddd
our $date = $1;
}
# CLEAN UP
while(substr($curr,0,1) =~ m/[^\w]/){ # while not a word char
substr($curr,0,1)=''; #cut that char
}
if($curr =~ m/\$[0-9]/) { # only use if has $number.
####################### Now we've got what we need, output relevant parts into excel.
my $firstcom = index($curr, ',');
$name = substr($curr,0,$firstcom);
# print "Name:", $name. "\n";
$worksheet->write($row,0,$name); # print the name in the first col
$worksheet->write($row,1,$date); # print the date in the 2nd col
if($curr =~ m/\$([0-9,]*)/) { # finds the cost PROBLEM: there may be more than one
# print "Cost:", $1. "\n";
$worksheet->write($row,2,$1);
}
if($curr =~ m/([A-Za-z0-9][A-Z0-9]{4}[A-Z0-9]?\-[0-9]+\-[A-Z]\-[A-Z0-9]{4})/) { # print ref # in 3rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,3,$1); # ref takes form (letter ...-...-...number)
}
# 2nd attempt to get ref #
if($curr =~ m/\((.*\-.*\-.*)\)/){ # print ref # in 4rd col
# print "Cost:", $1. "\n";
$worksheet->write($row,4,$1); # ref takes form (letter ...-...-...number)
}
$worksheet->write($row,5,$curr); # print full record (for verification!)
$row ++;
} # close for if has a number statement
} # close foreach line of HTML Page
#print "cleaned array size = ", @cleaned. "\n";
print "The end.\n";
close (HTML);
} # End of foreach file
最佳答案
显然,非常不完整,但是,通常需要大量现金才能说服我处理这种困惑(VIEWSTATE
,真的吗?):
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
use Regexp::Common qw( number );
my $parser = HTML::TokeParser::Simple->new('contract.html');
my %contracts;
while (my $tag = $parser->get_tag('p')) {
if (defined( my $align = $tag->get_attr('align')) ) {
my $text = get_text_in_p($parser);
next unless defined $text;
if (lc($text) eq 'contracts') {
process_contracts($parser, \%contracts);
}
}
}
use YAML;
print Dump \%contracts;
sub process_contracts {
my ($parser, $contracts) = @_;
my $current_dept = '';
while (my $tag = $parser->get_tag('p')) {
my $text = get_text_in_p($parser);
next unless defined $text;
if (defined $tag->get_attr('align')) {
$current_dept = $text;
next;
}
my ($company) = ($text =~ /^(.+?), (?:is|was)/);
my ($amount) = ($text =~ m{
(
\$
$RE{num}{int}{-base => 10}{-sep => ','}
)
}x
);
my ($contract) = ($text =~ m{
(
[A-Z0-9]{6}
[A-Z0-9/-]+
)
}x
);
push @{ $contracts->{$current_dept} }, {
company => $company,
amount => $amount,
contract => $contract,
# text => $text,
};
}
}
sub get_text_in_p {
my ($parser) = @_;
my $text = $parser->get_text('/p');
return unless defined $text;
$text =~ s/^[^A-Z]+//;
$text =~ s/\s+\z//;
return $text;
}
No. 1001-11 的输出:
--- DEFENSE LOGISTICS AGENCY: - amount: '$49,418,113' company: 'Physio-Control, Inc., Redmond, Wash.' contract: SPM200-07-D-8261/P00005 - amount: '$43,246,524' company: 'Johnson & Johnson Healthcare Systems, on behalf of Ortho-McNell-Janssen Pharmaceuticals, Inc., Piscataway, N.J.' contract: SPM2D0-12-D-0001 - amount: '$15,240,054' company: 'Patterson Dental Supply, Inc., Minn.' contract: SPM2DE-10-D-7447/P00005 NAVY: - amount: '$60,360,995' company: 'Raytheon Co., Integrated Defense Systems, San Diego, Calif.' contract: N00024-11-C-2404 - amount: '$33,693,891' company: 'Wyle Laboratories, Inc., Huntsville, Ala.' contract: N00421-03-D-0015 - amount: '$30,071,729' company: 'Deloitte Consulting, L.L.P., Lexington Park, Md.' contract: N00421-03-D-0014 - amount: '$22,151,900' company: 'Raytheon Co., Tucson, Ariz.' contract: N00024-08-C-5401 - amount: '$18,508,325' company: 'Canadian Commercial Corp., General Dynamics Land Systems - Canada, Ontario, Canada' contract: M67854-07-D-5028
关于perl - 如何提取美国国防部契约(Contract)信息以用于统计分析?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8407785/