c++ - C++中的电子邮件爬虫

标签 c++ email

我有一个我无法弄清楚的作业。我希望我的函数从 html 文件中获取一行并从中提取一封电子邮件。然后将电子邮件拆分为电子邮件、用户名和域。然后我想要第三个函数来获取 html 文件中的下一封电子邮件。

void get_line_emails(ifstream &in_stream, ofstream &out_stream, string email[], string users[], string domain[])
{
    int location, end;
    string mail;    
    getline(in_stream, mail);
    location = mail.find("mailto:");
    end = mail.find(">");
    mail = mail.substr(location, (end - 1));
    cout << mail << endl;
}

void get_next_email(ifstream &in_stream, string mail)
{
        getline(in_stream, mail);
        int location = mail.find("mailto:");
        int end = mail.find(">");
        mail = mail.substr(location, (end - 1));

}

void split_email(string email[], string domain[], string users)
{
    int count = 300;
    string mail;
    for (int i = 1; i < count; ++i) //For loop to input stream.
        {
            mail = email[i];
            int location = mail.find("@");
            int end = mail.find(">");
            string domain[i] = mail.substr(location, (end - 1));
            string users[i] = mail.substr(0, location);
        }
}

我在运行程序时也遇到了这个错误:

terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 4294967295) > this->size() (which is 244)
Abort (core dumped)

如果对我的主要功能有帮助:

int main()
{
    string email[1000];
    string users[1000];
    string domain[1000];
    int count = 300;
    string filename;

    ifstream in_stream;
    ofstream out_stream;
    cout << "Enter input filename: " << endl;
    cin >> filename; //Input of filename.
    in_stream.open(filename.c_str()); //Opening the input file for population and other information.
        if (in_stream.fail()) //Checking to see if file opens.
        {
            cout << "Error opening input/output files" << endl; //Telling user file isn't opening.
            exit(1); //Exiting program.
        }

    out_stream.open("Emails.txt");//If it does not exist it will not be created. If it exists it will be overwritten.
    out_stream << "Email " << right << setw(20) << "User " << right << setw(20) << "Domain" << endl;
    out_stream << "_______________________________________________________________________________" << endl;
    get_line_emails(in_stream, out_stream, email, users, domain);
    //split_email(email, domain, users);
    sort(email, users, domain, count);
    in_stream.close(); //Closing the in stream.
    out_stream.close(); //Closing the out stream.

    cout << "A new file Emails has been created with the emails extracted. Thank you." << endl; //End message.

    return 0;
}

我正在输入的部分 HTML 文件:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <!-- Content Copyright Ohio University Server ID: 2-->
<!-- Page generated 2016-03-22 14:55:21 by CommonSpot Build 9.0.3.119 (2015-08-14 15:00:01) -->
<!-- JavaScript & DHTML Code Copyright &copy; 1998-2015, PaperThin, Inc. All Rights Reserved. --> <head>
<meta name="Description" id="Description" content="Faculty" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="Keywords" id="Keywords" content="engineering" />
<meta name="Generator" id="Generator" content="CommonSpot Content Server Build 9.0.3.119" />
<link rel="stylesheet" href="/style/ouws_0111_allin1_nonav.css" type="text/css" />
<link rel="stylesheet" href="/engineering/upload/engineeringEV.css" type="text/css" />
<link rel="stylesheet" href="/engineering/upload/gridpak.css" type="text/css" />
<style type="text/css">
.mw { color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none; }
a.mw:link   {color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
a.mw:visited    {color:#000000;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
a.mw:hover  {color:#0000FF;font-family:Verdana,Arial,Helvetica;font-weight:bold;font-size:xx-small;text-decoration:none;}
</style> <script type="text/javascript">
<!--
var gMenuControlID = 0;
var menus_included = 0;
var jsDlgLoader = '/engineering/about/people/loader.cfm';
var jsSiteID = 1;
var jsSubSiteID = 6148;
var js_gvPageID = 2177477;
var jsPageID = 2177477;
var jsPageSetID = 0;
var jsPageType = 0;
var jsControlsWithRenderHandlers = ",1366057,1407941,1408984,1409120,1409220,1463564,1653027,1464282,1484855,1663987,1703445,1714178,1719109,1716274,1719109,1719109,1722161,1748941,1743237,1767756,1771704,1240950,1795856,1799077,1806233,1814378,1814378,1814378,36,1156323,958270,959997,36,1239784,1239535,1240103,1264495,1264559,1240832,1241026,1268776,1269019,1365662,1365798,1367666,1367112,1367146,1403322,1236239,1644435,1707482,36,1707482,1708185,1708185,1707846,1718301,1718356,1722082,1735273,1156092,1736675,1738340,1758445,1487747,1740183,1750814,1755341,36,4,1241075,1320447,1410344,1440455,1462605,1463564,1642797,1644920,1644955,1659254,1656252,1707459,1692320,1290294,1705469,1705596,1707846,1708163,1708367,1719109,1719109,1719109,1728460,1718356,1706218,1725200,1739433,1193755,1782561,1806244,1781609,1783821,1784445,1783821,1788664,1750814,1781533,1781788,1812661,1810778,1822088,1644219,39,36,36,438722,443887,523857,542895,36,867909,671210,733944,1074794,671213,671222,671225,671231,671234,1190981,1190914,1190943,1193755,1236239,1239497,1280404,1284325,860732,860741,1080236,671204,1237273,671216,671219,671228,671237,671207,1190973,1243855,1264544,1264564,1241172,1267910,1240840,1240849,1241220,1264699,1241365,1264571,1289737,8,1290184,1321465,1322500,1363024,1365670,1365954,1365998,1366014,2214456,2068897,1837521,1190931,1190931,2239453,1992371,1967400,1992371,1808005,1792195,1792195,1156323,1716646,1967400,1763595,1080236,1971121,1960374,1290151,2007514,2013290,2012663,2012302,2012026,2012663,2021773,1191128,426028,1808005,2108357,426028,36,36,36,2145522,2145522,2186158,1792195,1827509,1827486,1827486,1840641,1843869,1843869,1843879,1843879,1827509,1827486,635375,1190931,1853586,1854295,1854509,1854614,1855117,1855125,1859942,1232520,996841,999747,1074782,801933,1156092,1231112,1240950,1264518,1264536,1240828,1241280,1241033,1241322,1265043,1268750,1269805,1287352,1290231,1321501,1322534,1368599,1407796,1407917,1408156,1408447,1461409,1463586,1466072,1660460,1704499,1701618,1704211,1701596,1707383,1706218,1713783,1713443,1715100,1716646,1714352,1723376,1706218,1717134,1717134,1759841,1740127,1740183,1737868,1755222,1763595,1750814,1812661,1784600,860732,1785700,1786558,1786640,1788366,1788803,1787835,1758851,1802116,1802116,1802116,1802116,1810778,1870892,1827509,1854528,1859942,1859942,1870780,1865837,1905202,1905202,1750814,1243855,1763595,1806295,1806280,860741,1893429,1893243,1893429,1898989,1913110,1915322,1921065,1871293,1872541,1900928,1708367,1874008,1827509,1808005,1948002,1708367,1859942,1827509,1243851,1959041,1243851,1746007,1243851,1243851,1967400,1967400,1191128,1780116,1960374,1960374,1780116,1827486,1156092,1153939,36,1827486,1859942,1974908,1156092,1156323,1763595,1080236,1763595,1854295,1854641,1865837,1867230,1867211,1869328,738180,8,1191128,1808005,1967400,1156323,2104541,2058309,2013290,2047047,2068897,2010928,2087246,2010928,2104541,2104541,2104578,2115265,1708185,2120941,426028,2129783,1663761,2166426,2068897,1967400,1967400,1967400,2068897,1808005,1716646,1833649,1827509,2010085,36,2167570,2068897,1706218,1156092,2012337,2186146,1191128,2191212,1190931,1156323,1716646,2012663,2508370,1992371,1080236,2280950,1808005,36,36,1156323,1808005,1819898,1191128,1243855,2281280,2013290,2239453,1837521,1156323,1644219,1849105,1849105,2376567,2381406,1808005,1808005,1156092,2552104,2552104,2281280,1805958,1967400,2068897,2390125,1808005,2444428,2459222,2013290,2568057,2508370,1661786,1763595,2349059,2349059,2438289,1708367,2120941,2508370,2120951,2596819,1156323,1191128,2239453,2367160,2012337,2451225,1808005,2615851,1808005,1849105,55,55,2734901,1191128,55,55,2012663,2734829,1967400,1967400,1996683,1992371,2013290,2018337,2012337,2018364,1156092,1363024,1967400,1888191,1888191,1805958,1967400,2057362,39,1153939,1708185,2010085,2010085,2010085,2079659,2079659,2010928,2010928,2087246,1808005,36,1190931,2369360,2380491,1808005,2120941,1153939,1708367,2511867,2540778,1704499,1787140,1758479,1716646,1827486,2239453,1808005,1808005,1080236,2451225,2120941,1808005,";
var jsDefaultRenderHandlerProps = ",,";
var jsAuthorizedControls = ",65684,62081,62169,62236,62658,67860,70371,70560,70645,70911,71567,71570,71579,71582,71585,71588,71630,71645,73051,73055,73135,73175,73177,73179,73181,73183,73185,75593,75596,75598,75600,75602,75604,75943,77337,77339,77367,77369,77371,77397,77399,77401,77403,77406,77408,77423,77425,77429,77431,77433,77435,77454,77456,77458,77460,77462,77464,77524,77526,77528,77530,77533,77535,77564,77566,77569,77572,77579,77581,77755,77759,77771,77940,78254,78304,78759,81449,81447,81452,81454,86430,95027,110992,112176,114559,122476,122590,122592,122594,122998,123000,123002,123004,123010,123012,123014,123016,123113,123115,123117,123119,123121,123123,123125,123127,123129,123131,123133,123135,123137,123139,123141,123143,123193,123217,123219,123221,543,1784,1786,1791,1829,1901,1903,3434,3062,10165,17470,19113,17964,17975,20458,18450,19246,20461,20532,20535,20631,22975,22976,29043,29065,29198,29497,29894,32565,37812,42989,50270,50283,51427,51770,51940,51987,52309,52306,52325,52338,52440,52727,52935,53585,53717,54936,55739,56170,57624,70375,57659,58549,60274,60859,65324,65375,65378,65630,341266,341268,341270,343681,344120,344123,344125,344127,344129,344131,344133,1155418,344136,344142,344918,344920,346066,349254,349260,353078,353096,353249,353368,353500,353518,356036,356519,356527,356534,359303,359315,359619,365645,365647,365651,372637,372642,373892,409046,385136,402687,408565,416225,423380,423445,423634,423934,424407,424503,426545,425757,425785,426028,426263,433478,438722,440105,440778,441424,441447,441488,441530,441743,441914,441917,441920,441923,442181,442184,442228,442231,442767,443887,444519,444536,448085,446524,447856,448121,450241,450489,450583,451031,123223,123225,123227,123229,123231,123233,123235,123237,123239,123241,123243,123245,123247,123249,133712,138458,138462,138472,138493,140917,152719,152941,155012,174553,176272,182475,185313,185545,185572,185600,185653,189527,189717,189912,189915,209638,190014,209612,209640,210772,233752,233754,240835,242005,245048,245061,246392,247905,253143,255217,258368,258370,258448,259352,259507,259535,259540,259557,259597,270079,272462,272484,273374,275946,276171,281359,281731,281886,285356,285362,285364,289279,290246,293573,293580,293990,306206,306372,307096,307117,1409047,1410292,1410344,1440455,1462692,1462605,1463206,1463358,1463363,1463559,1463575,1466067,1466072,1466949,565361,577664,577666,580782,580785,586106,593209,631308,631375,671204,671207,671210,671213,671216,671219,671222,671225,671228,630659,630928,631186,631230,671231,703507,703512,872630,872675,951724,1070639,1070773,1071579,1074782,1074794,1116648,1118602,1153954,1153962,310170,319781,325794,326607,326613,331241,331243,331248,338287,338305,338307,338805,340095,340098,341260,341264,523857,523883,540187,541324,542748,542895,543075,543442,543531,545031,545034,545925,550439,550694,551327,551342,551843,551848,554801,557468,563421,563522,564335,564350,564362,565392,565403,565430,565440,565460,578908,580751,589443,589691,589825,631522,631342,671234,704390,704500,730405,733189,733195,733931,733944,735045,721050,721061,720116,803640,807230,860741,867909,869754,878921,872399,911315,951437,952815,952921,954983,956036,958270,960899,960901,960903,960912,960914,960916,959997,990601,993320,996841,999438,999472,999741,999747,999871,1034551,1034553,1035679,1035681,1070829,1080236,1111202,1112587,1112594,1116088,1117180,566481,567951,635375,671237,705089,708277,738180,738270,738274,756640,808480,993241,993247,993326,998452,999162,1034549,1034793,1034795,1118837,1121340,1150407,1152064,1153928,1153933,1153939,1153948,1154637,1156092,1156320,753746,754822,754960,755002,755412,755426,755453,801854,801933,802037,802071,802077,802080,802083,802087,802091,802417,802525,804060,860732,753752,754885,753748,754422,802568,451785,453349,452911,452935,454345,454916,464533,465324,476013,469286,469308,470126,472222,476011,476015,489860,478066,482338,482852,492048,486517,489015,489681,492017,492050,492052,498151,516411,516413,516415,516417,516419,516422,1935063,1939712,1992371,1996683,2010928,2012302,2012840,2013290,2021773,2047047,2058309,2079659,2104541,2108357,2115265,2120941,2120951,2135749,2145522,2157693,2157775,1193061
<a href="http://www.youtube.com/user/OhioUnivRussCollege"><img border="0" alt="YouTube" title="YouTube" src="/engineering/images/icon_youtube.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://www.linkedin.com/groups?home=&gid=3000035&trk=anet_ug_hm"><img border="0" alt="LinkedIn" title="LinkedIn" src="/engineering/images/icon_linkedin.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://www.facebook.com/ohio.engineering"><img border="0" alt="Facebook" title="Facebook" src="/engineering/images/icon_fb.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="https://twitter.com/russcollege"><img border="0" alt="Twitter" title="Twitter" src="/engineering/images/icon_twitter.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
<div class="imageImg">
<a href="http://instagram.com/russcollege"><img border="0" alt="Instagram" title="Instagram" src="/engineering/images/russ_instagram.png" /><span class="imageCaption" style="display:none;"></span></a>
</div>
</div></div></div><div id="cs_control_2398199" class="cs_control CS_Element_Custom"></div></div></div><div id="cs_control_2142700" class="contentWrap col row"><div  title="" id="CS_Element_2177477_2142700"><div id="cs_control_2142767" class="cs_control col pageTitle">
<!-- Portal Content -->
<div class="content-element">
<h2>Faculty</h2>
<p></p>
<br />
</div>
<!-- Portal Content -->
</div><div id="cs_control_2142762" class="mainContent col"><div  title="" id="CS_Element_2177477_2142762"><div id="cs_control_2142772" class="cs_control CS_Element_Custom">
<!-- Portal Content -->
<div class="content-element">
<p>  </p>
</div>
<!-- Portal Content -->
</div><div id="cs_control_2177314" class="cs_control">
<style type="text/css">
/* This fixes some issues with the anchor links from the A-Z bar at the top */
.group a[name]
{
position: absolute;
}
</style>
<div id="staffAlpha">
<ul class="azList">
<li class="children "><a href="#A">A</a></li>
<li class="children "><a href="#B">B</a></li>
<li class="children "><a href="#C">C</a></li>
<li class="children "><a href="#D">D</a></li>
<li class="children "><a href="#E">E</a></li>
<li class="children "><a href="#F">F</a></li>
<li class="children "><a href="#G">G</a></li>
<li class="children "><a href="#H">H</a></li>
<li class="children "><a href="#I">I</a></li>
<li class="children "><a href="#J">J</a></li>
<li class="children "><a href="#K">K</a></li>
<li class="children "><a href="#L">L</a></li>
<li class="children "><a href="#M">M</a></li>
<li class="children "><a href="#N">N</a></li>
<li class="children "><a href="#O">O</a></li>
<li class="children "><a href="#P">P</a></li>
<li>Q</li>
<li class="children "><a href="#R">R</a></li>
<li class="children "><a href="#S">S</a></li>
<li class="children "><a href="#T">T</a></li>
<li class="children "><a href="#U">U</a></li>
<li class="children "><a href="#V">V</a></li>
<li class="children "><a href="#W">W</a></li>
<li class="children "><a href="#X">X</a></li>
<li class="children "><a href="#Y">Y</a></li>
<li class="children last"><a href="#Z">Z</a></li>
</ul>
<div id="azContent">
<div class="group">
<a id="A" name="A"></a>
<h3 class="letter">A</h3>
<a href="profiles.cfm?profile=abukamai">Nasseef Abukamail</a><br />
Electrical Engineering and Computer Science <br />
Associate Lecturer <br />
<a href="mailto:abukamai@ohio.edu">abukamai@ohio.edu</a> <br />
740.593.1229 
<div><br />
</div><a href="profiles.cfm?profile=alam">Khairul Alam</a><br />
Mechanical Engineering, Center for Advanced Materials Processing, ESP Lab <br />
Professor <br />
<a href="mailto:alam@ohio.edu">alam@ohio.edu</a> <br />
740.593.1558 
<div><br />
</div><a href="profiles.cfm?profile=alim1">Muhammad Ali</a><br />
Biomedical Engineering, Mechanical Engineering, ESP Lab <br />
Associate Professor <br />
<a href="mailto:alim1@ohio.edu">alim1@ohio.edu</a> <br />
740.593.1389 
<div><br />
</div><a href="profiles.cfm?profile=arch">Deak Arch</a><br />
Aviation <br />
Associate Professor, Assistant Chair <br />
<a href="mailto:arch@ohio.edu">arch@ohio.edu</a> <br />
740.597.2688

最佳答案

将问题分解成任务。你有四项任务,他们应该单独处理。在您确定当前任务完全符合您的要求之前,不要继续执行下一个任务。一次处理多个任务会扩大问题范围,事实证明这不仅仅是几何扩展。错误往往会与其他错误相互作用。任务 1 中的错误可能会使任务 2 中的错误看起来不同,导致您调试错误的症状。

考虑给每个任务一个功能,或者如果任务很复杂,则给它自己的文件。这样每个任务都可以轻松地单独测试。为什么?如果您更改任务 1 的代码并想知道它是否损坏怎么办?当然你可以测试整个程序,但是如果你破坏了 2 个东西怎么办?如果你想用几百个地址测试拆分器逻辑以确保你正确处理所有奇怪的边缘情况,你可以只用这几百个字符串调用拆分器函数,而不必发明一个复杂的文件。

任务 1:逐行读取文件。

这是第一个,因为除非您能做到这一点,否则您将无能为力。

std::string line;
while (std::getline(in_stream, line))
{
    // output line to compare with source
}

将读取文件,直到无法再读取文件为止,可能是文件的这一端、损坏的数据、某些 clown 在您读取文件时拔出 USB 驱动器,或其他各种问题。你如何测试这个?一种简单的方法是逐行从一个流中读取文件并将其打印到控制台。这是一个相当大的文件,眼睛只对比较大量文本有用,所以将所有接收到的行写入输出文件,然后比较文件。如果他们匹配,你就赢了。继续执行任务 2。如果他们不这样做,请进行调试。

任务 2:寻找“mailto”。

这从任务 1 中取出一行并查找“mailto”

size_t loc = line.find("mailto:");
if (loc != std::string::npos)
{
    std::cout << "found: " << line << std::endl;
}

这是一个更容易测试的东西,所以我们可以使用 mk 1 眼球或记事本和 ctrl+f 来确认所有 mailto 行都已打印。

任务 3:隔离地址。

您在任务 2 中找到了包含“mailto”的行。现在您必须隔离该行上的地址。你有任务 2 的起始位置,你可以提取“mailto”之后的“:”和下一个“\”之间的字符串。我不会在这里花太多时间,因为这是肉和这个任务的土 bean 。我在这里做的太多了,我通过了类(class),而不是你,但基本上这是一个 find 和一个 substr 类似于OP在他们的问题中的内容.

任务 4:从任务 3 中拆分地址

这更多地使用 findsubstr 来隔离地址的各个部分。

关于c++ - C++中的电子邮件爬虫,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36184009/

相关文章:

email - 使用带有 CDOSYS 的 gmail smtp 服务器发送电子邮件时出错

python - 使用 Mailgun API 通过 Python 发送电子邮件

c++ - 关于将临时对象传递给 const 引用

c++ - 如何将 MFC 对话框映射到不同的帮助 ID?

c++ - 在 C++ 头文件中声明 vector

javamail 无法读取多部分/混合邮件

html - 字体大小未正确应用于新电子邮件 outlook 的正文

c# - 使用 C# 自动创建电子邮件地址和接收电子邮件?

c++ - 从大学计算机换到家里,为什么我得到的 FreeImage.h 没有这样的文件或目录?

c++ - C++ 中带有纯虚方法的抽象模板类