大家好,我有时需要从网站上自动执行数据收集任务。有时我需要目录中的一堆 URL,有时我需要一个 XML 站点地图(是的,我知道有很多软件和在线服务)。
无论如何,作为我之前问题的后续,我编写了一个可以访问网站的小网络爬虫。
Basic crawler class to easily and quickly interact with one website.
Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).
Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.
Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.
Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.
JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.
如果存在上述限制,这是编写爬虫程序的好/体面的方式吗?任何输入都会有很大帮助:)
http://pastebin.com/VtgC4qVE - 主.java
http://pastebin.com/gF4sLHEW - JCrawler.java
http://pastebin.com/VJ1grArt - HTMLUtils.java
最佳答案
您的抓取工具似乎不遵守 robots.txt以任何方式并使用伪造的 User-Agent
字符串来炫耀,就好像它是一个网络浏览器一样。这可能会在未来导致法律上的麻烦。请考虑到这一点。
关于java - 网络爬虫,反馈?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2936068/