java - 如何从 Java 高效地将 For 循环(700,000 行)内容写入文件？

我编写了以下代码来以 XML 响应的形式获取结果，并将其部分内容从 Java 写入文件。这是通过接收对公共(public)数据库的大约 700,000 个查询的 XML 响应来完成的。

但是，在代码写入文件之前，它会被代码中随机位置的一些随机异常(来自服务器)停止。我尝试从 For 循环本身写入文件，但未能成功。因此，我尝试将收到的响应中的 block 存储到 Java HashMap 中，并在一次调用中将 HashMap 写入文件。但在代码接收 for 循环中的所有响应并将它们存储到 HashMap 之前，它会因一些异常而停止(可能在第 15000 次迭代时!!)。当需要这样的迭代来获取数据时，是否有其他有效的方法可以用 Java 写入文件？

我用于此代码的本地文件是 here .

我的代码是，

import java.io.BufferedReader;              

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringWriter;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.XML;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;


public class random {

    static FileWriter fileWriter;
    static PrintWriter writer;

    public static void main(String[] args) {

        // Hashmap to store the MeSH values for each PMID 
        Map<String, String> universalMeSHMap = new HashMap<String, String>();

        try {

            // FileWriter for MeSH terms
            fileWriter = new FileWriter("/home/user/eclipse-workspace/pmidtomeshConverter/src/main/resources/outputFiles/pmidMESH.txt", true);
            writer = new PrintWriter(fileWriter);

            // Read the PMIDS from this file 
            String filePath = "file_attached_to_Post.txt";
            String line = null;
            BufferedReader bufferedReader = new BufferedReader(new FileReader(filePath));


            String[] pmidsAll = null;

            int x = 0;
            try {
                //print first 2 lines or all if file has less than 2 lines
                while(((line = bufferedReader.readLine()) != null) && x < 1) {
                    pmidsAll = line.split(",");
                    x++;
                }   
            }
            finally {   
                bufferedReader.close();         
            }

            // List of strings containing the PMIDs
            List<String> pmidList = Arrays.asList(pmidsAll);

            // Iterate through the list of PMIDs to fetch the XML files from PubMed using eUtilities API service from PubMed
            for (int i = 0; i < pmidList.size(); i++) {


                String baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id=";

                // Process to get the PMIDs
                String indPMID_p0 = pmidList.get(i).toString().replace("[", "");
                String indPMID_p1 = indPMID_p0.replace("]", "");
                String indPMID_p2 = indPMID_p1.replace("\\", "");
                String indPMID_p3 = indPMID_p2.replace("\"", "");

                // Fetch XML response from the eUtilities into a document object 
                Document doc = parseXML(new URL(baseURL + indPMID_p3));

                // Convert the retrieved XMl into a Java String 
                String xmlString = xml2String(doc); // Converts xml from doc into a string

                // Convert the Java String into a JSON Object
                JSONObject jsonWithMeSH = XML.toJSONObject(xmlString);  // Converts the xml-string into JSON

                // -------------------------------------------------------------------
                // Getting the MeSH terms from a JSON Object
                // -------------------------------------------------------------------
                JSONObject ind_MeSH = jsonWithMeSH.getJSONObject("PubmedArticleSet").getJSONObject("PubmedArticle").getJSONObject("MedlineCitation");

                // List to store multiple MeSH types
                List<String> list_MeSH = new ArrayList<String>();
                if (ind_MeSH.has("MeshHeadingList")) {

                    for (int j = 0; j < ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").length(); j++) {
                        list_MeSH.add(ind_MeSH.getJSONObject("MeshHeadingList").getJSONArray("MeshHeading").getJSONObject(j).getJSONObject("DescriptorName").get("content").toString());
                    }
                } else {

                    list_MeSH.add("null");

                }

                universalMeSHMap.put(indPMID_p3, String.join("\t", list_MeSH));

                writer.write(indPMID_p3 + ":" + String.join("\t", list_MeSH) + "\n");



            System.out.println("Completed iteration for " + i + " PMID");

        }

        // Write to the file here
        for (Map.Entry<String,String> entry : universalMeSHMap.entrySet()) {

            writer.append(entry.getKey() + ":" +  entry.getValue() + "\n");

        }

        System.out.print("Completed writing the file");

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ParserConfigurationException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (SAXException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TransformerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } finally {
        writer.flush();
        writer_pubtype.flush();
        writer.close();
        writer_pubtype.close();
    }

}

private static String xml2String(Document doc) throws TransformerException {

    TransformerFactory transfac = TransformerFactory.newInstance();
    Transformer trans = transfac.newTransformer();
    trans.setOutputProperty(OutputKeys.METHOD, "xml");
    trans.setOutputProperty(OutputKeys.INDENT, "yes");
    trans.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", Integer.toString(2));

    StringWriter sw = new StringWriter();
    StreamResult result = new StreamResult(sw);
    DOMSource source = new DOMSource(doc.getDocumentElement());

    trans.transform(source, result);
    String xmlString = sw.toString();
    return xmlString;

}

private static Document parseXML(URL url) throws ParserConfigurationException, SAXException, IOException {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse((url).openStream());
    doc.getDocumentElement().normalize();
    return doc;
}

private static String readAll(Reader rd) throws IOException {
    StringBuilder sb = new StringBuilder();
    int cp;
    while ((cp = rd.read()) != -1) {
        sb.append((char) cp);
    }
    return sb.toString();
}

public static JSONObject readJsonFromUrl(String url) throws IOException, JSONException {
    InputStream is = new URL(url).openStream();
    try {
        BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
        String jsonText = readAll(rd);
        JSONObject json = new JSONObject(jsonText);
        return json;
    } finally {
        is.close();
    }
}

}

这是异常发生之前在控制台上打印的内容。

已完成 0 PMID 的迭代
已完成 1 个 PMID 的迭代
已完成 2 个 PMID 的迭代
已完成 3 个 PMID 的迭代
已完成 4 个 PMID 的迭代
已完成 5 个 PMID 的迭代
它会一直写入，直到出现下面给定的异常...

因此，在循环中的任何随机点，我都会得到下面的异常。

java.io.FileNotFoundException: https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1890) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:263) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:647) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1304) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1270) at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:264) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1161) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1045) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:959) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at pmidtomeshConverter.Convert2MeSH.parseXML(Convert2MeSH.java:240) at pmidtomeshConverter.Convert2MeSH.main(Convert2MeSH.java:121)

最佳答案

您希望解析器在解析它们时忽略 DTD。

使用此功能:

dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

参见Xerces documentation对于其他功能。

关于java - 如何从 Java 高效地将 For 循环(700,000 行)内容写入文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53836473/

java - 如何从 Java 高效地将 For 循环(700,000 行)内容写入文件？

上一篇：delphi - ini 文件部分到 stringgrid

下一篇：cmake - 从 CMake 中使用的库继承包含目录