javascript - 使用 Node.js 获取大文件进行处理

我有一个 Node.js 应用程序需要从 Census.gov 获取这个 6GB 的 zip 文件，然后处理它的内容。但是，当使用 Node.js https API 获取文件时，下载会在不同的文件大小处停止。有时它会在 2GB 或 1.8GB 时失败，依此类推。我永远无法使用该应用程序完全下载文件，但在使用浏览器时可以完全下载。有没有办法下载完整文件？在完全下载之前我无法开始处理 zip，因此我的处理代码在执行之前等待下载完成。

const file = fs.createWriteStream(fileName);
http.get(url).on("response", function (res) {
      let downloaded = 0;
      res
        .on("data", function (chunk) {
          file.write(chunk);
          downloaded += chunk.length;
          process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
        })
        .on("end", async function () {
          file.end();
          console.log(`${fileName} downloaded successfully.`);
        });
    });

最佳答案

你没有flow control在 file.write(chunk) 上。你需要注意 file.write(chunk) 的返回值，当它返回 false 时，你必须等待 drain写更多之前的事件。否则，您可能会溢出写入流上的缓冲区，尤其是在将大量内容写入磁盘等慢速介质时。

当您试图以快于磁盘可以跟上的速度写入大量内容时缺少流量控制，您可能会耗尽内存使用量，因为流必须在其缓冲区中积累比预期更多的数据。

由于您的数据来自可读文件，当您从 file.write(chunk) 返回 false 时，您还必须暂停传入的读取流因此当您等待 writestream 上的 drain 事件时，它不会不断向您发送数据事件。当您收到 drain 事件时，您可以resume 读取流。

仅供引用，如果您不需要进度信息，可以让 pipeline()为您完成所有工作(包括流程控制)。您不必自己编写该代码。您甚至可以在使用 pipeline() 时仅通过观察 writestream 事件来收集进度信息。

这是您自己实现流控制的一种方法，但我建议您使用流模块中的 pipeline() 函数，如果可以的话让它为您完成所有这些工作:

const file = fs.createWriteStream(fileName);
file.on("error", err => console.log(err));
http.get(url).on("response", function(res) {
    let downloaded = 0;
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        process.stdout.write(`Downloaded ${(downloaded / 1000000).toFixed(2)} MB of ${fileName}\r`);
    }).on("end", function() {
        file.end(); console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => console.log(err));
});

http 请求中似乎也存在超时问题。当我添加这个时:

// set client timeout to 24 hours
res.setTimeout(24 * 60 * 60 * 1000);

然后我能够下载整个 7GB ZIP 文件。

这是对我有用的交 key 代码:

const fs = require('fs');
const https = require('https');
const url =
    "https://www2.census.gov/programs-surveys/acs/summary_file/2020/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.zip";
const fileName = "census-data2.zip";

const file = fs.createWriteStream(fileName);
file.on("error", err => {
    console.log(err);
});
const options = {
    headers: {
        "accept-encoding": "gzip, deflate, br",
    }
};
https.get(url, options).on("response", function(res) {
    const startTime = Date.now();

    function elapsed() {
        const delta = Date.now() - startTime;
        // convert to minutes
        const mins = (delta / (1000 * 60));
        return mins;
    }

    let downloaded = 0;
    console.log(res.headers);
    const contentLength = +res.headers["content-length"];
    console.log(`Expecting download length of ${(contentLength / (1024 * 1024)).toFixed(2)} MB`);
    // set timeout to 24 hours
    res.setTimeout(24 * 60 * 60 * 1000);
    res.on("data", function(chunk) {
        let readyForMore = file.write(chunk);
        if (!readyForMore) {
            // pause readstream until drain event comes
            res.pause();
            file.once('drain', () => {
                res.resume();
            });
        }
        downloaded += chunk.length;
        const downloadPortion = downloaded / contentLength;
        const percent = downloadPortion * 100;
        const elapsedMins = elapsed();
        const totalEstimateMins = (1 / downloadPortion) * elapsedMins;
        const remainingMins = totalEstimateMins - elapsedMins;

        process.stdout.write(
            `  ${elapsedMins.toFixed(2)} mins, ${percent.toFixed(1)}% complete, ${Math.ceil(remainingMins)} mins remaining, downloaded ${(downloaded / (1024 * 1024)).toFixed(2)} MB of ${fileName}                                 \r`
        );
    }).on("end", function() {
        file.end();
        console.log(`${fileName} downloaded successfully.`);
    }).on("error", err => {
        console.log(err);
    }).on("timeout", () => {
        console.log("got timeout event");
    });
});

关于javascript - 使用 Node.js 获取大文件进行处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73345878/

javascript - 使用 Node.js 获取大文件进行处理

上一篇：android - Kotlin : How to insert a value of a variable in a name of another variable

下一篇：javascript - JavaScript 中的函数 - 初学者问题