multithreading - 使用多线程从URL下载文件

我需要从网址下载excel文件列表，并将其保存在文件夹中。 (最多可以包含200个文件。)我从以下代码开始循环并下载每个文件:

<cfloop query="idsToDownload">
    <cfset fileURL = "https://myLink/#downloadID#" /><!--- link to an xlsx file --->
    <cfexecute name="curl" arguments = "#fileURL# -k" timeout="10" outputFile="#downloadID#.xlsx" />
</cfloop>

这将下载每个文件，并将它们保存在coldfusion temp目录中。 (目前，仅用于测试-最终，我们将确定它们的存储位置，并使用路径更新outputFile。)这很好，除了最终达到cfloop时间限制(下载大约30个文件之后)。但是无论如何，我们确实希望为每次下载启动一个线程以最大化效率。因此，我在循环内添加了一个cfthread标记(免责声明:我是cfthreading的新手):

<cfloop query="idsToDownload">
    <cfthread name="download_#downloadID#" action="run">
        <cfset fileURL = "https://myLink/#downloadID#" />
        <cfexecute name="curl" arguments = "#fileURL# -k" timeout="10" outputFile="#downloadID#.xlsx" />
    </cfthread>
</cfloop>

我将假定它的执行方式与以前相同，只是每次下载都将在异步线程中运行。但是，当我运行此命令时，什么也没有发生。我在页面上没有任何错误，但是ColdFusion临时文件中没有文件显示(就像使用简单的无线程cfloop一样)。此代码有什么问题？

编辑:

我还尝试了一次下载一次的单线程，它可以正常工作:

<cfthread name="downloadFile" action="run">
    <cfset fileURL = "https://myLink/123" />
    <cfexecute name="curl" arguments = "#fileURL# -k" timeout="10" outputFile="123.xlsx" />
</cfthread>

所以cfloop/cfthread组合似乎出了点问题...

最佳答案

我认为您可能需要一种不同的方法来下载数百个文件。每个文件线程只会在目标服务器可能停止响应或阻止您(如果不受您控制)之前扩展得如此之高。另外，如果您使用的是cURL，那么您将为每个线程生成一个子进程，因此它占用大量资源。

相反，我会创建一个线程池并在它们之间分配工作。创建N个线程，并为每个线程提供要下载的文件列表。每个线程都将遍历列表，您可以轻松调整N以获得最佳性能/资源使用权衡。

上面方法的潜在潜在缺点是，如果一个文件列表的下载速度比其他文件列表的下载速度快得多，那么它将尽早结束，其余工作将由更少的线程执行。您可以实现一个工作跟踪器，每个线程都调用该工作跟踪器来拾取下一个要下载的文件。只要它的getNextFile()方法被适本地同步，它将使所有N个线程保持工作状态，直到有更多的工作要做。

另外，如果下载像示例中所示的那样简单，请考虑不使用cURL。考虑使用CFHTTP或Java HTTP Client库之一，因为您不必每次下载都产生一个进程。

编辑
关于运行现有代码，我能够构建一个相应的示例，该示例似乎执行正常(CF10/OSX):

Thread test...<br/>
<cfloop from="1" to="3" index="i">  
    <cfoutput>Starting #i# <br/></cfoutput><cfflush>
    <cfthread action="run" name="dl-thread-#i#" urlNumber="#i#">
        <cflog log="Application" text="#urlNumber#">
    <cfexecute name="/opt/local/bin/curl" arguments="https://www.google.co.uk/?q=#urlNumber#" outputfile="#GetTemplatePath()##urlNumber#.html" errorFile="#GetTemplatePath()##urlNumber#.html.err">
    <!--- alternatively....
    <cfhttp url="https://www.google.co.uk/?q=#urlNumber#" file="#urlNumber#.html" path="#GetDirectoryFromPath(GetTemplatePath())#" method="get" />
    --->
    </cfexecute>
</cfthread> 
</cfloop>
Done....

我可以看到的唯一真正的区别是，我正在将参数显式传递给线程，并允许线程代码使用这些参数来组装URL(请参阅urlNumber属性)。在我这样做之前，我看到了非常奇怪的结果:我为文件写入的值是2-4，而不是1-3。

我将确保线程需要的任何数据都被显式传递。此外，cfexecute上的文档指出name属性必须是绝对路径(包括扩展名)，但是如果没有，您的代码似乎可以正常工作吗？

我提供了一个使用<cfhttp>来实现与curl相同的功能的注释示例。几乎可以肯定，启动任何外部过程数百次都不会扩展。修改上面的示例以拆分一个列表，该列表上的每个工作都应该简单明了。

编辑2
以下代码段实现了在可配置数量的线程之间分配工作负载:

Thread test...<br/>
<cfscript>
    urlCount=100;
    threads=5;
    urls=[];

    //utility function to split an array into a set of equal arrays
    function ArrayDivide(arr,divisor){
        divided=[];
        for(i=1;i<=divisor;i++){
            divided[i]=[];
        }
        for(i=1;i<=ArrayLen(arr);i++){
            ArrayAppend(divided[(i%divisor)+1],arr[i]);
            WriteOutput((i%divisor)+1 & "<br>");
        }
        return divided;
    }

    //Create a set of dummy URLs to test against
    //sleep.cfm waits for as long as it's asked to in order to simulate downloads taking a bit of time
    for(i=1;i<=urlCount;i++){
        urls[i]="http://localhost:8500/sleep.cfm?duration="&(i*50);
    }

    urlLists=ArrayDivide(urls,threads);

</cfscript>

<cfloop from="1" to="#threads#" index="i">  
    <cfoutput>Starting #i# <br/></cfoutput><cfflush>
    <cflog log="Application" text="#i# spawn">
    <cfthread action="run" name="dl-thread-#i#" urlList="#urlLists[i]#" threadNumber="#i#">
        <cflog log="Application" text="#threadNumber# start">
        <cfloop from="1" to="#ArrayLen(urlList)#" index="j">
            <cfhttp url="#urlList[j]#" file="#threadNumber#_#j#.html" path="#GetDirectoryFromPath(GetTemplatePath())#" method="get" />
        </cfloop>
        <cflog log="Application" text="#urlNumber# end">
</cfthread> 

</cfloop>
Done....

关于multithreading - 使用多线程从URL下载文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25886338/

multithreading - 使用多线程从URL下载文件

上一篇：multithreading - Perl 错误与 Log::Log4perl::Appender::File

下一篇：multithreading - MPI和pthreads : nodes with different numbers of cores