sql - 分析大图 - 检索簇并计算最强路径

标签 sql graph-theory social-networking

我尝试使用广度优先算法来检索所选用户所属的整个连接集群,方法如下:

CREATE PROCEDURE Breadth_First (@StartNode varchar(50), @LinkStrength decimal(10,7) = 0.1, @EndNode varchar(50) = NULL)
    AS
    BEGIN
        -- Automatically rollback the transaction if something goes wrong.   
        SET XACT_ABORT ON   
        BEGIN TRAN

        -- Increase performance and do not intefere with the results.
        SET NOCOUNT ON;

        -- Create a temporary table for storing the discovered nodes as the algorithm runs
        CREATE TABLE #Discovered
        (
             DiscoveredUser varchar(50) NOT NULL,    -- The Node Id
            Predecessor varchar(50) NULL,    -- The node we came from to get to this node.
            LinkStrength decimal(10,7) NULL, -- positive - from predecessor to  DiscoveredUser, negative - from  DiscoveredUser to predecessor
            OrderDiscovered int -- The order in which the nodes were discovered.
        )

        -- Initially, only the start node is discovered.
        INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
        VALUES (@StartNode, NULL, NULL, 0)

        -- Add all nodes that we can get to from the current set of nodes,
        -- that are not already discovered. Run until no more nodes are discovered.
        WHILE @@ROWCOUNT > 0
        BEGIN
            -- If we have found the node we were looking for, abort now.
            IF @EndNode IS NOT NULL
                IF EXISTS (SELECT TOP 1 1 FROM #Discovered WHERE  DiscoveredUser = @EndNode)
                    BREAK   

            -- We need to group by ToNode and select one FromNode since multiple
            -- edges could lead us to new same node, and we only want to insert it once.
            INSERT INTO #Discovered ( DiscoveredUser, Predecessor, LinkStrength, OrderDiscovered)
            (SELECT mc.called_party, mc.calling_party, mc.link_strength, d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.calling_party
            WHERE mc.called_party NOT IN (SELECT  DiscoveredUser From #Discovered) AND mc.link_strength > @LinkStrength
            UNION
            SELECT mc.calling_party, mc.called_party, mc.link_strength * (-1), d.OrderDiscovered + 1
            FROM #Discovered d JOIN monthly_connections mc ON d. DiscoveredUser = mc.called_party
            WHERE mc.calling_party NOT IN (SELECT  DiscoveredUser FROM #Discovered) AND mc.link_strength > @LinkStrength
            )
        END;

        -- Select the results. We use a recursive common table expression to
        -- get the full path from the start node to the current node.
        WITH BacktraceCTE(Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path)
        AS
        (
            -- Anchor/base member of the recursion, this selects the start node.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered, 
                CAST(n. DiscoveredUser AS varchar(MAX))
            FROM #Discovered d JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
            WHERE d. DiscoveredUser = @StartNode

            UNION ALL

            -- Recursive member, select all the nodes which have the previous
            -- one as their predecessor. Concat the paths together.
            SELECT d.Predecessor, n. DiscoveredUser, d.LinkStrength, d.OrderDiscovered,
                CAST(cte.Path + ',' + CAST(n. DiscoveredUser as varchar(30)) as varchar(MAX))
            FROM #Discovered d JOIN BacktraceCTE cte ON d.Predecessor = cte. DiscoveredUser
            JOIN users n ON d. DiscoveredUser = n. DiscoveredUser
        )

        SELECT Predecessor,  DiscoveredUser, LinkStrength, OrderDiscovered, Path FROM BacktraceCTE
        WHERE  DiscoveredUser = @EndNode OR @EndNode IS NULL -- This kind of where clause can potentially produce
        ORDER BY OrderDiscovered                -- a bad execution plan, but I use it for simplicity here.

        DROP TABLE #Discovered
        COMMIT TRAN
        RETURN 0
    END

我当前正在分析的图表(社交网络)有 28M 连接,并且在不忽略弱连接(使用 @LinkStrength 设置阈值)的情况下,执行运行了很长时间(到目前为止,我还没有得到任何结果,并将尝试让它运行过夜)。

无论如何,下一步是计算两个用户(大约有 300 万用户)之间的最短(最强)链接。我正在考虑使用 Djikstra 算法,但不确定是否可以在我当前使用的 PC(Quad CPU 2.66 GHz,4GB RAM)上分析此类网络,并且数据存储在 MS SQL Server 2008 数据库中。

总而言之,我希望获得以下问题的一些答案/建议:

  1. 是否可以执行 与上面的查询一样复杂的查询 描述图(28M 连接,3M 用户)在所描述的机器上 (2.66 GHz,4GB RAM)?
  2. 如果不可能的话,有什么办法吗? 其他可能的方法 执行时间可以减少 (例如,创建带有部分内容的表 结果)?
  3. 您有其他推荐吗 检测簇的算法 并计算最短路径 所描述的图表?

谢谢!

最佳答案

首先,使用索引

其次,您需要减少内存需求。这意味着首先为 VARCHAR(50) 提供一个短别名,例如 4 字节而不是 50 字节的 int。这将使您的速度提高 10 倍。

declare @tmpPeople table(
  ixPerson int identity primary key,
  UserNodeID varchar(50),
  unique(UserNodeID, ix) -- this creates an index
)
Insert @tmpPeople(UserNodeID) select UserNodeID from NormalPeopleTable
declare @relationships table(
  ixParent int,
  ixChild int,
  unique(ixParent, ixChild),
  unique(ixChild, ixParent)
)
insert @relationships(ixParent, ixChild)
select distinct p.ixPerson, c.ixPerson
from NormalRelationshipsTable nr
inner join @tmpPeople p on p.UserNodeID = nr.ParentUserNodeID
inner join @tmpPeople c on c.UserNodeID = nr.ChildUserNodeID

-- OK now got a copy of all relationships, but it is a fraction of the size
-- because we are using integers for the keys.
-- if we need to we can insert the reverse relationships too.

您需要编写一个执行您想要的操作的查询,而不是“通用”的查询。

如果你想找到两个节点之间的最短距离,可以通过从两端同时搜索来减少搜索时间。

declare @p1 table(
ix int identity primary key,
ixParent int,
ixChild int,
nDeep int,
-- Need indexes
unique(ixParent, ixChild, nDeep, ix),
unique(ixChild, ixParent, nDeep, ix)
)
-- That's now indexed both ways. 
-- If you only need one, you can comment the other out.
-- define @p2 the same

insert @p1 (ixParent, ixChild, nDeep) select @ixUserFrom, @ixUserFrom, 0
insert @p2 ..... @ixUserTo, @ixUserTo, 0

-- Breadth first goes like this.
-- Just keep repeating it till you get as far as you want.
insert @p1 (ixParent, ixChild, nDeep)
select
p1.ixChild, r.ixChild, p1.nDeep+1
from @p1 p1 inner join @relationships r on r.ixParent = p1.ixChild
-- may want to exclude those already in the table
where not exists (
    select 1 from @p1 p1 
    where p1.ixParent = p.ixChild and p1.ixChild = r.ixChild
)

对于“从 Alice 到 Bob 的距离”,您并行执行两次搜索,并在 Alice 的搜索包含 Bob 的搜索中也包含的任何人时停止。这还会将您的时间缩短 n^2 倍,其中 n 是平均连接数。

如果深度变得太多,不要忘记停止。

关于sql - 分析大图 - 检索簇并计算最强路径,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4124441/

相关文章:

java - 无法使用 executeQuery() 发出数据操作语句

MySQL 错误 1064 - 我瞎了吗?奇怪的错误

algorithm - 以最少的开销重命名目录的所有内容

graph-theory - 这个解决NP-Hard Vertex Cover的贪心算法叫什么名字

algorithm - 社交网络功能查找您可能知道的联系

twitter - Twitter 的速率限制是否允许我进行必要的数据挖掘以构建一个大约 60 万用户的完整社交网络图?

mysql - 如何将列命名为星期几

mysql - 如何过滤两列之间的一组数字?

r - 在 igraph(R 包)中反转有向图(转置图)中的边

database - 以适度可扩展的方式交付事件源项目