c# - C# 泛型集合中的倒排索引

(抱歉，顺便说一下，如果标题是一个完整的红鲱鱼)

背景:

我正在使用 Twitter Streaming API 和 ASP.NET SignalR 实时开发世界上所有推文的 map 。我正在使用 Tweetinvi C# Twitter 库使用 SignalR 将推文异步推送到浏览器。一切都按预期工作 - 请参阅 http://dev.wherelionsroam.co.uk去了解它。

开发的下一步涉及使用斯坦福自然语言解析库 ( http://nlp.stanford.edu/software/corenlp.shtml ) 解析每条推文的文本数据，特别是命名实体识别器(也称为 CRFClassifier)，以便我可以从每条推文中提取有意义的元数据(即提到的人物、地点和组织)。期望的结果是我将能够确定很多人正在谈论的人物、地点和组织(类似于概念“趋势”)，并使用 SignalR 将它们广播给所有客户端。我知道 Twitter API 有 GET trends方法，但这不会有任何乐趣吗？!

以下是我的应用中的主要类:

主要类:

TweetModel.cs(保存有关从 Streaming API 向其广播的推文的所有信息):

public class TweetModel
{
    public string User { get; set; }
    public string Text { get; set; }
    public DateTime CreatedAt { get; set; }
    public string ImageUrl { get; set; }
    public double Longitude { get; set; }
    public double Latitude { get; set; }
    public string ProfileUrl { get; set; }

    // This field is set later during Tokenization / Named Entity Recognition
    public List<NamedEntity> entities = new List<NamedEntity>();
}

抽象命名实体类:

public abstract class NamedEntity
{
    /// <summary>
    /// Abstract modelling class for NER tagging - overridden by specific named entities. Used here so that all classes inherit from a single base class - polymorphic list
    /// </summary>
    protected string _name;
    public abstract string Name { get; set; }
}

Person 类，一个覆盖抽象 NamedEntity 类的类的示例:

public class Person : NamedEntity
{
    public override string Name
    {
        get
        {
            return _name;
        }
        set
        {
            _name = value;
        }
    }
    public string entityType = "Person";
}

TweetParser 类:

 public class TweetParser
    {
        // Static List to hold all of tweets (and their entities) - tweets older than 20 minutes are cleared out
        public static List<TweetModel> tweets = new List<TweetModel>();
        public TweetParser(TweetModel tweet)
        {
            ProcessTweet(tweet);
            // Removed all of NER logic from this class
        }
}

命名实体识别器的解释:

NER 识别库的工作方式是对带有标签的句子中的单词进行分类，例如“PERSON”代表“Luis Suarez”，“PLACE”代表“New York”。此信息存储在 NamedEntity 类的子类中，具体取决于 NER 库将哪种类型的标签归因于单词(选择 PERSON、LOCATION、ORGANISATION)

问题:

我的问题是，考虑到可能会有多个版本的“Luis Suarez”一词出现(即 Luis Suarez、Luis Suárez)，它们都将在各自不同的 NamedEntity 实例中定义(在List<NamedEntity> 实例，依次在 TweetModel 实例内部)，将所有推文中术语“Luis Suarez”的匹配实例分组在一起同时仍保留 TweetModel 的最佳方式是什么？ > List<NamedEntity>亲子关系。我被告知这实际上是一个倒排索引，但我不确定此人的消息灵通程度!

结构可视化:

enter image description here

如果这个问题不清楚，我真的很抱歉；我真的无法用比这更简洁的方式表达它!到目前为止的完整 src，请参阅 https://github.com/adaam2/FinalUniProject

最佳答案

1- 添加 List<TweetModel>属性(property)给你NamedEntity .

public abstract List<TweetModel> Tweets { get; set; }

2- 保证您的标记化函数始终返回相同的 NamedEntity相同标签的对象。

3- 当你添加 NamedEntity 时在实体列表中还要添加 TweetModel到 NamedEntity 上的列表.

Person p = this is the result of the Tokenization;
entities.Add(p);
p.Tweets.Add(this);

基本上，唯一困难的部分是当它在不同的推文上找到文本“Luis Suarez”和“Luis Suárez”时，让生成命名实体的函数返回相同的对象。

关于c# - C# 泛型集合中的倒排索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24571499/

c# - C# 泛型集合中的倒排索引

上一篇：c# - 如何使用 smartsheet api sdk 发送文件附件

下一篇：c# - 线程中join和await的区别