machine-learning - 根据数据库中的值作为训练集从电子邮件中提取字段

我有 480 封电子邮件，每封都包含以下一个或全部值:-

[人物、学位、工作/不工作、角色]

例如，其中一封电子邮件如下所示:-

    Hi Amy,

    I wanted to discuss about Bob. I can see that he has a degree in 
    Computer Science which he got three years ago but hes still unemployed. 
    I dont know whetehr he'll be fit for the role of junior programmer at 
    our institute.
    Will get back to you on this.

    Thanks

此电子邮件对应的数据库条目如下所示

Email_123 | Bob | Computer Science | Unemployed | Junior Programmer

现在，尽管数据尚未被标记，但我们仍然有一个数据库来查找从每封电子邮件中提取到 4 个字段中的值。现在我的问题是，如何使用这个包含 480 封电子邮件的语料库，通过机器学习/NLP 来学习和提取这 4 个字段。我是否需要手动标记所有这 480 封电子邮件，例如......

I wanted to discuss about <person>Bob</person>. I can see that he has a degree in 
    <degree>Computer Science</degree> which he got....

或者有更好的办法吗？像这样的东西(MarI/O - 视频游戏机器学习)https://www.youtube.com/watch?v=qv6UVOQ0F44&t=149s

最佳答案

假设每封电子邮件的每个字段只有一个值，并且该值始终从电子邮件中逐字复制，您可以使用类似 WikiReading 的值。 .

问题在于 WikiReading 是用 470 万个示例进行训练的，因此如果只有 480 个示例，则远远不足以训练一个好的模型。

我的建议是预处理您的数据集以自动添加标签，如您的示例中所示。像这样的东西，在伪 python 中:

entity = "Junior Programmer"
entity_type = "role"
mail = "...[text of email]..."

ind = mail.index(entity)
tagged = "{front}<{tag}>{ent}</{tag}>{back}".format(
  front=mail[0:ind],
  back=mail[ind+len(entity):],
  tag=entity_type,
  ent=entity)

您需要针对大小写问题、多个匹配项等进行调整。

对于标记数据，您可以使用传统的 NER 系统，例如 CRF。 Here这是在 Python 中使用 spaCy 的教程。

关于machine-learning - 根据数据库中的值作为训练集从电子邮件中提取字段，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45451032/

machine-learning - 根据数据库中的值作为训练集从电子邮件中提取字段

上一篇：python - 更改tensorflow object_detection教程中的结果数量

下一篇：python - 将带有 rbf 内核的 sklearn SVC 移植到 java