我有 tb_sentence
表:
=========================================================================
| id_row | document_id | sentence_id | sentence_content |
=========================================================================
| 1 | 1 | 0 | Introduction to Data Mining. |
| 2 | 1 | 1 | Describe how data mining. |
| 3 | 2 | 0 | The boss is right. |
=========================================================================
我想标记 sentence_content,所以 tb_tokens
表将包含:
==========================================================================
| tokens_id | tokens_word | tokens_freq | sentence_id | document_id |
==========================================================================
| 1 | Introduction | 1 | 0 | 1 |
| 2 | to | 1 | 0 | 1 |
| 3 | Data | 1 | 0 | 1 |
| 4 | Mining | 1 | 0 | 1 |
| 5 | Describe | 1 | 1 | 1 |
etc...
这是我的代码:
$sentence_clean = array();
$q1 = mysql_query("SELECT document_id FROM tb_sentence ORDER BY document_id ") or die(mysql_error());
while ($row1 = mysql_fetch_array($q1)) {
$doc_id[] = $row1['document_id'];
}
$q2 = mysql_query('SELECT sentence_content, sentence_id, document_id FROM tb_sentence ') or die(mysql_error());
while ($row2 = mysql_fetch_array($q2)) {
$sentence_clean[$row2['document_id']][] = $row2['sentence_content'];
}
foreach ($sentence_clean as $kal) {
if (trim($kal) === '')
continue;
tokenizing($kal);
}
分词的功能是:
function tokenizing($sentence) {
foreach ($sentence as $sentence_id => $sentences) {
$symbol = array(".", ",", "\\", "-", "\"", "(", ")", "<", ">", "?", ";", ":", "+", "%", "\r", "\t", "\0", "\x0B");
$spasi = array("\n", "/", "\r");
$replace = str_replace($spasi, " ", $sentences);
$cleanSymbol = str_replace($symbol, "", $replace);
$quote = str_replace("'", "\'", $cleanSymbol);
$element = explode(" ", trim($quote));
$elementNCount = array_count_values($element);
foreach ($elementNCount as $word => $freq) {
if (ereg("([a-z,A-Z])", $word)) {
$query = mysql_query(" INSERT INTO tb_tokens VALUES ('','$word','$freq','$sentence_id', '$doc_id')");
}
}
}
}
问题是 document_id
无法读取,也无法插入到 tb+tokens 表中。如何调用那些 document_id
?谢谢:)
编辑问题:
每个单词(标记化的结果)都有 document_id
和 sentence_id
。我的问题是无法调用 document_id
。如何在每个单词中同时调用 sentence_id
和 document_id
?
最佳答案
我认为你不需要这些代码:
$q1 = mysql_query("SELECT document_id FROM tb_sentence ORDER BY document_id ") or die(mysql_error());
while ($row1 = mysql_fetch_array($q1)) {
$doc_id[] = $row1['document_id'];
}
从未使用过 $doc_id 数组
if (trim($kal) === '')
continue;
$kal是一个数组,不需要裁剪
$sentence_clean[$row2['document_id']][] = $row2['sentence_content'];
因为你要记录sentence_id,所以应该是$row2['sentence_id']而不是[]
(当然你要确定,同一个document_id中不会有相同的sentence_id,否则你应该concat它)
这是我的一些更正:
$sentence_clean = array();
$q2 = mysql_query('SELECT sentence_content, sentence_id, document_id FROM tb_sentence ') or die(mysql_error());
while ($row2 = mysql_fetch_array($q2)) {
$sentence_clean[$row2['document_id']][$row2['sentence_id']] = $row2['sentence_content'];
}
foreach ($sentence_clean as $doc_id => $kal) {
tokenizing($kal, $doc_id);
}
function tokenizing($sentence, $doc_id) {
foreach ($sentence as $sentence_id => $sentences) {
$symbol = array(".", ",", "\\", "-", "\"", "(", ")", "<", ">", "?", ";", ":", "+", "%", "\r", "\t", "\0", "\x0B");
$spasi = array("\n", "/", "\r");
$replace = str_replace($spasi, " ", $sentences);
$cleanSymbol = str_replace($symbol, "", $replace);
$quote = str_replace("'", "\'", $cleanSymbol);
$element = explode(" ", trim($quote));
$elementNCount = array_count_values($element);
foreach ($elementNCount as $word => $freq) {
if (ereg("([a-z,A-Z])", $word)) {
$query = mysql_query(" INSERT INTO tb_tokens VALUES ('','$word','$freq','$sentence_id', '$doc_id')");
}
}
}
}
我将 document_id 解析为函数
关于php - 无法存储 document_id,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11776974/