java - 如何优化倒排文件的全文索引？

我正在制作一个简单的程序，其中使用 PDF 文件示例在我的数据库上构建全文索引。我的想法是阅读每个 PDF 文件，提取单词并将它们存储在哈希集中。

然后，将循环中的每个单词及其文件路径添加到 MySQL 中的表中。因此，每个单词都会循环存储在每一列中，直到完成。它工作得很好。然而，当涉及到包含数千个单词的大型PDF文件时，建立索引表可能需要一些时间。换句话说，由于单词提取的速度很快，因此将每个单词保存到数据库需要很长时间。

代码:

public class IndexTest {

public static void main(String[] args) throws Exception {
    // write your code here
    //String path ="D:\\Full Text Indexing\\testIndex\\bell2009a.pdf";
    // HashSet<String> uniqueWords = new HashSet<>();
    /*StopWatch stopwatch = new StopWatch();
    stopwatch.start();*/
    File folder = new File("D:\\PDF1");
    File[] listOfFiles = folder.listFiles();

    for (File file : listOfFiles) {
        if (file.isFile()) {
            HashSet<String> uniqueWords = new HashSet<>();
            String path = "D:\\PDF1\\" + file.getName();
            try (PDDocument document = PDDocument.load(new File(path))) {

                if (!document.isEncrypted()) {

                    PDFTextStripper tStripper = new PDFTextStripper();
                    String pdfFileInText = tStripper.getText(document);
                    String lines[] = pdfFileInText.split("\\r?\\n");
                    for (String line : lines) {
                        String[] words = line.split(" ");

                        for (String word : words) {
                            uniqueWords.add(word);

                        }

                    }
                    // System.out.println(uniqueWords);

                }
            } catch (IOException e) {
                System.err.println("Exception while trying to read pdf document - " + e);
            }
            Object[] words = uniqueWords.toArray();
            String unique = uniqueWords.toString();
            //  System.out.println(words[1].toString());



            for(int i = 1 ; i <= words.length - 1 ; i++ ) {
                MysqlAccessIndex connection = new MysqlAccessIndex();
                connection.readDataBase(path, words[i].toString());

            }

            System.out.println("Completed");

        }
    }

SQL连接代码:

 public class MysqlAccessIndex {

      public MysqlAccessIndex() throws Exception {
        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.178/fulltext_ltat?"
                        + "user=root&password=root123");
      //  statement = connect.createStatement();
        System.out.print("Connected");
    }


    public void readDataBase(String path,String word) throws Exception {
        try {




            statement = connect.createStatement();
            System.out.print("Connected");


            preparedStatement = connect
                    .prepareStatement("insert IGNORE into  fulltext_ltat.test_text values (?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();
            // resultSet = statement
            //.executeQuery("select * from fulltext_ltat.index_detail");



            //  writeResultSet(resultSet);
        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

对于性能问题有什么改进或优化的建议吗？

最佳答案

问题在于以下代码:

// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(
        "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");

您正在为插入数据库的每个单词重新创建连接。更好的方法是这样的:

public MysqlAccess() {
    connect = DriverManager
                .getConnection("jdbc:mysql://126.32.3.20/fulltext_ltat?"
                        + "user=root&password=root");
}

这样，您只需在第一次创建该类的实例时创建connect。在 main 方法中，您必须在 for 循环之外创建 MysqlAccess 实例，因此它只会创建一次。

MysqlAccess 看起来像这样:

public class MysqlAccess {

    private Connection connect = null;
    private Statement statement = null;
    private PreparedStatement preparedStatement = null;
    private ResultSet resultSet = null;

    public MysqlAccess() {
        // Setup the connection with the DB
        connect = DriverManager.getConnection(
                "jdbc:mysql://126.32.3.20/fulltext_ltat?" + "user=root&password=root");
    }

    public void readDataBase(String path, String word) throws Exception {
        try {
            // Statements allow to issue SQL queries to the database
            statement = connect.createStatement();
            System.out.print("Connected");
            // Result set get the result of the SQL query

            preparedStatement = connect.prepareStatement(
                    "insert IGNORE into  fulltext_ltat.test_text values (default,?, ?) ");

            preparedStatement.setString(1, path);
            preparedStatement.setString(2, word);
            preparedStatement.executeUpdate();

        } catch (Exception e) {
            throw e;
        } finally {
            close();
        }

    }

    private void writeResultSet(ResultSet resultSet) throws SQLException {
        // ResultSet is initially before the first data set
        while (resultSet.next()) {
            // It is possible to get the columns via name
            // also possible to get the columns via the column number
            // which starts at 1
            // e.g. resultSet.getSTring(2);
            String path = resultSet.getString("path");
            String word = resultSet.getString("word");

            System.out.println();
            System.out.println("path: " + path);
            System.out.println("word: " + word);

        }
    }
}

关于java - 如何优化倒排文件的全文索引？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52925314/

java - 如何优化倒排文件的全文索引？

上一篇：php - 如何在 PHP 中使用 mysql 循环和 xlsxwriter php 导出到 Excel 工作表

下一篇：mysql - 生成字母数字 ID VB.NET 时出错