java - Lucene 索引大小

标签 java indexing lucene

我有这样的数据

1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 22 23 24 25 26 28 30 36 37 39 40 41 46 48 49 51 52 53 54 55 56 58 60 66 67 68 71 72 74 77 78 85 89 90 91 108 109 110 116 117 118 120 121 123 137 138 145 146 147 148 154 157 159 162 165 166 168 175 179 181 198 201 203 212 215 216 223 231 233 254 266 270 274 323 327 329 331 347 352 355 360 363 370 411 415 434 438 442 444 445 462 470 471 477 486 495 499 503 524 525 536 542 595 603 608 636 644 646 647 670 692 694 698 762 763 798 809 822 970 981 987 992 1040 1057 1066 1079 1089 1111 1233 1244 1302 1315 1327 1333 1336 1387 1411 1412 1432 1458 1486 1498 1509 1572 1573 1574 1607 1625 1784 1808 1824 1909 1933 1938 1940 2011 2077 2081 2093 2286 2289 2395 2427 2467 2911 2944 2962 2975 3121 3170 3172 3197 3236 3267 3334 3699 3731 3905 3945 3982 3999 4008 4161 4234 4235 4296 4374 4457 4494 4526 4717 4720 4723 4820 4875 5352 5423 5472 5728 5799 5813 5821 6032 6230 6244 6278 6859 6868 7186 7280 7401 8734 8832 8885 8886 8925 9363 9510 9517 9592 9707 9802 10002 11097 11192 11715 11716 11836 11945 11996 12025 12482 12703 12706 12887 13122 13372 13482 13577 14150 14161 14169 14461 14626 16057 16268 16415 17183 17398 17440 17464 18097 18690 18731 18834 20576 20603 21558 21839 22202 26201 26497 26654 26658 26776 28088 28531 28551 28775 29122 29407. 

这是一行数据,很多都是这样,存储在“training.txt”中。我使用以下 lucene 索引代码对其进行索引

public class training
{
    public static void main(String args[]) throws IOException, ParseException
    {
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
//      IndexWriter w = new IndexWriter(FSDirectory.open(new File("../search/index")), analyzer, true, new IndexWriter.MaxFieldLength(1000000));
        IndexWriter w = new IndexWriter(FSDirectory.open(new File("index")), analyzer, true, new IndexWriter.MaxFieldLength(2139999999));


                File file = new File("training.txt");
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;

            File file1 = new File("fileName.txt");
        FileInputStream fis1 = null;
        BufferedInputStream bis1 = null;
        DataInputStream dis1 = null;


        try {
            fis = new FileInputStream(file);

            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);

                        fis1 = new FileInputStream(file1);

            // Here BufferedInputStream is added for fast reading.
            bis1 = new BufferedInputStream(fis1);
            dis1 = new DataInputStream(bis1);



            // dis.available() returns 0 if the file does not have more lines.
            while (dis.available() != 0 && dis1.available() != 0 ) {

                                String tempImg=dis1.readLine();
                    String temp=dis.readLine();
                                addDoc(w,tempImg,temp);

                               // System.out.println(temp);

            }

            // dispose all the resources after using them.
            fis.close();
            bis.close();
            dis.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }


        w.optimize();
        w.close();


    }
    private static void addDoc(IndexWriter w, String value1,String value2) throws IOException
    {
        Document doc = new Document();
        doc.add(new Field("fileId", value1, Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("visualId", value2, Field.Store.YES, Field.Index.ANALYZED));
        w.addDocument(doc);
    }

}

还有另一个文件“fileName.txt”,用于文件名。我的“training.txt”大小为 127.1 MB,正在创建大小为 217.2 MB 的索引文件夹。我相信它应该减少。

我的搜索代码:

public class search
{
    public static void main(String args[]) throws IOException, ParseException
    {

        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

             String fname = "test.txt";

        File file = new File(fname);
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;


        try {
            fis = new FileInputStream(file);
                          Writer fos = null;
                        File outputFile = new File("outList.txt");

                        fos = new BufferedWriter(new FileWriter(outputFile));

            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);

            while (dis.available() != 0)
            {

                Query q = new QueryParser(Version.LUCENE_CURRENT, "visualId", analyzer).parse(dis.readLine());

                //3.search
                int hitsPerPage = 200;
                IndexSearcher searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File("index")), true));
                TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
                long startTime = System.currentTimeMillis();
                searcher.search(q, collector);
                long endTime = System.currentTimeMillis();
                ScoreDoc[] hits = collector.topDocs().scoreDocs;
                for(int i=0;i<hits.length;++i) {
                    int docId = hits[i].doc;
                    Document d = searcher.doc(docId);

                                        String text = d.get("fileId");

                                        fos.write(text);
                                        fos.write("\n");


                }

                searcher.close();
            }

            // dispose all the resources after using them.
            fis.close();
                        fos.close();

            bis.close();
            dis.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        //out.close();


    }
}

我的“test.txt”包含内容:



谢谢, 拉维。

最佳答案

当您添加Field.Store.YES时它被存储并索引到 Lucene 字段。结果将是您的索引变得比预期大。

关于java - Lucene 索引大小,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6175059/

相关文章:

c# - 加载lucene索引中的一个字段

java - 在 lucene 中搜索 UUID 不起作用

java - 密码查询 : - Allow apostrope(') containing String in the query

python - numpy 选择每隔 n 个条目

java - 使用流从现有列表创建不可变列表

sql - 我的查询第二次运行得更快,我该如何停止?

MySQL,加载数据并发本地,禁用和启用键

lucene - 在Sitecore中停用词

java - 安卓 : Getting error "cannot resolve getSharedPreferences(java.lang.String, int)"

java - getText() 方法有文本时返回 null