java - Lucene 索引大小

标签 java indexing lucene

我有这样的数据

1 2 3 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 20 22 23 24 25 26 28 30 36 37 39 40 41 46 48 49 51 52 53 54 55 56 58 60 66 67 68 71 72 74 77 78 85 89 90 91 108 109 110 116 117 118 120 121 123 137 138 145 146 147 148 154 157 159 162 165 166 168 175 179 181 198 201 203 212 215 216 223 231 233 254 266 270 274 323 327 329 331 347 352 355 360 363 370 411 415 434 438 442 444 445 462 470 471 477 486 495 499 503 524 525 536 542 595 603 608 636 644 646 647 670 692 694 698 762 763 798 809 822 970 981 987 992 1040 1057 1066 1079 1089 1111 1233 1244 1302 1315 1327 1333 1336 1387 1411 1412 1432 1458 1486 1498 1509 1572 1573 1574 1607 1625 1784 1808 1824 1909 1933 1938 1940 2011 2077 2081 2093 2286 2289 2395 2427 2467 2911 2944 2962 2975 3121 3170 3172 3197 3236 3267 3334 3699 3731 3905 3945 3982 3999 4008 4161 4234 4235 4296 4374 4457 4494 4526 4717 4720 4723 4820 4875 5352 5423 5472 5728 5799 5813 5821 6032 6230 6244 6278 6859 6868 7186 7280 7401 8734 8832 8885 8886 8925 9363 9510 9517 9592 9707 9802 10002 11097 11192 11715 11716 11836 11945 11996 12025 12482 12703 12706 12887 13122 13372 13482 13577 14150 14161 14169 14461 14626 16057 16268 16415 17183 17398 17440 17464 18097 18690 18731 18834 20576 20603 21558 21839 22202 26201 26497 26654 26658 26776 28088 28531 28551 28775 29122 29407. 

这是一行数据,很多都是这样,存储在“training.txt”中。我使用以下 lucene 索引代码对其进行索引

public class training
{
    public static void main(String args[]) throws IOException, ParseException
    {
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
//      IndexWriter w = new IndexWriter(FSDirectory.open(new File("../search/index")), analyzer, true, new IndexWriter.MaxFieldLength(1000000));
        IndexWriter w = new IndexWriter(FSDirectory.open(new File("index")), analyzer, true, new IndexWriter.MaxFieldLength(2139999999));


                File file = new File("training.txt");
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;

            File file1 = new File("fileName.txt");
        FileInputStream fis1 = null;
        BufferedInputStream bis1 = null;
        DataInputStream dis1 = null;


        try {
            fis = new FileInputStream(file);

            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);

                        fis1 = new FileInputStream(file1);

            // Here BufferedInputStream is added for fast reading.
            bis1 = new BufferedInputStream(fis1);
            dis1 = new DataInputStream(bis1);



            // dis.available() returns 0 if the file does not have more lines.
            while (dis.available() != 0 && dis1.available() != 0 ) {

                                String tempImg=dis1.readLine();
                    String temp=dis.readLine();
                                addDoc(w,tempImg,temp);

                               // System.out.println(temp);

            }

            // dispose all the resources after using them.
            fis.close();
            bis.close();
            dis.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }


        w.optimize();
        w.close();


    }
    private static void addDoc(IndexWriter w, String value1,String value2) throws IOException
    {
        Document doc = new Document();
        doc.add(new Field("fileId", value1, Field.Store.YES, Field.Index.ANALYZED));
        doc.add(new Field("visualId", value2, Field.Store.YES, Field.Index.ANALYZED));
        w.addDocument(doc);
    }

}

还有另一个文件“fileName.txt”,用于文件名。我的“training.txt”大小为 127.1 MB,正在创建大小为 217.2 MB 的索引文件夹。我相信它应该减少。

我的搜索代码:

public class search
{
    public static void main(String args[]) throws IOException, ParseException
    {

        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

             String fname = "test.txt";

        File file = new File(fname);
        FileInputStream fis = null;
        BufferedInputStream bis = null;
        DataInputStream dis = null;


        try {
            fis = new FileInputStream(file);
                          Writer fos = null;
                        File outputFile = new File("outList.txt");

                        fos = new BufferedWriter(new FileWriter(outputFile));

            // Here BufferedInputStream is added for fast reading.
            bis = new BufferedInputStream(fis);
            dis = new DataInputStream(bis);

            while (dis.available() != 0)
            {

                Query q = new QueryParser(Version.LUCENE_CURRENT, "visualId", analyzer).parse(dis.readLine());

                //3.search
                int hitsPerPage = 200;
                IndexSearcher searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File("index")), true));
                TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
                long startTime = System.currentTimeMillis();
                searcher.search(q, collector);
                long endTime = System.currentTimeMillis();
                ScoreDoc[] hits = collector.topDocs().scoreDocs;
                for(int i=0;i<hits.length;++i) {
                    int docId = hits[i].doc;
                    Document d = searcher.doc(docId);

                                        String text = d.get("fileId");

                                        fos.write(text);
                                        fos.write("\n");


                }

                searcher.close();
            }

            // dispose all the resources after using them.
            fis.close();
                        fos.close();

            bis.close();
            dis.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        //out.close();


    }
}

我的“test.txt”包含内容:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 55 56 57 58 59 60 61 63 64 65 66 67 69 70 72 73 76 77 78 80 82 83 85 86 88 89 90 91 92 93 94 95 97 99 100 102 105 106 107 108 109 110 111 112 114 115 116 117 118 119 120 121 122 124 126 127 128 129 130 132 133 135 136 137 138 141 142 143 144 145 147 148 151 153 154 155 156 157 160 164 165 167 168 169 170 172 173 174 175 176 178 179 180 181 182 183 184 190 191 194 195 199 200 202 206 208 211 215 216 217 220 228 231 234 239 246 248 250 254 259 264 266 267 268 270 271 272 275 276 278 281 284 285 292 296 297 300 306 307 314 316 317 320 321 322 323 325 326 327 330 331 333 336 343 345 348 349 350 351 353 354 355 357 358 360 361 362 364 365 367 371 372 379 381 384 385 386 388 391 396 398 399 404 405 406 407 409 412 415 423 424 427 428 429 431 432 434 435 436 442 443 444 453 458 461 462 466 468 472 479 493 494 495 496 500 501 502 503 504 506 507 508 509 510 515 518 519 521 526 528 533 535 537 538 540 544 545 547 549 551 569 570 574 582 583 586 597 599 601 605 607 618 623 624 632 644 645 649 651 661 683 694 701 702 718 737 738 739 743 751 762 776 777 778 792 797 800 803 809 811 812 813 817 825 828 833 843 853 854 875 889 892 900 918 919 922 941 949 951 961 963 964 965 966 967 969 975 976 977 979 980 990 992 993 1000 1007 1008 1009 1029 1036 1045 1047 1051 1052 1053 1058 1059 1061 1062 1064 1065 1066 1070 1072 1075 1081 1082 1083 1086 1093 1094 1101 1114 1116 1117 1136 1143 1152 1154 1158 1159 1165 1172 1188 1194 1198 1212 1216 1218 1220 1227 1236 1245 1269 1272 1280 1283 1284 1285 1287 1293 1295 1296 1303 1305 1307 1327 1329 1332 1358 1373 1374 1375 1384 1385 1386 1397 1404 1415 1416 1436 1437 1478 1481 1482 1485 1487 1489 1501 1503 1505 1506 1508 1511 1517 1518 1520 1521 1522 1524 1525 1527 1529 1545 1555 1556 1564 1577 1579 1583 1599 1606 1610 1611 1612 1615 1620 1632 1636 1640 1648 1654 1706 1711 1721 1746 1750 1758 1792 1796 1802 1814 1820 1853 1869 1872 1897 1931 1932 1935 1946 1953 1982 2049 2082 2104 2107 2155 2211 2213 2216 2228 2253 2286 2329 2330 2332 2334 2377 2390 2399 2408 2427 2428 2433 2435 2440 2452 2475 2484 2498 2529 2559 2563 2626 2666 2675 2699 2754 2758 2765 2822 2847 2852 2882 2889 2893 2895 2898 2902 2906 2908 2925 2929 2932 2936 2939 2940 2971 2977 2980 2999 3022 3023 3024 3028 3086 3107 3134 3136 3140 3152 3156 3160 3174 3176 3182 3186 3192 3195 3197 3209 3216 3225 3242 3247 3249 3259 3279 3283 3303 3341 3349 3350 3352 3407 3429 3455 3462 3475 3476 3495 3515 3564 3581 3595 3637 3648 3653 3660 3681 3707 3735 3807 3817 3839 3850 3852 3856 3860 3878 3884 3889 3909 3916 3920 3980 3988 3997 4075 4120 4122 4123 4125 4152 4156 4157 4159 4191 4211 4244 4248 4307 4310 4434 4444 4446 4455 4462 4466 4503 4509 4516 4517 4525 4532 4551 4554 4559 4563 4564 4565 4573 4576 4581 4586 4634 4666 4669 4691 4730 4738 4748 4796 4817 4829 4832 4837 4846 4859 4896 4909 4919 4943 4962 5119 5132 5162 5237 5251 5275 5376 5387 5407 5441 5461 5559 5606 5608 5616 5692 5792 5797 5806 5837 5858 5947 6146 6245 6313 6320 6466 6632 6640 6648 6683 6759 6859 6987 6988 6989 6995 7003 7131 7171 7197 7223 7225 7280 7283 7299 7304 7320 7355 7357 7424 7451 7493 7586 7678 7690 7878 7997 8024 8096 8261 8275 8294 8465 8542 8556 8646 8667 8679 8685 8695 8707 8718 8724 8774 8786 8795 8808 8817 8819 8913 8932 8941 8996 9065 9069 9071 9085 9258 9321 9403 9408 9420 9456 9468 9481 9523 9528 9546 9559 9575 9584 9590 9592 9626 9648 9675 9727 9740 9742 9747 9776 9778 9836 9850 9909 10022 10046 10049 10056 10222 10288 10366 10385 10425 10429 10485 10546 10691 10744 10786 10912 10945 10958 10980 11043 11120 11205 11420 11451 11518 11551 11557 11568 11580 11633 11635 11652 11667 11728 11749 11760 11940 11963 11990 12225 12360 12367 12370 12375 12455 12468 12472 12476 12573 12632 12633 12731 12732 12745 12921 12922 12931 13303 13331 13332 13338 13364 13366 13386 13397 13510 13528 13548 13551 13575 13597 13654 13662 13676 13688 13689 13690 13693 13694 13720 13728 13743 13757 13901 13999 14007 14074 14190 14214 14245 14389 14452 14487 14496 14511 14538 14578 14689 14726 14756 14829 14887 15357 15395 15485 15710 15754 15824 16128 16161 16220 16323 16384 16678 16819 16825 16848 17075 17375 17391 17417 17511 17575 17841 18439 18734 18940 18961 19399 19896 19920 19945 20050 20276 20578 20960 20964 20967 20986 21009 21393 21513 21591 21670 21676 21839 21849 21898 21911 21960 22066 22072 22271 22354 22480 22759 23033 23070 23635 23990 24073 24287 24784 24824 24882 25395 25625 25668 25938 26002 26036 26054 26056 26085 26122 26153 26173 26321 26358 26385 26423 26450 26456 26739 26796 26823 26987 27196 27206 27214 27255 27773 27962 28209 28225 28260 28369 28405 28443 28568 28585 28637 28676 28724 28753 28770 28775 28877 28944 29026 29180 29221 29225 29240 29327 29333 29507

谢谢, 拉维。

最佳答案

当您添加Field.Store.YES时它被存储并索引到 Lucene 字段。结果将是您的索引变得比预期大。

关于java - Lucene 索引大小,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6175059/

相关文章:

c# - 加载lucene索引中的一个字段

java - 在 lucene 中搜索 UUID 不起作用

java - 密码查询 : - Allow apostrope(') containing String in the query

python - numpy 选择每隔 n 个条目

java - 使用流从现有列表创建不可变列表

sql - 我的查询第二次运行得更快,我该如何停止?

MySQL,加载数据并发本地,禁用和启用键

lucene - 在Sitecore中停用词

java - 安卓 : Getting error "cannot resolve getSharedPreferences(java.lang.String, int)"

java - getText() 方法有文本时返回 null