java.util.Scanner 读取不同字符编码的文件

标签 java arrays character-encoding java.util.scanner

我使用 Java 读取文件列表。其中一些具有不同的编码,ANSI 而不是 UTF-8java.util.Scanner 无法读取这些文件并获得空输出字符串。 我尝试了另一种方法:

                FileInputStream fis = new FileInputStream(my_file);
                BufferedReader br = new BufferedReader(new InputStreamReader(fis));
                InputStreamReader isr = new InputStreamReader(fis);
                isr.getEncoding();

我不确定如何更改 ANSI 字符编码。 UTF-8 和 ANSI 文件混合在同一个文件夹中。我尝试为此使用 Apache Tika。 获得文件编码后,我使用 Scanner,但输出为空。

Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
line = scanner.nextLine();

最佳答案

有一个名为 juniversalchardet 的库,可以帮助您猜测正确的编码。它是最近更新的,目前位于 GitHub 上:

https://github.com/albfernandez/juniversalchardet

但是,没有故障安全工具来检测编码,因为有很多未知的东西:

  1. 这个文件是文本还是部分 PNG?
  2. 它是否以 (1,...,k,...,n) 位编码存储?
  3. 使用了哪种 k 位编码?

可以通过计算不常用的控制字符的数量来进行一些猜测。当一个文件包含许多控制符号时,很可能是您选择了错误的编码。 (然后尝试下一个。)

Juniversalchardet 尝试了多种更成功的方法来确定编码(甚至是中文的)。它还提供了从已选择正确编码的文件中打开阅读器的便捷方法:

(摘自 https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding 并改编的片段)

import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;

public class TestCreateReaderFromFile {

    public static void main (String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
            System.exit(1);
        }

        Reader reader = null;
        try {
            File file = new File(args[0]);
            reader = ReaderFactory.createBufferedReader(file);

            String line;
            while((line=reader.readLine())!=null){
                System.out.println(line); //Print each line to console
            }
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }

    }

}

编辑:添加了 ScannerFactory

/*
(C) Copyright 2016-2017 Alberto Fernández <infjaf@gmail.com>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;

/**
 * Create a scanner from a file with correct encoding
 */
public final class ScannerFactory {

    private ScannerFactory() {
        throw new AssertionError("No instances allowed");
    }
    /**
     * Create a scanner from a file with correct encoding
     * @param file The file to read from
     * @param defaultCharset defaultCharset to use if can't be determined
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */

    public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
        Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
        String detectedEncoding = UniversalDetector.detectCharset(file);
        if (detectedEncoding != null) {
            cs = Charset.forName(detectedEncoding);
        }
        if (!cs.toString().contains("UTF")) {
            return new Scanner(file, cs.name());
        }
        Path path = file.toPath();
        return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
    }
    /**
     * Create a scanner from a file with correct encoding. If charset cannot be determined,
     * it uses the system default charset.
     * @param file The file to read from
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */
    public static Scanner createScanner(File file) throws IOException {
        return createScanner(file, Charset.defaultCharset());
    }
}

关于java.util.Scanner 读取不同字符编码的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53171661/

相关文章:

java - 使用 unicode 值打印字符 (java)

Java 将 int[] 转换为 byte[] 而不进行转换

java - CSVFormat 在标题中添加引号

java - 有没有办法有选择地调试 JVM 中的单个应用程序(或几个应用程序)?

c - 洗牌数组

sql - 查询统计多对多关联的频率

c++ - 返回指针无法正常工作

java - 如何防止 Android 应用程序的自动备份?

java - JsonParseException : Illegal unquoted character ((CTRL-CHAR, 代码 10)

java - 设置默认 Java 字符编码