我使用 java 和 Apache POI 读取 .xlsx 文件。(60k+ 行),但出现错误。
我使用poi和xmlbeans的最新版本maven插件。
根据我在StackOverflow中找到的相关问题,最新的poi应该可以成功处理带有特殊字符的文件。
如果是xml文件我可以自己替换程序中的特殊字符。但它是一个 Excel 文件。
困难在于我不知道如何使用poi成功读取“excel”文件。
或者有什么方法可以处理该文件吗?
我使用openjdk,版本:“1.8.0_171-1-redhat”。
错误信息是这样的
Caused by: java.io.IOException: unable to parse shared strings table
at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:134)
at org.apache.poi.xssf.model.SharedStringsTable.<init>(SharedStringsTable.java:111)
... 11 more
Caused by: org.apache.xmlbeans.XmlException: error: Character reference "�" is an invalid XML character.
at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3440)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1272)
at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1259)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse(SchemaTypeLoaderBase.java:345)
at org.openxmlformats.schemas.spreadsheetml.x2006.main.SstDocument$Factory.parse(Unknown Source)
at org.apache.poi.xssf.model.SharedStringsTable.readFrom(SharedStringsTable.java:123)
代码
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import org.apache.commons.codec.binary.Base64;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class test2 {
public static void main(String[] args) throws Exception {
File file = new File("D:\\Users\\3389\\Desktop\\Review\\drive-download-20181112T012605Z-001\\ticket.xlsx");
Workbook workbook = null;
XSSFWorkbook xssfWorkbook = new XSSFWorkbook(file); //error occured
workbook = new SXSSFWorkbook(xssfWorkbook);
Sheet sheet = xssfWorkbook.getSheetAt(0);
System.out.println("the first row:"+sheet.getFirstRowNum());
}
}
pom.xml
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.0.0</version>
</dependency>
shareString.xml 中的 UTF16SurrogatePairs(几个示例)
������
��
��������������
etc....
最佳答案
由于您的问题标题包含问题“有没有办法预处理Excel文件?”,我将尝试回答这个问题:
假设:
*.xlsx
文件中的 /xl/sharedStrings.xml
包含 UTF-16 代理对 XML 数字字符引用,例如 ;
= 😁。这对于 HTML 来说是可以的。但在 Office Open XML 中不允许这样做,因为编码始终为 UTF-8,并且该 XML 中不允许使用代理字符。
因此,如果 *.xlsx
文件中的 /xl/sharedStrings.xml
包含 UTF-16-surrogate-pair XML 数字字符引用,则该文件已损坏并且无论如何都不应该使用。该问题应该由创建该 *.xlsx
文件的人来解决。
但是,如果仍然需要修复该文件,则只能在字符串级别上完成。由于 UTF-16 代理对 XML 数字字符引用,无法解析 XML。然后需要从 *.xlsx
文件中获取 /xl/sharedStrings.xml
。然后获取该 /xl/sharedStrings.xml
文件的字符串内容。然后将每个找到的 UTF-16 代理对 XML 数字字符引用替换为其 Unicode 替换。
我的代码展示了如何使用java.util.regex.Matcher
来执行此操作。它搜索与模式 (\\d{5});(\\d{5});
匹配的实体。如果找到,它将获取代理对 H
igh 和 L
ow 作为整数。然后它检查这是否真的是代理对(H
必须在 0xD800 和 0xDBFF 之间,L
必须在 0xDC00 和 0xDFFF 之间)。如果是这样,它会将 N 计算为 N = (H - 0xD800) * 0x400 + (L - 0xDC00) + 0x10000
。然后,它将 UTF-16 代理对 XML 数字字符引用替换为 Unicode 数字字符引用。之后,它用空字符串替换补充对的剩余单个部分。因此,由于不允许使用补充对的单个部分,因此它们将被删除。
import java.io.*;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class XSSFWrongXMLinSharedStrings {
static String replaceUTF16SurrogatePairs(String string) {
Pattern pattern = Pattern.compile("&#(\\d{5});&#(\\d{5});");
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
String found = matcher.group();
int h = Integer.valueOf(matcher.group(1));
int l = Integer.valueOf(matcher.group(2));
if (0xD800 <= h && h < 0xDC00 && 0xDC00 <= l && l < 0xDFFF) {
int n = (h - 0xD800) * 0x400 + (l - 0xDC00) + 0x10000;
System.out.print(found + " will be replaced with ");
System.out.println("&#" + n + ";");
string = string.replace(found, "&#" + n + ";");
}
}
pattern = Pattern.compile("&#(\\d{5});");
matcher = pattern.matcher(string);
while (matcher.find()) {
String found = matcher.group();
int n = Integer.valueOf(matcher.group(1));
if (0xD800 <= n && n < 0xDFFF) {
System.out.println(found + " is single part of supplement pair. It will be removed.");
string = string.replace(found, "");
}
}
return string;
}
public static void main(String[] args) throws Exception {
File file = new File("ticket.xlsx");
//Repairing the /xl/sharedStrings.xml on string level. Parsing XML is not possible because of the UTF-16-surrogate-pair XML numeric character references.
OPCPackage opcPackage = OPCPackage.open(file);
PackagePart packagePart = opcPackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
ByteArrayOutputStream sharedStringsBytes = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int length;
InputStream inputStream = packagePart.getInputStream();
while ((length = inputStream.read(buffer)) != -1) {
sharedStringsBytes.write(buffer, 0, length);
}
inputStream.close();
String sharedStrings = sharedStringsBytes.toString("UTF-8");
//Replace UTF-16-surrogate-pair XML numeric character reference with it's unicode replacement:
//sharedStrings = sharedStrings.replace("��", "😁");
//ToDo: Create method for replacing all possible UTF-16-surrogate-pair XML numeric character references with their unicode replacements.
sharedStrings = replaceUTF16SurrogatePairs(sharedStrings);
OutputStream outputStream = packagePart.getOutputStream();
outputStream.write(sharedStrings.getBytes("UTF-8"));
outputStream.flush();
outputStream.close();
opcPackage.close();
//Now the /xl/sharedStrings.xml in the file does not contain UTF-16-surrogate-pair XML numeric character references any more.
Workbook workbook = new XSSFWorkbook(file);
Sheet sheet = workbook.getSheetAt(0);
System.out.println("Success.");
}
}
关于java - APACHE POI EXCEL XmlException : is an invalid XML character, 有没有办法预处理excel文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53357802/