XSLT Unicode Horror
May 18th, 2010 — 10:50pm
Different Java XSLT implementation have different handling of UTF-8 characters. Here is test code that parses UTF-8 XML into DOM document and then serializes it using a transformer.
1 2 3 4 5 6 7 | System.out.println( " SOURCE: " + source); DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance(parserClass, TestUnicode. class .getClassLoader()); Document document = builderFactory.newDocumentBuilder().parse( new InputSource( new StringReader(source))); TransformerFactory transformerFactory = TransformerFactory.newInstance(transformerClass, TestUnicode. class .getClassLoader()); StringWriter writer = new StringWriter(); transformerFactory.newTransformer().transform( new DOMSource(document), new StreamResult(writer)); System.out.println( " RESULT: " + writer.toString()); |
I tested following transformers:
- Xalan 2.7.1:
- org.apache.xalan.processor.TransformerFactoryImpl
- org.apache.xalan.xsltc.trax.TransformerFactoryImpl
- org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl
- Sun-Xalan (an internal transformer factory present in Sun JDK 5 and 6):
- com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl
- Saxon 8.7:
- net.sf.saxon.TransformerFactoryImpl
Here are the results for Mathematical Script Capital D character: 𝒟
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | TRANSFORMER: org.apache.xalan.processor.TransformerFactoryImpl SOURCE: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 RESULT: <!--?xml version="1.0" encoding="UTF-8"?--> �� TRANSFORMER: org.apache.xalan.xsltc.trax.TransformerFactoryImpl SOURCE: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 RESULT: <!--?xml version="1.0" encoding="UTF-8"?--> �� TRANSFORMER: org.apache.xalan.xsltc.trax.SmartTransformerFactoryImpl SOURCE: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 RESULT: <!--?xml version="1.0" encoding="UTF-8"?--> �� TRANSFORMER: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl SOURCE: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 RESULT: <!--?xml version="1.0" encoding="UTF-8" standalone="no"?--> 𝒟 TRANSFORMER: net.sf.saxon.TransformerFactoryImpl SOURCE: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 RESULT: <!--?xml version="1.0" encoding="UTF-8"?--> 𝒟 |
Or, summarized in a table:
𝒟 | 𝒟 | |
---|---|---|
Xalan 2.7.1 | �� | �� |
Sun-Xalan (Sun JDK 1.5+) | 𝒟 | 𝒟 |
Saxon 8.7 | 𝒟 | 𝒟 |
The results were the same regardless of the parser implementation. Xerces or Saxon.
Xalan’s handling of UTF-8 multi-byte characters seems to be seriously flawed. ��
are not valid UTF-8 characters and both Xerces and Saxon parsers will throw SAXParseException when trying to parse documents that have them.