Sanitize XML String Tag(s): XML
The following lists the range of valid XML characters. Any character not in the range is not allowed.
Hexidecimal | Decimal |
#x9 | #9 |
#xA | #10 |
#xD | #13 |
#x20-#xD7FF | #32-#55295 |
#xE000-#xFFFD | #57344-#65533 |
#x10000-#x10FFFF | #10000-#1114111 |
ref : http://www.w3.org/TR/REC-xml/#charsets.
The exception to this rule is that CDATA sections may contain any character, including ones not in the above range.
For example, if data is coming from a Cut&Paste operation from a Microsoft Word document, you may end up with 0x1a characters. Later, when the XML data is parsed, an Exception "hexadecimal value 0x1A, is an invalid character" will be thrown.
The following methods will remove all invalid XML characters from a given string (the special handling of a CDATA section is not supported).
Using Regex
public static String sanitizeXmlChars(String xml) { if (xml == null || ("".equals(xml))) return ""; // ref : http://www.w3.org/TR/REC-xml/#charsets // jdk 7 Pattern xmlInvalidChars = Pattern.compile( "[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]" ); return xmlInvalidChars.matcher(xml).replaceAll(""); }
Using StringBuilder and for-loop
public static String sanitizeXmlChars(String in) { StringBuilder out = new StringBuilder(); char current; if (in == null || ("".equals(in))) return ""; for (int i = 0; i < in.length(); i++) { current = in.charAt(i); if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) out.append(current); } return out.toString(); }