Unaccent lettersTag(s): Internationalization String/Number
The following snippets remove from a String accented letters and replace them by their regular ASCII equivalent.
These can be useful before inserting data into a database to made sorting easier.
Using java.text.Normalizer
It's a simple using the java.text.Normalizer class.We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.
import java.text.Normalizer; import java.util.regex.Pattern; public class StringUtils { private StringUtils() {} public static String unAccent(String s) { // // JDK1.5 // use sun.text.Normalizer.normalize(s, Normalizer.DECOMP, 0); // String temp = Normalizer.normalize(s, Normalizer.Form.NFD); Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+"); return pattern.matcher(temp).replaceAll(""); } public static void main(String args[]) throws Exception{ String value = "é à î _ @"; System.out.println(StringUtils.unAccent(value)); // output : e a i _ @ } }
Using String.replaceAll()
As an alternative, replaceAll() and regular expressions on a String can also be used :public class Test { public static void main(String args[]) { String s = "È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô"; s = s.replaceAll("[èéêë]","e"); s = s.replaceAll("[ûù]","u"); s = s.replaceAll("[ïî]","i"); s = s.replaceAll("[àâ]","a"); s = s.replaceAll("Ô","o"); s = s.replaceAll("[ÈÉÊË]","E"); s = s.replaceAll("[ÛÙ]","U"); s = s.replaceAll("[ÏÎ]","I"); s = s.replaceAll("[ÀÂ]","A"); s = s.replaceAll("Ô","O"); System.out.println(s); // output : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o } }
The String.indexOf()
While the two techniques above are ok... there are a little bit slow.The following HowTo is faster because we using one String to contain all the possible characters to be converted and a String with the ASCII equivalent. So we need to detect the position in the first String and then do a lookup in the second String.
public class AsciiUtils { private static final String PLAIN_ASCII = "AaEeIiOoUu" // grave + "AaEeIiOoUuYy" // acute + "AaEeIiOoUuYy" // circumflex + "AaOoNn" // tilde + "AaEeIiOoUuYy" // umlaut + "Aa" // ring + "Cc" // cedilla + "OoUu" // double acute ; private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9" + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1" + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" + "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171" ; // private constructor, can't be instanciated! private AsciiUtils() { } // remove accentued from a string and replace with ascii equivalent public static String convertNonAscii(String s) { if (s == null) return null; StringBuilder sb = new StringBuilder(); int n = s.length(); for (int i = 0; i < n; i++) { char c = s.charAt(i); int pos = UNICODE.indexOf(c); if (pos > -1){ sb.append(PLAIN_ASCII.charAt(pos)); } else { sb.append(c); } } return sb.toString(); } public static void main(String args[]) { String s = "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç"; System.out.println(AsciiUtils.convertNonAscii(s)); // output : // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c } }
Thanks to L.. Tama for the missing Ñ !
Thanks to T. Hague for the missing "double acute";
As a bonus, here a method to convert a given string to uppercase with no accent. This can be useful in a database field to simplify name searching with accent or not.
public class StringUtils { private StringUtils() {} private static final String UPPERCASE_ASCII = "AEIOU" // grave + "AEIOUY" // acute + "AEIOUY" // circumflex + "AON" // tilde + "AEIOUY" // umlaut + "A" // ring + "C" // cedilla + "OU" // double acute ; private static final String UPPERCASE_UNICODE = "\u00C0\u00C8\u00CC\u00D2\u00D9" + "\u00C1\u00C9\u00CD\u00D3\u00DA\u00DD" + "\u00C2\u00CA\u00CE\u00D4\u00DB\u0176" + "\u00C3\u00D5\u00D1" + "\u00C4\u00CB\u00CF\u00D6\u00DC\u0178" + "\u00C5" + "\u00C7" + "\u0150\u0170" ; public static String toUpperCaseSansAccent(String txt) { if (txt == null) { return null; } String txtUpper = txt.toUpperCase(); StringBuilder sb = new StringBuilder(); int n = txtUpper.length(); for (int i = 0; i < n; i++) { char c = txtUpper.charAt(i); int pos = UPPERCASE_UNICODE.indexOf(c); if (pos > -1){ sb.append(UPPERCASE_ASCII.charAt(pos)); } else { sb.append(c); } } return sb.toString(); } public static void main(String args[]) throws Exception { String s = "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç"; System.out.println( StringUtils.toUpperCaseSansAccent(s)); // output : // THE RESULT : E,E,E,E,U,U,I,I,A,A,O,E,E,E,E,U,U,I,I,A,A,O,C } }