Remove HTML tags from a file to extract only the TEXTTag(s): IO String/Number Networking
Using regular expression
A special regular expression is used to strip out anything between a < and > .import java.io.*; public class Html2TextWithRegExp { private Html2TextWithRegExp() {} public static void main (String[] args) throws Exception{ StringBuilder sb = new StringBuilder(); BufferedReader br = new BufferedReader(new FileReader("java-new.html")); String line; while ( (line=br.readLine()) != null) { sb.append(line); // or // sb.append(line).append(System.getProperty("line.separator")); } String nohtml = sb.toString().replaceAll("\\<.*?>",""); System.out.println(nohtml); } }
Using javax.swing.text.html.HTMLEditorKit
In most cases, the HTMLEditorKit is used with a JEditorPane text component but it can be also used directly to extract text from an HTML page.import java.io.IOException; import java.io.FileReader; import java.io.Reader; import java.util.List; import java.util.ArrayList; import javax.swing.text.html.parser.ParserDelegator; import javax.swing.text.html.HTMLEditorKit.ParserCallback; import javax.swing.text.html.HTML.Tag; import javax.swing.text.MutableAttributeSet; public class HTMLUtils { private HTMLUtils() {} public static List<String> extractText(Reader reader) throws IOException { final ArrayList<String> list = new ArrayList<String>(); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { public void handleText(final char[] data, final int pos) { list.add(new String(data)); } public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { } public void handleEndTag(Tag t, final int pos) { } public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } public void handleComment(final char[] data, final int pos) { } public void handleError(final java.lang.String errMsg, final int pos) { } }; parserDelegator.parse(reader, parserCallback, true); return list; } public final static void main(String[] args) throws Exception{ FileReader reader = new FileReader("java-new.html"); List<String> lines = HTMLUtils.extractText(reader); for (String line : lines) { System.out.println(line); } } }
Using an HTML parser
This is maybe the best solution (if the choosen parser is good !).There are many parsers available on the net. In this HowTo, I will use the OpenSource package Jsoup.
Jsoup is entirely self contained and has no dependencies which is a good thing.
import java.io.IOException; import java.io.FileReader; import java.io.Reader; import java.io.BufferedReader; import org.jsoup.Jsoup; public class HTMLUtils { private HTMLUtils() {} public static String extractText(Reader reader) throws IOException { StringBuilder sb = new StringBuilder(); BufferedReader br = new BufferedReader(reader); String line; while ( (line=br.readLine()) != null) { sb.append(line); } String textOnly = Jsoup.parse(sb.toString()).text(); return textOnly; } public final static void main(String[] args) throws Exception{ FileReader reader = new FileReader ("C:/RealHowTo/topics/java-language.html"); System.out.println(HTMLUtils.extractText(reader)); } }
Using Apache Tika
import java.io.FileInputStream; import java.io.InputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.ContentHandler; public class ParseHTMLWithTika { public static void main(String args[]) throws Exception { InputStream is = null; try { is = new FileInputStream("C:/Temp/java-x.html"); ContentHandler contenthandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); parser.parse(is, contenthandler, metadata, new ParseContext()); System.out.println(contenthandler.toString()); } catch (Exception e) { e.printStackTrace(); } finally { if (is != null) is.close(); } } }
See also Extract links from an HTML page and Remove XML tags from a string to keep only text