Real's How-to

Optimize Collection usageTag(s): Language

From Usenet, a post (by "Ed") to demonstrate that depending on your need, choosing the right Collection implementation can be really important.

In the demonstration, we read a huge text file war-and-peace.txt and then count the duplicated words. as you can see depending on the Collection used, the execution time is very different!

import java.util.*;
import java.io.*;

class TestCollectionPerformance {
    private static String TEXT_BOOK_NAME = "war-and-peace.txt";

    public static void main(String[] args) {
      try {
        String text = readText();   // Read text into RAM
        countDuplicateWords(text, new HashSet<String>());
        countDuplicateWords(text, new TreeSet<String>());
        countDuplicateWords(text, new ArrayList<String>());
        countDuplicateWords(text, new LinkedList<String>());
      }
      catch (Throwable t) {
        System.out.println(t.toString());
      }
    }

    private static String readText() throws Throwable {
      BufferedReader reader =
        new BufferedReader(new FileReader(TEXT_BOOK_NAME));
      String line = null;
      StringBuffer text = new StringBuffer();
      while ((line = reader.readLine()) != null) {
        text.append(line + " ");
      }
      return text.toString();
    }

    private static void countDuplicateWords(String text,
                        Collection<String> listOfWords) {
      int numDuplicatedWords = 0;
      long startTime = System.currentTimeMillis();
      for (StringTokenizer i = new StringTokenizer(text);
         i.hasMoreElements();) {
           String word = i.nextToken();
           if (listOfWords.contains(word)) {
             numDuplicatedWords++;
           }
           else {
             listOfWords.add(word);
           }
      }
      long endTime = System.currentTimeMillis();
      System.out.println(numDuplicatedWords + " duplicated words. " +
        "Using " + listOfWords.getClass().getName() +
        ", time = " + (endTime - startTime) + "ms.");
    }
}

Result :

522396 duplicated words. Using java.util.HashSet, time = 453ms.
522396 duplicated words. Using java.util.TreeSet, time = 1031ms.
522396 duplicated words. Using java.util.ArrayList, time = 100937ms.
522396 duplicated words. Using java.util.LinkedList, time = 129375ms.

A Set offers a collection of unique elements.

An HashSet maintains its collection in an unordered manner.

A TreeSet keeps the elements in the collection in sorted order.

A List provides ordered access (by index), but it doesn't guarantee uniqueness.

The ArrayList provides a collection backed by an array. It provides quick indexed access to its elements, and works best when elements are only added and removed at the end. To make this happen, ArrayList performs an internal move operation when an element is added or removed.

The LinkedList is best when add and remove operations happen anywhere, not only at the end. LinkList doesn't do an internal move operation for an element insert or remove, it just manipulates reference pointers. But LinkedList's added flexibility comes at an added cost -- it results in much slower indexed operations.

Send comment, question or suggestion to howto@rgagnon.com

Optimize Collection usageTag(s): Language

About cookies on this site