Tag Archives: Data Structures

HashSet<T>

The HashSet<T> class is a bit of a tricky one to get your head around at first. Mainly due to the poor documentation and even poorer examples offered up by the blogging community. The thing that had me initially stumped was the fact that there was no retrieval method. Sure, you could use one of the LINQ-to-Object methods like ElementAt. But I knew it in my waters that such a retrieval was not really in keeping with the HashSet<T> collection’s purpose and reason for being.

HashSet does not implement IList<T>. It does implement ICollection<T>, and thus is a collection. But it is not a list. It has no indexer. I have seen others write that it is a list with no duplicate values.

  • No duplicate values – correct. Sets only contain unique items.
  • A list – not correct.

They probably developed that misconception because it implements IEnumerable<T> and ICollection<T>.

Anyhow, for me, the penny dropped when I was reading the MSDN documentation and struck on this sentence -:

In simple terms, the HashSet<(Of <(T>)>) class can be thought of as a Dictionary<(Of <(TKey, TValue>)>) collection without values.

So, if the are no values, then of course there will be no retrieval method. There’s only keys, and the Contains method will confirm whether or not a key is in the set or not. So, now that we know what the HashSet does and does not implement, and we know a little about its nature, how can we use it?

I will now endeavour to provide a useful example of this collection in action. Many examples that I see involve adding integers to a HashSet instance. I feel that a more useful example will include a collection of objects of a custom class. This example will involve students who are voting on a student guild proposal. A student can vote only once. Students can be uniquely identified by their ID. The student class:

    public class Student
    {
        public string Name { get; set; }
        public int ID { get; set; }
        public bool Vote { get; set; }

        public Student(string name, int id, bool vote)
        {
            Name = name;
            ID = id;
            Vote = vote;
        }
    }

As we are not populating the collection with primitive types, we need to provide it with a way of determining whether 2 student objects are the same. And in this case, I am using value equality, as distinct from reference equality. And I think that will be quite normal when working with this class. The object in the set will probably not occupy the same memory space as the external object being compared. To that end, I have created a simple comparer class (implements IEqualityComparer<T>):

public class StudentComparer : IEqualityComparer<Student>
{
	public bool Equals(Student x, Student y)
	{
		// use the ID member to see if two student objects are the same
		return x.ID.CompareTo(y.ID) == 0;
	}

	public int GetHashCode(Student obj)
	{
		// use the GetHashCode() method of the int struct for the ID member
		return obj.ID.ToString().ToLower().GetHashCode();
	}
}

As you can see, two student objects will be determined to be equal if they have the same ID (value equality).
With comparer in hand, lets see the HashSet do its thing:

        static void Main(string[] args)
        {
            //  students who voted in person
            Student dave = new Student("dave", 302, true);
            Student chad = new Student("chad", 104, false);
            Student barry = new Student("barry", 602,true);
            Student simon = new Student("simon", 432, true);
            Student kate = new Student("kate", 982, false);
            Student[] inPersonVoters = { dave, chad, barry, simon, kate };

            //  add the above voters to a new HashSet and pass in the IEqualityComparer object
            HashSet<Student> voters = new HashSet<Student>(inPersonVoters, new StudentComparer());

            //  students who voted by post
            //  note, melanie is really chad, trying to sneak in a second vote (ID = 104)
            Student clare = new Student("clare", 763, true);
            Student larry = new Student("larry", 783, false);
            Student melanie = new Student("melanie", 104, false);

            //  add the postal voters
            voters.Add(clare);
            voters.Add(larry);
            voters.Add(melanie);

            foreach (var item in voters)
            {
                //  see how the "melanie" object was not added to the collection
                //  and thus, could not vote twice.
                Console.WriteLine(item.Name);
            }

            //  note the count of voters all up is 7. Not 8. Melanie would have been nr.8
            Console.WriteLine(voters.Count);

            Console.ReadKey(); // press key to return the Main method
        }

So, as you can see from the commenting in that code, Chad and Melanie have the same ID number (104). It seems as though Chad is trying to submit a second vote. But the HashSet<T> is onto his caper and does not add Melanie. Whilst this is not an awesome example (in terms of usage in a line-of-business application), I think it sheds more light on this excellent collection than the multitudes of examples showing bare integers being added to a set.

Generic List vs Object List

I was reading through a Microsoft certfication training kit book, when I came upon a discussion about the performance of data structures which use generics, against those that use objects (and boxing). Intrigued, I knocked up a couple of Linked List types for a test.

The implementations of the Linked Lists were identical, bar the fact that one used Generics, whilst the other used Objects. I then used 2 successive, long-running for loops, each performing inserts on the respective Linked Lists, to test the performance of each. The System.Diagnostics.Stopwatch() class came in very handy:


static void Main(string[] args)
{
	ObjectNS.LinkedList objectsList = new ObjectNS.LinkedList();
	GenericNS.LinkedList<int> genericLinkedList = new GenericNS.LinkedList<int>();

	//  first, time the performance of genericLinkedList inserts
	System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
	watch.Start();

	for (int i = 0; i < 300000; i++)
		genericLinkedList.Insert(i, genericLinkedList.Count);

	watch.Stop();
	Console.WriteLine("Performance of Generics - {0}", watch.Elapsed.ToString());

	//  second, time the performance of objectsList inserts
	watch.Reset();
	watch.Start();

	for (int i = 0; i < 300000; i++)
		objectsList.Insert(i, objectsList.Count);

	watch.Stop();
	Console.WriteLine("\nPerformance of Object/Casting - {0}", watch.Elapsed.ToString());

	Console.WriteLine("\n\nPress any key to close window");
	Console.Read();
}

And the result:

Generics vs Objects - Performance Comparison
Generics vs Objects – Performance Comparison

So, it looks like the performance penalty of boxing the integers was enough to give generics a clear victory. And of this, I am glad. Generics are very cool and easy to use. So, finding out that they also perform well – that’s icing on the cake.

You can download the whole code for my test here.