Friday, April 29, 2011

Hashtable doubling?

I don't know if the title makes sense, but I am wondering how a hashtable enlarges when you add items to it?

Is it like the List<T> where it doubles in size when the limit is reached? If so, then does this doubling recreates the collection from scratch (this can be answered for List<T> too, since I am not sure if that's what it does)?

Lastly if it indeed recreates it from scratch, then this particular Add operation would be very expensive to the user who wouldn't know that the limit is reached, right?

From stackoverflow
  • The sizes are not always doubled, but have variable growth depending on the number of items.

    For a list, this is not nearly as expensive as re-creating a string or array for example, since only the pointers need to be copied from one list to another, and this can be done very efficiently.

    for a hashtable/dictionary the items must be redistributed, and that can be very expensive. Best to initialize your hashtable with an estimated size ahead of time.

  • I believe both Hashtable and Dictionary<TKey, TValue> expand to the next prime number after doubling the current count, e.g. 31 to 67.

    As I understand it, a resize doesn't involve recomputing the hashes (as they're stored with the entries) but involves putting each entry into its new bucket, where the bucket number is based on both the hash code and the bucket count.

    You asked about List<T> - there it's really simple. The list is backed by an array, and you just need to create a new array with the right size, and copy the contents of the current array. Something like:

    private void Resize(int newCapacity)
    {
        T[] tmp = new T[newCapacity];
        Array.Copy(backingArray, tmp, backingArray.Length);
        backingArray = tmp;
    }
    
    Jason Coyne : Interesting. I wonder if it has a list of primes, or if it calculates them on the fly. If its calculating, that calc could be more expensive than the copy!
    Jon Skeet : I believe it calculates them on the fly... but if you're going from about 1 million to about 2 million entries (i.e. it's a *big* map) you still only need to check each possible prime against about 1000 potential divisors. You've then got to find the right bucket for a million entries!
    Lucero : I guess this is a classic performance-vs-space question... because it adds 4 bates of storage per object stored. Maybe I should check out the source...
    Mehrdad Afshari : In a Hashtable (not a Dictionary) which uses rehashing as collision resolution, this "putting in a new bucket" can be costly.
    Jon Skeet : What does? Not recalculating the hashcode? If you don't store it, you're going to have to recalculate it every time you get a *potential* match when doing a lookup. A "bucket position" collision is much more likely than a real hash collision.
    Jon Skeet : (My previous comment was addressed to Lucero)
    Jon Skeet : @Mehrdad: Could you explain that comment a bit more in your answer?
    Lucero : @Jon: Seems you're right, at least in the rotor source code... http://www.123aspx.com/rotor/RotorSrc.aspx?rot=40806
    Mehrdad Afshari : @Jon Skeet: Look at the expand() function in the link provided, the `for` look which calls `putEntry` is what I meant.
    Joan Venge : Thanks Jon, this is a good one.
    Jon Skeet : Mehrdad: But that doesn't look terribly expensive to me. Yes, you need to recalculate the bucket position and keep looking for as many collisions as you see, but I suspect it's amortised O(n) and not *much* per entry.
    Mehrdad Afshari : Jon: It is indeed O(n), but becomes slower than Dictionary if collision rate is high. Well, not terribly... but definitely much slower than resizing List.
  • The hashtable works by using buckets, which can hold several items each (at least in most implementations, there are some that reuse other buckets in case of already used buckets). The number of buckets is usually a prime number, so that dividing the hashcode by the number of buckets returns an acceptable distribution for "good" hashes.

    Usually, there is a certain fill factor which triggers the addition of more buckets and therefore the rebuild of the hashtable. Since the hashes are divided by the bucket count, the instances need to be redistributed according to their new bucket index, which is basically a recreate from scratch.

    For the .NET hashtable, you can specify the "load factor" in some constructors. From MSDN:

    The load factor is the maximum ratio of elements to buckets. A smaller load factor means faster lookup at the cost of increased memory consumption. A load factor of 1.0 is the best balance between speed and size.

  • From the MSDN page on Hashtable.Add():

    If Count is less than the capacity of the Hashtable, this method is an O(1) operation. If the capacity needs to be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.

    Since List has the same remark I would assume they work similarly internally regarding their memory allocation.

  • why not dig into reflector to do some research if interested:

    private void Insert(object key, object nvalue, bool add)
    {
        uint num;
        uint num2;
        if (key == null)
        {
            throw new ArgumentNullException("key", Environment.GetResourceString("ArgumentNull_Key"));
        }
        if (this.count >= this.loadsize)
        {
            this.expand();
        }
        else if ((this.occupancy > this.loadsize) && (this.count > 100))
        {
            this.rehash();
        }
        uint num3 = this.InitHash(key, this.buckets.Length, out num, out num2);
        int num4 = 0;
        int index = -1;
        int num6 = (int) (num % this.buckets.Length);
    Label_0071:
        if (((index == -1) && (this.buckets[num6].key == this.buckets)) && (this.buckets[num6].hash_coll < 0))
        {
            index = num6;
        }
        if ((this.buckets[num6].key == null) || ((this.buckets[num6].key == this.buckets) && ((this.buckets[num6].hash_coll & 0x80000000L) == 0L)))
        {
            if (index != -1)
            {
                num6 = index;
            }
            Thread.BeginCriticalRegion();
            this.isWriterInProgress = true;
            this.buckets[num6].val = nvalue;
            this.buckets[num6].key = key;
            this.buckets[num6].hash_coll |= (int) num3;
            this.count++;
            this.UpdateVersion();
            this.isWriterInProgress = false;
            Thread.EndCriticalRegion();
        }
        else if (((this.buckets[num6].hash_coll & 0x7fffffff) == num3) && this.KeyEquals(this.buckets[num6].key, key))
        {
            if (add)
            {
                throw new ArgumentException(Environment.GetResourceString("Argument_AddingDuplicate__", new object[] { this.buckets[num6].key, key }));
            }
            Thread.BeginCriticalRegion();
            this.isWriterInProgress = true;
            this.buckets[num6].val = nvalue;
            this.UpdateVersion();
            this.isWriterInProgress = false;
            Thread.EndCriticalRegion();
        }
        else
        {
            if ((index == -1) && (this.buckets[num6].hash_coll >= 0))
            {
                this.buckets[num6].hash_coll |= -2147483648;
                this.occupancy++;
            }
            num6 = (int) ((num6 + num2) % ((ulong) this.buckets.Length));
            if (++num4 < this.buckets.Length)
            {
                goto Label_0071;
            }
            if (index == -1)
            {
                throw new InvalidOperationException(Environment.GetResourceString("InvalidOperation_HashInsertFailed"));
            }
            Thread.BeginCriticalRegion();
            this.isWriterInProgress = true;
            this.buckets[index].val = nvalue;
            this.buckets[index].key = key;
            this.buckets[index].hash_coll |= (int) num3;
            this.count++;
            this.UpdateVersion();
            this.isWriterInProgress = false;
            Thread.EndCriticalRegion();
        }
    }
    
  • It all depends on your hash implementation of course.

    Some hashes double, some change their size to an other arbitrary size (for instance, the next prime number).

    Most hashes will need a rehash after changing their buffer size, which is "just" moving pointers around, but is still linear with the hash size. However, some hashes use consistent hashing, which reduces the need to move elements around (typically, only one small fraction of the elements will need to be moved).

    Ben S : He's asking about the specific .NET Hashtable implementation.

0 comments:

Post a Comment