VBA Collections: AKA, Linked Lists
Everything I still remember about the Data Structures course I took 20 years ago as it applies to Collections in VBA.
Collections and dictionaries are often used similarly in VBA. But behind the scenes they are actually very different data structures. Many VBA developers don't come from a computer science background...and that's OK! I use a shockingly small amount of my computer science education when developing custom database applications. By far, the more important skill is my ability to understand my client's problem domain and craft a solution that solves their problem.
That said, I probably use my CS background more than I realize. I took an entire semester-long course on data structures. When writing high performance code, understanding data structures is critically important. Nobody writing high performance code is doing that in VBA (or any other interpreted language). However, it's still worthwhile having a bit of background knowledge in some of the differences among data structures.
Linked lists and hash tables
What and the what now? In computer science terms, a VBA collection is implemented as a linked list. Keep in mind, a "collection" is an abstraction; it's a convenient way of thinking about how the data structure works.
What do I mean by abstraction? The Windows file system uses the abstraction of folders to communicate to users the idea of organizing their files. But the computer is not literally storing pieces of paper inside of manila folders. The file system's implementation--how the bits are actually stored--is a different topic altogether.
Part 1. Collections as linked lists
Back to VBA collections as linked lists. If "collection" is the abstract concept, "linked list" is the actual implementation. It's how the bits are physically stored and accessed within the computer's memory. It works roughly like this:
The drawing above represents the following code:
10 Dim MyCollection As Collection
20 Set MyCollection = New Collection
30 MyCollection.Add "A"
40 MyCollection.Add "B"
50 MyCollection.Add "C"
Here's what VBA is actually doing behind the scenes when the above commands are run:
- Line 20: Point the local variable name "MyCollection" at the memory address
0x42b7
. - Line 30: Create a memory pointer from MyCollection at
0x42b7
to item A at0x1c3e
. - Line 40: Create a memory pointer from item A at
0x1c3e
to item B at0x54d9
. - Line 50: Create a memory pointer from item B at
0x54d9
to item C at0x877a
.
The reason this data structure is called a linked list should be obvious now. Each item links to the one after it. Note that the items only link in a single direction. There is a similar structure, known as a doubly linked list, which links in both directions. I'm not aware of any native VBA data structure implemented as a doubly linked list, but it would not be difficult to modify a collection to add that feature.
Deleting from a collection
What happens if we want to remove the second item added from our collection? Keep in mind that VBA collections--unlike almost all other sequential programming structures--are indexed from 1, not 0. Thus, to remove the second item, we use the following code:
MyCollection.Remove 2
What does this look like graphically?
That's easy enough code to write, but what's actually happening in the background? Probably more than you realize.
- Go to the MyCollection object at
0x42b7
. - Follow the memory pointer to item A at
0x1c3e
. - Save memory address
0x1c3e
. - Follow the memory pointer to item B at
0x54d9
. - Save item B's pointer to the next item in the linked list (
0x877a
). - Return to item A using the memory address we saved earlier (
0x1c3e
). - Change item A's memory pointer so that instead of pointing to item B at
0x54d9
it points to item C at0x877a
. - {If item B were an object, then this step would reduce the reference counter for item B. Once an object's reference count equals 0, the VBA garbage collector releases its memory. But that's waaaay beyond the scope of this article.}
What are a few things you've noticed? First of all, getting to a member of the collection is not a single operation. In this contrived example, there are only three items. But what if you packed 5 million items into a collection? And you needed to access the last record in the collection? A collection would not be an efficient data structure for that use case.
A reasonable person might think that a way around that would be to assign explicit keys to each item in the collection. Then, we could remove the item based on its key and not by its index number. Surely that would be more efficient? As it turns out...probably not. In fact, it might be even less efficient.
If a collection is implemented as a linked list, then the key is likely just an extra piece of metadata stored with each item. So, the process of deleting an item by its key may look like this:
- Go to the MyCollection object at
0x42b7
. - Follow the memory pointer to the first item (
0x1c3e
). - Check the value of the key attached to the first item (no match).
- Save memory address
0x1c3e
. - Follow the memory pointer to the second item (
0x54d9
). - Check the value of the key attached to the second item (it matches).
- Save the pointer to the next item in the linked list (
0x877a
). - Return to the item at the memory address we saved earlier (
0x1c3e
). - Change this item's memory pointer so that instead of pointing to the second item at
0x54d9
it points to the third item at0x877a
.
Caveats and assumptions
I'm assuming that Office VBA implements Collections as a simple linked list. This is a reasonable assumption, but not one that I've actually confirmed. The actual implementation may in fact be a doubly linked list. Since items can be removed by their string index, it's possible that Office VBA maintains a hash table to speed up that particular operation. In a compiled language, that's the sort of optimization that would be done at compile time.
I should note that there is nothing in the VBA Language Specification that states how a collection should be implemented. Therefore, different implementations of the language may in fact use different data structures. I'm inferring the data structure based on the stated behavior of the object in the language spec.
More to come
Stay tuned for a future article where I'll tackle the implementation of Dictionary objects as hash tables.