VBA and Unicode

VBA and Unicode go together like Bailey's and Lime. But pretending the problem doesn't exist won't make it go away. Even for us Americans.

VBA and Unicode

Cement mixer, anyone?


I've never worried too much about text file encodings or Unicode handling in my VBA code.  And why should I?  I'm an American, dammit!

I've modified an old joke to make my point for me:

Q: What do you call someone who speaks 3 languages?
A: Tri-lingual.

Q: What do you call someone who speaks 2 languages?
A: Bilingual.

Q: What do you call someone who speaks 1 language (and doesn't care about code pages and file encodings)?
A: American.

The past few days I've written a series of articles detailing a variety of file handling functions that I've used *with no problems* for over a decade.  Sounds great, right?  The problem is that most of them are pretty useless outside the good old U.S. of A.  

I mean, maybe my British brothers and sisters across the pond can use them without trouble.  Perhaps even the Aussies down under.  Those functions also probably work fine for Canadians.  (Which makes sense, since they're basically just Americans who like snow and don't know how to pronounce "about" correctly, eh?  No, wait, there is that weird province where the people can't decide if they're French or not.)

I was making a point before I went off on my ignorant-American digression.  For my international audience who've managed to read this far without closing their browser tab in disgust, I'd like to atone for my previous disregard for the plight of the world outside my small ANSI box.

I was going to write a full article about ANSI, Unicode, file encodings, code points, etc., but I realized that I don't really know very much about that world.  Instead, I'll do something I haven't done very often around here.  Put together a whole big pile of links to some better Unicode sources than anything I could provide myself.

Background

My fellow Americans, if you've read this far and been thinking, "What in the world is this guy even talking about?" then this first set of links are for you.  My international readers can just skip over most of these because, well, you've been living with this pain for years now and know all about it.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
"It does not make sense to have a string without knowing what encoding it uses."
- Joel Spolsky

This article from Joel Spolsky is the perfect starting point for learning about Unicode.  It's getting a bit long in the tooth, particularly with some of the website examples, but all the concepts are still solid.  Plus, Joel's articles are just fun to read.

On the Goodness of Unicode
"Thus, while there are officially two standards you should care about, Unicode and ISO 10646, through some political/organizational magic they are exactly the same, and if you're using one you're also using the other."
- Tim Bray

This article provides background on the world's writing systems; the interplay of characters, encodings, and fonts; the history and politics of Unicode; and some technical information on the Unicode standard itself.

Character code tutorial
"This document in itself does not contain solutions to practical problems with character codes (but see section Further reading). Rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place."
- Jukka "Yucca" Korpela

This is more thorough technical reference than Joel's.  It's from 2001, but the information seems to have aged pretty well.

Unicode and VBA

AKA, the stuff you actually care about.

Working with Unicode file names in VBA (using Dir, FileSystemObject, etc.)
"It sounds like you are being misled by the fact that while VBA itself supports Unicode characters, the VBA development environment does not. The VBA editor still uses the old "code page" character encodings based on the locale setting in Windows."
- Gord Thompson

The above stackoverflow answer from Gord Thompson shows how to use the FileSystemObject to iterate through a folder of Unicode file names, where the Dir() function would otherwise choke.

VBA Read Unicode file contents in various encodings
"To read a text file, you need to be able to handle more character sets than just ANSI. Not just in the contents but also in the file and folder names."
- Patrick O'Beirne

The above article includes two functions, ReadFileContentsAndCharset() (which tries to guess the character encoding), and ReadADOStreamText().  These are good alternatives to my FileRead() function.  (Note that they are not drop-in replacements, as they return file contents via a ByRef function argument, rather than as the return value of the function.)

Thanks to @ExcelAnalytics for pointing me to Patrick's article (and serving as the inspiration for this post).

Solving the Unicode, UTF8, UTF16 and Text Files conundrum in VBA
"No bulky and verbose .NET or undecipherable C++ code complications here.  Just immediately actionable, simple and humble, VBA code with one function to rule them all, and a 10 to 15 minutes read to understand it all."
- @francescofoti

Francesco recommends reading text files as binary using legacy statements like Open and Get.  By reading the files in as binary, VBA won't assume the contents are ANSI; it won't assume anything at all.  He then offers some code to convert the imported binary into UTF8 text.

VBA: Unicode Strings and the Windows API

"1. Do not use the A version of API calls, always use the W version instead."
"7. Remember that the VBE IDE cannot process and display Unicode, so don’t expect your debug.print to display anything other than ? for high code point characters."
- Renaud Bompuis

The above article has some great information about Unicode considerations when making Windows API calls.  He also talks about the built-in VBA message box's lack of Unicode support and how to work around that.  It's a great, practical read.

How To Display Foreign Characters In Excel VBE
"In this article, you will be walked through the process of showing foreign characters in Visual Basic Editor."

This article steps you through the process of changing the Windows Region settings so that the Visual Basic Editor will display Unicode characters correctly.  Unfortunately, this only supports one character set at a time.  Switching between character sets is clunky and appears to require a computer restart.

Note: In version 2004 of Windows 10, there is a regional setting to enable UTF-8 support system-wide.  The feature is still in Beta...and they're not kidding about that.

I tried enabling it and then using Alt codes to create Unicode characters in the VBE.  It did not work.  The Visual Basic Editor seems impervious to the Beta UTF-8 Windows setting, at least for now.  Also, after enabling the feature, I was no longer able to use Alt codes to create Unicode characters in a UTF-8 encoded file in Notepad++.  Instead, each time I tried, the Notepad++ window displayed a square box.  I was able to copy and paste Unicode characters from other text files and they displayed fine; the problem seemed to be with the Alt code handling.  I reverted to the non-UTF-8 Region Settings.  Your mileage may vary.  Mine sucked.

Assortment of stackoverflow VBA unicode questions
How to handle filenames with unicode characters in VBA using Dir?
I am selecting a list of files based on a criteria using Dir, and then storing these in an array of strings. I then iterate through the array and process the files. How can i handle filenames with
VBA - Convert string to UNICODE
I need to convert the string HTML from a mix of Cyrillic and Latin symbols to UNICODE. I tried the following: Public HTML As String Sub HTMLsearch() GetHTML (“http://nfs.mobile.bg/pcgi/m...
Using multiple language (english, chinese, japanese) in VBA
I need to be able to use strings of multiple languages (english, chinese and japanese) in VBA. Changing the region/locale setting of the computer only works if there is one language. Could someone ...
Getting the unicode value of a char in VB
How can I get the unicode value of a char? For example, I know that I can do this with ascii: i = Asc(“a”) // i == 97 (correct) What if I have a unicode char though? i = Asc(”•”) // i == 149 (

References

Now, for the hard-core material.  

Unicode Character Code Charts

This page from unicode.org has links to every currently recognized character set, from Basic Latin (ASCII) to Arabic, Ethiopic, Hebrew, Nandinagari, Tagalog, and Wancho.  There are also links to symbols and punctuation, including Braille Patterns, Mathematical Operators, Currency Symbols, Chess Symbols, Dingbats, and Emoticons.

Character Sets And Code Pages At The Push Of A Button

This is an old page with lots of links.  Unfortunately, most of those are now broken links, especially the ones linking to Microsoft and IBM reference pages toward the bottom of the site.  That said, if you're working on some esoteric issue and your Google-fu is failing you, this might be worth a shot.

Additional References

Dear readers in the future: leave additional references to good VBA Unicode articles in the comments below.  Dear realtime readers who have no option to leave comments, email me suggested articles at mike {at} nolongerset.com.  Or send them to me on Twitter @nolongerset, if you're into that sort of thing.

Image by LEEROY Agency from Pixabay

All original code samples by Mike Wolfe are licensed under CC BY 4.0