I’m more likely to swim the Channel in a pair of Uggs than I am to utter ‘caff’ in broad daylight, but that is because I’m a character-counting pedant.
At first glance, no matter how you pronounce it, the physical on-the-page representation of a parkin-parlour happens to have only four characters and be four bytes in length. As a word, ‘caff’ is very, very wrong. But if, like me, you are slightly averse to the notion of CCLXXX, then, well, ‘caff’, may have a subliminal allure.
Character counts count for SO much
A character count, for Twitter, is defined by the byte length of the word or phrase you’ve written. The characters equate to bytes. But if you use anything apart from the most basic symbols, then things get a bit weird: accented vowels cause the most confusion.
Take the word café, for example. Four characters, n’est pas? N’est pas. There are several sequences of phonemes that deliver the same outcome, but each one uses a different number of bytes. Quod erat demostrandum:
- The word café – using the “é” character, known as a ‘composite character’ – is defined in the byte length like this: 0x63 0x61 0x66 0xC3 0xA9. Four characters, five bytes.
- But the word café – using a combined diacritic that overlaps the ‘e’ – is defined so: 0x63 0x61 0x66 0x65 0xCC 0x81. Five characters, six bytes.
- Reassuringly, for pedants, the word café may be shown thus: /ˈkæfeɪ/ – indicating the refined, meet and right, and respectable pronunciation. Six characters, nine bytes though. Sorry, old chap.
- Aaaand… we’re back to the caff: four letters and four bytes, dagnammit.
Luckily, for us, Twitter knows this doomed march towards Manningtree could be a problem. For it, and for us. (No offence, Manningtree.)
To the human eye the word ‘café’ is clearly four characters long, even if some ignoramuses utter blasphemies anon. However, with the high-handed precision of a man who’s circumcising beer-flies at a bar mitzvah, The Bird goes for the easiest and most common qualification of this word’s length, and inflicts a short, quick, and dirty outcome every time.
Four characters it is – five bytes it has to be – until Twitter swaps out the composite combo for a doppelgänger character that uses one byte alone.
That works for me. It probably does the same thing for ü.