When Ancient Sogdia crashed our Elasticsearch cluster

Varahsha, Relief of a hunter, 5th-7th century CE — A Sogdian horseman hunting a wild Elasticsearch cluster, 5th century CE, Wikipedia

Ever heard of Sogdia? Well, until a few weeks ago, I hadn’t. It’s an ancient Iranian civilization that thrived between the 6th century BC and the 11th century AD. It spanned across what is now Uzbekistan, Turkmenistan, Tajikistan, Kazakhstan, and Kyrgyzstan. Sogdia has been gone for centuries, but, believe it or not, it recently managed to throw a wrench in our modern-day Elasticsearch cluster.

Okay, actually it didn’t completely take down our cluster, but it did trigger a rather peculiar warning message in our Elasticsearch logs:

"cause": {
    "type": "array_index_out_of_bounds_exception",
    "reason": "Index 184 out of bounds for length 184"
}

What on earth does that mean? Well, let’s dive in.

The Case of the Off-by-One Error

The root of this hiccup traces back to one of our custom Elasticsearch plugins, specifically one that handles tokenization. If you’re not familiar with the term, tokenization is the process of splitting sentences into individual words. But before we can even get to that, the tokenizer first needs to identify the “script” the text is written in, whether it’s Latin, Cyrillic, Chinese characters, or something else.

To help us figure that out, we rely on the ICU (International Components for Unicode) library, which is managed by the Unicode Consortium. It is able to identify 185 different scripts, referenced in the library with IDs from 0 to 184. Now, without drowning you in technical details, there’s a key moment in our tokenizer’s workflow where we need an array to hold all these possible scripts.

Originally, we based the size of this array on a constant called CODE_LIMIT from the ICU library, which had a value of 185. So, we created an array that could accommodate the 185 possible scripts.

But at some point, the CODE_LIMIT constant was deprecated, and it was advised to replace it with UCharacter.getIntPropertyMaxValue(UProperty.SCRIPT) instead. We made the switch, trusting that everything would work as before.

That’s when things got interesting. You see, UCharacter.getInt... returns 184, the highest ID value, and not 185, the number of possible scripts. So instead of creating an array that could hold 185 scripts (0 through 184), we ended up with an array that could only hold 184 scripts (0 through 183). Our new array didn’t have room for the very last script, the one with ID 184.

This, my friends, is what we call an off-by-one bug; an error so common that it has its own Wikipedia page.

Enter the Sogdians

So, when did this bug rear its ugly head? Only when a document came along containing text in the script with the ID 184. And what script might that be? You guessed it: Old Sogdian, as you can see in the source code of the ICU library:

public static final int OLD_SOGDIAN = 184; /* Sogo */

While Old Sogdian isn’t exactly trending in the world of modern linguistics, given the massive volumes of data we process daily, it was only a matter of time before we encountered it.

The good news? The bug has now been squashed, and we’ve been reminded of a valuable lesson: when you have a huge amount of data, you get to see the rarest of edge cases.

So here’s to Sogdia — long gone, but clearly not forgotten, especially by our Elasticsearch cluster.

<- Previous post
How we reclaimed 100+ TB of ES storage with 1 API call

Next post ->
Measuring camera sensor readout