Why isn’t String‘s .length() accurate?


It isn’t accurate because it will only account for the number of characters within the String. In other words, it will fail to account for code points outside of what is called the BMP (Basic Multilingual Plane), that is, code points with a value of U+10000 or greater.

The reason is historical: when Java was first defined, one of its goal was to treat all text as Unicode; but at this time, Unicode did not define code points outside of the BMP. By the time Unicode defined such code points, it was too late for char to be changed.

This means that code points outside the BMP are represented with two chars in Java, in what is called a surrogate pair . Technically, a char in Java is a UTF-16 code unit.

The correct way to count the real numbers of characters within a String, i.e. the number of code points, is either:

someString.codePointCount(0, someString.length())
0 answers