ButtUgly: Main_blogentry_140903

Short JSPWiki update

I've managed to fix (hopefully anyway) a couple of age-old problems relating to the "match english plurals" -option. At least all of the tests run.

I also fixed a couple of authentication-related problems; it seems now that everything works except groups. Slowly, slowly...

On a tangent: The development was halted today for a few hours, as I was seriously contemplating throwing my computer out of the window, and just simply give it all up.

The reason was that I tried to upgrade my Debian installation to work natively with UTF-8, which created some really serious and non-obvious issues: for example, the only somewhat functional possible locale is en_US.UTF-8 - NOT fi_FI.UTF-8, or any combination thereof.

Another reason was that this simple JUnit test ceased to work:

        String src = "abcåäö";
        String res = new String( src.getBytes("ISO-8859-1"), "ISO-8859-1" );
        assertEquals( src, res );

You see, according to all possible specifications this should work regardless of the encoding system. But no.

The cure? Load the file in editor, save it. No changes. Just save it.

I have no idea whatsoever what was happening. But it took me hours to figure out, a few reboots (yes, I was that desperate), and at some point I was considering reinstalling Java, or getting rid of Debian altogether. OK, my own fault for running the unstable distribution, but still... My guess is that for some reason, Java started to interpret the file encoding system differently, as it uses the native system encoding. But what I don't understand is that when I reverted to the earlier configuration, it failed again.

Gah. This stuff is getting more and more complicated year-by-year. Or I am getting more stupid. Could be both.

Comments

The formula:

String src = "abcåäö"; String res = new String( src.getBytes("ISO-8859-1"), "ISO-8859-1" ); assertEquals( src, res );

Will *NOT* work if the compiler thinks that the original characters that you put into the string (the literal expression) contains characters outside of the ISO-8859-1 range. Remember that UTF-8 involves sequences of byte values in the >128 range. If the compiler were to decide that your source was UTF-8, and you happened to be unlucky enough that the literal was a valid sequence of byte values, then the string would be composed of characters outside of the 8859-1 range.

It is possible that editing the file and saving it again might change it so the compiler thinks it is 8859-1.

Yeah, I am grasping at straws, but you have any better idea?

--KeithSwenson, 18-Sep-2003

Except that the string is NOT valid UTF-8. But actually, now that you mention it, I think JDK 1.4 no longer throws exceptions at invalid UTF-8, whereas 1.3 would. So the byte stream would still be decoded, no matter what.

--JanneJalkanen, 18-Sep-2003

More info... Comments? Back to weblog

"Main_blogentry_140903_1" last changed on 14-Sep-2003 23:53:32 EEST by JanneJalkanen.