RefactorMe, please. Mostly obsolete information.


ÄÖÜßäöü

-- 060118 rsc


The content of the file (jspwiki.pageProvider = FileSystemProvider) is Little-Endian Unicode, not UTF-8:

00000000h: FF FE C4 00 D6 00 DC 00 DF 00 E4 00 F6 00 FC 00 ; ÿþÄ.Ö.Ü.ß.ä.ö.ü.
00000010h: 0D 00 0A 00 0D 00 0A 00 2A 00 5B 00 C4 00 C4 00 ; ........*.[.Ä.Ä.
00000020h: C4 00 C4 00 C4 00 5D 00 0D 00 0A 00 2A 00 5B 00 ; Ä.Ä.Ä.].....*.[.
00000030h: D6 00 D6 00 D6 00 D6 00 D6 00 5D 00 0D 00 0A 00 ; Ö.Ö.Ö.Ö.Ö.].....
00000040h: 2A 00 5B 00 DC 00 DC 00 DC 00 DC 00 DC 00 5D 00 ; *.[.Ü.Ü.Ü.Ü.Ü.].
00000050h: 0D 00 0A 00 2A 00 5B 00 DF 00 DF 00 DF 00 DF 00 ; ....*.[.ß.ß.ß.ß.
00000060h: DF 00 5D 00 0D 00 0A 00 2A 00 5B 00 E4 00 E4 00 ; ß.].....*.[.ä.ä.
00000070h: E4 00 E4 00 E4 00 5D 00 0D 00 0A 00 2A 00 5B 00 ; ä.ä.ä.].....*.[.
00000080h: F6 00 F6 00 F6 00 F6 00 F6 00 5D 00 0D 00 0A 00 ; ö.ö.ö.ö.ö.].....
00000090h: 2A 00 5B 00 FC 00 FC 00 FC 00 FC 00 FC 00 5D 00 ; *.[.ü.ü.ü.ü.ü.].
000000a0h: 0D 00 0A 00 0D 00 0A 00 2D 00 2D 00 20 00 30 00 ; ........-.-. .0.
000000b0h: 36 00 30 00 31 00 31 00 38 00 20 00 72 00 73 00 ; 6.0.1.1.8. .r.s.
000000c0h: 63 00 0D 00 0A 00                               ; c.....

The file names generated are always

%C3%83%C3%83%C3%83%C3%83%C3%83.txt

where I expected

%C3%84%C3%84%C3%84%C3%84%C3%84.txt
%C3%96%C3%96%C3%96%C3%96%C3%96.txt
%C3%9C%C3%9C%C3%9C%C3%9C%C3%9C.txt
%C3%9F%C3%9F%C3%9F%C3%9F%C3%9F.txt
%C3%84%C3%A4%C3%A4%C3%A4%C3%A4.txt
%C3%96%C3%B6%C3%B6%C3%B6%C3%B6.txt
%C3%9C%C3%BC%C3%BC%C3%BC%C3%BC.txt

What did I miss? If UTF-8 is recommendable for jspwiki (and I manage to understand) I would like to convert the existing Wiki files from ISO-8859-1 to UTF-8.


Saying that I recognized that the behavior is different on this wiki. On this wiki it works like I've expected.

What I did was just downloading the jspwiki.zip containing the JSPWiki.war, moving it below tomcat5.5\webapps on W2k.

I do not understand what's going on.

Rolf


checking 2.3.71-cvs out of the repository leads to the same problem on Debian Linux. Why does it work on jspwiki.org but not on my machines?

-- 060118 rsc


By the way: the way file names are produced out of non 7-bit-ASCII characters has drawbacks on Windows systems:

A filename as %C3%84.txt will be treated as: "look for an environment variable with the name C3 and replace it in the name." Several programs, especially backup software, will convert such a name to 84.txt as there is no environment variable named C3. And: they do it silently.

I got the experience when using cvs for backup purposes.

--rsc, 19-Jan-2006

I'm sorry, but I don't use Windows. Could someone please open up a new bug and provide a patch which does not break existing Windows repositories?

-- JanneJalkanen

If someone faces such a problem here is my solution so far: the error happened to me most often in cases where you feed batch commands (.bat or .cmd) with the list of all files in a Wiki page directory. While passing the filenames accross you ought to replace a percent sign by two percent signs. E.g. %C3%84.txt should be converted on the fly to %%C3%%84.txt. However, I'm far from happy with this solution.

-- rsc 060121


Seems like I'm not the first one who faced this problem.

I found a method in WikiEngine called safeGetParameter.

Substituting getParameter in DefaultURLConstructor.parsePage by this one solved the problem on my site.

        String pagereq = m_engine.safeGetParameter( request, "page" ); //request.getParameter( "page" );

But if you look into safeGetParameter I guess this could be a hack. My browser sends the %C3%9F%C3%9F string ISO-8859-1 encoded because this is the default characterset in my Firefox. safeGetParameter assumes this and converts the ISO-8859-1 string to byte sequence C3 9F C3 9F and presents this to a UTF-8 encoder. The encoder finds legal UTF-8 code (not in the request-string but) in the byte sequence and from that forms unicode bytes such as the java.lang.String class needs to represent "ßß".

As this method works for me it will not work in cases where the clients browser is set to anything else but ISO-8859-1.

The whole subject of switching to UTF-8 seems to me unclear. May be not only to me: http://weblogs.java.net/blog/satyak/archive/2004/05/working_with_se.html.

Result: I have to stick to ISO-8859-1 till I'm a bit more enlightened.

What is astonishing most: It just makes no difference for request.getParameter wether you set the characterset request.setCharacterEncoding( "UTF-8" ) or request.setCharacterEncoding( "ISO-8859-1" ) prior to calling request.getParameter. The result is just the same.

--rsc, 20-Jan-2006

The decoding process is supposed to be done by TextUtil.urlDecode(), which is why we don't do it that place. It might be that there's something strange going on. I'll look into it at some point, but it's strange that it works for me but it does not work for you. Could you detail your environment?

BTW, this page name really sucks. It provides no context - it could be a comment on anything.

-- JanneJalkanen


While partly understanding what's this all about, I found that this issue was reported several times:

e.g.

Why does this thing not happen on the jspwiki.org site?

Do you've configured for ShortViewURLConstructor? This is the only URL constructor not using request.getParameter( "page" ) while DefaultViewURLConstructor and ShortURLConstructor are relying on the HTTP protocol.

I recommend to substitute request.getParameter( "page" ) by m_engine.safeGetParameter( request, "page" ) in the URL constructors.

However, I'm not a HTTP protocol expert. If a HTTP request might use other codings than 7-bit ASCII even safeGetParameter will fail.

And, Janne, please have a look at my second issue:

If FileSystemProvider saves in Little Endian Unicode instead of UTF-8 the Wiki page base might not be portable from a 386-machine to a Sun, PowerPC, MAC, MIPS or Alpha. However, I haven't tested this.

--Rolf Schumacher, 21-Jan-2006

Hi!

FileSystemProvider should never save in UTF-16 LE or BE. It should always use Latin1 or UTF-8... You're pretty much the first one to ever have reported this on a code base which has been pretty much unchanged for four-or-so years, so I have no clue about what is going on.

I don't know why this is not happening on jspwiki.org either. I need more information as to which OS/servlet container/web browser combinations cause this.

-- JanneJalkanen

I added some extra checks in 2.3.72 for the URL issue (let's hope it breaks nothing). However, I did check also the AbstractFileProvider class, and if it's writing things in Unicode-LE, there is probably a bug in the JVM you are using. The encoding is explicitly set to UTF-8 (or Latin1) when writing the data.

-- JanneJalkanen


1. Sorry, Janne. Switched the hex editor and got plain UTF-8 in the stored file. You are right. The file is plain UTF-8. There is no problem. I was misleaded by a strange editor (ULTRA-EDIT).

2. As far as I can see nothing changed to filenames with UTF-8 encoding. See the attached 4 screenshot extractions:

  1. the main page about as above mentioned
  2. open editor by clicking on ÄÄÄ
    1. notice the page name: %C3%84%C3%84%C3%84 that would be ok.
  3. I entered some text on this shot
  4. the thing happens when you click on the Save button:
    1. the page name changes to %C3%83%C3%83%C3%83
    2. the displayed page name is ÃÃÃ

I've got no chance to have characters above 7-bit ASCII within the page name with UTF-8.

What can I show you more than this in order to isolate the difference between "plain 2.3.73 on a German machine (linux or windows)" and jspwiki.org?

--Rolf Schumacher, 25-Jan-2006


some more investigation:#

When I click in page Main on ÄÄÄ I click on the following html code:

<a class="editpage" title="Create 'ÄÄÄ'" href="Edit.jsp?page=%C3%84%C3%84%C3%84">ÄÄÄ</a>

This goes to the Edit.jsp which in turn activates the WikiEngine. WikiEngine uses the DefaultURLConstructor.parsePage method. Here something goes wrong - at least on my server: "%C3%84%C3%84%C3%84" goes in and C4 C4 C4 is what should come out. But it doesn't! Instead C3 84 C3 84 C3 84 is stored in pagereq. No conversion from UTF-8 to Unicode has taken place!

Subsequently the MarkupParser.cleanLink method cleans out the 84s as they are control Unicode characters. MarkupParser.cleanLink returns C3 C3 C3 which will be converted from Unicode to UTF-8 resulting in C3 83 C3 83 C3 83.

This is the page name the Save button on the Edit page gets in its form:

<form accept-charset="UTF-8" method="post" 
      action="Edit.jsp?page=%C3%83%C3%83%C3%83" 
      name="editForm" enctype="application/x-www-form-urlencoded">
    <p>
        
        <input name="page" type="hidden" value="ÃÃÃ" />
        <input name="action" type="hidden" value="save" />
        <input name="edittime" type="hidden" value="1138238331000" />
    </p>

    <textarea style="width:100%;" class="editor" 
              id="editorarea" name="_editedtext" rows="25" cols="80">ÄÄÄ
</textarea>

The question is: why will %C3%84%C3%84%C3%84 not be converted to Unicode C4 C4 C4 by request.getParameter( "page" ) in DefaultURLConstructor.parsePage method? Is it because the DefaultLocale on my server machine is ISO-8859-1? And on yours UTF-8? I don't know.

One solution: If you would use m_engine.safeGetParameter( request, "page" ) instead of request.getParameter( "page" ) this would work better.

--Rolf Schumacher, 26-Jan-2006

2.3.73 actually does use WikiEngine.safeGetParameter(). I'm stymied. It could be a defaultlocale thing.

-- JanneJalkanen, 26-Jan-2006


Checked out 2.3.73 at about midnight yesterday. Please have a look at line #213 in DefaultURLConstructor. What I checked out there is: String pagereq = request.getParameter( "page" ) where I would prefer String pagereq = m_engine.safeGetParameter( request, "page" ).

But: in line #212 the charactercoding is set. Hence, for my opinion request.getParameter should convert what it gets (%C3%84) to Unicode (C4) after it has converted the escaped bytes. It does not do that (on all my machines, Windows & Linux)! However, I could not find the RFC that tells exactly how to handle escaped byte specifications together with non-LATIN1 charactercoding in a HTTP-request character sequence. It might be questionable if my (and your?) interpretation of request.setCharacterEncoding is consistent to the one of the developer of that routine.

So in the long run I would recommend to forget about escaped byte sequences when it comes to the conversion of page names to filenames.

This is not an easy recommendation. Unfortunately and obviously it would brake a lot of JSPWiki storages out there. At least a migration path or a conversion utility would be nice to have. Could be an overkill to such a seldom recognized problem. So, in the short run m_engine.safeGetParameter( request, "page" ) would be the not-so-clean-but-pragmatical solution.

(BTW: I'm not as sure as it might sound. I'm learning every day. So please take nothing for granted.)

--Rolf Schumacher, 26-Jan-2006


As of version 2.3.74 I could not see that this issue is addressed yet.

I've

  1. copied your DefaultURLConstructor.java to DefaultUTF8URLConstructor.java
  2. changed only line #213 to String pagereq = m_engine.safeGetParameter( request, "page" );
  3. built the JSPWiki
  4. changed the jspwiki.properties property jspwiki.urlConstructor = DefaultUTF8URLConstructor
  5. produced a conversion Main class de.toolsprofi.iso2utf.Iso2utf as appended (some code is borrowed from TextUtil.java)
  6. converted my 437MByte wiki file storage in about 2 minutes
  7. checked the result

and I'm now happy with a proper recognition of German Umlaut etc. in UTF-8 coding and would be realy satisfied if this helps someone else.

--rsc, 01-Feb-2006

I have the same problem with v2.3.92-alpha and other recent cvs versions, not with 2.2. I have tried on a Windows 2003 Server and on a Linux server with Apache Tomcat/5.5.9. If I make a link to a new page named "Ötest" and create it, it becomes "Ã Test".

--Per Johansson, 06-Apr-2006


Hey,

Have you tried adding URIEncoding="UTF8" to your connector(s) in Tomcat? I have:

    <Connector port="80"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"
               URIEncoding="UTF-8"/>

That saved me a lot of weird trouble with the request-encoding! :)

:O) Mikkle

--Mikkel Troest, 11-Apr-2006


YEPP!! positive Mikkel.

I tried this one and it functions as well. I not longer have to change line #213 in DefaultURLConstructor.java following each checkout.

Honestly I'm as far from enlightment whats going on in http request encoding as I was when starting this thread. For now I stick to my words: "the charactercoding is set. Hence, for my opinion request.getParameter should convert what it gets (%C3%84) to Unicode (C4)" but it doesn't do that.

But for now your finding is a perfect workaround if not the solution. Maybe it is the explenation why it functions at http://www.jspwiki.org and not at my server. Wonder if this is the solution to other servlet containers as well and whether or not the jspwiki functions in all circumstances, e.g. whether other (older) servlets can stand the URIEncoding="UTF-8" attribute.

Many thanks for your help, Mikkel.

--rsc, 15-Apr-2006

JSPWiki.org has the URIEncoding set, so maybe that's why we don't have these issues. I seem to recall this was something which was changed in the later versions of Tomcat - certainly 4.x does not need this, and it caught me by surprise as well when we were upgrading.

The reason why DefaultURLConstructor does not do anything is that it's actually WikiServletFilter which does request.setCharacterEncoding("UTF-8") - after that request.getParameter() should automatically return the right encoding. Unfortunately, there is something weird going on with Tomcat - I've not heard of any of these issues with any other servlet container.

-- JanneJalkanen


Don't forget there is often more than one connector, for instance the one apache use to connect to tomcat (jk2).

--Trond, 05-May-2006


Remember the connector used by mod_jk if you are using Apache and mod_jk with Tomcat together. In my case, it is 8009, see below.

<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8"/> <Connector port="8009" protocol="AJP/1.3" redirectPort="8443" URIEncoding="UTF-8"/>


--PeterLiu, 20 Jun 08

Add new attachment

Only authorized users are allowed to upload new attachments.

List of attachments

Kind Attachment Name Size Version Date Modified Author Change note
png
Bildschirmfoto-1.png 30.8 kB 1 26-Jan-2006 00:12 Rolf Schumacher
png
Bildschirmfoto-2.png 27.0 kB 1 26-Jan-2006 00:12 Rolf Schumacher
png
Bildschirmfoto-3.png 25.6 kB 1 26-Jan-2006 00:12 Rolf Schumacher
png
Bildschirmfoto-4.png 29.7 kB 1 26-Jan-2006 00:13 Rolf Schumacher
java
Iso2utf.java 8.6 kB 1 01-Feb-2006 21:24 rsc
« This page (revision-64) was last changed on 02-Oct-2008 01:00 by Baráti,Zoltán