Discussion on the problems in WikiRPCInterface of using UTF-8 in XML-RPC, which doesn't strictly allow it, and time zone issues.


The UTF-8 issue seems to be talked to the death on the XMLRPC mailing list. The summary seems to be: "While many toolkits might support something else than ASCII in string values, the XML-RPC spec is frozen, and will never change. If you transport something else than ASCII, you're in violation of the spec. Use base64."

I don't get it. The spec says explicitly that strings can be used to carry binary data (from http://xmlrpc.org/spec):

  Q. What characters are allowed in strings? Non-printable
     characters? Null characters? Can a "string" be used to
     hold an arbitrary chunk of binary data?

  A. Any characters are allowed in a string except < and &,
     which are encoded as &lt; and &amp;. A string can be
     used to encode binary data.
That would seem to make strings functionally interchangeable with base64s. What am I missing? --JeffDairiki

Same document:

Tag        Type           Example
<string>   ASCII string   hello world

ASCII. Not UTF-8, not ISO-Latin-1. ASCII. Yes, this is a clear problem within the spec. Actually, most clients and servers will survive UTF-8, no problem, so it's very much a convention nowadays. But we offer both options - both UTF-8 encoded strings, and base64-encoded strings. So far, nobody has used the UTF-8 version of the RPC interface...

--JanneJalkanen

Using base64 would mean that all methods that use strings now, should use base64 (because JSPWiki supports UTF-8 all across the board - in fact even ISO-Latin1 is not supposed to go through XML-RPC strings). Which means more work to the application writer, since he has to encode/decode all stuff going back and forth. Gng. XML-RPC is not person-to-person interoperable - many people are unable to write their own names as strings.

I'm seriously considering SOAP at this point. Or breaking the XML-RPC spec knowingly and willingly; call it WikiRPC or something =). (XML-RPC is a registered trademark of Userland Software).

--JanneJalkanen


MahlenMorris It seems to be working fine. I'm turning the base64 back to a String by calling new String((byte) server.execute(GETPAGE, args), "UTF-8"); does that seem right? Like for most ignorant Americans, I18N and character encodings are very mysterious to me :)

I'm still having trouble getting the time zone right, though. I've been looking at what you did in your code, but no matter what I do i get times that think they are PST but are in fact EET, For example, as i write this it thinks that the TODOList was last changed at 00:01 PST, when it was really 00:01 EET. If you were really sending UTC, i don't think I'd be getting that. Should that be working yet?

JanneJalkanen: Yeah, that's the correct way to get UTF-8. It's entirely possible that I screwed up something in the TimeZone thing... I didn't really test it properly. BTW, note that XML-RPC does not transport TimeZone information at all, and the Apache XML-RPC library always assumes your default TimeZone when it's reading the timestamp. You'll have to manipulate the result with the Calendar class to make sure it's UTC.


Based on the work you have done here I've added experimental XML-RPC and SOAP support for the same methods as you use. You can find the methods (with some limited autogenerated documentation, expect better docs tomorrow) here: http://www.protocol7.com/services/openwiki.asp

One thing that is very different with my methods is that I have decided to break to ANSI rule of XML-RPC and returns the data as UTF8 anyways. if anyone has a huge problem with that they can just use the SOAP method instead ;-)

Feedback is appriciated! Thanks for this very interesting work! I will follow it and probably evolve it a little bit myself :-)

/niklas http://www.protocol7.com

Whee, this is definitely cool :-). I deliberately wanted to stay compatible with XML-RPC spec because, well, it makes sense to be compatible. Not to mention that the Java XML-RPC library didn't take UTF-8 too well anyway. Also, you'll need to convert the page data anyway, since it's possible to use < and > inside the text, which makes it necessary to turn them into HTML entities. So it doesn't really matter much whether you do the whole UTF-8 into base64 or UTF-8 into escaped UTF-8.

(I cleaned some older stuff away, BTW...)

--JanneJalkanen

I had three reasons for not using the base64 approach. 1. I think the ASCII rule in XMLRPC is a huge bug. And Dave Winer does as well (http://lists.xml.org/archives/xml-dev/200202/msg00920.html) :-) 2. My main platform is JavaScript... and it can not handle base64 really good... 3. If anyone reallt opposes to it I can just point then to the SOAP implementation ;-)

Do you have any ideas for other methods that we should implement? :-) I we been thinking about making a setPage() method for writing content...

/niklas http://www.protocol7.com

On a secondary note - can you be sure that the newlines on the Wiki page (which tend to be very meaningful) always go through the XML transformation properly? I am not really certain about that myself, but I've found it best not to make assumptions. :-)

Careful reading of XML spec says that newlines go untranslated. So it's okay.

The whole XML-RPC is a bug. Darned infectious at that, I'd say =).

Note that you can, of course, break the XML-RPC standard. You just can't call it XML-RPC anymore, since UserLand software owns the trademark.

I think the proper call for setPage() is something like:

  • setPage( string pageName, base64 text ): Sets the page text. Now, what should it return? The old page text? An error code? An error message?

I think we can do user authentication in

  • a separate call (setPage( string username, string password, string pageName, base64 text), or
  • using HTTP Basic authentication, or
  • allow both.

--JanneJalkanen

If Dave Winer breaks XMLRPC in that way I will as well :-) And if Userlands don't want me to I will just take down the XMLRPC end of that web service.

As for escaping the HTML: the string a return is inside a CDATA so it can contain any markup besides the end of the CDATA section (which OpenWiki will fial on anyways :-). So, because of this bug in Openwiki I won't have this problem. But of course this is not a very good way of doing it... the CDATA sections need to be escaped.

The setPage() seems good. I will try to implement it later today. I would say we go for: setPage(pageName, text, username, password)

/niklas

Good point on Dave. So, I was going to release 1.7.0 over the weekend, which probably should have the API fixed. Shall we go with the "UTF-8 in strings" or "UTF-8 in base64" -approach? --JanneJalkanen


I'm having trouble deciding between base64 and straight UTF-8 representation of page data. The advantage with the former is that it's standards compliant, but it's more inconvinient (for example, for people who are working with Javascript). The advantage with the latter is that it's much easier to work with, but it breaks the XML-RPC standard. Also, not all implementations can actually work with the UTF-8 strings - for example, you need to patch the Apache XML-RPC library to work with UTF-8, otherwise it loses all information. Also, the MinML parser it uses must be replaced with a fully standards-compliant browser such as Xerces, which roughly triples the distribution size...

Does anyone have any other opinions?

--JanneJalkanen

MahlenMorris: My gut says go with standards compliance. It may be a pain to work with, but I know that no matter what language I'm working with, it'll work. The alternative is that there'll need to be an "approved" list of libraries that'll support the mangled XML-RPC, and that seems a royal pain. Say i want to write a client in Haskell; will a particular Haskell XML-RPC client library work?

Put it this way; would you write Java code that only worked with particular JVM's?


Here's a crazy idea: Why don't we do both?

It's near-trivial to make a new API at, say http://www.ecyrd.com/JSPWiki/RPC3/ that provides the plain UTF-8 version of the API. It's just a new implementation of RPCHandler object anyway, and would involve about 100 lines of code... Yeah, and we'd need to include a patched version of the Apache XML-RPC library, but that's easy.

We're gonna at some point provide a SOAP interface anyway (presumably at http://www.ecyrd.com/JSPWiki/soap/), so why not support a third option for those who can't or won't handle things like base64?

--JanneJalkanen

Brilliant. I'm so used to making technical trade-offs that "you don't have to" doesn't get onto my radar.

--MahlenMorris

---

Off topic, are there instructions for setting up the RPC handler at a particular address? I wouldn't imagine it's difficult, but a list would help those of us who are staying up too late working on this :)

--MahlenMorris

Here's a clip of the JSPWiki web.xml deployment file:

   <servlet>
       <servlet-name>XMLRPC</servlet-name>
       <servlet-class>com.ecyrd.jspwiki.xmlrpc.RPCServlet</servlet-class>
   </servlet>

   <servlet-mapping>
       <servlet-name>XMLRPC</servlet-name>
       <url-pattern>/RPC2/*</url-pattern>
   </servlet-mapping>

The servlet part names a servlet and defines which class is used for it, and the servlet-mapping part defines the URL, relative to the current webapp. In this case, all possible URLs under "RPC2" are mapped to the XMLRPC servlet, but you could do it without the wildcard as well to have just a single URL to resolvet to the servlet itself.

--JanneJalkanen

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-20) was last changed on 17-Jul-2006 22:08 by Janne Jalkanen