Okay, so you use lucene for search with the current 2.1.x head. So you create a page named ExamplePseudoPromblemOfSearchingWithLucene. You obviously talk about the Example Pseudo Promblem Of Searching With Lucene on that page, because it's an important thing to talk about.


THe "problem" is on that page, the word "lucene" is not present. It is present within the page's title, but at the end (or in the middle). Page names are internally camel-cased in JSPWiki. When that page gets fed to Lucene for indexing, since there are no spaces in it, it's stored as one whole token. When searching for 'lucene' that page does not get found because the page text did not contain that word, and the page title was taken as one whole token. Lucene (the search engine) does not allow prefix matches, so a search for '*lucene' isn't possible.

JohnV accidentally discovered this "problem" and wondered if the old regex based search would have found the page, and yup, it does. (Though it's horribly slow!) So maybe a patch for the 2.1.x lucene search is in order.

NOTE: This "problem" only manifests itself if breakTitlesWithSpaces is off.#

Hmm, there is probably a whole nest of CamelCase related search problems with the current 2.1.x lucene search. If I get a chance to dig deeper I will. Off the cuff, my gut tells me that all wiki page names (as page titles and within page texts) must be fed to lucene twice, once as CamelCase and once as Camel Case.

Oh yeah, I think there is another problem in that the brackets are fed in as well. Create a page Snarf Grobble and on it put the text SnarfGrobble, Snarf Grobble, SnarfGrobble, and (Snarf Grobble)

MahlenMorris - In the version of the searching code that I have running here at work (which is different than the 2.1.x head), I think part of this problem is solved. The title is indeed broken up into individual words when given to the indexer; in fact, each continuous grouping of words is also given, so that, for example, a search for "pseudo", "pseudoproblem", and "problemsearching" would all find this page.

But I hadn't thought about the CamelCase within pages. That's another kettle of fish.

When I can tear myself away from playing Doom 3 (and it isn't easy to do so!), I'll send Janne a patch with some new searching code that should help.

Any update on the CamelCase within pages issue? I've actually sort of reconciled my self to eliminating al CamelCase on our internal wiki's and requireing the [] syntax be used. (For a variety of reasons, but this was a factor.) --JohnV

This is actually an indexing problem. It also impacts severely languages like Chinese or Japanese, where sentences in general are written without whitespace - and the default English rules don't work because they cannot detect where the words start and stop. Also, it is a problem for Finnish, where words like "wiki" and "wikissä" are in fact the same word, the other one just means "in the wiki".

I'd love to see this one solved.

-- JanneJalkanen

Um, fixing the CamelCase issue discussed here won't solve your "wiki" and "wikissä" example. You'd need to search for "wiki*" to find both. This another issue that I have with the current search/lucene implementation. I should be able to use the standard lucene search syntax but I cannot. I've not looked into it, but it seems like something is mutating the input before feeding it to lucene. Anyway fixing the CamelCase in page text problem, as well as allowing standard lucene search syntax would address all my known issues. --JohnV

What changes would need to be made to allow for the making partial word matches? We implemented the Lucene search on our internal wiki and when I search on Postgres, the pages with Postgresql will not be returned in the search. I think this could be solved by allowing Lucene to search with wild cards...any ideas? You can do the same search on www.jspwiki.org and the Postgres search will find this page now but Postgresql will find other pages.-- Shawkins

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-19) was last changed on 26-Sep-2007 23:51 by JanneJalkanen