Intranet searching

We have been asked recently if Claromentis supports fuzzy logic searching and I am pleased to confirm that it does.

Not only that but it has a neat feature to refine how much fuzziness is allowed.. if only we could work out the least fuzzy way to use it.

To run a fuzzy logic search just add a tilde at the end of the string.

For example lets search for “term” on a Claromentis intranet.

normal-search

I have run a basic search , and asked for results from documents and Wikis only. I find only 3 results – the most relevant being a wiki post about search terms, the second a PDF that mentions term in the title, and the third a management report in the form of a word document, that contains 2,322 words, doesn’t mention term in any of the obvious fields like title or metadata but does indeed mention the phrase “long term basis” buried within the content.

Now lets turn to fuzzy logic – I am looking for something like “term”.

term-default-fuzzy

On a basic run I get 42 matching results, as usual returned in order of relevancy. The ones at the top contain many words that are really similar to “term” – in fact exactly contain that word. Hits after return 6 are starting to contain “team” and by the time we get to the last result – a word document – I struggle to find the reason for the result – but indeed there are similar words buried in the document.

Now here’s the fun. We can add a parameter to define how fuzzy we really feel! The default is 0.5

To see how this works we really need to be searching for something with more than 4 characters – I don’t pretend to fully understand the Levenshtein Distance algorithm used, but I can see that if you have only 4 characters to play with a fuzzy level of 0.75 or less is going to give close to 42 results, and 0.76 or more is going to get us back to the 3 we found in the first place.

So I use a search for “supervision” – a normal search finds 5 documents from our HR policies about, believe it or not – supervision.

supervision_default

Now here it gets interesting : “supervision~” finds 55 returns – the same top 5 of course, but anther 50 that mention “Support” amongst other things. Being a bit less fuzzy with “supervision~0.75” just finds  the same 5, but “super~” finds 15. This includes documents that have “upper” in them.

I think a great example is “super~0.60”  which is being slightly fussier (this is where English American spelling is really important for a change!) and finds only one return – it turns out I  have a document where super is in the meta data but has been mistyped as “Suer accounts” :

super61

So where does this leave us? Well it is great that we allow fuzzy searching, to lead into the obvious pun I am a little fuzzy as to how exactly to use it. It does, however, find some great typos.