Stemming Search Terms in Sitecore Lucene Indexes .

stemming-search-terms-header

Sitecore content search is great technology that allows you to get search on your Sitecore website with minimum efforts. But one thing that always disappointed me is that this search doesn’t understand word forms. Single and plural form of a noun will be saved as two separate terms in the index(e.g.: “tool” and “tools”). Single, past tense and normal form of a verb will result in three different terms in the index(e.g.: “deny”, “denies” and “denied”). It gives worse search results. If you will search “deny” then it would not found documents with “denies” or “denied”.

There are few options how you can “fix” it. First one is usage of similarity parameter in the query: x => x.YourFieldName.Like(“tools”, 0.8f). It is quick and dirty solution. Now, content search will return results with similar words. But there is the other side of the coin. You will get search results with similar words where you don’t expect. E.g.: search for “Ireland” will give “Iceland” and “Island” in results.

The other option is using “Stemming”. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem. There are a different implementation of stemming algorithms. Lucene.Net has implementation of the Porter Stemming algorithm. It could be used to extend Sitecore content search. We need to implement our own analyser:

using Lucene.Net.Analysis;
using System;
using System.IO;
 
namespace Feature.StemmedSearch.Search
{
    public class PorterStemLowerCaseKeywordAnalyzer : KeywordAnalyzer
    {
        public override TokenStream TokenStream(string fieldName, TextReader reader)
        {
            return new PorterStemFilter(new LowerCaseFilter(new KeywordTokenizer(reader)));
        }
    }
}

Then register field mapping in the content search configuration:

<fieldNames>
  <field fieldName="_content"              storageType="YES" indexType="TOKENIZED"    vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">
	<analyzer type="Feature.StemmedSearch.Search.PorterStemLowerCaseKeywordAnalyzer, Feature.StemmedSearch" />
  </field>
</fieldNames>

I used _content field as example, but it is better don’t change Sitecore fields that come out of the box and use your own custom fields. Now, after rebuild of indexes we can see that all search terms are saved in a stemmed way:

stemming-search-terms-2

And when you check search queries by turning on verbose logging. You will see that search query terms are also stemmed for _content field:

stemming-search-terms-1

Hurray! Now, our Sitecore website search is more similar to Google search. :-)

Stay tuned, in the second part we will do the same for Solr indexes.

You can read more of my Sitecore blogs here!

want to speak to one of our experts?

Anton Tishchenko Thumbnail
Anton Tishchenko
Head of Digital Engineering
Anton has worked as a developer since 2007, he is a highly experienced Sitecore developer who previously worked as a Technical Team Lead at Sitecore. Anton's expertise in the Sitecore platform is formidable; he's definitely one of the world's finest Sitecore ninjas and in 2019 he was recognised as the only Sitecore MVP in the Ukraine when he achieved his Technology MVP Status.
Anton Tishchenko Thumbnail

Anton Tishchenko

02 Nov 2018 - 7 minute read
share this

stay in the know, stay ahead.

Get the latest from the agency, including news, events and expert content.
explore services in the article
find out what we can do for you
read some of our case studies