Jai’s Weblog – Tech, Security & Fun…

Tech, Security & Fun…

  • Jaibeer Malik

    Jaibeer Malik
  • View Jaibeer Malik's profile on LinkedIn
  • Subscribe

  • Feedburner

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 23 other followers

  • Archives

  • Categories

  • Stats

    • 257,559
  • Live Traffic

ElasticSearch: Boosting score for content relevancy

Posted by Jai on April 10, 2013

Every search solution is build to serve relevant content to the users. Each of these search solutions provide different algorithms and mechanism for you to serve the relevant content matching your business requirements. The flexibility provided in terms of affecting and manipulating the relevancy of search results for your content allows business to serve end customers better. In this post, we will cover in details how ElasticSearch helps you to retrieve the relevant content and the ways to affect the scoring of your search data.

Content Relevance Ranking

Content relevancy is to return the relevant documents for a search query. Every business domain has its own relevancy for the content. For example, a typical eCommerce platform will have different relevancy for products based on different customers and different search criteria on different times.

content-relevancy-scoring

You would like each product to score differently based on different criteria. Some of the typical requirements from business side for relevancy of the products,

  • Relevancy based on search term, how many time a particular term occurs for a document
  • Relevancy based on different content fields of the document, eg. title is more important than description for a document
  • Relevancy based on combination of different field values
  • Relevancy based on current conditions and field values at the query time
  • Negative relevancy to some products based on certain ceriteria or field values
  • Higher relevancy for new data
  • Higher relevancy to products which are most liked, visited etc.

Lucene Relevancy Ranking

Lucene is information retrieval library allowing users to index data and search on the same.

Have a look at lucene scoring formula which allows us to retrieve relevant documents matching our query Lucene scoring algorithm

Lucene computes the score for each document based on different matching parameters and the weight. Some of things that are taken into account are,

  • How many of the query terms are found in the document
  • How frequently the term exists in the document
  • How frequently the term exist in the collection of the documents, how rare the term is
  • Relative weight for the term at query time

OpenRelevance project allows you to do the relevancy testing for information retrieval.

ElasticSearch Boosting Content

ElasticSearch is based on Lucene and use the same scoring mechanism to score the documents. On top of existing functionality, ElasticSearch provides different ways to affect the scoring of the documents.

We will see different options and functionality provided by ElasticSearch to retrieve the desired search from search engine matching to your needs.

What is Boosting

Boosting is process of enhancing the document relevancy.

As stated earlier each document is given a score for a query calculated by scoring algorithm. The boosting functionality allows you to alter the score of a document. You can choose to place a document on top and can also give negative boost value to come last in the result set.

You can set the boosting value for a document at both index time and also during query time.

Boosting at Index Time

While indexing your search data, you can boost content based on existing business logic. Different business and domain data will have different index time boost values.

For example, for a typical eCommerce platform the products analytic data, the number of times a product is sold is very important aspect for setting the index time boost. This will enable you to set the boosting for most selling products to higher value.

For newspaper, magazines, article data the publishing date can be one of the deciding factor to give higher boost to recent documents.

Document Boost field

Each document contains standard field, _boost. The document field _boost control the boost level for the indexed document.

The document mapping for the document allows you to set the default boost value. This default value will be used as document boost value if no value explicitly specified.

To set the default boost value for a document,


{
    "tweet" : {
        "_boost" : {"name" : "_boost", "null_value" : 1.0}
    }
}

The default value for the boost field here is set to be 1.0

To add boost field value for a document,


{
    "tweet" {
        "_boost" : 2.2,
    }
}

contentBuilder.field("_boost", 2.2f)

The above document will be indexed with default boosting of 2.2

Boosting at Query time

We will analyze different query options and how the ranking or scoring of content is affected using each query.

ElasticSearch allows you to add boosting to most of the queries, every QueryBuilder implementing BoostableQueryBuilder allows you to set boosting for your query.

To add boost to a query, use exposed method boost.


public interface BoostableQueryBuilder<b>>
{
    public B boost(float boost);
}

String queryString = "text";
QueryStringQueryBuilder queryStringQueryBuilder = QueryBuilders.queryString(queryString ).boost(1.5f);

The default scoring is based on the the scoring algorithm  The most common query query string query,  also allows you to add boosting to the query, but the default behavior is to automatically calculate the scoring for the document based on Lucene scoring algorithm.

Think of a business scenario where we need to execute multiple queries on the same search data and we want to score more for documents matching a particular query. You can use query boosting while executing multiple queries which need relative scoring of documents.

Constant Score Query

Constant score query,  allows you to wrap a filter or query and provide a constant and equal scoring.


QueryBuilders.constantScoreQuery(QueryBuilders.queryString(queryString )).boost(1.5f);
QueryBuilders.constantScoreQuery(FilterBuilders.termFilter("name", "jai")).boost(2.5f);

Similar to earlier case, in case of multiple queries, you have the option to give higher boosting to one query and a similar constant boosting to other queries. You can use constant score query while executing multiple queries which need same scoring.

Custom Score Query

Most of the times, it is the single query you use and based on different field values for the content, you may wish to affect the scoring of the document.

Custom score query,  allows you to achieve the same. It allows you to wrap another query and you can customize the scoring which can be derived based on the field values in your document using the script expression.


String script = "_score * doc['age'].value"
CustomScoreQueryBuilder customScoreQueryBuilder = QueryBuilders.customScoreQuery(QueryBuilders.matchAllQuery()).script(script);

The scripting is based on MVEL  expression language.

Custom Filters Score Query

Representing your data calculations in terms of expression language and fields is much cumbersome. Usually you prepare to represent your data in terms of filtering. ElasticSearch provides you too add selective boosting based on filter criteria in the custom score query.

Custom filters score query, allows you to wrap a query and allows to add boosting or script if the document match the given filter match.


String script = "_score * doc['age'].value"
CustomFiltersScoreQueryBuilder customFiltersScoreQueryBuilder = QueryBuilders.customFiltersScoreQuery(QueryBuilders.matchAllQuery())
                .add(FilterBuilders.termFilter("name", "jai"), 2.0f)
                .add(FilterBuilders.termFilter("age", "31"), script);

You can use score mode to control the order in case of multiple matching filters.

One of the important requirement is to pass different parameters from the query environment which will match to your document data. The typical examples are the current date time during the query execution. Let’s say based on a date field in your document you would like to have different boosting value for your date. You should be able to pass current time and calculate the scoring based on param values.


String script = "(0.08 / ((3.16*pow(10,-11)) * abs(currenttimeinmillis - doc['date'].date.getMillis()) + 0.05)) + 1.0";
CustomFiltersScoreQueryBuilder customFiltersScoreQueryBuilder = QueryBuilders.customFiltersScoreQuery(QueryBuilders.matchAllQuery())
                .add(FilterBuilders.termFilter("name", "jai"), 2.0f)
                .add(FilterBuilders.existsFilter("date"), script)
                .param("currenttimeinmillis", (Long)new Date().getTime());

Check the variable boosting example to retrieve recent documents based on current date under Advanced scoring in ElasticSearch.

The Boosting Query

The boosting query allows you to demote the ranking or score of the documents matching the conditions of demoting query by a given factor.

Consider a search criteria where you want to display all the documents and facets generated based on some business condition  But there are some documents matching critical business condition like products are no more available in stock, you would like to place those products down in the search results. You can use the boosting query for scoring low for products not in stock.


Query balancedQuery = new BoostingQuery(positiveQuery, negativeQuery, 0.01f);

Check the complete syntax, for the Boosting Query

Defining Field Boosting

While querying data you can also mention boosting at the field levels, which fields are preferred in search in terms of relevancy of field data. Let’s say we have a product text data available in different field. For example, title, description, tag, categories and meta data etc. We would like to define relevancy to boost title data more than tags data.


String queryString = "text";
QueryStringQueryBuilder queryStringQueryBuilder = QueryBuilders.queryString(queryString)
                                                                .field("title", 1.75f)
                                                                .field("description", 1.5f)
                                                                .field("tags", 1.35f);

If you don’t specify which fields to search in, the default _all fields is used for querying.

Some of the upcoming changes which will be very relevant to the scoring are,

Index Boost

Index Boost,  allows you to define boosting at index level which you can use in case of you searching between multiple indices.


{
    "indices_boost" : {
        "index1" : 1.4,
        "index2" : 1.3
    }
}

Query Rescorer

Rescoring allows reordering of top documents, as configured by you, returned based on your query and filtering. It reorders based on secondary query to alter the scoring of top results.

For details, have a look at Rescore 

Using the combination of above boosting solutions allows you to retrieve what give relevant value for your business data.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: