An alternative way to ElasticSearch

You are viewing revision #4 of this wiki article.
This version may not be up to date with the latest version.
You may want to view the differences to the latest version or see the changes made in this revision.
« previous (#3)next (#5) »

Hello!

This article is for those who have dealt with the complexity of Elasticsearch or any other indexing machines and are looking for an easier way to index the existing database without additional effort.

Why this post?

The amount of data increases every day. As a result, the search in the databases becomes longer and longer. Conventional data structures must be realigned in order to be able to access information more quickly. There are already database systems like Elasticsearch that can do this. However, such systems also have disadvantages.

The most noticeable major drawbacks are:

Learning a new query language. SQL won’t get you far or is not flexible enough.
The existing programs must be rewritten in order to process the new result sets appropriately.
The safety regulations must be defined again.
A second database must be set up, which in principle contains the same data.

Who will benefit from this post?

This information is useful for any programmer who wants to integrate an index database into his existing system in a simple way without additional effort.

The basic idea behind Indexing-machines

We will use this simple Table to demonstrate what an Index-machine does

Tablename: object

UID	TITLE	DESCRIPTION
4711	Rudyard Kipling	If you can keep your head when all about you …
4712	John Magee	I have slipped the surly bonds of Earth and danced the skies on laugher-silvered wings …
4713	Wiliam Wordsworth	Ten thousand saw I at a glance, Tossing their heads in sprightly dance…

With this request we can find very specific texts in this single table:

SELECT ID, Title, Description
FROM object
WHERE Description like '%head%'

But what if we want to find ‚%head%‘ in all tables of our database? We have to write a code to do this job for us. This is inefficent and will work very slowy. The idea behind Elasticsearch and other indexing tables is – so far I understood – to break the strings in single tokens. That means in a very easy way that we have to transform the horicontal order of the table into a vertical order.

Tablename: ncp_index

UID	TABLENAME	FIELDNAME	ID	TOKEN
1001	object	Description	4711	if
1002	object	Description	4711	you
1003	object	Description	4711	can
…				
1010	object	Description	4712	I
1011	object	Description	4712	have
1012	object	Description	4712	slipped
…

We can tokenize any field of any table of our database into the table ncp_index. Now we can find with a single query very fast any (tokenized) word in our hole database.

SELECT Tablenname, Fieldname, Token
FROM ncp_index
WHERE Token like '%head%'

That is the secret of any Index-Searchengine. Yes, the ncp_index table has a lot of redundant data that we can normalize as follows:

Every field is stored in a system table and has a unique id. let us call it field_id Every content of a field has a lot of same words. These words should be stored only once in a separat words-table.

Our ncp_index table looks now so:

UID	FIELD_ID	ID	TOKEN_ID
1001	123	4711	1
1002	123	4711	2
1003	123	4711	3
…			
1010	123	4712	4
1011	123	4712	5
1012	123	4712	6
…	
		
Systemtable: fields

UID	TABLENAME	NAME
122	object	Name
123	object	Description
…		

Tablename: word

UID	TOKEN
1	if
2	you
3	can
…

Some basic examples

/**
 * @author ncp <necips@live.de>
 */
class NCPSearch
{
    /**
     * @param $model
     * @param $tablename
     * @param $fieldnames
     */
    public static function delete_ncp_search_item($model, $tablename) {
        $criteria = new CDbCriteria;
        $criteria->condition = "tablename = :tablename " .
                    "AND id = :id ";
        $criteria->params[":tablename"] = $tablename;
        $criteria->params[":id"] = $model->uid;
        NCPIndexModel::model()->deleteAll($criteria);
    }
    /**
     * @param $model
     * @param $tablename
     * @param $fieldnames
     */
    public static function update_ncp_search_item($model, $tablename, $fieldnames) {
        NCPSearch::delete_ncp_search_item($model, $tablename);
        NCPSearch::insert_ncp_search_item($model, $tablename, $fieldnames);
    }
    /**
     * @param $model
     * @param $tablename
     * @param $fieldnames
     */
    public static function insert_ncp_search_item($model, $tablename, $fieldnames) {
        foreach ($fieldnames as $fieldname) {
            $NCP_index_model = new NCPIndexModel("create");
            $NCP_index_model->tablename = $tablename;
            $NCP_index_model->fieldname = $fieldname;
            $NCP_index_model->id = $model->uid;
            $NCP_index_model->save();
            // a very simple way to tokenize the strings!
            $raw = strip_tags($model->{$fieldname});
            $tokens = explode( ' ', $raw);
            foreach ($tokens as $token) {
                $NCP_token_model = new NCPTokenModel("create");
                $NCP_token_model->NCP_index_uid = $NCP_index_model->uid;
                $NCP_token_model->token = $token;
                $NCP_token_model->save();
            }
        }
    }
    /**
     * @param $models
     * @param $tablename
     * @param $fieldnames
     */
    public static function insert_ncp_search_items($models, $tablename, $fieldnames) {
        foreach ($models as $model) {
            NCPSearch::insert_ncp_search_item($model, $tablename, $fieldnames);
        }
    }
}


// main.php:

// initialize ncp_search table once with all tables which has to be indexed in the main function
NCPSearch::insert_ncp_search_items(UserModel::model()->findAll(), "user", ["login", "mail", "name_last", "name_first"]);
NCPSearch::insert_ncp_search_items(DepartmentModel::model()->findAll(), "department", ["title", "description"]);
NCPSearch::insert_ncp_search_items(ObjectModel::model()->findAll(), "object", ["title", "description"]);
  

// model.php:

class Object : Model
{
  function afterSave()  {
    
    
     // insert this code to synchronize the informations on ncp_index
     if ($this->status === ObjectStatus::DELETED)
            NCPSearch::delete_ncp_search_item($this, "object");
        else
            NCPSearch::update_ncp_search_item($this, "object", ["title", "description"]);       
            
    ...
  }
  
  ...
  
}

Conclusion

These are my basic observations on this subject. These are the first steps to a search-engine that can index existing tables so that informations can be found quickly.

Thanks to

Translated with www.DeepL.com/Translator
Elasticsearch: https://www.elastic.co/blog/a-practical-introduction-to-elasticsearch

Github

https://github.com/Necip8/ncp_search

I hope this information helps you to build your own super search engine!

1 0

2 followers

Viewed: 49 808 times

Version: all

Category: How-tos

Tags: Index Engine

Written by:

Necip

Last updated by:

Necip

Created on: Jul 1, 2018

Last updated: 3 years ago

Update Article

Revisions

View all history

3 years ago by Necip
Reverted to revision #1
7 years ago by Necip
edited
8 years ago by samdark
Removed "Hello" since it was in the shor...
8 years ago by Necip
edited
8 years ago by Necip
edited
8 years ago by Necip
edited
8 years ago by Necip
edited
8 years ago by Necip

User Contributed Notes 4

#20272

0 0

Additional Information

Algorithm of misspelled words

e.g. "Field Poppy"

Method:
uppercase the word: FIELD POPPY
remove duplicate letters and white spaces: FIELDPOY
sort letters: DEFILOPY

Create binary code based on the letters A-Z
1: the letter is present in the word
0: the letter is not in word

Word        ABCDEFGHIJKLMNOPQRSTUVWXYZ
FIELD POPPY 00011100100100110000000010

Now you can also search for misspelled words of "Field Poppy".
The letters can even be wildly confused!

Searching for "00011100100100110000000010" will find all words containing the letters of F,I,E,L,D,P,O and Y.

Counting the found binaries in the binary code will give us an grad of accurancy.
For example we are searching for "FIELD" and we will found this binary codes of saved words:

Word        ABCDEFGHIJKLMNOPQRSTUVWXYZ
FIELD  		00011100100100000000000000		sum: 5	of 5	= 100%

Word        ABCDEFGHIJKLMNOPQRSTUVWXYZ
FIELD POPPY 00011100100100110000000010		sum: 8	of 5	= 62,5%

Necip at Jul 3, 2018, 6:11:53 PM

#20307

Data indexing (see ncp_search_engine) can be used for a chronological trend analysis to determine forecasts that can serve as a control tool for a company. It brings more clarity to questions such as "What is the average opinion/mood about the new product?"

First, the words must be weighted between a scale of -10 to 10. The feedback from customers, for example, is indexed.

2018-08-12-09_25_23-plasma-website-xlsx-excel.png?w=558

(1) Is the weighted word list

(2) Is the indexed table (e.g. feedback or forum post). The column "Weight" is determined from the word list.

I used Excel here to make a demonstrative pivot analysis.

Marked the range from E2 to G14 and select the pivot chart

Select the columns in this order: Time, word and weighting.

2018-08-12-09_26_37-plasma-website-xlsx-excel.png?w=558

The periods can be "compressed", i.e. they are combined from days to months and from months to years etc.

2018-08-12-09_26_57-plasma-website-xlsx-excel.png?w=558

Here is an overview of the overall trend of the feedback.

German translation

Necip at Aug 12, 2018, 7:48:57 AM

#20315

I think it is better to use fulltext search that is already part of your RDBMS, than rolling your own. I am using Postgresql fulltext search (in Yii2 application of course) and while it is not as advanced as Elasticsearch, I believe for most use cases it is the right tool.

Lazarević Ivica at Aug 18, 2018, 10:30:25 AM

#20334

How to build a search-engine with MySQL
https://www.mullie.eu/mysql-as-a-search-engine/

Necip at Sep 6, 2018, 5:47:44 AM

Please only use comments to help explain the above article.
If you have any questions, please ask in the forum instead.

Signup or Login in order to comment.

Categories

Popular Tags

Recent Comments

An alternative way to ElasticSearch

Revisions