When semantic search works it appears to be, depending upon your perspective, either magical or intelligent in the way a person is. As it tunes out neither of these two perspectives is correct. Semantic search, just like Boolean search before it is based upon mathematics, statistical models and probabilities. What is different about it however is the degree of complexity and self-checking that goes on in the background, prior to an answer appearing in the search results for a query.
The machine learning algorithms that power all this are trained on a three-step program that uses different sets of Training and Test Data to perform a sequence of:
- Training
- Validation
- Testing
Remember that in the wild data is subject to the 4Vs of:
- Volume
- Velocity
- Variety
- Veracity
Solving for the last one first allows us to filter out many of the ambiguities produced by the first three. But here’s the problem: Testing is linked to the quality of the Classification we started off with. If the Classifier we have put in place is not sufficiently 'good' to allow us to be able to recognize things we have never seen before then it will be unable to deal with the new data that comes onto the web all the time and search will be unable to deal with queries it has never encountered before.
This may seem like a simple problem to solve by actually testing the outcomes of validation, seeing where the algorithm has failed and then tweaking the parameters to allow it to perform better. The problem is that as we tweak the parameters we are actually, incrementally informing the algorithm of the nature of our Test Data therefore when it comes to testing it, it appears to perform better and better against the Test Data we are using because it has come to ‘see’ it, but it performs poorly in the wild when it is up against data it has never encountered before. What has happened in this case is that the algorithm knows exactly what we are testing for so it is ‘cheating’ because it knows exactly what we are testing for but is unable to generalize sufficiently to then use its training to understand new data.
It’s a case of programmer bias affecting the performance of the algorithm.
The way around it is a paradox of sorts. Instead of using two data sets we use three. One is the one we test against and tweak but one is a set we do not see. When we test against that we are able to determine whether the algorithm has ‘cheated’ or it is indeed capable of generalizing sufficiently to correctly understand new data sets. This is called Cross Validation.
Why is all of this important? Because if you are asking the question just how much data about you semantic search needs the answer is, as much as you can throw at it. There is never a specific amount of data that can be enough when semantic search relies on constant judgements made based upon its ability to generalize and project. When you consider that it can take more than 30,000 examples of one type of data to train an algorithm you begin to realize the enormity of the task at hand and why doing as much as possible to help search understand you, will only help you and your online business.
Get smart: SEO Help: 20 Semantic Search Steps that Will Help Your Business Grow is a practical step-by-step guide to applying semantic search principles to your business.