Frequently Asked Questions
Semantic Search

What kind of information does the Cortical.io Semantic Search Engine process?

Any kind of structured or unstructured text data, including emails, presentations, webpages, contracts, CVs, clinical studies, technical reports, handbooks, and social media posts. The content is always identified by meaning, not keywords.

Which languages are supported?

Our primary language is English. However, the functionality to search and process other languages can be added on request.

What kind of information can the engine identify?

The Semantic Search Engine can process any kind of text. The engine can identify, among others, the most relevant:

  • Documents for a search query
  • Answers to a customer query
  • Candidates for a job description
  • Product recommendations based on purchase history
  • Information on market competition
  • Sources of evidence within scientific literature

How does the engine handle ambiguous search queries?

The Semantic Search Engine represents every word with roughly 16,000 semantic features. The engine allows for very fine semantic distinctions, disambiguating terms as required for each use case. For example, the word organ would not only be made up of the sub-sense 'music' or 'anatomy', but also of 'church', 'composer', and 'musical instrument'.

How does the engine handle irrelevant text?

All unstructured text is automatically filtered so that irrelevant text, such as generic introductions and references, and duplicate text are removed. The Semantic Search Engine identifies variations of this type of text throughout the data without requiring exact text searches for each variation.

How does the engine handle inaccurate search queries?

The Semantic Search Engine can match answers with search queries that use different words. For example, done deal and contract signed would be mapped to the same meaning to denote the conclusion of a business agreement.

Out of the activated semantic features for each of these expressions, a certain percentage of overlap is identified. By measuring this semantic overlap, the Semantic Search Engine understands that both expressions are related and should be mapped to the same meaning within the semantic space.

Does the engine understand long sentences?

Yes, the engine can process sentences, paragraphs, and documents of any length. Generally, the longer the query, the clearer the context and the more accurate the search results.

How is the information indexed?

The information is stored as semantic fingerprints whereby all terms are mapped to the documents in which they appear to build an inverted index. The Semantic Search Engine converts a query—a word, paragraph, or document—into a semantic fingerprint and compares the query fingerprint to the document fingerprints stored in the index. This allows the engine to quickly look up query terms (once the index is computed) rather than fully scan all documents at query time.

What kind of queries can the engine handle?

The Semantic Search Engine can handle any text query—a word, paragraph, or document. Cortical.io can customize the engine to process text documents in any format (for example, .pdf, .doc(x), .xls(x), .csv, .ppt(x), .html, .xml, and .txt).

Note: Audio, video, and image content (except for scanned paper documents) cannot be processed, and numbers in text documents are not converted into semantic fingerprints.

How are the search results ranked?

The search results are ranked by their semantic fingerprint similarity. Depending on the use case, fine-grained similarity scores can also be used to compare across different document sections (for example, title, body text, and metadata).

What kind of file formats can the engine process?

The Semantic Search Engine can process, among others, the following file formats: .pdf, .doc(x), .xls(x), .csv, .ppt(x), .html, .xml, and .txt. Owing to a dedicated OCR pipeline, the engine can also convert scanned paper documents into searchable text.

What are the standard functionalities of the engine?

The Semantic Search Engine expands a text query by matching the query automatically with both exact and approximate results. Custom functionalities can also be added to meet the particular needs of your use case.

Is the engine easy to customize?

As the Semantic Search Engine can be integrated into your existing system, you can change how the search results are displayed in your user interface and adjust the similarity metrics for comparing fingerprints. Cortical.io can also adapt the engine to your use case, for example, by adding components to filter out search results.

What kind of training material is required?

This is highly dependent on the use case. In general, the Semantic Search Engine should be trained on the same kind of material that the engine is expected to search and process. For example, the engine is supposed to be trained on emails to search and identify relevant information in other emails.

The engine requires little training material, which is particularly helpful in use cases where such material is scarce (for example, in fraud detection).

How long does the training take?

The engine training takes only a couple of hours. Some manual fine tuning is usually required to improve the quality of search results.

Do I need to supervise the training?

No, the Semantic Search Engine learns the vocabulary of your company's business domain by analyzing a corpus of relevant information sources—for example, emails, presentations, webpages, contracts, CVs, clinical studies, technical reports, handbooks, and social media posts—in an unsupervised machine-learning approach.

How long does it take before I get a working Semantic Search Engine?

It takes only a few days to get a fully functioning Semantic Search Engine. Collaborating with subject-matter experts to update the engine might be necessary for more accurate search results.

Is the engine easily retrained?

As new product names, feature names, and other technical terms enter the jargon of your business domain, the Semantic Search Engine can be easily retrained (for example, every 6 or 12 months) in the updated domain vocabulary of your business. Retraining can be done in parallel with the normal functioning of the engine and takes only a few hours.

How does the engine integrate into my existing search system?

The engine can be integrated into your existing system as a back-end solution through its REST API.

Can I connect the engine to other applications?

Yes, the engine can in principle be connected to applications like Salesforce, SAP, and SharePoint through its REST API.

Does the engine deliver an analytics dashboard?

The search results can be exported as relational databases and viewed in business intelligence solutions like Tableau.

How scalable is the engine?

The engine is easy to scale. We can switch to a more efficient server and/or CPU for more processing power or load-balance between multiple Docker instances to support a higher number of users.

How quickly can the engine process documents?

The Semantic Search Engine can index over a million documents in up to 40 seconds. It is also possible to retrieve more than 100 search results in over a million documents in up to 0.2 second. Using inverted indexing, the engine can quickly look up query terms (once the index is computed) rather than fully scan all documents at query time.

Note: Processing times can vary considerably depending on document sizes, deployment configurations, and system resources.

How is the engine deployed?

The engine can be installed on your own server—on your company’s premises or in your private cloud—or a third-party server. Third-party cloud production environments are currently operating on Google Compute Engine (GCE) and Amazon Web Services (AWS) instances.

Deployment configurations

  • Standalone JVM distribution, JRE version 8+
  • Docker Engine

Minimum system resources

For a single instance of the engine:

  • 8 GB RAM
  • 1 core

SSD space requirements are negligible.

Still have some questions? Contact us to get the answers!