Skip to content Skip to main navigation Skip to footer

Semantic search Readme

Semantic search

Phonexia’s text-embedding based semantic search project demo


  • installed Docker
  • 25 GB of free HDD space
  • 8 GB of RAM
  • basic experience with using Docker images

Running in Docker

First you’ll need to load CPU/GPU image from provided tar file(s).

docker load --input phonexia_semantic_search_<cpu/cuda>_<hash>.tar.gz

Assuming that you have all the necessary data (textual documents) in the folder <your_data_directory> you may then mount it via -v <your_data_directory>:/home/data to /home/data folder in Docker container.

We suggest to run the container in interactive (-it) mode to act as normal terminal.

docker run -it -v <your_data_directory>:/home/data phonexia_semantic_search_cpu

To run with GPU support use image with _cuda suffix together with --gpus all. This requires NVIDIA Container Toolkit as well properly setup NVIDIA driver.

docker run -it --gpus all -v <your_data_directory>:/home/data phonexia_semantic_search_cuda


For the purposes of this demo we propose using two scripts: and is intended to prepare semantic vector representation of presented text documents – so called index provides minimal command-line like UI for searching through the indexed documents

Indexing document collection

To test that indexing is working you may index some example data (copied from ISW ):

python3 --document_list /home/example_data/example_isw_10_document_list.txt --output_index_root_dir /home/indices

Then you may identify the index folder on stdout: e.g. Output index directory: /home/indices/MODEL_1.0.0__DATA_example_isw_10_document_list.txt_TIME_2023-04-19--13-46-24--360

This folder is important, because it’s used as an input for script that does actual searching.

Simplest possible way to index your own data in Docker:

python3 --document_list <document_list> --output_index_root_dir <output_dir>

Where <document_list> e.g. /home/documents.txt may be a plain list of full paths to *.txt documents in UTF-8 encoding. Each document should contain relatively short lines (~sentences) and not long paragraphs on one line – these would get truncated during the indexing process.

Example content of document list:


Each document may also have metadata associated with it, these are textual and specified after the space symbol in <document_list>

Example content of document list with metadata:

/home/data/meeting_ashari_bago.txt STT transcript of meeting between bosses Ashari and Bago from 17.3.2021
/home/data/doc_twitter.1234.txt Twitter posts related to eventful event

In this case metadata (e.g. STT transcript of meeting between bosses Ashari and Bago from 17.3.2021) can be later displayed via the script.

In a CUDA supporting container you may still run indexing on CPU by specifying --device cpu option to script.

Currently there’s no UI to append documents to existing index, or merge multiple indices together. Nevertheless it’s not a technical problem to add this feature in newer version of the Semantic search.

Searching document collection index

To test that is working you may use the provided index on the small example data’s index directory. E.g.:

python3 --index /home/indices/MODEL_1.0.0__DATA_example_isw_10_document_list.txt_TIME_2023-04-19--13-46-24--360

If you want to search over your data you must provide an index directory created with the script:

python3 --index <index_dir>

Where <index_dir> is the output directory of indexer (e.g. /home/indices/MODEL_1.0.0__DATA_example_isw_10_document_list.txt_TIME_2023-04-19--13-46-24--360)

This command starts the UI for searching over index with its own help inside the UI.

In a CUDA supporting container you may still run indexing on CPU by specifying --device cpu option to script.

Troubleshooting script may run into memory issues with very long documents during indexing. In that case try to lower the value for parameter --batch_size see python3 -h

If that’s not sufficient, try to lower the value for parameter --max_sentence_length together with that, preferably with --auto_split feature on (see next paragraph). Note: Lowering only --max_sentence_length may lead to removal of parts of the sentences in case your documents contain long lines. will print warning in case length of line exceeds model capabilities. In that case you may correct your data before indexing or use –auto_split parameter (see script’s help). Note that --auto_split may incorrectly separate sentence inside the unlikely type of shortcuts (e.g. “N. A. S. A”) or other unusual data.

The CUDA version and its cooperation with Docker may be an issue where we so far suggest running on CPU, especially for small amount of data. Indexing around 1000 lines of text documents should be under one minute even when running on CPU.


These scripts are written as a proof of concept rather than production ready solution.


We value your feedback and would love to hear from you! Please don’t hesitate to reach out to us via email at [email protected] with any comments, suggestions, or questions you may have about this project. Your feedback is greatly appreciated.

Related Articles