Search with BERT vectors in Solr and Elasticsearch

Overview

BERT models with Solr and Elasticsearch

streamlit-search_demo_solr-2021-05-13-10-05-91.mp4
streamlit-search_demo_elasticsearch-2021-05-14-22-05-55.mp4

This code is described in the following Medium stories, taking one step at a time:

Neural Search with BERT and Solr (August 18,2020)

Fun with Apache Lucene and BERT Embeddings (November 15, 2020)

Speeding up BERT Search in Elasticsearch (March 15, 2021)

Ask Me Anything about Vector Search (June 20, 2021) This blog post gives the answers to the 3 most interesting questions asked during the AMA session at Berlin Buzzwords 2021. The video recording is available here: https://www.youtube.com/watch?v=blFe2yOD1WA

Bert in Solr hat Bert with_es burger


Tech stack:

  • bert-as-service
  • Hugging Face
  • solr / elasticsearch
  • streamlit
  • Python 3.7

Code for dealing with Solr has been copied from the great (and highly recommended) https://github.com/o19s/hello-ltr project.

Install tensorflow

pip install tensorflow==1.15.3

If you try to install tensorflow 2.3, bert service will fail to start, there is an existing issue about it.

If you encounter issues with the above installation, consider installing full list of packages:

pip install -r requirements_freeze.txt

Let's install bert-as-service components

pip install bert-serving-server

pip install bert-serving-client

Download a pre-trained BERT model

into the bert-model/ directory in this project. I have chosen uncased_L-12_H-768_A-12.zip for this experiment. Unzip it.

Now let's start the BERT service

bash start_bert_server.sh

Run a sample bert client

python src/bert_client.py

to compute vectors for 3 sample sentences:

    Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474  0.32404128 -0.82704437 ... -0.3711958  -0.39250174
      -0.31721866]
     [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
      -0.11345179]
     [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366  -0.39310536
       0.07640187]]

This sets up the stage for our further experiment with Solr.

Dataset

This is by far the key ingredient of every experiment. You want to find an interesting collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia, that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into data/dbpedia directory in bz2 format. You don't need to extract this file onto disk: the provided code will read directly from the compressed file.

Preprocessing and Indexing: Solr

Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data. You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring

After the plugin's jar has been added, configure it in the solrconfig.xml like so:

">

  

Schema also requires an addition: field of type VectorField is required in order to index vector data:

">

  

Find ready-made schema and solrconfig here: https://github.com/DmitryKey/bert-solr-search/tree/master/solr_conf

Let's preprocess the downloaded abstracts, and index them in Solr. First, execute the following command to start Solr:

bin/solr start -m 2g

If during processing you will notice:

<...>/bert-solr-search/venv/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=500" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)

The index_dbpedia_abstracts_solr.py script will output statistics:

Maximum tokens observed per abstract: 697
Flushing 100 docs
Committing changes
All done. Took: 82.46466588973999 seconds

We know how many abstracts there are:

bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
5045733

Preprocessing and Indexing: Elasticsearch

This project implements several ways to index vector data:

  • src/index_dbpedia_abstracts_elastic.py vanilla Elasticsearch: using dense_vector data type
  • src/index_dbpedia_abstracts_elastiknn.py Elastiknn plugin: implements own data type. I used elastiknn_dense_float_vector
  • src/index_dbpedia_abstracts_opendistro.py OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in es_conf/ directory.

Preprocessing and Indexing: GSI APU

In order to use GSI APU solution, a user needs to produce two files: numpy 2D array with vectors of desired dimension (768 in my case) a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.

After these data files get uploaded to the GSI server, the same data gets indexed in Elasticsearch. The APU powered search is performed on up to 3 Leda-G PCIe APU boards. Since I’ve run into indexing performance with bert-as-service solution, I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files. This allowed me to index into Elasticsearch freely at any time, without waiting for days. You can use this script to do this on DBPedia data, which allows choosing between:

EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
EmbeddingModel.BERT_UNCASED_768 (bert-as-service)

To generate the numpy and pickle files, use the following script: scr/create_gsi_files.py. This script produces two files:

data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors.npy
data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors_docids.pkl

Both files are perfectly suitable for indexing with Solr and Elasticsearch.

To test the GSI plugin, you will need to upload these files to GSI server for loading them both to Elasticsearch and APU.

Running the BERT search demo

There are two streamlit demos for running BERT search for Solr and Elasticsearch. Each demo compares to BM25 based search. The following assumes that you have bert-as-service up and running (if not, laucnh it with bash start_bert_server.sh) and either Elasticsearch or Solr running with the index containing field with embeddings.

To run a demo, execute the following on the command line from the project root:

# for experiments with Elasticsearch
streamlit run src/search_demo_elasticsearch.py

# for experiments with Solr
streamlit run src/search_demo_solr.py
Owner
Dmitry Kan
I build search engines. Host of the Vector Podcast: https://www.youtube.com/channel/UCCIMPfR7TXyDvlDRXjVhP1g
Dmitry Kan
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

NumPy String-Indexed NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels, rather than conventio

Aitan Grossman 1 Jan 08, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
Code voor mijn Master project omtrent VideoBERT

Code voor masterproef Deze repository bevat de code voor het project van mijn masterproef omtrent VideoBERT. De code in deze repository is gebaseerd o

35 Oct 18, 2021
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

James Betker 2.1k Jan 01, 2023
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Yangming Li 128 Dec 29, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022