An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

Data science/Analysis Health Care Portfolio

PyChemia, Python Framework for Materials Discovery and Design

High Dimensional Portfolio Selection with Cardinality Constraints

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Visions provides an extensible suite of tools to support common data analysis operations

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

Bamboolib - a GUI for pandas DataFrames

DataPrep — The easiest way to prepare data in Python

Feature Detection Based Template Matching

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

A simplified prototype for an as-built tracking database with API

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Stochastic Gradient Trees implementation in Python

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

INFO-H515 - Big Data Scalable Analytics

ETL flow framework based on Yaml configs in Python