Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Last update: Nov 21, 2022

Overview

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

Install the package

pip install estndv
Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()

Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)
If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

ndv = estimator.profile_predict(f=[2,1,1], N=100000)
If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

Go to the model_training folder cd model_training
Install requirements

pip install requirements.txt
Generate training data. (This uses a lot of memory.)

python training_data_generation.py
Train model

python model_training.py
Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

python torch2npy.py
Test with your model parameters by specifying a path to your model_paras.npy

estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}

Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Related tags

Overview

Learned NDV estimator

How to use

How to train the ndv estimator

Citation

Owner

Trading environnement for RL agents, backtesting and training.

Extracts essential Mediapipe face landmarks and arranges them in a sequenced order.

Classical OCR DCNN reproduction based on PaddlePaddle framework.

DrWhy is the collection of tools for eXplainable AI (XAI). It's based on shared principles and simple grammar for exploration, explanation and visualisation of predictive models.

SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

Ankou: Guiding Grey-box Fuzzing towards Combinatorial Difference

fastgradio is a python library to quickly build and share gradio interfaces of your trained fastai models.

Official implementation for ICDAR 2021 paper "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer"

This framework implements the data poisoning method found in the paper Adversarial Examples Make Strong Poisons

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Rainbow: Combining Improvements in Deep Reinforcement Learning

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Official repository of the AAAI'2022 paper "Contrast and Generation Make BART a Good Dialogue Emotion Recognizer"

Learning Chinese Character style with conditional GAN

code for our ECCV 2020 paper "A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation"

Code for How To Create A Fully Automated AI Based Trading System With Python

Code for the paper "Relation of the Relations: A New Formalization of the Relation Extraction Problem"

The code is the training example of AAAI2022 Security AI Challenger Program Phase 8: Data Centric Robot Learning on ML models.

Sign Language Transformers (CVPR'20)

Multiband spectro-radiometric satellite image analysis with K-means cluster algorithm