DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Overview

Dimensionality Reduction + Clustering + Unsupervised Score Metrics

  1. Introduction
  2. Installation
  3. Usage
  4. Hyperparameters matters
  5. BayesSearch example

1. Introduction

DimReductionClustering is a sklearn estimator allowing to reduce the dimension of your data and then to apply an unsupervised clustering algorithm. The quality of the cluster can be done according to different metrics. The steps of the pipeline are the following:

  • Perform a dimension reduction of the data using UMAP
  • Numerically find the best epsilon parameter for DBSCAN
  • Perform a density based clustering methods : DBSCAN
  • Estimate cluster quality using silhouette score or DBCV

2. Installation

Use the package manager pip to install DimReductionClustering like below. Rerun this command to check for and install updates .

!pip install umap-learn
!pip install git+https://github.com/christopherjenness/DBCV.git

!pip install git+https://github.com/MathieuCayssol/DimReductionClustering.git

3. Usage

Example on mnist data.

  • Import the data
from sklearn.model_selection import train_test_split
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1]*x_train.shape[1]))
X, X_test, Y, Y_test = train_test_split(x_train, y_train, stratify=y_train, test_size=0.9)
  • Instanciation + fit the model (same interface as a sklearn estimators)
model = DimReductionClustering(n_components=2, min_dist=0.000001, score_metric='silhouette', knn_topk=8, min_pts=4).fit(X)

Return the epsilon using elbow method :

  • Show the 2D plot :
model.display_plotly()

  • Get the score (Silhouette coefficient here)
model.score()

4. Hyperparameters matters

4.1 UMAP (dim reduction)

  • n_neighbors (global/local tradeoff) (default:15 ; 2-1/4 of data)

    → low value (glue small chain, more local)

    → high value (glue big chain, more global)

  • min_dist (0 to 0.99) the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of min_dist will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of min_dist will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead.

  • n_components low dimensional space. 2 or 3

  • metric (’euclidian’ by default). For NLP, good idea to choose ‘cosine’ as infrequent/frequent words will have different magnitude.

4.2 DBSCAN (clustering)

  • min_pts MinPts ≥ 3. Basic rule : = 2 * Dimension (4 for 2D and 6 for 3D). Higher for noisy data.

  • Epsilon The maximum distance between two samples for one to be considered as in the neighborhood of the other. k-distance graph with k nearest neighbor. Sort result by descending order. Find elbow using orthogonal projection on a line between first and last point of the graph. y-coordinate of max(d((x,y),Proj(x,y))) is the optimal epsilon. Click here to know more about elbow method

! There is no Epsilon hyperparameters in the implementation, only k-th neighbor for KNN.

  • knn_topk k-th Nearest Neighbors. Between 3 and 20.

4.3 Score metric

5. BayesSearch example

!pip install scikit-optimize

from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV

search_space = list()
#UMAP Hyperparameters
search_space.append(Integer(5, 200, name='n_neighbors', prior='uniform'))
search_space.append(Real(0.0000001, 0.2, name='min_dist', prior='uniform'))
#Search epsilon with KNN Hyperparameters
search_space.append(Integer(3, 20, name='knn_topk', prior='uniform'))
#DBSCAN Hyperparameters
search_space.append(Integer(4, 15, name='min_pts', prior='uniform'))


params = {search_space[i].name : search_space[i] for i in range((len(search_space)))}

train_indices = [i for i in range(X.shape[0])]  # indices for training
test_indices = [i for i in range(X.shape[0])]  # indices for testing

cv = [(train_indices, test_indices)]

clf = BayesSearchCV(estimator=DimReductionClustering(), search_spaces=params, n_jobs=-1, cv=cv)

clf.fit(X)

clf.best_params_

clf.best_score_
Few-shot Learning of GPT-3

Few-shot Learning With Language Models This is a codebase to perform few-shot "in-context" learning using language models similar to the GPT-3 paper.

Tony Z. Zhao 224 Dec 28, 2022
Some experiments with tennis player aging curves using Hilbert space GPs in PyMC. Only experimental for now.

NOTE: This is still being developed! Setup notes This document uses Jeff Sackmann's tennis data. You can obtain it as follows: git clone https://githu

Martin Ingram 1 Jan 20, 2022
A pytorch &keras implementation and demo of Fastformer.

Fastformer Notes from the authors Pytorch/Keras implementation of Fastformer. The keras version only includes the core fastformer attention part. The

153 Dec 28, 2022
EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation

EFENet EFENet: Reference-based Video Super-Resolution with Enhanced Flow Estimation Code is a bit messy now. I woud clean up soon. For training the EF

Yaping Zhao 19 Nov 05, 2022
PyTorch implementation of D2C: Diffuison-Decoding Models for Few-shot Conditional Generation.

D2C: Diffuison-Decoding Models for Few-shot Conditional Generation Project | Paper PyTorch implementation of D2C: Diffuison-Decoding Models for Few-sh

Jiaming Song 90 Dec 27, 2022
Semantically Contrastive Learning for Low-light Image Enhancement

Semantically Contrastive Learning for Low-light Image Enhancement Here, we propose an effective semantically contrastive learning paradigm for Low-lig

48 Dec 16, 2022
Notebooks em Python para Métodos Eletromagnéticos

GeoSci Labs This is a repository of code used to power the notebooks and interactive examples for https://em.geosci.xyz and https://gpg.geosci.xyz. Th

Victor Cezar Tocantins 1 Nov 16, 2021
Code repository for EMNLP 2021 paper 'Adversarial Attacks on Knowledge Graph Embeddings via Instance Attribution Methods'

Adversarial Attacks on Knowledge Graph Embeddings via Instance Attribution Methods This is the code repository to accompany the EMNLP 2021 paper on ad

Peru Bhardwaj 7 Sep 25, 2022
N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting

N-HiTS: Neural Hierarchical Interpolation for Time Series Forecasting Recent progress in neural forecasting instigated significant improvements in the

Cristian Challu 82 Jan 04, 2023
Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

Language Identifier What is this ? The goal of this project is to create a model that is able to predict a given sentence language through text proces

Hossam Asaad 9 Dec 15, 2022
A simple library that implements CLIP guided loss in PyTorch.

pytorch_clip_guided_loss: Pytorch implementation of the CLIP guided loss for Text-To-Image, Image-To-Image, or Image-To-Text generation. A simple libr

Sergei Belousov 74 Dec 26, 2022
Code for Learning Manifold Patch-Based Representations of Man-Made Shapes, in ICLR 2021.

LearningPatches | Webpage | Paper | Video Learning Manifold Patch-Based Representations of Man-Made Shapes Dmitriy Smirnov, Mikhail Bessmeltsev, Justi

Dima Smirnov 22 Nov 14, 2022
for taichi voxel-challange event

Taichi Voxel Challenge Figure: result of python3 example6.py. Please replace the image above (demo.jpg) with yours, so that other people can immediate

Liming Xu 20 Nov 26, 2022
A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Guangyuan(Frank) Li 32 Nov 20, 2022
Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Info This is the code repository of the work Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation from Elias T

2 Apr 20, 2022
Tensor-Based Quantum Machine Learning

TensorLy_Quantum TensorLy-Quantum is a Python library for Tensor-Based Quantum Machine Learning that builds on top of TensorLy and PyTorch. Website: h

TensorLy 85 Dec 03, 2022
Implementation of neural class expression synthesizers

NCES Implementation of neural class expression synthesizers (NCES) Installation Clone this repository: https://github.com/ConceptLengthLearner/NCES.gi

NeuralConceptSynthesis 0 Jan 06, 2022
IPATool-py: download ipa easily

IPATool-py Python version of IPATool! Installation pip3 install -r requirements.txt Usage Quickstart: download app with specific bundleId into DIR: p

159 Dec 30, 2022
A TensorFlow implementation of the Mnemonic Descent Method.

MDM A Tensorflow implementation of the Mnemonic Descent Method. Mnemonic Descent Method: A recurrent process applied for end-to-end face alignment G.

123 Oct 07, 2022
Deep Sketch-guided Cartoon Video Inbetweening

Cartoon Video Inbetweening Paper | DOI | Video The source code of Deep Sketch-guided Cartoon Video Inbetweening by Xiaoyu Li, Bo Zhang, Jing Liao, Ped

Xiaoyu Li 37 Dec 22, 2022