Improving Representations via Similarities

Last update: Jan 08, 2023

Related tags

Miscellaneous embetter

Overview

embetter

warning

I like to build in public, but please don't expect anything yet. This is alpha stuff!

notes

Improving Representations via Similarities

The object to implement:

Embetter(multi_output=True, epochs=50, sampling_kwargs)
  .fit(X, y)
  .fit_sim(X1, X2, y_sim, weights)
  .partial_fit(X, y, classes, weights)
  .partial_fit_sim(X1, X2, y_sim, weights)
  .predict(X)
  .predict_proba(X)
  .predict_sim(X1, X2)
  .transform(X)
  .translate_X_y(X, y, classes=none)

Observation: especially when multi_output=True there's an opportunity with regards to NaN y-values. We can simply choose with values to translate and which to ignore.

Comments

[WIP] Feature/progress bar
Fixes issue #20

[x] Adds progress bar to all text and image embedders.

[x] Tests for SentenceEncoder.

[ ] Use perfplot for progress bar?

[ ] Can we ensure fast NumPy vectorization while using a progress bar?
opened by CarloLepelaars 5
[BUG] `device` should be attribute on `SentenceEncoder`
The device argument in SentenceEncoder is not defined as an attribute. This leads to bugs when using it with sklearn. I encountered attribute errors when trying to print out a Pipeline representation that has SentenceEncoder as a component.

Should be easy to fix by just adding self.device in SentenceEncoder.__init__. We can consider adding tests for text encoders so we can catch these errors beforehand.

The scikit-learn development docs make it clear every argument should be defined as an attribute:

every keyword argument accepted by init should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

Error message: AttributeError: 'SentenceEncoder' object has no attribute 'device'.

Reproduction: Python 3.8 with embetter = "^0.2.2"

se = SentenceEncoder() repr(se)

Fix:

Add self.device on SentenceEncoder

class SentenceEncoder(EmbetterBase): . . def __init__(self, name="all-MiniLM-L6-v2", device=None): if not device: device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.device = device self.name = name self.tfm = SBERT(name, device=self.device)
opened by CarloLepelaars 4
Color Histograms - Additional Tricks

This approach could work pretty well as an implementation: https://danielmuellerkomorowska.com/2020/06/17/analyzing-image-histograms-with-scikit-image/

To do something similar to what is explained here: https://www.pinecone.io/learn/color-histograms/

opened by koaning 4
Support for word embeddings
Hi,

Do you think it would be a good idea to add support for static word embeddings (word2vec, glove, etc.)? The embedder would need:

A filename to a local embedding file (e.g., glove.6b.100d.txt)

Either a callable tokenizer or regex string (i.e., the way sci-kit learn's TfIdfVectorizer splits words).

A (name of a) pooling function (e.g., "mean", "max", "sum").

The second and third parameters could easily have sensible defaults, of course. If you think it's a good idea, I can do the PR somewhere next week.

Stéphan
opened by stephantul 3
[FEATURE] SpaCyEmbedder
I think it would be a nice addition to add an embedder that can easily vectorize text through SpaCy. I already have an implementation class for this and would be happy to contribute it here.

SpaCy Docs on vector: https://spacy.io/api/doc#vector

Example code for single string:

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This here text") doc.vector
opened by CarloLepelaars 2
`get_feature_names_out` for encoders

I would be happy to implement get_feature_names_out for all the Embetter objects. I will implement them by just adding a new method (without a Mixin).

opened by CarloLepelaars 1
Remove the classification layer in timm models

I was playing a bit with the library and found out that the TimmEncoder returns 1000-dimensional vectors for all the models I selected. That is caused by returning the state of the last FC classification layer and the fact all of the models were trained on ImageNet with 1000 classes. In practice, it's typically replaced with identity.

Are there any reasons for returning the state of that last layer as an embedding? I'd be happy to submit a PR fixing that.

opened by kacperlukawski 1
xception mobilenet

https://keras.io/api/applications/

https://www.tensorflow.org/api_docs/python/tf/keras/applications/mobilenet_v2/MobileNetV2 https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception/Xception

opened by koaning 0

'SentenceEncoder' object has no attribute 'device'

text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

# This pipeline can also be trained to make predictions, using
# the embedded features. 
text_clf_pipeline = make_pipeline(
  text_emb_pipeline,
  LogisticRegression()
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})

X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col'])

This code gives this error: 'SentenceEncoder' object has no attribute 'device'

opened by nicholas-dinicola 6

Releases(0.2.2)

0.2.2(Dec 20, 2022)

Adds GPU support for Sentence Encoders.
Source code(tar.gz)
Source code(zip)
0.2.1(Dec 5, 2022)

Fixed some error messages related to installing extra dependencies.
Source code(tar.gz)
Source code(zip)
0.2.0(Oct 10, 2022)

Fixes a bug related to the Timm vision models.
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 19, 2022)

The first original release. Should have enough components to be interesting.
Source code(tar.gz)
Source code(zip)

Owner

vincent d warmerdam

Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].

GitHub Repository

Small pip update helpers.

pipdate pipdate is a collection of small pip update helpers. The command pipdate # or python3.9 -m pipdate updates all your pip-installed packages. (O

69 Dec 18, 2022

An alternative app for core Armoury Crate functions.

NoROG DISCLAIMER: Use at your own risk. This is alpha-quality software. It has not been extensively tested, though I personally run it daily on my lap

12 Nov 29, 2022

urlwatch is intended to help you watch changes in webpages and get notified of any changes.

urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes.

2.5k Jan 08, 2023

Attempt at a Windows version of the plotman Chia Plot Manager system

windows plotman: an attempt to get plotman to work on windows THIS IS A BETA. Not ready for production use just yet. Almost, but not quite there yet.

59 May 11, 2022

一个Graia-Saya的插件仓库

一个Graia-Saya的插件仓库这是一个存储基于 Graia-Saya 的插件的仓库如果您有这类项目

111 Oct 24, 2022

Python library for the analysis of dynamic measurements

Python library for the analysis of dynamic measurements The goal of this library is to provide a starting point for users in metrology and related are

18 Dec 21, 2022

NBT-Project: This is a APP for building NBT's

NBT-Project This is an APP for building NBT's When using this you select a box on kit maker You input the name and enchant in there related boxes Then

1 Jan 21, 2022

Telegram bot to search quotes from brainyquote.com

Brainy Quote Bot @BrainQuoteBot A star ⭐ from you means a lot to us! Telegram bot to search quotes from brainyquote.com Usage Deploy to Heroku Tap on

21 Nov 24, 2022

Anti VirusTotal written in Python.

How it works Most of the anti-viruses on VirusToal uses sandboxes or vms to scan and detect malicious activity. The code checks to see if the devices

3 Dec 26, 2021

Monochrome's API, implemented with Deta Base and Deta Drive.

Monochrome Monochrome's API, implemented with Deta Base and Deta Drive. Create a free account on Deta to test this out! Most users will prefer the Mon

5 Sep 22, 2022

UF3: a python library for generating ultra-fast interatomic potentials

Ultra-Fast Force Fields (UF3) S. R. Xie, M. Rupp, and R. G. Hennig, "Ultra-fast interpretable machine-learning potentials", preprint arXiv:2110.00624

24 Nov 13, 2022

This repo contains scripts that add functionality to xbar.

xbar-custom-plugins This repo contains scripts that add functionality to xbar. Usage You have to add scripts to xbar plugin folder. If you don't find

1 Jan 10, 2022

A Lynx that manages a group that puts the federation first.

Lynx Super Federation Management Group Lynx was created to manage your groups on telegram and focuses on the Lynx Federation. I made this to root out

2 Nov 01, 2022

Сервис служит прокси между cервисом регистрации ошибок платформы и системой сбора ошибок Sentry

Sentry Reg Service Сервис служит прокси между Cервисом регистрации ошибок платформы и системой сбора ошибок Sentry. Как развернуть Sentry onpremise. С

13 May 24, 2022

Osu statistics right on your desktop, made with pyqt

Osu!Stat Osu statistics right on your desktop, made with Qt5 Credits Would like to thank these creators for their projects and contributions. ppy, osu

21 Jul 13, 2022

LSO, also known as Linux Swap Operator, is a software with both GUI and terminal versions that you can manage the Swap area for Linux operating systems.

LSO - Linux Swap Operator Türkçe - LSO Nedir? LSO, diğer adıyla Linux Swap Operator Linux işletim sistemleri için Swap alanını yönetebileceğiniz hem G

4 Feb 09, 2022

Improving Representations via Similarities

Related tags

Overview

embetter

warning

notes

Comments

Releases(0.2.2)

0.2.2(Dec 20, 2022)

0.2.1(Dec 5, 2022)

0.2.0(Oct 10, 2022)

0.1.0(Sep 19, 2022)

Owner

vincent d warmerdam

Small pip update helpers.

An alternative app for core Armoury Crate functions.

urlwatch is intended to help you watch changes in webpages and get notified of any changes.

Attempt at a Windows version of the plotman Chia Plot Manager system

一个Graia-Saya的插件仓库

Python library for the analysis of dynamic measurements

NBT-Project: This is a APP for building NBT's

Telegram bot to search quotes from brainyquote.com

Anti VirusTotal written in Python.

Monochrome's API, implemented with Deta Base and Deta Drive.

UF3: a python library for generating ultra-fast interatomic potentials

This repo contains scripts that add functionality to xbar.

A Lynx that manages a group that puts the federation first.

Сервис служит прокси между cервисом регистрации ошибок платформы и системой сбора ошибок Sentry

Osu statistics right on your desktop, made with pyqt

LSO, also known as Linux Swap Operator, is a software with both GUI and terminal versions that you can manage the Swap area for Linux operating systems.

WorldsCollide - Final Fantasy VI Randomizer

Машинное обучение на ФКН ВШЭ

a pull switch (or BYO button) that gets you out of video calls, quick

Streamlit Component, for a Chatbot UI