VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Overview
   

Unittest GitHub stars GitHub license Black

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on [email protected].

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Results on IdBench benchmarks

Similarity

Method Small Medium Large
FT-SG 0.30 0.29 0.28
LV 0.32 0.30 0.30
FT-cbow 0.35 0.38 0.38
VarCLR-Avg 0.47 0.45 0.44
VarCLR-LSTM 0.50 0.49 0.49
VarCLR-CodeBERT 0.53 0.53 0.51
Combined-IdBench 0.48 0.59 0.57
Combined-VarCLR 0.66 0.65 0.62

Relatedness

Method Small Medium Large
LV 0.48 0.47 0.48
FT-SG 0.70 0.71 0.68
FT-cbow 0.72 0.74 0.73
VarCLR-Avg 0.67 0.66 0.66
VarCLR-LSTM 0.71 0.70 0.69
VarCLR-CodeBERT 0.79 0.79 0.80
Combined-IdBench 0.71 0.78 0.79
Combined-VarCLR 0.79 0.81 0.85

Pre-train your own VarCLR models

Coming soon.

Cite

If you find VarCLR useful in your research, please cite our [email protected]:

@misc{chen2021varclr,
      title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
      author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
      year={2021},
      eprint={2112.02650},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}
Owner
squaresLab
squaresLab
SparseInst: Sparse Instance Activation for Real-Time Instance Segmentation, CVPR 2022

SparseInst 🚀 A simple framework for real-time instance segmentation, CVPR 2022 by Tianheng Cheng, Xinggang Wang†, Shaoyu Chen, Wenqiang Zhang, Qian Z

Hust Visual Learning Team 458 Jan 05, 2023
ObsPy: A Python Toolbox for seismology/seismological observatories.

ObsPy is an open-source project dedicated to provide a Python framework for processing seismological data. It provides parsers for common file formats

ObsPy 979 Jan 07, 2023
TF Image Segmentation: Image Segmentation framework

TF Image Segmentation: Image Segmentation framework The aim of the TF Image Segmentation framework is to provide/provide a simplified way for: Convert

Daniil Pakhomov 546 Dec 17, 2022
Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022)

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022) Introdu

anonymous 14 Oct 27, 2022
RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

[Paper] [Хабр] [Model Card] [Colab] [Kaggle] RuDOLPH 🦌 🎄 ☃️ One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP Russian Diffusio

AI Forever 232 Jan 04, 2023
This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization News: [2020/05/04] Added EGL rendering option for training data g

Shunsuke Saito 1.5k Jan 03, 2023
Language Used: Python . Made in Jupyter(Anaconda) notebook.

FACE-DETECTION-ATTENDENCE-SYSTEM Made in Jupyter(Anaconda) notebook. Language Used: Python Steps to perform before running the program : Install Anaco

1 Jan 12, 2022
UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

UNION Automatic Evaluation Metric described in the paper UNION: An UNreferenced MetrIc for Evaluating Open-eNded Story Generation (EMNLP 2020). Please

50 Dec 30, 2022
Python program that works as a contact list

Lista de Contatos Programa em Python que funciona como uma lista de contatos. Features Adicionar novo contato Remover contato Atualizar contato Pesqui

Victor B. Lino 3 Dec 16, 2021
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Multipath RefineNet A MATLAB based framework for semantic image segmentation and general dense prediction tasks on images. This is the source code for

Guosheng Lin 575 Dec 06, 2022
This library provides an abstraction to perform Model Versioning using Weight & Biases.

Description This library provides an abstraction to perform Model Versioning using Weight & Biases. Features Version a new trained model Promote a mod

Hector Lopez Almazan 2 Jan 28, 2022
​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

TextWorld A text-based game generator and extensible sandbox learning environment for training and testing reinforcement learning (RL) agents. Also ch

Microsoft 983 Dec 23, 2022
FaceQgen: Semi-Supervised Deep Learning for Face Image Quality Assessment

FaceQgen FaceQgen: Semi-Supervised Deep Learning for Face Image Quality Assessment This repository is based on the paper: "FaceQgen: Semi-Supervised D

Javier Hernandez-Ortega 3 Aug 04, 2022
Contains a bunch of different python programm tasks

py_tasks Contains a bunch of different python programm tasks Armstrong.py - calculate Armsrong numbers in range from 0 to n with / without cache and c

Dmitry Chmerenko 1 Dec 17, 2021
SberSwap Video Swap base on deep learning

SberSwap Video Swap base on deep learning

Sber AI 431 Jan 03, 2023
Procedural 3D data generation pipeline for architecture

Synthetic Dataset Generator Authors: Stanislava Fedorova Alberto Tono Meher Shashwat Nigam Jiayao Zhang Amirhossein Ahmadnia Cecilia bolognesi Dominik

Computational Design Institute 49 Nov 25, 2022
Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

cim-misspelling Pytorch implementation of Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence, CHIL 2022. This model (

Juyong Kim 11 Dec 19, 2022
CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021

CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021 How to cite If you use these data please cite the o

Digital Linguistics 2 Dec 20, 2021
The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

Generative Occupancy Fields for 3D Surface-Aware Image Synthesis (NeurIPS 2021) Project Page | Paper Xudong Xu, Xingang Pan, Dahua Lin and Bo Dai GOF

xuxudong 97 Nov 10, 2022
A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis This is the pytorch implementation for our MICCAI 2021 paper. A Mul

Jiarong Ye 7 Apr 04, 2022