VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Last update: Oct 24, 2022

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

New: Paper accepted by ICSE 2022. Preprint at arXiv!

This repository contains code and pre-trained models for VarCLR, a contrastive learning based approach for learning semantic representations of variable names that effectively captures variable similarity, with state-of-the-art results on [email protected].

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

pip install -e .

Step 1: Load a Pre-trained VarCLR Model

from varclr.models import Encoder
model = Encoder.from_pretrained("varclr-codebert")

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

emb = model.encode("squareslab")
print(emb.shape)
# torch.Size([1, 768])

Get embeddings of list of variables (supports batching)

emb = model.encode(["squareslab", "strudel"])
print(emb.shape)
# torch.Size([2, 768])

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

print(model.score("squareslab", "strudel"))
# [0.42812108993530273]
print(model.score(["squareslab", "average", "max", "max"], ["strudel", "mean", "min", "maximum"]))
# [0.42812108993530273, 0.8849745988845825, 0.8035818338394165, 0.889922022819519]

Get pairwise (N * M) similarity scores from two lists of variables

variable_list = ["squareslab", "strudel", "neulab"]
print(model.cross_score("squareslab", variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832]]
print(model.cross_score(variable_list, variable_list))
# [[1.0000007152557373, 0.4281214475631714, 0.7207341194152832],
#  [0.4281214475631714, 1.0000004768371582, 0.549992561340332],
#  [0.7207341194152832, 0.549992561340332, 1.000000238418579]]

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

from varclr.benchmarks import Benchmark

# Similarity on IdBench-Medium
b1 = Benchmark.build("idbench", variant="medium", metric="similarity")
# Relatedness on IdBench-Large
b2 = Benchmark.build("idbench", variant="large", metric="relatedness")

Compute VarCLR scores and evaluate

id1_list, id2_list = b1.get_inputs()
predicted = model.score(id1_list, id2_list)
print(b1.evaluate(predicted))
# {'spearmanr': 0.5248567181503295, 'pearsonr': 0.5249843473193132}

print(b2.evaluate(model.score(*b2.get_inputs())))
# {'spearmanr': 0.8012168379981921, 'pearsonr': 0.8021791703187449}

Let's compare with the original CodeBERT

codebert = Encoder.from_pretrained("codebert")
print(b1.evaluate(codebert.score(*b1.get_inputs())))
# {'spearmanr': 0.2056582946575104, 'pearsonr': 0.1995058696927054}
print(b2.evaluate(codebert.score(*b2.get_inputs())))
# {'spearmanr': 0.3909218857993804, 'pearsonr': 0.3378219622284688}

Results on IdBench benchmarks

Similarity

Method	Small	Medium	Large
FT-SG	0.30	0.29	0.28
LV	0.32	0.30	0.30
FT-cbow	0.35	0.38	0.38
VarCLR-Avg	0.47	0.45	0.44
VarCLR-LSTM	0.50	0.49	0.49
VarCLR-CodeBERT	0.53	0.53	0.51

Combined-IdBench	0.48	0.59	0.57
Combined-VarCLR	0.66	0.65	0.62

Relatedness

Method	Small	Medium	Large
LV	0.48	0.47	0.48
FT-SG	0.70	0.71	0.68
FT-cbow	0.72	0.74	0.73
VarCLR-Avg	0.67	0.66	0.66
VarCLR-LSTM	0.71	0.70	0.69
VarCLR-CodeBERT	0.79	0.79	0.80

Combined-IdBench	0.71	0.78	0.79
Combined-VarCLR	0.79	0.81	0.85

Pre-train your own VarCLR models

Coming soon.

Cite

If you find VarCLR useful in your research, please cite our [email protected]:

@misc{chen2021varclr,
      title={VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning},
      author={Qibin Chen and Jeremy Lacomis and Edward J. Schwartz and Graham Neubig and Bogdan Vasilescu and Claire Le Goues},
      year={2021},
      eprint={2112.02650},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Related tags

Overview

VarCLR: Variable Representation Pre-training via Contrastive Learning

Step 0: Install

Step 1: Load a Pre-trained VarCLR Model

Step 2: VarCLR Variable Embeddings

Get embedding of one variable

Get embeddings of list of variables (supports batching)

Step 2: Get VarCLR Similarity Scores

Get similarity scores of N variable pairs

Get pairwise (N * M) similarity scores from two lists of variables

Step 3: Reproduce IdBench Benchmark Results

Load the IdBench benchmark

Compute VarCLR scores and evaluate

Let's compare with the original CodeBERT

Results on IdBench benchmarks

Similarity

Relatedness

Pre-train your own VarCLR models

Cite

Owner

squaresLab

SparseInst: Sparse Instance Activation for Real-Time Instance Segmentation, CVPR 2022

ObsPy: A Python Toolbox for seismology/seismological observatories.

TF Image Segmentation: Image Segmentation framework

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022)

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"

Language Used: Python . Made in Jupyter(Anaconda) notebook.

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

Python program that works as a contact list

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

This library provides an abstraction to perform Model Versioning using Weight & Biases.

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

FaceQgen: Semi-Supervised Deep Learning for Face Image Quality Assessment

Contains a bunch of different python programm tasks

SberSwap Video Swap base on deep learning

Procedural 3D data generation pipeline for architecture

Context-Sensitive Misspelling Correction of Clinical Text via Conditional Independence, CHIL 2022

CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021

The codebase for our paper "Generative Occupancy Fields for 3D Surface-Aware Image Synthesis" (NeurIPS 2021)

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.