IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Last update: Nov 30, 2022

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).

2. About

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.

In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.

3. Pretraining Data

We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.

4. How to use

Load model and tokenizer (tested with transformers==3.5.1)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")

Preprocessing Steps:

lower-case all words
converting user mentions and URLs into @USER and HTTPURL, respectively
translating emoticons into text using the emoji package.

5. Results over 7 Indonesian Twitter Datasets

Models	Sentiment		Emotion	Hate Speech		NER		Average
Models	IndoLEM	SmSA	EmoT	HS1	HS2	Formal	Informal	Average
mBERT	76.6	84.7	67.5	85.1	75.1	85.2	83.2	79.6
malayBERT	82.0	84.1	74.2	85.0	81.9	81.9	81.3	81.5
IndoBERT (Willie, et al., 2020)	84.1	88.7	73.3	86.8	80.4	86.3	84.3	83.4
IndoBERT (Koto, et al., 2020)	84.1	87.9	71.0	86.4	79.3	88.0	86.9	83.4
IndoBERTweet (1M steps from scratch)	86.2	90.4	76.0	88.8	87.5	88.1	85.4	86.1
IndoBERT + Voc adaptation + 200k steps	86.6	92.7	79.0	88.4	84.0	87.7	86.9	86.5

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Related tags

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

2. About

3. Pretraining Data

4. How to use

5. Results over 7 Indonesian Twitter Datasets

Owner

IndoLEM

Blazing fast language detection using fastText model

test

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Generate a cool README/About me page for your Github Profile

Chatbot for the Chatango messaging platform

NLP made easy

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Curso práctico: NLP de cero a cien 🤗

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Simple translation demo showcasing our headliner package.

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

A programming language with logic of Python, and syntax of all languages.

Code for lyric-section-to-comment generation based on huggingface transformers.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

To classify the News into Real/Fake using Features from the Text Content of the article

Espial is an engine for automated organization and discovery of personal knowledge