NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

Associated Repository for "Translation between Molecules and Natural Language"

Pre-training BERT masked language models with custom vocabulary

KoBERT - Korean BERT pre-trained cased (KoBERT)

PIZZA - a task-oriented semantic parsing dataset

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

Train 🤗-transformers model with Poutyne.

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Entity Disambiguation as text extraction (ACL 2022)

Sequence model architectures from scratch in PyTorch

Chinese Named Entity Recognization (BiLSTM with PyTorch)

Snowball compiler and stemming algorithms

A Structured Self-attentive Sentence Embedding

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

Harvis is designed to automate your C2 Infrastructure.

Every Google, Azure & IBM text to speech voice for free

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Use the power of GPT3 to execute any function inside your programs just by giving some doctests

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP