Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Overview

Neural Scam Artist

TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.

Comic stolen from Agent-X Comics.

📖 Table of contents

☁️ Project Description

Objective

The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.

Web Scraper

Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.

Deduplication

To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.

GPT-2

A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of cherry-picked randomly selected generated samples can be found here here.

📁 Shared Files

Resource Size #Samples Link
Full dataset 128.5 MB 85,160 Link
Deduplicated dataset 74.2 MB 58,227 Link
Thread urls 6.4 MB 95,324 Link
GPT-2 Checkpoints ~1.5 GB Link

🧰 Requirements

See requirements.txt.

⚙️ Installation

$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt

🧻 Usage

To generate dataset (~3 hours on Colab):


$ python create_dataset.py [-c configs/create_dataset.yaml]

To deduplicate dataset (~30 minutes on Colab):

$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]

To train GPT-2 (~3 hours/epoch on Colab with K80):

$ python gpt2_train.py [-c configs/gpt2_train.yaml]

To generate text with GPT-2:

$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]
Practical Machine Learning with Python

Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.

Dipanjan (DJ) Sarkar 2k Jan 08, 2023
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

7 Dec 05, 2022
This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

IPL-data-analysis This project consists of data analysis and data visualization of all IPL seasons from 2008 to 2019 and answering the most asked ques

Sivateja A T 2 Feb 08, 2022
Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Linear Multihead Attention (Linformer) PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer:

Kui Xu 58 Dec 23, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Alexa 62 Dec 20, 2022
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022