Data preprocessing rosetta parser for python

Last update: Nov 28, 2021

Overview

datapreprocessing_rosetta_parser

I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity, specifically targeting popular packages like pandas, beautifulsoup and spacy.

The main idea of my project is to recreate Jelle Teijema's preprocessing pipeline and then try to run Dutch language model on each document to extract things of interest, such as emails, urls, organizations, people and dates. Maybe at this point, it shouldn't be considered just pre-processing, hmmm. Anyway, I've used nl_core_news_lg model. It is not very reliable, especially for organization and person names, however, it still allows for interesting queries.

Moreover, I've decided to try to do a summarization and collection of the most frequent words in the documents. My script tries to find N_SUMMARY_SENTENCES most important sentences and store it in the summary column. Please note, my Dutch is not very strong, so I can't really judge how well it works :)

Finally, the script also saves cleaned title and file contents, as per track anticipated output.

Output file

generate.py reads .csv files from input_data folder and produces output .csv file with | separator. It is pretty heavy (about x1.8 of input csv, ~75MB) and has a total of 15 columns:

Column name	Description
filename	Original filename provided in the input file
file_content	Original file contents provided in the input file
id	The dot separated numbers from the filename
category	Type of a file
filename_date	Date extracted from a filename
parsed_date	Date extracted from file contents
found_emails	Emails found in the file contents
found_urls	URLs found in the file contents
found_organizations	Organizations found in the file contents
found_people	People found in the file contents
found_dates	Dates found in the file contents
summary	Summary of the document
top5words	Top 5 most frequently used words in the file contents
title	Somewhat cleaned title
abstract	Somewhat cleaned file contents

Some interesting queries that I could think of at 12pm

Load the output processed .csv file:

import pandas as pd
df = pd.read_csv('./output_data/processed_data.csv', sep='|',
                 index_col=0, dtype=str)

All unique emails found in the documents:

import ast
emails = sum([ast.literal_eval(x) for x in df['found_emails']], [])
unique_emails = set(emails)

Top 10 communicated domains in the documents:

from collections import Counter
domains = [x.split('@')[1] for x in emails]
d_counter = Counter(domains)
print(d_counter.most_common(10))

Top 10 organizations mentioned in the documents:

orgs = sum([ast.literal_eval(x) for x in df['found_organizations']], [])
o_counter = Counter(orgs)
print(o_counter.most_common(10))

Find IDs of documents that contain word "confidential" in them:

df['id'][df['abstract'].str.contains('confidential')]

How many documents and categories there are in the dataset:

print(f'Total number of documents: {len(df)}')
print('Documents by category:')
df['category'].value_counts()

and I am sure you can be significantly more creative with this :)

How to generate output data

Install dependencies with conda and switch to the environment:

conda env create -f environment.yml
conda activate ftm_hackathon

Alternatively (not tested), you can install packages to your current environment manually:

pip install spacy tqdm pandas bs4

Download Dutch spacy model, ~500MB:

python -m spacy download nl_core_news_lg

Put your raw .csv files into input_data folder.
Run generate.py. On my 6yo laptop it takes ~17 minutes.
The result will be written in output_data/processed_data.csv

Data preprocessing rosetta parser for python

Related tags

Overview

datapreprocessing_rosetta_parser

Output file

Some interesting queries that I could think of at 12pm

How to generate output data

Owner

ASReview hackathon for Follow the Money

NLP project that works with news (NER, context generation, news trend analytics)

A BERT-based reverse-dictionary of Korean proverbs

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

In this workshop we will be exploring NLP state of the art transformers, with SOTA models like T5 and BERT, then build a model using HugginFace transformers framework.

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

A simple chatbot based on chatterbot that you can use for anything has basic features

Natural language computational chemistry command line interface.

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

Data loaders and abstractions for text and NLP

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

A CSRankings-like index for speech researchers

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Just a basic Telegram AI chat bot written in Python using Pyrogram.

GPT-3: Language Models are Few-Shot Learners

Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.