Maha is a text processing library specially developed to deal with Arabic text.

Last update: Nov 27, 2022

Overview

An Arabic text processing library intended for use in NLP applications

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments

Time: Add the ability to parse Hijri dates
What does this pull request change?

Closes #27.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 6
Added distance to dimension parsing
What does this pull request change?

Resolves #15.

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

parsing highlight
opened by TRoboto 5
Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names
What does this pull request change?

This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 4
Add pyupgrade to pre-commit and upgrade to future-style type annotations
What does this pull request change?

Upgrades to new type annotations style.

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

maintenance
opened by TRoboto 3
Deprecate and remove `datasets` module and host datasets on Hugging Face instead
What does this pull request change?

Removes datasets module.

Datasets are now hosted here

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

breaking changes deprecation
opened by TRoboto 3
Add the ability to parse names from text
What does this pull request change?

Adds #24. Depends on #40

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 3
Add a deprecation system
What does this pull request change?

Closes #23

Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

development
opened by saedx1 3
Prepare for the next release of Maha (v0.3.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.3.0.

Bumped pypi version to v0.3.0.

Updated the citation information.
opened by github-actions[bot] 2
Ordinal: Add support to `بعد` in ordinal parsing
What does this pull request change?

Closes #48.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Numeral: Add support for hierarchical parsing
What does this pull request change?

Closes #25

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Prepare for the next release of Maha (v0.2.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.2.0.

Bumped pypi version to v0.2.0.

Updated the citation information.
opened by github-actions[bot] 2
Update ci.yml
Check the support for python 3,10

What does this pull request change? It checks if the library is supporting python 3.10.

...

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[ ] tox passes
opened by PAIN-BARHAM 1
[pre-commit.ci] pre-commit autoupdate
updates:

github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0

github.com/psf/black: 22.6.0 → 22.12.0

github.com/pycqa/isort: 5.10.1 → 5.11.4

github.com/asottile/pyupgrade: v2.37.3 → v3.3.1
opened by pre-commit-ci[bot] 1
Add the option to ignore Harakat when removing or replacing
What problem are you trying to solve?

Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

Examples (if relevant)

Current:

>> from maha.cleaners.functions import remove >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة") >> output يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى

Suggested:

>> from maha.cleaners.functions import remove >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True) >> output يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى

Definition of Done

It must adhere to the coding style used in the defined cleaner functions.

The implementation should cover most use cases.

Adding tests

feature request
opened by xaleel 1
Wrong parsed name using name dimension
What happened?

The name parser extracted wrong name likes : بي, شكرا.

Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

I expect to extract the names on the name dataset only.

Python version

3.8

What operating system are you using?

Linux

Code to reproduce the issue

>>> from maha.parsers.functions import parse_dimension >>> text = `أريد البحث في سجل الإنفاق الخاص بي` >>> extracted = parse_dimension(text, names=True) [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

Relevant log output

No response
bug parsing
opened by PAIN-BARHAM 0
Add feature to parse duration period
What problem are you trying to solve?

Parsing the duration from the text that has the difference between the two dates.

Examples (if relevant)

>>> from maha.parsers.functions import parse_dimension >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value >>> output DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Definition of Done

It must adhere to the coding style used in the defined dimensions, duration dimension.

The implementation should cover most use cases.

Adding tests

feature request
opened by PAIN-BARHAM 1

Adding the parser functionality to Processors

What problem are you trying to solve?

Adding the parser functionality to Processors to parse different dimensions.

Examples (if relevant)

>>> from pathlib import Path
>>> import maha
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
>>> data = resource_path.read_text()
>>> print(data)

الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
لما حد يسالني بتختفي كتير لية =..
زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
#Windows11 is on the horizon. What feature are you looking forward to
Get vaccinate #savethesaviour
Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> parsed = proc.parse_dimension(time=True)
[Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
 Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
 Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]

Definition of Done

It must adhere to the coding style.
The implementation should cover most use cases.
Adding tests.

good first issue feature request parsing

opened by PAIN-BARHAM 0

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 16, 2021)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 23, 2021)
Quick fix:

Added readme badges

Fixed missing regex dependency

Source code(tar.gz)
Source code(zip)

Owner

Mohammad Al-Fetyani

Machine Learning Engineer

GitHub Repository

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

RIDE: Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. by Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu and Stella X. Yu at UC

205 Dec 16, 2022

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

easySpeech easySpeech is an open source python wrapper for google speech to text api that doesn't require PyAaudio(So you specially windows user don't

14 May 24, 2022

Maha is a text processing library specially developed to deal with Arabic text.

Related tags

Overview

Installation

Overview

Documentation

Contributing

License

Comments

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What happened?

Python version

What operating system are you using?

Code to reproduce the issue

Relevant log output

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

v0.2.0(Nov 16, 2021)

v0.1.2(Sep 23, 2021)

Owner

Mohammad Al-Fetyani

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

PyTorch impelementations of BERT-based Spelling Error Correction Models.

A high-level yet extensible library for fast language model tuning via automatic prompt search

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Large-scale Knowledge Graph Construction with Prompting

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language-Agnostic SEntence Representations

UniSpeech - Large Scale Self-Supervised Learning for Speech

A full spaCy pipeline and models for scientific/biomedical documents.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Stand-alone language identification system

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Big Bird: Transformers for Longer Sequences

A toolkit for document-level event extraction, containing some SOTA model implementations

PyTorch implementation of Tacotron speech synthesis model.

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al