Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Last update: Dec 23, 2022

Overview

Polish Wordnet Python library

Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic database of the Polish language. PlWordNet can also be browsed here.

I created this library, because since version 2.9, PlWordNet cannot be easily loaded into Python (for example with nltk), as it is only provided in a custom plwnxml format.

Usage

Load wordnet from an XML file (this will take about 20 seconds), and print basic statistics.

import plwordnet
wn = plwordnet.load('plwordnet_4_2.xml')
print(wn)

Expected output:

PlWordnet
  lexical units: 513410
  synsets: 353586
  relation types: 306
  synset relations: 1477849
  lexical relations: 393137

Find lexical units with name leśny and print all relations, where where that unit is in the subject/parent position.

for lu in wn.lemmas('leśny'):
    for s, p, o in wn.lexical_relations_where(subject=lu):
        print(p.format(s, o))

Expected output:

leśny.2 tworzy kolokację z polana.1
leśny.2 jest synonimem mpar. do las.1
leśny.3 przypomina las.1
leśny.4 jest derywatem od las.1
leśny.5 jest derywatem od las.1
leśny.6 przypomina las.1

Print all relation types and their ids:

for id, rel in wn.relation_types.items():
    print(id, rel.name)

Expected output:

10 hiponimia
11 hiperonimia
12 antonimia
13 konwersja
...

Installation

Note: plwordnet requires at Python 3.7 or newer.

pip install plwordnet

Version support

This library should be able to read future versions of PlWordNet without modification, even if more relation types are added. Still, if you use this library with a version of PlWordNet that is not listed below, please consider contributing information if it is supported.

Documentation

See plwordnet/wordnet.py for RelationType, Synset and LexicalUnit class definitions.

Package functions

load(source): Reads PlWordNet, where src is a path to the wordnet XML file, or a path to the pickled wordnet object. Passed paths can point to files compressed with gzip or lzma.

`Wordnet` instance properties

lexical_relations: List of (subject, predicate, object) triples
synset_relations: List of (subject, predicate, object) triples
relation_types: Mapping from relation type id to object
lexical_units: Mapping from lexical unit id to unit object
synsets: Mapping from synset id to object
(lexical|synset)_relations_(s|o|p): Mapping from id of subject/object/predicate to a set of matching lexical unit/synset relation ids
lexical_units_by_name: Mapping from lexical unit name to a set of matching lexical unit ids

`Wordnet` methods

lemmas(value): Returns a list of LexicalUnit, where the name is equal to value
lexical_relations_where(subject, predicate, object): Returns lexical relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or LexicalUnit and RelationType objects.
synset_relations_where(subject, predicate, object): Returns synset relation triples, with matching subject or/and predicate or/and object. Subject, predicate and object arguments can be integer ids or Synset and RelationType objects.
dump(dst): Pickles the Wordnet object to opened file dst or to a new file with path dst.

`RelationType` methods

format(x, y, short=False): Substitutes x and y into the RelationType display format display. If short, x and y are separated by the short relation name shortcut.

Comments

Fix for abstract attribute bug, MAJOR speedup of synset_relations_where

Hi Max.

I've fixed the bug related to abstract attribute of the synset (it was always True, because bool("non-empty-string") is always True)

I've also speeded up synset_relations_where by order of 3-4 magnitudes.

opened by dchaplinsky 7
Exposing relations in Wordnet class
This might be a bit an overkill, but it has two advantages.

First is:

Another is that you can rewrite code like this:

def path_to_top(synset): spo = [] for rel in [11, 107, 171, 172, 199, 212, 213]:

with meaningful names, not numbers
opened by dchaplinsky 3
Domains dict

I've used wikipedia (https://en.wikipedia.org/wiki/PlWordNet) to decipher 45 of 54 domains listed on Słowosieć.

There might be more: for example, zwz

Can you try to decipher the rest? My Polish isn't too good (yet ))

opened by dchaplinsky 3

WIP: hypernyms/hyponyms/hypernym_paths routines for WordNet class

So, here is my attempt. I've used standard python stack for now, will let you know if it caused any problems

I've tested it on Africa/Afryka with different combinations, all looked sane to me:

for lu in wn.find("Afryka"):
    for i, pth in enumerate(wn.hypernym_paths(lu.synset, full_searh=True, interlingual=True)):
        print(f"{i + 1}: " + "->".join(str(s) for s in pth))

gave me

1: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{przestrzeń.1}
2: {kontynent.2}->{ląd.1 ziemia.4}->{obszar.1 rejon.3 obręb.1}->{location.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}
3: {kontynent.2}->{ląd.1 ziemia.4}->{land.4 dry land.1 earth.3 ground.1 solid ground.1 terra firma.1}->{object.1 physical object.1}->{physical entity.1}->{entity.1}

Sorry, I accidentally blacked your file, so now it has more changes than expected. The important one, though is that:

+        # For cases like Instance_Hypernym/Instance_Hyponym
+        for rel in self.relation_types.values():
+            if rel.inverse is not None and rel.inverse.inverse is None:
+                rel.inverse.inverse = rel

opened by dchaplinsky 1

Question: hypernym/hyponym tree traversal and export
Hello.

The next logical step for me is to implement tree traversal and data export. For tree traversal I'd try to stick to the following algorithm:

Find the true top-level hypernyms for the english and polish (no interlingual hypernymy)

Calculate number of leaves under each top level hypernym (and/or number of LUs under it)

For each node calculate the distance from top-level hypernym

To export I'd like to use the information above and pass some callables for filtering to only export particular nodes/rels. For example, I only need first 3-4 levels of the trees for nouns, that has more than X leaves. This way I'll have a way to export and visualize only parts of the trees I need.

Speaking of export, I'm looking into graphviz (to basically lay top level ontology on paper) and ttl, but in the format, that is similar to PWN original TTL export.

I'd like to have your opinion on two things:

General approach

How to incorporate that into code. It might be a part of Wordnet class, a separate file (maybe under contrib section), an usage example or a separate script which I/we do or don't publish at all
opened by dchaplinsky 1
Separate file and classes for domains, support for bz2 in load helper

Hi Max. I've slightly cleaned up your spreadsheet on domains (replaced TODO and dashes with nones and made POSes compatible to UD POS tagset) and wrapped everything into classes. I've also made two rows out of cwytw / cwyt and moved pl description of adj/adv into english one. I made en fields default ones for str method

It's up to you to replace str domains in LexicalUnit with instances of Domain class as it's still ok to compare Domain to str

I've also added support for bz2 in loader helper.

opened by dchaplinsky 1
Include sentiment annotations

PlWordNet 4.2 comes with a supplementary file (słownik_anotacji_emocjonalnej.csv) containing sentiment annotations for lexical units. Users should be able to load and access sentiment data.
enhancement

opened by maxadamski 1

Parse the description format

Currently, nothing is done with the description field in Synset and LexicalUnit. Information about the description format comes in PlWordNets readme.

Parsing should be done lazily to avoid slowing down the initial loading of PlWordNet into memory.

Example description:

##K: og. ##D: owoc (wielopestkowiec) jabłoni. [##P: Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek.] {##L: http://pl.wikipedia.org/wiki/Jab%C5%82ko}

Desired behavior:

A new (memoized) method rich_description returns the following dict:

dict(
  qualifier='og.',
  definition='owoc (wielopestkowiec) jabłoni.',
  examples=['Jabłka są kształtem zbliżone do kuli, z zagłębieniem na szczycie, z którego wystaje ogonek'],
  sources=['http://pl.wikipedia.org/wiki/Jab%C5%82ko'])

enhancement

opened by maxadamski 1

Releases(0.1.5)

0.1.5(Feb 5, 2022)
Changelog:

Added convenience routines for querying hypernyms and hyponyms

Added domain and part-of-speech data

Added description and sentiment data parsing

Added HTML pretty printing

Reduced RAM usage

Source code(tar.gz)
Source code(zip)
0.1.4(Aug 17, 2021)

Bugfixes
Source code(tar.gz)
Source code(zip)
0.1.3(Aug 16, 2021)
Changelog:

Fixed the abstract attribute.

Optimized lexical_relations_where and synset_relations_where

Source code(tar.gz)
Source code(zip)

Owner

Max Adamski

Student of AI @ PUT

GitHub Repository

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

1.3k Dec 28, 2022

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

128 Dec 11, 2022

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

55 Nov 17, 2022

String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022

Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

9 Jan 11, 2022

Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022

Score-Based Point Cloud Denoising (ICCV'21)

Score-Based Point Cloud Denoising (ICCV'21) [Paper] https://arxiv.org/abs/2107.10981 Installation Recommended Environment The code has been tested in

79 Dec 26, 2022

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

2 Nov 17, 2021

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

99 Jan 06, 2023

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

255 Dec 26, 2022

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

8 Dec 16, 2022

Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

29 Dec 01, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Related tags

Overview

Polish Wordnet Python library

Usage

Installation

Version support

Documentation

Package functions

Wordnet instance properties

Wordnet methods

RelationType methods

Comments

Releases(0.1.5)

0.1.5(Feb 5, 2022)

0.1.4(Aug 17, 2021)

0.1.3(Aug 16, 2021)

Owner

Max Adamski

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

String Gen + Word Checker

Chinese segmentation library

🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

Collection of useful (to me) python scripts for interacting with napari

Score-Based Point Cloud Denoising (ICCV'21)

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This is a project of data parallel that running on NLP tasks.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Blue Brain text mining toolbox for semantic search and structured information extraction

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

An Explainable Leaderboard for NLP

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization

The RWKV Language Model

`Wordnet` instance properties

`Wordnet` methods

`RelationType` methods