Files
blender-portable-repo/extensions/.local/lib/python3.11/site-packages/segments-2.3.0.dist-info/METADATA
T
2026-03-17 14:58:51 -06:00

125 lines
3.4 KiB
Plaintext

Metadata-Version: 2.1
Name: segments
Version: 2.3.0
Summary: Segmentation with orthography profiles
Home-page: https://github.com/cldf/segments
Author: Steven Moran and Robert Forkel
Author-email: dlce.rdm@eva.mpg.de
License: Apache 2.0
Project-URL: Bug Tracker, https://github.com/cldf/segments/issues
Keywords: linguistics,tokenizer
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex
Requires-Dist: csvw>=1.5.6
Provides-Extra: dev
Requires-Dist: flake8; extra == "dev"
Requires-Dist: wheel; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=5; extra == "test"
Requires-Dist: pytest-mock; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
segments
========
[![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests)
[![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157)
The segments package provides Unicode Standard tokenization routines and orthography segmentation,
implementing the linear algorithm described in the orthography profile specification from
*The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)).
Command line usage
------------------
Create a text file:
```
$ echo "aäaaöaaüaa" > text.txt
```
Now look at the profile:
```
$ cat text.txt | segments profile
Grapheme frequency mapping
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
```
Write the profile to a file:
```
$ cat text.txt | segments profile > profile.prf
```
Edit the profile:
```
$ more profile.prf
Grapheme frequency mapping
aa 0 x
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
```
Now tokenize the text without profile:
```
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
```
And with profile:
```
$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x
```
API
---
```python
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme mapping
ab x
cd y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'
```