125 lines
3.4 KiB
Plaintext
125 lines
3.4 KiB
Plaintext
Metadata-Version: 2.1
|
|
Name: segments
|
|
Version: 2.3.0
|
|
Summary: Segmentation with orthography profiles
|
|
Home-page: https://github.com/cldf/segments
|
|
Author: Steven Moran and Robert Forkel
|
|
Author-email: dlce.rdm@eva.mpg.de
|
|
License: Apache 2.0
|
|
Project-URL: Bug Tracker, https://github.com/cldf/segments/issues
|
|
Keywords: linguistics,tokenizer
|
|
Platform: any
|
|
Classifier: Development Status :: 5 - Production/Stable
|
|
Classifier: Intended Audience :: Developers
|
|
Classifier: Intended Audience :: Science/Research
|
|
Classifier: Natural Language :: English
|
|
Classifier: Operating System :: OS Independent
|
|
Classifier: Programming Language :: Python :: 3
|
|
Classifier: Programming Language :: Python :: 3.9
|
|
Classifier: Programming Language :: Python :: 3.10
|
|
Classifier: Programming Language :: Python :: 3.11
|
|
Classifier: Programming Language :: Python :: 3.12
|
|
Classifier: Programming Language :: Python :: 3.13
|
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
|
Classifier: License :: OSI Approved :: Apache Software License
|
|
Requires-Python: >=3.8
|
|
Description-Content-Type: text/markdown
|
|
License-File: LICENSE
|
|
Requires-Dist: regex
|
|
Requires-Dist: csvw>=1.5.6
|
|
Provides-Extra: dev
|
|
Requires-Dist: flake8; extra == "dev"
|
|
Requires-Dist: wheel; extra == "dev"
|
|
Requires-Dist: build; extra == "dev"
|
|
Requires-Dist: twine; extra == "dev"
|
|
Provides-Extra: test
|
|
Requires-Dist: pytest>=5; extra == "test"
|
|
Requires-Dist: pytest-mock; extra == "test"
|
|
Requires-Dist: pytest-cov; extra == "test"
|
|
|
|
segments
|
|
========
|
|
|
|
[](https://github.com/cldf/segments/actions?query=workflow%3Atests)
|
|
[](https://pypi.org/project/segments)
|
|
|
|
|
|
[](https://doi.org/10.5281/zenodo.1051157)
|
|
|
|
The segments package provides Unicode Standard tokenization routines and orthography segmentation,
|
|
implementing the linear algorithm described in the orthography profile specification from
|
|
*The Unicode Cookbook* (Moran and Cysouw 2018 [](https://doi.org/10.5281/zenodo.1296780)).
|
|
|
|
|
|
Command line usage
|
|
------------------
|
|
|
|
Create a text file:
|
|
```
|
|
$ echo "aäaaöaaüaa" > text.txt
|
|
```
|
|
|
|
Now look at the profile:
|
|
```
|
|
$ cat text.txt | segments profile
|
|
Grapheme frequency mapping
|
|
a 7 a
|
|
ä 1 ä
|
|
ü 1 ü
|
|
ö 1 ö
|
|
```
|
|
|
|
Write the profile to a file:
|
|
```
|
|
$ cat text.txt | segments profile > profile.prf
|
|
```
|
|
|
|
Edit the profile:
|
|
|
|
```
|
|
$ more profile.prf
|
|
Grapheme frequency mapping
|
|
aa 0 x
|
|
a 7 a
|
|
ä 1 ä
|
|
ü 1 ü
|
|
ö 1 ö
|
|
```
|
|
|
|
Now tokenize the text without profile:
|
|
```
|
|
$ cat text.txt | segments tokenize
|
|
a ä a a ö a a ü a a
|
|
```
|
|
|
|
And with profile:
|
|
```
|
|
$ cat text.txt | segments --profile=profile.prf tokenize
|
|
a ä aa ö aa ü aa
|
|
|
|
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
|
|
a ä x ö x ü x
|
|
```
|
|
|
|
|
|
API
|
|
---
|
|
|
|
```python
|
|
>>> from segments import Profile, Tokenizer
|
|
>>> t = Tokenizer()
|
|
>>> t('abcd')
|
|
'a b c d'
|
|
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
|
|
>>> print(prf)
|
|
Grapheme mapping
|
|
ab x
|
|
cd y
|
|
>>> t = Tokenizer(profile=prf)
|
|
>>> t('abcd')
|
|
'ab cd'
|
|
>>> t('abcd', column='mapping')
|
|
'x y'
|
|
```
|