Metadata-Version: 2.1 Name: segments Version: 2.3.0 Summary: Segmentation with orthography profiles Home-page: https://github.com/cldf/segments Author: Steven Moran and Robert Forkel Author-email: dlce.rdm@eva.mpg.de License: Apache 2.0 Project-URL: Bug Tracker, https://github.com/cldf/segments/issues Keywords: linguistics,tokenizer Platform: any Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Programming Language :: Python :: 3.13 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: License :: OSI Approved :: Apache Software License Requires-Python: >=3.8 Description-Content-Type: text/markdown License-File: LICENSE Requires-Dist: regex Requires-Dist: csvw>=1.5.6 Provides-Extra: dev Requires-Dist: flake8; extra == "dev" Requires-Dist: wheel; extra == "dev" Requires-Dist: build; extra == "dev" Requires-Dist: twine; extra == "dev" Provides-Extra: test Requires-Dist: pytest>=5; extra == "test" Requires-Dist: pytest-mock; extra == "test" Requires-Dist: pytest-cov; extra == "test" segments ======== [![Build Status](https://github.com/cldf/segments/workflows/tests/badge.svg)](https://github.com/cldf/segments/actions?query=workflow%3Atests) [![PyPI](https://img.shields.io/pypi/v/segments.svg)](https://pypi.org/project/segments) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1051157.svg)](https://doi.org/10.5281/zenodo.1051157) The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from *The Unicode Cookbook* (Moran and Cysouw 2018 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1296780.svg)](https://doi.org/10.5281/zenodo.1296780)). Command line usage ------------------ Create a text file: ``` $ echo "aäaaöaaüaa" > text.txt ``` Now look at the profile: ``` $ cat text.txt | segments profile Grapheme frequency mapping a 7 a ä 1 ä ü 1 ü ö 1 ö ``` Write the profile to a file: ``` $ cat text.txt | segments profile > profile.prf ``` Edit the profile: ``` $ more profile.prf Grapheme frequency mapping aa 0 x a 7 a ä 1 ä ü 1 ü ö 1 ö ``` Now tokenize the text without profile: ``` $ cat text.txt | segments tokenize a ä a a ö a a ü a a ``` And with profile: ``` $ cat text.txt | segments --profile=profile.prf tokenize a ä aa ö aa ü aa $ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize a ä x ö x ü x ``` API --- ```python >>> from segments import Profile, Tokenizer >>> t = Tokenizer() >>> t('abcd') 'a b c d' >>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'}) >>> print(prf) Grapheme mapping ab x cd y >>> t = Tokenizer(profile=prf) >>> t('abcd') 'ab cd' >>> t('abcd', column='mapping') 'x y' ```