koinenlp API reference

koinenlp.final_sigma(text)

Return the given text with final sigmas normalized to normal sigmas.

koinenlp.lowercase(text)

Return the given text in lowercase.

koinenlp.normalize(text)

Return the given text in a normalized form suitable for indexing.

Namely, return after converting to lowercase, removing diacritics, converting final sigma to sigma, expanding elision to the full form and normalizing for unicode.

koinenlp.remove_elision(text, diacritics=False)

Return the given text with all instances of elision removed.

Pass diacritics=True if the input text contains diacritics. These must be removed for elisions can be detected and removed.

koinenlp.remove_punctuation(text)

Return the given text with punctuation removed.

koinenlp.simplify_tag(tag)

Simplify the given tag, returning only the POS portion.

This function may be given as the tag_mapping_function to the nltk.corpus.reader.TaggedCorpusReader (or similar) class. This allows the argument simplify_tags=True to be passed to tagged_* methods on corpora.

koinenlp.strip_diacritics(text)

Return the given text string with Unicode diacritics removed.

koinenlp.unicode_normalize(text)

Return the given text normalized to Unicode NFKC.