Using koine-nlp¶
The normalize()
function¶
In the most basic mode of operation, koine-nlp is used to prepare
polytonic Greek text for indexing by normalizing. This done by means
of the omnibus normalize()
function. Example Greek from the SBLGNT.
>>> import koinenlp
>>> koinenlp.normalize("καὶ ἡ σκοτία αὐτὸ οὐ κατέλαβεν.")
'και η σκοτια αυτο ου κατελαβεν'
Other Functions¶
The normalize()
function is just a chain of other functions in the
koinenlp module. You can use only certain parts if desirable. For
example, to remove all diacritics, or to remove instances of elision:
>>> koinenlp.strip_diacritics("οὗτος ἦν ἐν ἀρχῇ πρὸς τὸν θεόν.")
'ουτος ην εν αρχη προς τον θεον.'
>>> koinenlp.remove_elision("δι’ αὐτοῦ")
'δια αὐτοῦ'
See the API reference documentation for a full description of available functions.
Stopwords¶
koine-nlp contains a list of stopwords which can be removed to keep them out of the index.
Note
The list of stopwords has not been normalized. You’ll want apply the same normalizations to the stopwords list as to the text from which you are removing them.
>>> text = koinenlp.normalize("ὅσοι δὲ ἔλαβον αὐτόν")
>>> normal_stops = [koinenlp.normalize(word) for word in koinenlp.stopwords]
>>> ' '.join([word for word in text.split() if word not in normal_stops])
'οσοι ελαβον αυτον'
The simplify_tag()
function¶
When processing tagged corpora with NLTK, it
is sometimes necessary to provide a function to split the simplified
tag (typically the part-of-speech) from the rest of the tag. The
provided simplify_tag()
function splits based on the hyphen
character and can be passed to NLTK for this purpose.
>>> from nltk.corpus.reader import CategorizedTaggedCorpusReader
>>> lxx = CategorizedTaggedCorpusReader('lxxmorph-corpus/', '\d{2}\..*', encoding=u'utf8', tag_mapping_function=koinenlp.simplify_tag, cat_file='cats.txt')