View on GitHub

Indic NLP Library

Resources and tools for Indian language Natural Language Processing

Download this project as a .zip file Download this project as a tar.gz file

Indic NLP Library

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.

The library provides the following functionalities:

Language Support

Language Support Details

Pre-requisites

Download Examples

You can download this IPython Notebook to play with examples of the API usage. If you just to browse through the examples, read this on IPython NBViewer

Text Normalization

Text written in Indic scripts display a lot of quirky behaviour on account of varying input methods, multiple representations for the same character, etc. There is a need to canonicalize the representation of text so that NLP applications can handle the data in a consistent manner. The canonicalization primarily handles the following issues:

- Non-spacing characters like ZWJ/ZWNL
- Multiple representations of Nukta based characters 
- Multiple representations of two part dependent vowel signs
- Typing inconsistencies: e.g. use of pipe (|) for poorna virama

You can check the documentation for each normalizer in the file src/normalize/indic_normalize.py to know the script specific normalizations.

The following scripts are supported:

Devanagari(Hindi,Marathi,Sanskrit,Konkani,Nepali), Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Tamil, Telugu, Kannada, Malayalam

Commandline Usage

python src/indic_nlp/normalize/indic_normalize.py <infile> <outfile> <language> [<replace_nukta>]

<language>: 2-letter ISO 639-1 language code. 
            Codes for some language not covered in the standard
            kK: Konkani
            bD: Bodo
            mP: Manipuri
<replace_nukta>: True/False. Default: False                

API Usage

e.g.

from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
input_text=u"\u0929 \u0928\u093c"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi",remove_nuktas)
print normalizer.normalize(input_text)

Unicode based Transliteration

Transliterate from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script. The following scripts are supported:

Devanagari(Hindi,Marathi,Sanskrit,Konkani,Nepali), Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Tamil, Telugu, Kannada, Malayalam

Commandline Usage

python src/indic_nlp/transliterate/unicode_transliterate.py <infile> <outfile> <language1> <language2>

<language1>,<language2>: 2-letter ISO 639-1 language code. 
            Codes for some language not covered in the standard
            kK: Konkani
            bD: Bodo
            mP: Manipuri

API Usage

e.g.

from indicnlp.transliterate.unicode_transliterate import IndicNormalizerFactory
input_text=u"\u0929 \u0928\u093c"
print UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")

Tokenization

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.

Commandline Usage

python src/indicnlp/tokenize/indic_tokenize.py <infile> <outfile> <language> 

<language>: 2-letter ISO 639-1 language code. 
            Codes for some language not covered in the standard
            kK: Konkani
            bD: Bodo
            mP: Manipuri

API Usage

e.g.

from indicnlp.tokenize import indic_tokenize  
indic_string=u'\u0905\u0928\u0942\u092a,\u0905\u0928\u0942\u092a?\u0964 '
indic_tokenize.trivial_tokenize(indic_string)

Morphological Analysis

Unsupervised morphological analysers for various Indian language. Given a word, the analyzer returns the componenent morphemes. The analyzer can recognize inflectional and derivational morphemes.

The following languages are supported:

Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam

Support for more languages will be added soon.

Commandline Usage

python src/indicnlp/morph/unsupervised_morph.py <infile> <outfile> <language> <resource_directory>

<language>: 2-letter ISO 639-1 language code. 
            Codes for some language not covered in the standard
            kK: Konkani
            bD: Bodo
            mP: Manipuri

<resource_directory>: Path to directory containg Indic NLP library resources. 

API Usage

e.g.

# -*- coding: utf-8 -*-

from indicnlp.morph import unsupervised_morph 
from indicnlp import common

common.INDIC_RESOURCES_PATH="resources"

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('mr')

indic_string=u'आपल्या हिरड्यांच्या आणि दातांच्यामध्ये जीवाणू असतात ."

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

Author

Anoop Kunchukuttan ( anoop.kunchukuttan@gmail.com )

Version: 0.2

Revision Log

0.3 : 21 Oct 2014: Supports morph-analysis between Indian languages

0.2 : 13 Jun 2014: Supports transliteration between Indian languages and tokenization of Indian languages

0.1 : 12 Mar 2014: Initial version. Supports text normalization.

LICENSE

Copyright Anoop Kunchukuttan 2013 - present

Indic NLP Library is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Indic NLP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Indic NLP Library. If not, see http://www.gnu.org/licenses/.