Japanese NLP Library

This is a fork of the jProcessing repository, which purpose is Python 3.X support.

All the functions of the 0.2 versions are not yet supported by this version, feel free to help in debugging/porting code

Contents

1 Requirements
- 1.1 Links
- 1.2 Install
- 1.3 History
2 Libraries and Modules
3 Edict Japanese Dictionary Search with Example sentences
4 Sentiment Analysis Japanese Text -> Not supported yet !!
5 Contacts

1 Requirements

Third Party Dependencies
- Cabocha Japanese Morphological parser http://sourceforge.net/projects/cabocha/
Python Dependencies
- Python 3.*

1.1 `Links`

All code at jProcessing Repo GitHub (original repository)

Documentation and HomePage and Sphinx

clone [email protected]:kevincobain2000/jProcessing.git

1.2 `Install`

In Terminal

bash$ python3 setup.py install

1.3 History

current version (unofficial modification of 0.1)
- Python3 support of the 0.1 version
0.2
- Sentiment Analysis of Japanese Text
0.1
- Morphologically Tokenize Japanese Sentence
- Kanji / Hiragana / Katakana to Romaji Converter
- Edict Dictionary Search - borrowed
- Edict Examples Search - incomplete
- Sentence Similarity between two JP Sentences
- Run Cabocha(ISO--8859-1 configured) in Python.
- Longest Common String between Sentences
- Kanji to Katakana Pronunciation
- Hiragana, Katakana Chart Parser

2 Libraries and Modules

2.1 Tokenize `jTokenize.py`

In Python3

>>> from jNlp.jTokenize import jTokenize
>>> input_sentence = '私は彼を５日前、つまりこの前の金曜日に駅で見かけた'
>>> list_of_tokens = jTokenize(input_sentence)
>>> print(list_of_tokens)
>>> print('--'.join(list_of_tokens))

Returns:

... ['私', 'は', '彼', 'を', '５', '日', '前', '、', 'つまり', 'この', '前', 'の', '金曜日', 'に', '駅', 'で', '見かけ', 'た']
... 私--は--彼--を--５--日--前--、--つまり--この--前--の--金曜日--に--駅--で--見かけ--た

Katakana Pronunciation:

>>> from jNlp.jTokenize import jReads
>>> print('--'.join(jReads(input_sentence)))

... ワタシ--ハ--カレ--ヲ--ゴ--ニチ--マエ--、--ツマリ--コノ--マエ--ノ--キンヨウビ--ニ--エキ--デ--ミカケ--タ

2.2 Cabocha `jCabocha.py`

Run Cabocha with original EUCJP or IS0-8859-1 configured encoding

If cobocha is configured as utf8 then see this http://nltk.googlecode.com/svn/trunk/doc/book-jp/ch12.html#cabocha

>>> from jNlp.jCabocha import cabocha
>>> print(cabocha(input_sentence))

Output:

 <sentence>
<chunk id="0" link="7" rel="D" score="-1.901231" head="0" func="1">
 <tok id="0" feature="名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ">私</tok>
 <tok id="1" feature="助詞,係助詞,*,*,*,*,は,ハ,ワ">は</tok>
</chunk>
<chunk id="1" link="2" rel="D" score="0.231898" head="2" func="3">
 <tok id="2" feature="名詞,代名詞,一般,*,*,*,彼,カレ,カレ">彼</tok>
 <tok id="3" feature="助詞,格助詞,一般,*,*,*,を,ヲ,ヲ">を</tok>
</chunk>
<chunk id="2" link="7" rel="D" score="-1.901231" head="6" func="6">
 <tok id="4" feature="名詞,数,*,*,*,*,５,ゴ,ゴ">５</tok>
 <tok id="5" feature="名詞,接尾,助数詞,*,*,*,日,ニチ,ニチ">日</tok>
 <tok id="6" feature="名詞,副詞可能,*,*,*,*,前,マエ,マエ">前</tok>
 <tok id="7" feature="記号,読点,*,*,*,*,、,、,、">、</tok>
</chunk>
<chunk id="3" link="7" rel="D" score="-1.901231" head="8" func="8">
 <tok id="8" feature="接続詞,*,*,*,*,*,つまり,ツマリ,ツマリ">つまり</tok>
</chunk>
<chunk id="4" link="5" rel="D" score="1.309036" head="10" func="11">
 <tok id="9" feature="連体詞,*,*,*,*,*,この,コノ,コノ">この</tok>
 <tok id="10" feature="名詞,副詞可能,*,*,*,*,前,マエ,マエ">前</tok>
 <tok id="11" feature="助詞,連体化,*,*,*,*,の,ノ,ノ">の</tok>
</chunk>
<chunk id="5" link="7" rel="D" score="-1.901231" head="12" func="13">
 <tok id="12" feature="名詞,副詞可能,*,*,*,*,金曜日,キンヨウビ,キンヨービ">金曜日</tok>
 <tok id="13" feature="助詞,格助詞,一般,*,*,*,に,ニ,ニ">に</tok>
</chunk>

2.3 Kanji / Katakana /Hiragana to Tokenized Romaji `jConvert.py`

Uses data/katakanaChart.txt and parses the chart. See katakanaChart.

>>> from jNlp.jConvert import *
>>> input_sentence = '気象庁が２１日午前４時４８分、発表した天気概況によると、'
>>> print(' '.join(tokenizedRomaji(input_sentence)))
>>> print(tokenizedRomaji(input_sentence))

...kisyoutyou ga ni ichi nichi gozen yon ji yon hachi hun  hapyou si ta tenki gaikyou ni yoru to
...['kisyoutyou', 'ga', 'ni', 'ichi', 'nichi', 'gozen',...]

katakanaChart.txt

katakanaChartFile and hiraganaChartFile

2.4 Longest Common String Japanese `jProcessing.py`

On English Strings

>>> from jNlp.jProcessing import long_substr
>>> a = 'Once upon a time in Italy'
>>> b = 'Thre was a time in America'
>>> print(long_substr(a, b))

Output

...a time in

On Japanese Strings

>>> a = 'これでアナタも冷え知らず'
>>> b = 'これでア冷え知らずナタも'
>>> print(long_substr(a, b))

Output

...冷え知らず

2.5 Similarity between two sentences `jProcessing.py`

Uses MinHash by checking the overlap http://en.wikipedia.org/wiki/MinHash

English Strings:

>>> from jNlp.jProcessing import Similarities
>>> s = Similarities()
>>> a = 'There was'
>>> b = 'There is'
>>> print(s.minhash(a,b))
...0.444444444444

Japanese Strings:

>>> from jNlp.jProcessing import *
>>> a = 'これは何ですか？'
>>> b = 'これはわからないです'
>>> print(s.minhash(' '.join(jTokenize(a)), ' '.join(jTokenize(b))))
...0.210526315789

2.6 Word by word definition `jTranslate.py`

Gives a raw definition of a sentence (alpha version)

>>> from jNlp.jTranslate import Translate
>>> edict_path = 'src/jNlp/data/edict'
>>> specialdict_path = 'src/jNlp/data/particles.json'
>>> translator = Translator(edict_path, specialdict_path)
>>> input_sentence = "田中さんが下町に行きました。"
>>> for el in translator.parse(args.string):
...    print(el)

...{1: '$PERSON', 'japanese': '田中'}
...{1: ['Mr.', 'Mrs.', 'Miss', 'Ms.', '-san'], 2: ['makes words more polite'], 'japanese': 'さん'}
...{1: ['low-lying part of a city'], 2: ['Shitamachi'], 'japanese': '下町'}
...{1: ['to', 'in', 'at', 'by'], 'japanese': 'に'}
...{1: ['go', 'move', 'head', 'be transported', 'reach'], 2: ['proceed', 'take place'], 3: ['pass through', 'come and go'], 4: ['walk'], 5: ['do'], 6: ['stream', 'flow'], 7: ['continue'], 8: ['have an orgasm', 'come', 'cum'], 9: ['trip', 'get high', 'have a drug-induced hallucination'], 'japanese': '行く'}
...{1: ['.'], 'japanese': '。'}

3 Edict Japanese Dictionary Search with Example sentences

3.1 Sample Ouput Demo

3.2 Edict dictionary and example sentences parser.

This package uses the EDICT and KANJIDIC dictionary files. These files are the property of the Electronic Dictionary Research and Development Group , and are used in conformance with the Group's licence .

Edict Parser By Paul Goins, see edict_search.py Edict Example sentences Parse by query, Pulkit Kathuria, see edict_examples.py Edict examples pickle files are provided but latest example files can be downloaded from the links provided.

3.3 Charset

Two files

utf8 Charset example file if not using src/jNlp/data/edict_examples

To convert EUCJP/ISO-8859-1 to utf8
```
iconv -f EUCJP -t UTF-8 path/to/edict_examples > path/to/save_with_utf-8
```
ISO-8859-1 edict_dictionary file

Outputs example sentences for a query in Japanese only for ambiguous words.

3.4 Links

Latest Dictionary files can be downloaded here

3.5 `edict_search.py`

author:	Paul Goins License included linkToOriginal:

For all entries of sense definitions

>>> from jNlp.edict_search import *
>>> query = '認める'
>>> edict_path = 'src/jNlp/data/edict'
>>> kp = Parser(edict_path)
>>> for i, entry in enumerate(kp.search(query)):
...     print(entry.to_string())

...認める [みとめる]: (v1,vt) (1) to recognize; to recognise; to observe; to notice; (2) to deem; to judge; to assess; (3) to approve; tocceptable; to allow; (4) to admit; to accept; to confess (to a charge); (5) to watch steadily; to observe carefully; (6) to renown; to give renown to; to appreciate; to acknowledge; (P)
...非を認める [ひをみとめる]: (exp,v1) to admit a fault; to admit one is wrong
...人影を認める [ひとかげをみとめる]: (exp,v1) to make out someone's figure
...自他共に認める [じたともにみとめる]: (exp,v1) to be generally accepted; to be acknowledged by oneself and others
...誤りを認める [あやまりをみとめる]: (exp,v1) to admit to a mistake
...自他ともに認める [じたともにみとめる]: (exp,v1) to be generally accepted; to be acknowledged by oneself and others
...必要と認める [ひつようとみとめる]: (exp,v1) to judge as necessary

3.6 `edict_examples.py` -> Not supported yet !!

Note:	Only outputs the examples sentences for ambiguous words (if word has one or more senses)
author:	Pulkit Kathuria

>>> from jNlp.edict_examples import *
>>> query = u'認める'
>>> edict_path = 'src/jNlp/data/edict-yy-mm-dd'
>>> edict_examples_path = 'src/jNlp/data/edict_examples'
>>> search_with_example(edict_path, edict_examples_path, query)

Output

認める

Sense (1) to recognize;
  EX:01 我々は彼の才能を*認*めている。We appreciate his talent.

Sense (2) to observe;
  EX:01 ｘ線写真で異状が*認*められます。We have detected an abnormality on your x-ray.

Sense (3) to admit;
  EX:01 母は私の計画をよいと*認*めた。Mother approved my plan.
  EX:02 母は決して私の結婚を*認*めないだろう。Mother will never approve of my marriage.
  EX:03 父は決して私の結婚を*認*めないだろう。Father will never approve of my marriage.
  EX:04 彼は女性の喫煙をいいものだと*認*めない。He doesn't approve of women smoking.
  ...

(Adnouns, nouns, verbs, .. all included)
No WSD module on Japanese Sentence
Uses word as its common sense for polarity score

>>> from jNlp.jSentiments import *
>>> jp_wn = '../../../../data/wnjpn-all.tab'
>>> en_swn = '../../../../data/SentiWordNet_3.0.0_20100908.txt'
>>> classifier = Sentiment()
>>> classifier.train(en_swn, jp_wn)
>>> text = u'監督、俳優、ストーリー、演出、全部最高！'
>>> print classifier.baseline(text)
...Pos Score = 0.625 Neg Score = 0.125
...Text is Positive

4.3 Japanese Word Polarity Score

>>> from jNlp.jSentiments import *
>>> jp_wn = '_dicts/wnjpn-all.tab' #path to Japanese Word Net
>>> en_swn = '_dicts/SentiWordNet_3.0.0_20100908.txt' #Path to SentiWordNet
>>> classifier = Sentiment()
>>> sentiwordnet, jpwordnet  = classifier.train(en_swn, jp_wn)
>>> positive_score = sentiwordnet[jpwordnet[u'全部']][0]
>>> negative_score = sentiwordnet[jpwordnet[u'全部']][1]
>>> print 'pos score = {0}, neg score = {1}'.format(positive_score, negative_score)
...pos score = 0.625, neg score = 0.0

5 Contacts

Original Author: pulkit[at]jaist.ac.jp [change at with @]

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
scripts		scripts
src		src
.gitattributes		.gitattributes
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
README.txt		README.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese NLP Library

1 Requirements

1.1 `Links`

1.2 `Install`

1.3 History

2 Libraries and Modules

2.1 Tokenize `jTokenize.py`

2.2 Cabocha `jCabocha.py`

2.3 Kanji / Katakana /Hiragana to Tokenized Romaji `jConvert.py`

2.4 Longest Common String Japanese `jProcessing.py`

2.5 Similarity between two sentences `jProcessing.py`

2.6 Word by word definition `jTranslate.py`

3 Edict Japanese Dictionary Search with Example sentences

3.1 Sample Ouput Demo

3.2 Edict dictionary and example sentences parser.

3.3 Charset

3.4 Links

3.5 `edict_search.py`

3.6 `edict_examples.py` -> Not supported yet !!

4 Sentiment Analysis Japanese Text -> Not supported yet !!

4.1 Wordnet files download links

4.2 How to Use

4.3 Japanese Word Polarity Score

5 Contacts

About

Releases

Packages

Languages

License

Guriido/jProcessing_py3

Folders and files

Latest commit

History

Repository files navigation

Japanese NLP Library

About

Topics

Resources

License

Stars

Watchers

Forks

Languages