Software compiled from source does not work after first run
Hi I am trying to use a piece of Japanese NLP software called MeCab, written in Python and distributed with only source. ( )
It has been giving me trouble since the first day. I have no problem using it under a windows 7 machine, which installed the software with an exe. However, the ubuntu version I compiled from source does not work from time to time.
I also asked in Stackoverflow but no one has a clue.
I have some finding just now and I want to ask if any guys here know how to identify the problem.
This software work just fine right after it was installed, for one time only. Then does not work and throw error in code:
Traceback (most recent call last): File "japan_text_analysis.py", line 304, in <module> result = Jp.main() File "japan_text_analysis.py", line 49, in main tagged_text_tp = self.parse_text(text) File "japan_text_analysis.py", line 33, in parse_text word = parsed.surface
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 1: invalid start byteI could resolve this problem by executing the following in its source directory ( where I got this command: ):
nkf -w --overwrite *.csv
nkf -w --overwrite *.defand install it again:
./configure --with-charset=utf8 make sudo make installPlease tell me where can I look for its compiled code or wherever it installed itself in the machine? Because I know few things about linux software compilation.
I am using Ubuntu 16.04 LTS 64 bit. Thanks!
02 Answers
You should not compile the package by yourself. Remove it with
cd mecab-0.996
sudo make uninstalland then proceed with deb-package mecab from repository. It has exactly the same 0.996 version as your trying hard to compile ...
xenial (16.04LTS) (misc): Japanese morphological analysis system [universe]
0.996-1.2ubuntu1: amd64 arm64 armhf i386 powerpc ppc64el s390x
Nkf application is packaged as nkf too. So the solution is simple:
sudo apt-get install mecab nkfNote: you may be interested in other mecab-related packages (output from apt-cache search mecab):
darts - C++ Template Library for implementation of Double-Array
groonga-tokenizer-mecab - MeCab tokenizer for Groonga
libmecab-dev - Header files of Mecab
libmecab-java - mecab binding for Java - java classes
libmecab-jni - mecab binding for Java - native interface
libmecab-perl - mecab binding for Perl
libmecab2 - Libraries of Mecab
libtext-mecab-perl - alternate MeCab Interface for Perl
mecab - Japanese morphological analysis system
mecab-ipadic - IPA dictionary compiled for Mecab
mecab-ipadic-utf8 - IPA dictionary encoded in UTF-8 compiled for Mecab
mecab-jumandic - Juman dictionary compiled for Mecab
mecab-jumandic-utf8 - Juman dictionary encoded in UTF-8 compiled for Mecab
mecab-naist-jdic - free Japanese Dictionaries for mecab (replacement of mecab-ipadic)
mecab-naist-jdic-eucjp - free Japanese Dictionaries for mecab (replacement of mecab-ipadic) in EUC-JP
mecab-utils - Support programs of Mecab
open-jtalk - Japanese text-to-speech system
open-jtalk-mecab-naist-jdic - NAIST Japanese Dictionary for Open JTalk
python-mecab - mecab binding for Python
ruby-mecab - mecab binding for Ruby language
unidic-mecab - free Japanese Dictionaries for mecab 1 I did not find a solution directly answering my own question, but I find a way to work around it.
I came to conclusion that the problem is related to the way mecab handle utf8 encoding, it defaults to euc-jp encoding. Also there is a bug which cause UnicodeDecodeError and there is a solution on internet.
To sum it up:
- One has to install mecab and IPA dictionary with utf8 configuration. If you screw up, then "sudo make uninstall".
There is a bug which cause UnicodeDecodeError, there is a way to solve this ( ) :
import MeCab
mecab = MeCab.Tagger()
mecab.parse("") # This line is solution
node = mecab.parseToNode("すもももももももものうち")
while node: print(node.surface) node = node.nextJust add the first line which parse an empty string before parsing anything, and MeCab would work just fine.