Python print non-english (telugu) text in readable format

Question

asked Feb 6, 2021 in Python by laddulakshana (16.4k points)

I'm handling telugu text to investigate not many text tokens.

>>> sent = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')
>>> text = sent
>>> text = nltk.word_tokenize(text)
>>> result = nltk.pos_tag(text)
>>> for val in result:
... print list(val)[0], list(val)[1]
...
నా JJ
పేరు NNP
కరీం NNP
ఉంది NNP
. .
నేను VB
భారత JJ
ఆహార NNP
ప్రేమ NNP

This way I can see the outcome in telugu.

For a similar text, when I attempted thusly, it gives unicode string in outcome. How might I print token in telugu ?

>>> s = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')
>>> res = s.split(' ')
>>> res[0]
u'\u0c28\u0c3e'
>>> type(res[0])
<type 'unicode'>
>>> res[0].encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
>>> res[0].encode('utf-8')
'\xe0\xb0\xa8\xe0\xb0\xbe'

print res[0] gave it effectively. In any case, when I executed puting this code in .py script and running it. It gives

ubuntu@DELL-PC:~/Documents/codes$ python test.py
File "test.py", line 1
SyntaxError: Non-ASCII character '\xe0' in file test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
test.py contains
s = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')
a = s.split()
for i in a:
print i

1 Answer

hari_sh · Answer 1 · 2021-02-06T11:18:29+0000

Since I don't have Telugu accessible in my console least demanding answer was to run your python session in a program by utilizing Jupyter - that way you dispose of a ton of issues around the terminal character set, and so on

At that point I could basically print the outcomes:

s = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ.".decode('utf-8')
a = s.split()
for i in a:
print(i)
నా
పేరు
కరీం
ఉంది.
నేను
భారత
ఆహార
ప్రేమ.

Note that when placing such a thing into a script file you need to begin the document with the magic lines:

#!/usr/bin/env python
#coding:utf-8

Having derived that the OP was running python2 I have tested and discovered that - in a terminal that underpins utf-8 - the accompanying give results that appear to be acceptable when run from a script file:

#!/usr/bin/env python
# coding: utf-8
from __future__ import print_function
import nltk
s = "నా పేరు కరీం ఉంది. నేను భారత ఆహార ప్రేమ." #.decode('utf-8')
a = s.split()
for i in a:
print(i)
text = nltk.word_tokenize(s.decode('utf-8'))
result = nltk.pos_tag(text)
for val in result:
print (list(val)[0].encode('utf-8'), list(val)[1])
$ python Untitled2.py
నా
పేరు
కరీం
ఉంది.
నేను
భారత
ఆహార
ప్రేమ.
నా JJ
పేరు NNP
కరీం NNP
ఉంది NNP
. .
నేను VB
భారత JJ
ఆహార NNP
ప్రేమ NNP
. .

Want to become a expert in Python? Join the python course fast!

For more details, do check out the below video tutorial...

Python print non-english (telugu) text in readable format

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources