Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)
edited by

I used the pyhton:

for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):

    for i in m:

        print(i, i.encode('unicode-escape'))

    print('--------')

the results show ल्ली has 2 Hindi characters:

ल b'\\u0932'

् b'\\u094d'

--------

ल b'\\u0932'

ी b'\\u0940'

--------

it is wrong, actually, ल्ली is one Hindi character. How to get the Hindi character(such as ल्ली) by how many Unicode composed.

In short, I want to split 'कृपयाल्ली' to 'कृ','प','या','ल्ली'

1 Answer

0 votes
by (36.8k points)
edited by

I am not quite sure if this is correct, being Finnish and not well versed in Hindi, but this would merge characters with any subsequent Unicode Mark characters:

import unicodedata

def merge_compose(s: str):

    current = []

    for c in s:

        if current and not unicodedata.category(c).startswith("M"):

            yield current

            current = []

        current.append(c)

    if current:

        yield current

for group in merge_compose("कृपयाल्ली"):

    print(group, len(group), "->", "".join(group))

Output is:

['क', 'ृ'] 2 -> कृ

['प'] 1 -> प

['य', 'ा'] 2 -> या

['ल', '्'] 2 -> ल्

['ल', 'ी'] 2 -> ली

If you are a beginner and want to know more about Python the do check out the python for data science

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...