Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)
edited by

I used the pyhton:

for m in regex.findall(r"\X", 'ल्लील्ली', regex.UNICODE):

    for i in m:

        print(i, i.encode('unicode-escape'))

    print('--------')

the results show ल्ली has 2 Hindi characters:

ल b'\\u0932'

् b'\\u094d'

--------

ल b'\\u0932'

ी b'\\u0940'

--------

it is wrong, actually, ल्ली is one Hindi character. How to get the Hindi character(such as ल्ली) by how many Unicode composed.

In short, I want to split 'कृपयाल्ली' to 'कृ','प','या','ल्ली'

1 Answer

0 votes
by (36.8k points)
edited by

I am not quite sure if this is correct, being Finnish and not well versed in Hindi, but this would merge characters with any subsequent Unicode Mark characters:

import unicodedata

def merge_compose(s: str):

    current = []

    for c in s:

        if current and not unicodedata.category(c).startswith("M"):

            yield current

            current = []

        current.append(c)

    if current:

        yield current

for group in merge_compose("कृपयाल्ली"):

    print(group, len(group), "->", "".join(group))

Output is:

['क', 'ृ'] 2 -> कृ

['प'] 1 -> प

['य', 'ा'] 2 -> या

['ल', '्'] 2 -> ल्

['ल', 'ी'] 2 -> ली

If you are a beginner and want to know more about Python the do check out the python for data science

Browse Categories

...