0 votes
1 view
in Java by (3.9k points)

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for example:

▓ railway??

→ Cats and dogs

I'm on

Apples ⚛ 

✅ Vi sign

♛ I'm the king ♛ 

Corée ♦ du Nord ☁ (French)

 gjør at både ◄╗ (Norwegian)

Star me ★

Star ⭐ once more

早上好 ♛ (Chinese)

Καλημέρα ✂ (Greek)

another ✓ sign ✓

добрай раніцы ✪ (Belarus)

◄ शुभ प्रभात ◄ (Hindi)

✪ ✰ ❈ ❧ Let's get together ★. We shall meet at 12/10/2018 10:00 AM at Tony's.❉

...and many more of these.

I would like to get rid of all these signs/images and to keep only the letters (and punctuation) in the different languages.

I tried to clean the signs using the EmojiParser library:

String withoutEmojis = EmojiParser.removeAllEmojis(input);

The problem is that EmojiParser is not able to remove the majority of the signs. The ♦ sign is the only one I found till now that it removed. Other signs such as ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ are not removed.

Is there a way to remove all these signs from the input strings and keeping only the letters and punctuation in the different languages?

1 Answer

0 votes
by (46.1k points)

I'm not super into Java, so I won't try to write example code inline, but the way I would do this is to check what Unicode calls "the general category" of each character. There are a couple letter and punctuation categories.

You can use Character.getType to find the general category of a given character. You should probably retain those characters that fall in these general categories:

COMBINING_SPACING_MARK

CONNECTOR_PUNCTUATION

CURRENCY_SYMBOL

DASH_PUNCTUATION

DECIMAL_DIGIT_NUMBER

ENCLOSING_MARK

END_PUNCTUATION

FINAL_QUOTE_PUNCTUATION

FORMAT

INITIAL_QUOTE_PUNCTUATION

LETTER_NUMBER

LINE_SEPARATOR

LOWERCASE_LETTER

MATH_SYMBOL

MODIFIER_LETTER

MODIFIER_SYMBOL

NON_SPACING_MARK

OTHER_LETTER

OTHER_NUMBER

OTHER_PUNCTUATION

PARAGRAPH_SEPARATOR

SPACE_SEPARATOR

START_PUNCTUATION

TITLECASE_LETTER

UPPERCASE_LETTER

(All of the characters you listed as specifically wanting to remove have general category OTHER_SYMBOL, which I did not include in the above category whitelist.)

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...