0 votes
1 view
in Data Science by (17.6k points)

Having a bit of trouble with some code I'm working through. Basically, I have transcripts (txt files) for a few Japanese anime, of which I want to remove everything but the spoken lines (Japanese sentences) in order to do some NLP experiments.

I've managed to accomplish a good bit of cleaning, but where I'm stuck is with parentheses. A majority of the elements in my list start with a character's name inside parentheses (i.e. (Armin)). I want to remove these, but all the regex code I've found online doesn't seem to work.

Here's a snippet of the list I'm working with:

['(アルミン)その日', '人類は思い出した', '(アルミン)奴らに', '支配されていた恐怖を', '(アルミン)鳥籠の中に', 'とらわれていた―', '屈辱を', '(キース)総員', '戦闘用意!', '目標は1体だ', '必ず仕留め―', 'ここを', '我々', '人類', '最初の壁外拠点とする!', '(エルヴィン)あっ…', '目標接近!', '(キース)訓練どおり5つに分かれろ!', '囮は我々が引き受ける!', '全攻撃班', '立体機動に移れ!', '(エルヴィン)全方向から', '同時に叩くぞ!', '(モーゼス)やあーっ!']

I've tried the following code (it's as close as I could get):

no_parentheses = []

for line in mylist:

    if '(' in line:

        line = re.sub('\(.*\)','', line)




But when I view the results, those pesky parentheses remain in my list mockingly.

Could anyone offer suggestions to resolve this issue?

Thanks again!

1 Answer

0 votes
by (39.1k points)

Here, the brackets used in the text are full-width brackets. So, your regex should also use full-width brackets.

line = re.sub('(.*)','', line)

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !