Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I'm trying to create a word count for a book (.txt file) and I'm trying to split each line into its separate words using this:

temp = re.split('[; |, |\*|\n| |\|:|.|’|"|&|#|$|(|)|]|//|'']', line)

However, this isn't working because every time I run the program, I have to add another delimiter to the list. This time I have to add '-' and '%'. I remember doing something similar in Java where I could specify a 'range' of delimiters and when I tried the same thing here, it didn't seem to work.

Is there any better way to do this and make sure I just get the word and nothing else?

1 Answer

0 votes
by (36.8k points)
edited by

I think you're looking for \W, the set of all non-word characters, i.e. not a letter, digit, or underscore.

i.e.

temp = re.split('\W+', line)

By the way, characters inside a regex character set are mostly literal. Yours boils down to this:

[; |,*\n:.’"&#$()]/']

 Learn Data Science with Python Course to improve your technical knowledge.

Browse Categories

...