Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

How can I edit the normalize function so that it also removes the punctuations and end of line characters?

This is the code sample below:

    filename="bible.Sentences.15.txt"

    def getData(filename):

      with open(filename,'r') as f:

        #converting to list where each element is an individual line of text file

        lines=[line.rstrip() for line in f]

        return lines

    filename="bibleSentences.txt"

    getData(filename)

    

    def normalize(filename):

        #converting all letters to lowercase

        lowercase_lines=[x.lower() for x in getData(filename)]

        print(lowercase_lines)

        return lowercase_lines  

    normalize(filename)

1 Answer

0 votes
by (36.8k points)
edited by

Here is the solution code:

import re

...

def normalize(data):

    #converting all letters to lowercase

    lowercase_lines=[x.lower() for x in data]

    # strip out all non-word or tab or space characters

    stripped_lines = [re.sub(r"[^\w \t]+", "", x) for x in lowercase_lines]

    print(stripped_lines)

    return stripped_lines

 Lear Data science from scratch using Data science online courses

Browse Categories

...