Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am working on a CSV file that looks like this.

Name1, 123

Name2, 123

Name1, 456

Name3, 345

Name2, 456

Name1, 123

Name3, 123

Name4, 789

Name2, 789

Name5, 136

Code:

import pyspark

import numpy as np

import pandas as pd

import csv

with open('filehash.csv') as filehash:

    csv_reader=csv.reader(filehash, delimiter=",")

for filehash in csv_reader:

    print (filehash)

    csv_reader.duplicated()

Between the csv_read and .duplicated I need to add some attribute. Since my CSV table doesn't have an attribute I may not get my desired result. I have no clue how to get my int-values after a comma.

The results which I am expecting is:

True, True True, False, True, True, True, True, True, False

1 Answer

0 votes
by (36.8k points)

Read the CSV file using pandas use duplicated to get which values from the second column are duplicates:

I am using the read_csv 

import pandas as pd

df = pd.read_csv('file.csv', header=None)

duplicates = df[df.columns[1]].duplicated(keep=False).to_list()

duplicates

# [True, True, True, False, True, True, True, True, True, False]

If you are a beginner and want to know more about Data Science the do check out the Data Science course 

Browse Categories

...