Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Azure by (5.8k points)

I have a number of large CSV (tab-delimited) data stored as azure blobs, and I want to create a pandas data frame from these. I can do this locally as follows:

from azure.storage.blob import BlobService

import pandas as pd

import os.path

STORAGEACCOUNTNAME= 'account_name'

STORAGEACCOUNTKEY= "key"

LOCALFILENAME= 'path/to.csv'        

CONTAINERNAME= 'container_name'

BLOBNAME= 'bloby_data/000000_0'

blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

# Only get a local copy if haven't already got it

if not os.path.isfile(LOCALFILENAME):

 blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)

df_customer = pd.read_csv(LOCALFILENAME, sep='\t')

However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from CSV, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).

I can get to the desired end result (pandas data frame for blob CSV data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas data frame, but I'd prefer to just read straight from the blob storage location.

1 Answer

0 votes
by (9.6k points)

Here is a code that will help you: 

from io import StringIO

blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)

df = pd.read_csv(StringIO(blobstring))

If you get an error, add .content here:

blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

...