How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

Question

asked Jul 11, 2019 in Azure by Dhanangkita (5.8k points)

I have a number of large CSV (tab-delimited) data stored as azure blobs, and I want to create a pandas data frame from these. I can do this locally as follows:

from azure.storage.blob import BlobService
import pandas as pd
import os.path
STORAGEACCOUNTNAME= 'account_name'
STORAGEACCOUNTKEY= "key"
LOCALFILENAME= 'path/to.csv'
CONTAINERNAME= 'container_name'
BLOBNAME= 'bloby_data/000000_0'
blob_service = BlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
# Only get a local copy if haven't already got it
if not os.path.isfile(LOCALFILENAME):
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
df_customer = pd.read_csv(LOCALFILENAME, sep='\t')

However, when running the notebook on azure ML notebooks, I can't 'save a local copy' and then read from CSV, and so I'd like to do the conversion directly (something like pd.read_azure_blob(blob_csv) or just pd.read_csv(blob_csv) would be ideal).

I can get to the desired end result (pandas data frame for blob CSV data), if I first create an azure ML workspace, and then read the datasets into that, and finally using https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python to access the dataset as a pandas data frame, but I'd prefer to just read straight from the blob storage location.

1 Answer

Fairy Queen Deka · Answer 1 · 2019-07-11T07:00:18+0000

Here is a code that will help you:

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
df = pd.read_csv(StringIO(blobstring))

If you get an error, add .content here:

blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

How best to convert from azure blob csv format to pandas dataframe while running notebook in azure ml

1 Answer

Related questions

Browse Categories