0 votes
1 view
in Azure by (5.8k points)

I'm using the first 100 lines from The Enron Email Dataset for my experiment in Azure ML Studio, however, the Saved Dataset object is being populated with odd 4.8K lines instead of 100. That must be due to "Inaccurate column separation on string data containing commas" issue, which I understand.

However, using the same dataset in the Python project locally and/or in Azure ML Jupyter notebook (same imported dataset from ML Studio - not separately imported to Jupyter notebook) the number of lines is being read correctly and the further logic does also work fine.

Jupyter example:

from azureml import Workspace

ws = Workspace()

ds = ws.datasets['The Enron Email Dataset (Minimal)']

emails_df = ds.to_dataframe()

Local example:

import pandas as pd

emails_df = pd.read_csv('C:/enron-email-dataset/emails.csv', nrows=100)

And here is how dataset visualization looks like in Azure ML Studio

enter image description here

It's clear that it gets messed up after it gets moved from saved datasets to an experiment, but my question is - what would be the best way to work around it? Calling dataset from Azure BLOB Storage inside my Python code perhaps?

1 Answer

0 votes
by (9.5k points)

Once you have the output from your selected rows, download that data set and upload it in the experiment. That seems the only way to do it. 

You can also add Execute Python script to the canvas and write a script. 

Welcome to Intellipaat Community. Get your technical queries answered by top developers !