I’m having some issues trying to load a dataset in Azure ML Studio, a dataset containing a column that looks like a DateTime but is in fact a string. Azure ML Studio converts the values to DateTime internally, and no amount of wrangling seems to convince it of them that they’re in fact strings.
This is an issue because during conversion the values lose precision and start appearing as duplicates whereas in fact, they are unique. Does anybody know if ML Studio can be configured not to infer data types for columns while importing a dataset?
Now, for the long(er) story :)
I’m working here with a public dataset - specifically Kaggle’s New York City Fare Prediction competition. I wanted to see if I could do a quick-and-dirty solution using Azure ML Studio, however, the dataset’s unique key values are of the form and so on.
When importing them in my experiment the key values get converted to DateTime, making them no longer unique, even though they’re unique in the csv. Needless to say, this prevents me from submitting any solution to Kaggle, since I can’t identify the rows uniquely :).
I’ve tried the following:
- edit the metadata of the dataset after it has been loaded and set the data type of the column to string, but this doesn’t do much as the precision has already been lost
- import the dataset from an Azure blob, convert it to csv and then loading it in Jupyter/Python - this brings me the same (duplicated) keys.
- loading the dataset locally with pandas works, as expected.
I’ve reproduced this behavior with both the big, 5.5GB train
dataset, but also with the more manageable sample_submission
dataset.
Curious to know if there is some sort of workaround to tell ML Studio not to try converting this column while loading the dataset. I'm looking here specifically for Azure ML Studio-only solutions, as I don't want to do any preprocessing on the dataset.