0 votes
1 view
in Data Science by (17.6k points)

I have a data like this 

enter image description here

enter image description here

there are 29 column ,out of which I have to predict winPlacePerc(extreme end of dataframe) which is between 1(high perc) to 0(low perc)

Out of 29 column 25 are numerical data 3 are ID(object) 1 is categorical

I dropped all the Id column(since they're all unique) and also encoded the categorical(matchType) data into one hot encoding

After doing all this I am left with 41 column(after one hot)

after one hot encoded

This is how i am creating data

X = df.drop(columns=['winPlacePerc'])

#creating a dataframe with only the target column

y = df[['winPlacePerc']]

Now my X have 40 column and this is my label data looks like

> y.head()

winPlacePerc

0   0.4444

1   0.6400

2   0.7755

3   0.1667

4   0.1875

I also happen to have very large amount of data like 400k data ,so for testing purpose I am training on fraction of that,doing that using sckit

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.997, random_state=32)

which gives almost 13k data for training

For model I'm using Keras sequential model

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import Dense, Dropout, Activation

from keras.layers.normalization import BatchNormalization

from keras import optimizers

n_cols = X_train.shape[1]

model = Sequential()

model.add(Dense(40, activation='relu', input_shape=(n_cols,)))

model.add(Dense(1, activation='sigmoid'))

model.compile(loss='mean_squared_error',

              optimizer='Adam',

              metrics=['accuracy'])

model.fit(X_train, y_train,

          epochs=50,

          validation_split=0.2,

          batch_size=20)

Since my y-label data is between 0 & 1 ,I'm using sigmoid layer as my output layer

this is training & validation loss & accuracy plot

enter image description here

enter image description here

I also tried to convert label into binary using step function and binary cross entropy loss function

after that y-label data looks like

> y.head()

winPlacePerc

0   0

1   1

2   1

3   0

4   0

and changing loss function

model.compile(loss='binary_crossentropy',

              optimizer='Adam',

              metrics=['accuracy'])

enter image description here

this method was more worse than previous

as you can see its not learning after certain epoch,and this also happens even if I am taking all data rather than fraction of it

after this did not work I also used dropout and tried adding more layer,but nothing works here

Now my question ,what I am doing wrong here is it wrong layer or in data how can I improve upon this?

1 Answer

0 votes
by (38.2k points)
edited by

Accuracy doesn't really makes sense because this is a Regression problem because you can never predict the exact value of 0.23124.

So, firstly, before passing it to the network you should normalize your values. 

Then, you should change the activation function in the output layer. For this, try linear

and as a loss mean_squared_error will work fine

Refer the below code:

.

from sklearn.preprocessing import StandardScaler

n_cols = X_train.shape[1]

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

model = Sequential()    

model.add(Dense(128, activation='relu', input_shape=(n_cols,)))

model.add(Dense(64, activation='relu'))

model.add(Dense(16, activation='relu'))

model.add(Dense(1))

model.compile(loss='mean_squared_error',

              optimizer='Adam',

              metrics=['mean_squared_error'])

model.fit(X_train, y_train,

          epochs=50,

          validation_split=0.2,

          batch_size=20)

If you want to learn Machine Learning then then follow the video tutorials of Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...