Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Machine Learning by (19k points)

I'm just learning to use TensorFlow's tf.data API, and I've found that it is slowing my code down a lot, measured in time per epoch. This is the opposite of what it's supposed to do, I thought. I wrote a simple linear regression program to test it out.

Tl;Dr: With 100,000 training data, tf.data slows time per epoch down by about a factor of ten, if you're using full batch training. Worse if you use smaller batches. The opposite is true with 500 training data.

My question: What is going on? Is my implementation flawed? Other sources I've read have tf.data improving speeds by about 30%.

import tensorflow as tf 

import numpy as np

import timeit

def regress_without_tfData(n_epochs, input_dimension, training_inputs, training_labels):

    tf.reset_default_graph()

    weights = tf.get_variable("weights", initializer=np.random.randn(input_dimension,                                               1).astype(np.float32))

   

    init = tf.global_variables_initializer()

    with tf.Session() as sess:

        sess.run(init)

        for _ in range(n_epochs):

            sess.run(loss_op, feed_dict={X: training_inputs, Y:training_labels})


 

    X,Y = data_set.make_one_shot_iterator().get_next()

    prediction = tf.matmul(X, weights)

    loss = tf.reduce_mean(tf.square(tf.subtract(prediction, Y)))

    loss_op = tf.train.AdamOptimizer(.01).minimize(loss)

    init = tf.global_variables_initializer()

for input_dimension in input_dimensions_list:

    for data_size in [500, 100000]:

        training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)

        random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))

        training_labels = function_to_approximate(training_inputs)

      

for input_dimension in input_dimensions_list:

    for data_size, batch_size in [(500, 50), (500, 500), (100000, 50), (100000, 100000)]:

        training_inputs = np.random.randn(data_size, input_dimension).astype(np.float32)

        random_covector = np.random.randint(-5, 5, size=(input_dimension, 1))

        training_labels = function_to_approximate(training_inputs)

        data_set = tf.data.Dataset.from_tensor_slices((training_inputs, training_labels))

        data_set = data_set.repeat(n_epochs)

        data_set = data_set.batch(batch_size)

This outputs for me:

Not using tf.data, with data size 500, input dimension 10 and training with a full batch, it took an average of 0.20243382899980134 seconds to run 10 epochs.

Not using tf.data, with data size 100000, input dimension 10 and training with a full batch, it took an average of 0.2431719040000644 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 50, it took an average of 0.09512088866661846 seconds to run 10 epochs.

Using tf.data, with data size 500, and input dimension 10, and training with batch size 500, it took an average of 0.07286913600000844 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 50, it took an average of 4.421892363666605 seconds to run 10 epochs.

Using tf.data, with data size 100000, and input dimension 10, and training with batch size 100000, it took an average of 2.2555197536667038 seconds to run 10 epochs.

1 Answer

0 votes
by (33.1k points)

In your case, It seems like you are comparing apples with bananas.

Using placeholders, you are simply providing a monolithic tensor. But, when using Dataset, you are slicing the tensor into individual samples. 

Instead of using a monolithic placeholder tensor with the dataset pipeline, you should simply use tf.data.Dataset.from_tensors. 

If you will use from_tensors in your script, then you would also get similar computation times than with placeholders.

Using a pipeline using from_tensor_slices, you should use a fair comparison with placeholders. For example, shuffle your data. Add some preprocessing on your slices. I have no doubt you will observe the performance gain that makes people switch to this pipeline.

Study Tensorflow Tutorial for more details. Since Tensorflow and Machine Learning are related, one can study the latter as well.

Hope this answer helps.

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...