I am interested in deploying a machine learning model in python, so predictions can be made through requests to a server.

I will create a Cloudera cluster and take advantage of Spark to develop the models, by using the library pyspark. I would like to know how the model can be saved in order to employ it on the server.

I have seen that the different algorithms have the .save functions (like it is answered in this post How to save and load MLLib model in Apache Spark), but as the server will be in a different machine without Spark and not in the Cloudera cluster, I don't know if it is possible to use their .load and .predict functions.

Can it be made by using the pyspark library functions for prediction without Spark underneath? Or would I have to do any transformations in order to save the model and use it elsewhere?

1 Answer

After spending an hour I got this working code, This may not be optimized.

import os

import sys

# Path for spark source folder


# Append pyspark  to Python Path



    from import StringIndexer

    from numpy import array

    from math import sqrt

    from pyspark import SparkConf

    from pyspark import SparkContext

    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:


if __name__ == "__main__":

    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')

    sc = SparkContext(conf=sconf)  

    parsedData =  array([0.0,0.0, 1.0,1.0, 9.0,8.0,                      8.0,9.0]).reshape(4,2)

    clusters = KMeans.train(sc.parallelize(parsedData), 2, maxIterations=10, runs=10, initializationMode="random"), "mymodel")  // this will save model to file system


This code will create a kmean cluster model and save it in a file system:

from flask import jsonify, request, Flask

from sklearn.externals import joblib

import os

import sys

# Path for spark source folder


# Append pyspark  to Python Path



    from import StringIndexer

    # $example on$

    from numpy import array

    from math import sqrt

    from pyspark import SparkConf

    from pyspark import SparkContext

    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:


app = Flask(__name__)

@app.route('/', methods=['GET'])

def predict():

    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')

    sc = SparkContext(conf=sconf)  # SparkContext

    sameModel = KMeansModel.load(sc, "clus")  

    response = sameModel.predict(array([0.0, 0.0]))  // pass your data

    return jsonify(response)

if __name__ == '__main__':

The above API is written in Flask.

