Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (17.6k points)

I've just discovered the Pipeline feature of scikit-learn, and I find it very useful for testing different combinations of preprocessing steps before training my model.

A pipeline is a chain of objects that implement the fit and transform methods. Now, if I wanted to add a new preprocessing step, I used to write a class that inherits from sklearn.base.estimator. However, I'm thinking that there must be a simpler method. Do I really need to wrap every function I want to apply in an estimator class?


class Categorizer(sklearn.base.BaseEstimator):


    Converts given columns into pandas dtype 'category'.


    def __init__(self, columns):

        self.columns = columns

    def fit(self, X, y):

        return self

    def transform(self, X):

        for column in self.columns:

            X[column] = X[column].astype("category")

        return X

1 Answer

0 votes
by (41.4k points)
edited by

For having a general solution that works for many other use cases also, and not just a transformer,  we can write your own decorator if  there is a  state-free function that do not implement fit.

Refer to the code below for an example:

class TransformerWrapper(sklearn.base.BaseEstimator):

    def __init__(self, func):

        self._func = func

    def fit(self, *args, **kwargs):

        return self

    def transform(self, X, *args, **kwargs):

        return self._func(X, *args, **kwargs)

And after this you can do the following


def foo(x):

  return x*2

Which is similar to 

def foo(x):

  return x*2

foo = TransformerWrapper(foo)

And that is what sklearn.preprocessing.FunctionTransformer is doing .

You can also use  sklearn function by

from sklearn.preprocessing import FunctionTransformer


def foo(x):

  return x*2

If you wish to learn about scikit learn then visit this Scikit Learn Tutorial.

Thinking of getting a master's degree in Data Science? Enroll in the MSc in Data Science in USA!

Browse Categories