2 views

My general problem is that I have a dataframe where columns correspond to feature values. There is also a date column in the dataframe. Each feature column may have missing NaN values. I want to fill a column with some fill logic such as "fill_mean" or "fill zero".

But I do not want to just apply the fill logic to the whole column because if one of the earlier values is a NaN, I do not want the average i fill for this specific NaN to be tainted by what the average was later on, when the model should have no knowledge about. Essentially it's the common problem of not leaking information about the future to your model - specifically when trying to fill my time series.

Anyway, I have simplified my problem to a few lines of code. This is my simplified attempt at the above general problem:

#assume ts_values is a time series where the first value in the list is the oldest value and the last value in the list is the most recent.

ts_values = [17.0, np.NaN, 12.0, np.NaN, 18.0]

nan_inds = np.argwhere(np.isnan(ts_values))

for nan_ind in nan_inds:

nan_ind_value = nan_ind[0]

ts_values[nan_ind_value] = np.mean(ts_values[0:nan_ind_value])

The output of the above script is:

[17.0, 17.0, 12.0, 15.333333333333334, 18.0]

which is exactly what I would expect.

My only issue with this is that it will be linear time with respect to the number of NaNs in the data set. Is there a way to do this in constant or log time where I don't iterate through the nan index values.

by (36.8k points)

When you want to interpolate the series you can use the pandas directly.

>>> s = pd.Series([0, 1, np.nan, 5])

>>> s

0    0.0

1    1.0

2    NaN

3    5.0

dtype: float64

>>> s.interpolate()

0    0.0

1    1.0

2    3.0

3    5.0

dtype: float64

you can also use the numpy.interp instead of pandas.

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch