I'm looking to construct or adapt a model preferably based on RL theory that can solve the following problem. I would greatly appreciate any guidance or pointers.
I have a continuous action space, where actions can be chosen from the range 10-100 (inclusive). Each action is associated with a certain reinforcement value, ranging from 0 to 1 (also inclusive) according to a value function. So far, so good. Here's where I start to get in over my head:
The value function V maps actions to reinforcement according to the distance between a given action x and a target action A. The less the distance between the two, the greater the reinforcement (that is, reinforcement is inversely proportional to abs(A - x). However, the value function is only nonzero for actions close to A ( abs(A - x) is less than, say, epsilon) and zero elsewhere. So:
**V** is proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon, and
**V** = 0 for abs(**A** - **x**) > epsilon.
I do not know precisely what actions have been taken at each step. I know roughly what they are, such that I know they belong to the range x +/- sigma, but cannot exactly associate a single action value with the reinforcement I receive.
The precise problem I would like to solve is as follows: I have a series of noisy action estimates and exact reinforcement values (e.g. on trial 1 I might have x of ~15-30 and reinforcement of 0; on trial 2 I might have x of ~25-40 and reinforcement of 0; on trial 3, x of ~80-95 and reinforcement of 0.6.) I would like to construct a model that represents the estimate of the most likely location of the target action A after each step, probably weighing new information according to some learning rate parameter (since certainty will increase with increasing samples).