I'm looking to construct or adapt a model preferably based on RL theory that can solve the following problem. I would greatly appreciate any guidance or pointers.

I have a continuous action space, where actions can be chosen from the range 10-100 (inclusive). Each action is associated with a certain reinforcement value, ranging from 0 to 1 (also inclusive) according to a value function. So far, so good. Here's where I start to get in over my head:

**Complication 1:**

The value function **V** maps actions to reinforcement according to the distance between a given action **x** and a target action **A**. The less the distance between the two, the greater the reinforcement (that is, reinforcement is inversely proportional to abs(**A** - **x**). However, the value function is only nonzero for actions close to **A** ( abs(**A** - **x**) is less than, say, epsilon) and zero elsewhere. So:

**V** is proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon, and

**V** = 0 for abs(**A** - **x**) > epsilon.

**Complication 2:**

I do not know precisely what actions have been taken at each step. I know roughly what they are, such that I know they belong to the range **x** +/- sigma, but cannot exactly associate a single action value with the reinforcement I receive.

The precise problem I would like to solve is as follows: I have a series of noisy action estimates and exact reinforcement values (e.g. on trial 1 I might have **x** of ~15-30 and reinforcement of 0; on trial 2 I might have **x** of ~25-40 and reinforcement of 0; on trial 3, **x** of ~80-95 and reinforcement of 0.6.) I would like to construct a model that represents the estimate of the most likely location of the target action **A** after each step, probably weighing new information according to some learning rate parameter (since certainty will increase with increasing samples).