2 views

I'm looking to construct or adapt a model preferably based on RL theory that can solve the following problem. I would greatly appreciate any guidance or pointers.

I have a continuous action space, where actions can be chosen from the range 10-100 (inclusive). Each action is associated with a certain reinforcement value, ranging from 0 to 1 (also inclusive) according to a value function. So far, so good. Here's where I start to get in over my head:

Complication 1:

The value function V maps actions to reinforcement according to the distance between a given action x and a target action A. The less the distance between the two, the greater the reinforcement (that is, reinforcement is inversely proportional to abs(A - x). However, the value function is only nonzero for actions close to A ( abs(A - x) is less than, say, epsilon) and zero elsewhere. So:

**V** is proportional to 1 / abs(**A** - **x**) for abs(**A** - **x**) < epsilon, and

**V** = 0 for abs(**A** - **x**) > epsilon.

Complication 2:

I do not know precisely what actions have been taken at each step. I know roughly what they are, such that I know they belong to the range x +/- sigma, but cannot exactly associate a single action value with the reinforcement I receive.

The precise problem I would like to solve is as follows: I have a series of noisy action estimates and exact reinforcement values (e.g. on trial 1 I might have x of ~15-30 and reinforcement of 0; on trial 2 I might have x of ~25-40 and reinforcement of 0; on trial 3, x of ~80-95 and reinforcement of 0.6.) I would like to construct a model that represents the estimate of the most likely location of the target action A after each step, probably weighing new information according to some learning rate parameter (since certainty will increase with increasing samples).

by (108k points)

You can refer to this article which addresses delayed rewards and robust learning in the presence of noise and inconsistent rewards.

"Rare neural correlations implement robot conditioning with delayed rewards and disturbances"

This article trace (remember) which synapses (or actions) had been firing before a rewarding event and reinforce all of them, where the amount of the reinforcement decays with the time between the action and the reward.

If you wish to learn about reinforcement learning then visit this Reinforcement Learning Certification Training.