Talk:Reinforcement learning

From Wikipedia, the free encyclopedia

Is R=Σ_tγ^tr_t, $R = \sum \limits_{t^\gamma}^{t} r_t$ or $R = \sum \limits_{t\gamma}^{t} r_t$ or $R = \sum \limits_{t}^{t} \gamma r_t$ ?

Answer: It is : $R = \sum \limits_{t=0}^{\infty} \gamma^{t} r_t$

1 Policies
2 merge with Q learning
3 algorithms/concepts not mentioned
4 Economics?
5 Psychology

[edit] Policies

What exactly is a policy? The Sutton-Barto book is very vague on this point, and so is this article. In both cases the word is used without much explanation.

According to both the book and the article, a policy is a mapping from states to action probabilities. Fine. But this is not elaborated upon. What does a policy look like? I infer that it must be a table (2-D array), indexed by state and action, and containing probabilities, say p_ij for the i-th state and j-th action, each p_ij being a transition probability for the MDP. If so, what is its relation to the values derived from rewards? I.e. where exactly do the probabilities p_ij come from? How does one generate a policy table starting from values?

Sorry if I appear stupid, but I've been studying the book and I find it very difficult to comprehend, even though the maths is very simple (almost too simple). Or maybe it's in there somewhere but I've missed it?

--84.9.83.127 09:36, 18 November 2006 (UTC)

A policy is indeed a mapping from states to action probabilities, usually written π. So we could write π:S×A→[0,1], saying that π gives a probability of taking a given action a in state s. It doesn't have to be a table, it is just a function. If S and A are discrete then it can be easily written as a table, but if either is continuous then another form is needed. For instance, if S is the interval [0,10], we can set a number of radial basis functions over that interval (say, 11 of them, one at 0, one at 1, one at 2, etc.). Number them r₀, ... r₁₀. Now our policy is a function π:r₀×...×r₁₀×A→[0,1], which we can no longer write as a table.

The relation of the policy to values depends on the particular solution being used for the RL problem. In an actor-critic architecture, the policy is the set of state-action values along with a function for selecting an action (softmax, for instance, or just choosing the action with the highest value) and the state-action values are updated according to state values and the error signal. In a Q-learning agent, the policy and the values are essentially the same. Well, more correctly the policy is a function of the values given by the action selection mechanism.

For the most part, when you're just learning reinforcement learning theory, the use of policies may not be particularly clear. At least, in my own case, I didn't understand the focus on policies until I read Sutton, Precup, and Singh (1999) on options [1], at which point policies became crystal clear.

Hope that answers your question. digfarenough (talk) 19:25, 4 March 2007 (UTC)

Thanks. But your reply raises more questions for me, which I need to try and find answers to! --84.9.75.142 22:41, 16 March 2007 (UTC) (formerly 84.9.83.127)

Feel free to ask further questions on my talk page. I'm certainly no expert on reinforcement learning, but I've written one paper on it and have written a large number of simulations of RL-related things, so I at least know the basics. digfarenough (talk) 01:09, 17 March 2007 (UTC)

[edit] merge with Q learning

There is a short article on Q learning and could be merged with reinforcement learning Kpmiyapuram 14:23, 24 April 2007 (UTC)

I'd offer that Q Learning be expanded instead. In Q Learning's "See Also" there's Watkins' thesis, which I faintly remember is where Q Learning was introduced; but there's no mention of Watkins or any other researcher in the article. Additionally, Sutton's RL book is listed, which would be a great source to mine for further detail on history and application. --59.167.203.115 (talk) 01:17, 11 January 2008 (UTC)

I'd back Q-learning being expanded instead, with a summary in RL. As Q-learning is an active area of research it will grow over time, so it would be short-sighted to merge them - especially as they are already separate. At the start of my research it would have been SO helpful to know what was applicable to RL generally, and what was Q-Learning. --217.37.215.53 (talk) 10:05, 6 March 2008 (UTC)

[edit] algorithms/concepts not mentioned

active (policy improvement) vs passive (policy evaluation)
Adaptive Dynamic Programming (ADP) —Preceding unsigned comment added by 132.177.27.1 (talk) 17:23, 1 April 2008 (UTC)

[edit] Economics?

Where's all the stuff about learning in games? It would be great if someone could incorporate this. Jeremy Tobacman 23:40, 1 August 2007 (UTC)

It's certainly relevant, but you may have to add it yourself if you're familiar with the subject. I've come across that aspect a few times but never really looked into it, though I have seen quite a few papers on interacting multiagent systems from the game and economic perspectives (always, I think, the agents were working against each other to try to maximize profit or win the game, etc.). So add what you know, and others may be able to clean up any incorrect claims. digfarenough (talk) 16:31, 2 August 2007 (UTC)

[edit] Psychology

This article starts with a reference to 'Reinforcement learning' in psychology. Isn't there an article about that? --Rinconsoleao 13:43, 27 September 2007 (UTC)

Found it... --Rinconsoleao 13:45, 27 September 2007 (UTC)

Talk:Reinforcement learning

From Wikipedia, the free encyclopedia

Contents

[edit] Policies

[edit] merge with Q learning

[edit] algorithms/concepts not mentioned

[edit] Economics?

[edit] Psychology

Views

Navigation

Interaction

Search