partially observable markov decision process


  • [16] Another line of approximate solution techniques for solving POMDPs relies on using (a subset of) the history of previous observations, actions and rewards up to the current
    time step as a pseudo-state.

  • These solutions typically attempt to approximate the problem or solution with a limited number of parameters,[7] plan only over a small part of the belief space online, or
    summarize the action-observation history compactly.

  • Policy and value function[edit] Unlike the “originating” POMDP (where each action is available from only one state), in the corresponding Belief MDP all belief states allow
    all actions, since you (almost) always have some probability of believing you are in any (originating) state.

  • Instead, it must maintain a sensor model (the probability distribution of different observations given the underlying state) and the underlying MDP.

  • However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the
    current state.

  • At the same time, the agent receives an observation which depends on the new state of the environment, , and on the just taken action, , with probability (or sometimes depending
    on the sensor model).

  • A consequence of this property is that the optimal behavior may often include (information gathering) actions that are taken purely because they improve the agent’s estimate
    of the current state, thereby allowing it to make better decisions in the future.

  • In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief
    states that are encountered which are not in the set of grid points.

  • [2] Formally, the belief MDP is defined as a tuple where • is the set of belief states over the POMDP states, • is the same finite set of action as for the original POMDP,
    • is the belief state transition function, • is the reward function on belief states, • is the discount factor equal to the in the original POMDP.

  • [9][10] For example, adaptive grids and point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space.

  • The POMDP framework is general enough to model a variety of real-world sequential decision processes.

  • [2] An exact solution to a POMDP yields the optimal action for each possible belief over the world states.

  • [15] Similar to MDPs, it is possible to construct online algorithms that find arbitrarily near-optimal policies and have no direct computational complexity dependence on the
    size of the state and observation spaces.

  • [13] Online planning algorithms approach large POMDPs by constructing a new policy for the current belief each time a new observation is received.

  • A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state.

  • Such a policy only needs to consider future beliefs reachable from the current belief, which is often only a very small part of the full belief space.

  • Another dynamic programming technique called policy iteration explicitly represents and improves the policy instead.

  • Since the state is Markovian (by assumption), maintaining a belief over the states solely requires knowledge of the previous belief state, the action taken, and the current

  • is where is the value derived in the previous section and The belief MDP reward function () is the expected reward from the POMDP reward function over the belief state distribution:

  • More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with
    millions of states.

  • An MDP does not include the observation set, because the agent always knows with certainty the environment’s current state.

  • The agent takes an action , which causes the environment to transition to state with probability .

  • Parity objectives are defined via parity games; they enable to define complex objectives such that reaching a good state every 10 timesteps.

  • Alternatively, an MDP can be reformulated as a POMDP by setting the observation set to be equal to the set of states and defining the observation conditional probabilities
    to deterministically select the observation that corresponds to the true state.


Works Cited

[‘Åström, K.J. (1965). “Optimal control of Markov processes with incomplete state information”. Journal of Mathematical Analysis and Applications. 10: 174–205. doi:10.1016/0022-247X(65)90154-X.
2. ^ Jump up to:a b Kaelbling, L.P., Littman, M.L., Cassandra,
A.R. (1998). “Planning and acting in partially observable stochastic domains”. Artificial Intelligence. 101 (1–2): 99–134. doi:10.1016/S0004-3702(98)00023-X.
3. ^ Sondik, E.J. (1971). The optimal control of partially observable Markov processes
(PhD thesis). Stanford University. Archived from the original on October 17, 2019.
4. ^ Smallwood, R.D., Sondik, E.J. (1973). “The optimal control of partially observable Markov decision processes over a finite horizon”. Operations Research. 21
(5): 1071–88. doi:10.1287/opre.21.5.1071.
5. ^ Sondik, E.J. (1978). “The optimal control of partially observable Markov processes over the infinite horizon: discounted cost”. Operations Research. 26 (2): 282–304. doi:10.1287/opre.26.2.282.
6. ^
Hansen, E. (1998). “Solving POMDPs by searching in policy space”. Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI-98). arXiv:1301.7380.
7. ^ Hauskrecht, M. (2000). “Value function approximations
for partially observable Markov decision processes”. Journal of Artificial Intelligence Research. 13: 33–94. arXiv:1106.0234. doi:10.1613/jair.678.
8. ^ Lovejoy, W. (1991). “Computationally feasible bounds for partially observed Markov decision
processes”. Operations Research. 39: 162–175. doi:10.1287/opre.39.1.162.
9. ^ Jump up to:a b Jesse Hoey; Axel von Bertoldi; Pascal Poupart; Alex Mihailidis (2007). “Assisting Persons with Dementia during Handwashing Using a Partially Observable
Markov Decision Process”. Proc. International Conference on Computer Vision Systems (ICVS). doi:10.2390/biecoll-icvs2007-89.
10. ^ Jump up to:a b Jesse Hoey; Pascal Poupart; Axel von Bertoldi; Tammy Craig; Craig Boutilier; Alex Mihailidis. (2010).
“Automated Handwashing Assistance For Persons With Dementia Using Video and a Partially Observable Markov Decision Process”. Computer Vision and Image Understanding (CVIU). 114 (5): 503–519. CiteSeerX doi:10.1016/j.cviu.2009.06.008.
11. ^
Pineau, J., Gordon, G., Thrun, S. (August 2003). “Point-based value iteration: An anytime algorithm for POMDPs” (PDF). International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. pp. 1025–32.
12. ^ Hauskrecht, M. (1997).
“Incremental methods for computing bounds in partially observable Markov decision processes”. Proceedings of the 14th National Conference on Artificial Intelligence (AAAI). Providence, RI. pp. 734–739. CiteSeerX
13. ^ Roy, Nicholas;
Gordon, Geoffrey (2003). “Exponential Family PCA for Belief Compression in POMDPs” (PDF). Advances in Neural Information Processing Systems.
14. ^ David Silver and Joel Veness (2010). Monte-Carlo planning in large POMDPs. Advances in neural information
processing systems.
15. ^ Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee (2017). “DESPOT: Online POMDP Planning with Regularization”. Journal of Artificial Intelligence Research. 58. arXiv:1609.03250. doi:10.1613/jair.5328.
16. ^ Michael H.
Lim, Tyler J. Becker, Mykel J. Kochenderfer, Claire J. Tomlin, and Zachary N. Sunberg (2023). “Optimality Guarantees for Particle Belief Approximation of POMDPs”. Journal of Artificial Intelligence Research. 77. arXiv:2210.05015. doi:10.1613/jair.1.14525.
17. ^
Francois-Lavet, V., Rabusseau, G., Pineau, J., Ernst, D., Fonteneau, R. (2019). On overfitting and asymptotic bias in batch reinforcement learning with partial observability. Journal of Artificial Intelligence Research (JAIR). Vol. 65. pp. 1–30. arXiv:1709.07796.
18. ^
Jump up to:a b c d e Chatterjee, Krishnendu; Chmelík, Martin; Tracol, Mathieu (2016-08-01). “What is decidable about partially observable Markov decision processes with ω-regular objectives”. Journal of Computer and System Sciences. 82 (5): 878–911.
doi:10.1016/j.jcss.2016.02.009. ISSN 0022-0000.
19. ^ Hauskrecht, M., Fraser, H. (2000). “Planning treatment of ischemic heart disease with partially observable Markov decision processes”. Artificial Intelligence in Medicine. 18 (3): 221–244. doi:10.1016/S0933-3657(99)00042-1.
PMID 10675716.
20. ^ Chadès, I., McDonald-Madden, E., McCarthy, M.A., Wintle, B., Linkie, M., Possingham, H.P. (16 September 2008). “When to stop managing or surveying cryptic threatened species”. Proc. Natl. Acad. Sci. U.S.A. 105 (37): 13936–40.
Bibcode:2008PNAS..10513936C. doi:10.1073/pnas.0805265105. PMC 2544557. PMID 18779594.
21. ^ Kochenderfer, Mykel J. (2015). “Optimized Airborne Collision Avoidance”. Decision Making Under Uncertainty. The MIT Press.
22. ^ Kochenderfer, Mykel J.;
Wheeler, Tim A.; Wray, Kyle H. (2022). Algorithms for decision making. Cambridge, Massachusetts; London, England: MIT Press. p. 678. ISBN 9780262047012.
23. ^ Moss, Robert J. (Sep 24, 2021). “WATCH: POMDPs: Decision Making Under Uncertainty POMDPs.jl.
Crying baby problem” (video). The Julia Programming Language.
Photo credit:’]