AAAI-15 Conference – Day 4 – Daniel Lee James

My observations from the 4th full day of sessions at the AAAI-15 artificial intelligence conference in Austin, Texas.

von Neumann’s Dream

Talk by Michael Bowling discussing games and applicability to AI, etc. 3 weeks ago Bowling’s team was able to announce that “poker has been solved”.

Stats on game history: 2007 checkers “solved”. 1996 Kasparov loses. 1994 first time any computer beats a chess master. 1948 Turing & 1949 Shannon both wrote chess programs. 1928-1944, theory of games by von Neumann. 1864 Babbage and Lovelace tic-tac-toe, with hints at chess.

Why games? So many aspects of intelligent thinking are combined in game play; why we use them to help develop our children. Easy to determine progress.

“Solved” implies guaranteed win.

Many board games are “perfect information” games. Poker, however, involves things like bluffing and uncertainty. There’s a part of the decision making space that involves observations, guesses, and missing information. How do we represent “bluffing”? Poker has generally been perceived to be a “huge” game that we would never be able to solve.

Looking at Heads-Up Limit Texas Hold’em (HULHE). 3.2 x 10^17 game states (though don’t necessarily know which is which). 3.2 x 10^4 infosets. 1.4 x 10^13 canonical infosets. Would need 3 petabytes to represent? Unlike other games that have 3 win states; e.g. {win, lose draw}, poker has more of a continuous range; e.g. [-24,24] big blind (bb). So solutions end up being approximations.

So instead of getting it perfect, we aim for “essentially solved” (something like 95% confidence), approaching as if training a player over a “lifetime” of poker playing. Also apply “regret-matching”, in which the AI looks back at the utility of past decisions and replays the ones that seemed to lead to less “regret”. Over time regret will converge to 0. Cumulative regret is measured over time, and the most regretful strategies end up getting pruned, which leads to a probability distribution on the no-regret leftovers. Strategies translate into decision edges in the tree. Keeping track of all that non-regret may be intractable, so divide and localize many regret algorithms at different points in the tree. Thus far, these changes between 2003 and 2013 lead to improvements of orders of magnitude.

Some ideas did not work…

Abstraction strategy. Start with the full game strategy being just too big to solve. Idea is to abstract the game itself into a form from which a tractable solution can be determined and then translate the strategy back into context. This actually was successful in that it did beat some top poker players. From there, we could start measure meaningfully how close we were getting to an optimal solution measuring “exploitability”. Watching the graph, results were looking very very promising, but misconstrued.

Decomposition strategy. Very successful in perfect information games. As name implies, you solve sub-trees independently. However, in poker, you don’t really have independent sub-trees because of unknown information that crosses potential future solutions (not a perfect information space). Did have some success with this using sub-trees to work out counterfactuals. Ends up still being resource intensive.

One did work…

Algorithmic strategy (the one that did work). Track and use cumulative regret and regret matching. Bad things in the past get reset, and good things get saved and adjusted.

Look at Rhode Island Hold’em; a smaller game. Solved in 2005, but takes 2.5 hours to run with that solution. New solution: 7 minutes.

Additionally strategy includes also dropping data that is “unlikely” to be relevant, reducing the amount of memory needed and the amount of data to process. When started, the bot plays horribly. After 4.5 days (many many cpu years). 69 days (900 cpu years) heads us to that “good enough” point where we can claim to have “solved” poker.

Justifying the relevance of this research (beyond being for the fun of it) with an application story… you are a doctor, and you have to come up with a treatment recommendation for a new patient you don’t know much about, so you start off with some uncertainty. Could sample something from your sample set of patients. You can frame it as a game similar to how we represented poker.

“Intent Prediction and Trajectory Forecasting via Predictive Inverse Linear-Quadratic Regulation”

Intent is not just the target trajectory but also the behavior type. Example of pedestrian movement prediction. How probable are different behaviors? Made use of Cornell Activity Dataset (CAD-120), which is a database of videos of various activities, looking to be mostly household. Dataset includes skeleton movement and object info. So sample here might be one of cleaning plates, etc off a table. If recognized in real time, the bot can use predictive information to assist in a non-disruptive way.

Bayesian intention and target prediction. Approached with Markov decision process, quadratic control for path determination combined with a dynamic based cost function, inverse reinforcement learning. Have location of objects but not context (plate or microwave platter?). Have to also consider sub-activity-specific behaviors. Each gets unique (quadratic) cost functions.

Able to gain a nice set of update rules to allow for fast enough behavior (1000 predictions/sec) for real-time, real-world interventions. Works best if combined with strong priors. Future work involves climbing up into higher level activities, actor identification, and actor dependent prediction.

“Model-Based Reinforcement Learning in Continuous Environments Using Real-Time Constrained Optimization”

Quadcopter navigation, 8 dimensional feed. Should be able to set a higher order objective deployed into foreign environments. (a little hard to understand speaker)

Use constraints to avoid undesirable states (obstacles, going to fast, incoming ground). Showed math for in slide given objective, model and constraints. Use FITC sparse Gaussian processes for learning function. Use SE ARD kernel for feature selection. SQP used for NLP, which reduces to a more manageable series of QP problems. Let first QP converge for locally feasible trajectory, and then “warm-start” QP approximations around the previous trajectory.

Use pole-balancing results for benchmark: good results. Then test in simulated flight environment. Applied constraints and dynamic changes in real time: also good results.

“Approximately Optimal Risk-Averse Routing Policies via Adaptive Discretization”

Example of trying to get to the airport on time (relevantly to us conference goers) from the BBQ joint, but you want to maximize your culinary experience time. Traffic is uncertain. 50% probability of getting there in time at various points in your (common formal) predictive scheme. Compare fastest expected/shortest path vs best path, etc. Once you get to another point in time, you can look back at the original prediction and adjust what part of the decision graph you want to go down. Mostly seems like playing with formal probability so far. Modeled as a DAG. Looking at maximum utility to meet end goals.

Approach taken: backwards induction. For each vertex in reverse topology (from future to present?), calculate best option and expected utility, using Bellman equation to calculate utility. However, best not to solve the Bellman *exactly* or use time based discretization (might experience gap in utility). Instead, apply adaptive discretization until the utility gap is less than the delta at any given step.

Experimental results… done on fairly uniform graphs that *might* represent traversing Manhattan to get to the airport. So tests kinda limited, but planning to apply to more complex scenarios.

**“Easily Accessible Paper: Reusing Previously Found A* Paths for Fast Goal-Directed Navigation in Dynamic Terrain”**

Note that this is one of the “accessible” papers in the conference.

Problem: pathfinding in general graphs applied to a dynamic terrain; costs increase and decrease; e.g. mars rovers. The paper describing A* bot information is relatively unreadable, so trying to make more accessible here.

Generic solution: search and store result in a path. follow the path to the goal, execute a solution, if the arc cost changes, update the search graph.

Repeated forward A*: shown briefly…

D* Lite: do an optimized dynamic backward A* and store result in path. move through path and adjust for unknown object. Generalized Adaptive A* (GAA*): do A* followed by heuristic update and store in path, follow path, adjusting for arc cost changes.

Showed 2d picture of agent, target and obstacles. Backward strategy requires knowing the location of the target and where all the obstacles are (it looks like). Pic of D* strategy space looked like a blob, and the A* looked fairly directed. However, D* is more efficient because A* has to be repeated many more times.

AA* will improve upon A* by re-using previous search information. Pseudo code displayed, but font kinda small for my friggin eyes… alas.

Regardless… experimental evaluation in random maps (random, room, and warcraft maps). MPGAA* performs faster than the other x* solutions in the random and indoor settings. Not much improvement over D* Lite in outdoor settings. MPGAA* in any case is much easier to implement.

“SCRAM: Scalable Collision-Avoiding Role Assignment with Minimal-Makespan for Formational Positioning”

UT LARG dudes.

Role assignment problem. Assign N robots to M targets into 1-1 mapping so that collisions are avoided while minimizing the makespan. Can represent as bipartite graph. Applicable to scenario where using robots to procure items in a warehouse, or robots on an assembly line. Also applied to RoboCup domain.

Minimum Maximal Distance Recursive (MMDR) Role Assignment… mapping costs calculated, then apply hungarian algorithm. Translate MMDR into the assignment problem.

Minimum maximal distance + minimum sum distance…

Algorithm is a little dense, and presenter zoomed through it a little quickly, but did indicate that the solution is speedy and scales well. Showed some results assigning 10 robots in a 100×100 grid, which indicate improvements over typical solutions (greedy, MSD, MSD^2, etc). Showed animations of paths taken, illustrating significant performance difference.

RoboCup simulation shown as well. Formations seemed to make sense relative to ball position. Also showed pre-defined formation movement which included effects of battery drain because of additional work needed to form under other circumstances.

May try to apply to more complex scenarios with obstacles and such, and may also trying applying auction algorithms to the solution.

“Learning to Mediate Perceptual Differences in Situated Human-Robot Dialogue”

Situated referential grounding. To bot: “get me the black panda on the floor there”. Bot sees a floor space with all sorts of objects, including a stuffed black panda. However, there are mismatched perceptions about that space between the human and the bot. Related work spans from the 70’s.

Focused here on grounding procedure and “word grounding models”. Reliability of word grounding models can be affected by situational factors. Propose a weight-learning approach. Can make use of peripheral verbal cues from human (e.g. if two objects look similar, and the human adds “the one in the back”, then probability can be applied to resolve ambiguity), and then use that info to help reinforce general recognition of the type of object. Make use of graph matching between known and experienced (and maybe predicted?), and adjust weights at the attribute and word levels. Can frame as a linear programming problem. Pull this into online learning with human feedback.

“Spatio-Spectral Exploration Combining In Situ and Remote Measurements”

Rover (as in Mars/space Rover) navigation. Currently terrain priors are pretty low-resolution. Would like to figure out a sensible way to collect spectra to more accurately reconstruct from imagery and improve adaptive path planning.

Concept of endmembers, similar to PCA, that ideally correspond to geologic materials; hard to determine from orbital pixels. Determining where would be good to collect a sample to help inform the images is part of the goal. Rover-based endmembers… As in, we don’t necessarily know what different colors in the spectral analysis mean. Collecting samples helps to dispel mystery.

Tested on data from Cuprite, Nevada (has a number of visually distinct spectral classes). Paths chosen at random, with samples in a uniform grid. AVIRIS: camera used for capturing high res spectral resolution (2m/pixel); spatial resolution meh. ASTER: low res orbital spectra (15-30m/pixel) — used to help inform sampling done with AVIRIS. Tried various planning options (random path, direct path, etc). MESMA and NNLS performed the best. Field tests coming in the near future in Atacama Desert, Chile.

“An Entorhinal-Hippocampal Model for Simultaneous Cognitive Map Building”

Brain-inspired navigation. Study how mammal perform spatial cognition. Reference to neuroscience work done on the “Brain’s Inner GPS”. Function of “place cells” in the hippocampus and “Grid cells” in the entorhinal cortex used as inspiration for the development of a model.

Visual => place cells (to cognitive map) => set of grid cells (path integration)… (accellerated through the slides at this point. as the material became more dense, of course).

Tested on Pioneer bot with arm. Live test: build cognitive maps of two indoor environments. Result images showed neural map responses to different locations in office environment. Cool video showed neural spikes along with video of bot moving through its environment.