My observations from the 2nd full day of sessions at the AAAI-15 artificial intelligence conference in Austin, Texas.
“Statistical Parsing with a Context-Free Grammar and Word Statistics”
“Task-Oriented Planning for Manipulating Articulated Mechanisms Under Model Uncertainty”
Task of cooking involves many subtasks. Many household objects have mechanical constraints: “articulated”. Here we’re looking at learning the kinematic structure of an objects. Use task-drive to help motivate and guide the learning. Graph representation, vertexes and edges representing joints, shape, etc, as generalized kinematic graph. Given some number of candidate models, find minimal cost to achieve the goal, involving user-defined cost.
Belief MDP… Assuming we can assume no noise in observation, and perfect motions, we can construct a sound logical space. Observes environment, generates plan. Learning is supervised in that the robot will know about and have prototypical concepts for drawers and doors.
Experiments conducted in office and kitchen spaces. Videos showed success finding and opening various cabinets and doors.
“Learning the State of the World: Object-based State Estimation for Mobile-Manipulation Robots”
Creating semantic world models. How should a robot represent its spatial environment? Object-based spatial representation using attributes… what and where?
Say a robot captures image of objects on a table. Apply object recognition processor… identify box, etc. However, object detection is noisy, misinterpreted, occluded, etc. Then can possibly combine partial views to form hypotheses about what’s on the table. Becomes a data association problem. Can use crossover clustering to help association accuracy.
What representation should be used? Object as atom? Occupancy grids? Why not use both? Fuse together on demand as needed.
Example shown of a toy train behind a board. Robot arm can sweep behind the board (?)
Scaling this up has issues, of course, but much of it is irrelevant (what you can’t see) and doesn’t matter. Instead, the estimator should be tied to the task, but flexible enough to be tasked on the fly. As in, the bot would generate an estimator for a task.
“Time-Optimal Learning, Exploration and Control for Mobile Robots in (Partially) Known Environments”
Robot knows there are objects in a space, but must find them. Circular sight range. Assume that bots pick up objects automatically, and that detection is instantaneous (to simplify computations), and uniform distribution over the space. (Egads, highly constrained to help the study)
Two approaches tried. One heuristic worst case and heuristic probabilistic optimization, using decision trees, and exploration to action cycle. Given all the constraints, able to create a sound logical model (note to self: isn’t this part of the general problem in AI that we try to fit problems into sound logical models?) Showed various paths taken in either case, and that the probabilistic version performed a little better (moving in a spiral). Can solve an approximate version of the OCP problem (non-convex).
“Plan Execution Monitoring through Detection of Unmet Expectations about Action Outcomes”
Intelligent behavior via accurate model. Case in point with RoboCup robots. Many models used: optimal attack location, opponent marking, ball and robot motion. Works well if we have accurate models to work with. However, cannot always fully explore domain before deployment. Especially in adversarial or human environments and domains. In some cases too dangerous to deploy and test. So how to go about it?
Can possibly create “generally accurate” models with anomalous subspaces. Still applicable to RoboCup, in which case it depends on the opponents and how they play. What do you do if there’s a subspace wherein the bot is totally inaccurate?
Introduce into the sense-plan-act loop: the planner will also generate some expectations and monitor/correct based on observations/feedback. Optimization applied to identify and correct anomalous subspace regions. Using elliptical parametric subspaces, nonlinear optimizations, and a cost function (likelihood of anomaly over probability of anomaly).
Focusing on anomalous subspaces acts as negative feedback for learning. Showed various animations of their own bots learning off each other.
“Representation Learning for Robotics”
Can call it “feature learning” or “learning to see”. Showed animation of blurry visual feedback that came from a bot with 4 cameras trying to navigate in a box with different colored walls. The blurry image showed how vague the visual field is when interpreted by the bot.
Idea is to take this blurry observation and map it to a state. There’s a learning objective applied, with punishment and rewards applied (reward being a corner). Still supervised, but does convert raw visual imagery into an internal map of the walled box space.
Added “distractors”, a.k.a. obstacles. Relying on observations solely, not great. This new method performs as well as introducing “cheat” information about the environment.
“A Divide and Conquer Approach to Control Complex Continuous State Dynamic Systems using Hierarchical Reinforcement Learning”
Imagine Mario trying to drive around a track. Continuous space, x, y, direction, velocity. Typical reinforcement learning leads to consistent mistakes. If you discretize/decompose the track, with assumed hierarchy. Completion function is used to re-asses what’s left after a subtask is completed. As well, divide the regions of the termination step, with multiple policies for each region. So then it becomes a matter of picking a particular exit region to aim toward. Reduces the solution space, so easier to solve. Reasonably successful, but still not optimal, and errors introduced in a non-simulation, continuous environment. Plan to extend this research to pole balancing and real bots. Currently using a road bot in a track.
“Towards a Programmer’s Apprentice (again)”
Revisiting the original paper of same title from 1974. Motivation is to give aid to programmers; e.g. go back to old code or get code piled on to you, and think… Why that way? does it work? etc. Speaker formerly with Symbolics, etc.
Example: the mailer bug. Mail would just stayed queued for no apparent reason. Had lunch with original author, who remembered the bug. As in “oh yeah, I remember that”.
Problem: software systems last forever, continually evolving, break, design rationale lost… so could a computer help? Concept of an apprentice working with the programmer, but in such a way as to capture the rationale; e.g. asks about the stuff that’s not routine. Programmer, draws, points, codes, talks, etc. Project to solve this problem started in 1974… Developed plan calculus, temporal abstraction, cliche recognition, etc. In emacs!!!
Why revisit now? Computers are better now, various technologies have improved, and massive open source libraries to reuse.
Prototype now uses Siri and Start natural language system to gather input. Involves cliches, taxonomy, viewpoints of the data, code generation and refinement. Runtime is a temporal sequence state machine of sorts. Examples… encounters code that intends to represent pixels as bytes, refines and generates code that works.
Video demo demonstrates emacs talking to the programmer, asking questions. User says “add a disk drive”, and then corrects the user by saying that the data types in his design are not matching up. In another case, user asks to generate pixels for some data, and the bot asks how user wants to do it, asking why once the user has made a choice. Kinda cool. Synthesized code is in lisp.
“Conducting Neuroscience to Guide the Development of AI”
Many people pursue AI as way to understand how the brain works. Rather, here suggesting that we use tools and research of neuroscience to make more intelligent systems or dispel any illusions about AI models being accurate. So, along those lines, what we usually do is feed input to computer, get output, then feed input to person, get output. But we can’t answer the question about whether what the bot did == what the human did. Still no idea what was going on in the human brain.
Interface between language and vision as a focus here. Action recognition and video captioning.
Action recognition: Take video data and map to fixed set of concepts; e.g. a video of a kiss ==> “kiss”. Typical way… extract space time features, classify with SVM, etc. Questions… how does this compare with the human brain? Well… MRI scan the brain with the same input, and compare. And of course, they compare poorly. Explore more… does the brain pool features? MRI scans once per two second time resolution, which is slow, but can improve to 300ms, which improves possible results because of the importance of hitting the critical identifying moments.
Video captioning: Video clip input ==> speech text output. Solutions use priors from web scale natural language corpora to determine likelihood. However, is this what the brain does? Probably not. Accurate? Only if your video clip happens to hit the right probability. Again did MRI scans to find active regions and crossover between. And compared. Not really similar.
“Mechanism Learning with Mechanism Induced Data”
Mechanism design in internet applications with multiple agents with independent intentions and self-interest. Context of crowd-sourcing, search ads, and app stores. Users <==> Platform <==> Agents. Agents may be pretty irrational, but some consistency in behavior. Note that this is from Microsoft Research.
Game theory assumes rational behavior. Machine learning has unknown, but fixed distribution. So… quandary is in dealing with a combination of bounded rational and mechanism dependent behaviors.
Introduce MLMID as a hybrid between game theory and machine learning. Even with irrational behavior, there will still be patterns, plus changes in behavior depending on circumstances that can be recorded and used in a probabilistic model. Take into consideration evolution of user behavior as well. Strict use of previous history may not be effective. Make use of “regret” and “equilibrium” analysis. All very complicated, however, and really there are just a lot of open questions.
“Challenges in Resource and Cost Allocation”
Food banks in Australia and worldwide. Using technology to do good in the world. In this case, creating an app to help people both donate and receive food. Technology-wise, this becomes a resource allocation problem (similar to vehicle routing problem, which is more or less a variation on the np-complete traveling salesman problem, etc). Great bit of complexity added with a number of constraints and granularity of items that can be distributed, money available, etc etc etc. 20k customers, 600 vehicles, $100’s million+, etc.
Challenges: development of complex models and mechanism for fair division, mixed fair divisibility, optimization, and behavior awareness (not everyone behaves fairly or rationally), cost allocation mechanisms, etc. And the whole domain of the environment is constantly changing. (note to self: seemed to miss reacting to delays and other extenuating circumstances)
Benefit to food providers is that costs can be cut in half if done right.
“Explaining Watson: Polymath Style”
Why does Watson work well? Not standard NL research. Despite it’s success, we still don’t know why it works. Figuring it out can be an open collaborative effort (“polymath style”).
Why do we care? Seems to be a mismatch between our theory of meaning vs what we’re experiencing with Watson. In the Jeopardy Challenge, there are 5 key dimensions… broad open domain, complex language, high precision, accurate confidence, high speed. Some questions are not encyclopedic that can just be looked up. Regardless, can sort of be considered solved as of 2010. Watson became the first non-human millionaire by winning Jeopardy.
Four years later, and Watson’s accuracy no replicated yet (something like 75%), even on factoid questions. Unlike Deep Blue, after which progress in chess playing increased.
Watson has two search engines. Analyses question ==> decomposes query ==> –2x split here– hypothesis generation (primary search with candidate answer) ==> soft filtering ==> hypothesis and evidence scoring (supporting evidence lookup and scoring) ==> –join split here– synthesis ==> merge and rank hundreds of scored items (logistic regression applied) ==> answer and confidence. Dozens of NLP models used in the process.
So… where’s the meaning in all of that? We get the impression that the meaning is understood somewhere in there (the ghost in the machine). No formal model or theory. IBM has published it in the IBM research journal, so how Watson works is open to play with. Offered some possibly research paths that could be taken. Can contact them if interested in doing related research. Opened up to public because they are not seeing any replication of this sort of tech in the academic community.
Presented by Geoffrey Hinton, of original backprop and boltzmann machine fame.
What is deep learning good for? Distinguish structure from noise. Example problem: pixels to words.
Backpropagation had a bit of promise, but didn’t make good use of hidden layers. Couldn’t get it to work effectively for recurrent neural networks. RNN’s held hopes of being able to combine in mass to help solve problems. Regardless.. what’s wrong with BPNN? Required labelled training data. Very slow to run. Only found local optima. Often good, but otherwise really inaccurate.
Attempt to overcome by using unsupervised learning. Make use of stochastic binary units Then hook it together to form restricted Boltzmann machines. With one hidden layer, the chances of one feature detecting unit can be independent of other units. Hopfield energy function applied to determine weight of a joint configuration, and the derivatives are useful for defining probabilities. Go back and forth training vectors until activity stabilizes. Applied to learning to model images from video data. Problem, however, is that it’s horribly slow.
However, the RBM can be improved. Corrupt the data with encoded “beliefs”, then reconstruct, then take the difference and learn from that.
Training a deep network: Based on this RBM, more or less. First train layer base directly on the pixels. Then treat the activations as if they were pixels to learn features of features, creating a multi-layer generative model. Apparently it’s provable that each layer added leads to a better variational lower bound variability.
Then… fine tune for discrimination… and use BPNN to pre-process data (I think).
Example application to acoustic modeling using a DNN pretrained as a deep belief network. Last year all good speech recognition apps were using this.
How many layers? How big? Backprop works better with greedy pre-training, and for scenarios with limited label information.
stuff ==> image ==> label. This is typical, and is what commonly leads us to focus on the gap between image and label. However, seems sensible enough to go from stuff ==> label directly. Unsupervised pre-training not necessary for optimization, but helpful for generalization.
Success achieved with ILSVRC-2012 competition on ImageNet. 1.2 million images, 1000 classes of objects. Goal: get correct class in your top 5 bets. Other groups looking at 25%-30% error. This solution is hitting 8% error. Architecture used: 7 hidden layers (most recently using 20), early layers convolutional, last two layers globally connected, activation function is rectified linear units, global layers had most parameters utilizing “dropout”. Example images shown, with confidence list. Dropout involves randomly omitting units, which helps to retain some of the information already learned.
Again… what’s wrong with backprop? Misconstrued before. In retrospect: too few labels, too slow computers, stupid method for initializing weights, used wrong type of non-linearity.
Back to RNNs. Hidden layers feed into themselves, can take input or output at any time slice, etc. Powerful because they combine distributed hidden states that allow them to store info about the past efficiently and non-linear dynamics that allow them to update… AND if they’re deep, then they work better. Applied to doing machine language translation, using hidden state to represent the thought that the sentence expresses. Take that “thought” state vector and feed it into the “french” translation RNN. This solution beats state of the art now. Goal is to create a real babelfish on a chip that goes in your ear.
Then… combine vision with language. Train the RNN with vision percepts. Tested on a database from Microsoft of 200k images with captions. Successfully trained on this data, was good at generating sentences for images. All done without symbols, etc. Suggested that this is really bad news for GOFAI. Connectionist vs Symbolist — FIGHT! Hinton’s being pretty forward about denouncing people who keep using symbols in knowledge representation; that symbols are only input and output.
– Rebuttal from GOFAI: defending ability of humans to use formal logic even if it’s not part of the basic way we work.
– Can get NN to learn mathematics and perhaps interpret code.
– Smolensky brought up as someone who worked on Boltzmann machines, but has been pursuing more symbolic solutions
– Thought experiment: black box that sorts numbers. Whether or not there is a NN on the inside does not matter. Arguing that we cannot conclude that there are not algorithms in the brain
Hinton was more or less saying that most people here are probably wasting their time if they’re still dabbling in GOFAI. What he says about symbolic knowledge rep with classical logic reasoning frameworks is pretty obviously correct: it’s not how our brains work.
“Spontaneous Retrieval from Long-Term Memory for a Cognitive Architecture”
Knowledge search heuristics… e.g. “find an item that has a solid line and ends in a square”. What if the agent doesn’t have any of the basic knowledge in the first place. Doesn’t know cue, cue relationship, when to search, etc.
Applied to missing link domain, remote associates test. Agent gets some words as a clue, then has to look up associative words. Assuming imperfect associative information, with no pre-established relationships, introduce “spontaneous” retrieval. Results show that this solution helps in some cases, but does not do any worse than typical solutions in other cases; so it only improves. For cases in which puzzles are not solvable or there’s no good solution, the spontaneous solution performs faster and does just as well.
“Automatic Ellipsis Resolution: Recovering Covert Information from Text”
Ellipsis, in the sense of …
or [e] or “what we didn’t do”
Confounders… a lot of syntactic and other world make many assumptions about the correctness and completeness in the environment. This group intends to handle actual natural language. Question to investigate here is to see to what extent we and explore situations with incomplete, dirty data baby. With focus on resolving syntactically incomplete sentences (with ellipses).
Processing cycle: (showed slide for 1 second, alas)… used standard input processing, then applied removal of false positives, ordered matches with level of confidence (using phrasals, parallel configurations, non-parallel matching modalities, etc). Some challenges experienced after detecting antecedent clauses.
Unclear what the results were or indicated, but expressed that it would hopefully help inform agent decision making.
“Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Video”
KidsVideo project. Representing and learning concepts from kid videos. Many aspects to consider. Multimodal, vision, language, story lines, grammar, etc. Previous approaches include semantic networks, etc.
Solution involves image pre-processing, multiple abstraction layers, sparse population code (hierarchy free brain inspired representations), deep concept hierarchy of cartoon videos using SPC models, empirical distribution of a scene utterance pair, graph monte-carlo (stochastic, with numerous sub graph methods-UGMC, PRGMC, FGMC), etc. Generates a multimodal concept map, which includes formation of character property classification and recognition.
Results seems to include success in scene to subtitle determination, generation of images associated with given sentences.
Note: very dense material for the short amount of time to present, so it went by a little fast.
“Ontology-Based Information Extraction with a Cognitive Agent”
Problem is looking at text from family/genealogy history books, taking it and populating an ontological model to gather meaning and establish meaningful relationships. How can we determine who’s who even within the same family article? Probabilistic models applied to determine likelihood that two symbols may represent the same referent.
Introduce “Ontosoar” architecture, which is a combination of off-the-shelf and some new home grown components, including a semantic analyzer. OntoES used to help tie things together.
Notion of construction grammar. Form patterns are constructed and matched against inputs to determine meaning. Translated into a knowledge structure, with deduced relational links.
Tested against real genealogy sources, got mostly good, but some mixed results; accuracy checked against human interpretation of the sources.
“Extending Analogical Generalization with Near-Misses”
Learning from structured examples. Extend analogical generalization with “near-misses”. Introduce “ALIGN”: analogical learning by integrating generalization and near-misses. Involves generating hypotheses and refining or filtering.
Applied to recognizing structures (relatively geometric). Assume some generalized contexts; e.g. for “arch” concept, etc. Case libraries exist for each case (?). Pseudo probabilistic methods used to extract hypotheses based on analogy. Relatively good success rate.
“Learning Plausible Inferences from Semantic Web Knowledge by Combining Analogical Generalization with Structured Logistic Regression”
Problem: learn to do inference on structures, with all the issues people run into using traditional methods that assume no noise or a sound, complete system of some sort. Trying to overcome issues with incomplete, noisy data. Solution tries to combine structural alignment with statistical learning. SLogAn: structure logistic regression with analogical generalization.
Test example… use semantic web to gather information to infer information about family relationship. Preprocess structure, then apply analogical generalization, then make weight adjustments with structured logic regression… can then do a structure mapping between input and a template.
Compared to state of the art classification models. Better results that even some NN solutions.