Methods:
Given past RGBD observations and robot manipulation actions, our aim is to estimate an unstructured scene as a collection of k objects, with object labels o, 2D image-space bounding boxes b_t, and six DoF object poses qt. Note, k is the number of objects recognized in the scene. Object labels are strings containing a semantic identifier assumed to be a human-intuitive reference. In our experiments, the manipulation actions u_t are pick-and-place actions, which will invoke a motion planning process. However, our formulation is general such that u_t also applies to low level motor commands represented by joint torques, such as in scenarios for object tracking. The state of an individual object i in the scene is represented as a tuple of object pose, object lable, and its image space bounding box. We assume that every object is independent of all other objects, which implies there will be only one object with a given label in the eventual inferred scene estimate. Independence between objects allows us to state this scene estimation problem as using the statistical chain rule and independence assumptions, factoring of the scene estimation problem into object detection, object recognition, and belief over object pose. The object detection factor denotes the probability of object i being observed within the bounding box given observation. The object recognition factor denotes the probability of this object having label given the observation inside the bounding box. These distributions are generated as the output of a pre-trained discriminative object detector and recognizer that evaluates all possibilities. The pose belief factor for a particular object is modeled over time by a recursive Bayesian filter. We have demonstrated the effectiveness and robustness of SUM with respect to both perception and manipulation errors in a cluttered tabletop scenario for an object sorting task with a Fetch mobile manipulator. We are now looking to extend these methods to unstructured environments beyond tabletops.