We present RiEMann, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input. Compared to previous methods that rely on descriptor field matching, RiEMann directly predicts the target poses of objects for manipulation without any object segmentation.

RiEMann learns a manipulation task from scratch with 5 to 10 demonstrations, generalizes to unseen SE(3) transformations and instances of target objects, resists visual interference of distracting objects, and follows the near real-time pose change of the target object. The scalable action space of RiEMann facilitates the addition of custom equivariant actions such as the direction of turning the faucet, which makes artic- ulated object manipulation possible for RiEMann.

In simulation and real-world 6-DOF robot manipulation experiments, we test RiEMann on 5 categories of manipulation tasks with a total of 25 variants and show that RiEMann outperforms baselines in both task success rates and SE(3) geodesic distance errors on predicted poses (reduced by 68.6%), and achieves a 5.4 frames per second (fps) network inference speed.

RiEMann models the SE(3)-equivariant action space of robot manipulation tasks as target poses, consisting of
a **translational vector \( \mathbf{t} \in \mathbb{R}^3 \)** and a **rotation matrix \( \mathbf{R} \in \mathbb{R}^9 \)**,
which are proven to be SE(3)-equivariant.

For a point cloud input of a scene, a type-0 saliency map is firstly outputted by an SE(3)-invariant backbone \( φ \) to get a small point cloud region \( B_{ROI} \), and an SE(3)-equivariant policy network that contains a translational heatmap network \( ψ_1 \) and an orientation network \( ψ_2 \) predicts the action vector fields on the points of \( B_{ROI} \) . Finally, we perform softmax, mean pooling, and Iterative Modified Gram-Schmidt orthogonalization to get the target action \( T \).

We present the video, point cloud with predicted pose, position heatmap, and orientation vector fields of mug and plane experiments to show the generalization ability of RiEMann to unseen SE(3) transformations, new object instances, distracting objects, and the combination of all of them.

Thanks to the end-to-end pipleine of RiEMann, the network forward speed can be 5.4 FPS, which leads to the near real-time following experiments as follows.

Besides the target pose, RiEMann can also predict any type-l vector fields that are SE(3)-equivariant, as long as given demonstrations. Here we show an example of predicting the direction of turning the faucet.

```
@article{gao2024riemann,
author = {Gao, Chongkai and Xue, Zhengrong and Deng, Shuying and Liang, Tianhai and Yang, Siqi and Shao, Lin and Xu, Huazhe},
title = {RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation},
booktitle = {arXiv preprint arXiv:2403.19460},
year = {2024},
}
```