Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level
semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at
translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason.
This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for
temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free
grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level
probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules.
Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence
prediction and discover its compositional structure.
Experimental results demonstrate that our method significantly improves temporal action
segmentation in terms of both performance and interpretability on two standard
benchmarks, Breakfast and 50 Salads.