Abstract
We present a novel approach for the visual prediction of human-object interactions in videos. Rather than forecasting the human and object motion or the future hand-object contact points, we aim at predicting (a) the class of the on-going human-object interaction and (b) the class(es) of the next active object(s) (NAOs), i.e., the object(s) that will be involved in the interaction in the near future as well as the time the interaction will occur. Graph matching relies on the efficient Graph Edit distance (GED) method. The experimental evaluation of the proposed approach was conducted using two well-established video datasets that contain human-object interactions, namely the MSR Daily Activities and the CAD120. High prediction accuracy was obtained for both action prediction and NAO forecasting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abu-Aisheh, Z., Raveaux, R., Ramel, J.Y., Martineau, P.: An exact graph edit distance algorithm for solving pattern recognition problems. In: ICPRAM (2015)
Abu Farha, Y., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In: Akata, Z., Geiger, A., Sattler, T. (eds.) DAGM GCPR 2020. LNCS, vol. 12544, pp. 159–173. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71278-5_12
Alfaifi, R., Artoli, A.M.: Human action prediction with 3D-CNN. SN Comput. Sci. 1(5), 1–15 (2020). https://doi.org/10.1007/s42979-020-00293-x
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR, pp. 10618–10627 (2020)
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis. 130(1), 33–55 (2021). https://doi.org/10.1007/s11263-021-01531-2
Dessalene, E., Devaraj, C., Maynord, M., Fermuller, C., Aloimonos, Y.: Forecasting action through contact representations from first person video. In: IEEE PAMI (2021)
Fellbaum, C.: WordNet and WordNets (2005)
Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017)
Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: IEEE ICCV, pp. 6252–6261 (2019)
Furnari, A., Farinella, G.M.: Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2020)
Girdhar, R., Grauman, K.: Anticipative video transformer. In: IEEE ICCV, pp. 13505–13515 (2021)
Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: IEEE CVPR, pp. 18995–19012 (2022)
Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2568–2583 (2018)
Hu, X., Dai, J., Li, M., Peng, C., Li, Y., Du, S.: Online human action detection and anticipation in videos: a survey. Neurocomputing 491, 395–413 (2022)
Kong, Yu., Fu, Y.: Human action recognition and prediction: a survey. Int. J. Comput. Vis. 130, 1366–1401 (2022). https://doi.org/10.1007/s11263-022-01594-9
Kong, Y., Gao, S., Sun, B., Fu, Y.: Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, No. 1 (2018)
Koppula, H., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013)
Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: ECCV (2020)
Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: IEEE CVPR, pp. 3282–3292 (2022)
Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint arXiv:cs/0205028 (2002)
Manousaki, V., Argyros, A.A.: Segregational soft dynamic time warping and its application to action prediction. In: VISIGRAPP (5: VISAPP), pp. 226–235 (2022)
Manousaki, V., Papoutsakis, K., Argyros, A.: Action prediction during human-object interaction based on DTW and early fusion of human and object representations. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds.) ICVS 2021. LNCS, vol. 12899, pp. 169–179. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87156-7_14
Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)
Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: IEEE ICCV, pp. 8688–8697 (2019)
Oprea, S., et al.: A review on deep learning techniques for video prediction. IEEE PAMI (2020)
Panagiotakis, C., Papoutsakis, K., Argyros, A.: A graph-based approach for detecting common actions in motion capture data and videos. Pattern Recognit. 79, 1–11 (2018)
Papoutsakis, K., Panagiotakis, C., Argyros, A.A.: Temporal action co-segmentation in 3D motion capture data and videos. In: IEEE CVPR, pp. 6827–6836 (2017)
Papoutsakis, K.E., Argyros, A.A.: Unsupervised and explainable assessment of video similarity. In: BMVC, p. 151 (2019)
Petković, T., Petrović, L., Marković, I., Petrović, I.: Human action prediction in collaborative environments based on shared-weight LSTMs with feature dimensionality reduction. Appl. Soft Comput. 126, 109245 (2022)
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: IEEE WACV, pp. 1569–1578 (2021)
Reily, B., Han, F., Parker, L.E., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction. Auton. Robot. 42(6), 1281–1298 (2017). https://doi.org/10.1007/s10514-017-9692-3
Rodin, I., Furnari, A., Mavroeidis, D., Farinella, G.M.: Predicting the future from first person (egocentric) vision: a survey. Comput. Vis. Image Underst. 211, 103252 (2021)
Rodin, I., Furnari, A., Mavroeidis, D., Farinella, G.M.: Untrimmed action anticipation. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing (ICIAP 2022). ICIAP 2022. LNCS, vol. 13233, pp. 337–348. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06433-3_29
Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 154–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_10
Wang, C., Wang, Y., Xu, M., Crandall, D.J.: Stepwise goal-driven networks for trajectory prediction. IEEE Robot. Autom. Lett. 7(2), 2716–2723 (2022)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR, pp. 1290–1297 (2012)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wu, X., Wang, R., Hou, J., Lin, H., Luo, J.: Spatial–temporal relation reasoning for action prediction in videos. Int. J. Comput. Vis. 129(5), 1484–1505 (2021). https://doi.org/10.1007/s11263-020-01409-9
Wu, Z., Palmer, M.: Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/9406033 (1994)
Xu, X., Li, Y.L., Lu, C.: Learning to anticipate future with dynamic context removal. In: IEEE CVPR, pp. 12734–12744 (2022)
Yagi, T., Hasan, M.T., Sato, Y.: Hand-object contact prediction via motion-based pseudo-labeling and guided progressive label correction. arXiv preprint arXiv:2110.10174 (2021)
Zatsarynna, O., Abu Farha, Y., Gall, J.: Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In: IEEE CVPR, pp. 2249–2258 (2021)
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: ECCV, pp. 803–818 (2018)
Acknowledgements
This research was co-financed by Greece and the European Union (European Social Fund-ESF) through the Operational Programme “Human Resources Development, Education and Lifelong Learning” in the context of the Act “Enhancing Human Resources Research Potential by undertaking a Doctoral Research” Sub-action 2: IKY Scholarship Programme for PhD candidates in the Greek Universities. The research work was also supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 1592) and by HFRI under the “1st Call for HFRI Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment”, project I.C.Humans, number 91.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Manousaki, V., Papoutsakis, K., Argyros, A. (2022). Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity Representations. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13598. Springer, Cham. https://doi.org/10.1007/978-3-031-20713-6_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-20713-6_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20712-9
Online ISBN: 978-3-031-20713-6
eBook Packages: Computer ScienceComputer Science (R0)