[go: up one dir, main page]

Skip to main content

Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity Representations

  • Conference paper
  • First Online:
Advances in Visual Computing (ISVC 2022)

Abstract

We present a novel approach for the visual prediction of human-object interactions in videos. Rather than forecasting the human and object motion or the future hand-object contact points, we aim at predicting (a) the class of the on-going human-object interaction and (b) the class(es) of the next active object(s) (NAOs), i.e., the object(s) that will be involved in the interaction in the near future as well as the time the interaction will occur. Graph matching relies on the efficient Graph Edit distance (GED) method. The experimental evaluation of the proposed approach was conducted using two well-established video datasets that contain human-object interactions, namely the MSR Daily Activities and the CAD120. High prediction accuracy was obtained for both action prediction and NAO forecasting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 67.40
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 84.39
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abu-Aisheh, Z., Raveaux, R., Ramel, J.Y., Martineau, P.: An exact graph edit distance algorithm for solving pattern recognition problems. In: ICPRAM (2015)

    Google Scholar 

  2. Abu Farha, Y., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In: Akata, Z., Geiger, A., Sattler, T. (eds.) DAGM GCPR 2020. LNCS, vol. 12544, pp. 159–173. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71278-5_12

    Chapter  Google Scholar 

  3. Alfaifi, R., Artoli, A.M.: Human action prediction with 3D-CNN. SN Comput. Sci. 1(5), 1–15 (2020). https://doi.org/10.1007/s42979-020-00293-x

    Article  Google Scholar 

  4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

  5. Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR, pp. 10618–10627 (2020)

    Google Scholar 

  6. Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for EPIC-KITCHENS-100. Int. J. Comput. Vis. 130(1), 33–55 (2021). https://doi.org/10.1007/s11263-021-01531-2

    Article  MathSciNet  Google Scholar 

  7. Dessalene, E., Devaraj, C., Maynord, M., Fermuller, C., Aloimonos, Y.: Forecasting action through contact representations from first person video. In: IEEE PAMI (2021)

    Google Scholar 

  8. Fellbaum, C.: WordNet and WordNets (2005)

    Google Scholar 

  9. Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. 49, 401–411 (2017)

    Article  Google Scholar 

  10. Furnari, A., Farinella, G.M.: What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: IEEE ICCV, pp. 6252–6261 (2019)

    Google Scholar 

  11. Furnari, A., Farinella, G.M.: Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2020)

    Article  Google Scholar 

  12. Girdhar, R., Grauman, K.: Anticipative video transformer. In: IEEE ICCV, pp. 13505–13515 (2021)

    Google Scholar 

  13. Grauman, K., et al.: Ego4D: around the world in 3,000 hours of egocentric video. In: IEEE CVPR, pp. 18995–19012 (2022)

    Google Scholar 

  14. Hu, J.F., Zheng, W.S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2568–2583 (2018)

    Article  Google Scholar 

  15. Hu, X., Dai, J., Li, M., Peng, C., Li, Y., Du, S.: Online human action detection and anticipation in videos: a survey. Neurocomputing 491, 395–413 (2022)

    Article  Google Scholar 

  16. Kong, Yu., Fu, Y.: Human action recognition and prediction: a survey. Int. J. Comput. Vis. 130, 1366–1401 (2022). https://doi.org/10.1007/s11263-022-01594-9

    Article  Google Scholar 

  17. Kong, Y., Gao, S., Sun, B., Fu, Y.: Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, No. 1 (2018)

    Google Scholar 

  18. Koppula, H., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. 32(8), 951–970 (2013)

    Article  Google Scholar 

  19. Liu, M., Tang, S., Li, Y., Rehg, J.M.: Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In: ECCV (2020)

    Google Scholar 

  20. Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: IEEE CVPR, pp. 3282–3292 (2022)

    Google Scholar 

  21. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint arXiv:cs/0205028 (2002)

  22. Manousaki, V., Argyros, A.A.: Segregational soft dynamic time warping and its application to action prediction. In: VISIGRAPP (5: VISAPP), pp. 226–235 (2022)

    Google Scholar 

  23. Manousaki, V., Papoutsakis, K., Argyros, A.: Action prediction during human-object interaction based on DTW and early fusion of human and object representations. In: Vincze, M., Patten, T., Christensen, H.I., Nalpantidis, L., Liu, M. (eds.) ICVS 2021. LNCS, vol. 12899, pp. 169–179. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87156-7_14

    Chapter  Google Scholar 

  24. Munkres, J.: Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5(1), 32–38 (1957)

    Article  MathSciNet  MATH  Google Scholar 

  25. Nagarajan, T., Feichtenhofer, C., Grauman, K.: Grounded human-object interaction hotspots from video. In: IEEE ICCV, pp. 8688–8697 (2019)

    Google Scholar 

  26. Oprea, S., et al.: A review on deep learning techniques for video prediction. IEEE PAMI (2020)

    Google Scholar 

  27. Panagiotakis, C., Papoutsakis, K., Argyros, A.: A graph-based approach for detecting common actions in motion capture data and videos. Pattern Recognit. 79, 1–11 (2018)

    Article  Google Scholar 

  28. Papoutsakis, K., Panagiotakis, C., Argyros, A.A.: Temporal action co-segmentation in 3D motion capture data and videos. In: IEEE CVPR, pp. 6827–6836 (2017)

    Google Scholar 

  29. Papoutsakis, K.E., Argyros, A.A.: Unsupervised and explainable assessment of video similarity. In: BMVC, p. 151 (2019)

    Google Scholar 

  30. Petković, T., Petrović, L., Marković, I., Petrović, I.: Human action prediction in collaborative environments based on shared-weight LSTMs with feature dimensionality reduction. Appl. Soft Comput. 126, 109245 (2022)

    Article  Google Scholar 

  31. Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The MECCANO dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: IEEE WACV, pp. 1569–1578 (2021)

    Google Scholar 

  32. Reily, B., Han, F., Parker, L.E., Zhang, H.: Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction. Auton. Robot. 42(6), 1281–1298 (2017). https://doi.org/10.1007/s10514-017-9692-3

    Article  Google Scholar 

  33. Rodin, I., Furnari, A., Mavroeidis, D., Farinella, G.M.: Predicting the future from first person (egocentric) vision: a survey. Comput. Vis. Image Underst. 211, 103252 (2021)

    Article  Google Scholar 

  34. Rodin, I., Furnari, A., Mavroeidis, D., Farinella, G.M.: Untrimmed action anticipation. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing (ICIAP 2022). ICIAP 2022. LNCS, vol. 13233, pp. 337–348. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06433-3_29

  35. Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 154–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_10

    Chapter  Google Scholar 

  36. Wang, C., Wang, Y., Xu, M., Crandall, D.J.: Stepwise goal-driven networks for trajectory prediction. IEEE Robot. Autom. Lett. 7(2), 2716–2723 (2022)

    Article  Google Scholar 

  37. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: IEEE CVPR, pp. 1290–1297 (2012)

    Google Scholar 

  38. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  39. Wu, X., Wang, R., Hou, J., Lin, H., Luo, J.: Spatial–temporal relation reasoning for action prediction in videos. Int. J. Comput. Vis. 129(5), 1484–1505 (2021). https://doi.org/10.1007/s11263-020-01409-9

    Article  Google Scholar 

  40. Wu, Z., Palmer, M.: Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/9406033 (1994)

  41. Xu, X., Li, Y.L., Lu, C.: Learning to anticipate future with dynamic context removal. In: IEEE CVPR, pp. 12734–12744 (2022)

    Google Scholar 

  42. Yagi, T., Hasan, M.T., Sato, Y.: Hand-object contact prediction via motion-based pseudo-labeling and guided progressive label correction. arXiv preprint arXiv:2110.10174 (2021)

  43. Zatsarynna, O., Abu Farha, Y., Gall, J.: Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In: IEEE CVPR, pp. 2249–2258 (2021)

    Google Scholar 

  44. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: ECCV, pp. 803–818 (2018)

    Google Scholar 

Download references

Acknowledgements

This research was co-financed by Greece and the European Union (European Social Fund-ESF) through the Operational Programme “Human Resources Development, Education and Lifelong Learning” in the context of the Act “Enhancing Human Resources Research Potential by undertaking a Doctoral Research” Sub-action 2: IKY Scholarship Programme for PhD candidates in the Greek Universities. The research work was also supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number: 1592) and by HFRI under the “1st Call for HFRI Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment”, project I.C.Humans, number 91.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Victoria Manousaki or Antonis Argyros .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manousaki, V., Papoutsakis, K., Argyros, A. (2022). Graphing the Future: Activity and Next Active Object Prediction Using Graph-Based Activity Representations. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13598. Springer, Cham. https://doi.org/10.1007/978-3-031-20713-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20713-6_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20712-9

  • Online ISBN: 978-3-031-20713-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics