By: Ryan Wendel
It is an ongoing analysis in which videos are
analyzed frame by frame
Most of the video recognition is pulled from
3-D graphic engines
“HAA” stands for Human Activity Analysis
Surveillance systems
Patient monitoring systems
Human-computer interfaces
We are going to take a look at methodologies
that have been developed for simple human
actions.
And high-level activities.
Gestures
Actions
Interactions
Group activities
Basic movements of a persons body parts.
For example:
Raising an arm
Lifting a leg
A Single persons activities which could entail
multiple gestures.
For example:
Walking
Waving
Shaking body
Interactions that involve two or more
people / items.
For Example:
Two people fighting
Activities performed by multiple people.
For example:
A group running
A group walking
A group fighting
Can be separated into two sections
◦ Single-layered approaches: An approach that deals
with recognizing human activities based on a video
feed (frame by frame.)
◦ Hierarchical approaches: An approach aimed at
describing the high level approach to HAA by
showing high level activities in simpler terms.
Main objective is to analyze simple sequences
of movements of humans
Can be categorized into two different
categories
◦ Space-time approach: takes an input video as a 3D volume
◦ Sequential approach: takes an input video and
interprets it as a sequence of observations
Divided into three different subsections
based on features
◦ Space-time volume
◦ Space-time Trajectories
◦ Space-time features
Captures a group of human activities by
analyzing volumes of a video (frame by
frame.)
Also uses types of recognition using spacetime volumes to measure similarities between
two volumes
Uses stick figure modeling to extract joint
positions of a person at each frame by frame
Does not extract features frame by frame
Extracts features when there is a appearance
or shape change in 3-D Space-time volume
Space-Time Volume
◦ Hard to differentiate between multiple people in the
same scene.
Space-Time Trajectories
◦ 3-D body-part detection and tracking is still an
unsolved problem, and it requires a strong lowlevel component that can estimate 3-D join
location.
Space-Time features
◦ Not suitable for modeling complex activities
Divided into two different subsections based
on features
◦ Exemplar-based
◦ State model-based
Review
◦ Sequential approach: takes an input video and
interprets it as a sequence of observations
Exemplar-based
◦ Shows human activities with a set of sample
sequences of action executions
Sequential set of sequences that represent a
human activity as a model composed of a set
of states.
Exemplar-based is more flexible in terms of
comparing multiple sample sequences
Where as State Model-based can handle a
probabilistic analysis of an activity better.
Sequential approach is able to handle and
detect more complex activities performed
Whereas the Space-time approach handles
simpler less complex activities.
Both methods are based off of some type of a
sequences of images
Allows the recognition of high-level activities
based on the recognition results of other
simpler activities
Advantages of the Hierarchical Approach
◦ Has the ability to recognize high-level activities
with a more in depth structure
◦ Amount of data required to recognize an activity is
significantly less then single-layered approach
◦ Easier to incorporate human knowledge
Statistical approach
Syntactic approach
Description-based approach
Statistical approaches use the state-based
models to recognize activities
If you use multiple layers of a state-based
model you can use these separate models to
recognize activities with sequential structures
Human activities are recognized as a string of
symbols
Human activities are shown as a set of
production rules generating a string of
actions
Human activities that use recognition with
complex spatio-temporal structures
◦ A spatio-temporal structure is a detector used for
recognizing human actions
Uses Context-free grammars (CFGs) to
represent activities
◦ CFGs are used to recognize high-level activities
◦ The detection extracts space-time points and local
periodic motions to obtain a sparse distribution of
interest points in a video
Probability theory
Fuzzy logic
Bayesian network:
◦ Used for recognition of an activity, based on the
activities temporal structure representation
◦ Uses a large network with over 10,000 nodes
A group of persons marching
◦ The images are recognized as an overall motion of
an entire group
A group of people fighting
◦ Multiple videos are used to recognize the activity
that a “group is fighting”
Recognition of interactions between humans
and objects requires multiple components
involved.
A lot of human-object interaction ignores
interaction between object recognition and
motion estimation
You can also factor in object dependencies,
motions, and human activities to determine
activities involved
J.K. Aggarwal and M.S. Ryoo. 2011. Human activity analysis: A review.
ACM Comput. Surv. 43, 3, Article 16 (April 2011), 43 pages.
DOI=10.1145/1922649.1922653
http://doi.acm.org/10.1145/1922649.1922653
Christopher O. Jaynes. 1996. Computer vision and artificial intelligence.
Crossroads 3, 1 (September 1996), 7-10. DOI=10.1145/332148.332152
http://doi.acm.org/10.1145/332148.332152
Zhu Li, Yun Fu, Thomas Huang, and Shuicheng Yan. 2008. Real-time
human action recognition by luminance field trajectory analysis. In
Proceedings of the 16th ACM international conference on Multimedia
(MM '08). ACM, New York, NY, USA, 671-676.
DOI=10.1145/1459359.1459456
http://doi.acm.org/10.1145/1459359.1459456
Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift
descriptor and its application to action recognition. In Proceedings of the
15th international conference on Multimedia (MULTIMEDIA '07). ACM,
New York, NY, USA, 357-360. DOI=10.1145/1291233.1291311
http://doi.acm.org/10.1145/1291233.1291311