INOM
EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP
STOCKHOLM SVERIGE 2017 ,
Online Predictions of Human Motion
ANDREAS EDVARDSSON
LUCAS GRÖNLUND
Abstract
Collaboration between humans and robots is becoming an increasingly com- mon occurrence in both industry and homes, more so with every forthcoming technological advance. This paper examines the possibilities of performing human hand movement predictions on the fly, e.g. by only using informa- tion up to the specific moment in time of which the prediction is carried out.
Specifically, data will be collected using a Kinect (v.1).
The model used for the predictor developed is the Minimum Jerk model, which states that certain multi-joint reaching movements are planned in such a way that the hand is to follow a straight path while maximizing smoothness.
Extent, direction and duration of the motion are main objectives for the predictor to determine, with a Kalman filter and curve fitting as the main constituents. Another assumption in this work is that a reliable start detector is available. An experiment where five volunteers were to perform different reaching movements was conducted.
This study shows that the approach is feasible in some cases, namely
usable predictions is acquired for long movements. In the case of short
movements the alternative of not doing any prediction was by all means
better.
Referat
Samarbete mellan människor och robotar blir allt vanligare både i industrin och i hushållen, än mer så för varje ytterligare tekniskt framsteg. I detta arbete undersöks möjligheterna att genomföra förutsägelser av handrörelser online , d.v.s. att bara ta hänsyn till information upp till tillfället då förut- sägelsen äger rum. All data kommer att samlas in med en Kinect (v.1).
Antagandet som görs är att en bra startdetektor är tillgänglig. Den föreslagna och undersökta prediktorn grundas på Minimum-jerk modellen som bygger på att flerledade sträckrörelser utförs på ett sådant sätt att handen följer en rak bana med maximal jämnhet. Längd, riktning och tiden en rörelse tar utgör de huvudsakliga delarna som måste bestämmas. Ett experiment där frivilliga fick utföra olika sträckrörelser genomfördes.
Det visar sig att denna prediktor ger användbara förutsägelser på längre
rörelser, medan att inte göra någon förutsägelse alls är det bättre alternativet
för korta rörelser. Några förslag för att förbättra prediktorn presenteras
slutligen.
Preface
We would like to thank all the people who participated in the experiment
helping us collecting data, and Christian Smith for providing us with valuable
information and support.
Contents
1 Introduction 1
1.1 Previous work . . . . 1
1.2 Goals . . . . 1
1.3 Theory . . . . 2
1.3.1 Minimum Jerk . . . . 2
1.3.2 Kalman Filter . . . . 2
2 Method 6 2.1 Problem . . . . 6
2.2 Implementation . . . . 6
2.2.1 Equipment . . . . 6
2.2.2 Limitations . . . . 6
2.2.3 Design considerations . . . . 7
2.2.4 The Kalman filter . . . . 7
2.2.5 Predictor . . . . 7
2.3 Evaluation . . . . 8
2.3.1 Experiment . . . . 9
2.4 Analysis . . . . 9
2.4.1 Good results . . . 11
3 Results 13 4 Analysis 17 4.1 The influence of previous experience . . . 17
4.2 Radial and angle errors . . . 17
5 Discussion 18 5.1 Future Work . . . 18
5.2 Conclusion . . . 19
1 Introduction
In the future generations of robots, closer collaborations with humans will be a more common occurrence. When two humans interact, for example in a handover motion, the interaction has often begun long before the actual movement starts by eye contact and body posture among other things.[9]
This all comes natural to humans. If a robot is unable to mimic and interpret these signals, the interaction might feel unnatural and thus make humans less likely to proceed with the motion. For robots to achieve a more natural behavior in a handover motion, one of the things it must be able to do is predict when a handover motion starts and where it will end.[11] This would enable the robot to start its motion and meet the human, thus making the interaction more human like and possibly more efficient.
1.1 Previous work
The minimum-jerk model have been shown to describe human motion and thus be able to predict future motions accurately.[16][13] Previous studies on human-robot interactions has shown that when using a minimum jerk motion as basis for the robot as supposed to conventional trapezoidal one, reaction time was significantly shorter.[6] Another study suggests that about half of the giving motion has usually been carried through before the receiver reacts in ordinary human-human interactions.[1]
Without knowing the end of a motion, fitting a minimum-jerk polyno- mial would include two unknown parameters, the time and position where the movement ends. If instead, you could detect the duration of a motion earlier, there would only be one unknown parameter to fit. Different ways of predicting the duration of a movement have been tried previously, such as using the acceleration or velocity profile of a minimum-jerk motion.[12][10]
1.2 Goals
This study will be focused on predicting a reaching movement’s final position
early on using a relatively cheap sensor, the Kinect v. 1, contrary to most
previous studies where more expensive motion capture equipment have been
used. Since reaction time in human-human interactions seems to be up to
about half the duration of the total motion, predictions up to the 50 % point
of completed movement is of interest, with the goal being better performance
than not using any prediction at all. The problem of finding starting points
of the motions will not be considered.
1.3 Theory
In order to make predictions, assumptions on how humans move have to be done. A model that’s widely used is the minimum jerk model, which has been shown to obey certain properties of human reaching movements.[3] The Kalman filter[15] will also be central to the approach suggested in this study.
1.3.1 Minimum Jerk
One of the most prevalent mathematical models for predicting human move- ments is the minimum jerk model. For unconstrained point-to-point move- ments the objective that human bodies seems to strive towards is to generate the smoothest motion which brings the hand from the initial position to the final position in a given time. One way to accomplish this is to minimize the mean-square of the jerk 1 .[3][13]
J = 1 2
Z t f inish
t start
( d 3 x
dt 3 )dt (1)
Under specific conditions the minimum jerk model can be simplified into a fifth degree polynomial, with the constraints being that the movement starts and ends with zero velocity and acceleration. Assuming this, the polynomial in equation (2) is acquired.
x(t) = x 0 + (x 0 − x f )(15 4 − 6 5 − 10 3 ) (2) Where x 0 is the initial position of the hand, x f is the final position and
= t/t f is the normalized time parameter. The position, velocity and accel- eration profiles are shown in Figure (1).
When there is a change in the final position after a movement has begun, superposition of several minimum jerk motions have been suggested as a way to adapt the motion.[2] Though, in this study we will only look at a single minimum jerk motion and not superposition’s of several.
1.3.2 Kalman Filter
A Kalman filter is commonly used when the states of a linear model cannot accurately be calculated directly due to observation (measurement) noise.
Common applications include navigation, radar tracking and certain imple- mentations in econometrics. The classical Kalman filter, which is used in
1 Time derivative of the acceleration.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [s]
0 0.2 0.4 0.6 0.8 1
Displacement [m]
(a) Minimum jerk trajec- tory.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [s]
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Velocity [m/s]
(b) Minimum jerk velocity profile.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Time [s]
-6 -4 -2 0 2 4 6
Acceleration [m/s2]
(c) Minimum jerk acceler- ation profile.
Figure 1: Minimum jerk profiles.
this study, requires the model to be linear and the noise to be Gaussian in order guarantee successful operation. [14]
To use a Kalman filter we need a state vector, a model and measurements.
The state vector contains all parameters needed to describe the movement and make a prediction, like position, velocity and acceleration. The model describes the system will behave as a function of current state and time-step.
As mentioned before this is always a linear function of the states when using an ordinary Kalman filter.
State vector : x t =
x 0,t x 1,t
...
x n,t
M odel : x t−1 −−−→ x model t
(3)
Reality often differs quite a bit from the model, so an additional term is added to the state equation - the process noise v t , which is assumed to be Gaussian. This can all be expressed in vector form with F as the model matrix:
x t = F x t−1 + v t−1 (4)
The measured data does not have to contain exactly the same parameters as the state vector. For example, predictions about a movements velocity could be acquired using only position data. The extraction of relevant pre- dictions from the state vector is done by multiplication with an observation model matrix H that maps from the state space to the observation space.
Measurements are often accompanied by noise which is assumed to be Gaus- sian, so a measurement noise term w t is added to the equation:
z t = Hx t + w t (5)
Actual estimations is now done by taking a linear function of the pre- dicted state and the difference between the actual measurement and the predicted measurement multiplied by the Kalman gain K:
x t,est = x t − K(z t,true − z t ) (6)
A derivation of the Kalman gain will not be presented here, but the idea behind it is to look at estimations of the covariance matrices of the state and measurement predictions and use this to minimize the a posteriori error covariance. The estimated state covariance matrix P t tells us the variability of the state, and in the same way the estimated measurement covariance matrix S t tells us the variability of the measurements. Intuitively we want the Kalman gain to be small if the measurement noise is large, and the reverse if the process noise is large. The optimal Kalman gain[15] can be written as:
K = P t H T S −1 t (7)
The idea behind a Kalman filter is to incorporate the predictions of our state and measurement with the actual sensor information, and thus get a better estimation. Only information about the previous estimated states and the current measurements are needed, thus making it ideal to use in a real time scenario. [15, 5]
Each time new data is acquired, a process that can be described as a cycle containing two steps starts. Firstly during the prediction step, the predicted position and covariance is calculated. Then during the update step , the Kalman gain is computed along with the estimation of the state and the covariance matrix.
Figure 2: Kalman filter cycle diagram.
It is obvious that the model plays a huge role, but the requirement of
linearity might seem limiting in applications involving real physical phe-
nomenon since those more often than not are only approximately linear in
some interval. Moreover, even with the linearity requirement removed it is
reasonable to assume that the model will not exactly describe the process
in question. This problem, which is one common cause for filter divergence,
is battled by including a model noise term. If the model linearisation errors
are big one could instead use the Extended Kalman filter. [14]
2 Method
In this section the problem and the proposed predictor is defined, as well as the analysis to be carried out. In addition, the question of a good result is discussed.
2.1 Problem
One wishes to construct a causal predictor P delivering final position predic- tions ˆx f = P (y t , y t−1 , . . . , y 0 ) such that the error ε, defined in (8), is mini- mized early on, where x f is the true final position and y t−n = y t−n (x, y, z) is the measured data from n time steps before. Moreover, it should be able to operate successfully with data of lower quality compared to traditional motion capture laboratories. The final values for quantities is denoted with subscript f, while predictions are marked with a hat. Therefore, the predic- tion of the final position is called ˆx f .
ε := |x f − ˆ x f | (8)
2.2 Implementation
Here the implementation and methods of the proposed predictor is described in greater detail.
2.2.1 Equipment
A Microsoft Kinect [7] with a frame rate of 30 fps was used together with Kinect for Windows SDK 1.7 for skeleton tracking. The joint positions reported by the development tool kit were recorded in a Cartesian coordinate system defined by the Kinect. Only the right hand joint data was saved and used in this study.
2.2.2 Limitations
A simple start detection algorithm based on edge detection using exponen-
tial moving averages was developed. However, for this study it was assumed
that start points could be found consistently. That is, starting points ac-
quired using edge detection that weren’t consistent with the ones of other
movements were tuned by hand.
2.2.3 Design considerations
Since the main focus of this study was to get good performances during the first 50 % or so of the movement, not much work was put into performance during the later parts of a movement.
2.2.4 The Kalman filter
Because of the somewhat noisy data acquired from the Kinect, a Kalman filter was used in order to better be able to track motions. The state vector was defined as
x t =
x t
˙ x t
¨ x t
(9)
where x t is the position, v t is the velocity and a t is the acceleration. Us- ing the Weiner-Process Acceleration Model[8] where it is assumed that the acceleration is nearly constant, the state equation is
x t = F x t−1 + v t−1 , F =
1 T T 2 /2
0 1 T
0 0 1
(10)
where T is the time between acquisition of two data points (in this case 33 ms because of the Kinect’s 30 FPS) and v t−1 being the process noise. Since the Kinect only measure position data, the measurement equation is
z t = Hx t + w t , H = 1 0 0 (11)
where w t is the measurement noise. As mentioned in [8], the process noise error covariance Q is set to be:
Q =
T 5 /20 T 4 /8 T 3 /6 T 4 /8 T 3 /3 T 2 /2 T 3 /6 T 2 /2 T
(12)
2.2.5 Predictor
Just performing a least squares fit of the Minimum Jerk polynomial (2) to
the data in each of the x, y, z coordinates turns out to be problematic for a
couple of reasons. Mainly, the number of data points is potentially very low
because of the desire to make a prediction, thus making the fitting of a fifth
degree polynomial with two unknown very sensitive to measurement errors.
In order to combat this, it is suggested to instead look at the time derivative of this polynomial, (15), and determine one of the unknowns, t f . [13]
While keeping the Cartesian raw data from the Kinect for direction pre- dictions, the first step is a conversion to a spherical coordinate system with the origin set to the (Cartesian) start position of the movement is conducted first.
x = (x, y, z) → r = (r, φ, θ) (13) Using the spherical representation the radial component r is fed into the Kalman filter, from which the most likely velocity profile belonging to the data points is gathered.
r t − Kalman −−−− → v t (14)
Finding the time it takes to perform the movement is now done using this velocity data, since the start of a movement is more discernible in the ve- locity space. Fitting a Minimum Jerk time derivative, equation (15), using Matlab’s least squares method now gives a guess of the motions duration ˆ t f .
˙
x(t) = (x 0 − x f )
t f (60 3 − 30 4 − 30 2 ) (15) With the time duration guess, an ordinary minimum jerk polynomial, equa- tion (2), with "known" time ˆt f could be fitted to the unfiltered radial position data r t . That is, the extent of the true movement r f was predicted this way, giving ˆr f . See Figure (3) for illustrative predictions of a movement.
However, knowing ˆr f and the origin is not enough to uniquely identify a point in a spherical coordinate system, the direction (φ, θ) is required as well. This was done by treating all the raw Cartesian coordinate data points belonging to a reaching movement as a point cloud, to which a best line was fit using Matlab’s least squares method. The direction of this line was considered the direction guess ˆd f . Note that this was calculated in Cartesian coordinates.
Every time new data is acquired these steps, or predictions, are carried out by the predictor P .
2.3 Evaluation
Here an explanation of the experiment conducted is presented, and a short
discussion on the analysis and what good results would be.
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 Time [s]
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Displacement [m]
(a) Predictions on a movement.
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Time [s]
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Velocity [m/s]
(b) Predictions on a Kalman filtered ve- locity profile.
Figure 3: predictions
2.3.1 Experiment
An experiment in which people were asked to do a simple reaching move- ment to different points in space was conducted. The setup consisted of a table with eight small targets on two different heights (15 cm, 40 cm) in five directions with respect to the subject’s reaching arm, and a marker indicat- ing the start position (height 30 cm). Five persons plus the authors of this paper performed five motions in a sequence towards each of the targets. A Kinect (v.1) placed approximately 2 meters away at an angle of about 45 ° was used to capture the data.
A target consisted of a stick with a paper pulp ball in one end and a block used for attachment to a table in the other.
Each subject was instructed to perform a natural reaching movement from the start marker to, or as close as possible to the target marker.
2.4 Analysis
The analysis is mainly carried out as a comparison between the proposed predictor and the naive approach - to say that the final position guess is the latest measured position: ˆx f,naive = y t . The naive error is defined to be ε naive = |x f − ˆ x f,naive | , analogous to the prediction error ε.
A separation of the radial (extent) guess and the direction guess for the
long and short movements respectively will be made in a separate figure, in
order to draw more precise conclusions on the predictor’s performance. The
radial error is defined in a similar was to the total error ε, simply subtract
Figure 4: Final target placement.
Figure 5: Experiment setup illustration.
Figure 6: Setup idea, approximates the final experiment.
the extent (radial) guess ˆr f from the true value r f ; leading to the radial error defined in equation (16).
ρ = |r f − ˆ r f | (16)
The direction guess ˆd f = ˆ d f (x, y, z) is compared to the true direction d f = d f (x, y, z) by using the planar angular error α, which is calculated by considering geometrical properties, leading to the expression in equation (17).
α = arctan |ˆ d f × d f | d ˆ f · d f
!
(17)
2.4.1 Good results
In a human-robot interaction, as long as ε is relatively small the human can always adapt it’s motion to suit the robot better. In this study, ε is the prediction error early on during the movement so what is considered "small"
depends on the situation. During a normal handover motion, for example
handing over a tool in an industrial setting, an error of about 15 cm should
be acceptable.
3 Results
The most obviously interesting measurement is how the error ε in equation (8) depends on the proportion of the movement captured. Figure (7) shows both this error and the naive error ε naive for each of the movements. The values up to 50 % are the most relevant to this study because of the charac- teristics of human-human handover motions, as mentioned earlier.
A natural second question to ask is how the different parts of the guess, mainly the extent ˆr f and the direction ˆ (φ, θ) , contributes to the error. Fig- ures (8) and (9) breaks up the total error into those for the long and short movements respectively.
When making a predictor, certain assumptions about the properties of a movement has to be done. Does knowledge of the underlying assumptions, or experience with the predictor, affect the performance? Figure (10) attempts to illustrate this by comparing the recruited volunteers in the experiment to the authors of this paper performing the same experiment.
Table (1) attempts to give a quick overview of the performance by indi- cating the part of predictions with an error ε less than a certain amount, to the part of the movement recorded. That is, the percentages of movements with ε < r, where r is the radius of a sphere (globe) centred at the final position.
Table 1: Percentage of position predictions after a given time within a sphere of given radius centred around the goal.
radius [cm]
% of movement done
10 20 30 40 50
5 0 22.4 27.3 26.8 27.8
10 10.2 47.8 56.4 52.1 50.6
15 33.9 61.2 76.4 71.8 69.6
20 57.6 82.1 87.3 84.5 88.6
25 74.6 89.6 92.7 93.0 100.0
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
[m]
(a) Movement 1
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25
[m]
(b) Movement 2
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
[m]
(c) Movement 3
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3
[m]
(d) Movement 4
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
[m]
(e) Movement 5
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3
[m]
(f) Movement 6
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
[m]
(g) Movement 7
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
[m]
(h) Movement 8
Figure 7: The absolute error of predictions in 3D space plotted against the
percentage of a movement done at the time of the prediction. The naive
prediction is no prediction at all, i.e. to suppose that the latest measured
position is final. Movement n corresponds to a motion from the start position
to position n according to Figure (5). The errors ε are defined as in equation
(8).
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
[m]
(a) Radial error.
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
5 10 15 20 25 30 35 40
[degrees]
(b) Angular error.
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.1 0.2 0.3 0.4 0.5
[m]
(c) Total error.
Figure 8: Errors for different components of the predictions on long move- ments. The radial error ρ is defined in eq. (16), the angular α in eq. (17) and the total in eq. (8).
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25
[m]
(a) Radial error.
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
10 20 30 40 50 60 70 80
[degrees]
(b) Angular error.
10 15 20 25 30 35 40 45 50 55 60
% of movement 0
0.05 0.1 0.15 0.2 0.25 0.3
[m]