Document 17953364

advertisement
>> Phil Chou: All right. Welcome, everyone. Thanks for coming.
Today Arrigo and I have the distinct pleasure of hosting Ivana Tosic
from Ricoh Innovation from Menlo Park, California. Ivana got her
Ph.D., I think, around 2009 from EPFL. Spent a few years after that at
Berkeley at the Redwood Center for Theoretical Neuroscience and then
after leaving there went to Ricoh. So we're very excited to hear what
she's got to tell us, thanks.
>> Ivana Tošic: Thank you, Phil. Thank you for the introduction and
for the invitation to come here. It's a great pleasure first time for
me at Microsoft Research here. So I am currently with Ricoh
Innovations, but all I'm going to talk about today I've done during my
post-doc at the Redwood Center. So I'm going to talk about dictionary
learning for 3-D scene representation and I will explain during the
talk what dictionary learning is.
So basically the whole idea and the whole goal of this particular work
and of my research is what are the best representations for 3-D scenes
and how to acquire them and how from the images and information that we
acquired how we build those representations.
So this is joint work with Bruno Olshausen, he was my post-doc advisor
at U.C. Berkeley, and Jack Culpepper, a Ph.D. student at U.C. Berkeley
and now he's with IQ Engines also in Berkeley and Sarah Drewes., who
was a post-doc at Berkeley in the math department and now she's with
Duetche Telecom in Germany.
So before going into 3-D representations, I want to give you a little
bit of motivation and go back to the fact to see how we actually
capture the information about 3-D scenes.
So there are many ways to capture and here I show two most used ways to
capture. One is if we have a network of cameras that is capturing a
3-D scene, and we can have multiple views of that scene from different
angles.
Another way is to use hybrid image and depth sensors. So this is a
time of camera from PMD Germany company, and it captures depth
information, and I will usually show depth images, depth map in color
where I represent closer. And further is blue.
And gets also here intensity mapper, a map of the object or Microsoft
Kinect where you can have the depth map and you have the color image
that are co-registered.
So in both ways of capturing, we have reached 3-D visual information.
And it has many applications in 3-D television, in surveillance,
robotics, exploration and just to name a few.
But what is worth pointing out is that depth or disparity is central to
both these approaches of capturing 3-D scenes. So here the depth is
contained implicitly in just by the paralex between different views.
And here we have explicit measurements of depth, like distance from the
sensor.
And we need that depth information to, if we want to interpolate in
between the distant views. So we cannot have a view with current
technologies we can not have views from every single point in space, so
we can just sample that space of camera positions and then interpolate
in between and extrapolate.
But there are problems with depth and main problem is its measurements
are unreliable, so we get usually very noisy data. And here I'm
showing you an example from a time of light camera, this is the
intensity, reflects the image, and this is the depth image. And if you
zoom in, you can see that there are a lot of kind of salt and pepper
type of noise. This is much different than the noise than the noise we
see in regular images.
Or we can have from laser range scanners, this is a tree in front of a
pillar. And you can see that there is errors in acquisition around the
boundaries of the objects. And this is from structured light sensors,
where you have missing pixels where you have occlusions between two
views in a scene.
So the bottom line is that for different type of depth sensors, we need
algorithms to do denoising, remove the noise or inpainting to remove
the information and they're illposed inverse problems and to solve them
we need some prior information about the data that we are
reconstructing. So for depth.
And with for this, to solve it, we actually need good representations.
And the idea of using representations solve inverse problems is not
new. It's been used for decades in image processing.
And for basis has been used to represent images, wavelets and in the
last decade we've seen a lot of people using other complete
dictionaries where you have a much larger number of elements in your
basis than you have, than is the dimensionality of your signal.
And here I'm just showing an example of a dictionary that's been burned
from a database of images. So the elements of the dictionary that we
usually call atoms, if you vectorize them and put them in a large
matrix you have a big fat matrix, and to reconstruct the signal. Here
I'm just showing a image patch from Atlanta, you need to multiply it
with a tall vector. And if you want to do this really efficiently, you
can say, well, I want this vector to be sparse, to have some small
number of non-zero components.
And if we want, now let's go back to our problem of noisy images. So
what we will observe is not exactly this vector, this signal, but
something with the added noise. So then the denoising problem if we
are using representations becomes estimating this vector A, vector of
coefficients to reconstruct our original signal.
And in the denoising, people for images usually assume either
stationary Gaussian noise or Poisson distribution of the noise. But
it's usually for the Gaussian it's usually considered to be the same
distribution of the, same variants of the Gaussian throughout the whole
image and for Poisson it's also linked to the intensity of the image.
So pretty well known models for images. But what about depth? So in
depth we have seen that the noise is usually spatially varying. So
it's what I will call nonstationary but it's not in time it's
nonstationary throughout the whole image, and we can also add the time
dimension, but for this talk I'm just going to keep on still images,
still depth maps.
But also a bigger difference is that the statistics of depth maps
differ from the statistics of images. So just by looking at depth map
you can see there's much less texture than we have in images and we
have much more sharp and oriented edges.
So let's see if we want to represent with, do denoising with wavelet
thresholding, simplest representation, we get a lot of ring effects
around the edges. Or we can say fine, that's great, we know that
because wavelets, orthogonal wavelets are not optimal we can use
wavelets that represent boundaries, we get better results or we can say
we're not going to use representations we're just going to use some
walk of filter or nonlock hold, and this is with a nonlock hold means
denoising.
And.
>>: So that is the depth from ->> Ivana Tošic: Yes, this is done on, only on depth. You can see this
is a great result but if you want to zoom in for details you can see
lost some details here. This is a part of my hair, and smooth it out,
a lot of the information in the depth maps.
So we have looked at this problem and said, well, let's see if we can
learn sparse representations of that. So not just start from
representations that exist for images, like wavelets or curvelets, but
let's learn them from the data.
So in the rest of the talk, I will first go briefly through sparse
representations, just the background. And please interrupt me at any
time if you want clarifications on anything I suppose that people have
already heard. Has anyone heard about sparse representations? Is
there someone who is not familiar?
Okay. Good. And we can go quickly through that. And then the two
parts of the talk will be learning sparse representations for depth
only. And then the second is a newer work, it's learning
representations jointly on images and depth. So intensity and depth.
And for both of these I will show you how we model the data, how we
learn the dictionaries and what do we get results on inverse problems.
Okay. So sparse representation these are also linear representations
like the transform recording of wavelets and curvelets, but the
difference is now that we have this big dictionary, which is ever
complete, has a much larger number of elements of basis functions or
atoms than the dimension of the signal.
So our coefficient is a long vector. And to reconstruct it then
becomes a combinatorial problem. So it's not easy to find these
coefficients and we have an infinite solution set.
So what people have proposed is to look for a sparse solution and there
have been some suboptimal algorithms like the matching pursuit or basis
pursuit denoising.
And I will just quickly explain the basis pursuit denoising in the next
slide, because that's the type of algorithms that we use. And then
another bigger problem is, well, great we can get the coefficients, but
how do we find this big dictionary? Nobody's going to give that to us.
So for that people have developed dictionary learning or they're called
also sparse coding methods. Just from given data to learn what is the
best dictionary for that data.
Okay. So this is just one slide to go over how we find the sparse
solutions once we are given a dictionary, and then we'll go into the
dictionary learning problem.
So if we want to pose our sparse, we want to find the sparsest solution
to our reconstruction problem, basically we want to minimize the L-0 of
the norm of our coefficient vector. So you have the smallest number of
nonzero entries. And this is NP hard. This is under quadratic
constraint that the approximation is bounded by some error.
So what people then propose is let's relax an L1 norm to an L 0 norm to
an L1 norm and have that as our objective function and keep our
quadratic constraint, which is convex. And it can also be formulated
as unconstrained.
And the advantage, of course, this is convex and easy to optimize. So
that is the basis pursuit denoising [inaudible] by Saunders in 1999.
So great. Once we have the dictionary, we know how to find a vector of
coefficients. It's not necessarily the same solution as the LO norm
but it's most of the cases it's a pretty good approximation.
Then, of course, you wonder why would we use sparsity? And it turns
out that this is a really good prior in solving inverse problems, such
as denoising in painting because it gives a good generative model of
the images.
And I'm just showing here an example again on the line on the images
but I'll show you depth maps in just a second. This is denoised using
sparse representation in translation invariant valid forms. By adding
overcompleteness to our representation we can get better results than,
for example, using clean wavelet threshold. A simple example showing
that overcompleteness already gives us something and sparsity as well.
But what we can also do, it's really important, that we can adapt this
dictionary to signal statistics. So we don't have to use this
translation invariant wavelet frames. We can learn our own. And this
is ->>: You're saying that the noise, the distance of the noise while
you're running the dictionary?
>> Ivana Tošic: At this point I'm just telling you the background,
which is learning the signal, the statistics of the signal. Let's say
you have no noise in your signal. Or the noise is actually subsumed in
the approximation error. So you say I'm going to find signal model up
to this approximation error that well represents my signal.
But for that depth I will show you in just a sec. So this is -- so
people have previously proposed dictionary learning or sparse coding,
and it was first proposed by Olshausen Field in '97. This is the
maximum likelihood learning where they said, well, we'll have a linear
image model, and for the learning we're going to maximize the log
likelihood that natural images arise from the model above given that
dictionary.
So they posed this as this optimization problem to maximize the log
likelihood over the set of dictionaries.
And they have used images for training from this one hedron database of
natural images, and, whoops, supposed to be -- and then -- okay. We
have a dictionary -- but before I show you what they obtained as the
dictionary, I'll show you how they solved actually this problem.
Because now they had a model of the signal to put into the model.
So they uncoupled the conditional probability of the images given the
dictionary and the coefficients times the prior on the coefficients.
So this is important to put the prior on the coefficients because
that's where we put the sparsity assumption on the coefficients.
So if you put this prior to be a Cartotic [phonetic] description
tightly detailed and this one is just a Gaussian distribution if the
noise is assumed to be Gaussian. It turns out that this objective,
actually this optimization problem, can be cast as equivalently as
minimizing the energy where energy has this form, we have the quadratic
approximation error, plus the L1 norm, a lot of times the L1 norm of
the coefficient.
So you can see this is exactly the same objective that is for basis
pursuit in denoising except for here we have unknowns which are both
the dictionary and the coefficients.
So to solve it, they proposed a two step alternating optimization. So
they have a big set of natural image patches and they initialize the
dictionary randomly. In the first step they fixed that dictionary and
minimized the objective over the sparse coefficients.
Basic pursuit denoising again. Then they fixed in the next step, in
the learning step, they fixed the sparse coefficients and minimized the
energy over the dictionary. So that's the learning step.
Basically it's just taking, you can do simple gradient descents.
take gradient steps in the coefficients and the dictionary and
coefficients in the dictionary until they converge.
You
And starting from a completely random dictionary, they learned a
dictionary that has oriented band pass filters that are Gabor-like. So
they look a lot like Gabor wavelets and this was just learned from the
statistics of the data.
This was an important result actually in computational neuroscience
because they show that this sparse coding can also be [inaudible] in
encoding information in the brain because this emerged dictionary
looked a lot like the receptive fields of neurons in the primary visual
cortex [inaudible]. Not just the fact that they're band pass and
Gabor-like but distribution orientation and in frequency.
Okay. So that's for the background. Do you have any questions so far,
because from now I'm just going to build upon -- okay. So then we
looked into how to learn these dictionaries for depth. And we had a
challenge. So how can we learn it such that we're above this spatial
[inaudible] noise so we have to learn from noisy data. And the noise
is not any more stationary Gaussian noise, that was assumed previously
in the dictionary learning.
So what we propose to do is to add another, so we had the linear model
dictionary times the coefficients and we have the approximation error
and then we can put that to be stationary.
But we have another type of error here which is just due to the
acquisition device, and we modeled it as a multivariate Gaussian. So
we wanted to be general here and say, well, let's assume that each
pixel has a Gaussian notion but that variance of that noise can vary
along the image. Some pixels can be more noisy, some pixels can be
less noisy.
And we're going to infer those statistics with also inferring the
sparse coefficients. So we assumed a multivariate Gaussian noise and
this is its covariance metrics, and we assume that these are -- the
noise is not correlated between the pixels. So that's one of the
limitations of the model.
So how does that change our graphical model for the images? So we
still have the sparse coefficients. We have the dictionary. And then
we have different noise variables added to each pixel in the depth map.
So this is just to highlight the differences with respect to the basic
dictionary learning sparse coding method. So these are the new
variables, the new noise that we have added just to be able to deal
with different noise and depth maps.
Okay. How does that change our learning objective where we can still
use the maximum likelihood. Now when we break down our likelihood, we
have three terms. This terms is the Gaussian distribution of our
approximation noise. So this is fine.
And then we have the prior on our coefficients. We can put that as
Laplacian. And a prior on our co-variants metrics. Prior on our
noise.
So then for the prior, on the noise, we decided to do a noninformative
Jeffreys prior to end up with any structure on the noise to stay as
general as we can.
And so now what the difference is with respect to the regular objective
in dictionary learning is that we have another set of variables that we
need to infer and then we have one other distribution and one other
parameters here. Okay. And then the energy function is also slightly
modified. So here we have the sparsity term, which remains the same,
but in this term we have the log of the variants, which is due to our
Jeffreys prior, and then the error at each pixel is divided by two
times the variance of the noise of that pixel.
the nose that we have to find.
So these are always on
So it's pretty simple formulation, and it can be also modified if you
know some specifics of your noise variance, how it's related to the
signal you can put it here.
>>: Seems like there's a couple of common models that I've seen for
error variance in depth images. One is that it varies square of the
data. And the other is that it more along the lines of a Laplacian
model. The variance is the brightness.
>> Ivana Tošic:
Yes.
>>: But this doesn't seem to be motivated really by either of those.
>> Ivana Tošic: No. So we wanted to keep it general, as general as we
can. Adapt it to take more specific relations into account but we just
wanted to keep it as broad as possible. So it's not always related
just to depth that you can use it in other scenarios, too. But if you
have that kind of deterministic relation, you can easily put it, and
it's going to simplify finding your variance.
If it's related you don't have to infer it anymore.
relate it to the signal.
You can just
>>: So if it's proportional to say F, I guess, that is the -- is the
depth ->> Ivana Tošic: Uh-huh. You can still use -- use, you can still use
the same objective, it's just this might become then harder to
optimize.
>>: [inaudible].
>> Ivana Tošic: Yes, uh-huh. And you can also change the prior. This
turns out, we also wanted to infer this so we know the reliability, how
reliability is it to depth estimate.
So basically we wanted to see for each measurement of the depth how
reliably, how well it fits to the signal model. So not necessarily
always linked to the acquisition device. But you know, occlusions or
some other noise.
>>: The lambda -- some Lagrange multiplier.
>> Ivana Tošic:
Yes.
>>: Can it be specially variate as well, let's say I have my depth map
and I know exactly where the bad measurements are, so can that be taken
into account, while -- the basic lambda variant as well, does it make
solving the convex problem harder?
>> Ivana Tošic: So what you can do is use this solvers that actually
find your lambda while it's finding your solution. For each patch. So
it finds the best lambda for each patch. You can already use a solver
that deals with that problem. Or if you know how to estimate lambda
yourself, then you can put it there and ->>: Special variant does not ->> Ivana Tošic: Well, you cannot vary it here for each pixel.
vary it for each patch in the image.
You
>>: For the prior [indiscernible].
>> Ivana Tošic: Yes, it might be possible to do it for pixel. I think
there's some work on that, but I can't really remember right now. But
yes. You can -- oh, you can vary it per coefficient. Yes, the work
I've seen I can point you to the reference, is you can change different
lambda for different coefficients. They call it kind of like the scale
model. That's kind of different what you were saying.
>>: Seems like your model assumes that the noise [indiscernible] in the
pixel [indiscernible] so if it's the same pixel, the same
[indiscernible].
>> Ivana Tošic: So right now I'm just doing static depth maps.
not -- it's not video depth map.
I'm
>>: [indiscernible].
>> Ivana Tošic: Just static. Still depth maps. If you do it image
per frame, you can just find different variance for each of the pixels
in different frames, or you can constrain it to smoothly change from
one frame to another. But it's different -- the point is that it's
different per pixel.
So, now how do we solve this, how do we learn the dictionary? So we
have two parts. Again, we have inference first so we need to find all
the coefficients and we need to find the variance for each noise for
each pixel.
And we solve that in an iterative manner. So we first initialize our
sigmas to be equal and very large value. So we start from hypothesis
that all the pixels are unreliable, and then we find -- we infer the
coefficients, then we fix the coefficients and solve for the variance,
and this is a closed form solution.
So usually it takes, just a couple of iterations between these two to
find the solution between for the As and the sigmas.
And then once we have that, we solve for the dictionary. And this is
just to show that it's averaged over all the dictionary, all the
examples in our image base dataset.
So it's a similar principle. We have inference and learning. We
iterate. It's just here in the inference step that we also find the
variance of noise per pixel.
Okay.
>>: So sigma I squared that equation, it goes to 0 the top -- minus
[indiscernible] especially?
>> Ivana Tošic:
Yes.
>>: [indiscernible].
>> Ivana Tošic: No, because we have the sigma zero term which is added
to it. It will if this, so the sigma.
>>: Sigma I -- so it's zero.
>> Ivana Tošic: Yes, some good point. So basically this sigma I we
add, put it here to be more, just simpler sigma I has plus the sigma
zero squared, which is this approximation. So, yeah, basically sigma I
is the sum of the sigma I plus this. Sorry, it's -- sigma zero is
constant. That's a good -- just for like ease of representation.
Basically this sigma I hat is sigma I squared plus the sigma zero
squared. That's why it's here. We don't let it go negative. If it
goes negative, it becomes zero.
>>: I minus [indiscernible] then you are ignoring basically the
difference in depth? [indiscernible].
>> Ivana Tošic:
Sorry?
Can you repeat that?
>>: When the difference between FI and FI hat is vague, basically the
denominator becomes ->> Ivana Tošic:
Large.
>>: The same as the top.
>> Ivana Tošic:
Yes.
>>: So you're just ignoring that?
>> Ivana Tošic: Exactly. So if the measurement doesn't really fit the
model and just isolated, doesn't fit. Then it reduces that term and
says, well, this is not reliable measurement. Because it doesn't
really fit well our model.
>>: Assume you only have visually here -- so here -- I thought
[indiscernible] also a variable.
>> Ivana Tošic:
Yes.
>>: That's excellent.
>> Ivana Tošic: That's the learning, yes. So here we fix dictionary,
we start from random and then here we learn it. We optimize over the
dictionary.
>>: So there's a lot of variables here. Use F. So basically
[indiscernible] F, and A [indiscernible] and also [indiscernible],
determined model.
>> Ivana Tošic:
Yes, it's hard.
>>:
>> Ivana Tošic:
Yes, exactly.
>>: If it's converging, is it because of the convexity?
>> Ivana Tošic: So each step, because the objective is convex, each
step is convex, but it's not necessarily the whole objective because it
has not ->>: [indiscernible].
>> Ivana Tošic: I have never gotten a nonconvergent dictionary, unless
you have really small number of training sets.
>>: Are you learning one dictionary per image?
>> Ivana Tošic:
No, one dictionary for a database of images.
>>: Okay, because I thought you were learning this for one unit.
>> Ivana Tošic:
No.
>>: Because you assume that -- as before, the voice model, pixel,
[indiscernible].
>> Ivana Tošic:
image.
Oh, that's what I meant.
No, no it changes image to
>>: Use different. Seems different. Seems different than
[indiscernible] even want to say pixel. So that means -- different
signal about different things.
>> Ivana Tošic: Yeah, that's why we infer sigma for each pixel.
per pixel per image.
Yes,
>>: Per image.
>> Ivana Tošic: Pixel per image. So, yeah, and here see we average
for the dictionary. So this is done per each patch. And then because
it's independent, you can just do sparse [indiscernible] for each
patch. But then here for learning step, we average over all the
patches. Over all images.
>>: The position of the patches is just regular ->> Ivana Tošic: Yes, randomly chosen. They take a lot of images and
you randomly chose patch. So I can show you, this is from the
Middlebury benchmark. We took all the depth maps and I think we had
something like 30 depth maps. And we extracted 16 by 16 patches of 16
by 16 pixels. These are some examples. And so here on most of the
irregularities are the occlusions or these occlusions. So these are,
they're missing pixels and we don't mark them as missing pixels. We
just let our algorithm figure out if they're nonreliable data.
So we have learned the dictionary, and I will show you here in
grayscale because I think it's easier to see. And what we saw is that
we have a lot of oriented edges. They're a bit different than for the
images, they're kind of more sharper edges. And we got some couple of
slants like here. We think that this -- it's kind of used for the
ground, because usually in this depth maps the ground is kind of
tilted.
It looks because the sensor is looking that way. And this can be
multiplied by positive or negative coefficients. So it doesn't really
matter if it's white in the front or black in the back.
But the biggest conclusion is that it is different than the
dictionaries people use for images. So we have tried to see, well, how
better do these dictionaries do, and we have done that on the denoising
task. So this is just a mathematical formulation of denoising task.
So in denoising we now have our learn to dictionary, and we have our
coefficient -- and we have our noisy image here. So we want to
reconstruct like denoised image F hat. So what we need to do is to
infer both the coefficients and the covariance metrics of the noise for
that patch.
So here dictionary is fixed. So it's just our inference step is the
same. And so I'll show you here example, first this is the first real
depth maps but synthetic noise. Just to be able to see, to have the
ground truth and to compare against it. So this is an original depth
map. This is obtained with a laser range scanner. This is from Yang
and Purves database, Purves, database. And we have added nonstationary
Gaussian noise. We've corrupted one percent of the pixels in depth
maps. And the variance was randomly chosen for each of the pixels.
And we have done total variation denoising and got results that smooth
it out some parts but still left some of the noise, denoise. Curvelet
thresholding gave also some really bad results.
Nonlocal means gave nice smoothed results but lost a lot of the
details. This is a trees, forest in the back.
And this is the result that we obtained with our nonstationary sparse
coding. And ->>: Just a question. So when -- you're talking like even with these,
of course [indiscernible].
>> Ivana Tošic:
Uh-huh.
>>: So do you try to [indiscernible].
>> Ivana Tošic: Yes. I tuned like the total variation, the
regularizer, the lambda. I chose the best one for nonlocal means and
for curvelet, yes.
>>: So I'm trying to understand what exactly is going on here. And I
feel that exactly were you basically trying to learn the dictionary and
the condition of the pixels? So in the literature, the estimation,
there's a lot of work on robust estimation images, where you have those
things to make a destination robust. And from the way you threshold,
what it seems you're doing is basically just take away the outliers in
the processing.
>> Ivana Tošic:
>> Phil Chou:
Uh-huh.
The many pixels.
>> Ivana Tošic: Well, they also estimate something about those
outliers. You're not just thresholding them saying these are useless.
I have reliability measure.
>>: You were doing that because when it's greater than sigma zero, you
basically are throwing it away because you're second term ->> Ivana Tošic:
No, it's smaller.
When it's smaller than sigma.
>>: If I had us, minus sigma 0. So you had a sigma 0. So we had it
modest in my head and the [indiscernible] was also in my head. So that
data has been sewn away.
>> Ivana Tošic: If it's below sigma 0? Yes if it's subsumed in the
approximation error we don't care about it.
>>: Below sigma 0, then you're using the data.
0 than you're just slow.
>> Ivana Tošic:
But if it's above sigma
No, it's just appropriately weighted.
>>: The weighting ->> Ivana Tošic:
It's thresholded when it's --
>>: [indiscernible].
>> Ivana Tošic:
It's thresholding.
>>: So if it's on the top condition, then it's zero, which is sigma 0.
So you're waiting every pixel with every same weight in the 0.
>> Ivana Tošic:
Plus additional one.
>>: And the second time, when you put it there, because the problem is
sigma I squared plus sigma 0. So you end up with this one-half minus
half hat square, and top is if I minus FI square so you get a one
[indiscernible].
>> Ivana Tošic: Oh, oh I see what you mean. No, because when you
are -- because in the first step here you are inferring A and these
sigmas are from the previous iteration. So you're never dividing with
the same.
This is you solve it after and you go back.
>>: Well in robust estimation you can say I threshold, I get rid of 10
percent estimate [indiscernible] I [indiscernible] for the outliers and
search again. Seems very sigma like.
>> Ivana Tošic: I will show you on an example. So we get per pixel
some estimates of the variance. And I'll show you how they look like.
They're not thresholding there's some variability in them. This is
just similar, showing the results. Averaged over different images in
the same per database, and this is the nonstationary sparse coding
result that performs much better than the other ones.
And this is just we corrupt it with a different number of bad pixels.
And so I think maybe here you can see what we get from the data. So
this is from laser range data, the one that I showed before. And this
is the original, when we put into the nonstationary sparse coding we
get an estimate, and then we get the reconstruction, and then we also
get per pixel the inferred variance so you can see that it has
different values for different pixels and then you can also use this
data if you wanted to do further processing of this, you can use this
data to kind of put a reality measure on each of the pixels.
And this is what you get from nonlocal means. So you also remove some
of the noise but you smooth out some small details like here, here.
And this noise, which is basically correlated noise, we were not able
to denoise, because it doesn't fall under the assumptions of the model
that the noise is not correlated.
So this is the limitation of this algorithm. But I think the best is,
the best properties that gives us both denoised image and denoise depth
map and inferred variance.
>>: Seems to be mostly on the edges.
>> Ivana Tošic: Uh-huh. Yeah. So we can get some of the correlated.
So these are correlated along the edges so they're not just single
pixel outliers, but when it has a lot of correlation there and it has
problems.
And we have done that also on the time of light data. So this is the
original depth and this is the denoised and I can just, oops, it
doesn't zoom in. Basically we removed all this pixels and then we kept
some of the fine details of the depth map. We can also do inpainting
just as an example from Kinect maps. And again we can inpaint well
around the edges so here. But when we have big pieces of the
information, which is bigger than our patch size, we cannot feel it ->>: What's the attach size.
>> Ivana Tošic: 16 by 16.
bigger machines.
You can learn it larger but then you need
>>: Pixel [indiscernible].
>> Ivana Tošic: We learned that it was just complete. You can also
make it more overcomplete and it will be better. So it was 256
elements.
>>: Can you tell me what we've done hierarchically where you start with
smaller patch sizes where you get it right and follow bigger ones, you
just get the area where it's bad.
>> Ivana Tošic: I think yes and I think some people have done it in a
kind of like a multi-resolutionnal dictionary learning. Yes, there's
also work on that. That's a good point. Okay. So I will go now
through the second part and it will be a bit quicker, but stop me,
please, if you need some more information. So here we decided to go
step further and see if we can learn multi-model representations of
intensity plus depth.
So it's pretty easy to see that there is a lot of correlation between
the intensity effect [indiscernible] time of eye cameras and depth, or
intensity and disparity and structured life. Here we need two model
presentations, basically we need our dictionary atoms to have two
parts, their intensity part and their depth part.
And let me quickly tell you why some very simple models that you can
think of will not work on this. Two simple examples. This is just
syntactic, the illustration. If you have like a 3-D edge, observed in
the hybrid sense, you have an image, besides continuity and depth
again. Or another example if you have a slanted surface with some
texture on it. Let's say a signed rating, if you look at it the image
will be a chirp and then the depths will just be a gradient. And you
can say, well, let me just put on top, we put intensity on top of the
depth and put each atom to have its intensity and depth just put one
vector on top of another and multiply with the same coefficient.
So in the example one, you can do that, but what will happen is because
you have the same coefficients for both intensity and depth, you will
have to put the variability in terms of the contrast within the atoms,
which will lead to a combinatorial explosion of the dictionary.
So we can now do this text model. What people then have proposed is
the let's say you have a common dictionary model, people refer to it as
joint sparsity. Well, you have the same dictionary and two different
coefficients. And here that would be great. We just put like a high
wavelet or step function and multiply with different coefficients and
that will account for the contrast difference.
However, in the second example, we won't be able to use this model,
because we cannot have one atom that can represent both a slant and a
chart. So the conclusion is that we need a model that has different
atoms and different coefficients, but still model the correlation
between the two modalities.
>>: They did encoding like I can see [indiscernible] in the depth
issues. So [indiscernible] [indiscernible].
>> Ivana Tošic: You can do that, but then you have a problem of face
wrapping, right?
>>: Yes.
>> Ivana Tošic: It has been -- I can point to you where it's been done
for videos, but not depth, but optical fault type.
What we proposed is to have this like a sparse model for image, sparse
model for depth, and then just have coupling variables that multiply
with the magnitudes to give the coefficients.
So these ideally binary, but kinkeder likes to be between 0 and 1, and
they just say if it's 0, we're turning off both intensity and depth.
If it's one, then we're turning them on. So in the end we want to have
sparsity on the coupling variables.
And you can put that into a system model so you'll have atoms for
intensity, atoms for depth. These are the magnitudes of the
coefficients per for intensity and depth and decoupling variables.
you can have different noise intensity and depth.
And
So then it becomes a problem of how do we then -- do we inference? So
if you look, put it as an optimization problem, we want to minimize the
L1 norm of our coupling variables. Since this is a positive, this is
just a sum. And then we have a set of constraints. This is the
quadratic constraint representation of the intensity and here I'm just
going for the stationary type of Gaussian noise. It's another step
further to for the nonstationary.
And the same for depth, and let's just say that our magnitudes are
bounded by certain values. They can depick, whatever.
This problem is nonlinear. So we can't optimize in a straightforward
way. But just doing a simple change of variables where we put that A
is times X, and B similar, we can get a convex. It's the second order
program, and we can use already existing solvers to find the
coefficient.
So this optimization problem gives us the values for X and for the As,
for the A and B. And equivalent of the magnitude. So we named it
joint basis for pursuit because it jointly finds the intensity and
depth.
>>: XI is independent?
>> Ivana Tošic: Yes. So for
and depth, we get different X
variables. They're estimated
are these the parameters that
each patch, for each pair of intensity
variables. So X, As, M, are hidden
for each image and only the dictionary
are estimated for the whole database.
So then again in a similar way, we can have a two step iterative
process to learn the dictionary. We initialized it randomly and then
the joint basis proceed to find the sparse coefficient vectors and X as
well. And then use that in the learning just to learn the
dictionaries. And iterate between these two steps. So very similar.
And we have learned it again on the Middlebury benchmark database. We
used -- now we used both intensity and depth, and we have learned that
twice over complete dictionary. And here I show in each of these
little panels an atom that had its image and disparity or the depth
part.
And just by looking at it, we can see that there's coinciding edges,
which we would expect because an edge in 3-D would induce imaginables
and intensity and depth, and there is a texture. We saw that sometimes
we have a slant in depth and dictionary in the image. So this type of
atoms which are a bit more rare than these ones. But they still exist.
And we can also plot a scatter plot where on this axis I evaluated a
gradient angle of a depth atoms. So basically this is the gradient
angle, the normal. If you would see it as a slant, it would be like a
normal on the surface. And it's a texture part which I just fit with
each item Gabor and found its orientation, so if you do a scatter plot
and here each point corresponds to as an atom in the dictionary. You
can see there's a big diagonal structure. All the points are around
the diagonal. Basically there's a 90-degree angle between the
government and Gabor representation. And just for comparison, they
didn't show it here, we did a similar experiment with group [inaudible]
algorithm and it gave us completely uniformly distributed. It isn't
that all give us the same tendency of the correlation between these
two.
So this was an interesting result, because well a lot of people have
looked into the statistics of intensity and depth and they found out
that closer objects than to be more, have to have higher luminance and
higher intensity values.
But this was to the best of my knowledge the first time that we have
shown that there exists some type of correlation between the gradient
angle of the depth and the orientation of the texture.
And of course we have done some inpainting experiments, just to show if
we have dictionaries that are learned both on intensity and depth, we
can get slightly better in painting results than just the depth
dictionary. Here they remove 90 percent of the pixels from the depth
map. And here I removed 96 percent, just see how far we can go.
And just from four percent of the pixels, plus the intensity we can
reconstruct this depth map.
Which is still blurry and it's not
perfect, but it's really from four percent of the data.
>>: So that's really missing pixels for the depth value and image
value?
>> Ivana Tošic: No, just the depth. Basically we are relying on
intensity to give us the -- it would be a good experiment to try
removing intensity and depth and removing both at the same or removing
both at different.
>>: The correlation between the moving [inaudible].
>> Ivana Tošic: Yes. It's a prior. So if you see a gradient -oriented texture in the intensity, it will try to fit a slant. Or if
it finds an edge, try to fit an edge in the depth.
>>: Not understanding why that is [indiscernible] just if it weren't
missing anything would it be learning?
>> Ivana Tošic: So I think that it's blurry because these models are
essentially linear. So they're kind of, linearly combining these
atoms. So if you're bad at estimating one, you're just removing some
of the -- it's not like a layered model in a scene per depth where you
would nicely represent that just ->>: Is lambda turned up so high that it's not sparse?
>>: So here, because we have used this solver, we don't even put the
lambda in. So it finds the best lambda to solve it. For the group
lasso, I just did the comparison. Group lasso, I just ran it for a
bunch of lambdas and I took the best one. If you have -- if you turn
up lambda, it will be smoother for the group. But here you don't have
that option to tune it.
>>: [indiscernible] because you have [indiscernible].
>> Ivana Tošic: Yes. That's true.
But it sits a linear model.
That's true.
It's overlapping.
>>: So in here in your prior you're still 16 by 16?
>> Ivana Tošic: These are 12 by 12. We had to reduce them because
now, the dimensionality of the problem increases.
>>: Is this the antagonistic area.
>> Ivana Tošic:
Yes.
>>: Any elements?
>> Ivana Tošic:
So 288.
>>: What's the -- trend?
How many images do you need to translate?
>> Ivana Tošic: This was like all the Middlebury that I could
download. So I think it's around 30 intensity depth pairs, the depth
images. But within each image you take random patches which are 12 by
12.
So you have a lot of data. I've been -- yeah, it's just kind of the
dataset that I found they can do this type of training on. I'm not
aware of any other database that I can use. Maybe just maybe from the
MPEG dataset. Yes.
>>: When was your next slide here found a representation between the
angles and depths intensity. Do you think that same relationship would
still hold if you looked at raw patches from the depth and intensity
rather than the dictionary?
>> Ivana Tošic: That's a good question. I don't know. That's
something worth checking. I suppose people would have already looked
at something like that. But ->>: I don't know.
>> Ivana Tošic: I know we tried it. I can show you. We have one
slide. I think it's this one. So this is a plot with a group lasso,
and it's completely different. If you do dictionary learning with
group lasso, you don't get that. Maybe I can also show you this. So I
said that some people have already looked into the relation between
luminance and how close the objects are. And I have looked at the same
thing for my atoms and it turns out that in atoms also we get the
closer -- if you look at like closer and further than you also get the
closer objects appear brighter. So it's pretty dominant over here is
the bright surface is closer and histogram in the red one is when the
darker surface is closer.
So we got the same tendency that people have already observed.
>>: Just to be clear, if you didn't have even any depth pixels in a
12-by-12 region you couldn't indicate that, there's no propagation for
cells early [phonetic].
>> Ivana Tošic: You would have to learn about your dictionaries.
you don't have a data, you don't have a system.
If
>>: Larger as in bigger patches or larger in bigger number of patches?
>> Ivana Tošic: Bigger patches. So, if you don't have any data, F,
your observations are empty sets. So you don't have anything.
>>: You get any blocking artifacts?
>> Ivana Tošic: So what we do is usually average like we are shifting
patches and then averaging. So if you don't do that, you can have the
blocking artifacts, yes.
So I think on this one I did -- I moved every two pixels and then
averaged. Not every single pixel.
So yes so just I liked always to conclude with these representations of
David Marr, just to show you where I think we are with sparse
representations right now. He put this representation in going in from
the pixel representations to kind of edges and blobs and images to two
and a half D representations where you take into account local surface
of orientation and discontinuities, et cetera, et cetera. And these
are all kind of viewer representations, and these are mostly what we
have worked on so far.
So what I presented today is mostly in this block. And to go through
3-D representations that are objects centered, it's still a part of
future work, for sparse representations. So I think that's it. Thank
you. [applause].
>> Ivana Tošic:
So please if you have any further questions.
>>: This work, let's say I have prior knowledge of my -- for example,
most of my patches are multi-planner or [indiscernible] most of my
stuff will be on a movie set. I logged certain surfaces. Would
adding -- mix our terms is that a problem that enforces that somehow,
does that help, or is it if my training set just [indiscernible]?
>> Ivana Tošic: So this is a completely nonparametric learning. So
we're learning pixel for pixel for each image. If we have 12 by 12
image patch we have 144 parameters per patch. If you know you have
surfaces you say I'm going to learn a normal surface that will
completely define my surface and that will have three parameters,
right. So you can reduce the number of parameters. I don't know how
your objective function will look like, it will be convex or not, but
you definitely -- definitely reducing the number of parameters.
>>: This prior should be scale invariant, right, so it's equivalent to
as you get closer to the subject or further away ->> Ivana Tošic: So there are people who have tried to do like a
multi-scale, do like larger objects of smaller objects. I have not
seen much difference in the way that the dictionaries look like.
>>: The 12 by 12 could be scaled to any size.
>> Ivana Tošic:
Yes.
>>: With the same priors.
>> Ivana Tošic:
Yes.
>>: So you could, if you had a patch that had -- a 12 by 12 patch that
had no depth to them, you could apply your entire dictionary but scale
it out so that you're basically reusing the same [indiscernible] but on
a larger scale.
>> Ivana Tošic: Well, you can. You're filtering basically the higher
frequency from your data, right, if you're just upscaling. You won't
be able to reconstruct higher frequency information from that.
>>: But you would still be able to maintain things that patches that
were completely devoid of that, too. So you kind of -- your region
with no depth, you could make that using the same dictionary.
>> Ivana Tošic:
Yes, that's a good point.
>>: It's basically tantamount to shrinking the original image to find
your algorithm and then spelling it back. So then you could penetrate
larger regions.
>> Ivana Tošic:
Yes, yes, that's it.
Good thing to try.
>> Phil Chou: All right. Thank you very much. [applause]
Download