Lexicap

Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258


link |
[Lex] The following is a conversation with Yann LeCun,
link |
[Lex] his second time on the podcast.
link |
[Lex] He is the chief AI scientist at Meta, formerly Facebook,
link |
[Lex] professor at NYU, touring award winner,
link |
[Lex] one of the seminal figures in the history
link |
[Lex] of machine learning and artificial intelligence,
link |
[Lex] and someone who is brilliant and opinionated
link |
[Lex] in the best kind of way,
link |
[Lex] and so is always fun to talk to.
link |
[Lex] This is the Lex Friedman podcast.
link |
[Lex] To support it, please check out our sponsors
link |
[Lex] in the description.
link |
[Lex] And now, here's my conversation with Yann LeCun.
link |
[Lex] You co-wrote the article,
link |
[Lex] "'Self-Supervised Learning,
link |
[Lex] the Dark Matter of Intelligence."
link |
[Lex] Great title, by the way, with Ishan Mizra.
link |
[Lex] So let me ask, what is self-supervised learning,
link |
[Lex] and why is it the dark matter of intelligence?
link |
[Yann] I'll start by the dark matter part.
link |
[Yann] There is obviously a kind of learning
link |
[Yann] that humans and animals are doing
link |
[Yann] that we currently are not reproducing properly
link |
[Yann] with machines or with AI, right?
link |
[Yann] So the most popular approaches
link |
[Yann] to machine learning today are,
link |
[Yann] or paradigms, I should say,
link |
[Yann] are supervised learning and reinforcement learning.
link |
[Yann] And they are extremely inefficient.
link |
[Yann] Supervised learning requires many samples
link |
[Yann] for learning anything,
link |
[Yann] and reinforcement learning requires
link |
[Yann] a ridiculously large number of trial and errors
link |
[Yann] for a system to learn anything.
link |
[Yann] And that's why we don't have self-driving cars.
link |
[Lex] That was a big leap from one to the other.
link |
[Lex] Okay, so to solve difficult problems,
link |
[Lex] you have to have a lot of human annotation
link |
[Lex] for supervised learning to work.
link |
[Lex] And to solve those difficult problems
link |
[Lex] with reinforcement learning,
link |
[Lex] you have to have some way to maybe simulate that problem
link |
[Lex] such that you can do that large-scale kind of learning
link |
[Lex] that reinforcement learning requires.
link |
[Yann] Right, so how is it that most teenagers
link |
[Yann] can learn to drive a car in about 20 hours of practice?
link |
[Yann] Whereas even with millions of hours of simulated practice,
link |
[Yann] a self-driving car can't actually learn
link |
[Yann] to drive itself properly.
link |
[Yann] And so, obviously, we're missing something, right?
link |
[Yann] And it's quite obvious for a lot of people
link |
[Yann] that the immediate response you get from many people is,
link |
[Yann] well, humans use their background knowledge
link |
[Yann] to learn faster, and they're right.
link |
[Yann] Now, how was that background knowledge acquired?
link |
[Yann] And that's the big question.
link |
[Yann] So now you have to ask,
link |
[Yann] how do babies in the first few months of life
link |
[Yann] learn how the world works?
link |
[Yann] Mostly by observation,
link |
[Yann] because they can hardly act in the world.
link |
[Yann] And they learn an enormous amount
link |
[Yann] of background knowledge about the world
link |
[Yann] that may be the basis of what we call common sense.
link |
[Yann] This type of learning is not learning a task,
link |
[Yann] it's not being reinforced for anything,
link |
[Yann] it's just observing the world and figuring out how it works.
link |
[Yann] Building world models, learning world models.
link |
[Yann] How do we do this?
link |
[Yann] And how do we reproduce this in machines?
link |
[Yann] So self-supervised learning is a very, very important thing
link |
[Yann] is one instance or one attempt
link |
[Yann] at trying to reproduce this kind of learning.
link |
[Lex] Okay, so you're looking at just observation,
link |
[Lex] so not even the interacting part of a child.
link |
[Lex] It's just sitting there watching mom and dad walk around,
link |
[Lex] pick up stuff, all of that.
link |
[Lex] That's what we mean by background knowledge.
link |
[Yann] Perhaps not even watching mom and dad,
link |
[Lex] Just having eyes open or having eyes closed
link |
[Lex] or the very act of opening and closing eyes
link |
[Lex] that the world appears and disappears,
link |
[Lex] all that basic information.
link |
[Lex] And you're saying in order to learn to drive,
link |
[Lex] like the reason humans are able to learn to drive quickly,
link |
[Lex] some faster than others,
link |
[Lex] is because of the background knowledge
link |
[Lex] in the many years leading up to it,
link |
[Lex] the physics of basic objects, all that.
link |
[Yann] That's right.
link |
[Yann] I mean, the basic physics of objects,
link |
[Yann] you don't even need to know how a car works
link |
[Yann] because that you can learn fairly quickly.
link |
[Yann] I mean, the example I use very often
link |
[Yann] is you're driving next to a cliff.
link |
[Yann] And you know in advance because of your understanding
link |
[Yann] of intuitive physics that if you turn the wheel
link |
[Yann] to the right, the car will veer to the right,
link |
[Yann] will run off the cliff, fall off the cliff,
link |
[Yann] and nothing good will come out of this, right?
link |
[Yann] But if you are a sort of tabularized
link |
[Yann] reinforcement learning system
link |
[Yann] that doesn't have a model of the world,
link |
[Yann] you have to repeat falling off this cliff
link |
[Yann] thousands of times before you figure out it's a bad idea.
link |
[Yann] And then a few more thousand times
link |
[Yann] before you figure out how to not do it.
link |
[Yann] And then a few more million times before you figure out
link |
[Yann] how to not do it in every situation you ever encounter.
link |
[Lex] So self-supervised learning still has to have
link |
[Lex] some source of truth being told to it by somebody.
link |
[Lex] So you have to figure out a way without human assistance
link |
[Lex] or without significant amount of human assistance
link |
[Lex] to get that truth from the world.
link |
[Lex] So the mystery there is how much signal is there,
link |
[Lex] how much truth is there that the world gives you,
link |
[Lex] whether it's the human world, like you watch YouTube
link |
[Lex] or something like that, or it's the more natural world.
link |
[Lex] So how much signal is there?
link |
[Yann] So here's the trick, there is way more signal
link |
[Yann] in sort of a self-supervised setting
link |
[Yann] than there is in either a supervised
link |
[Yann] or reinforcement setting.
link |
[Yann] And this is going to my analogy of the cake.
link |
[Yann] The luck cake, as someone has called it,
link |
[Yann] where when you try to figure out how much information
link |
[Yann] you ask the machine to predict and how much feedback
link |
[Yann] you give the machine at every trial.
link |
[Yann] In reinforcement learning,
link |
[Yann] you give the machine a single scalar.
link |
[Yann] You tell the machine you did good, you did bad.
link |
[Yann] And you only tell this to the machine once in a while.
link |
[Yann] When I say you, it could be the universe
link |
[Yann] telling the machine, right?
link |
[Yann] But it's just one scalar.
link |
[Yann] And so as a consequence,
link |
[Yann] you cannot possibly learn something very complicated
link |
[Yann] without many, many, many trials
link |
[Yann] where you get many, many feedbacks of this type.
link |
[Yann] Supervised learning, you give a few bits to the machine
link |
[Yann] at every sample.
link |
[Yann] Let's say you're training a system
link |
[Yann] on recognizing images on ImageNet,
link |
[Yann] there is 1,000 categories,
link |
[Yann] that's a little less than 10 bits of information per sample.
link |
[Yann] But self-supervised learning, here is the setting.
link |
[Yann] You ideally, we don't know how to do this yet,
link |
[Yann] but ideally you would show a machine a segment of a video
link |
[Yann] and then stop the video and ask the machine to predict
link |
[Yann] what's going to happen next.
link |
[Yann] And so you let the machine predict
link |
[Yann] and then you let time go by
link |
[Yann] and show the machine what actually happened
link |
[Yann] and hope the machine will learn to do a better job
link |
[Yann] at predicting next time around.
link |
[Yann] There's a huge amount of information you give the machine
link |
[Yann] because it's an entire video clip.
link |
[Yann] It's of the future after the video clip
link |
[Yann] you fed it in the first place.
link |
[Lex] So both for language and for vision,
link |
[Lex] there's a subtle, seemingly trivial construction,
link |
[Lex] but maybe that's representative
link |
[Lex] of what is required to create intelligence,
link |
[Lex] which is filling the gap.
link |
[Lex] So. Filling the gaps.
link |
[Yann] Filling the gaps.
link |
[Lex] It sounds dumb, but can you,
link |
[Lex] it is possible that you could solve
link |
[Lex] all of intelligence in this way.
link |
[Lex] Just for both language,
link |
[Lex] just give a sentence and continue it
link |
[Lex] or give a sentence and there's a gap in it.
link |
[Lex] Some words blanked out and you fill in what words go there.
link |
[Lex] For vision, you give a sequence of images
link |
[Lex] and predict what's going to happen next
link |
[Lex] or you fill in what happened in between.
link |
[Lex] Do you think it's possible that formulation alone
link |
[Lex] as a signal for self-supervised learning
link |
[Lex] can solve intelligence for vision and language?
link |
[Yann] I think that's our best shot at the moment.
link |
[Yann] So whether this will take us all the way
link |
[Yann] to human level intelligence or something
link |
[Yann] or just cat level intelligence is not clear,
link |
[Yann] but among all the possible approaches
link |
[Yann] that people have proposed, I think it's our best shot.
link |
[Yann] So I think this idea of an intelligence system
link |
[Yann] filling in the blanks, either predicting the future,
link |
[Yann] inferring the past, filling in missing information.
link |
[Yann] I'm currently filling the blank of what is behind your head
link |
[Yann] and what your head looks like from the back
link |
[Yann] because I have basic knowledge about how humans are made.
link |
[Yann] And I don't know if you're gonna,
link |
[Yann] what word you're gonna say,
link |
[Yann] which point you're gonna speak,
link |
[Yann] whether you're gonna move your head this way or that way,
link |
[Yann] which way you're gonna look,
link |
[Yann] but I know you're not gonna just dematerialize
link |
[Yann] and reappear three meters down the hall
link |
[Yann] because I know what's possible and what's impossible
link |
[Lex] So you have a model of what's possible,
link |
[Lex] what's impossible and then you'd be very surprised
link |
[Lex] if it happens and then you'll have to reconstruct your model.
link |
[Yann] Right, so that's the model of the world.
link |
[Yann] It's what tells you, you know, what fills in the blanks.
link |
[Yann] So given your partial information
link |
[Yann] about the state of the world, given by your perception,
link |
[Yann] your model of the world fills in the missing information
link |
[Yann] and that includes predicting the future,
link |
[Yann] re-predicting the past, you know,
link |
[Yann] filling in things you don't immediately perceive.
link |
[Lex] And that doesn't have to be purely generic vision
link |
[Lex] or visual information or generic language.
link |
[Lex] You can go to specifics like predicting
link |
[Lex] what control decision you make
link |
[Lex] when you're driving in a lane.
link |
[Lex] You have a sequence of images from a vehicle
link |
[Lex] and then you have information if you record it on video
link |
[Lex] where the car ended up going so you can go back in time
link |
[Lex] and predict where the car went
link |
[Lex] based on the visual information.
link |
[Lex] That's very specific, domain specific.
link |
[Yann] Right, but the question is whether we can come up
link |
[Yann] with sort of a generic method for, you know,
link |
[Yann] training machines to do this kind of prediction
link |
[Yann] or filling in the blanks.
link |
[Yann] So right now, this type of approach
link |
[Yann] has been unbelievably successful
link |
[Yann] in the context of natural language processing.
link |
[Yann] Every modern natural language processing
link |
[Yann] is pre-trained in self-supervised manner
link |
[Yann] to fill in the blanks.
link |
[Yann] You show it a sequence of words, you remove 10% of them
link |
[Yann] and then you train some gigantic neural net
link |
[Yann] to predict the words that are missing.
link |
[Yann] And once you've pre-trained that network,
link |
[Yann] you can use the internal representation learned by it
link |
[Yann] as input to, you know, something that you train,
link |
[Yann] supervised or whatever.
link |
[Yann] That's been incredibly successful.
link |
[Yann] Not so successful in images, although it's making progress
link |
[Yann] and it's based on sort of manual data augmentation.
link |
[Yann] We can go into this later,
link |
[Yann] but what has not been successful yet
link |
[Yann] is training from video.
link |
[Yann] So getting a machine to learn,
link |
[Yann] to represent the visual world, for example,
link |
[Yann] by just watching video.
link |
[Yann] Nobody has really succeeded in doing this.
link |
[Lex] Okay, well, let's kind of give a high-level overview.
link |
[Lex] What's the difference in kind and in difficulty
link |
[Lex] between vision and language?
link |
[Lex] So you said people haven't been able
link |
[Lex] to really kind of crack the problem of vision open
link |
[Lex] in terms of self-supervised learning,
link |
[Lex] but that may not be necessarily
link |
[Lex] because it's fundamentally more difficult.
link |
[Lex] Maybe like when we're talking about achieving,
link |
[Lex] like passing the Turing test in the full spirit
link |
[Lex] of the Turing test in language might be harder than vision.
link |
[Lex] That's not obvious.
link |
[Lex] So in your view, which is harder,
link |
[Lex] or perhaps are they just the same problem?
link |
[Lex] The farther we get to solving each,
link |
[Lex] the more we realize it's all the same thing.
link |
[Lex] It's all the same cake.
link |
[Yann] that make them look essentially like the same cake,
link |
[Yann] but currently they're not.
link |
[Yann] And the main issue with learning world models
link |
[Yann] or learning predictive models is that the prediction
link |
[Yann] is never a single thing
link |
[Yann] because the world is not entirely predictable.
link |
[Yann] It may be deterministic or stochastic.
link |
[Yann] We can get into the philosophical discussion about it,
link |
[Yann] but even if it's deterministic,
link |
[Yann] it's not entirely predictable.
link |
[Yann] And so if I play a short video clip
link |
[Yann] and then I ask you to predict what's going to happen next,
link |
[Yann] there's many, many plausible continuations
link |
[Yann] for that video clip.
link |
[Yann] And the number of continuation grows
link |
[Yann] with the interval of time that you're asking the system
link |
[Yann] to make a prediction for.
link |
[Yann] And so one big question with self-supervised learning
link |
[Yann] is how you represent this uncertainty,
link |
[Yann] how you represent multiple discrete outcomes,
link |
[Yann] how you represent a sort of continuum
link |
[Yann] of possible outcomes, et cetera.
link |
[Yann] And if you are sort of a classical machine learning person,
link |
[Yann] you say, oh, you just represent a distribution, right?
link |
[Lex] Yeah.
link |
[Yann] And that we know how to do when we're predicting words,
link |
[Yann] missing words in the text,
link |
[Yann] because you can have a neural net give a score
link |
[Yann] for every word in the dictionary.
link |
[Yann] It's a big list of numbers,
link |
[Yann] maybe 100,000 or so.
link |
[Yann] And you can turn them into a probability distribution
link |
[Yann] that tells you when I say a sentence,
link |
[Yann] the cat is chasing the blank in the kitchen.
link |
[Yann] There are only a few words that make sense there.
link |
[Yann] It could be a mouse or it could be a lizard spot
link |
[Yann] or something like that, right?
link |
[Yann] And if I say the blank is chasing the blank in the savanna,
link |
[Yann] you also have a bunch of plausible options
link |
[Yann] for those two words, right?
link |
[Yann] Because you have kind of an underlying reality
link |
[Yann] that you can refer to sort of fill in those blanks.
link |
[Yann] So you cannot say for sure in the savanna,
link |
[Yann] if it's a lion or a cheetah or whatever,
link |
[Yann] you cannot know if it's a zebra or a do or whatever,
link |
[Yann] wildebeest, the same thing.
link |
[Yann] by just a long list of numbers.
link |
[Yann] and I ask you to predict a video clip,
link |
[Yann] it's not a discrete set of potential frames.
link |
[Yann] You have to have somewhere representing
link |
[Yann] a sort of infinite number of plausible continuations
link |
[Yann] of multiple frames in a high dimensional continuous space.
link |
[Yann] And we just have no idea how to do this properly.
link |
[Lex] Finite high dimensional.
link |
[Lex] So like you,
link |
[Lex] Just like the words,
link |
[Lex] they try to get it down to a small finite set
link |
[Lex] of like under a million, something like that.
link |
[Lex] Something like that.
link |
[Lex] I mean, it's kind of ridiculous
link |
[Lex] that we're doing a distribution
link |
[Lex] of every single possible word for language and it works.
link |
[Lex] It feels like that's a really dumb way to do it.
link |
[Yann] Something like that.
link |
[Lex] I mean, it seems to be like there should be
link |
[Lex] some more compressed representation
link |
[Lex] of the distribution of the words.
link |
[Yann] You're right about that.
link |
[Lex] And so- I agree.
link |
[Lex] Do you have any interesting ideas
link |
[Lex] about how to represent all of reality in a compressed way
link |
[Lex] such that you can form a distribution over it?
link |
[Yann] That's one of the big questions.
link |
[Yann] How do you do that?
link |
[Yann] I mean, what's kind of,
link |
[Yann] another thing that really is stupid about,
link |
[Yann] I shouldn't say stupid,
link |
[Yann] but like simplistic about current approaches
link |
[Yann] to self-supervised running in NLP in text
link |
[Yann] is that not only do you represent
link |
[Yann] a giant distribution over words,
link |
[Yann] but for multiple words that are missing,
link |
[Yann] those distributions are essentially independent
link |
[Yann] of each other.
link |
[Yann] And you don't pay too much of a price for this.
link |
[Yann] So you can't, so the system,
link |
[Yann] in the sentence that I gave earlier,
link |
[Yann] if it gives a certain probability for lion and cheetah,
link |
[Yann] and then a certain probability for gazelle,
link |
[Yann] wildebeest and zebra,
link |
[Yann] those two probabilities are independent of each other.
link |
[Yann] And it's not the case that those things are independent.
link |
[Yann] Lions actually attack like bigger animals than cheetahs.
link |
[Yann] So, you know, there's a huge independent hypothesis
link |
[Yann] in this process, which is not actually true.
link |
[Yann] The reason for this is that we don't know
link |
[Yann] how to represent properly distributions
link |
[Yann] over combinatorial sequences of symbols,
link |
[Yann] essentially when they're,
link |
[Yann] because the number grows exponentially
link |
[Yann] with the length of the symbols.
link |
[Yann] And so we have to use tricks for this,
link |
[Yann] but those techniques can, you know, get around,
link |
[Yann] like don't even deal with it.
link |
[Yann] So the big question is like,
link |
[Yann] would there be some sort of abstract
link |
[Yann] latent representation of text that would say that,
link |
[Yann] you know, when I switch lion for gazelle,
link |
[Lex] Yeah, so this independence assumption,
link |
[Lex] let me throw some criticism at you that I often hear
link |
[Lex] So this kind of filling in the blanks is just statistics.
link |
[Lex] You're not learning anything
link |
[Lex] like the deep underlying concepts.
link |
[Lex] You're just mimicking stuff from the past.
link |
[Lex] You're not learning anything new
link |
[Lex] such that you can use it to generalize about the world.
link |
[Lex] Or, okay, let me just say the crude version,
link |
[Lex] which is just statistics.
link |
[Lex] It's not intelligence.
link |
[Lex] What do you have to say to that?
link |
[Lex] What do you usually say to that
link |
[Yann] I don't get into those discussions
link |
[Yann] because they are kind of pointless.
link |
[Yann] So first of all, it's quite possible
link |
[Yann] that intelligence is just statistics.
link |
[Yann] It's just statistics of a particular kind.
link |
[Yann] Yes.
link |
[Lex] Is it possible that intelligence is just statistics?
link |
[Yann] Yeah.
link |
[Yann] But what kind of statistics?
link |
[Yann] So if you're asking the question,
link |
[Yann] are the models of the world that we learn,
link |
[Yann] do they have some notion of causality?
link |
[Yann] Yes.
link |
[Yann] So if the criticism comes from people who say,
link |
[Yann] current machine learning system don't care about causality,
link |
[Yann] which by the way is wrong,
link |
[Yann] I agree with them.
link |
[Yann] You should, your model of the world
link |
[Yann] should have your actions as one of the inputs.
link |
[Yann] And that will drive you to learn causal models of the world
link |
[Yann] where you know what intervention in the world
link |
[Yann] will cause what result.
link |
[Yann] Or you can do this by observation of other agents
link |
[Yann] acting in the world and observing the effect,
link |
[Yann] other humans, for example.
link |
[Yann] So I think at some level of description,
link |
[Yann] intelligence is just statistics.
link |
[Lex] Yes.
link |
[Yann] But that doesn't mean you don't have models
link |
[Yann] that have deep mechanistic explanation for what goes on.
link |
[Yann] The question is how do you learn them?
link |
[Yann] That's the question.
link |
[Yann] I'm interested in.
link |
[Yann] say that those mechanistic model
link |
[Yann] have to come from someplace else.
link |
[Yann] They have to come from human designers,
link |
[Yann] they have to come from I don't know what.
link |
[Yann] And obviously we learn them.
link |
[Yann] Or if we don't learn them as an individual,
link |
[Yann] nature learn them for us using evolution.
link |
[Yann] So regardless of what you think,
link |
[Yann] those processes have been learned somehow.
link |
[Lex] So if you look at the human brain,
link |
[Lex] just like when we humans introspect
link |
[Lex] about how the brain works,
link |
[Lex] it seems like when we think about what is intelligence,
link |
[Lex] we think about the high level stuff
link |
[Lex] like the models we've constructed,
link |
[Lex] concepts like cognitive science,
link |
[Lex] like concepts of memory and reasoning module,
link |
[Lex] almost like these high level modules.
link |
[Lex] Does this serve as a good analogy?
link |
[Lex] Are we ignoring the dark matter,
link |
[Lex] the basic low level mechanisms,
link |
[Lex] just like we ignore the way the operating system works,
link |
[Lex] we're just using the high level software.
link |
[Lex] We're ignoring that at the low level,
link |
[Lex] the neural network might be doing something like statistics.
link |
[Lex] Sorry to use this word.
link |
[Lex] It probably.