An Introduction to Transfer Learning in NLP and HuggingFace

Bookmark

In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.


Transcript


Intro


Hi, everyone. Welcome to my talk, and today we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept, the history. Then present you the tools that we're developing at Hugging Face. And then I hope you have a lot of questions for me, so a Q and A session.


What is Transfer Learning?


[00:46] Let's start by concepts. What is transfer learning? That's a very good question. Here is the traditional way we do machine learning. Usually when we're faced with the first task in machine learning, we gather a set of data, we randomly initialize our model, and we train it on our datasets to get the machine learning system that we use, for instance, to work in prediction. Now when we're faced with a second task, usually we start from scratch again. We'll gather another set of data, another dataset. We randomly initialize from our model, and we'll train it from scratch again to get a second learning system. And the same when we're faced with a third task, okay?

We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in prediction. So this is not the way we humans do learning. You see, when we're faced with a new task, we reuse all the knowledge we've learned in the past task, all the things we've learned in life, all the things we've learned in university classes, and we use that to bootstrap our learning, okay? So you can see that as having a lot of datasets that we've already used to generate a knowledge base, okay?

[02:03] And now this gives us two main advantages. The first one is that we can learn with just a few data points, because we can interplay between these data points. And this can help us use some form of data augmentation if you want naturally. And the second advantage is that we can also leverage all this knowledge to reach better performances. Humans are typically more data-efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done, last summer, a very long tutorial. There was a three-hour tutorial, so you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.


Sequential Transfer Learning


[02:53] There are a lot of ways you can do transfer learning, but today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want, of transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps.

The first step is called pre-training, and by doing these steps, you'll try to gather as much data as you can, okay? We'll try to build basically some kind of knowledge base, like the knowledge base we humans build. And the idea is that we can end up with a general purpose model. So there's a lot of different general purpose models. You've probably heard about many of them. Word2Vec and GloVe were the first model leveraging transfer learning. They were word embeddings, but today we use models which have a lot more parameters, which are fully pre-trained. Like BERT, GPT, or DistilBERT. And these models, they are pre-trained as general purpose models. They are not focused on one specific task. They can be used on a lot of different tasks.

[03:54] So how we do that, we do a second step of adaptation, or fine-tuning, usually, on which we will select the task we want to use our model for and we'll fine-tune it on this task, okay? So here you have a few examples: text classification, word labeling, question answering.

But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. Language modeling is a pre-training objective which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a predictor of the probability of the words, for instance, and we try to maximize that. So you can see that given some context, you will try to maximize the probability of the next word or the probability of the last word, okay?

[04:46] The main thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high capacity model. Okay? So this is great for many things and in particular, low resource languages. It's also very versatile, as I told you. You can decompose this probability as the product of probability of a values view of your texts. And this is very interesting from a research point of view.


Pre-training Transformers models (BERT, GPT…)


[05:13] Now, how are the models looking? So there are two main flavors of the model. They are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a de-noising objective. So we take a sentence here, we mask one token, and we try to predict this token back.

So we project all these words in vectors, then we will actually use what is the attention layer in transformer. Which will do a weighted average of these vectors, to end up with also vectors. But these vectors they now depend on the context. Which means that the vector associated to masked, have enough information from the context to be able to predict back or at least to try to predict back the missing or masked word. Okay? So this is how BERT is trained. You train this model to predict back the masked token.

[06:14] Now there is another flavor which is called autoregressive or causal model. And these models, they look pretty much the same, but you can see that, with one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be 'is' the next token. Okay. So we need to tweak the attention here. We have to mask the right context otherwise the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot choose the right context to do their prediction, but they have a lot more strange signals here. So they usually train faster.

So these are the two main flavors of general purpose model. Now, how do you do the adaptation steps? It's pretty easy. These are the two advantages I'll show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides and we'll remove them. And we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it.

[07:27] If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section.


Downstream tasks and Model Adaptation: Quick Examples


A - Transfer Learning for Text Classification

[07:38] So let me show you two main examples of downstream tasks that we can tackle: text classification and generation. So here is an example for text classification, to show you how it works. Let's say we have a sentence here: Jim Hanson was a Puppeteer. And the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification, if you want. So the first step will be to convert this input sentence in something that our model can digest. So our models they can tackle integers, floats, they need numbers.

So the first step will be to convert these strings into numbers. And there is one thing interesting here, that our models are made to be trained and to process, basically open domain vocabulary, open domain copies, they're made to be trained on the internet. So we have to handle some uncommon words. Here is puppeteer, a word that is definitely not very common in English. So the way we handle that is that we will catch this word in prefix and suffix until we know every set part of the word. Okay. So here, for instance, puppet and -eer are two kind of common prefix and suffix. So they are in our vocabulary and we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the BERT and GPT. And as you remember, the output was a vector here. The output before the pink boxes was a vector. And we then just have to project back this vector to two classes. So for instance, we pull them, put them into a linear classifier, and then we get the classes.

[09:15] So this linear classifier will be trained from scratch on our downstream task. Okay. So it has to be quite small in terms of parameters if you want to be that efficient. But this one is pre-trained on the huge data that we've gathered. So this one can be really used. Okay.

So here is a practical example on a classification task called TREC-6. So you can see more details in the slides of our tutorials last year, but basically you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really data efficient. This dataset, this TREC-6 is really a small dataset, 2,500 examples. So this is really data efficient. And when you train for three epochs, you get down to an error rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about: data efficiency and high performances. Okay.


B - Transfer Learning for Language Generation

[10:11] Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chat-bot setting where we have kind of knowledge base. Our chat-bot is pretending to be an artist with four children who got a cat and likes to watch Game of Thrones. And now he's discussing with a user in an open domain setting. So the user said, "Hey," chat-bot answered, "Hello. How are you?" Then the user said, "I'm good. Thank you. How are you?" And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense, given their personality.

So you see, this is quite a different task because we have a lot of different types of inputs. We have kind of a knowledge base here. We have a dialogue history as well, and we have this utterance, and even the beginning of the reply, because we've generated the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next word.

[11:05] So how do we handle that? Well, actually we had a paper last year at ACL that you can check out, but basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this, and it actually works really well. Or, you can duplicate your model and have several parts that process each type of input, and then you need to connect them together with cross attention or another way to connect them.

Okay, this is very different, but the idea is that overall, as you can see here, this is a competition we participated in taking place two years ago now. And you see, we were really leading on the automatic metrics by a strong margin, using this type of transfer learning approach.


Trends and limits of Transfer Learning in NLP


[11:53] So a few words on trends and limits. I'll go fast, but yeah, there is a trend to bigger models. This is quite a big problem because it's narrowing the competition. There are some ways we can reduce the size of this model: distillation, pruning, quantization. I won't spend too much time on this, but this is something we are like a lot. These models have generalization problems. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in texts. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image or even to use a human in the loop. OK.

And now there is the last main problem, which is that these models, they are trained just one time and it's very hard to make them learn new knowledge later on because they have this problem of catastrophic forgetting. And so for instance, GPT 3, which was very expensive to train, training costs like 10 million, GPT 3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today. Okay. So this is too bad. So there's a lot of shortcomings. And at Hugging Face, we try to tackle some of these shortcomings.


Huggin Face Inc.


[13:16] So what do we do at Hugging Face? Well we try to democratize the NLP. We start as a chat-bot company, building a game. And we open sourced a lot of tools, while we were doing that. And actually this tool catches so much interest in the community that we are now fully focused on catalyzing on democratizing all this research level work. So we do that in two main ways. The first way is knowledge sharing. That's why I'm talking today. And the second way is to opensource a lot of code and to opensource some library, which lets people leverage all these developments and develop better models on top of them.


Open source Libraries

Okay. So let's see a few of our open source libraries, and this will be the open door for the question in the second part. OK. So the first main library that you probably know is called the transformers library, which is a way to access all these state of the art models. They can be used both for NLU and NLG in many languages. We have today read a lot of models in the library. There are all the models you know, probably BERT, GPT and also more interesting models, more recent DPR, Pegasus, XLM-Roberta. And the T5 is a really huge model. So there are a lot of them, very simple to use. We can talk about that in the question. And we even have a model hub where people can share their model. We have more than 3000 models and many, many languages.

[14:51] So if you want to use a model for Finnish or German you have like 132 different models you can expect and you can play with them, even in the web interface, you can try the model and see how they behave. This is at HuggingFace.co.

Now we open source more recently a second library, called Tokenizers which do this first part that I was talking about, like splitting the string in integrals. So why did we do that? This was really a bottleneck in term of speed. And so we decided to use some very low level risk code, super fast to do that, to solve this problem. So this, this is now really a very fast step. It's available in Python, Node, Rest. Rest is a great language.

[15:38] And more recently, a few months ago we open sourced a third library, which is called Datasets. So Datasets is our third library, and this is to tackle one last problem we discover, which is that dataset themselves, they are kind of hard to access for people. We have to share, and we spend a lot of time rewriting the same processing code that people have actually already written a lot, for their own dataset. So we decided to make a library to solve this problem, make it easier to share datasets and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use or, or at least to install and to set up. Metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it could be also good to make something very easy to share them, to upload them, to share them. So this is the Dataset library with datasets on metrics.

And it has a few really cool stuff. For instance, there is built-in interoperability with Numpy, Pandas, PyTorch, and Tensorflow 2. So you can use basically any type of framework if you want. Well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on the Wikipedia. Your dataset, maybe for instance, bigger than your RAM memory, you can use this library because it has a lot of smart ways to do on disk zeroization. So you just take 9 megabytes, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good and you won't spend time reprocessing it again.

[17:19] So you can check this one. Here is an example, how you prepare. This is the full preparation of training set to train a model on glue with the tokenization, with the padding. And you see, this is just like 20 lines of code and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time.

So same as the model. We also have a hub where you can access and even explore the datasets. So there are more than 160 datasets now. There are also multimodal datasets with images. And so you can go to HuggingFace.co, explore all the datasets, you can see the station, everything, and you can even see what's inside the dataset in the web interface. Okay. So same here. You can go and check it out at huggingface.co.

Okay. So these are the three main libraries and I'm really happy to talk about any question you may have on them. And also any question you may have on the first part, the concept, and the history.


Questions


[18:27] Sergii Khomenko: Hey Thomas, how's it going for you?

Thomas Wolf: Hi. Good, good. How's it going for you?

[18:34] Sergii Khomenko: Yeah, it's also pretty nice. I mean, sun is already a bit like out in Munich, but it's still quite a bunch of exciting talks and you did kick off this conference pretty nicely because I do believe that natural language processing, or language in general, is like one of those indicators, how good we are at understanding those machine learning. How do you feel?

[18:54] Thomas Wolf: Yeah, definitely. I think what we've seen in NLP right now, I guess over the last two years, probably, that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do rezoning or what we hope would be like rezoning. And so when we talk about AGI, I think right now a lot of people think about GPT 3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI.

[19:27] Sergii Khomenko: Yeah. And it's kind of funny, right? Because you are almost predicting the first question, right? And the first question is actually about GPT 3 and people are asking like: Hey, since you are an expert in NLP, and your company's driving such a good effort in this regard, what do you think is the advantage of GPT 3 in comparison to GPT 2?

[19:47] Thomas Wolf: Yeah, it's a good question. I think, well, one of the problem of GPT 3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GTP 3 like we've done it for other models, like BERT or GPT 2, just because like a lot of academics did not really have full access to be able to investigate what's happening. What can you do with that to be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right?

What I think is what we can do is let GPT 3 really behave in some way, like an interesting thing, which is a retrieval, like a smooth retrieval engine over a really large dataset. So you can do like, it's like having a huge Google search where you can search every page on the internet and be able to smoothly interpolate between all these pages. So I think this is very interesting and what we see is that you can do some pretty cool applications with that. Like you can smoothly interpolate to generate code and to generate realistic looking blog posts. Now, when you talk about really rezoning, meaning, anything like that, I don't think there is really any, any deep breakthrough in GPT 3, but that's my personal opinion. Yeah.

[21:12] Sergii Khomenko: Yeah, it's really good that at least for me, it does resonate that you separate reasoning from having a big database. Because sometimes your feelings that our community or parts of our community is going like, 'Hey, if the database is bigger, you can just solve all of the problems.' And sometimes having a bigger model doesn't really mean that you suddenly have AGI. And it's good to remember, basically.

Thomas Wolf: Yeah.

Sergii Khomenko: Yeah. So another question would be just like from the audience is also: There are so many different sizes of the model. Even a transformer base. There is GPT 3, there are transformers, there is BERT, all kinds of things, there's also more distilled versions of them. And when one machinery engineer kind of starts to work on the task, what is a good rule of thumb, how to make this decision process? And obviously there is no one clear answer. But do you have any, I don't know, a mental model or like a framework, especially for beginners who may be working at a company that doesn't have a big machine learning group. But they're like one person who actually makes the calls. How can we help and support such a person?

[22:24] Thomas Wolf: Yeah. That's a good question. I think that's definitely something a lot of people face. And I mean, the very practical thing about our team at Hugging Face, I think, is that we should help people do that because I understand we are providing a lot of models. We're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So, the first thing is that we will try to build some better tools on this and we'll try to help you. But now for the quick answer, I think it's good to keep your good reflex, like your good routine is that you should start with something simple. Like you should start with a smaller model, like starting with DistilBERT, for instance, instead of BERT, like starting with kind of something small and see how far you can go with that with this computer efficient model, like DistilBERT or GPT2. You test a little bit with them and if it's not enough, then you scale up. Then you start to use bigger models and you try to see if you need like a T5 or something like that. Yeah. But that's the typical thing. It's the same. You should always start with like logistic redirection before you move to something very fancy.

[23:40] Sergii Khomenko: Yeah. Yeah. Definitely. I can only agree with you. Because once you start with GPT 3, you can only make worse form errors, so no way to explain it. But maybe just like a random idea, because you're also building quite a bunch of tools. So right now many companies building this, like a model zoo. Maybe you can also have something as like use case zoo, that's kind of showing, what is a space of problems, and how does a map, like a space of models, and solution. But again, just like a random idea.

Thomas Wolf: Oh, that's a very good idea. I think you should come build that with us.

[24:10] Sergii Khomenko: Yep. I will try. Okay. So the next question would be transformers. We have seen the transformers did reinvent or revolutionize NLP industry at all. But how do you feel about the future of transformers? There were some papers also about using transformers for vision. Do you see that soon transformers are going to be all around the machine learning or deep learning in particular or is that going to find their own niche and going from there, basically?

[24:43] Thomas Wolf: Yeah. That's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient and they are really a nice model to scale to large data sets. So, as soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can stay on scale.

Yeah. I mean, I'm not a computer vision guy, so yeah. I don't really know why we are only seeing that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model-agnostic, I would say. I think the underlying strategy, like transfer learning, has something very deep and important regarding the precise model that we select to use. I don't care that much. If somebody finds something better than transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-train model is a lot deeper with deep consequences.

[25:56] Sergii Khomenko: Yeah. Yeah. I guess what also is happening in industry right now is not even using transformers are architecture, but more like using weakly or unsupervised learning. Because the big part, correct me if I'm wrong, of a transformer or BERT, or this direction was hey, we don't need to have a big label data set. we could just reuse what we have. And by using the same ideas of hey, can we learn representation from the image itself and all of ideas of contrastive loss might be something that will spark in other different directions.

Thomas Wolf: Definitely, yeah.

[26:28] Sergii Khomenko: Cool. One more question from the audience is basically around, I will rephrase it, but I hope I'm not going to lose this sensor. So what you suggested also in your talk, if you can use transfer learning, you can try to start there. And try to unfree some layers and train it gradually basically to get the task. And the person is asking essentially, what is, how likely is it by you doing transfer learning you would end up in just like local optima of convergence? So you don't get something good in comparison, if you start with a model from scratch, and you start basically training it all together. Because I mean, usually people hope that the more you're unfreezing the better, you're kind of getting, and it's almost like a smooth transition from really out of the box transfer learning into this model that has been trained from scratch. Do you agree with this one? Or is there anything that people should be aware of, like any tips and tricks basically of advanced transfer learning and NLP?

[27:37] Thomas Wolf: Yeah. That's, there's a lot of things here. So the best resource, I mean, is probably, I had a link, in my first slide, on the long tutorial where we talked about that. And I also made that video I could share earlier this year on this question, but there is a lot of risk to fall in local optima. So we see that when you fine tune BERT, for instance, you can get like a none of behavior for some random seed, the model just doesn't convert. It's just stayed stuck in minima. And for some other random seed, it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that.

You have tools like something related to what you were saying, which is called Mixout, which is the idea a bit like drop out. You probably want to regularize your model, with drop out you're regularized towards zero, you cancel some weights. And with these other things, you can regularize to the pre-train model. So you can have a small sync operation from the pre-train model to the model that's spent. But there's lot of other techniques that try to make some spruce interpolation to not be stuck in local minima. That's a very active area of research. And the other option is to try to do some assemblies, which work very well as well. But yeah, that's still quite an open research question, I think. And we definitely have not really understood everything that was happening here. Well, it's failing sometimes, well it's working very well in other cases that look really close. Yeah.

[29:30] Sergii Khomenko: Yeah. Yeah. So maybe for whoever did ask this question, it's like a good topic also for research. So once you find some answers, please write an article or paper and we're all going to read it later on.

Thomas Wolf: Yeah, definitely.

[29:41] Sergii Khomenko: Cool. Maybe like the last question, because it feels like this conversation can be going on for a very long time, but we still have a bit of a schedule. So how do you feel about NLP and smartphones? Is it something that is already kind of like sorted out and we already have pretty good distilling or are there still great breakthroughs ahead of us?

[30:01] Thomas Wolf: Yeah. I think it's a cool area. We have a few open source libraries actually on Swift core ML with BERT that you can run on the iPhone. And you have a lot of very recent papers regarding talk for next week, an NLP activity on that, you can quantize models very efficiently. So yeah. I can see a lot of breakthroughs coming there with very efficient model, keeping a lot of the performances that we have on PC, on the computer, but on smartphone. Yeah.

[30:32] Sergii Khomenko: Nice. Awesome. Thank you again, for your amazing answers, and also your talk. But also I would like to mention, again, the work that Hugging Face does in open source. It is really helpful and it does make entry where you're like lower for many people in this industry. So thank you again and hope to see you somewhere around the internet.

Thomas Wolf: Thanks.

Sergii Khomenko: Yeah. Yeah. And if you still have some questions, and you would like to address them to Thomas directly, remember that we have the speakers room for that you just go to website, as AJ explained to you earlier, and after you will find the link to the conversation with Thomas, that's going to be starting more or less right now. So please go ahead if you want to continue the conversation on NLP.

by



Transcription


Hi everyone. Welcome to my talk. And today we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept, the history, then talk to, then present you the tools about the tools that we're developing at Hugging Face. And, and then I hope you have a lot of questions for me. So Q&A session. Okay. Let's start by concepts. What is transfer learning? That's a very good question. So here is the traditional way we do transfer learning. Sorry. Is the traditional way we do machine learning. Usually when we face with a first task in machine learning, we gather a set of data, we randomly initialize our model and we train it on our data sets to, to get the machine learning system that we'll use to predict, to predict, to for instance, to work in production. Now, when we faced with a second task, usually we start from scratch again, we'll gather another set of data and other data sets. We are randomly initialized from our model and we will train it from scratch again to get the second learning system. And the same way we faced with a third task. Okay. We'll have a third data set. We'll have a third machine learning system again, initialized from scratch and that we'll use in production. So this is not the way we humans do learning. Usually when we faced with a new task, we reuse all the knowledge we've learned in the past task, all the things we've learned in life, all the things we've learned in the university classes. And we use that to bootstrap our learning. Okay. So you can see that as having a lot of data, a lot of data set that we've already used to generate a knowledge base. Okay. And now this gives us two main advantages. The first one is that we can learn with just a few data, just a few data points because we can interpolate between the data, these data points. And this can, this can help us do some form of data mutation if you want naturally. And the second advantage is, is just is that we can also leverage all this knowledge to reach better performances. So humans are typically more data efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning for machine learning. So we've done last summer, a very long tutorial. There was a three hours tutorial. So you can check out these links. There are 300 slides, a lot of hands-on exercise and an open source code base. So if you want more information, really should go there. So there are a lot of ways you can do transfer learning, but today I'm going to talk about sequential transfer learning, which is the currently most used flavor if you want to transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps. So at least two steps. The first step is called pre-training and during these steps, you'll try to gather as much data as you can. Okay. We'll try to build basically some kind of knowledge base, like the knowledge base we human build. And the idea is that we can end up with a general purpose model. So there's a lot of different general purpose model. You've probably heard about many of them. Word2vec and GloVie were the first model leveraging transfer learning. There were word embeddings, but today we use models which have a lot more parameters, which are fully pre-trained. Okay. Like BERT, GPT, or Distilled BERT. And these models are pre-trained as general purpose models. They are not focused on one specific task, but they can be used on a lot of different tasks. So how we do that, we do a second step of adaptation or fine tuning usually, on which we will select the task we want to use our model for, and we'll fine tune it on this task. Okay. So here you have a few examples, text classification, word labeling, question answering. But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. So language modeling is a pre-training objective, which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a product of the probability of the words, for instance, and we try to maximize that. So you can see that as given a sum of words, some context, you will try to maximize the probability of the next word or the probability of a master. Okay. The nice thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high capacity model. Okay. So this is great for many things, and in particular, low resource languages. It's also very versatile. As I told you, you can decompose this probability as a product of probability of a value's view of your texts. And this is very interesting from a research point of view. Now, how are the models looking? So there are two main flavors of model. They are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language model, which mask language modeling, which is a denoising objective. So we take a sentence here, we mask one token, and we try to predict this token back. So we project all these words in vectors. Then we will actually use what is the attention layer in transformer, which will do a weighted average of these vectors to end up with also vectors. But these vectors, they now depend on the context, which means that the vector associated to mask have enough information from the context to be able to predict back, or at least to try to predict back the missing, the masked word. So this is how BERT is trained. You train this model to predict back the masked token. Now there is another flavor, which is called auto-aggressive or causal model. And these models, they look pretty much the same, but you can see that to one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be is the next token. So we need to tweak the attention here. We have to mask the right context, otherwise the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot use the right context to do their prediction, but they have a lot more strange signal here. So they usually train faster. So these are the two main flavors of model. These are the two main famous of general purpose model. Now, how do you do the adaptation steps? It's pretty easy. Yeah, these are the two advantages I'll show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides, and we'll remove them and we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it. If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section. So let me show you two main examples of downstream tasks that we can tackle, text classification and generation. So here is an example for text classification to kind of show you how it works. Let's say we have a sentence here, Jim Hansen was a puppeteer, and the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification if you want. So the first step would be to convert this input sentence in something that our model can digest. So our models, they can tackle integrals, floats, they need numbers. So the first step will be to convert these strings into numbers. And there is one thing interesting here is that our models are made to be trained and to process basically open domain vocabulary, open domain corpus, they are made to be trained on internet. So we have to handle some uncommon words. Here is puppeteer is a word that is definitely not very common in English. So the way we handle that is that we will cut this word in prefix and suffix until we know every sub part of the word. So here, for instance, puppet and are are two kind of common prefix and suffix. So they are in our vocabulary. And we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the bird or GPT. And as you remember, the output was a vector here. The output before the pink boxes was a vector. And we just have to project back this vector to two classes. So for instance, we pull them, put them into a linear classifier, and then we get the classes. So this linear classifier will be trained from scratch on our downstream task. So it has to be quite small in terms of parameters if you want to be that efficient. But this one is pre-trained on the huge data corpus that we've gathered. So this one can be really used. So here are two, I mean, here is a practical example on a classification task called track six. So you can see more details in the slides of our tutorials last year. But basically, you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really data efficient. This dataset, this track six is really a small dataset. It's 2,500 examples. So this is really data efficient. And when you train for three epochs, you get down to an error rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about, data efficiency and high performances. Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chatbot setting where we have kind of a knowledge base. Our chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. And now it's discussing with a user in an open domain setting. So the user said, hey, chatbot answer, hello, how are you? And the user say, I'm good, thank you, how are you? And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense given the personality. So you see, this is quite a different task because we have a lot of different type of inputs. We have kind of a knowledge base here. We have a dialogue history as well. And we have this utility. And even the beginning of the reply, because we've generated the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next word. So how do we handle that? Well, actually we had a paper last year at ACL that you can check out, but basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this and it actually works really well. Or you can duplicate your model and have several parts that process each type of input. And then you need to connect them together with cross attention or a way to connect them. Okay. This is very different, but the idea is that overall, as you can see here, this is a competition we participated in at a game phase two years ago now. And you see, we were really leading on the automatic metrics by a strong margin using this type of transfer learning approach. So a few words on trends and limits, I'll go fast, but yeah, there is a trend to bigger model. This is quite a big problem because it's narrowing the competition. There is some way we can reduce the size of this model, distillation, pruning, quantization. I won't spend too much time on this, but this is something we are like a lot. These models have generalization problem. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in text. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image, or even to use human in the loop. Okay. And now there is the last main problem, which is that these models, they are trained just one time and they can, it's very hard to make them learn new knowledge later on because they have this problem of catastrophic forgetting. And so for instance, GPT-3, which was very expensive to train, train cost like 10 million. GPT-3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today. Okay. So this is too bad. So there's a lot of shortcomings. And at Hugging Face, we try to tackle some of these shortcomings. So what do we do at Hugging Face? Well, we try to democratize NLP. We started as a chatbot company, building a game, and we opened source a lot of tools while we were doing that. And actually, this tool catch so much interest in the community that we are now fully focused on catalyzing and democratizing all this research level work. So we do that in two ways. One is to make it more accessible for the community. We do that in two main ways. The first way is to knowledge sharing. That's why I'm talking today. And the second way is to open source a lot of code and to open source some library, which lets people leverage all these developments and develop better models on top of them. So let's see a few of our open source libraries. And this will be the open door for the question in the second part. Okay. So the first main library that you probably know is called the Transformers library, which is a way to access all the state of the art model. And they can be used both for NLU and NLG in many languages. We have today read a lot of models in the library. There is all the model you know, probably BERT, GPT, and also more interesting model, more recent DPR, Pegasus. Excel Marberta is a great multilingual model by Alexey Kolov on team. And there are actually T5 is a really huge model. So there are a lot of them, very simple to use. We can talk about that in the question. And we even have a model hub where people can share their model. We have more than 2,000, 3,000 models and many, many languages. So if you want to use a model for Finnish or German, they have like 132 different models you can expect, and you can play with them even in the web interface. You can try the model and see how they behave. This is at huggingface.com. Now we opened so small recently, a second library called tokenizer, which do this first part that I was talking about, like the string, splitting the string in integers. So why did we do that? This was really a bottleneck in terms of speed. And so we decided to use some very low level Rust code, super fast to do that, to solve this problem. So this is now really a very fast step. It's available in Python node, Rust, Rust is a great language. And more recently, a few months ago, we opened source a third library, which is called datasets. So datasets is our third library. And this is to tackle one last problem we discover, which is that dataset themselves, they're kind of hard to access for people. They're hard to share. And we spend a lot of time rewriting the same processing code that people have actually already written a lot in there for their own dataset. So we decided to make a library to solve this problem, make it easier to share dataset and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use or at least to install and to set up metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it would be also good to make something very easy to share them, to upload them, to share them. So this is the dataset library with that set on metrics. And it has a few really cool stuff. For instance, there is building interoperability with NumPy, Panda, PyTorch and TensorFlow 2. So you can use basically any type of framework if you want, well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on the Wikipedia and some data that are your dataset, maybe for instance, bigger than your RAM memory, you can use this library because it has a lot of smart way to do on-disk zero-sization. So it just takes nine megabytes from, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good and you won't spend time reprocessing it again. So you can check this one. Here is an example how you prepare. This is the full preparation of a training set to train a model on Glue with the tokenization, with the padding. And you see, this is just like 20 lines of code and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time. So same as the model, we also have a hub where you can access and even explore the datasets. So there is like a model, 160 dataset now. There are also multi-model datasets with images. And so you can go to HuggingFace.co, explore all the datasets. You can read, see the station, everything. And you can even see what's inside a dataset in the web interface. So same here, you can go and check it out at HuggingFace.co. So these are the three main libraries and I'm really happy to talk about any question you may have on them. And also any question you may have on the first part, the concept, the history. Hey, Tomas, how is it going for you? Good, good. How's it going for you? Yeah, it's also pretty nice. I mean, Sun is already a bit like out in Munich, but it's still quite a bunch of exciting talks. And you did kick off this conference pretty nicely because I do believe that natural language processing or language in general is like one of those indicators how good we are at understanding this machine learning. How do you feel? Yeah, definitely. I think what we've seen in NLP right now, I guess over the last two years, probably, is that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do researching or what we hope would be like researching. And so when we talk about AGI, I think right now a lot of people think about GPT-3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI. Yeah, and it's kind of funny, right, because you're almost predicting the first question, right? And the first question is actually about GPT-3 and people asking like, hey, since you're an expert in NLP, right, and your company driving such a good effort in this regard, what do you think is like advance of GPT-3 in comparison to GPT-2? Yeah, it's a good question. I think, well, one of the problems of GPT-3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GPT-3 like we've done it from other models like BERT or GPT-2, just because like a lot of academics did not really have full access to be able to investigate what's happening, what can you do with that, to be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right? What I think is what we can do is that GPT-3 behave in some way like an interesting thing, which is a retrieval, like a smooth retrieval ranging over a really large data set. So you can do like, it's like having a huge Google search, where you can search every pages on the internet and being able to smoothly interpret between all these pages. So I think this is very interesting and what we see is that you can do some pretty cool application with that, like you can smoothly interpolate to generate like also code and to generate like a realistic looking blog post. Now, when you talk about like really resounding meaning and things like that, I don't think there is really any deep breakthrough in GPT-3, but that's my personal opinion. Yeah. It's really good that, at least for me, it does resonate that you separate reasoning from having a big database, right? Because sometimes we have feelings that our community or parts of our community going like, hey, if database is bigger, right, you can just solve all of the problems, right? And sometimes it's not really, you know, having a bigger model, right, doesn't really mean that you suddenly like having like a GI, right? And it's good to remember, basically. Yeah. Yeah. So another question would be just like from the audience is also, there are like so many different sizes of the model, right? Even transformer-based, right? There's GPT-3, right? There is like transformers, there is like Bertra, Roberta, all kind of things, right? There's also more distilled versions of them, right? And when one machinery engineer kind of like starts to work on the task, right, what is a good like rule of thumb, how to make this decision process, right? And obviously, there is no like one clear answer, right? But do you have like any, I don't know, mental model or like a framework, especially for beginners who may be working in the company that doesn't have like, you know, a big machine learning group, right? But they're like one of a person to actually make the calls, right? How can we help and support such a person? Yeah, that's a good question. I think that's definitely something a lot of people face. And I mean, the very practical question is, the very practical thing about me, about our team at Twiggyface, I think that we should help people do that because I understand we have like, we're providing a lot of models, we're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So the first thing is that we will try to build some better tools on this and we'll try to help you. But now for the quick answer, I think it's good to keep your good reflex, like your good reflex. And the second thing is that you should start with something simple, like you should start with a smaller model, like starting with distilled BERT, for instance, instead of BERT, like starting with kind of something like small and see how far you can go with that, with this computer efficient model, like a distilled BERT, a distilled GPT-2 or a distilled Roberta. You test a little bit with them and if it's not enough, then you scale up, then you start to use bigger models and you try to see if you need like a T5 or something like that. Yeah, but that's the typical thing. It's the same, you should always start with like a logistic projection before you move to something very fancy. Yeah, definitely. I can only agree with you, right? Because once you start with GPT-3, right, you can only make it like worse from there, right? So no way to explain it, right? But maybe just like a random idea, because you're also building quite a bunch of tools, right? So right now, many companies are building this like a model zoo, right? Maybe we can also have something as like a use case zoo, right? That's kind of like showing like, hey, what is the space of problems, right? And how does it map like a space of models, right? And yeah, solution. But again, just like a random idea. Well, that's a very good idea. I think you should come build that with us. Yeah, I will try. Okay, so next question would be transformers, right? We have seen the transformers did reinvent or revolutionize the NLP industry at all, right? But how do you feel about future of transformers? There was like some papers also about using transformers for like vision, right? Do you see that soon transformers going to be like all around the machine learning or deep learning in particular? Or is this going to find their own niche, right? And going to plateau from there, basically? Yeah, that's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it's actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient. And they are really a nice model to scale to large data sets. So as soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can scale. And yeah, I mean, I'm not a computer vision guy. So yeah, I don't really know how, why we only seen that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model agnostic, I would say. I think the underlying strategy, like transfer learning, has something very deep and important regarding the precise model that we select to use. I don't care that much. If somebody finds something better than transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-trained model is a lot deeper, with deep consequences. Yeah, I guess also something happening in the industry right now, it's not even using transformers architecture, right? But more like use like weekly or unsupervised learning, right? Because the big part of a transformer or bird, right, or this direction was like, hey, we don't need to have like a big label data set, right? We can just reuse what we have, right? And by using the same ideas of like, hey, can we learn representation from the image itself, right? And all of ideas of contrasting loss, right? Might be something that feels sparkly like in other different directions, right? Yeah. Cool. One more question from the audience. It's basically around, I will rephrase it, but I hope I'm not going to lose this sensor. So what you suggested also, like in your talk, right, if you can use transfer learning, right, you can try to start there, right? And try to, you know, unfreeze some layers, right? And train it basically like gradually, basically, to get the task, right? And the person is asking like, essentially, what is, how likely, right, that by you doing transfer learning, you would end up in this like local optima, right, of convergence, right? So you don't get like really something good, right? In comparison, if you start with model from stretch, right, and you start basically training it all together, right? Because, I mean, usually people hope that the more you like unfreezing, right, the better you're kind of like getting. And it's almost like a smooth transition from, you know, like really out of the box transfer learning, right, into this model that has been trained from stretch. Do you agree with this one, right? Or is it like anything that people should be aware of and, yeah, like any tips and tricks basically of advanced transfer learning on NLP? Yeah, yeah, there's a lot of things here. So the best resource, I mean, is probably, so I had a link in my first slide on a long tutorial where we talked about that, and I also made a video I could share earlier this year on this question. But there is a lot of risk to fall in local optima. So we see that when you fine-tune BERT, for instance, you can get like a non-off behavior for some kind of, so for some random seed, the model just doesn't converge. It's just, it's stuck in minima. And for some other random seed, it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that. You have tools like something related to what you're saying, which is called mix out, which is the idea that's a bit like dropout. You probably want to, like, you probably want to regularize your model. With dropout, you regularize right towards zero, like you cancel some weights. And with this, this other things you can regularize to what the pre-train model. So you can have a smooth interpolation from the pre-train model to the model that's fine-tuned. But there's a lot of other techniques that try to make some smooth interpolation to not be stuck in local minima. That's a very active area of research. And the other option is to try to do some assembles, which work very well as well. But yeah, that's still quite open research question, I think, and we definitely have not really understand everything that was happening here on while it's failing sometime, while it's working very well in other cases that looks really close. So maybe for whoever did ask this question, it's like a good topic also for research, right? So once you find some answers, right, please write the article or paper, and we all gonna read it later on. Yeah, definitely. Cool. Maybe like the last question, because it feels like that this conversation can be going on like for a very long time, but we still have a bit of like schedule. So how do you feel about NLP in smartphones? Like, is it something that already kind of like sold out and we already have like pretty good distilling? Or are there like still great breakthroughs like ahead of us? Yeah, I think it's a cool area. We have a few open source library actually on Swift, Swift for ML with BERT that you can run on the iPhone. And you have a lot of very recent paper regarding the talk for next week, NLP, EMNLP actually on that. You can quantize model very efficiently. So yeah, I can see a lot of breaks will come in there with very efficient model keeping a lot of the performances that we have on PC and on the computer, but on smartphone. Yeah. Nice. Awesome. Thank you, Thomas, for... Sorry. Yeah. Sorry. Thank you again, like for your amazing answers, right? And also your talk, but also I would like to mention again, like the work that HackingFace does in open source, it is really helpful and it does make, you know, entry barrier like lower for many people in this industry. So thank you again and hope to see you somewhere around the internet. Thanks. Yeah. And if you still have some questions, right, and you would like to address them to Thomas directly, remember that we have the speakers room for that. You just go to website as Ajay explained to you early. And after you will find the link to the conversation with Thomas, that's going to be starting more or less right now. So please go ahead if you want to continue your conversation on the other team.
32 min
02 Jul, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic