Thomas Wolf
Thomas Wolf
Thomas Wolf is co-founder and Chief Science Officer of HuggingFace. His team is on a mission to catalyze and democratize NLP research. Prior to HuggingFace, Thomas gained a Ph.D. in physics, and later a law degree. He worked as a physics researcher and a European Patent Attorney.
An Introduction to Transfer Learning in NLP and HuggingFace
ML conf EU 2020ML conf EU 2020
32 min
An Introduction to Transfer Learning in NLP and HuggingFace
In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.
Hi, everyone. Welcome to my talk, and today we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept, the history. Then present you the tools that we're developing at Hugging Face. And then I hope you have a lot of questions for me, so a Q and A session.
What is Transfer Learning?
[00:46] Let's start by concepts. What is transfer learning? That's a very good question. Here is the traditional way we do machine learning. Usually when we're faced with the first task in machine learning, we gather a set of data, we randomly initialize our model, and we train it on our datasets to get the machine learning system that we use, for instance, to work in prediction. Now when we're faced with a second task, usually we start from scratch again. We'll gather another set of data, another dataset. We randomly initialize from our model, and we'll train it from scratch again to get a second learning system. And the same when we're faced with a third task, okay?
We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in prediction. So this is not the way we humans do learning. You see, when we're faced with a new task, we reuse all the knowledge we've learned in the past task, all the things we've learned in life, all the things we've learned in university classes, and we use that to bootstrap our learning, okay? So you can see that as having a lot of datasets that we've already used to generate a knowledge base, okay?
[02:03] And now this gives us two main advantages. The first one is that we can learn with just a few data points, because we can interplay between these data points. And this can help us use some form of data augmentation if you want naturally. And the second advantage is that we can also leverage all this knowledge to reach better performances. Humans are typically more data-efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done, last summer, a very long tutorial. There was a three-hour tutorial, so you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.
Sequential Transfer Learning
[02:53] There are a lot of ways you can do transfer learning, but today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want, of transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps.
The first step is called pre-training, and by doing these steps, you'll try to gather as much data as you can, okay? We'll try to build basically some kind of knowledge base, like the knowledge base we humans build. And the idea is that we can end up with a general purpose model. So there's a lot of different general purpose models. You've probably heard about many of them. Word2Vec and GloVe were the first model leveraging transfer learning. They were word embeddings, but today we use models which have a lot more parameters, which are fully pre-trained. Like BERT, GPT, or DistilBERT. And these models, they are pre-trained as general purpose models. They are not focused on one specific task. They can be used on a lot of different tasks.
[03:54] So how we do that, we do a second step of adaptation, or fine-tuning, usually, on which we will select the task we want to use our model for and we'll fine-tune it on this task, okay? So here you have a few examples: text classification, word labeling, question answering.
But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. Language modeling is a pre-training objective which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a predictor of the probability of the words, for instance, and we try to maximize that. So you can see that given some context, you will try to maximize the probability of the next word or the probability of the last word, okay?
[04:46] The main thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high capacity model. Okay? So this is great for many things and in particular, low resource languages. It's also very versatile, as I told you. You can decompose this probability as the product of probability of a values view of your texts. And this is very interesting from a research point of view.
Pre-training Transformers models (BERT, GPT…)
[05:13] Now, how are the models looking? So there are two main flavors of the model. They are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a de-noising objective. So we take a sentence here, we mask one token, and we try to predict this token back.
So we project all these words in vectors, then we will actually use what is the attention layer in transformer. Which will do a weighted average of these vectors, to end up with also vectors. But these vectors they now depend on the context. Which means that the vector associated to masked, have enough information from the context to be able to predict back or at least to try to predict back the missing or masked word. Okay? So this is how BERT is trained. You train this model to predict back the masked token.
[06:14] Now there is another flavor which is called autoregressive or causal model. And these models, they look pretty much the same, but you can see that, with one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be 'is' the next token. Okay. So we need to tweak the attention here. We have to mask the right context otherwise the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot choose the right context to do their prediction, but they have a lot more strange signals here. So they usually train faster.
So these are the two main flavors of general purpose model. Now, how do you do the adaptation steps? It's pretty easy. These are the two advantages I'll show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides and we'll remove them. And we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it.
[07:27] If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section.
Downstream tasks and Model Adaptation: Quick Examples
A - Transfer Learning for Text Classification
[07:38] So let me show you two main examples of downstream tasks that we can tackle: text classification and generation. So here is an example for text classification, to show you how it works. Let's say we have a sentence here: Jim Hanson was a Puppeteer. And the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification, if you want. So the first step will be to convert this input sentence in something that our model can digest. So our models they can tackle integers, floats, they need numbers.
So the first step will be to convert these strings into numbers. And there is one thing interesting here, that our models are made to be trained and to process, basically open domain vocabulary, open domain copies, they're made to be trained on the internet. So we have to handle some uncommon words. Here is puppeteer, a word that is definitely not very common in English. So the way we handle that is that we will catch this word in prefix and suffix until we know every set part of the word. Okay. So here, for instance, puppet and -eer are two kind of common prefix and suffix. So they are in our vocabulary and we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the BERT and GPT. And as you remember, the output was a vector here. The output before the pink boxes was a vector. And we then just have to project back this vector to two classes. So for instance, we pull them, put them into a linear classifier, and then we get the classes.
[09:15] So this linear classifier will be trained from scratch on our downstream task. Okay. So it has to be quite small in terms of parameters if you want to be that efficient. But this one is pre-trained on the huge data that we've gathered. So this one can be really used. Okay.
So here is a practical example on a classification task called TREC-6. So you can see more details in the slides of our tutorials last year, but basically you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really data efficient. This dataset, this TREC-6 is really a small dataset, 2,500 examples. So this is really data efficient. And when you train for three epochs, you get down to an error rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about: data efficiency and high performances. Okay.
B - Transfer Learning for Language Generation
[10:11] Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chat-bot setting where we have kind of knowledge base. Our chat-bot is pretending to be an artist with four children who got a cat and likes to watch Game of Thrones. And now he's discussing with a user in an open domain setting. So the user said, "Hey," chat-bot answered, "Hello. How are you?" Then the user said, "I'm good. Thank you. How are you?" And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense, given their personality.
So you see, this is quite a different task because we have a lot of different types of inputs. We have kind of a knowledge base here. We have a dialogue history as well, and we have this utterance, and even the beginning of the reply, because we've generated the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next word.
[11:05] So how do we handle that? Well, actually we had a paper last year at ACL that you can check out, but basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this, and it actually works really well. Or, you can duplicate your model and have several parts that process each type of input, and then you need to connect them together with cross attention or another way to connect them.
Okay, this is very different, but the idea is that overall, as you can see here, this is a competition we participated in taking place two years ago now. And you see, we were really leading on the automatic metrics by a strong margin, using this type of transfer learning approach.
Trends and limits of Transfer Learning in NLP
[11:53] So a few words on trends and limits. I'll go fast, but yeah, there is a trend to bigger models. This is quite a big problem because it's narrowing the competition. There are some ways we can reduce the size of this model: distillation, pruning, quantization. I won't spend too much time on this, but this is something we are like a lot. These models have generalization problems. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in texts. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image or even to use a human in the loop. OK.
And now there is the last main problem, which is that these models, they are trained just one time and it's very hard to make them learn new knowledge later on because they have this problem of catastrophic forgetting. And so for instance, GPT 3, which was very expensive to train, training costs like 10 million, GPT 3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today. Okay. So this is too bad. So there's a lot of shortcomings. And at Hugging Face, we try to tackle some of these shortcomings.
Huggin Face Inc.
[13:16] So what do we do at Hugging Face? Well we try to democratize the NLP. We start as a chat-bot company, building a game. And we open sourced a lot of tools, while we were doing that. And actually this tool catches so much interest in the community that we are now fully focused on catalyzing on democratizing all this research level work. So we do that in two main ways. The first way is knowledge sharing. That's why I'm talking today. And the second way is to opensource a lot of code and to opensource some library, which lets people leverage all these developments and develop better models on top of them.
Open source Libraries
Okay. So let's see a few of our open source libraries, and this will be the open door for the question in the second part. OK. So the first main library that you probably know is called the transformers library, which is a way to access all these state of the art models. They can be used both for NLU and NLG in many languages. We have today read a lot of models in the library. There are all the models you know, probably BERT, GPT and also more interesting models, more recent DPR, Pegasus, XLM-Roberta. And the T5 is a really huge model. So there are a lot of them, very simple to use. We can talk about that in the question. And we even have a model hub where people can share their model. We have more than 3000 models and many, many languages.
[14:51] So if you want to use a model for Finnish or German you have like 132 different models you can expect and you can play with them, even in the web interface, you can try the model and see how they behave. This is at
Now we open source more recently a second library, called Tokenizers which do this first part that I was talking about, like splitting the string in integrals. So why did we do that? This was really a bottleneck in term of speed. And so we decided to use some very low level risk code, super fast to do that, to solve this problem. So this, this is now really a very fast step. It's available in Python, Node, Rest. Rest is a great language.
[15:38] And more recently, a few months ago we open sourced a third library, which is called Datasets. So Datasets is our third library, and this is to tackle one last problem we discover, which is that dataset themselves, they are kind of hard to access for people. We have to share, and we spend a lot of time rewriting the same processing code that people have actually already written a lot, for their own dataset. So we decided to make a library to solve this problem, make it easier to share datasets and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use or, or at least to install and to set up. Metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it could be also good to make something very easy to share them, to upload them, to share them. So this is the Dataset library with datasets on metrics.
And it has a few really cool stuff. For instance, there is built-in interoperability with Numpy, Pandas, PyTorch, and Tensorflow 2. So you can use basically any type of framework if you want. Well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on the Wikipedia. Your dataset, maybe for instance, bigger than your RAM memory, you can use this library because it has a lot of smart ways to do on disk zeroization. So you just take 9 megabytes, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good and you won't spend time reprocessing it again.
[17:19] So you can check this one. Here is an example, how you prepare. This is the full preparation of training set to train a model on glue with the tokenization, with the padding. And you see, this is just like 20 lines of code and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time.
So same as the model. We also have a hub where you can access and even explore the datasets. So there are more than 160 datasets now. There are also multimodal datasets with images. And so you can go to, explore all the datasets, you can see the station, everything, and you can even see what's inside the dataset in the web interface. Okay. So same here. You can go and check it out at
Okay. So these are the three main libraries and I'm really happy to talk about any question you may have on them. And also any question you may have on the first part, the concept, and the history.
[18:27] Sergii Khomenko: Hey Thomas, how's it going for you?
Thomas Wolf: Hi. Good, good. How's it going for you?
[18:34] Sergii Khomenko: Yeah, it's also pretty nice. I mean, sun is already a bit like out in Munich, but it's still quite a bunch of exciting talks and you did kick off this conference pretty nicely because I do believe that natural language processing, or language in general, is like one of those indicators, how good we are at understanding those machine learning. How do you feel?
[18:54] Thomas Wolf: Yeah, definitely. I think what we've seen in NLP right now, I guess over the last two years, probably, that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do rezoning or what we hope would be like rezoning. And so when we talk about AGI, I think right now a lot of people think about GPT 3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI.
[19:27] Sergii Khomenko: Yeah. And it's kind of funny, right? Because you are almost predicting the first question, right? And the first question is actually about GPT 3 and people are asking like: Hey, since you are an expert in NLP, and your company's driving such a good effort in this regard, what do you think is the advantage of GPT 3 in comparison to GPT 2?
[19:47] Thomas Wolf: Yeah, it's a good question. I think, well, one of the problem of GPT 3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GTP 3 like we've done it for other models, like BERT or GPT 2, just because like a lot of academics did not really have full access to be able to investigate what's happening. What can you do with that to be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right?
What I think is what we can do is let GPT 3 really behave in some way, like an interesting thing, which is a retrieval, like a smooth retrieval engine over a really large dataset. So you can do like, it's like having a huge Google search where you can search every page on the internet and be able to smoothly interpolate between all these pages. So I think this is very interesting and what we see is that you can do some pretty cool applications with that. Like you can smoothly interpolate to generate code and to generate realistic looking blog posts. Now, when you talk about really rezoning, meaning, anything like that, I don't think there is really any, any deep breakthrough in GPT 3, but that's my personal opinion. Yeah.
[21:12] Sergii Khomenko: Yeah, it's really good that at least for me, it does resonate that you separate reasoning from having a big database. Because sometimes your feelings that our community or parts of our community is going like, 'Hey, if the database is bigger, you can just solve all of the problems.' And sometimes having a bigger model doesn't really mean that you suddenly have AGI. And it's good to remember, basically.
Thomas Wolf: Yeah.
Sergii Khomenko: Yeah. So another question would be just like from the audience is also: There are so many different sizes of the model. Even a transformer base. There is GPT 3, there are transformers, there is BERT, all kinds of things, there's also more distilled versions of them. And when one machinery engineer kind of starts to work on the task, what is a good rule of thumb, how to make this decision process? And obviously there is no one clear answer. But do you have any, I don't know, a mental model or like a framework, especially for beginners who may be working at a company that doesn't have a big machine learning group. But they're like one person who actually makes the calls. How can we help and support such a person?
[22:24] Thomas Wolf: Yeah. That's a good question. I think that's definitely something a lot of people face. And I mean, the very practical thing about our team at Hugging Face, I think, is that we should help people do that because I understand we are providing a lot of models. We're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So, the first thing is that we will try to build some better tools on this and we'll try to help you. But now for the quick answer, I think it's good to keep your good reflex, like your good routine is that you should start with something simple. Like you should start with a smaller model, like starting with DistilBERT, for instance, instead of BERT, like starting with kind of something small and see how far you can go with that with this computer efficient model, like DistilBERT or GPT2. You test a little bit with them and if it's not enough, then you scale up. Then you start to use bigger models and you try to see if you need like a T5 or something like that. Yeah. But that's the typical thing. It's the same. You should always start with like logistic redirection before you move to something very fancy.
[23:40] Sergii Khomenko: Yeah. Yeah. Definitely. I can only agree with you. Because once you start with GPT 3, you can only make worse form errors, so no way to explain it. But maybe just like a random idea, because you're also building quite a bunch of tools. So right now many companies building this, like a model zoo. Maybe you can also have something as like use case zoo, that's kind of showing, what is a space of problems, and how does a map, like a space of models, and solution. But again, just like a random idea.
Thomas Wolf: Oh, that's a very good idea. I think you should come build that with us.
[24:10] Sergii Khomenko: Yep. I will try. Okay. So the next question would be transformers. We have seen the transformers did reinvent or revolutionize NLP industry at all. But how do you feel about the future of transformers? There were some papers also about using transformers for vision. Do you see that soon transformers are going to be all around the machine learning or deep learning in particular or is that going to find their own niche and going from there, basically?
[24:43] Thomas Wolf: Yeah. That's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient and they are really a nice model to scale to large data sets. So, as soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can stay on scale.
Yeah. I mean, I'm not a computer vision guy, so yeah. I don't really know why we are only seeing that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model-agnostic, I would say. I think the underlying strategy, like transfer learning, has something very deep and important regarding the precise model that we select to use. I don't care that much. If somebody finds something better than transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-train model is a lot deeper with deep consequences.
[25:56] Sergii Khomenko: Yeah. Yeah. I guess what also is happening in industry right now is not even using transformers are architecture, but more like using weakly or unsupervised learning. Because the big part, correct me if I'm wrong, of a transformer or BERT, or this direction was hey, we don't need to have a big label data set. we could just reuse what we have. And by using the same ideas of hey, can we learn representation from the image itself and all of ideas of contrastive loss might be something that will spark in other different directions.
Thomas Wolf: Definitely, yeah.
[26:28] Sergii Khomenko: Cool. One more question from the audience is basically around, I will rephrase it, but I hope I'm not going to lose this sensor. So what you suggested also in your talk, if you can use transfer learning, you can try to start there. And try to unfree some layers and train it gradually basically to get the task. And the person is asking essentially, what is, how likely is it by you doing transfer learning you would end up in just like local optima of convergence? So you don't get something good in comparison, if you start with a model from scratch, and you start basically training it all together. Because I mean, usually people hope that the more you're unfreezing the better, you're kind of getting, and it's almost like a smooth transition from really out of the box transfer learning into this model that has been trained from scratch. Do you agree with this one? Or is there anything that people should be aware of, like any tips and tricks basically of advanced transfer learning and NLP?
[27:37] Thomas Wolf: Yeah. That's, there's a lot of things here. So the best resource, I mean, is probably, I had a link, in my first slide, on the long tutorial where we talked about that. And I also made that video I could share earlier this year on this question, but there is a lot of risk to fall in local optima. So we see that when you fine tune BERT, for instance, you can get like a none of behavior for some random seed, the model just doesn't convert. It's just stayed stuck in minima. And for some other random seed, it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that.
You have tools like something related to what you were saying, which is called Mixout, which is the idea a bit like drop out. You probably want to regularize your model, with drop out you're regularized towards zero, you cancel some weights. And with these other things, you can regularize to the pre-train model. So you can have a small sync operation from the pre-train model to the model that's spent. But there's lot of other techniques that try to make some spruce interpolation to not be stuck in local minima. That's a very active area of research. And the other option is to try to do some assemblies, which work very well as well. But yeah, that's still quite an open research question, I think. And we definitely have not really understood everything that was happening here. Well, it's failing sometimes, well it's working very well in other cases that look really close. Yeah.
[29:30] Sergii Khomenko: Yeah. Yeah. So maybe for whoever did ask this question, it's like a good topic also for research. So once you find some answers, please write an article or paper and we're all going to read it later on.
Thomas Wolf: Yeah, definitely.
[29:41] Sergii Khomenko: Cool. Maybe like the last question, because it feels like this conversation can be going on for a very long time, but we still have a bit of a schedule. So how do you feel about NLP and smartphones? Is it something that is already kind of like sorted out and we already have pretty good distilling or are there still great breakthroughs ahead of us?
[30:01] Thomas Wolf: Yeah. I think it's a cool area. We have a few open source libraries actually on Swift core ML with BERT that you can run on the iPhone. And you have a lot of very recent papers regarding talk for next week, an NLP activity on that, you can quantize models very efficiently. So yeah. I can see a lot of breakthroughs coming there with very efficient model, keeping a lot of the performances that we have on PC, on the computer, but on smartphone. Yeah.
[30:32] Sergii Khomenko: Nice. Awesome. Thank you again, for your amazing answers, and also your talk. But also I would like to mention, again, the work that Hugging Face does in open source. It is really helpful and it does make entry where you're like lower for many people in this industry. So thank you again and hope to see you somewhere around the internet.
Thomas Wolf: Thanks.
Sergii Khomenko: Yeah. Yeah. And if you still have some questions, and you would like to address them to Thomas directly, remember that we have the speakers room for that you just go to website, as AJ explained to you earlier, and after you will find the link to the conversation with Thomas, that's going to be starting more or less right now. So please go ahead if you want to continue the conversation on NLP.