In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.
An Introduction to Transfer Learning in NLP and HuggingFace

AI Generated Video Summary
Transfer learning in NLP allows for better performance with minimal data. BERT is commonly used for sequential transfer learning. Models like BERT can be adapted for downstream tasks such as text classification. Handling different types of inputs in NLP involves concatenating or duplicating the model. Hugging Face aims to tackle challenges in NLP through knowledge sharing and open sourcing code and libraries.
1. Introduction to Transfer Learning in NLP
Today, we're going to talk about transfer learning in NLP. In transfer learning, we reuse knowledge from past tasks to bootstrap our learning. This approach allows us to learn with just a few data points and achieve better performance. At Hugging Face, we're developing tools for transfer learning in NLP.
Hi, everyone. Welcome to my talk. And today, we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept of history, then present you the tools that we are developing at Hugging Face. And then I hope you have a lot of questions for me.
So, Q&A session. OK. Let's start by concepts. What is transfer learning? That's a very good question. So here is the traditional way we do transfer learning. Sorry. This is the traditional way we do machine learning. Usually when we face with a first task in machine learning, we gather a set of data. We randomly initialize our model and we train it on our datasets to get the machine learning system that we'll use to predict, for instance, to work in production.
Now, when we face with a second task, usually again, we'll gather another set of data. Another data set. We randomly initialize from our model and we'll train it from scratch again to get the second learning system and the same way we face with a third task. We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in production. So this is not the way we humans do learning. Usually when we're faced with a new task, we reuse all the knowledge we've learned in the past tasks, all the things we have learned in life, all the things we've learned in university classes and we use that to bootstrap our learning. So you can see that as having a lot of data, a lot of data set that we've already used to generate a knowledge base. And now this gives us two main advantages. The first one is that we can learn with just a few data, just a few data points, because we can interpolate between these data points. And this kind of help us do some form of data mutation, if you want, naturally. And the second advantage is just... Is that we can also leverage all this knowledge to reach better performances. So humans are typically more data efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done last summer, a very long tutorial. There was a three-hour tutorial.
2. Sequential Transfer Learning with BERT
Today, we'll discuss sequential transfer learning, which involves retraining and fine-tuning a general-purpose model like BERT. Language modeling is a self-supervised pre-training objective that maximizes the probability of the next word. This approach doesn't require annotated data and is versatile, making it useful for low-resource languages. Transformers like BERT are commonly used for transfer learning in NLP.
So you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.
So there are a lot of ways you can do transfer learning. But today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want to transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps. The first step is called retraining. And during these steps, you'll try to gather as much data as you can. We'll try to build basically some kind of knowledge base, like the knowledge base we humans built. And the idea is that we can end up with a general-purpose model.
So there's a lot of different general-purpose model. You've probably heard about many of them, Word2Vec and GloVee were the first model leveraging transfer learning. They were Word embeddings, but today, we use models which have a lot more parameters, which are fully pretrained, like BERT, GPT or distilled BERT. And these models, they are pre-trained as general-purpose model. They are not focused on one specific task, but they can be used on a lot of different tasks. So how we do that, we do a second step of adaptation or fine-tuning usually, on which we will select the task we want to use our model for, and we'll fine-tune it on this task. So here you have a few examples, test classification, word labeling, question answering.
But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. So language modeling is a pre-training objective, which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a product of the probability of the words, for instance, and we try to maximize that. So you can see that as given some context, you will try to maximize the probability of the next word or the probability of a master. The nice thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high-capacity So this is great for many things, and in particular, low-resource languages. It's also very versatile, as I told you, you can decompose this probability as a product of probability of a various view of your texts. And this is very interesting from a research point of view.
Now, how are the models looking? There are two main flavors of model, they are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a denoising objective.
3. Training BERT and Adapting for Downstream Tasks
BERT is trained by predicting back the masked token. Another flavor is the auto-aggressive or causal model, which associates each token with the next token. These models are less powerful but train faster. To adapt BERT, the pre-training head is removed and replaced with a task-specific head. Examples of downstream tasks include text classification and generation.
So we take a sentence here, we mask one token, and we try to predict this token back. So we project all these words in vectors. Then we will actually use what is the attention layer in transformer, which will do a weighted average of these vectors to end up with also vectors. But these vectors then now depend on the context, which means that the vector associated to masked have enough information from the context to be able to predict back, or at least to try to predict back the missing, the mass 12, okay?
So this is how BERT is trained. You train this model to predict back the masked token. Now there is another flavor, which is called auto-aggressive or causal model. And these models, they look pretty much the same, but you can see that to one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be, is the next token, okay? So we need to tweak the attention here. We have to mask the right context. Otherwise, the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot use the to do their prediction, but they have a lot more strange signal here. So they usually train faster. So these are the two main flavors of model. These are the two main framework.
Now, how do you do the adaptation steps? It's pretty easy. Yeah. These are the two advantages I show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides. And we remove them and we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it. If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section.
So let me show you two main examples of downstream task that we can tackle, text classification and generation. So here is an example for text classification to kind of show you how it works. Let's say we have a sentence here, Jim Hansen was a puppeteer, and the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification, if you want, okay? So the first step would be to convert this input sentence in something that our model can digest. So our models they are, they can tackle integrals, floats, they need numbers.
4. Model Training and Performance
Our models are trained to process open domain vocabulary and corpus. Uncommon words are handled by cutting them into prefixes and suffixes. The model's output is projected back to two classes using a linear classifier. This classifier is trained from scratch on the downstream task. In a practical example, we achieved over 90% accuracy after just one epoch on a small dataset. The model's versatility is demonstrated in a chatbot setting with a knowledge base.
So our models they are, they can tackle integrals, floats, they need numbers. So the first step will be to convert these strings into numbers. And there is one thing interesting here is that our models are made to be trained and to process basically open domain vocabulary, open domain corpus. They are made to be trained on internet. So we have to handle some uncommon words.
Here is perpetuer is a word that is definitely not very common in English. So the way we handle that is that we will cut this word in prefix and suffix until we know every step part of the word. Okay. So here, for instance, perpetuer are two kind of common prefix and suffix. So they are in our vocabulary and we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the birth of GPT. And as you remember the output was a vector here, the output before the pink boxes was a vector. And we just have to project that back this vector to two classes.
So for instance, we pull them, put them into a linear classifier, and then we get the classes. So this linear classifier will be trained from scratch on our downstream task. Okay. So it has to be quite small in term of barometers if you want to be that efficient, but this one is pre-trained on the huge data corpus that we've gathered. So this one can be really used. Okay. So here are two, I mean, here is a practical example on a classification task called track six. So you can see more details in the slides of our tutorials last year, but basically you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really that efficient. This dataset, this track six is really a small datasets, 2500 examples. So this is really that efficient. And when you train for three epochs, you get down to a narrow rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about, data efficiency and high performances.
Okay. Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chatbot setting where we have kind of a knowledge base.
5. Handling Different Types of Inputs
The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. The dialogue agent needs to generate a reply that makes sense given the personality. This is a different task because of the various types of inputs involved, such as the knowledge base, dialogue history, and the utility of the beginning of the reply. There are two main ways to handle this: concatenating all the inputs into a single input or duplicating the model and processing each type of input separately.
The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. And now it's discussing with a user in an open domain setting. So the user said, Hey, chatbot answer, hello, how are you? And the user say, I'm good. Thank you. How are you? And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense given the personality. Okay? So you see, this is quite a different task because we have a lot of different type of inputs. We have kind of a knowledge base here. We have a dialogue history as well. And we have this utility on even the beginning of the reply because we've generating the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next one. So how do we handle that? Well, actually we had the paper last year at ACL that you can check out. But basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this and actually works really well. Or you can duplicate your model and have several parts that process each type of input, and then you need to connect them together with cross attention or a way to connect them. Okay. This is very different, but the idea is that overall, as you can see here, this is a competition we participated in at the GameFest two years ago now. And you see we were really leading on the automatic matrix by a strong margin using this type of transfer learning approach.
6. Trends and Limitations in NLP
There is a trend to bigger models, which is narrowing the competition. However, there are ways to reduce the size of models through distillation, pruning, and quantization. These models have difficulties with out-of-domain generalization and are limited by the lack of common sense knowledge in text. Another challenge is making models learn new knowledge without forgetting previous knowledge. Despite these shortcomings, Hugging Face aims to tackle these challenges by democratizing NLP through knowledge sharing and open sourcing code and libraries, such as the Transformers Library.
So a few words on trends and limit, I'll go fast, but yeah, there is a trend to bigger models. This is a main, this is quite a big problem because it's narrowing the competition. There is some way we can reduce the size of this model, distillation, pruning quantization. I won't spend too much time on this, but it's something we are like a lot.
These models generalization problem. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in text. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image, or even to use human in the loop.
And now there is the last main problem, which is that these models, they are trained just one time, and it's very hard to make them learn new knowledge later on, because they have this problem of catastrophic forgetting. And so, for instance, GPT-3, which was very expensive to train. Train cost, like, 10 millions. GPT-3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today, okay? So this is too bad. So there's a lot of shortcomings, and at Hugging Face, we try to tackle some of these shortcomings.
So what do we do at Hugging Face? Well, we try to democratize NLP. We started as a chatbot company, building a game, and we open sourced a lot of tools while we were doing that. And actually this tool catched so much interest in the community that we are now fully focused on catalyzing and democratizing all this research level work. So we do that in two main ways. The first way is through knowledge sharing. That's why I'm talking today. And the second way is to open source a lot of code and to open source some library, which lets people leverage all these developments and develop better models on top of them. Okay. So let's see a few of our open source libraries, and this will be the open door for the question in the second part. Okay. So the first main library that you probably know is called the Transformers Library, which is a way to access all the state of the art model, and they can be used both for NLU and NLG, many languages. We have today really a lot of models in the library. There is all the model, you know, probably BERT, GPT, and also more interesting model, more recent DPR Pegasus. XM Roberta is a great multilingual model by Alex and team. There are actually T5 is a really huge model, so there are a lot of them, very simple to use.
7. Model Hub, Tokenizer, and DataSets Libraries
We have a model hub where people can share over 3,000 models in many languages. We also have a web interface to try out the models. We recently opened a second library called Tokenizer to solve the bottleneck in string splitting. We also opened a third library called DataSets to make it easier to share datasets and metrics. DataSets has interoperability with popular frameworks and is designed for large datasets.
We can talk about that in the question. We even have a model hub where people can share their model. We have more than 3,000 models and many, many languages, so if you want to use a model for Finnish or German, they have like 132 different models you can expect and you can play with them. Even in the web interface, you can try the model and see how they behave. This is at huggingface.com.
Now we opened so small recently a second library called Tokenizer, which do this first part that I was talking about, like splitting the string in integrals. So why did we do that? This was really a bottleneck in terms of speed. And so we decided to use some very low level rusk code, super fast to do that to solve this problem. So this is now really a very fast step. It's available in Python node. Rust is a great language.
And more recently, a few months ago, we opensource a third library, which is called DataSets. So DataSets is our third library. This is to tackle one last problem we discovered, which is that DataSets themselves, they're kind of hard to access for people, they're hard to share. And we spend a lot of time rewriting the same processing code that people have actually already written a lot of in there on that set. So we decided to make a library to solve this problem, make it easier to share DataSets, and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use, or like to at least to install and to set up metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it would be also good to make something very easy to share them, to upload them, to share them. So this is the DataSets library with that set of metrics. It has a few really cool stuff. For instance, there is building interoperability with NumPy, Panda, PyTorch, and Tensor flow too. So you can use basically any type of framework if you want, well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on Wikipedia, and some data that your dataset may be, for instance, bigger than your RAM memory, you can use this library because it has a lot of smart way to do on-disk zero-sization. So it just take nine megabytes from, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good, and you won't spend time reprocessing it again. So you can check this one. Here is an example of how you prepare. This is the full preparation of a training set to train a model on Glue, with the tokenization, with the padding.
8. Model Training, Libraries, and Questions
The model training process is efficient and provides visibility. The hub allows access to datasets, including multi-modal datasets with images. Visit HuggingFace.co to explore the datasets and view their contents. These three main libraries are available for discussion, along with any questions about the concept and history.
And you see, this is just like 20 lines of code, and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time. So, same as the model, we also have a hub where you can access and even explore the datasets. So there is like a model, 160 dataset now. There are also multi-modal datasets with images. And so you can go to HuggingFase.co, explore all the datasets. You can see the station and everything. And you can even see what's inside a dataset in the web interface.
Okay. So same here. You can go and check it out at HuggingFace.co. Okay. So these are the three main libraries, and I'm really happy to talk about any question you may have on them. And also, any question you may have on the first part, the concept, the history. Thank you.
QnA
NLP Advancements and Choosing Models
Hey Thomas, how is it going for you? Good to hear! NLP has become the most exciting field in AI. GPT-3 is an impressive model, but its full capabilities are difficult to evaluate. It excels in retrieval tasks and can generate code and realistic blog posts. However, it lacks deep reasoning abilities. Having a big database doesn't equate to AGI. When choosing a model, start with something simple like a distilled version and scale up if needed.
Hey Thomas, how is it going for you?
Hi. Good, good. How is it going for you?
Yeah, it's also pretty nice. I mean, the sun is already a bit out in Munich, but it's still quite a bunch of exciting talks. And you did kick off this conference pretty nicely because I do believe that natural language processing or language in general is one of those indicators how good we are at understanding this machine learning.
How do you feel?
Yeah, definitely. I think what we've seen in NLP right now I guess over the last two years probably is that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do researching or what we hope would be like researching. And so when we talk about AGI, I think right now a lot of people think about GPT-3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI.
Yeah, and it's kind of funny because you're almost predicting the first question right. And the first question is actually about GPT-3 and people asking like, hey, since you're an expert in NLP and your company is driving such good efforts in this regard, what do you think is the advance of GPT-3 in comparison to GPT-2?
Yeah, it's a good question. I think, well, one of the problems of GPT-3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GPT-3. Like, we've done it from other models like BERT or GPT-2 just because a lot of academics did not really have full access to be able to investigate what's happening. What can you do with that? Be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right? What I think is what we can do is that GPT-3 behave in some way like an interesting thing, which is a retrieval, like a smooth retrieval ranging over a really large data set. So you can do like, it's like having a huge Google Search, where you can search every page on the internet and being able to smoothly interpolate between all these pages. So I think this is very interesting and what we see, that you can do some pretty cool application with that, like you can smoothly interpolate to generate like also code and to generate like a realistic looking blog post. Now, when you talk about like really reasoning, meaning and things like that, I don't think there is really any deep breaks in GPT suite, but that's my personal opinion.
Yeah, no it's really good that at least for me it does resonate that you separate reasoning from having a big database. Because sometimes we have feelings that our community or parts of our community going like, hey, if database is bigger, you can just solve all of the problems. And sometimes it's not really, having a bigger model doesn't really mean that you're suddenly having like an AGI, and it's good to remember basically.
Yeah. So another question would be, just from the audience, there are so many different sizes of model, even transformer-based, there's GPT-3, there's transformers, there's Vertra, Roberta, all kind of things. There's also more distilled right? And when one machine learning engineer starts to work on the task, what is a good rule of thumb how to make this decision process? And obviously there is no one clear answer, but do you have any mental model or a framework, especially for beginners, who may be working in the company that doesn't have a big machine learning group, but they're person to actually make the calls, how can we help and support such a person?
Yeah, that's a good question. I think that's definitely something a lot of people face. I mean, the very practical thing about me, about our team at SwiggyFace, I think is that we should help people do that, because I understand we have like, we're providing a lot of models, we're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So the first thing is that we will try to build some better tools on this. But now, for the quick answer, I think it's good to keep your good reflex, your good routine, is that you should start with something simple, like you should start with a smaller model, like starting with distilled birth, for instance, instead of birth, like starting with something small, and see how far you can go with that, with this compute-efficient model, like a distilled birth, a distilled GPT-2, distilled Roberta, you test a little bit with them. And if it's not enough, then you scale up, then you start to use bigger models and you try to see if you need, like, a T5 or something like that.
Starting with a logistic projection
Starting with a logistic projection before moving to something more advanced is a typical approach. GPT-3 is a powerful model, and starting with it may not yield better results. It would be interesting to explore use cases that showcase the space of problems and how different models can be applied.
Yeah. Yeah, that's the typical thing. It's the same, you should always start with, like, logistic projection before you move to, like, something very fancy, but yeah. Yeah. Definitely. I can only agree with you, right? Because once you start with GPT-3, right, you can only make it, like, worse from there, right? So, no way to explain it, right? But maybe just like a random idea, because you're also building quite a bunch of tools, right? So right now, many companies building this, like a model, right? Maybe you can also have something as, like, a use case, right? But kind of, like, showing, like, hey, what is the space of problems, right? And how does it map, like, a space of models, right? And yeah, solution. But again, just like a random idea. Well, that's a very good idea. I think you should come build that with us. Yeah. Yeah, yeah. I will try.
Future of Transformers in NLP and Vision
Transformers have revolutionized the NLP industry and are now being explored for vision tasks. They are efficient and scalable for transfer learning on large datasets. The choice of model is not as important as the underlying strategy of transfer learning. There is also a trend towards weakly or unsupervised learning. When using transfer learning, there is a risk of falling into local optima, but resources are available to guide advanced transfer learning in NLP.
Okay. So next question would be transformers, right? We have seen that transformers did reinvent or revolutionize the NLP industry at all, right? But how do you feel, like, about future of transformers? There was, like, some papers also about using transformers for, like, vision, right? Do you see that soon transformers are going to be like all around the machine learning or the learning in particular, or is that going to find their own niche, right? And kind of plateau from there basically?
Yeah, that's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient, and they are really a nice model to scale to large data sets. As soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can scale. Yeah. I mean, I'm not a computer vision guy. So yeah, I don't really know why we only see that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model agnostic, I would say. I think the underlying strategy, like transfer learning, is something very deep and important. Regarding the precise model that we select to use, I don't care that much. If somebody finds something better than Transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-trained model is a lot deeper with deep consequences.
Yeah, I guess also something happening in the industry right now is not even using Transformer as an architecture, but more like using weekly or unsupervised learning. Because the big part of Transformer or BERT, or this direction, was like, hey, we don't need to have a big label dataset, we can just reuse what we have. And by using the same ideas of, hey, can we learn representation from the image itself, and all of the ideas of contrastive loss might be something that will spark another different direction. Cool, one more question from the audience. I will rephrase it, but I hope I'm not going to lose sensor. So what you suggested also in your talk, if you can use Transfer Learning, you can try to start there and try to unfreeze some layers and train it gradually to get the task right. And the person is asking, essentially, what is... how likely is it that by you doing Transfer Learning, you would end up in this local optima of convergence, so you don't get really something good? In comparison, if you start with a model from scratch and you start basically training it all together. Because, I mean, usually people hope that the more you unfreeze, the better you're getting. And it's almost like a smooth transition from really out-of-the-box Transfer Learning into this model that has been trained from scratch. Do you agree with this one, right? Or is there anything that people should be aware of? Like any tips and tricks basically of advanced Transfer Learning and NLP? Yeah, there's a lot of things here. The best resource, I mean, is probably... So, I had a link in my first slide on a long tutorial where we talk about that. And I also made a video I could share earlier this year on this question. But there is a lot of risk to fall in local optima. So, we see that when you fine-tune BERT for instance, you can get a non-off behavior.
Model Conversion and Stuck Minima
For some random seeds, the model gets stuck in minima, while for others, it works well. Researchers are investigating this phenomenon and developing tools like MixOut to avoid getting stuck. There are various techniques to achieve smooth interpolation and avoid local minima. Assemblies are also effective. This is an open research question, and further exploration is encouraged.
For some kind of, for some random seed the model just doesn't convert, it's just stays stuck in minima. And for some other random seed it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that. You have tools like something related to what you're saying, which is called MixOut, which is the idea that's a bit like DropOut, you probably want to, like, you probably want to regularize your model. With DropOut you regularize right towards zero, like you cancel some weights. And with these other things, you can regularize towards the pre-trained model. So you can have a smooth interpolation from the pre-trained model to the model that is fine tuned. But there's a lot of other techniques that try to make some smooth interpolation to not be stuck in local minima. That's a very active area of research, and the other option is to try to do some assembles, which work very well as well. But yeah, that's still quite open research question. I think we have not really understand everything that was happening here on well. It's paying some time, while it's working very well in those cases, that looks really close. So maybe for whoever did ask this question, it's like a good topic also for research, right? So once you find some answers, right, please write the article or paper. And we all can read it later.
Comments