JavaScript conferences

ML conf EU 2020

ML conf EU 2020

English version

An Introduction to Transfer Learning in NLP and HuggingFace

Thomas Wolf

In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

FAQ

Transfer learning in NLP involves using knowledge gained while solving one problem and applying it to a different but related problem. For example, a model trained on one language task can be fine-tuned to perform another language task, leveraging the pre-trained knowledge rather than starting from scratch.

The main advantages of transfer learning are data efficiency and improved performance. It allows models to perform well with fewer data points by utilizing knowledge from previously learned tasks. This approach mimics human learning behaviors, making it a powerful method in machine learning.

Sequential transfer learning involves multiple steps, starting with pre-training on a large dataset to develop a general-purpose model, followed by fine-tuning or adapting this model for specific tasks. This method is widely used due to its effectiveness in improving model performance across various tasks.

Pre-training in NLP typically involves language modeling, a self-supervised learning objective where the model predicts the next word in a sentence given the previous words. This process leverages large amounts of unannotated text, allowing the model to learn language patterns and structures effectively.

Hugging Face is pivotal in democratizing NLP technologies by developing and sharing powerful tools and libraries like Transformers, Tokenizers, and Datasets. These contributions help the community to access state-of-the-art models, promote NLP research, and apply advanced models in practical applications.

BERT (Bidirectional Encoder Representations from Transformers) is trained using a mask language modeling task where random words are masked and predicted by the model. GPT (Generative Pre-trained Transformer), however, uses an auto-regressive approach where each word is predicted based on the previous words in a sentence. These training differences make BERT better for understanding context, while GPT excels in generating coherent text sequences.

nlp machine learning

Thomas Wolf

32 min

02 Jul, 2021

Comments

Sign in or register to post your comment.

Video Summary and Transcription

Transfer learning in NLP allows for better performance with minimal data. BERT is commonly used for sequential transfer learning. Models like BERT can be adapted for downstream tasks such as text classification. Handling different types of inputs in NLP involves concatenating or duplicating the model. Hugging Face aims to tackle challenges in NLP through knowledge sharing and open sourcing code and libraries.

Available in Español: Una introducción al aprendizaje por transferencia en NLP y HuggingFace

1. Introduction to Transfer Learning in NLP

Short description:

Today, we're going to talk about transfer learning in NLP. In transfer learning, we reuse knowledge from past tasks to bootstrap our learning. This approach allows us to learn with just a few data points and achieve better performance. At Hugging Face, we're developing tools for transfer learning in NLP.

Hi, everyone. Welcome to my talk. And today, we're going to talk about transfer learning in NLP. I'll start to talk a little bit about the concept of history, then present you the tools that we are developing at Hugging Face. And then I hope you have a lot of questions for me.

So, Q&A session. OK. Let's start by concepts. What is transfer learning? That's a very good question. So here is the traditional way we do transfer learning. Sorry. This is the traditional way we do machine learning. Usually when we face with a first task in machine learning, we gather a set of data. We randomly initialize our model and we train it on our datasets to get the machine learning system that we'll use to predict, for instance, to work in production.

Now, when we face with a second task, usually again, we'll gather another set of data. Another data set. We randomly initialize from our model and we'll train it from scratch again to get the second learning system and the same way we face with a third task. We'll have a third dataset, we'll have a third machine learning system, again, initialized from scratch and that we'll use in production. So this is not the way we humans do learning. Usually when we're faced with a new task, we reuse all the knowledge we've learned in the past tasks, all the things we have learned in life, all the things we've learned in university classes and we use that to bootstrap our learning. So you can see that as having a lot of data, a lot of data set that we've already used to generate a knowledge base. And now this gives us two main advantages. The first one is that we can learn with just a few data, just a few data points, because we can interpolate between these data points. And this kind of help us do some form of data mutation, if you want, naturally. And the second advantage is just... Is that we can also leverage all this knowledge to reach better performances. So humans are typically more data efficient and have better performances than machine learning systems. So transfer learning is one way to try to do the same for statistical learning, for machine learning. So we've done last summer, a very long tutorial. There was a three-hour tutorial.

2. Sequential Transfer Learning with BERT

Short description:

Today, we'll discuss sequential transfer learning, which involves retraining and fine-tuning a general-purpose model like BERT. Language modeling is a self-supervised pre-training objective that maximizes the probability of the next word. This approach doesn't require annotated data and is versatile, making it useful for low-resource languages. Transformers like BERT are commonly used for transfer learning in NLP.

So you can check out these links. There are 300 slides, a lot of hands-on exercise, and an open source code base. So if you want more information, you really should go there.

So there are a lot of ways you can do transfer learning. But today I'm going to talk about sequential transfer learning, which is the currently most used flavor, if you want to transfer learning. So sequential transfer learning, like the name says, it's a sequence of steps, so at least two steps. The first step is called retraining. And during these steps, you'll try to gather as much data as you can. We'll try to build basically some kind of knowledge base, like the knowledge base we humans built. And the idea is that we can end up with a general-purpose model.

So there's a lot of different general-purpose model. You've probably heard about many of them, Word2Vec and GloVee were the first model leveraging transfer learning. They were Word embeddings, but today, we use models which have a lot more parameters, which are fully pretrained, like BERT, GPT or distilled BERT. And these models, they are pre-trained as general-purpose model. They are not focused on one specific task, but they can be used on a lot of different tasks. So how we do that, we do a second step of adaptation or fine-tuning usually, on which we will select the task we want to use our model for, and we'll fine-tune it on this task. So here you have a few examples, test classification, word labeling, question answering.

But let's start by the first step, pre-training. So the way we pre-train our models today is called language modeling. So language modeling is a pre-training objective, which has many advantages. The main one is that it is self-supervised, which means that we use the text as its own label. We can decompose the text here, the probability of the text as a product of the probability of the words, for instance, and we try to maximize that. So you can see that as given some context, you will try to maximize the probability of the next word or the probability of a master. The nice thing is that we don't have to annotate the data. So in many languages, just by leveraging the internet, we can have enough text to train really a high-capacity So this is great for many things, and in particular, low-resource languages. It's also very versatile, as I told you, you can decompose this probability as a product of probability of a various view of your texts. And this is very interesting from a research point of view.

Now, how are the models looking? There are two main flavors of model, they are both transformers because transformers are kind of interesting from a scalability point of view. The first one is called BERT. So to train a BERT model, you will do what we call mask language modeling, which is a denoising objective.

3. Training BERT and Adapting for Downstream Tasks

Short description:

BERT is trained by predicting back the masked token. Another flavor is the auto-aggressive or causal model, which associates each token with the next token. These models are less powerful but train faster. To adapt BERT, the pre-training head is removed and replaced with a task-specific head. Examples of downstream tasks include text classification and generation.

So we take a sentence here, we mask one token, and we try to predict this token back. So we project all these words in vectors. Then we will actually use what is the attention layer in transformer, which will do a weighted average of these vectors to end up with also vectors. But these vectors then now depend on the context, which means that the vector associated to masked have enough information from the context to be able to predict back, or at least to try to predict back the missing, the mass 12, okay?

So this is how BERT is trained. You train this model to predict back the masked token. Now there is another flavor, which is called auto-aggressive or causal model. And these models, they look pretty much the same, but you can see that to one token here is associated the next token. So my dog, the token that will be at the end of the hyper column associated to dog will be, is the next token, okay? So we need to tweak the attention here. We have to mask the right context. Otherwise, the model can just see the label here on the right. So these models, they are inherently less powerful because they cannot use the to do their prediction, but they have a lot more strange signal here. So they usually train faster. So these are the two main flavors of model. These are the two main framework.

Now, how do you do the adaptation steps? It's pretty easy. Yeah. These are the two advantages I show you. The first step is that we will remove the pre-training head. So we'll take these pink boxes that I showed you on the previous slides. And we remove them and we will replace them with a task-specific head. So if you do text classification, it can just be a linear prediction to the number of classes that you have. If you're doing a more complex task, you can add a full neural network on top of it. If your task is very complex, like structurally different, for instance, you do machine translation, then you have to do more complex things. We'll talk about that a little bit in the next section.

So let me show you two main examples of downstream task that we can tackle, text classification and generation. So here is an example for text classification to kind of show you how it works. Let's say we have a sentence here, Jim Hansen was a puppeteer, and the task of our machine learning model is to predict if the sentence is true or false. So this is just binary classification, if you want, okay? So the first step would be to convert this input sentence in something that our model can digest. So our models they are, they can tackle integrals, floats, they need numbers.

4. Model Training and Performance

Short description:

Our models are trained to process open domain vocabulary and corpus. Uncommon words are handled by cutting them into prefixes and suffixes. The model's output is projected back to two classes using a linear classifier. This classifier is trained from scratch on the downstream task. In a practical example, we achieved over 90% accuracy after just one epoch on a small dataset. The model's versatility is demonstrated in a chatbot setting with a knowledge base.

So our models they are, they can tackle integrals, floats, they need numbers. So the first step will be to convert these strings into numbers. And there is one thing interesting here is that our models are made to be trained and to process basically open domain vocabulary, open domain corpus. They are made to be trained on internet. So we have to handle some uncommon words.

Here is perpetuer is a word that is definitely not very common in English. So the way we handle that is that we will cut this word in prefix and suffix until we know every step part of the word. Okay. So here, for instance, perpetuer are two kind of common prefix and suffix. So they are in our vocabulary and we can then just convert all this in our vocabulary indices. This is now the inputs of the model we saw two slides ago, the birth of GPT. And as you remember the output was a vector here, the output before the pink boxes was a vector. And we just have to project that back this vector to two classes.

So for instance, we pull them, put them into a linear classifier, and then we get the classes. So this linear classifier will be trained from scratch on our downstream task. Okay. So it has to be quite small in term of barometers if you want to be that efficient, but this one is pre-trained on the huge data corpus that we've gathered. So this one can be really used. Okay. So here are two, I mean, here is a practical example on a classification task called track six. So you can see more details in the slides of our tutorials last year, but basically you will see that just after one epoch, we already have an accuracy over 90%, which means that this is really that efficient. This dataset, this track six is really a small datasets, 2500 examples. So this is really that efficient. And when you train for three epochs, you get down to a narrow rate of 3.6, which is the state of the art. Well, it was the state of the art last year. So you get the two things that we're talking about, data efficiency and high performances.

Okay. Now here's a totally different example. So you see the variety, the diversity of tasks that this model can face. This is a chatbot setting where we have kind of a knowledge base.

5. Handling Different Types of Inputs

Short description:

The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. The dialogue agent needs to generate a reply that makes sense given the personality. This is a different task because of the various types of inputs involved, such as the knowledge base, dialogue history, and the utility of the beginning of the reply. There are two main ways to handle this: concatenating all the inputs into a single input or duplicating the model and processing each type of input separately.

The chatbot is pretending to be an artist with four children who got a cat and like to watch Game of Thrones. And now it's discussing with a user in an open domain setting. So the user said, Hey, chatbot answer, hello, how are you? And the user say, I'm good. Thank you. How are you? And our dialogue agent, our machine learning model is supposed to generate a reply, which makes sense given the personality. Okay? So you see, this is quite a different task because we have a lot of different type of inputs. We have kind of a knowledge base here. We have a dialogue history as well. And we have this utility on even the beginning of the reply because we've generating the reply word by word. So we have the beginning of the reply that we should also take into account to generate the next one. So how do we handle that? Well, actually we had the paper last year at ACL that you can check out. But basically you can do two main ways to handle that. You can concatenate all this in a single input. This is one simple way to tackle this and actually works really well. Or you can duplicate your model and have several parts that process each type of input, and then you need to connect them together with cross attention or a way to connect them. Okay. This is very different, but the idea is that overall, as you can see here, this is a competition we participated in at the GameFest two years ago now. And you see we were really leading on the automatic matrix by a strong margin using this type of transfer learning approach.

6. Trends and Limitations in NLP

Short description:

There is a trend to bigger models, which is narrowing the competition. However, there are ways to reduce the size of models through distillation, pruning, and quantization. These models have difficulties with out-of-domain generalization and are limited by the lack of common sense knowledge in text. Another challenge is making models learn new knowledge without forgetting previous knowledge. Despite these shortcomings, Hugging Face aims to tackle these challenges by democratizing NLP through knowledge sharing and open sourcing code and libraries, such as the Transformers Library.

So a few words on trends and limit, I'll go fast, but yeah, there is a trend to bigger models. This is a main, this is quite a big problem because it's narrowing the competition. There is some way we can reduce the size of this model, distillation, pruning quantization. I won't spend too much time on this, but it's something we are like a lot.

These models generalization problem. Even though they are big, they have really difficulties to tackle out of domain generalization. And there is basically a limit to text as a medium, which is that a lot of common sense is just not written down in text. And so this is a limit to just what this model can learn. And the main way we can overcome this is to use some database or image, or even to use human in the loop.

And now there is the last main problem, which is that these models, they are trained just one time, and it's very hard to make them learn new knowledge later on, because they have this problem of catastrophic forgetting. And so, for instance, GPT-3, which was very expensive to train. Train cost, like, 10 millions. GPT-3 has no idea what COVID is. Well, COVID is such a strong element of our daily life today, okay? So this is too bad. So there's a lot of shortcomings, and at Hugging Face, we try to tackle some of these shortcomings.

So what do we do at Hugging Face? Well, we try to democratize NLP. We started as a chatbot company, building a game, and we open sourced a lot of tools while we were doing that. And actually this tool catched so much interest in the community that we are now fully focused on catalyzing and democratizing all this research level work. So we do that in two main ways. The first way is through knowledge sharing. That's why I'm talking today. And the second way is to open source a lot of code and to open source some library, which lets people leverage all these developments and develop better models on top of them. Okay. So let's see a few of our open source libraries, and this will be the open door for the question in the second part. Okay. So the first main library that you probably know is called the Transformers Library, which is a way to access all the state of the art model, and they can be used both for NLU and NLG, many languages. We have today really a lot of models in the library. There is all the model, you know, probably BERT, GPT, and also more interesting model, more recent DPR Pegasus. XM Roberta is a great multilingual model by Alex and team. There are actually T5 is a really huge model, so there are a lot of them, very simple to use.

7. Model Hub, Tokenizer, and DataSets Libraries

Short description:

We have a model hub where people can share over 3,000 models in many languages. We also have a web interface to try out the models. We recently opened a second library called Tokenizer to solve the bottleneck in string splitting. We also opened a third library called DataSets to make it easier to share datasets and metrics. DataSets has interoperability with popular frameworks and is designed for large datasets.

We can talk about that in the question. We even have a model hub where people can share their model. We have more than 3,000 models and many, many languages, so if you want to use a model for Finnish or German, they have like 132 different models you can expect and you can play with them. Even in the web interface, you can try the model and see how they behave. This is at huggingface.com.

Now we opened so small recently a second library called Tokenizer, which do this first part that I was talking about, like splitting the string in integrals. So why did we do that? This was really a bottleneck in terms of speed. And so we decided to use some very low level rusk code, super fast to do that to solve this problem. So this is now really a very fast step. It's available in Python node. Rust is a great language.

And more recently, a few months ago, we opensource a third library, which is called DataSets. So DataSets is our third library. This is to tackle one last problem we discovered, which is that DataSets themselves, they're kind of hard to access for people, they're hard to share. And we spend a lot of time rewriting the same processing code that people have actually already written a lot of in there on that set. So we decided to make a library to solve this problem, make it easier to share DataSets, and also to share metrics. Some metrics right now, some novel metrics are actually very complex to use, or like to at least to install and to set up metrics, for instance, to evaluate natural language generation can be really complex. And so we decided it would be also good to make something very easy to share them, to upload them, to share them. So this is the DataSets library with that set of metrics. It has a few really cool stuff. For instance, there is building interoperability with NumPy, Panda, PyTorch, and Tensor flow too. So you can use basically any type of framework if you want, well, any type of modern framework. It's also made specifically to tackle really large datasets. So if you have gigabytes of data, if you want to, for instance, train your model on Wikipedia, and some data that your dataset may be, for instance, bigger than your RAM memory, you can use this library because it has a lot of smart way to do on-disk zero-sization. So it just take nine megabytes from, for instance, to train Wikipedia. There's a lot of smart caching. If you process your dataset once, it's already good, and you won't spend time reprocessing it again. So you can check this one. Here is an example of how you prepare. This is the full preparation of a training set to train a model on Glue, with the tokenization, with the padding.

8. Model Training, Libraries, and Questions

Short description:

The model training process is efficient and provides visibility. The hub allows access to datasets, including multi-modal datasets with images. Visit HuggingFace.co to explore the datasets and view their contents. These three main libraries are available for discussion, along with any questions about the concept and history.

And you see, this is just like 20 lines of code, and here you're ready to train this model. So it's made to be very efficient and everything be very visible at the same time. So, same as the model, we also have a hub where you can access and even explore the datasets. So there is like a model, 160 dataset now. There are also multi-modal datasets with images. And so you can go to HuggingFase.co, explore all the datasets. You can see the station and everything. And you can even see what's inside a dataset in the web interface.

Okay. So same here. You can go and check it out at HuggingFace.co. Okay. So these are the three main libraries, and I'm really happy to talk about any question you may have on them. And also, any question you may have on the first part, the concept, the history. Thank you.

NLP Advancements and Choosing Models

Short description:

Hey Thomas, how is it going for you? Good to hear! NLP has become the most exciting field in AI. GPT-3 is an impressive model, but its full capabilities are difficult to evaluate. It excels in retrieval tasks and can generate code and realistic blog posts. However, it lacks deep reasoning abilities. Having a big database doesn't equate to AGI. When choosing a model, start with something simple like a distilled version and scale up if needed.

Hey Thomas, how is it going for you?

Hi. Good, good. How is it going for you?

Yeah, it's also pretty nice. I mean, the sun is already a bit out in Munich, but it's still quite a bunch of exciting talks. And you did kick off this conference pretty nicely because I do believe that natural language processing or language in general is one of those indicators how good we are at understanding this machine learning.

How do you feel?

Yeah, definitely. I think what we've seen in NLP right now I guess over the last two years probably is that it has become really what we would have expected to become from the start, which means really the way to process knowledge and the way to kind of do researching or what we hope would be like researching. And so when we talk about AGI, I think right now a lot of people think about GPT-3, which is a full text model. So I think this is really an impressive thing about how NLP is now the most exciting field to be in AI.

Yeah, and it's kind of funny because you're almost predicting the first question right. And the first question is actually about GPT-3 and people asking like, hey, since you're an expert in NLP and your company is driving such good efforts in this regard, what do you think is the advance of GPT-3 in comparison to GPT-2?

Yeah, it's a good question. I think, well, one of the problems of GPT-3 is that it's quite difficult to access it. So I think we have not really evaluated the capability of GPT-3. Like, we've done it from other models like BERT or GPT-2 just because a lot of academics did not really have full access to be able to investigate what's happening. What can you do with that? Be able to test it fully on a lot of tasks. So it's kind of hard to give you really an answer, right? What I think is what we can do is that GPT-3 behave in some way like an interesting thing, which is a retrieval, like a smooth retrieval ranging over a really large data set. So you can do like, it's like having a huge Google Search, where you can search every page on the internet and being able to smoothly interpolate between all these pages. So I think this is very interesting and what we see, that you can do some pretty cool application with that, like you can smoothly interpolate to generate like also code and to generate like a realistic looking blog post. Now, when you talk about like really reasoning, meaning and things like that, I don't think there is really any deep breaks in GPT suite, but that's my personal opinion.

Yeah, no it's really good that at least for me it does resonate that you separate reasoning from having a big database. Because sometimes we have feelings that our community or parts of our community going like, hey, if database is bigger, you can just solve all of the problems. And sometimes it's not really, having a bigger model doesn't really mean that you're suddenly having like an AGI, and it's good to remember basically.

Yeah. So another question would be, just from the audience, there are so many different sizes of model, even transformer-based, there's GPT-3, there's transformers, there's Vertra, Roberta, all kind of things. There's also more distilled right? And when one machine learning engineer starts to work on the task, what is a good rule of thumb how to make this decision process? And obviously there is no one clear answer, but do you have any mental model or a framework, especially for beginners, who may be working in the company that doesn't have a big machine learning group, but they're person to actually make the calls, how can we help and support such a person?

Yeah, that's a good question. I think that's definitely something a lot of people face. I mean, the very practical thing about me, about our team at SwiggyFace, I think is that we should help people do that, because I understand we have like, we're providing a lot of models, we're providing a lot of checkpoints, but it's really hard to actually see the one you should select, the one you should use. So the first thing is that we will try to build some better tools on this. But now, for the quick answer, I think it's good to keep your good reflex, your good routine, is that you should start with something simple, like you should start with a smaller model, like starting with distilled birth, for instance, instead of birth, like starting with something small, and see how far you can go with that, with this compute-efficient model, like a distilled birth, a distilled GPT-2, distilled Roberta, you test a little bit with them. And if it's not enough, then you scale up, then you start to use bigger models and you try to see if you need, like, a T5 or something like that.

Starting with a logistic projection

Short description:

Starting with a logistic projection before moving to something more advanced is a typical approach. GPT-3 is a powerful model, and starting with it may not yield better results. It would be interesting to explore use cases that showcase the space of problems and how different models can be applied.

Yeah. Yeah, that's the typical thing. It's the same, you should always start with, like, logistic projection before you move to, like, something very fancy, but yeah. Yeah. Definitely. I can only agree with you, right? Because once you start with GPT-3, right, you can only make it, like, worse from there, right? So, no way to explain it, right? But maybe just like a random idea, because you're also building quite a bunch of tools, right? So right now, many companies building this, like a model, right? Maybe you can also have something as, like, a use case, right? But kind of, like, showing, like, hey, what is the space of problems, right? And how does it map, like, a space of models, right? And yeah, solution. But again, just like a random idea. Well, that's a very good idea. I think you should come build that with us. Yeah. Yeah, yeah. I will try.

Future of Transformers in NLP and Vision

Short description:

Transformers have revolutionized the NLP industry and are now being explored for vision tasks. They are efficient and scalable for transfer learning on large datasets. The choice of model is not as important as the underlying strategy of transfer learning. There is also a trend towards weakly or unsupervised learning. When using transfer learning, there is a risk of falling into local optima, but resources are available to guide advanced transfer learning in NLP.

Okay. So next question would be transformers, right? We have seen that transformers did reinvent or revolutionize the NLP industry at all, right? But how do you feel, like, about future of transformers? There was, like, some papers also about using transformers for, like, vision, right? Do you see that soon transformers are going to be like all around the machine learning or the learning in particular, or is that going to find their own niche, right? And kind of plateau from there basically?

Yeah, that's a good question. That's a timely question. I think this vision transformer was a very impressive paper because it actually showed that what we've seen for NLP was also true for vision, which is that transformers are quite efficient, and they are really a nice model to scale to large data sets. As soon as you start talking about transfer learning from a large data set, then you really want the most efficient architecture that you can scale. Yeah. I mean, I'm not a computer vision guy. So yeah, I don't really know why we only see that today, for instance. So I don't know if there was something unlocked, but yeah, I think they're nice. I'm pretty model agnostic, I would say. I think the underlying strategy, like transfer learning, is something very deep and important. Regarding the precise model that we select to use, I don't care that much. If somebody finds something better than Transformer, I would be happy to switch to these new models. But I think the idea of reusing pre-trained model is a lot deeper with deep consequences.

Yeah, I guess also something happening in the industry right now is not even using Transformer as an architecture, but more like using weekly or unsupervised learning. Because the big part of Transformer or BERT, or this direction, was like, hey, we don't need to have a big label dataset, we can just reuse what we have. And by using the same ideas of, hey, can we learn representation from the image itself, and all of the ideas of contrastive loss might be something that will spark another different direction. Cool, one more question from the audience. I will rephrase it, but I hope I'm not going to lose sensor. So what you suggested also in your talk, if you can use Transfer Learning, you can try to start there and try to unfreeze some layers and train it gradually to get the task right. And the person is asking, essentially, what is... how likely is it that by you doing Transfer Learning, you would end up in this local optima of convergence, so you don't get really something good? In comparison, if you start with a model from scratch and you start basically training it all together. Because, I mean, usually people hope that the more you unfreeze, the better you're getting. And it's almost like a smooth transition from really out-of-the-box Transfer Learning into this model that has been trained from scratch. Do you agree with this one, right? Or is there anything that people should be aware of? Like any tips and tricks basically of advanced Transfer Learning and NLP? Yeah, there's a lot of things here. The best resource, I mean, is probably... So, I had a link in my first slide on a long tutorial where we talk about that. And I also made a video I could share earlier this year on this question. But there is a lot of risk to fall in local optima. So, we see that when you fine-tune BERT for instance, you can get a non-off behavior.

Model Conversion and Stuck Minima

Short description:

For some random seeds, the model gets stuck in minima, while for others, it works well. Researchers are investigating this phenomenon and developing tools like MixOut to avoid getting stuck. There are various techniques to achieve smooth interpolation and avoid local minima. Assemblies are also effective. This is an open research question, and further exploration is encouraged.

For some kind of, for some random seed the model just doesn't convert, it's just stays stuck in minima. And for some other random seed it really works well, which was surprising in the beginning. And a lot of people have been investigating what was happening, and you have a lot of tools now to try to avoid that. You have tools like something related to what you're saying, which is called MixOut, which is the idea that's a bit like DropOut, you probably want to, like, you probably want to regularize your model. With DropOut you regularize right towards zero, like you cancel some weights. And with these other things, you can regularize towards the pre-trained model. So you can have a smooth interpolation from the pre-trained model to the model that is fine tuned. But there's a lot of other techniques that try to make some smooth interpolation to not be stuck in local minima. That's a very active area of research, and the other option is to try to do some assembles, which work very well as well. But yeah, that's still quite open research question. I think we have not really understand everything that was happening here on well. It's paying some time, while it's working very well in those cases, that looks really close. So maybe for whoever did ask this question, it's like a good topic also for research, right? So once you find some answers, right, please write the article or paper. And we all can read it later.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Charlie Gerard's Career Advice: Be intentional about how you spend your time and effort

6 min

Charlie Gerard's Career Advice: Be intentional about how you spend your time and effort

Featured Article

Charlie Gerard

Jan Tomes

2 authors

When it comes to career, Charlie has one trick: to focus. But that doesn’t mean that you shouldn’t try different things — currently a senior front-end developer at Netlify, she is also a sought-after speaker, mentor, and a machine learning trailblazer of the JavaScript universe. "Experiment with things, but build expertise in a specific area," she advises.

What led you to software engineering?My background is in digital marketing, so I started my career as a project manager in advertising agencies. After a couple of years of doing that, I realized that I wasn't learning and growing as much as I wanted to. I was interested in learning more about building websites, so I quit my job and signed up for an intensive coding boot camp called General Assembly. I absolutely loved it and started my career in tech from there.  What is the most impactful thing you ever did to boost your career?I think it might be public speaking. Going on stage to share knowledge about things I learned while building my side projects gave me the opportunity to meet a lot of people in the industry, learn a ton from watching other people's talks and, for lack of better words, build a personal brand.  What would be your three tips for engineers to level up their career?Practice your communication skills. I can't stress enough how important it is to be able to explain things in a way anyone can understand, but also communicate in a way that's inclusive and creates an environment where team members feel safe and welcome to contribute ideas, ask questions, and give feedback. In addition, build some expertise in a specific area. I'm a huge fan of learning and experimenting with lots of technologies but as you grow in your career, there comes a time where you need to pick an area to focus on to build more profound knowledge. This could be in a specific language like JavaScript or Python or in a practice like accessibility or web performance. It doesn't mean you shouldn't keep in touch with anything else that's going on in the industry, but it means that you focus on an area you want to have more expertise in. If you could be the "go-to" person for something, what would you want it to be?   And lastly, be intentional about how you spend your time and effort. Saying yes to everything isn't always helpful if it doesn't serve your goals. No matter the job, there are always projects and tasks that will help you reach your goals and some that won't. If you can, try to focus on the tasks that will grow the skills you want to grow or help you get the next job you'd like to have.  What are you working on right now?Recently I've taken a pretty big break from side projects, but the next one I'd like to work on is a prototype of a tool that would allow hands-free coding using gaze detection.   Do you have some rituals that keep you focused and goal-oriented?Usually, when I come up with a side project idea I'm really excited about, that excitement is enough to keep me motivated. That's why I tend to avoid spending time on things I'm not genuinely interested in. Otherwise, breaking down projects into smaller chunks allows me to fit them better in my schedule. I make sure to take enough breaks, so I maintain a certain level of energy and motivation to finish what I have in mind.  You wrote a book called Practical Machine Learning in JavaScript. What got you so excited about the connection between JavaScript and ML?The release of TensorFlow.js opened up the world of ML to frontend devs, and this is what really got me excited. I had machine learning on my list of things I wanted to learn for a few years, but I didn't start looking into it before because I knew I'd have to learn another language as well, like Python, for example. As soon as I realized it was now available in JS, that removed a big barrier and made it a lot more approachable. Considering that you can use JavaScript to build lots of different applications, including augmented reality, virtual reality, and IoT, and combine them with machine learning as well as some fun web APIs felt super exciting to me. 

Where do you see the fields going together in the future, near or far? I'd love to see more AI-powered web applications in the future, especially as machine learning models get smaller and more performant. However, it seems like the adoption of ML in JS is still rather low. Considering the amount of content we post online, there could be great opportunities to build tools that assist you in writing blog posts or that can automatically edit podcasts and videos. There are lots of tasks we do that feel cumbersome that could be made a bit easier with the help of machine learning.  You are a frequent conference speaker. You have your own blog and even a newsletter. What made you start with content creation?I realized that I love learning new things because I love teaching. I think that if I kept what I know to myself, it would be pretty boring. If I'm excited about something, I want to share the knowledge I gained, and I'd like other people to feel the same excitement I feel. That's definitely what motivated me to start creating content.  How has content affected your career?I don't track any metrics on my blog or likes and follows on Twitter, so I don't know what created different opportunities. Creating content to share something you built improves the chances of people stumbling upon it and learning more about you and what you like to do, but this is not something that's guaranteed. I think over time, I accumulated enough projects, blog posts, and conference talks that some conferences now invite me, so I don't always apply anymore. I sometimes get invited on podcasts and asked if I want to create video content and things like that. Having a backlog of content helps people better understand who you are and quickly decide if you're the right person for an opportunity.What pieces of your work are you most proud of?It is probably that I've managed to develop a mindset where I set myself hard challenges on my side project, and I'm not scared to fail and push the boundaries of what I think is possible. I don't prefer a particular project, it's more around the creative thinking I've developed over the years that I believe has become a big strength of mine.***Follow Charlie on Twitter

javascript career tensorflow machine learning

TensorFlow.js 101: ML in the Browser and Beyond

ML conf EU 2020

41 min

TensorFlow.js 101: ML in the Browser and Beyond

Jason Mayes

Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!

innovation tensorflow machine learning

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

React Advanced Conference 2021

21 min

Using MediaPipe to Create Cross Platform Machine Learning Applications with React

Top Content

Shivay Lamba

TensorFlowJS Working Group Lead

This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low-powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.

react machine learning

TensorFlow.JS 101: ML in the Browser and Beyond

JSNation Live 2021

39 min

TensorFlow.JS 101: ML in the Browser and Beyond

Jason Mayes

Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!

tensorflow machine learning

Observability with diagnostics_channel and AsyncLocalStorage

Node Congress 2023

21 min

Observability with diagnostics_channel and AsyncLocalStorage

Stephen Belanger

Stephen Belanger

Node.js core contributor and founder of the diagnostics working group.

Modern tracing products work by combining diagnostics_channel with AsyncLocalStorage. Let's build a tracer together to see how it works and what you can do to make your apps more observable.

observability machine learning

GPU Accelerating Node.js Web Services and Visualization with RAPIDS

JSNation 2022

26 min

GPU Accelerating Node.js Web Services and Visualization with RAPIDS

Allan Enemark

NVIDIA | Data Visualization , USA

The expansion of data size and complexity, broader adoption of ML, as well as the high expectations put on modern web apps all demand increasing compute power. Learn how the RAPIDS data science libraries can be used beyond notebooks, with GPU accelerated Node.js web services. From ETL to server side rendered streaming visualizations, the experimental Node RAPIDS project is developing a broad set of modules able to run across local desktops and multi-GPU cloud instances.

dataviz node.js machine learning

Workshops on related topic

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

JSNation 2024

108 min

Leveraging LLMs to Build Intuitive AI Experiences With JavaScript

Workshop

Roy Derks

Shivay Lamba

2 authors

Today every developer is using LLMs in different forms and shapes, from ChatGPT to code assistants like GitHub CoPilot. Following this, lots of products have introduced embedded AI capabilities, and in this workshop we will make LLMs understandable for web developers. And we'll get into coding your own AI-driven application. No prior experience in working with LLMs or machine learning is needed. Instead, we'll use web technologies such as JavaScript, React which you already know and love while also learning about some new libraries like OpenAI, Transformers.js

artificial intelligence machine learning

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

C3 Dev Festival 2024

48 min

Can LLMs Learn? Let’s Customize an LLM to Chat With Your Own Data

WorkshopFree

Andreia Ocanoaia

Andreia Ocanoaia

Feeling the limitations of LLMs? They can be creative, but sometimes lack accuracy or rely on outdated information. In this workshop, we’ll break down the process of building and easily deploying a Retrieval-Augmented Generation system. This approach enables you to leverage the power of LLMs with the added benefit of factual accuracy and up-to-date information.

nlp machine learning

Let AI Be Your Docs

JSNation 2024

69 min

Let AI Be Your Docs

Workshop

Jesse Hall

Join our dynamic workshop to craft an AI-powered documentation portal. Learn to integrate OpenAI's ChatGPT with Next.js 14, Tailwind CSS, and cutting-edge tech to deliver instant code solutions and summaries. This hands-on session will equip you with the knowledge to revolutionize how users interact with documentation, turning tedious searches into efficient, intelligent discovery.
Key Takeaways:
- Practical experience in creating an AI-driven documentation site.- Understanding the integration of AI into user experiences.- Hands-on skills with the latest web development technologies.- Strategies for deploying and maintaining intelligent documentation resources.
Table of contents:- Introduction to AI in Documentation- Setting Up the Environment- Building the Documentation Structure- Integrating ChatGPT for Interactive Docs

frameworks artificial intelligence machine learning

Hands on with TensorFlow.js

ML conf EU 2020

160 min

Hands on with TensorFlow.js

Workshop

Jason Mayes

Come check out our workshop which will walk you through 3 common journeys when using TensorFlow.js. We will start with demonstrating how to use one of our pre-made models - super easy to use JS classes to get you working with ML fast. We will then look into how to retrain one of these models in minutes using in browser transfer learning via Teachable Machine and how that can be then used on your own custom website, and finally end with a hello world of writing your own model code from scratch to make a simple linear regression to predict fictional house prices based on their square footage.

tensorflow machine learning

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

ML conf EU 2020

112 min

The Hitchhiker's Guide to the Machine Learning Engineering Galaxy

Workshop

Alyona Galyeva

Are you a Software Engineer who got tasked to deploy a machine learning or deep learning model for the first time in your life? Are you wondering what steps to take and how AI-powered software is different from traditional software? Then it is the right workshop to attend.
The internet offers thousands of articles and free of charge courses, showing how it is easy to train and deploy a simple AI model. At the same time in reality it is difficult to integrate a real model into the current infrastructure, debug, test, deploy, and monitor it properly. In this workshop, I will guide you through this process sharing tips, tricks, and favorite open source tools that will make your life much easier. So, at the end of the workshop, you will know where to start your deployment journey, what tools to use, and what questions to ask.

machine learning

Introduction to Machine Learning on the Cloud

ML conf EU 2020

146 min

Introduction to Machine Learning on the Cloud

Workshop

Dmitry Soshnikov

Dmitry Soshnikov

This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.

azure machine learning

Follow us

Upcoming events

Korben
Dallasvisa@gitnation.org

Want to have access to all events for 4x less?

JSNation US 2024

November 18 - 21, 2024

React Summit US 2024

November 18 - 22, 2024

React Advanced Conference 2024

October 25 - 28, 2024

Productivity Conference 2024

November 7 - 8, 2024

React Day Berlin 2024

December 13 - 16, 2024

Node Congress 2025

February, 2025

JSNation 2025

June, 2025

React Summit 2025

June, 2025

C3 Dev Festival 2025

June, 2025

TechLead Conference 2025

June, 2025

React Advanced Conference 2025

October, 2025

JSNation US 2025

November, 2025

React Summit US 2025

November, 2025

TestJS Summit 2025

November, 2025

React Day Berlin 2025

December, 2025