Power of Transfer Learning in NLP: Build a Text Classification Model Using BERT


The domain of Natural Language Processing have seen a tremendous amount of research and innovation in the past couple of years to tackle the problem of implementing high quality machine learning and AI solutions using natural text. Text Classification is one such area that is extremely important in all sectors like finance, media, product development, etc. Building up a text classification system from scratch for every use case can be challenging in terms of cost as well as resources, considering there is a good amount of dataset to begin training with.

Here comes the concept of transfer learning. Using some of the models that has been pre-trained on terates of data and fine-tuning it based on the problem at hand is the new way to efficiently implement machine learning solutions without spending months on data cleaning pipeline.

This talk with highlight ways of implementing the newly launched BERT and fine tuning the base model to build an efficient text classifying model. Basic understanding of python is desirable.


Hello everyone, and welcome to this session of transfer learning with BERT. I'm very excited that you all are here and join me in the session. So let's see what we learn about transfer learning and BERT today. Before we start a quick introduction about me. Hi, my name is Jayita Puttatunda. I work as a senior data scientist at Indeliant US Inc. We are based out of New York City. And to give you an overview, Indeliant, we are a customer driven professional services company. We specialize in tailor made software and tech solutions. We work with a lot of data analytics and AI based companies and build some of our tools and softwares around those states. So you can connect me on Twitter and LinkedIn if you have any questions about the session later on and we can discuss about that. Great. So for today's session, we're going to talk about like 20 minutes for the session and then we have some Q&A. I'd also have some code that I can share over my GitHub. So do reach out if you want to take a look at those later after the session. So now the exciting part. I say this as an nlp is hard and you would say why? Now take a look at this picture. What's the first inference that comes to your mind? I know like human brains can make like, you know, great connections and references. This is a very trivial task for us. But if you think from a perspective of a computer program, this is very daunting challenge to kind of understand this complexity of English language. So here in the picture, it says, I'm a huge metal fan. So us like humans, we would know that, okay, this is the metal fan is like, you know, personifying electric, electric component and says that, okay, I'm a metal fan. But it can also refer to that you are a huge metal, you know, music fan. So how do you how the computer differentiates between these two meaning of the same terminology. So there is ambiguity, there's synonymity, there's coreference and of course, the syntactic rules of English literature that kind of hampers or makes it more daunting task for a computer program. So for today's agenda, we'll just quickly go over what's nlp, how it is used, how transfer learning can be used. And we look at a simple case of utilizing BERT for a text classification model. So let's jump right into it. For nlp, I feel that this image very well describes it. It's a subfield of linguistics, artificial intelligence, there's computer science, and also information engineering. So basically helps all the machines to understand and communicate back and forth with human beings in a free flowing speech without losing context or references. And if you see, nlp has had an exponential growth in the last, I would say, two to three years, like the huge models that Google, OpenAI, NVIDIA has been working on and releasing has like humongous amount of parameters. So just in May 2020, OpenAI came up with a 175 billion parameter model, which is you can understand how much text has gone into the processing of that model and how much work that it can do with so much accuracy. So where is nlp used? I know that most of you are definitely familiar with it, but I just wanted to give a quick description of where I feel that it's being used the most. And I've worked hands on on those areas. So definitely machine translation, like text to audio, audio to text, there's chatbot, building of knowledge trees, intent classification. There's also natural text generation. I'm sure when you use Gmail, you have seen that the prompt that keeps coming up when you're writing an email says that, oh, this two next words I feel would be good for the sentence that you're trying to complete. That's like text completion prompts that you get. It's also used a lot in topic modeling, clustering, understanding what's the context of the whole or what kind of insights can we generate from huge amount of text data. And also text classification, which is like, do you want to do a sentiment analysis? How do you understand what the general idea, say, Yelp reviews or Amazon product reviews. So those have a lot of implications and good applications via nlp problems. So how do we do it? I know that these can sound a lot underlying, but this is very important that we do it in all kinds of business cases or business problems that we're trying to solve using nlp. So just a quick idea, we need to handle when we have seen all our texts, make sure that we handle extra spaces. Then we also need to look after how we tokenize our text. So tokenization just using, you know, by spacing is the very traditional norm, but we also need to take care of use cases like say, the whole world as United States of America. If we tokenize it just by the spaces, sometimes it can happen that that doesn't make sense in the context that we're trying to work through it. So then we need to keep that whole United States of America as a whole phrase rather than tokenize it by a sentence. So that the information extraction works much better than otherwise, if it's tokenized by word. The next step would be spellcheck. So I'm referring here directly to a very great tool that Peter Norvig from Google created. It's a spellchecker. Basically, the idea is to, you know, kind of compare the distance between multiple words and see that, okay, does this spelling make sense and how close it is to a similar spelling or a similar word meaning that's there in the, you know, the vector space of the whole nlp corpus. So you can see here that when I pass the wrong spelling with a single L, the returned value would be a spelling with a double L and also for corrected with a K, not with a K and the final value would be corrected with a C. So this can actually help in making sure that our data is clean. And like they say that in nlp, it's like garbage in and garbage out. Sorry about that. Garbage in and garbage out. So we need to make sure that the clean data makes sense with grammatical sense, syntactic sense and also semantic meaningfulness. The next step would be contraction mapping. This might seem that why we need to do it, but then the understanding that, okay, if you're using a word like won't and can't, the tokenization may or may not be able to handle it very correctly. So then we need to expand it and map it back to its full words like won't, will not and can't, will be cannot. The next step is definitely stemming and lemmatization. So this is just to ensure that we're keeping our nlp corpus metrics as tight knit as possible using similar words like gaming, games, gamed, has the base word as game. So the idea is to keep the base word as our word and then so that the meaning around that remains the same and the context can be identified in a better way. The next is stop words definitely. So you know when we speak in a correct, say when you're scraping a lot of text from Twitter, Reddit and all other news blogs, there is a lot of extra additional words like the articles, then there is a and all the other pronouns that may or may not sometimes add much context to the whole sentence. So in this way we can get rid of them. But this I would also like to mention that this is very much a case to case basis. Like if you have a good amount of data, like say a legal data set or a patent data set that may or may not work well with removing all these stop words. So we need to understand our business use cases and make sense, make sure that it makes sense to remove the stop words. And of course there's like case based things that you can do. So you can tag for what kind of, is it a verb, is it a noun, like that would help you to understand your text insights much better before jumping into, you know, start modeling. So that's why it says that you should know the data or the unstructured text is more difficult to understand. So in order to make sure that once you're jumping into the modeling section before that you have like a complete understanding and insights about what kind of data you have at hand that you're dealing with. Great. So now that we have cleared all the baseline techniques, let's jump right into and think about what's transfer learning. So this is a great quote by Andrew NG. Of course he's a huge figure in the AI ML space. And in 2016, he mentioned that transfer learning will be the great next driver of ML success. And we are right now in 2020. And I think that, and you would see that so many problems or so many models have come out to solve different kinds of models via transfer learning. So what is it? So a quick understanding for it is just nothing but just a machine learning technique where a model is kind of trained on one task and it's kind of repurposed for a second task. Now we'll see what that means. So let's say you have a data set, right? You have a data set one, which is like say a general image data set and why the target variable is a classified object. That object could be, you know, a cat or a dog, a tree, a bus or a truck, anything, anything, any worldly objects that's already there in the image data set. An example of it is ImageNet. I'm sure all of you have worked a little bit or have tried your hands on with this data set in your work. Now this in a general view is like a structure of neural network with few hidden layers and input layers and an output layer. So the output layer for the data set one problem would be the, you know, if it's the classification between if it's cat or dog or any other particular object from the training data set. Now think about the second case. Now the second case that you have a data set two and that's a very small data set. You do not have, you know, a lot of training cases, but you have few and the target data set is if it's a image of urban image or if it's a rural image. Like, so what do you think is the common connection between these two data sets? The data sets is like both of them deal with image. Both of them have a lot of worldly objects captured into it, but the final goal is exactly not the same. So here transfer learning comes into action where you can, you know, kind of utilize the knowledge from the data set one that you have created in your first phase of modeling and just, you know, retrain the last layer. Here it's the layer four. If you can see, I mentioned it's the last layer, dense layer, retrain the last layer weights using softmax and class and like, you know, create or change those a final output layer from classifying it into cat, dog, ship, tree, et cetera, to classifying it as a urban or rural based on the new data set that we have intertwined with level four. Now this can seem that, okay, why are we doing this? We are before that, let's see, where are we doing this? So like say even in speech recognition, if you're saying a scraping or a lot of voice data from different audio sources like YouTube, and that's an English or Hindi, which is like, you know, one of the languages that we speak back in India, and then you pre-train your network. Now say you have a very small data set in Bengali and you want to kind of, you know, like understand what the speech is. So you can utilize the model that you have trained with a bigger data set in English and then use like, you know, translation, and then you utilize that for speech recognition, which is in Bengali. The second image, the second example is what we spoke about right now. So like if you have an image network, millions of tags, images of general objects of the word, you can utilize that, or, you know, image recognition, which is like urban versus rural and so on. So why do we use transfer learning? So that's because, is there like a positive gain that we want to achieve? Yes. First is that if there is a scarcity of labeled data, we understand that sometimes creating labeled data set is like very expensive. We do, you know, connect, spend some time and expert resources on subject matter SMEs who kind of help us in creating those tagged data sets. So if you do not have that to start with, and it can get expensive, what can we do in our, that's in our hand is that we try and find out if there's already a bigger problem that has been solved or a pre-trained model that has been solved. And there comes, if there's always, if there's already a network that exists, I would say with a massive amounts of data and it's pre-trained that we can utilize, that's like the next best scenario we can do. And of course the task one and task two should have kind of, you know, the same input properties. Is this what I mentioned? Like in first, in the example we spoke about, we had, you know, data that had a cat, dog and other worldly objects. And the second would be the rural versus urban, but the input features or components kind of remain the same. This I'm sure all of you have seen this image before. This is just the difference between a traditional ML learning model where you have different tasks and you create, you know, different learning system for each task. But here in the transfer learning, you have, you know, two source tasks where you create a model, you have a knowledge base for it. And the idea is to how can we leverage that knowledge base to achieve a target task in the case that we discussed was urban versus rural. Now BERT comes into play here. And before we understand what BERT is, let's talk a little bit, very briefly about what's word embeddings. Now word embedding is nothing but, you know, a feature vector representation of the word that you have in your training corpus. What does that mean? So the underlying concept of utilizing context in word is to create a connection or references between multiple words. So the famous example is like, you know, man is to king. So what is woman is to, is to, so the answer would be queen. That kind of relations happen when we map words, utilizing different sentences in the context of word embeddings. So the next example I just wanted to highlight is that if you see here, there are multiple words and you can see the tiny clusters and you can see that fish, eggs, and meat kind of fall under a similar cluster. This is a simplified version in 2D and in real case, it's much better with multidimensional matrix representation. But here, if you see another example is like heat, electricity, energy, oil, and fuel kind of is like, you know, a cluster, a mini cluster at its own. So that makes sense to the model that, okay, if these words, one of these comes, then we can say that the distance between these words are similar and they are closer to each other. In this example, it just shows that helicopter, drone, and rocket are much closer to each other versus helicopter to goose. Whereas goose, eagle, and bee are much closer to each other than they are to drone or rocket. So this is a very famous word mountain, like I call it, and Chris McCormick has made this terminology. So what he mentions is that word has been made on top of a lot of technology that we have been seeing developing over the years. But if you are starting right now and are trying to understand nlp and word, don't think that it's a big domain of like, you know, you have to do a lot of research to it and it's impossible to learn. It is not. So he says that, okay, transformer, attention, these are the two concepts that we need to understand and that will give you a good baseline to start applications of word and fine tuning it. Now, sometimes this can seem like, you know, big daunting terminology. So how I feel is that, you know, transformers can be a heavy concept, but they are there to help us. So just like optimist prime and don't get derailed by bait, but just think about it in this simple concept in how you can explain it is you have, say, input criteria. So there is like a layer of encoders that would dig those input. And then there's a decoder that will transfer those inputs into the outcome that you want. Like say here, the translation is happening from one language and the output language is a different language. So basically what is BERT? So BERT is the first of its kind. It's the first deeply bidirectional model trained in an unsupervised way and in a plain text. So when I say bidirectional, what I mean is, so if you have say, you know, multiple sentences, some of the initial models would be like do it sequentially. First, they would learn from left to right, and then they would learn from right to left. But BERT is a single bond that does it at once. How this or what this helps us in achieving is that it makes sure that it learns information from both the sides together and takes a particular token's context into reference during training without losing out any additional information about what are the tokens that's available on the left side and what are the tokens available on the right side. So just to give you a quick understanding, this is a very famous example, like if you look at these two sentences, it says that we went to the river bank and it says that I need to go to a bank to make a deposit. So the word bank here has different, grammatically it's the same, it's a word called bank, but its meaning is completely different based on the context words that it has been used with. Now BERT can quickly understand, okay, which is river versus it's a bank, where it's a financial institution that it's connected to. This is the quick layer, or I would say the structure of how BERT has been arranged in. It's token based, it's segment based, and it's also position based, that which position you are in the whole sentence. And if the token is in the first sentence, if it's a multi-sentence corpus, and of course the token sentence. These are the two tasks. I'll just quickly walk through over. I know we are closing on in time, but we can discuss more in the Q&A section if you have any particular questions about it. So if the idea is for the first one to understand the strength of bidirectionality is that the man went to the dash and he went to the dash of milk, which is we are masking the words here and the labels that the BERT is trying to produce or predict is that the first sentence would be store and the next word would be gallon. And the next use case that BERT has been pre-trained on is sentence prediction. If you see the first sentence A, the man went to the store, and sentence B is he bought a gallon of milk. So that makes sense, right? Those two sentences fall with each other and this label would be his next sentence. But if you see the second set is that the man went to the store and the penguins are flightless, though they are grammatically correct in their respective sentences, but they do not belong with each other. So then the label would be not next sentence. So it's predicting if the next sentence or the next token makes sense with the one that you have in hand or not. So this is the concept of attention. Attention is nothing but if you look at the word, the animal didn't cross the street because it was too tired. I have given the link by Jay Allavar. He talks about it in a very particular way. So you can look at it and understand about attention more and we can discuss more about it in our Q&A section. So I'm not going to look over the code at this point, but I just wanted to highlight one more factor that training a particular AI model, like this is a report by MIT and it says that training a particular AI model can emit as much carbon as five cars in their lifetime. So the idea of transfer learning is to see if we can utilize it's already prebuilt and use that in our small daily tasks rather than reaching again and again for particular each use cases. So as closing remarks, it is difficult, but we need to make sure that the data is clean, understand our data better than anything else. It's not magic and always advanced neural networks are not the answer to all your problems. Start with a baseline nlp modeling and then go step by step up. These are some of the additional resources you can look through and read through some of the blogs. Thank you to you for having this brilliant event. Thank you to all the fellow speakers and you guys who joined us. And thanks, I guess I'm ready to take any questions you guys have and do definitely check out in Dell Ensport portfolio. I'm sure you'll find a lot of good use cases and case studies that they worked on in the industry. Great. Thank you to all of you and thanks for joining and let's answer some questions if you guys have any. Thanks. Hey, Jay. How are you doing? How are you doing? I'm doing fine. Fantastic. Fantastic. Fantastic. You know, it's a I know I understand that you're also in the United States at all. It always, you know, comforts me to know that someone who I'm talking to is very close to home only a few thousand other side. Yeah, other side of the country. East Coast, West Coast doesn't matter. It's cool. Yeah. All right. So with that, let's actually jump into some Q&A's by our lovely audience. So here's actually a question from here's a question. OK, so what kind of nlp problems does BERT not work well on? Any thoughts on that? Yeah. So I guess the best idea or like once you start learning step by step, sometimes utilizing or like looking at some of the state of the art measures would kind of give you an idea. But I know for a fact why you chose BERT for classification, it may or may not be the I guess it's not the optimal algorithm that you can use for a classification technique. But I guess for all kinds of question answering, like where you have a factual kind of a data set where it's kind of making sure that it's reading from the blocks of data that you're training up in, it kind of works better in that. But for classification, I guess you again, it sometimes kind of depends on what kind of business problem you're trying to solve. So what I've seen working with it is that for Albert kind of gave me a pretty good score even and even BERT like a good score for a tiny question answering classification that I was working on a very recent project. And that kind of helps. And I always feel that there is no good, correct answer to it. Some of these applications are built for multiple problems to solve multiple problems. And it kind of depends on what kind of data you have, how you are mapping it and what's the end result. So looking at some of the use cases and what are the state of the art results for those models I guess can filter those weeds out of what kind of problems you may or may not use those techniques for. But definitely can give you a head start. So even for the classification problem, it just gave me a very baseline score of say 0.6, which is not great, but it's good to start. And that's where you get your hands on and say that, okay, well, how can I tune it on and maybe add more parameters or what kinds of future engineering I can do to improve that on. Yeah, that actually is a very good point because I do know that, you know, like you mentioned, like BERT could perform well if there's like very strict classification labels, like in question answering and very direct answers. But I'm thinking like BERT might not be the best, at least we're not at the stage yet where it's very good in conversational AI, where I think I've seen like some video examples of where like BERT kind of gives like a little funky answers here and there, especially like if like the responses tend to be very long winded. Sometimes they don't make sense at times. Yeah. I guess that's why the problem of the clarifying or maybe answering through tension is the goal at the end. And I guess like the big people in the industry are still working on and there's so much in process that they're still developing. So I guess one day we can say that, oh, put everything to BERT and BERT will solve it. I'm waiting for that day to come. One day, one day. Let's hope and it will come soon. Alrighty. So another question, you know, I think you mentioned towards the end of your presentation or something like that, that we will attack this in the Q&A session. So I'm going to ask you this now. Why do we train BERT on the two problems of next sentence prediction and mass language modeling specifically? And why not like any other tasks? Yeah, I guess so. That actually is the answer to what we are trying to solve, right? Let's start from the thought process of Word2Vec. So Word2Vec is also a good way of doing this language training model. What it doesn't serve is that it doesn't help us in creating that contextual and understanding that how to handle synonymous words. Like sometimes it will map it correctly. Sometimes the use case example that I showed for a bank, a river bank and a financial bank institution, how do you correctly segregate those same terms based on context? So for MAST, what happens is since we want, since we are doing bidirectional training with BERT, the idea is to capture all the essence and if we do not do it by the MAST way, what will happen is that BERT will be able to kind of, because it's reading from both sides, will be able to map it or kind of like cheat in its own way to kind of see what the next word could be. Yeah, it actually happens in that way. So I guess the MAST way is just to make sure that we are restricting it to kind of not see the solutions in all its training, bidirectional training, but then trying to infer it from some of it that it has seen versus mapping it to some of the other ways. So I guess that's answer is one of the core challenge that we are trying to solve and for versus the other models. And the second one for the next sentence, I guess it's very important that in order to understand context and think about it from the perspective that if you want to do a summarization, if you're doing an abstract of summarization, how do you know that when you're summarizing it getting the essence of all the words or all the sentences from the whole block of training corpus that you're passing it through? I guess the next sentence is just to ensure that we are not, you know, mapping it incoherently even sometimes because sentences can be grammatically correct on its own, but it may not mean or it may not create semantic meaning when placed next to each other. So when placed next to each other, it should make sense. And that continuity flow of meaning should be, you know, kept within that whole analysis. So I guess from developing from all the models that we have seen over the time, I guess part is trying to solve these two solutions in a hands-on core way. And that's how I infer it and see it working in all the solutions that it's kind of applied to. Yeah. So, and I, yeah, actually that's also another good point because I guess overall, like, so mass language modeling, it's like we're just trying to understand the language itself, whereas next sentence prediction, it's trying to understand context within sentences or between sentences for like larger inputs, right? Even if not only the direct one, but also like how it maps into the workflow of it. Like say a sentence one and sentence three makes sense, but then sentence two doesn't belong with sentence three. So how do you keep that conversational or like meaningful, like it all comes down to tension again, like what you're talking about in the context, like, is this following a correct pattern? So in the way it's describing through it. So it's one of the key, key features of purpose-building. Very fascinating. Another question here from our audience, what is your main challenge or difficulty when you started working as a data scientist? Yeah, the challenge still remains. It doesn't go away. The challenge is like, you know, so when I started working with nlp text data, the first was like, you know, it was too unstructured. How do you actually pre-process it to make sense of the data? Because in nlp, everything makes sense that if you put in garbage, you are going to get garbage. How there was challenges of fixing, you know, context-based words, usage, how to attach synonyms, how do you connect to or fix spelling errors. Spelling errors is a big issue in unstructured nlp problems. And sometimes it happens that since they are like, you know, very domain-specific spellings, you may or may not be able to use any generic spell checkers that's out there. So you need to kind of work with SMEs, try and figure out what's the domain knowledge that you need. Can they give us some corpus or metadata or tags that they have or the taxonomy that they have in the back end? Sometimes it doesn't work out that way. So it does need a lot of explorating and understanding the data yourself to kind of see how you can fix it. So yeah, I guess the data pre-processing for any nlp task is a very crucial part. And that kind of leads you to making a better model. Otherwise, it doesn't matter how much feature engineering you do or how much fine tuning you do, it will not work out and you're going to be stuck in a maze of what am I doing wrong. Yeah. Your model is only as good as your data is. So yeah, garbage in, garbage out. Absolutely. Alrighty. So here's another question. So you know, you showed this BERT pyramid. That was very interesting. And like you said, just learning all these concepts shouldn't be that daunting. But like when I look at that pyramid, I see like, okay, to know BERT, I need to know transformers. And to know that I also need to know attention. And to know that I need to read papers on like LSTMs, RNNs and understand that. And eventually you can see how you can get into this rabbit hole, where like when you're just studying one topic, you have to read like 200 papers. So how do you understand transformers and BERT specifically without being bogged down by these very, very hardcore technical details? But you need a good understanding. So how do I do this? Yeah, so I would say the first way of doing it is to just understand what it's trying to solve. Like say that what are the transformers trying to solve? The first thing is encode, decode, and then create that path of transformation of data. What is attention trying to do? It's just basically trying to create context within a contextual data saying that, okay, this word belongs to this, this word or this noun or this adjective that it's referencing to. These are the two general ideas. Now you need to map it up. I understand that the underlying, there has been a lot of in process ways that have been developed, but then if you understand just this core understanding, and then maybe read up on some very, I've mentioned some of the blog blogs by GLMR and Chris McCormick, like take a look at how they have like, you know, implemented in their business use cases. If you understand what they do, and if you don't want to get like, you know, stuck in the down the rabbit hole to understand all the math behind it at once, I would say do it step by step. First understand what it does. First then see that what kind of business scenarios it can solve, like what kind of data you need for that scenario. Once you start applying it a little bit and read up step by step what it's doing, and then that's how you kind of get hang of it that, okay, if I'm stuck at this position, what are the fine tuning options do I have? What are the additional features I can create to kind of map that problem? So it happens a lot. If you look up any topic, you'll find thousands of reading materials and thousands of papers that are associated with it. But like, that's been my learning curve that I just understand what it's trying to solve from where, like it's trying to solve some of the challenges that Word2Vec cannot handle, building on top of that, okay, what are the parameters inputs needed? What are the output categories look like? And then what are the internal steps like it's doing mass modeling, it's doing sentence prediction, these are the two main features. Okay, now what kind of data I need to solve some challenges? So yeah, just to keep that process and thinking flow going without getting too stuck in the internal goals. But if you're interested, please, you can always do that, I guess. Definitely. And honestly, I think a lot of us would be interested in hearing way more details. But unfortunately, at this moment, we're kind of out of time for Q&As. But Jaytha will definitely be available in her speaker room starting right now. So if you have any questions that you need to ask her, check out her, you can go talk to her right now and go to her speaker room and I'm sure she would be splendidly pleasant to answer some of your wonderful questions. Thank you so much. It was great having you here Jaytha. Thank you so much, Ajay. Thank you so much. Of course, it was a pleasure having you here and we'll see you very soon. Take care.
35 min
02 Jul, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic