Power of Transfer Learning in NLP: Build a Text Classification Model Using BERT

Rate this content
Bookmark

The domain of Natural Language Processing have seen a tremendous amount of research and innovation in the past couple of years to tackle the problem of implementing high quality machine learning and AI solutions using natural text. Text Classification is one such area that is extremely important in all sectors like finance, media, product development, etc. Building up a text classification system from scratch for every use case can be challenging in terms of cost as well as resources, considering there is a good amount of dataset to begin training with.


Here comes the concept of transfer learning. Using some of the models that has been pre-trained on terates of data and fine-tuning it based on the problem at hand is the new way to efficiently implement machine learning solutions without spending months on data cleaning pipeline.


This talk with highlight ways of implementing the newly launched BERT and fine tuning the base model to build an efficient text classifying model. Basic understanding of python is desirable.

35 min
02 Jul, 2021

Video Summary and Transcription

Transfer learning is a technique used when there is a scarcity of labeled data, where a pre-trained model is repurposed for a new task. BERT is a bidirectional model trained on plain text that considers the context of tokens during training. Understanding the baseline NLP modeling and addressing challenges like context-based words and spelling errors are crucial. BERT has applications in multiple problem-solving scenarios, but may not perform well in strict classification labels or conversational AI. Training BERT involves next sentence prediction and mass language modeling to handle contextual understanding and coherent mapping.

1. Introduction to Transfer Learning and NLP

Short description:

Hello, everyone! Welcome to this session of Transfer Learning with Burt. Today, we'll learn about Transfer Learning and NLP. NLP is a subfield of linguistics and AI. We'll discuss its challenges and how it can be used for text classification.

Hello, everyone, and welcome to this session of Transfer Learning with Burt. With me, I'm very excited that you all are here and join me in the session. So let's see what we learn about Transfer Learning and Burt today.

Before we start, a quick introduction about me. Hi, my name is Jayita Pudukonda. I work as a Senior Data Scientist at Intelliint US Inc. We are based out of New York City. And to give you an overview, Intelliint, we are a customer driven professional services company. We specialize in tailor made software and tech solutions. We work with a lot of data analytics and AI based companies and build some of our tools and softwares around those states. So you can connect me on Twitter and LinkedIn if you have any questions about the session later on and we can discuss about that.

Great. So for today's session, we're going to talk about like 20 minutes for the session and then we have some Q&A. I'd also have some code that I can share over my GitHub. So do reach out if you want to take a look at those later after the session. So now the exciting part, I say this as an NLP is hard and you would say why. Now take a look at this picture. What's the first inference that comes to your mind? I know like human beings can make like, you know, great connections and references. This is a very trivial task for us. But if you think from a perspective of a computer program, this is a very daunting challenge to kind of understand this complexity of English language. So here in the picture it says, I am a huge Metal fan. So us like humans, we would know that, okay, this is the Metal fan is like, you know, a personifying electric electric component and says that, okay, I'm a metal fan, but it can also refer to that you are a huge metal, you know, music fan. So how do you, how the computer differentiates between these two meaning of the same terminology. So there's ambiguity, there's synonymity, there's coreference. And of course the syntactic rules of English literature, that kind of hampers, or makes it more daunting task for a computer program. So for today's agenda, we'll just quickly go over what's NLP, how, where it is used, how transfer learning can be used. And we look at a simple case of utilizing part for a text classification model. So let's jump right into it. For NLP, I feel that this image very well describes it. It's a subfield of linguistics, artificial intelligence.

2. Introduction to NLP Applications and Techniques

Short description:

There's computer science and also in information engineering. NLP has had an exponential growth in the last two years. It's used in machine translation, chatbots, intent classification, text generation, topic modeling, clustering, and text classification. To work with NLP, we need to handle extra spaces, tokenize text correctly, perform spell check, and use contraction mapping.

There's computer science and also in information engineering. So basically helps all the machines to understand and, you know, communicate back and forth with human beings in a free-flowing speech without losing context or references. And if you see NLP has had an exponential growth in the last, I would say two, two, years, like the huge models that Google, OpenAI and Vidya has been working on and releasing has like humongous amount of parameters. So just in May, 2020, OpenAI came up with a 175 billion parameter model, which is, you can understand how much text has gone into the processing of that model and how much work that it can do with so much accuracy.

So where is NLP used? I know that most of you are definitely familiar with it, but I just wanted to give a quick description of where I feel that it's being used the most. And I've worked hands-on on those areas. So definitely machine translation, like text to audio, audio to text, there's chatbot, building of knowledge trees, intent classification, there's also, you know, natural text generation. I'm sure when you use Gmail, you have seen that the prompt that keeps coming up when you're writing an email, says that, oh, that this two next words, I think would be good for the sentence that you're trying to complete. That's like, you know, text completion prompts that you get. It's also used a lot in topic modeling, clustering, understanding with the context of the whole or what kind of insights can be generated from huge amount of text data. And also text classification, which is like, you know, do you want to do a sentiment analysis? Like how do you understand what the general idea, say Yelp reviews or Amazon product reviews. So those have a lot of implications and good applications by NLP problems.

So how do we do it? I know that, these can sound a lot, underlying, but this is very important that we do it in all kinds of business cases, or business problems that we're trying to solve using NLP, so just a quick idea. We need to handle when we have, say, all our texts, make sure that we handle extra spaces. Then we also need to look after how we tokenize our text. So, tokenization just using, by spacing is the very traditional norm, but we also need to take care of use cases, like say, the whole world as United States of America, if we tokenize it just by the spaces, sometimes it can happen that that doesn't make sense in the context that we're trying to work through it, right? So then we need to keep that whole United States of America as a whole phrase, rather than tokenize it by a sentence, so then the, so that the information extraction works much better than, than otherwise if it's tokenized by word. The next step would be spell check. So I'm referring here directly to a very, like a great tool that Peter Norvig from Google created. It's a spell checker. Basically the idea is to, you know, kind of compare the distance between multiple words. And see that, okay. Does this spelling make sense and how close it is to a similar spelling or a similar word meaning that's there in the, you know, the vector space of the whole NLP corpus. So you can see here that when I pass the wrong spelling with a single L, the return value would be a spelling with a double L and also for corrected with the K not with the K and the final value would be corrected with a C. So this can actually help in making sure that our data is clean. And like they say that in NLP it's like garbage in and garbage out. Sorry about that garbage in and garbage out. So we need to make sure that the clean data makes sense with a grammatical sense. Syntactic sense and also semantic meaningfulness. The next step would be contraction mapping. This might seem that why we need to do it.

3. Tokenization, Stemming, and Lemmatization

Short description:

Tokenization may not handle contractions like won't and can't correctly. Stemming and lemmatization help maintain corpus metrics and identify context.

But then the understanding that, okay, if you're using a word like won't and can't the tokenization may or may not be able to handle it very correctly. So then we need to expand it and map it back to its full words like won't, will not and can't, cannot. The next step is definitely stemming and lemmatization. So this is just to ensure that we're keeping our NLP corpus metrics as tightened as possible using similar words like gaming, games, games, has the base word as game. So the idea is to keep the base word as our word and then so that the meaning around that remains the same. And the context can be identified in a better way.

4. Stopwords and Understanding Data

Short description:

Stopwords are extra words like articles and pronouns that may not add much context to a sentence. Removing them depends on the specific use case. Legal or patent datasets may not benefit from removing stopwords. Understanding the data and its structure is crucial before modeling.

The next is stopwords. Definitely. So, you know, when we speak in a correct way, say, when you're scribbling a lot of text from Twitter, Reddit and all other news blogs, there is a lot of extra additional words like the articles, and there is uh, and all the other pronouns that may or may not sometimes add much context to the whole sentence. So in this way we can get rid of them. But this, I would also like to mention that this is very much a case to case basis. Like if you have a good amount of data, like say a legal data set or a patent data set, that may or may not work well with removing all these stop words. So we need to understand our business use cases and make sense, make sure that it makes sense to remove those stop words. And of course there's like case based things that you can do, like you can tag for what kind of, is it a verb, is it a noun, like that would help you to understand your text insights much better before jumping into, you know, start modeling. So that's why it says that you should know the data or the unstructured text is more difficult to understand. So in order to make sure that once you're jumping into the modeling sections, before that you have like your complete understanding and insights about what kind of data you have at hand that you're dealing with.

5. Understanding Transfer Learning

Short description:

Transfer learning is a machine learning technique where a model trained on one task is repurposed for a second task. It involves utilizing the knowledge from a previously created dataset and retraining the last layer to classify new data. This technique can be applied in various scenarios, such as image classification and speech recognition.

Great. So now that we have cleared all the baseline techniques, let's jump right into and think about what's transfer learning. So this is a great quote by Andrew Engie, of course, he's a huge figure in the AI ML space. And in 2016, he mentioned that transfer learning will be the great next driver of ML success. And we are right now in 2020. And I think that, and you would see that so many problems or so many models have come out to solve different kinds of models via transfer learning.

So what is it? So a quick understanding for it is just nothing but just a machine learning technique, where a model is kind of trained on one task and it's kind of repurposed for a second task. Now, we'll see what that means. So let's say you have a dataset, right? You have a dataset one, which is like, say, a general image dataset and why the target variable is a classified object. That object could be, you know, a cat or a dog, a tree, a bus, or a truck, anything, anything. Any worldly objects that's already there in the image dataset. An example of it is ImageNet. I'm sure all of you have worked a little bit or have tried your hands on with this dataset in your work. Now, this, in a general view, is like a structure of neural network with few hidden layers and input layers and an output layer. So the output layer for the dataset one problem would be the, you know, if the classification between if it's cat or dog or any other particular object from the training dataset.

Now, think about the second case. Now, the second case that you have a dataset two and that's a very small dataset. You do not have a lot of training cases, but you have few. And the target data set is if it's an image of an urban image or if it's a rural image. Like, so what do you think is the common, you know, connection between these two datasets? The datasets is like both of them deal with image, both of them have, you know, a lot of worldly objects captured into it. But the final goal is exactly not the same. So here, transfer learning comes into action where you can, you know, kind of utilize the knowledge from the data set one that you have created in your first phase of modeling and just, you know, retrain the last layer. Here, it's the layer four if you can see, I mentioned is the last layer dense layer, retrained the last layer weights using softmax and class and like, you know, create or change those a final output layer from classifying it into CAD dog ship tree, et cetera, to classifying it as urban or rural based on the new data set that we have intertwined with level four. Now this can seem that, okay, why are we doing this? We are before that. Let's see, where are we doing this? So like, say, even in speech recognition, if you're saying a scraping or voice data from different sources like YouTube, and that's an English or Hindi, which is like, you know, one of the languages that we speak back in India. And then you pre train your network. Now, say you have a very small data set in Bengali and you want to kind of, you know, translate or understand what the speech is. So you can utilize the model that you have trained with a bigger data set in English and then use like, you know, translation. And then you utilize that for speech recognition, which isn't Bengali. The second image, the second example is what we spoke about right now.

6. Introduction to Transfer Learning

Short description:

Transfer learning is used when there is a scarcity of labeled data. Instead of creating a new labeled dataset, we can utilize a pre-trained model that has already solved a similar problem. The input properties of the tasks should be similar.

So like if you have an image network, millions of tagged images of general objects of the world, you can utilize that. Or, you know, image recognition, which is like urban versus rural, and so on. So why do we use transfer learning? So that's because, is there, like, a positive gain that we want to achieve? Yes, first, is that if there is a scarcity of labeled data. We understand that sometimes creating labeled data set is like very expensive. We need to, you know, connect, spend some time and expert resources on subject matter SMEs, who kind of help us in creating those tagged datasets. So, if you do not have that to start with, and it can get expensive, what can we do in our, that's in our hand, is that we try and find out, if there's already a bigger problem that has been solved, or a pre-trained model that has been solved. And there comes, if there's always, if there's already a network that exists, I would say with a massive amounts of data and it's pre-trained that we can utilize that's like the next best scenario we can do. And of course, the task one and task two should have kind of, you know, the same input properties. Is this what I mentioned? Like in, for example, we spoke about, we had, you know, a data that had a, like cat dog and other worldly objects. And the second would be the rural versus urban, but the input features or components kind of remained the same.

7. Introduction to Transfer Learning and BERT

Short description:

In transfer learning, we leverage a knowledge base from two source tasks to achieve a target task. BERT is a key component in this process. Word embeddings create connections between words based on their context. Clusters of related words help the model understand their proximity. Burt Mountain is a term coined by Chris McCormick, who emphasizes the importance of understanding transformer attention for applying BERT.

This I'm sure all of you have seen this image before. This is just the difference between a traditional ML learning model, where you have different tasks and you create, you know, different learning system for each task. But here in the transfer learning, you have, you know, two source tasks where you create a model, you have a knowledge base for it. And the idea is to how can we leverage that knowledge base to achieve a target task in the case that we discussed was urban versus rural.

Now, BERT comes into play here. And before we understand what BERT is, let's talk a little bit very briefly about what's word embeddings. Now, word embedding is nothing but, you know, a feature vector representation of the word. That you have in your training corpus. What does that mean? So the underlying concept of utilizing context in word is to create a connection or references between multiple words. So the famous example is like, you know, man is to king. So what is woman is to what? Is to. So the answer would be queen. So that kind of relations happen when we map words, utilizing different sentences in the context of word embedding.

So the next example I just want to highlight is that if you see here, there are multiple words and you can see tiny clusters and you can see that fish, eggs and meat kind of fall under a similar cluster. This is a simplified version in 2D. And in real case, it's much better with multidimensional matrix representation. But here, if you see another example, it's like heat, electricity, energy, oil and fuel kind of is like a cluster, a mini cluster and its own. So that makes sense to the model that OK, if these words, one of these comes, then we can say that the distance between these words are similar and they are closer to each other. In this in this example, it just shows that helicopter, drone and rocket are much closer to each other versus helicopter to goose. Whereas goose, eagle and bee are much closer to each other than they are to drone or rocket. OK. So this is a very famous word, Burt Mountain, like I called it, and Chris McCormick has termed made this terminology. So what he mentions is that what has been made on top of a lot of technology that we have been seeing developing over the years. But if you are studying right now and are trying to understand LLP and Burt, don't think that it's a big domain of making. You have to do a lot of research to it and it's impossible to learn. It is not. So he says that, OK, Transformer attention, these are the two concepts that we need to understand. And that will give you a good baseline to start applications of Burt and fine tuning it. Now, sometimes this can seem like, you know, big, daunting terminology. So how I feel is that, you know, Transformers can be a heavy concept, but they are there to help us.

8. Introduction to Burt and Bidirectionality

Short description:

Burt is the first deeply bidirectional model trained in an unsupervised way and in plain text. It learns information from both sides of the input and considers the context of tokens during training. Burt's structure is token-based, segment-based, and position-based. It can understand different meanings of words based on context. Burt has been pre-trained on tasks like masked word prediction and sentence prediction. Attention is a concept that helps in understanding the context of words.

So just like Optimus Prime, and don't get derailed by it. But just think about it in the simple concept in how we can explain it is you have, say, input criteria. So there is like a layer of encoders that would take those input. And then there's decoders that will transfer those inputs into the outcome that you want. Like, say, here the translation is happening from one language and the output language is a different language.

So basically, what is Burt? So Burt is is the first of its kind. It's the first deeply bidirectional model trained in an unsupervised way and in a plain text. So when I say bidirectional, what I mean is so if you have say, multiple sentences, some of the initial models would do it sequentially. First, they would learn from left to right and then they would learn from right to left. But Burt is a single bond that does it at once. How this, or what this helps us in achieving is that it makes sure that it learns information from both the sides together and takes a particular tokens context into reference during training without losing out any additional information about what what are the tokens that's available on the left side and what are the tokens available on the right side.

So just to give you a quick understanding, this is a very famous example like if you look at these two sentences, it says that we went to the river bank and it says that I need to go to a bank to make a deposit. So the word bank here has different, grammatically, it's the same. It's a word called bank but its meaning is completely different based on the context words that it has been used with. Now Burt can quickly understand okay, which is the river versus it's a bank, it's a financial institution that it's connected to. So this is the quick layer or I would say the structure of how Burt has been arranged in. It's token based, it's segment based, and it's also position based, that which position you are in the whole sentence and if the token is in the first sentence, if it's a multi-sentence corpus and of course the token sentence. These are the two tasks I just quickly walk through over. I know we are closing on in time, but we can discuss more in the Q&A section if you have any particular questions about it.

So, you know, if the idea is for the first one to understand the strength of bidirectionality is that the man went to the dash and he went to the dash of milk, which is we are masking the words here, and the labels that the bird is trying to produce or predict is that the first sentence would be store, and the next word would be gallon. And the next use case that bird has been pre-trained on is like sentence prediction. Like, if you see the first sentence, the man went to the store. And sentence B is he bought a gallon of milk. So that makes sense, right, those two sentences fall with each other and this label would be his next sentence. But if you see, the second set is that the man went to the store and the penguins are flightless, though they're grammatically correct in their respective sentences, but they do not belong with each other. So then the label would be not next sentence. So it's predicting if the next sentence or the next token makes sense with one that you have in hand or not. So this is the concept of attention. Attention is nothing but if you look at the word the animal didn't cross the street because it was too tired. I have given the link by Jay Allwar.

QnA

Discussion on BERT and NLP Problems

Short description:

He talks about attention and the importance of utilizing pre-built AI models for small daily tasks. Clean data and understanding the baseline NLP modeling are crucial. Thank you to MLConf and the fellow speakers. Let's answer some questions. What kind of NLP problems does BERT not work well on? It depends on the business problem. Albert and Bert Light can be effective for question answering classification.

He talks about it in a very particular way. So you can look at it and understand about attention more and we can discuss more about in our Q&A section.

So I'm not going to look over the code at this point, but I just want to highlight one more factor that, you know, training a particular AI model like this is a report by MIT, and it says that during a particular I model can emit as much carbon as, you know, five cars in their lifetime. So the idea of transfer learning is to see if we can utilize it already pre-built and use that in our small daily tasks, rather than retraining again and again for particular each use cases.

So, as closing remarks, it is difficult, but we need to make sure the data is clean, understand our data better than anything else. It's not magic and always advanced neural networks are not the answer to all your problems. Start with the baseline NLP modeling, and then go step by step up. These are some of the additional resources you can look through and read through some of the blogs. Thank you to MLConf for having this brilliant event. Thank you to all the fellow speakers and you guys who joined us. Thanks. I'm ready to take any questions if you guys have, and do definitely check out the Dell Ensport portfolio. I'm sure you'll find a lot of good use cases and case studies that we've worked on in the industry. Great, I think, thank you to all of you and thanks for joining. Let's answer some questions, if you guys have any. Hey Jay, how are you doing? Hi, Ajay, how are you doing? I'm doing fine. Fantastic, fantastic, fantastic. I know, I understand that you're also in the United States. It always comforts me to know that someone who I'm talking to is very close to home, only a few thousand kilometers away. Yeah, the other side of the country. East coast, west coast, doesn't matter. It's cool. All right, so with that, let's actually jump into some Q and A's by our lovely audience. So here's actually a question from a. Here's a question. OK, so what kind of NLP problems does BERT not work well on? Any thoughts on that? Yeah. So I guess the best idea or like once you start learning step by step, sometimes utilizing or like looking at some of the state of the art measures would kind of give you an idea, but I know for a fact why I chose BERT for classification, it may or may not be the. I guess it's not the optimal algorithm that you can use for a classification technique, but I guess for all kinds of question answering, like where you have a factual kind of a data set, where it's kind of making sure that it's reading from the blocks of data that you're training up in, it kind of works better in that, but for classification, I guess, you, again, it sometimes kind of depends on what kind of business problem you're trying to solve. So what I've seen working with it is that for Albert kind of gave me a pretty good score even, and even Bert Light gave a good score for a tiny question answering classification that I was working on very recent project. And that kind of helps.

Applications and Limitations of BERT

Short description:

Some applications of BERT can be used to solve multiple problems, depending on the data and mapping. The state of the art results can provide a head start. BERT may not perform well in strict classification labels or conversational AI. There is ongoing development in the industry to improve BERT's performance.

And I always feel that there is no good, correct answer to it. Some of these applications are built for multiple problems, to solve multiple problems, and it kind of depends on what kind of data you have, how you are mapping it, and what's the end result. So looking at some of the use cases and what are the state of the art results for those models, I guess can filter those weeds out of what kind of problems you may or may not use these techniques for, but definitely can give you a head start.

So even for the classification problem, it just gave me a very baseline score of say 0.6, which is not great, but it's good to start. And that's where you get your hands on, and say, okay, how can I tune it on, and maybe add more parameters, or what kinds of feature engineering I can do to improve that on.

Yeah, that actually is a very good point, because I do know that, like you mentioned, I don't think BERT could perform well if there's very strict classification labels in question answering, and very direct answers. But I'm thinking, like BERT might not be the best. At least we're not at the stage yet where it's very good in conversational AI, where I think I've seen some video examples where BERT kind of gives a little funky answers here and there, especially if the responses tend to be very long-winded. Sometimes it doesn't make sense at times. I guess that's why the problem of the clarify or maybe answering through tension is the goal at the end. And I guess the big people in the industry are still working on it, and there's so much in process that they're still developing, so I guess one day we can say that oh, put everything to BERT, and BERT will solve it. I'm waiting for that day to come. One day, one day. Let's hope, and it will come soon.

BERT Training and NLP Challenges

Short description:

BERT is trained on next sentence prediction and mass language modeling to handle synonymous words, contextual understanding, and coherent mapping. It solves the challenge of capturing essence in bi-directional training and ensuring meaningful context between sentences. Mass language modeling helps BERT understand the language itself, while next sentence prediction focuses on contextual understanding. The main challenge in working as a data scientist in NLP is dealing with unstructured text data and addressing issues like context-based words, synonyms, and spelling errors.

Alrighty, so another question. You know, I think you mentioned towards the end of your presentation or something like that, that we will attack this in the Q&A session, so I'm going to ask you this now. Why do we train BERT on the two problems of next sentence prediction and mass language modeling specifically, and why not, like, any other tasks? Yeah, I guess so, that actually is the answer to what we are trying to solve, right? Let's start from the thought process of word2vec. So word2vec is also a good way of doing this language training model, but what it doesn't serve is that it doesn't help us in creating that contextual and understanding that how to handle synonymous words. Like sometimes it will map it correctly. Sometimes the use case example that I showed for a bank, a river bank and a financial bank institution, how do you correctly segregate those same terms based on context? So for MAST, what happens is since we are doing bi-directional training with BERT, the idea is to capture all the essence, and if we do not do it by the MAST way, what will happen is that BERT will be able to kind of, because it's reading from both sides, will be able to map it or kind of like cheat in its own way to kind of see what the next word could be. Yeah, it actually happens in that way. So I guess the MAST way is just to make sure that we are restricting it to kind of not see the solutions in all its training, bi-directional training, but then trying to infer it from some of it that it has seen versus mapping it to some of the other ways. So I guess that answers one of the core challenges that we are trying to solve, versus the other models. And the second one for the next sentence, I guess it's very important that in order to understand context and think of it from the perspective that if you want to do a summarization, if you if you're doing an abstract summarization, how do you know that when you are summarizing it getting the essence of all the words or all the sentences from the whole block of training corpus that you're passing it through? So I guess the next sentence is just to ensure that we are not mapping it incoherently, even sometimes because sentences can be grammatically correct on its own, but it may not mean or it may not create semantic meaning when placed next to each other. So when placed next to each other, it should make sense. And that continuity flow of meaning should be kept within that goal analysis. So I guess from developing from all the models that we have seen over the time, I guess part is trying to solve these two solutions in a hands on core way. And that's how I infer it and see it working in all the solutions that it's kind of applied to. Yeah. So and yeah, actually, that's also another good point, because I guess, overall, like so mass language modeling, it's like Bird is just trying to understand the language itself, whereas next sentence prediction, it's trying to understand context within sentences or between sentences for like larger inputs. Right? Even if not only the direct one, but also like how it maps into the workflow of it. Let's say a sentence one and sentence three makes sense. But then sentence two doesn't belong with sentence three. So how do you keep that conversational or meaningful? Like it all comes down to attention again, like what you're talking about in the context. Is this following a correct pattern in the way it's describing through it? So, yeah, it's one of the key, key features of Purposeful. Very fascinating.

Another question here from our audience. What is your main challenge or difficulty when you started working as a data scientist? Yeah, the challenge still remains. It doesn't go away. The challenge is like, you know, so when I started working with NLP Text data, the first was like, you know, it was too unstructured. How do you actually pre-process it to make sense of the data. Because in NLP, everything makes sense that if you put in garbage, you are going to get out garbage. How there was challenges of fixing, you know, context based words, usage, how to attach synonyms, how to connect to or fix spelling errors. Spelling errors is a big issue in unstructured NLP problems.

Understanding Transformers and BERT

Short description:

Sometimes domain-specific spellings require working with SMEs to understand the domain knowledge needed. Data preprocessing is crucial for NLP tasks and leads to better models. Understanding transformers and BERT can be achieved by focusing on the problems they solve and gradually exploring their implementation in business use cases. Core understanding, step-by-step learning, and applying the concepts help gain proficiency.

And sometimes it happens that since they are like very domain specific spellings, you may or may not be able to use any generic spell checkers that's out there. So you need to kind of work with SMEs, try and figure out what's the domain knowledge that you need. Can they give us some corpus or metadata or tags that they have or the taxonomy that they have in the back end?

Sometimes it doesn't work out that way. So it does need a lot of explorating and understanding the data yourself to kind of see how you can fix it. So yeah, I guess the data pre-processing for any NLP task is a very crucial part and that kinds of leads you to making a better model. Otherwise, it doesn't matter how much feature engineering you do or how much fine tuning you do. It will not work out and you're going to be stuck in a maze of what am I doing wrong. If your model is only as good as your data is so garbage in, garbage out. Absolutely.

Alrighty. So here's another question. So, you know, you showed this BERT pyramid and that was very interesting. And like you said, just learning all these concepts shouldn't be that daunting. But like when I look at that pyramid, I see like, OK, you know, BERT I need to know transformers. And to know that I also need to know attention. And to know that many papers on LSTMs, RNNs and understand that. And eventually you can see how you can get into this rabbit hole where like when you're just studying one topic, you have to read two hundred papers.

So how do you understand transformers, and BERT specifically, without being bogged down by these very, very hard core technical details? But you need to be understanding. So how do I do this? Yeah. So I would say the first way of doing it is to just understand what it's trying to solve. Like say that, what are the transformers trying to solve? The first thing is encode, decode, and then create that part of transformation of data. What is attention trying to do? It's just basically trying to create context within contextual data saying that, OK, this word belongs to this, this word, or this noun, or this objective that it's referencing to. These are the two general ideas. Now you need to map it up. I understand that the underlying, there has been a lot of in process ways that have been developed. But then if you understand just this core understanding and then maybe read up on some very, I've mentioned some of the blogs by GLMR and Chris McCormick, take a look at how they have implemented in their business use cases. If you understand what they do and if you don't want to get stuck down the rabbit hole to understand all the math behind it at once, I would say do it step by step. But understand what it does first, then see what kind of business scenarios it can solve, what kind of data you need for that scenario. Once you start applying it a little bit and read up step by step what it's doing, then that's how you kind of get a hang of it. If I'm stuck at this position, what are the fine-tuning options do I have? What are the additional features I can create to kind of map that problem? It happens a lot.

Exploring Challenges and Q&A

Short description:

If you look up any topic, you'll find thousands of reading materials and thousands of papers associated with it. It's trying to solve some of the challenges that Word2Vec cannot handle. The internal steps involve mask modeling and sentence prediction. Unfortunately, we're out of time for Q&As, but Jaytha will be available in her speaker room.

If you look up any topic, you'll find thousands of reading materials and thousands of papers associated with it. But that's been my learning curve. I just understand what it's trying to solve from where. It's trying to solve some of the challenges that Word2Vec cannot handle, building on top of that.

What are the parameters, inputs needed? What do the output categories look like? And then what are the internal steps? It's doing mask modeling, it's doing sentence prediction. These are the two main features. Okay, now what kind of data do I need to solve some challenges? So yeah, just to keep that process and thinking flow going without getting too stuck in the internal goals. But if you're interested, please you can always do that I guess. Definitely.

And honestly, I think a lot of us would be interested in hearing way more details. But unfortunately, at this moment, we're kind of out of time for Q&As. But Jaytha will definitely be available in her speaker room starting right now. So if you have any questions that you need to ask her, check out her. You can go talk to her right now and go to her speaker room and I'm sure she would be splendidly pleasant to answer some of your wonderful questions. Thank you so much. It was great having you here Jaytha. Thank you so much, Ajay. Of course, it was a pleasure having you here. And we'll see you very soon. Take care.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

6 min
Charlie Gerard's Career Advice: Be intentional about how you spend your time and effort
Featured Article
When it comes to career, Charlie has one trick: to focus. But that doesn’t mean that you shouldn’t try different things — currently a senior front-end developer at Netlify, she is also a sought-after speaker, mentor, and a machine learning trailblazer of the JavaScript universe. "Experiment with things, but build expertise in a specific area," she advises.

What led you to software engineering?My background is in digital marketing, so I started my career as a project manager in advertising agencies. After a couple of years of doing that, I realized that I wasn't learning and growing as much as I wanted to. I was interested in learning more about building websites, so I quit my job and signed up for an intensive coding boot camp called General Assembly. I absolutely loved it and started my career in tech from there.
 What is the most impactful thing you ever did to boost your career?I think it might be public speaking. Going on stage to share knowledge about things I learned while building my side projects gave me the opportunity to meet a lot of people in the industry, learn a ton from watching other people's talks and, for lack of better words, build a personal brand.
 What would be your three tips for engineers to level up their career?Practice your communication skills. I can't stress enough how important it is to be able to explain things in a way anyone can understand, but also communicate in a way that's inclusive and creates an environment where team members feel safe and welcome to contribute ideas, ask questions, and give feedback. In addition, build some expertise in a specific area. I'm a huge fan of learning and experimenting with lots of technologies but as you grow in your career, there comes a time where you need to pick an area to focus on to build more profound knowledge. This could be in a specific language like JavaScript or Python or in a practice like accessibility or web performance. It doesn't mean you shouldn't keep in touch with anything else that's going on in the industry, but it means that you focus on an area you want to have more expertise in. If you could be the "go-to" person for something, what would you want it to be? 
 And lastly, be intentional about how you spend your time and effort. Saying yes to everything isn't always helpful if it doesn't serve your goals. No matter the job, there are always projects and tasks that will help you reach your goals and some that won't. If you can, try to focus on the tasks that will grow the skills you want to grow or help you get the next job you'd like to have.
 What are you working on right now?Recently I've taken a pretty big break from side projects, but the next one I'd like to work on is a prototype of a tool that would allow hands-free coding using gaze detection. 
 Do you have some rituals that keep you focused and goal-oriented?Usually, when I come up with a side project idea I'm really excited about, that excitement is enough to keep me motivated. That's why I tend to avoid spending time on things I'm not genuinely interested in. Otherwise, breaking down projects into smaller chunks allows me to fit them better in my schedule. I make sure to take enough breaks, so I maintain a certain level of energy and motivation to finish what I have in mind.
 You wrote a book called Practical Machine Learning in JavaScript. What got you so excited about the connection between JavaScript and ML?The release of TensorFlow.js opened up the world of ML to frontend devs, and this is what really got me excited. I had machine learning on my list of things I wanted to learn for a few years, but I didn't start looking into it before because I knew I'd have to learn another language as well, like Python, for example. As soon as I realized it was now available in JS, that removed a big barrier and made it a lot more approachable. Considering that you can use JavaScript to build lots of different applications, including augmented reality, virtual reality, and IoT, and combine them with machine learning as well as some fun web APIs felt super exciting to me.


Where do you see the fields going together in the future, near or far? I'd love to see more AI-powered web applications in the future, especially as machine learning models get smaller and more performant. However, it seems like the adoption of ML in JS is still rather low. Considering the amount of content we post online, there could be great opportunities to build tools that assist you in writing blog posts or that can automatically edit podcasts and videos. There are lots of tasks we do that feel cumbersome that could be made a bit easier with the help of machine learning.
 You are a frequent conference speaker. You have your own blog and even a newsletter. What made you start with content creation?I realized that I love learning new things because I love teaching. I think that if I kept what I know to myself, it would be pretty boring. If I'm excited about something, I want to share the knowledge I gained, and I'd like other people to feel the same excitement I feel. That's definitely what motivated me to start creating content.
 How has content affected your career?I don't track any metrics on my blog or likes and follows on Twitter, so I don't know what created different opportunities. Creating content to share something you built improves the chances of people stumbling upon it and learning more about you and what you like to do, but this is not something that's guaranteed. I think over time, I accumulated enough projects, blog posts, and conference talks that some conferences now invite me, so I don't always apply anymore. I sometimes get invited on podcasts and asked if I want to create video content and things like that. Having a backlog of content helps people better understand who you are and quickly decide if you're the right person for an opportunity.What pieces of your work are you most proud of?It is probably that I've managed to develop a mindset where I set myself hard challenges on my side project, and I'm not scared to fail and push the boundaries of what I think is possible. I don't prefer a particular project, it's more around the creative thinking I've developed over the years that I believe has become a big strength of mine.***Follow Charlie on Twitter
ML conf EU 2020ML conf EU 2020
41 min
TensorFlow.js 101: ML in the Browser and Beyond
Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!
React Advanced Conference 2021React Advanced Conference 2021
21 min
Using MediaPipe to Create Cross Platform Machine Learning Applications with React
Top Content
This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low-powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
JSNation Live 2021JSNation Live 2021
39 min
TensorFlow.JS 101: ML in the Browser and Beyond
Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!
ML conf EU 2020ML conf EU 2020
32 min
An Introduction to Transfer Learning in NLP and HuggingFace
In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

Workshops on related topic

ML conf EU 2020ML conf EU 2020
160 min
Hands on with TensorFlow.js
Workshop
Come check out our workshop which will walk you through 3 common journeys when using TensorFlow.js. We will start with demonstrating how to use one of our pre-made models - super easy to use JS classes to get you working with ML fast. We will then look into how to retrain one of these models in minutes using in browser transfer learning via Teachable Machine and how that can be then used on your own custom website, and finally end with a hello world of writing your own model code from scratch to make a simple linear regression to predict fictional house prices based on their square footage.
ML conf EU 2020ML conf EU 2020
112 min
The Hitchhiker's Guide to the Machine Learning Engineering Galaxy
Workshop
Are you a Software Engineer who got tasked to deploy a machine learning or deep learning model for the first time in your life? Are you wondering what steps to take and how AI-powered software is different from traditional software? Then it is the right workshop to attend.
The internet offers thousands of articles and free of charge courses, showing how it is easy to train and deploy a simple AI model. At the same time in reality it is difficult to integrate a real model into the current infrastructure, debug, test, deploy, and monitor it properly. In this workshop, I will guide you through this process sharing tips, tricks, and favorite open source tools that will make your life much easier. So, at the end of the workshop, you will know where to start your deployment journey, what tools to use, and what questions to ask.
ML conf EU 2020ML conf EU 2020
146 min
Introduction to Machine Learning on the Cloud
Workshop
This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.