This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.
Introduction to Machine Learning on the Cloud
AI Generated Video Summary
Azure Machine Learning is a powerful tool for both experienced data scientists and beginners, providing tools for training machine learning models and scheduling experiments. It offers flexibility in separating compute from storage and includes features like AutoML and the ability to create pipelines using the Designer tool. The Workshop also covers running experiments remotely on clusters, hyperparameter optimization, training generative adversarial networks, and using DeepPavlov for question-answering models. Azure ML can be used for low code and no code scenarios, making it a versatile platform for machine learning tasks.
1. Introduction to Azure Machine Learning
Hello. Today, we'll be talking about Asia Machine Learning, which is a specific service for machine learning in the case of Microsoft Azure Cloud. It is a central part of all machine learning experience in Azure Cloud, suitable for experienced data scientists as well as beginners. Azure Machine Learning provides tools for training machine learning models, including the ability to record training results and schedule experiments. It also offers the flexibility to separate compute from storage, allowing for efficient data processing and training on powerful machines. The workspace in Azure Machine Learning contains everything needed for machine learning, including data sets, data stores, and compute resources. Jupyter Notebooks can be used to connect to the data store and run models. Scheduling experiments on clusters is another option, especially useful for collaborative projects with multiple data scientists.
Hello. I hope that everyone can hear me. Please let me know if that's true in the chat or say something.
Let's start. So what is Asia Machine Learning? Today, we'll be talking about Asia Machine Learning. Asia is a Microsoft Cloud. So in Microsoft Cloud, there are many tools to do machine learning. So the easiest way to do machine learning, of course, is just to create a virtual machine. That's quite obvious, right? You want to train a neural network. You create a virtual machine with GPU, a powerful machine. You train your model, and then you discard this machine, and you don't pay for it anymore. That's a good way to train your model, but not probably the best way. Because a virtual machine is not specific to machine learning. It has all the tools installed, but there are sometimes better ways to do things. And definitely, like all cloud providers, Microsoft, Google, Amazon, they understand that machine learning is the large part of cloud workloads. And that's why they try to create the specific services for machine learning. And this kind of specific service for machine learning in the case of Microsoft is called Azure Machine Learning. And that is something we'll be talking about today.
There is also one more interesting service which you probably should know about at least, which is called Databricks, which is for big data. Like if you have the task where you have a lot of data and you need to process them on the cluster using Spark, Azure Databricks is the good solution for that. Databricks is the managed solution for Spark. You don't have to set up the cluster. You don't have to worry about how this thing works. You basically just get the notebook where you can type your code, and it would be executed parallel on several machines. Azure Databricks is for big data, Azure Machine Learning is for machine learning. Of course, it also supports quite large datasets and large-scale tasks and parallel training and clusters and all those kinds of things. In addition to that, Microsoft Cloud also has some pre-trained models. I will not talk about this today, but those are cognitive services, so-called. They are models which already have been trained on some internal datasets, which support computer vision tasks, which can handle speech-to- text, text-to-speech translation, and those kinds of things. And even services like custom vision allow you to recognize custom objects pretty easily. So, if you have a typical task, cognitive services can do that for you, but if you have some kind of specific more, like, specific task which you need to train on your own data, that is, that can be done with Asian machine learning. So, Asian machine learning is a central part of all machine learning experience in Azure Cloud, and that's why it's good for experienced data scientists, as well as for beginners. And that is kind of hard positioning because you cannot be good for everyone, and that's why there was a kind of confusion that, like, in the beginning, Asian machine learning was targeted only for very experienced professionals, and it was quite honestly quite difficult to start using it. And even now like to start using it requires some kind of learning curve, and that is exactly something I want you to overcome today, so that you are not afraid to make this kind of first step to start using Asian machine learning. Because for me, I did not introduce myself, maybe that's not good, I work for Microsoft, and for the last year I work as a cloud developer advocate, which means that I do talks, workshops, I write documentation, kind of doing evangelism kind of work, and collecting also feedback, so if you have some problems with Microsoft products, developer products, you can always reach out to me. Before that for two or three years, I was working as a software developer, we had this small team of people who jumped on some kind of small project, together with Microsoft partners to solve problems related to artificial intelligence and machine learning. So, for example, if a company wanted to, I don't know, to create a model to detect goals in a football match, for example, that is one real problem which we were trying to solve. We would come together and for about a week or two we would try to solve this problem. So we trained a lot of machine learning models and in the beginning we were using virtual machines, exactly because Azure machine learning had this kind of learning curve which makes it difficult to begin using. But once we realized all the benefits, it became definitely the tool of choice for all machine learning tasks. So lately it also added tools for beginners. So there are some tools which allow you to train machine learning models without knowing a lot, without just providing data and saying I want to train a model based on this data. And also that is the simplest kind of task and we'll start with it in a few moments. And for experienced data scientists the main problem with virtual machines is that normally a data scientist spends a lot of time preparing data and then he does a lot of experiments like tuning parameters for the models and trying to see which architecture of the model works best. And tuning parameters is kind of a boring task because you start training, you wait and then when it finishes you need to record the accuracy and then you again change some kind of parameter and you wait again and that becomes pretty boring. So, ag-machine learning can solve this problem. It can record the results of training for you in the registry. You will always know which model achieved which accuracy and you can schedule like 100 experiments in parallel with different parameters and then see which one works better and do those kind of things. So, it takes a lot of manual work from you and that is the main benefit of ag-machine learning for a professional. So, let's see like how it is structured. So, the idea of ag-machine learning is that you have the notion of a workspace and the workspace is something which contains everything you need for machine learning. So, for machine learning you would need data. So, in the workspace you can, you have its own data set and data store. Data store is pretty much the storage where you can upload data and data set is the logical abstraction on top of it. So, if you have some kind of, some piece of data which is, for example, a table, a tabular data you can define it as a data set and it would contain also some metadata about this, about this data. So, after you have the data set, what you also need, you need to train the model. So, you have the compute available inside the workspace, you can create a cluster, you can create a virtual machine and use that for training. And you can access this compute using Jupyter Notebooks. So, one of the simplest ways to use Asian machine learning is just to say, okay, I will upload my data to the data store, I will connect to this data store from some kind of compute and using Jupyter Notebook I will run my model. That is how normal data scientists work. They do it either on a virtual machine or on their own machine. But here, you can do it in a much better way, because the advantage here is that you can have the computer separated from storage, so you can have the data in one place, and then you can attach, for example, a normal virtual machine without a GPU in the beginning when you are processing data, when you are doing data preparation, and then once the data is ready, you can attach to the same storage GPU powerful machine, and you can train the model using powerful compute. It supports also another style of training, which is scheduling experiments on the cluster. And that is quite good if you have several data scientists working on one project, for example. They all can create different models, different training scripts, and they can submit them for execution. So you write the script and you say, I want to execute it on the cluster. And then the experiment can log the results into the registry. So normally, if you do that, like all your results would be traced in one central place in this experiment. You can also do the same if you run on the notebooks. You can use logging facilities to store obtained metrics in one place.
2. Azure Machine Learning Workspace and Creation
You can store and deploy models in Azure Machine Learning workspace. It provides tools for different styles of using Azure ML, including Jupyter notebooks, scheduling experiments, parallel training, hyperparameter optimization, and setting up the ML Ops lifecycle. To start using Azure Machine Learning, you can create a workspace through the Azure portal. You can also use the Azure CLI, but the portal is more intuitive. Specify the resource group, workspace name, and region to create the workspace.
And then, together with metrics, you can also store models. So you always have the registry of all models that you trained. And once you have the model, once you selected the best model, you can deploy it also from inside the workspace. So workspace allows you to define the cluster for deployment. It's pretty much a Kubernetes cluster, but it takes away the burden from you. You just say here is my script, like scoring script, and here is my model, and I want to put that on the cluster, and I don't want to care about how this would be scaled, and those kind of things. So that is pretty much what is inside the workspace.
And then there are tools, two tools for no code experience, AutoML and Designer, and we'll start by looking at them. So the takeaway from this picture is that there are many different styles of using KJOR ML. It's not like a tool where you have just one way to use it. You can, for example, use it just from Jupyter notebooks, run your models, and then log the results. Or you can schedule experiments on the cluster. You can do parallel training on the cross-cluster, you can do hyperparameter optimization on the cluster. You can use it for setting up the MLOps lifecycle. So, normally, the data scientist only works in the beginning of this machine learning process. And then, once the model has been trained and you know that the problem is solvable, normally you want to set up a kind of workflow when you have the data, and when new data arrives and the model is automatically retrained, and everything is versioned and automatically deployed. Kind of an analog to DevOps. This is called ML Ops, and Azure Machine Learning allows you to set up this kind of ML Ops lifecycle as well.
So, let's start with some practical things. So, first of all, if we want to start playing with Azure Machine Learning, we would need to create Azure Machine Learning workspace. So, in Azure, normally, you can achieve things by using several approaches. You can do everything from the command line, and there is so-called Azure CLI, where you can issue a number of commands, and it will do everything for you. But that is less intuitive approach, because you need to know what to type. And Azure CLI, like, definitely, it has quite, not complicated, but quite a lot of different syntactic options, which you need to take care of. And the second way is to use a portal, Azure portal. And I think that's what we will use in our case. So let's start with going to portal.azure.com. That is the place where you can create different objects in the cloud. So if you want to follow along, feel free to do that. And feel free to tell me in the chat, or interrupt me if I'm going to fast, because I will probably tend to go too fast. So here is the Azure portal. Azure portal lists all the resources that you have in the cloud. And that's why I have a lot of different things here. But what you need to do, you need to create a new machine learning workspace. So the way you do it, you say create a resource. Here, you can type, machine learning. This one, machine learning. Or you can go to. In this catalog, there are really a lot of things you can create, but you can go to.. Where is it? A plus machine learning here. And as one of the options. Even though it doesn't give you a machine learning workspace in top 10. So yeah, if you click see all you would be able to find the Asian machine learning workspace somewhere. So the best thing is to search. So I say I want to create machine learning workspace and that is, it will give you the option to create the workspace. So it's called machine learning. Just machine learning. So you click here, you see create. Here, you need to specify a few options, not too many. So first of all, resource group resource group is kind of directory inside the cloud. So in the cloud you will probably have different projects you are working on. Different projects you are working on, and to be able to separate resources between the projects you use resource groups. So here you can, you would probably want to create a new resource group. So you would say create new and type in some kind of name here. Or yeah, you would want a new one. And then you need the workspace name. Workspace name can be anything. Azure ML work. And the region. Region is the location of the data center, and you want to select the one which is closest to you. So probably, I typically use north Europe or west Europe. It's pretty much equally close to me. I am located in Russia, in Moscow, so like any European location. Like, if you are, for some reason, not in Europe, then you can select any data center you want. And then you see, review plus create and create. I will not do that because I already have a couple of workspaces created in advance, just to save time. But, yeah, you can do that. You can go ahead and do that. And I will make sure that I don't go too fast, while it's creating for you. So, I hope everything is clear so far.
3. Azure ML Workspace and Compute
I have a question here. Are all of the services and functionality that we will use today supported in all regions? If you create a machine learning workspace, it will only give you the regions where it is supported in the interface. Prices for virtual machines differ from region to region, so you can select a region based on virtual machine pricing. Once the workspace is created, you can access it through the Azure portal page for machine learning workspace. From there, you can download a config JSON file to connect to the workspace programmatically. In the workspace, you have access to storage, datasets, compute experiments, and more. We will start with two tools that require no coding experience: AutoML and the Titanic dataset. AutoML tries all possible models to find the best one, making it suitable for relatively simple problems with small datasets. Before training automated ML, you need to define the compute, which includes compute clusters and compute instances. Compute instances are like virtual machines, while compute clusters are used to schedule experiments.
I have a question here. So are all of the services and functionality that we will use today, are they supported in all regions? Well, if you create machine learning workspace, it will only give you those regions where it is supported here in the interface. So I presume there might be regions where it is not supported. I'm not 100% sure. It should be supported everywhere, but yeah, I would probably not guarantee that. But you, in this list, you have the regions which are supported. So you can select anyone from this list. I was surprised to know, by the way that prices for, for example, for virtual machines, they differ from region to region. Not too dramatically, but they do. So like sometimes if you want to save on training you can, also select the region based on virtual machine pricing.
Okay, excellent. So we are creating the workspace. And the meanwhile, I will show you how it looks like once it has been created. So once it has been created, it would be available for you as the resource. And you would be able to see a page like this. So this is the Azure portal page for machine learning workspace. And here is a thing which is a little bit confusing. So for machine learning, there is also a separate machine learning studio. So normally, even though you have some options here, they used to be, by the way, more options here, but once you create this workspace, what you normally do, you say launch studio. And it will open a separate web page with a separate interface like this. Like. Yeah, like this I will go to home page. Home page like this. And this is the Azure machine learning workspace. In the kind of studio, Azure machine learning studio interface. The only thing that you might want to do. From this page from Azure portal is download config jason file. So if you want to do something programmatically with the workspace like, for example, you want to schedule the experiment to run on the cluster, but from your computer. Like you have local python environment and you want to schedule the experiment to run somewhere to connect to this workspace you need a few data. Basically, it's the name of the region the name of the workspace. And also some kind of a key access key. So this config jason file contains everything you need to connect to the workspace programmatically. So if you want to play with this workspace locally, like from python from python SDK, you would want to download this file and keep it somewhere on your local PC. Otherwise, you don't you won't need this page anymore. You switch to machine learning studio, and everything is located here. I made this interface too big, and that's why it doesn't show me them. Yeah, like this. So here you have everything, as I mentioned, in the workspace, you have storage, datasets, compute experiments, and so on. So everything is here, in the left menu, and you can start playing with this workspace.
So let me show you what we will be doing. We will start with two with two tools which require no coding experience. Like, suppose you just want to solve some business problem, and we will start with the well-known problem, so-called Titanic-Titanic dataset. You know the Titanic, the movie? and the the disaster itself, and there is a dataset of people who were on the Titanic. So, you have basically an Excel table, a CSV table, which lists all the passengers which were there on Titanic, and our goal would be to predict the probability of surviving Titanic catastrophe based on the data. So, for that we will use a tool called AutoML, automated ML. Automated ML is, the idea is, pretty simple. We take the data, and we say, I want to train the best model, and to find the best model, we try, well, pretty much all the models that there are. We just try to train all possible models and see which one works better. That is not the best way to spend computing time, because training all possible models requires a lot of resources, so data scientists would probably know which model would work better. But if you have relatively simple problem, and Titanic is a relatively simple problem, and the dataset is not too large, it's only about 1000 people, I think, 1000 rows. So a small dataset, and it would be able to try all possible models in about an hour and find the best solution. But it doesn't require any experience from you. So let's start with this automated ML approach. Do I understand correctly that most of the people who want to follow along, you have the workspace created for you? Great. So we need, before we start training automated ML, we need a couple of things. We need to define the compute. So compute is the resources, defines the resources, basically virtual machines which you will be using to train your models. There are two different types of compute. It's not really clear why, but there are compute clusters and compute instances. So compute instances is pretty much a virtual machine. So once you create it, you can start a Jupyter notebook on it and do whatever you want. So a compute instance is for a traditional, very straightforward approach. You create a machine, you start a notebook, and you start typing your Python code. A compute cluster is something you use to schedule the experiment on. So a compute cluster is basically... Well of course, behind the scenes it's still a virtual machine, but it's a kind of bare-bones virtual machine to which you deploy the container, with all the software, and it runs this container and produces some result. So in our case, here for AutoML, we would need a cluster. So you would say, I want to create a cluster. The cluster would be restricted to the region which you used when you selected the workspace, and that's quite normal because you probably don't want to change the region in between. Then you can select a virtual machine priority.
4. Azure VM Types and Cluster Creation
In Azure, there are low priority VMs that are cheaper and suitable for saving on training. Precautions must be taken to ensure that training scripts are structured to be resumable. For our tasks, we will use CPU and select a suitable virtual machine type. The cluster can have zero nodes, automatically scaling down when not in use. You can also enable SSH access if needed. Free virtual machines are not available in Azure, but you can minimize costs during the free trial. After creating the cluster, the next step is to define the dataset. The Titanic dataset is available on the Internet, and you can create a dataset from web files by specifying the URL and name.
And priority is an interesting thing because in the cloud, normally, there are machines which like nobody uses. But at any moment, somebody may want a virtual machine. So those machines which are not used but which can be taken away from you at any time are called low priority. And they are cheaper, like something like five times cheaper. So if you want to save on training, you can use low priority VMs. But you need to make sure that if this machine is stopped, you will continue training the model on the next instance. So you need to take certain precautions to make sure your training script is structured in such a way that it is resumable. It saves the checkpoints somewhere, those kind of things. So this is just for you to know. I mean we will use the dedicated virtual machine in this point just because it's easier. You can use low priority like if you decide to train in like big models, you would probably want to do that. Then of course you can choose CPU versus GPU. For our tasks we will need CPU. We will not train today, like not during these 2 hours. We will not train difficult things during these 2 hours. But if you want to train something cool, you would use GPU. If your models are small you would use CPU. Then you can select the type of the virtual machine. You can see the RAM, the storage amount and the cost here. I used the standard. The D33V2 something like this. Which you can use for anything. You can use 2 core. I think that's also good. Select the one that you think is good. There are few machines which you cannot select because it has too many cores. They are too powerful. And if you ever need those, you would need to do another step. You would go inside the Asia portal and say I want to increase my quota and they will increase the number of cores that you can use. This is done on purpose so that people who do not know what they are doing, they cannot do it. Those who know what they are doing, they will go and request it explicitly. So, yeah, select something. Yeah, let's say. Let me check which kind of machine I have here, I have already created the cluster, so I don't wait. That is the one I was using today. So you can use standard DS3 V2 as well. You will not be paying a lot, but after we finish the this training, please like do not forget to stop whatever you have started so that you are not paying for that. For the cluster, the good thing about the cluster is that cluster can have zero nodes. So when you create a cluster, let me show you again the interface. So let's say I selected the virtual machine type. You also select here, when you click next, you select the name of the cluster, and then you select minimum number of nodes and maximum number of nodes. And the minimum number of nodes is, it makes sense to set it to zero so that the cluster can scale down automatically. Like if you are not using it, it will shrink to zero and you will not be paying for it. If you submit a job on the cluster, it would create those machines for you, run the job and scale it down again. So depending on your working style, like if you are submitting a lot of small scripts, like if you are debugging and you want to submit every 15 minutes a new version of your script, then probably you want the minimum number of nodes set to one so that you have the cluster always ready, because this spin up takes some time. If you set it to zero, then it's good because it doesn't cost you anything, but you need to make sure to delete it if you are not using it, so that you're not paying for it. So that's pretty much it. You can also have some options, like enable SSH access, but we will not be using it today. But if you want, you can, so that you can access the machines individually. So, in terms of free machines, you do not have free virtual machines in Azure, but when you get the free trial, you have some amount of money, like $200 for a month, so you would be still spending that money in this experiment. So, you probably want to minimize the costs anyway, even though it's free trial, so that you have money left to do some more experimentation. Okay, so then you say create, and you create this cluster, and it would be created pretty fast because if you say that the number of nodes is zero, then you don't need even to spin up the virtual machines, it's kind of created almost instantly. If you say that the minimum number of nodes is one, then it will be creating the cluster for a few moments. So, that is the cluster. Next thing that we need to do is to define the data set. So, Titanic data set is available on the Internet. So, what we will do here, we'll go to data sets. As I mentioned, data store is where you put the data, and data set is when you define the structure on top of it. So, I already have the data somewhere on the Internet, so I don't even have to worry about putting the data here. I just say I want to define the data set. So, I say create data set from web files. And here I put the URL, I will type the URL and let you know, I will paste it in the Zoom Chat. So, in the Zoom Chat, you can find the URL for the dataset. Let me just make sure that it is correct. Yeah, here it is. Here is the CSV file with all the passengers of Titanic. It's pretty much the CSV file sitting somewhere at some URL. So I put the URL here. Then I specify the name. Then the dataset type.
5. Dataset Type and Creation
Dataset type can be either tabular or non-tabular. In our case, it's tabular. The file format is delimited with a comma as the delimiter and UTF-8 encoding. The column headers are taken from the first line of the file. The dataset provides a sample and automatically determines the types of the columns. Changes can be made to the dataset, such as modifying the data types and importing or excluding specific fields. The dataset creation process includes creating metadata and performing statistical checks on the columns. The dataset can be explored and profiled to view column distributions and statistics. Python code is provided to access the dataset for model training, even when training locally. The code includes all the necessary details, such as subscription ID.
Dataset type can be either tabular, if it's a table, or it can be file, if it's just a collection like of images or something like that. In our case, it's tabular. And that's it. We say next. What happens here is that it says parsing dataset info, so it tries to access the dataset and see what kind of data is there. And it tried to figure out the encoding and everything. So the question is, how do I know if it's tabular or not? If you just have the URL, probably you would have to look at it elsewhere. The way you access tabular datasets and non-tabular datasets is different. So if you have a non-tabular dataset, for example, you have the large collection of images, you can mount it as the directory. If you have a tabular data set, you can download it as a data frame in Python. So like, you probably would know if your dataset is tabular or binary, non-tabular. So I say that the file format is delimited. It also supports like different other formats, like it has the JSON lines format. If you have it structured as JSON, it can be parquet number of files in the parquet format, which is typical for big data storages. In our case, it's just delimited. Delimiter is comma, encoding is UTF-8, and then column headers. I can say that, use headers from the first line, and this is, I will, that's something I will use because in my file, the first line contains the names of the fields. So here you say it says p-class, survived, name, sex, age, and so on. So in our case, I want to use this first row as names, and then I do not skip rows. So I just leave it like this, and you can see it gives me the sample from the file here. I can make sure that file is encoded correctly. It even figures out the types. Like it says survived is either 0 or 1, and it knows that it's a numerical field. Which is kind of nice. You don't have to specify it manually. Then we say next. Here we can make some changes to our data set. So for example, it knows that p-class, the class of the passenger, is integers. It's 1 to 3, first, second, or third class. But if we want, we want to change the type. Survived is also integer 0 or 1. Name is a string. Sex is a string. And so on. So we can import all the fields. Or if we don't want some fields, suppose we don't want to know where the person was going, home destination. Because probably the probability of survival does not depend too much on the destination. If I don't want to import this, I can just say, just mark it, clear this checkmark. Then I say next. And I say create. And then it takes a little bit of time creating the dataset. Because what it actually does, it creates some kind of metadata for it. It does some statistical checks on the columns. And once you open it, for example, you can say explore. And you can play with it a little bit. And so you see that it shows me all the columns. It shows me the preview, like the first 50 rows of 1300. And then you can say profile. And it will show me some statistics like what's the distribution of the class. You can see that 700 people went third class, 251 second class, and 316 first class. Then the probability of survival, like survive 489 people and not survive 800, and so on. You can see statistics, gender distribution, age distribution, different kinds of things. What you can also do here, if you want to train your model using Python, you can go to consume and it gives you the Python code to access this data. You basically can take this code and put it into anywhere and it will get you the data frame from this data set. Even if you're training locally on your machine, you see it contains all the data, like subscription ID and so on, you just grab this code and use it. That's quite useful.
6. AutoML Experiment Setup and Featurization
To submit an AutoML experiment, go to automatedML and create a new automatedML run. Select the Titanic dataset and specify the target column as 'survived'. Choose the compute cluster and set the type of task as classification. Deep learning models can be enabled, but it's not recommended for small datasets like Titanic. Featurization settings allow you to select the features/columns to use for the model.
Okay, so now we have everything for submitting autoML experiments. We go to automatedML and we say new automatedML run. Which data set we want to use? We want to use one of the Titanic data sets. Then we click next. Experiment name, we want to create a new experiment for this and set up the name. I would say titanium2. Then what we need to specify? We need to specify the target column like what do we want to predict. In our case, we want to predict the probability of survival. So we select from this list, survived. This column we want to predict. And which cluster we want to use for compute? We select here the cluster. We can also create the cluster from here, by the way. But that's kind of not probably the most convenient way. I always prefer to create the cluster beforehand so that I have complete control of all the parameters. Good. So we defined those kind of parameters. Then we need to select the type of task. In machine learning, there are a few different types of problems. Like in our case, we want to classify all passengers as either survival or non-survival. So this is called classification problem. And classification problem is like, I think it's probably the most common one. So when you want to predict the class of an object, like if you want to predict the handwritten digit, if it's 1, 2, 3, 4, 5, that's a classification problem. So cluster is unusable. Why would the cluster be unusable? Let me think. Did you create the cluster using this kind of compute cluster? Not compute instance, but compute cluster? That can be one problem, because compute instance is just a virtual machine. And you see, in my case, it's called main compute, and it would not give me this main compute as an option in automated ML. It only gives me this compute cluster, which is called my cluster. So you need to create the compute cluster here. If it's a cluster, maybe it's unusable because it has not been... It's actually hard to to see. No, you don't need to start the cluster. If it succeeded with zero nodes, that's okay. What you can try to do meanwhile, you can try to put the minimum number of nodes to one and see if it starts up the cluster or not. Okay, a little slower. Okay, yeah. Somebody says that if you set the minimum number... Yeah, great comment. The maximum number of nodes should be maximum number of nodes should be one. That is definitely true. So the minimum and maximum number of nodes define like the minimum number of nodes, how small the cluster can be and the maximum number of nodes, how many nodes it can create. So if you put the maximum number of nodes to zero, it means it would not be able to create any machines and that cluster would not be usable. So in our case, what we do, we need to set the minimum number of nodes either to zero or one. One is probably safer, but you'd need not to forget either to delete the cluster or set it to zero after the workshop. Otherwise, you would be paying for this running virtual machine and the maximum number of nodes can be one or two or more. But don't do it more than 4. Good, so everybody seems to be working for everyone. So let me start again like I want to do the AutoML run. I select Titanic experiment. Then I create a new experiment, Titanic2, target column survived. Compute cluster is my cluster. Then I go to next. I see there are some more questions I will answer when I finish this definition of experiment. So this is classification problem. There is also a regression problem, regression is when you want to predict a number. Suppose you want to predict the age of a person based on the products he buys and you have this table of the products. Predicting the age would be a regression problem because you need to come up with a number. Then there is also time series forecast, if you want to predict the currency value that would be time series forecasting. In our case, it's classification. For classification, we can also say enable deep learning like enable deep learning models. Normally if you have little data like this titanic dataset, it's pretty small. You probably don't want a deep learning model. Deep learning is a neural network. Also, deep learning is a little bit less interpretable. But sometimes you want to do that. You can experiment with that as well. If you enable deep learning, there would be more models to test definitely. So now there are here also a few additional settings here at the bottom. So let's start with view featurization settings. So a featurization settings allows you to select which features you want to use for the model. Features are pretty much columns like which columns you want to use and which columns you don't want to use.
7. Featurization and Configuration
Do you want to use the name of a person to predict if they will survive on the Titanic? Probably not. Names would all be different and not have predictive power. AutoML can figure this out and avoid wasting time on names. Instead, focus on relevant features like gender, age, and class. Other important features include the number of siblings or spouses and parents or children on board. You can customize the feature types and impute missing values. Save the settings and explore additional configuration options.
So let's try to figure this out. Do you want to use the name of a person to predict if he will survive or not? Probably not. Because if you use the name for prediction, you would not be able to apply this model to yourself. Like, you want to see, would you survive on Titanic? And you don't want the model just to remember names. And also, names would all be different. So they would not have a predictive power. So AutoML would probably figure out that name is not useful if you don't tell it about it. But if you tell it not to use name, it would be much faster. It would not waste time doing that. So normally, you would want to go to this featurization and uncheck the boxes which are not relevant. Like, you don't want home destination. You don't want body. You don't want cabin. You don't want fare. You don't want ticket. What you want is you want gender. You want age. You want class, definitely, because class is probably the most important feature. And then there are also two other features, like sibsp is whether they have siblings or spouses on board, which is the number of siblings or spouses, and parseeage, parents or children. Do they have any parents or children on board? And those might be important because, like, if a person has the child on board, he would probably want to wait for him and that might affect the probability. Then you can also change the feature type, like, a feature type of p class would be probably numeric, gender would probably be categorical, and so on. And also, you can select the impute, like what to do if the column does not ex, like, if the value does not exist. And, for example, for the age, for some people, the age does not exist, like it's not filled in the table. So what do we need to do in this case? Do we want to, like, use mean value instead of the missing value, or do we want to fill it with some kind of constant? We can try different settings here. So after I've done it, I can say save. And then there are additional configuration settings here.
8. Configuration and Experiment Results
There is a question about whether the number of cabins would count in the model. The value for cabin is different for each passenger, so just the number on the door without any additional meaning would probably not be usable. Additional configuration settings include selecting the primary metric, explaining the best model, blocking algorithms, exit criterion, and concurrency. The stopping criteria cannot be defined based on epochs. After defining the parameters and clicking finish, the experiment is scheduled. The obtained accuracy for the run is 0.78. The best accuracy achieved is 0.81 with the Wojtki Ensembl algorithm. Other models include combinations like MaxApps scaler and LightGBM. Each model can be explored in more detail, including metrics and outputs.
Yeah, so there is a good question whether the number of cabins would count or not, because the cabin defines where you are in the ship. But probably the cabin, the value for cabin is pretty different. Like, if it's an individual number of the room, and that would be different across all passengers. So while the cabin would be important, you need to provide some more information, like which part of the ship. If you divided all cabins between three or four different locations, that would be very useful. That would improve the model. But just the number on the door without any additional meaning would probably not be usable because it's different for everyone.
Yeah, and yeah, again, yeah. A lot of null values for cabin. So let's view additional configuration settings. So here we can say what is our primary metric. Like which model would be the best based on accuracy, based on area under the curve. There are different possible metrics. Let's stick with accuracy as the simplest one. But depends on what you want. You can do it differently. Then you want to, if you can check explain the best model like, let's leave it checked and we'll see how it can explain the best model. Then blocked algorithms. By default, it tries all possible algorithms. But if you know that some of them would not work or you don't want to try test them, you can blacklist them here. Then exit criterion. When do you want to stop? You can either use training job time like after three hours of training, I want to stop and select the best model which was found so far. Or you can say like, metrics call threshold. If you get above 90%, you can stop. Validation that, there was a question in the chat like how does it validate the metric? And you can do different types of validation here. You can do train validation split, when you split the data in two parts and you train in one part and you try on, you validate on the other part. Or you can do different other methods like K fold cross validation and so on. Let's just leave auto. And concurrency, how many concurrent iterations? We have the cluster with two nodes. I have the cluster of two nodes and that's why it suggests me to do two concurrent iterations. The good question is can we define the stopping criteria based on epochs? Probably not because epochs are not always applicable. Like if you are not training neural network, if you are training like KNN classifier, you would not have the notion of epoch there. Okay so I have defined the parameters and now I click finish. And when I click finish, you can click finish. It would schedule the experiment. I will not do it because I have already run the experiment. It would take some time to run. That's why I have already done it beforehand. I will just show you how the result looks like. But you can now start the experiment and then you would be able to see the result when we finish. So then when you start the experiment, you can go to experiments and you would see this titanic experiment here. You select it. I have done two runs of titanic experiment. One was less successful, I messed up with the parameters and the other one is the good one. You see it took something like 32 minutes for this run. And so what you see here is you see the obtained accuracy, which is 0.78, which is the best accuracy for this run. So AutoML is scheduled as one experiment. But then this experiment has so called sub experiments. If I click, for example here, include child runs, it would include all child experiments as well. You would see here, there are a lot of individual subs experiments, each of which took something like 30 seconds, 40 seconds. And each of those experiments tried its own algorithm. So if we go to this run 54, you can go to models. Models show us the result. So here you have the accuracy. All the models are sorted by accuracy. So the best accuracy we achieved is 0.81, which is quite good. 81% accuracy of predicting the probability of survival. And you see the algorithm name is Wojtki Ensembl. So it has several submodels which contribute to the overall result. That is, Ensembl is the often used technique to improve the accuracy. You have several models, and you train several models. And then for each individual, they each give some kind of result. And you select the most frequent result. And you can see other models include the combinations, like it was MaxApps scaler and then LightGBM. So scaler is important to scale the input data so that, for example, age, which is from 0 to 80, and then class, which is from 0 to 3 so that they are kind of of comparable importance. So that's why you do the scaler first and then things like LightGBM. So for each model, you can go deeper. You can click this model, and it will give you, like, you can go to metrics to see all the possible metrics, accuracy, F1 score, and so on. You can go to outputs. So the way each training runs is that, of course, behind the scene, there are Python scripts for each model.
9. Running and Deploying the Model
And those Python scripts train the model and save it into the outputs directory. You can download the trained model file and use it in your own project. The explanation feature in Azure ML allows you to understand the important criteria used by the model. Gender is the most important feature, followed by class, age, and siblings/parents. Now, let's move on to the entry-level tool called designer. It allows you to draw diagrams and define workflows using the notion of pipelines.
And those Python scripts, they train the model, you can see the output of the script here, like whatever, everything which happened here, you can track, and it saves the model itself into the outputs directory here.
So here if you want to use this model in your own, like in production or in your own project, what you can try to do, you can here in the output, you can download the trained model file. This model.pkl is the saved model. And there is even a scoring file. So the scoring Python file which uses this model. You can grab this file and use it as the sample to run the model, right?
So that's how you run the model. Or you can click deploy. Like you can say deploy, and it would be deployed somewhere here on the on some compute. So the interesting thing here is that once I have this list of models, let me go back to the list of models. The top model has the view explanation link. So explanation is the very cool feature of Azure ML. It allows you to make sense of the model. Like we probably want to know, like for us, the accuracy of the model is not so important. But we would rather want to know what was the most important criteria, right? Like why the model, what model uses to predict this probability is, for example, gender more important than age. Or is age more important? That is interesting. And that is exactly what we can see in the explanation.
Explanation is a different kind of topic. I can talk about explanations for a long time. There are black box, white box explanations. But just in short, let's see what we can see here. You can see that the most important feature is in fact gender. So gender like whether the person is a male or female is the most important thing. Unfortunately, we cannot see from here like which gender results in higher probability of survival. I don't think we can figure this out easily. Probably we can somehow. But that doesn't sound obvious how to do that. Then the second is class one, two, three. And the third one is age and then siblings and parents. So that makes sense, I think, right? So good, so that was explanation. And I think that's it for AutoML. Any questions on AutoML? We spent a lot of time on AutoML. I thought we would be doing faster. But given that there are a lot of people who wants the introductory level material, that's good for you. So let me show you one more tool, which is entry level tool, which is called designer. And the way I will show it to you is very quick I will use the existing example here. So here if you go to designer, you have here on top show more samples. If you click it, you see a lot of sample diagrams. And let's start with this multi-class classification letter recognition.
So this is one very common task of recognizing handwritten digits. There is the data set called MNIST. Which stands for American Institute of Standards. Let me show you how it looks like. Yeah, so it's the data set like this. You have like handwritten digits in the shape of twenty eight by twenty eight gray scale dots. And there is a data set of something like sixty thousand or seventy thousand of those digits. And let's try to see how can we train the model to recognize which digit is that. This is a good data set because nothing can go wrong. It always gives you some kind of predictions. Like you cannot go below, I think, seventy percent of accuracy in this data set. So let's see how we can train this using Designer. I click here and it gives me the existing, already defined diagram to do the training. Let me try to show you in a little bit more detail. So there is a question in the chat would I recommend to use AutoML in real world scenarios.
So with AutoML, the problem is that AutoML takes a lot of computing power. Like for a simple Titanic, a single model would be trained in one minute and it took 30 minutes to try all the models. So if you have computing power, if the problem is not too difficult, then of course AutoML should be your first choice. It will give you the baseline, like how you can easily achieve certain accuracy. It would give you the understanding whether this model, whether there is a predictive model or for example if you get with AutoML like 50% accuracy on classification, probably means that the problem cannot be easily solved. So this is definitely a good way to start, but if the model gets more complicated it might require more computing power and you would have to decide at some point whether it's cheaper for you to bring in the data scientist or try all possible combinations of the model. Are there tools to run AutoML on your own system? Not inside Azure ML. Because the whole idea of Azure ML is to give you cloud tools to solve everything. So theoretically, of course, there are different tools to do that and the one of the tools is if you are using.NET for example, there is ml.net and ml.net supports local AutoML. But it's kind of different. AutoML, but the idea is the same. Good, so let's see the designer. So the designer allows you to draw diagrams like this. So these diagrams, there is a notion in AutoML. There is a notion of pipeline. And I will not go too much into the notion of pipeline today because it is good for large scale tasks. But the idea of pipelines is that you define the steps of the workflow which produce data and then they give data to the next step and so on.
10. Using the Designer to Create a Pipeline
The designer defines a pipeline with predefined steps. Import data, split data, train models, score models, and evaluate models are the main steps. The experiment can be submitted and the results can be viewed, including a confusion matrix.
So this designer in fact defines this kind of pipeline but it also contains some predefined steps for you. So in my case, like how do I want to check if I can predict the digits correctly, I take the data and it says, the first block here is import data. It's a kind of data set. I specify the URL here and it gives me this kind of model, not the model, that gives me the data as output. Then I use the next block which says split data. Split data, I want to split it between train and test data set. I can specify like the amount of the percentage of split, let's say I want 80% to go to one arrow and 20% to the other. And then this data, this split data, I want to put as input to train model. And there are two train models here. I want to train one model to, I want to train two different models. Like I put this, the same data to two train models. And in one model I put as the model, two class support vector machine converted to multi-class, and another one is multi-class decision forest. So in this diagram, I want to compare two different models, which one would perform better?
So yeah, the question, there is a question, how can I show you how to create those diagrams? I can, so pretty much on the left you have all possible blocks. You have all possible blocks, and you go and you select, for example, tightening data set. You drag it here and you have this kind of data source. Then if you want to split, you select the block containing split. You can search here by name, say split. It will give you the split data. You can drag it here, and then when you have two blocks you just draw the arrow like this, and you connect them. So that's pretty easy to set up those diagrams yourself. The reason I start with the pre-existing ones, I want to do things like I don't want to spend too much time on this.
Yeah, so I split the data. I trained two models. Then I have two more blocks which says score model. And those score model blocks I put as an input, the trained model. And the output of the test data set. When I split the left is trained data set the right is test data set. So I put the test data set to score model. And then score model I put to evaluate model. So scored model is basically a table which says predicted classes one and the real classes also one. A lot of rows like this. And evaluate model collects the statistics on top of that. It will give you the metrics. So once you do all that you say I want to submit this. It tells you please select compute target so you select the cluster here. And that's it. You say submit. Then you need to select the experiment. Again and click submit again. So I will not do that because I have already done that but you can go ahead and do that. And I will show you how the result looks like. So once your experiment has finished here is this experiment MNIST designer. It will look similarly like 15 minutes it took 15 minutes to train. Well 15 minutes doesn't mean that the whole 15 minutes is spent on training. There is one very important thing here is when you start the experiment for the first time it needs to create the custom container for you. So like especially if you do some kind of your own training with your own python script with your own set of libraries it takes some time to create this container. Which can be time consuming like five minutes. But once you do it like for the first time, the second time you do some small changes it does not recreate the whole container. It just deploys the different kind of script so it becomes much faster. The first long time of execution should not fool you, should not create the impression that it's slow, it's bad, it's not. It just takes some time to run the experiment for the first time. So once you run this experiment, let's go to this run1. It shows you this diagram here. And in this diagram you can go to the final to the final, final, final block. And here check the results. How do I do that? I should be able to click here and see the result. Or maybe I should go to outputs. No. Yeah, so each block is represented by its own kind of experiment. So here, I can go to the last block. And the last block will give me some graphical representation of results. My screen is unfortunately very small and difficult to see. But it gives me here, let me make this a bit smaller, so it gives me here the confusion matrix. So here, confusion matrix, it shows you which classes were classified correctly. So you see, because my problem was classifying digits, right? I have 10 classes. So each digit is from 0 to 9. And here, the diagonal means that most of the digits were classified correctly. So if I give it the digit 1, it classifies it to me as 1. But some of them, those light blue dots, mean that I gave a digit 7 for example, and it was classified as 1, or vice versa. So this is called confusion matrix.
11. Using Designer and Running Python Models
In Designer, you can see confusion matrices and evaluation results, which are in a difficult-to-understand format. Designer is an improved version of the earlier tool called Asian Machine Learning Studio Classic. It is more scalable and allows for larger datasets. You can use Designer to bring together bigger pieces of your own pipeline. To train models written in Python, you can run them in the portal using Visual Studio Code. If you want to learn more about Azure ML, you can refer to the blog series on starting with Azure ML. You can clone the GitHub repository to access sample code for training. Now, we want to switch to a different workflow where we combine the flexibility of creating experiments with running things on the cluster and our local Python development. We have a script called trainUniversal.py that downloads the MNIST data and runs logistic regression on it. This script can be run locally if Python is installed.
And confusion matrix is a good way to check like what is wrong with the model and how good it is as well. So it gives you, for both, because we have the left model and the right model. So it gives you both, for both models, gives me confusion matrices. I can see which model I like better. And it also gives me the evaluation results. Evaluation results is unfortunately in some kind of difficult to understand format. Yeah, metrics. Yeah, you can see these metrics here. So for left model, you can see recall accuracy and 0.89 was the overall accuracy for the right model. And you can put this table as well. So yeah, you can see all the metrics here. So that was Designer.
Designer, there was the earlier tool from Microsoft called Asian Machine Learning Studio Classic, which was very similar to Designer. And it was a little bit easier to use. But it is deprecated now because it did not scale well, so it was good for very simple tasks but not for difficult ones. And this Designer, even though it looks simple, in fact it scales for larger datasets because you see each block in this diagram runs as its own like step, as its own container. And you can write the Python code there. So you can use this Designer to bring together bigger pieces of your own pipeline, as well. OK, so that was it for simple tools. We played with Designer and now let's try to start with something more complicated. Something more for data scientists. Like data scientists, they want to train their models written in Python.
And the first thing I will show you how to do is how to run your model written in Python in the portal, to schedule this job on the cluster using Visual Studio Code. And Visual Studio Code is a very good, like, editor. Many people love Visual Studio Code and I hope you probably know about that, about this tool. So if you want to go deeper into learning Azure ML, because today, like, we have less than half of our time left, I probably would not be able to cover everything. But I have on my site, the collection of blog series on how to start with Azure ML. And it starts exactly with this point, which I'm starting at now, how to start with Visual Studio Code, how to do hyper parameter optimization, how to train generative adversarial networks, and how to train question answering model. So once you, if you don't have enough time to talk about everything, you can always have a look at my blog. And this presentation, I will make it available, I will post the link to the presentation, probably in Zoom chat or in Discord tomorrow morning, so that you have access to it.
So what we will do now, if you want to follow along, you would need to clone the GitHub repository. I have the GitHub repository. I will actually put this address into... Let me put this address into Zoom chat. Zoom chat. So this repository contains some... Let me show it to you... Some sample code which we can use to play with. So GitHub.com, CloudAdvocracy, AzureML starter. So this is the whole training for today, by the way. It's also described here, how to use AutoML, how to use AutoML for Titanic, how to use AzureML Designer, and we are here how to use AzureML for Visual Studio Code. But to start using it from Visual Studio Code, you need to clone this repository into some directory on your computer. I hope you know how to do that. You can do from here, download zip file, or you can do git clone, in whichever way you think is easier. And once you do that, you open it in Visual Studio Code. This is Visual Studio Code. Should I wait a little bit? Probably yes. And while I wait, if there are any questions, I would be glad to answer. So what we want to do now, we want to switch to completely different workflow. What we did before was upload the data set into the portal, play inside the portal and get the result. Now imagine we are a data scientist. We used to train everything locally at our own machine and now we want something better. And what Azure ML offers us, we have seen from what we have seen so far that there are experiments. And we have seen that experiments contain accuracy and those kind of metrics, right? When we run AutoML, there is an experiment and obtained results, obtained accuracy and so on. Yeah, so the question is do you need any Python on your local machine? Probably not. I don't think so. Let's not do it at the moment. So now we want to combine this flexibility of this power of creating experiments and running things on the cluster with our local Python development. So here I have the script which is called trainUniversal.py. So what trainUniversal does it downloads the MNIST data from the internet. You see, I will briefly show the code. So it says fetchOpenML and downloads the MNIST data, the digits themselves. Then it takes first 20,000 digits and runs very simple logistic regression on top of them. So it's like the simplest possible model offered by Scikit-Learn. Scikit-Learn is the Python package for doing machine learning. So it trains this model and gives the accuracy back. Now, this script can run locally. So if I select this script and if I have the Python installed, I said that you don't have to have Python installed. But if I have, I do have Python installed. You can, for example, select this and click to run it locally in the terminal.
12. Running the Script Remotely on the Cluster
To run the script remotely on the cluster, you need to install the Azure account and Azure machine learning extensions in Visual Studio Code. Once installed, right-click on the script and select 'Run as experiment in Azure'. Choose to run it remotely, select the Azure subscription and machine learning workspace, and create a new experiment. Select the compute cluster and the Azure ML environment, which specifies the libraries to use. You can also choose to use an Azure ML dataset with the training run. After configuring the settings, a JSON file is generated with the experiment details and run configuration.
And it will give me an error. I think it runs now. Yeah, it runs. I think I have some kind of misconfiguration here or… have some problem with my Python. But I hope you trust me that you can run it locally, right? Yeah, somebody tells me I need to use accuracy instead of… where do I need to do that? Here? No, because here is the computed accuracy. Let me like… trust me that it is supposed to work. So you can… this is the script which runs on your local machine. And then I have made two modifications here. I set import from Azure ML core run import run and run, get submitted run. And this is inside the tri block. So if I have Azure ML packages installed, I would get the reference to the experiment. And if I run this experiment, if I run this script on the cluster, it would give me the reference to the experiment. And at the end of the script, I can log the obtained accuracy. So those are two lines which I added and I added them inside. If block so I can still run this script locally. But what I can also do, I can run this script remotely on the cluster. How do I do that? Well, I say I click on the script, right click and say run. Run as experiment in Azure. To do that, I need an extension. Visual Studio code has the notion of extensions. So here somewhere I have the list of extensions and they have installed a couple of extensions. One of them is called Azure account which is the extension you need to do, like to log in to Asia and to do everything. And the second one is called Azure machine learning. You also need to have this installed. So if you don't have this installed. You can like you can install extensions from here. You can say here machine learning and it would give you the list of all extensions and you say install to this Asian machine learning extension. OK that is pretty fast. So if you like, have Visual Studio code but you don't have this extension, you can install it now. So one of them is Asian machine learning. Another of them is Azure. Account. Good so once you have the extensions you would have next to the script. You would have this menu saying Azure Emel run as experiment. In Azure. And I click that it tells me do you want to run it locally or do you want to run it remotely? You see the good thing is that if you have the powerful machine, you can still run your script locally, and you can benefit from logging ability to machine learning to Asian machine learning cluster. Like if you have 10 data scientists working locally, they can log their results centrally in one place, and then you would be able to compare everything there in one place. So you can run it locally, but I would do it remotely. And if you don't have Python installed, you just say I want to run it remotely. Then it tells me which Azure subscription to use. Okay, I select my Azure subscription. Then it says which Azure machine learning workspace to use, and I need to be careful I would use Azure ML workshop, the one I created for this workshop. Then it tells me okay, do you want, which Azure experiment you want to use? I already have a couple of experiments from the previous exercises, right? I already have the dynamic experiment and so on, but in my case this is the different thing. I want to create a new experiment. I say create new experiment. Which compute do I want to use? I want to use my cluster. I already have the cluster so I would select my cluster, but you can create a new cluster from here as well. Good. Now then it tells me what is the azure ML environment. And that is important because the way these things work is that it creates a container to deploy it on the cluster. And container needs to know which libraries you would be using. So for example if you want to be using TensorFlow, you need to specify the correct version of TensorFlow here. In my case I'm using scikit-learn. So I need to select something with scikit-learn. Here it is. Scikit-learn. Azure ML scikit-learn. So I select this one. Like would you like to use an azure ML dataset with your training run? In my case I download data directly from Internet so I'm not using the dataset here. But normally I would probably want to use the dataset as well. So in my case I would say no. And what it will do, it will generate this kind of description for me. This is the JSON file which says, this is the name of the experiment. And I can by the way give it a better name. I would say VS code experiment. Then it defines a run configuration, like which version of Python, which version of libraries. I'm using scikit-learn zero twenty three. So I can change those details here as well. Right. I can change some environment variables here.
13. Submitting and Running Experiments
I submitted the experiment from Visual Studio code. My current directory is captured and sent to Workspace. Once the experiment is submitted, it creates a docker image and deploys it on the cluster. The data is mounted from the datastore directly to the machine. The logs are streamed back to my computer. The outputs are copied over when the experiment finishes. Different compute targets can be used, such as Azure ML compute cluster, virtual machines, local computer, Azure Databricks, Hdinsight, and data analytics.
But in my case, I don't need to. OK, so that's it created this file for me. And if I want to submit it, I just click here on this green kind of button saying submit. Experiment, I submit experiment and it tells me submitting experiment to Azure Mell. And it is submitting the experiment to Azure Mell. So what it does when it submits the experiment, it packages all the files in the current directory. It creates this kind of archive and sends it to Azure Mell. And inside Azure Mell, it packages everything into a container. So let me now click here. View Experiment Run. And to view the experiment run, I basically open it in the portal. And instead of opening it like this, I can go to my workspace, which is here. I can go to experiments. And here is my experiment. Here's code experiment. And right now, what's happening with it? It is running, which is kind of good. And what does it do? I can go to inside the experiment. I can check outputs plus logs. I can go to logs and what happens here, it's you can you can see that it's pulling the container from somewhere. So what is what happened here is that there was already existing container for this configuration, because I didn't use anything fancy. I just use say get learn without any external libraries. It didn't need to create a new container and just took the default container, pulled it. And now it does something else. And now it will soon start running my script. And I will be able to see the output here. So here is the output 70 driver log is the output of my Python script. You can see it's printing my messages like fetching moneys data is my Python message, which is printed by my code. So it is currently fetching minutes data from Internet. I mean, this data is probably not very small. That's why it takes a little bit of time. So let me explain again what happens with this experiment submission. So I submitted the experiment from Visual Studio code. You can also do it from command line. Like if you are a command line expert, you can use a ZML run submit script and do the same. So what happens here is that I have my computer with the script. When I submit the script, my current directory is captured and sent to Workspace. Why the current directory? Because I might want to have some kind of Python additional libraries together with my code. Right. And that's why it takes the whole like everything which is around my script. And that's why it's not a good idea to put big data into the same directory. Like I don't want to store data set inside the same directory, because in this case, each time I submit the experiment, it will take the data and send it over. So I should not do that. So once the experiment has been submitted, it creates the docker image and stores it in some Docker image registry inside the workspace, which is like hidden Docker image registry. Then it deploys it on the cluster. And if I have the data in the datastore, it mounts the datastore. So the great thing is that I don't copy the data. Like when I create the cluster, I don't copy the data to each machine on the cluster. I just mount the datastore directly to the machine. So I mount the data store and then I launch the script and it starts streaming the logs back to my computer. So if I want I can observe logs on my machine as well. I normally don't do that because it breaks from time to time. I prefer to go to the portal and view the logs on the portal like this. By the way, what happens with our scripts, it trained the model and achieved some accuracy, hopefully. It's still running. Probably, it is still training. OK, and then when it finishes, it copies over the outputs. So if I want to save the model or kind some kind of artifacts from my training, I put it into a special directory called outputs. I don't do this in this script. I will show you the example in the next one. But this is a good idea because once I train the model, I want to be able later on to use it. Because what if this run was the most successful one and all the all others were not as good as this one. So that's why if I store them all together with logs and together with all the metrics, I would always be able to easily get this model. So that's kind of execution scheme. In this case, I can also use different compute targets. So in this case, we used Azure ML compute cluster. I can use any virtual machine to submit the experiment there. I can use local computer. I can use Azure Databricks or even some other things like Hdinsight and data like analytics. But normally, probably you won't want to do that.
14. Running Experiments Programmatically
Normally, you would want to use Azure ML compute because it supports all possible options. However, the current experiment is still running, and it's unclear why. The next step involves configuration, which will be addressed later. Programmatically submitting experiments is useful for hyperparameter optimization, where multiple experiments are run to test different parameter values. To do this, Python code can be run in Jupyter Notebook within Azure ML. Uploading the necessary notebooks and creating a compute instance are the initial steps. It's not possible to calculate costs per run in Azure ML, as the duration of an experiment is difficult to estimate in advance.
But normally, probably you won't want to do that. You want to use the Azure ML compute because it supports all possible options. Good. So let me see again, like what is the status of my experiment? Some reason it is still running. I cannot understand why. Let's wait a bit more. Maybe it will finish. So the question is where do I do this configuration? I don't quite understand what are you referring to? So we will do it in the next step. It asks me questions on each step. So I just need to answer them. My question was more about databricks. If I want to add databricks into my pipeline. When you define the cluster you have different steps. When you define the cluster I think Visual Studio code gives you only the opportunity to submit to Azure ML cluster. But it created this json file. Before we submit it. OK, so what happens to our experiment. It is probably still running and don't know what went wrong with it. Let's probably not waste time. Hopefully, it will finish at some point. It should have already finished, but not for some reason. So that was submitting the experiment from Visual Studio Code. The next thing, what you might want to do is to submit experiments programmatically. And the reason you want to do that is that some of the most frequent case for this is to do hyper parameter optimization. So what is hyper parameter optimization? Sometimes when you train the model, you need to make some choices about the parameters. For example, how many neurons neurons he wants to have in your neural network layers. And it's difficult to tell beforehand because like you just need to test. So the way to the way you do that is to do some training. To test. So the way to the way you want to handle this is to create a script which can accept those parameters from common line. And then you can just schedule, for example, one hundred experiments, have them run and you will they will look at the accuracy and you will see the result. So how to do that, let's see how we can programmatically submit the experiment like not from Visual Studio code, but from Python code. And to do that, I need to run Python code. And the best way to run Python code inside Azure email is to use notebook Jupyter Notebook here. So to use Jupyter Notebook here, you go to Notebooks section. You go to Notebooks section and you would not have many things here because for you, for you, it will just be an empty directory. Here you can you can create here new notebooks or you can upload new notebooks. So let's upload a few notebooks from your GitHub cloned repository. So the notebooks that you would need is create dataset. And submit IP and be submit IP and be and create dataset IP and be upload those two notebooks here using this arrow upload files. So. Let me show you I go here, I say create dataset, but make sure to select notebook and not the Python file. I PNB extension. This one. Select this one. And it says, all right. If already exists. And upload and it will upload. So once you upload it, you want to run it, but to run it, you need compute. And compute. You cannot use a compute cluster here because compute cluster is the cluster where you schedule experiments. And to run this, you need a normal compute, like simple compute. You can create it from here, like you would have if you don't have a compute. You would have the new compute button here. You can say new compute and select a new compute instance and creates the simple. Well, the same kind of virtual machine that we use before. Oh, everyone throwing a lot with me, should I stop for a moment or everyone is awake? I will answer questions. Meanwhile, is it possible to calculate costs per run in any of the functions you've seen? Is it possible to calculate costs per run in Azure ML? No. Well, you can probably estimate how much would it cost per minute. Because when you create the instance, it gives you the amount of dollars per minute. But how long the experiment will run? It's not obvious how to say that in advance. Sometimes if you use, for example, AutoML, I want to terminate after three hours, so you can say I want to spend so much money on that. And I can allow this to run for three hours. And after that, please terminate and give me the best possible model. So you can do that. But to estimate how long the model will train, I think it's a difficult task. Good. So I think I'm assuming that you have managed to create the compute for this notebook and compute for the notebook will take some time to start up as well. So probably. It says creating.
15. Uploading Dataset to Datastore
It is possible to run multiple VMs in the compute, but there is a limitation on the number of cores. Large files in the same directory can cause errors when submitting an experiment. To avoid this, it is recommended to keep data in separate directories. The next step is to upload the dataset to the datastore attached to the workspace. This can be done programmatically by creating the dataset and fetching the data from the internet. Once the dataset is created, it can be used in the main job, which is done by submit.ipnb. The submit.ipnb file creates a reference to the workspace and initializes it from the config.json file. Authentication may be required the first time you run it. You can choose to run the code in different environments, such as Jupyter or Jupyter lab. After running the createDataset code, the dataset is ready to be used.
So it should be possible to run more than one VM because you have the limitation for the number of course, if you go to compute. If you start creating new compute, it says total available quarter eight course. So in the beginning, like my limit was. My limit was 16 of 24, so my limit was 24 course and I think you should have similar amount. And we have created like if you selected two machines for the cluster, that eight up something like eight course. And then you can have the VM here as well. So the errors, somebody got the error, your total project size exceeds the limit of three fourteen megabytes when they try to send the experiment. And that happens because in the directory you have some kind of large file. And I don't know where you got it from, because in the GitHub repository, there should be no large file. I think so. This happens, as I mentioned, when you submit the experiment, it packages the whole directory and sends the whole directory to to the cloud. So in your case. Yeah, if there are large files in the same directory, which are larger than three, 15 megabytes, then you would not be able to submit the experiment. And the good idea is not to keep data, as I mentioned, in the same directory. Maybe you have some other files in the same folder and you should move them to do a new directory, as somebody suggested.
OK, so let me slowly start explaining what we will do next. So now we want to do the hyper parameter optimization, so we want to run the same script many times on the same data set. And to do that, we want the data set to be sitting in the datastore, not downloading each time from the Internet. So that's why the first thing we need to do, we need to upload the data set into the data store attached to this workspace, to this data store. We can do it programmatically. First thing we do, we create this data set. So create data set, meaning that we want to get many data from the Internet and put it into the local file. This is just a piece of code which does that in the first notebook. To create data set. So you see what I do. I fetch many data using the same code as in my previous script. And then I select the first 20000 records just to speed things up a little bit. And then I created a directory called data set and I pickle dump the file there. So pickle dump is the Python way to serialize things to disk. So I pretty much create the binary data set on disk. So what you need to do, you need to. Once you have to compute, you connect this notebook to compute. And you click this run button. And it should run the code. It says fetching many data, and then in a few minutes you will get the data in the data set directory. If you prefer, traditional Jupyter interface, like if you already worked with Jupyter, you can here select edit in Jupyter. And it would open the new browser, which would look familiar to those who like using Jupyter. If you want to fancy Jupyter, you can say open edit in Jupyter lab and it would have even more fancy interface, because Jupyter lab, like it's more it's richer environment. That also contains some kind of ability to create different scripts. Like you see, you can say like here you have, again, the interface with different files. You can have terminals here and some other nice things. How do you write, run a terminal? Yeah, so you can have the Jupyter console like this, write a Python code here. What do I need to run it? Anyway, so yeah, you can run it in a different environment. And this is the built in editor, like, of Azure ML. So whichever one you like to use. So this is finished. So I have created the dataset here and you see this ManistPickle file is sitting here and I can use it now. So you run this createDataset once and you then forget about it. And the main job is done by submit.ipnb. So submit.ipnb does all the work for you. Let's go through it step by step. So first thing it does it creates the reference to the workspace. So to do anything with the workspace you need this reference vs. The easiest way to create it is say workspace.fromConfig. And what it does it initializes this from the config.json file. In the beginning I showed you in the Azure portal download the config.json. If you want to do anything locally on your machine you would download this config.json to your machine and use the same kind of code. If you want to do it here from this notebook it automatically knows like where to get the config.json file. You just execute this piece of code. And it should work. It says library configuration succeeded. The problem, not the problem, but what your experience might be is that the first time you run it, it would ask you to authenticate using Azure in the separate window. Like Azure, it uses Azure CLI in the background sometimes. So it needs to do authentication of this library against Azure Active Directory. It will tell you, please go to microsoft.com. slash device login and enter this code and it would give you this code here and you would need to copy the code. Open the device login Microsoft.com device login in the separate window and paste the code there. And you would do it once when you start the workspace. I cannot show this to you because it did not suggest me to do that today. Maybe you wouldn't have this again.
16. Submitting Experiment and Training
In this part, we are about to submit the experiment called Submit IPNB. We get the reference to the cluster and create it programmatically if it doesn't exist. Then, we upload the data to the default data store and create a Python file for training. The file accepts parameters, such as data folders, number of epochs, batch size, size of the hidden layer, and dropout value. We treat the data folder as a path and use normal Python variables to download data. Finally, we construct a neural network, train it, save the model, and log the metrics. The experiment is named Karasminist.
Maybe it figured out the way to do it without authentication. Anyone succeeded running this cell? Or I've lost everyone and you are too tired to continue. So this notebook is called Submit IPNB and we are about to submit the experiment. Yeah, Titanic is still running. That's that's cool. Yeah. Titanic would take some time to run.
OK, good. So, yeah, somebody managed to get this output, meaning that probably everyone would also get the same output. So now we have the reference. This WS is the magic variable that you will use to access the workspace. Now you get the reference to the cluster. So to get the reference to the cluster, you this piece of code here, it also creates the cluster if it's not available. So in your case, because you already have the cluster, you better specify the right name here with the cluster name. So my cluster was called I think my cluster was called my cluster. I would I better check that so compute, compute clusters. Yeah, my cluster is called my cluster. So I would leave this name here. If your cluster name is different, please specify the name here. But you can create the cluster. It just shows you the way how you can create the cluster programmatically without doing it manually from Python. If I have the cluster, this code will run very fast. OK. Succeeded. If it created a new cluster. Yeah, it's as somebody said. Then you have two clusters. Just don't forget to delete them after the exercise.
Next thing. Next thing I want to upload the data to the data store. And this can be done pretty easily I say. I get the reference to default data store. So each workspace has the default data store where you put all the data. So I get this reference to default data store and I say ds.upload. I want to take everything from the dataset directory and dataset directory is where I put my beautiful file with digits. I say target path equals minis data. I want to put it in the target directory, minis data. Overwrite equals true. I show progress equals true. And when I run it, it uploads the data pretty fast. Good. Now I have the data sitting in the directory. What I do next? Next, I create the Python file to do the training. The way I do it, I have this file written here as one cell. But normally what you would do, you would keep it as a separate Python file and edit it in the editor, and tweak it and make sure it works. But once you make sure it works, you put it in the same directory. I call it MyTrain.py. And what it does, it accepts parameters. Like if you know, this arc parser, it parses the command line parameters. And I can specify data folders, command line parameter. I can specify number of epochs, batch size, size of the hidden layer. And the dropout value. I will train the very simple neural network here, using Keras. So here I have the defaults for each parameter, but if I want to change them, I can specify them via command line, right? Now, next I get the dataset. And the way I get the dataset, I pretty much treat this data folder as path. And I use normal Python variables to download data from this folder. Normal, like opening file, opening routines. The way it works is that, as I mentioned, data stores are mounted to the cluster. So they are mounted to some directory in the file system. And this directory I can pass here as the parameter to my script. And then I do training. I construct neural network. I do model.fit. I save the model to the output file and I log the metrics. That is pretty familiar. OK. You run this cell to to save it to the file called my train.py. And then here is the piece of code to how to put it. To create an envelope around my script to create the environment in which it will run. So I would say I want to create an experiment called Karasminist.
17. Running and Debugging Experiments
To run an experiment in Azure ML, you need to pass the data folder and specify the estimator and running environment. This may seem difficult at first, but with the provided code and resources, you can figure it out. Once you submit the experiment, you can track its progress, debug any issues, and view the logs and outputs. Remember that the experiment runs on a cluster, so you may encounter version or configuration problems. Debugging involves making changes to the script and resubmitting. Once the experiment is finished, you can access the results, including accuracy metrics and the trained model.
I want to pass the data folder, which points to the default data store. And I want to have this kind of estimator. Estimator is the object which knows how to run the script. And to this estimator, I also specify the running environment. So I say pip packages. I want Keras. I want TensorFlow. And that is the way I tell Azure ML which packages to install on each cluster, like what to include into my container.
This can sound a little bit difficult. I know that for you, that is pretty much confusing right now. But hopefully you have this piece of code. You have the link to the GitHub directory. And once we finish today, once you get a chance to have some sleep, and maybe even once the conference finishes, you can come back to this and spend a few times, a few hours here, and you will figure this out. I promise. That's not the easiest way to do. But there are huge benefits at the end.
So once you create this, you just say experiment.submit. So I click here. I click here. And it submits my experiment. So now I can go to experiments. Here is my Keras MNIST experiment. And it is... What is happening here? It is queued. This is my new experiment. And here are previous versions. I had some failures earlier today because I specified the wrong version of Keras. And wrong version of Keras resulted in huge problems. That is one complication here. It's that each time you submit the experiment to run, it's not running on your machine. On your machine, of course, you control, you know, the experiment. You control the environment. You know which Python version of libraries are installed. But on the cluster, you don't see that. It's not easy to see that. You can only specify this using those commands. And each time you specify it wrong, something goes wrong, and you need to go and track it.
So the way you debug this is that you submit the experiment. And then if you see that the experiment failed, you go inside. You go to Outputs. You go to Logs. And you see that here, something goes wrong. Right? You can see the output. You can see that here it's training. But at the end, there is a problem. You can see all the error messages here. And once you change your mind, once you figure out what the problem is, you change the script. You resubmit. And then you debug this again. That takes a little bit of time to get used to. And here you have all the logs. Like, for example, here, because I specify my own version of the libraries, I need to build the container. So what happens in the beginning, this 20-image build log, what it does? It, in fact, builds the container. You can see all the steps in the Docker file, like which libraries I installed and so on. The successful version. Yeah. Now my script, you see, it's finished in 39 seconds. That is kind of cool, because I did not tweak any, I did not make any changes to the container. I did not make any changes to the script. Even if I did small changes to the script, it would be 39 seconds. The first time it runs, it runs for one minute and 41 seconds. That's because it creates the container. So once it's finished, you can see that I have my accuracy metrics here I can, in the output plots logs, in the outputs, I would have my model. I think I disabled saving the model to, for some reason. Maybe it was giving some errors. But normally, let me go to some early run. You would have here outputs, and you would have the binary model file here. HDF file. Cool. So that is just submitting one experiment.
18. Hyperparameter Optimization with Hyperdrive
To perform hyperparameter optimization, you can manually submit all possible script runs or use the hyperdrive technology for automatic parameter sampling. Hyperdrive allows you to specify the parameters and the sampling method, such as choosing from a set of values or using random distributions. You can also define early termination policies to stop training if the desired accuracy is not achieved. Once the experiment is submitted, hyperdrive tries different parameter combinations and selects the best model based on the specified metric. The outputs and logs of each model are available for further analysis and prediction. This automated approach saves time and effort compared to manually trying multiple parameter combinations. To submit the experiment automatically, you need to create a script that accepts parameters, loads data from the data store, trains the model, stores the model in the output directory, and logs the results.
But I started talking about hyperparameter optimization, right. How do I do hyperparameter optimization? Well, I can manually submit all possible script runs, right. Because here when I submit, I specify the parameters. I specified here data folder equals to data store. But I can put here, for example, hidden equals to 100. And then hidden equals to 200. And I can submit a lot of experiments. That would be one option. But even better option would be to use so-called hyperdrive. Hyperdrive is a technology which does automatic parameter sampling. So, what it does here, I say, I want to do the parameter sampling in such a way that I want to select hidden layer from the choice of 5, 50, 100, 200, 300. I want to specify batch size as the choice of 60, 400, 28. I want to specify number of epochs as 5, 10, and 50. And then I want to specify dropout in this range. So, here I specify choice. Like, I just select one of those. But I can do some other parameter sampling. So, I can do random distribution, like uniform distribution, normal distribution, and that gives me a lot of flexibility. So, this is how I want to sample my parameters. And this is how I want to do early termination. Sometimes when the script starts training and it trains and trains but doesn't achieve high accuracy, I want to be able to terminate it earlier. And this is called early termination. And then I specify all that I want to use this termination policy, this parameter sampling. I want to use accuracy as a primary metric. I want to maximize this metric. I want to have maximal total rounds of 16 and maximum concurrent runs at four. And I want to submit this experiment. So, when I click this, I will not do that because I have already done that many times before. But once you click that, it will submit hyperdrive experiment. Let's have a look. I have it already here. Karos hyperdrive. Actually, I think I did not manage to do it here. I have another version of workspace. Here is my Karos hyperdrive. It is just different workspace at which I finished it, like, some other time because for some reason it did not work for me. I did not have time to finish this run today before our workshop. I will just show you the previous version. So you see, this is the previous version. It took 12 minutes. What happened? Sorry. Are you still able to hear me? Yes. Good. I think something happened to my Zoom. Maybe it started realizing that it's about to finish and stopped. Sorry about that. So, let me show you again. So, this is my Keras HyperDrive experiment. If I go there, I can see child runs. You see this beautiful graphics, it shows me different accuracy. Each dot is the different experiment. So, this is the 0.97 is the highest accuracy. I can click here and I can see which parameters it corresponded to. Here, you see that hidden layer is 300, which makes sense. It's the highest number of neurons in hidden layer. And 50 epochs also makes sense. Highest number of epochs and Drupal 0.5 and batch size 64. So, it automatically tried all different parameters and found out that this is the best model. And in outputs plus logs with each model, I have associated output model, which I can use for my predictions. So, you see the beauty of this is that I spent some time suffering and trying to configure this horrible, like, experiment running on the cluster. But once I do it, I am able to easily schedule a lot of experiments and just select the best one. Because hyperparameter optimization can sound like a nightmare. Imagine, like, you spend a week setting up the model and then somebody tells you, and why didn't you try it with 100 more different combinations of parameters? And then you say, oh my God, I need to remember where to put the parameters. I need to record all the accuracy. And here everything is done for you automatically. So, to summarize, if you want to submit the experiment automatically, you create the script, which accepts parameters. It loads data from data store as normal files. It trains the model and stores the model into output directory. And it logs the results using this logging. And that's it.
19. Hyperparameter Optimization and Model Deployment
To do hyperparameter optimization, specify the parameter sampling, such as choice, uniform distribution, or normal distribution. Use strategies like random parameter sampling or grid search. Azure ML provides a model registry for recording and deploying the best models. Deployment can be done on a cluster or virtual machine. If you have questions or need assistance, feel free to reach out to me on Telegram or visit my website for more information.
That's how you run the script. If you want to do hyperparameter optimization, you specify the parameter sampling. It can be either choice, or it can be a uniform distribution or normal distribution. And the strategy can be random parameter sampling. When it just randomly selects the combinations, it can be grid. It's exhaustive grid search, like all possible combinations. And variation can be like more clever way to select parameters. And then you just do something like this. Hyper hyper drive config and you submit the experiment.
Finally, once you train the model, Azure ML also have this notion of model registry. Like you can say, here is my best model, I want to register it. It will be recorded somewhere in the table. And then you provide the scoring file and say I want to deploy it. And this is also done by Azure ML. I will not go into details, but it can be deployed on the cluster, on the virtual machine. And for me, I spend more time in my life on doing data science. So for me, deployment is kind of dirty, not interesting thing, maybe for you too. But yeah, somebody has to do it, but then Azure ML simplifies this step as well.
So I think that was my hands-on introduction. If you want to practice more, please feel free to do that and ask me questions in Discord. I will probably not join Discord today because it's already 10 p.m. here at my location, and I will probably just go to sleep immediately after that. But tomorrow I will join the Discord and I will answer any questions that you might have. So feel free to continue experimenting tomorrow. And if you experience any problems later on, I will leave you my... Most frequently I use Telegram. I don't know if you use Telegram or not. I will leave you my Telegram address and I will leave you my website, which contains all the blog posts about Azure ML. And also my contact information. So if you have any questions later on, you feel free to drop me a note.
20. Training Generative Adversarial Networks
In this part, the speaker discusses the training of generative adversarial networks (GANs) to produce paintings. GANs consist of a generator network that produces paintings from random vectors and a discriminator network that distinguishes real paintings from fake ones. The networks are trained together, with the generator learning to produce paintings that the discriminator classifies as real. The training process involves using a dataset of paintings and a package called Keragon. The speaker also mentions the cost of GPU machines for training and explains how Azure ML supports logging images and graphs for monitoring the training progress. The speaker provides an example of the training output, showing the progression from noise to recognizable shapes in the generated paintings.
So I know that we have... Okay, so it doesn't seem to exist. So for the blog channel, I have to be honest with you, I have not joined Discord yet. I tried to do it before the training and I could not remember my Discord password. I used it like about a month before and I forgot. So I will do it tomorrow. And in terms of the name for the channel, I'm pretty sure that the organizers will figure something out and they will add the channel there and you would be able to figure out the name for the channel.
So let me, I know that we are over time, so if you are tired and if you want to stop this, you are free to drop off. I can briefly go through one more, or actually maybe even two more, case studies on using kjreml to solve something interesting. And one example would be to train generative adversarial networks to produce the paintings like this. So what's interesting about producing paintings like this? So this is done using the neural network and the way this neural network works is that it consists of two networks. The first one is how we train the generator. Generator takes the random vector and from random vector of numbers, it produces the painting. And then there is a second neural network which is discriminator and it takes images and input. And it's a goal is to figure out whether the image is fake produced by a network or it's a real painting produced by a real artist, a human being. So the way it is trained is that we train both networks. Like in the beginning, we give discriminator real image and the fake image and the fake image is like blurry. It's like rubbish. And discriminator, it's easy job for him like to distinguish something beautiful from rubbish. And it learns to do that. And then we train generator. We say generator, your job is to produce such images that discriminator classifies as real ones, not as fake. And generator learns to do his job a little bit better. And those two networks, they are kind of complementary to each other. They are trained together. And as a result, we are left with this beautiful network generator, which can produce paintings from any random vector of numbers. And as you can imagine, you can come up with a lot of random vectors and you can come up with a lot of paintings. And this is trained just on the dataset of pictures. So you get the dataset of paintings from somewhere and you can train this network. The sample code for this is also available in your GitHub repository. So if you really wants to do this exercise, you can also do that. The training is to produce paintings, as you see on the slides, it takes something like 11 hours. So it's not on GPU enabled cluster, so it's not a cheap exercise, not very expensive exercise, but also not terribly cheap. I would not, I think, go into detail on how this is structured. So if you know about convolutional neural networks, the generator, it is basically deconvolutional neural network. So it starts with a vector and then recreates the image from this. Discriminator is the normal convolutional neural network. And to do the actual training, I use the package called Keragon. It's the gun training package which I put together. And the most of the code is contained actually inside this package. So about the cost of GPU, the cheapest GPU machine I think costs about $800 per month if you run it for the whole month. So you can figure out the cost from there. I think it's about $1 per hour. So it's like $11 for this training, something like that. The interesting thing here, this is the piece of code which shows how the generator and discriminator work. The way we do training, the interesting thing here is that in Azure ML, in the examples that I showed you, we log metrics, because metrics is something which describes how good the model performs. If we want to generate paintings, the only way we can see that the model performs good is to look at the paintings ourselves. Like we want basically on each step to look at the produced paintings. So Azure ML supports logging pictures and graphs. So for that, we use the same run dot, but instead of run dot log, we use run dot log image and we log images on every step of the training. So the way it looks in the portal, I have this experiment here in my portal. It's called Keragon. You see that 11 hours was the training time. And I can go to output plus logs and at each, like not each step, not every epoch, but every like few 10 or 20 epochs, I save here both sample output and the models. Generator and discriminator. So here is the sample output. It generates three random pictures. So in the beginning, it looks like this. Like not trained network produces the noise. After a few epochs, it starts generating something like this, a little bit more beautiful noise. And then we need to go further down. You see, it starts like getting some kind of shapes, some more shapes, and more. There are a lot of training goes on here. And this was probably trained on portraits, so it probably looks like faces of people, somehow. Let's go to the end. This is for portraits. And the images, they are not. This is beautiful. I'm drawing Halloween portraits. Something like this, yeah.
21. Generating Landscape-like Pictures
This is my second experiment. I trained the first experiment on flowers, but this one is focused on generating landscape-like pictures using a neural network. Once the experiment finishes, I can programmatically download the latest version of the generator model and use it to produce random pictures. The generated pictures resemble landscapes, similar to the one behind me in the Zoom video.
And this is my second experiment. And my first experiment was trained on flowers. I ran two of them. No, this is not flowers. This is, I think, the pictures of landscapes. Or not. It looks like portraits again. Anyway, so I think you get the idea of how this is done. So it's pretty much the same scheduling of experiments. But instead of metrics, we see the images. What we can do after that. Once the experiment finishes, it saves all the models inside the experiment. And we can programmatically get the list of all file names. So using the same object called run. You can say run.getFileNames. I get all file names which starts with outputs, models and then generator. I get the highest number of generator. And I download the latest version of generator. And then I can locally produce pictures from that. Because once I have the generator, I just put the random vector as the input. And I can get those kind of random pictures from it. And you see this looks like a landscape. Like the one behind me in the Zoom video is also one of the pictures generated using this neural network. So that's one of the examples of using khrml.
22. Training Neural Network for COVID Dataset
Another example of using khrml is training the neural network to answer questions based on COVID dataset using the DPavlov library and the BERT model. BERT fills in the gaps in the text by predicting missing words and the probability of the next sentence. Pretrained BERT models in deep Pavlov can be used for text classification, entity extraction, and question answering. The run care and reader steps are used to answer questions based on a collection of documents. Training the model on the COVID dataset involves extracting data from the dataset and training on a GPU-enabled machine. The demo shows the process of defining the dataset, consuming it in Python, and extracting the data for training.
Another example of using khrml is training the neural network to answer questions based on COVID dataset. So this is pretty advanced in a sense that it uses the library called DPavlov. It's a library for text manipulation. And the BERT model. To answer questions on any big dataset of texts. I'm going a little bit fast. Now, so if you want any more details, just ask me and I will stop and I will talk a little bit more on that. But so this this method uses the BERT model. BERT model is the it's kind of state of the art. It's a few years old model of modeling text. The way it works is that it tries to fill in the gaps in the text. So you train it to predict missing words. So, for example, you have the sentence like during holidays, I like to do something with my dog. And the role of the network is to say, OK, this something can be play with the probability of 0.85. It can be sleep with a probability of 0.05 and so on. And then the next thing is that it also can predict the probability of the next sentence. So, for example, if you say during holidays, I like to play with my dog, it is so cute, it can say this is it. So it can follow with the probability of 0.8 and otherwise like not. It's a very big neural network. So it's very expensive to train. It has been trained for several languages and deep Pavlov contains pretrained BERT model. And based on this pretrained model, you can solve different problems. For example, if you want to save if you want to solve text classification like you want for each phrase to see if it's insulting, not then you would start with input text. You would use BERT to extract features and then you would train classifier on top of this to give the probability. If he wants to extract entities, for example, if you want to say which pieces of text allocations like countries and city names. You would have the same pipeline, but instead of classifier, you would train mask generator. It would generate to which words are for location. And if you want question answering, normally you would have the same model BERT, extract features and then it would generate you the bounds. For example, you would pass it the question. What is your age? And then somewhere in the text, there is a sentence my age is 21. It would tell you take this bounds and this would be the answer. This is question answering. If you want to use the depal it's pretty much like normal Python library. And if you want to answer questions based on big document collection, it is done in two steps. Like you have this collection and you first train the run care. The run care takes in the question and it produces the list of documents where the answer is contained. By documents I mean the short pieces of text like one sentence or two sentences. And then there is a reader which takes those kind of short documents and finds the exact answer to the question from them. And gives the highest probability answer as a result to the user. So to train this using Deepavali on the Covid dataset, there is a dataset of covid papers which is something like two hundred thousand of papers on covid I think. And here I would not go into details, I use Azure ML like this. I put the dataset of covid papers in the datasets and then I use normal compute non-GPU compute to do the data extraction. It took me something like a few hours to figure out the format of the dataset and while I do this I don't really need GPU. And I just created normal virtual machine, the one that you have been using today. And I spend a few hours doing that. And once I've done that I just connected the same storage, the same notebook directory to GPU enabled compute and to train this model I used a 2 GPU machine with 128 I think gigabytes of RAM, huge machine, which is like expensive. And I connected it to GPU and it trained for a few hours, I think, to train this model. So this shows, this demo shows how all this is done. Like I defined the dataset. This dataset is also from web files. But this web files is an archive. It's tar, it's Gezipped tar file. So it's an archive file. So I specify it as the file dataset, not tabular. And once I have it here, I can grab the code to consume this dataset in Python. Then I can go to notebooks I can paste this code, and I can mount this dataset as a directory. So I can say mount and I can see that this directory contains the... this big file of papers. Then we just do tar extract. We extract all the documents. And those documents are JSON files, which contain the full text of the article and the abstract. Then I run a piece of Python code to extract the data from the JSON files and put them into separate text files for each article. So I have... I end up with a collection of text files like this. It's in the text directory. A lot of text files. So that's... each... each text file looks like this. It will print...
23. DeepPavlov and Question-Answering
DeepPavlov allows you to configure the pipeline for question-answering models. It can answer questions based on pre-trained models from Wikipedia. You can also train it on your own dataset and ask questions. However, it's more of a toy exercise and may not provide the best answers from a large collection of papers. It can be used for medical data mining and extracting medical terms to simplify navigation for doctors and researchers.
yeah, like this. Just some kind of text. Then, like, how... how this DeepPavlov works... DeepPavlov allows you to configure the pipeline. So the pipeline... what happens with the text... what you need to do. And it has already the pipeline for answering questions from Wikipedia. So you can say, I want to build this question-answering model. And then you can ask it, like, when did Guinea pigs originate, where? And it will tell you where. When did Lin Mouse floods happen? It will give you the answer, 1804. And to... to train it for my own dataset, I need to retrain the runker. And I need to improve also the question-answering model. The way you do it, you just change the path to the dataset, and you call train. So I will not go into details, but what you get at the end is the model where you can ask questions, like, you can say, what is the reproductive number? And because it was trained on COVID dataset, it will tell you 2.2, which is the average reproductive number of COVID dataset. Like, what is the effective COVID treatment? It will give you some suggestions. What minerals help COVID patients? Of course, that is not the... It is probably more like a funny demo than the real-world exercise. You cannot expect a neural network like that to really find out the best answer from the huge collection of papers and... Or even extract some new information, because it just looks for the place where the best answer is. Of course, it's more like a toy exercise. But you can, of course, use the same approach to do more medical data mining. I think the most promising approach is to extract medical terms and to build the kind of knowledge graph on top of the paper data set, to simplify navigation for for doctors, for medical researchers, and to maybe even extract some knowledge using the symbolic methods. But this is just kind of a proof of concept for that.
24. Azure ML for Low Code and No Code Scenarios
Azure ML can be used for low code and no code scenarios, including AutoML for beginners and configuring environments for serious machine learning. It supports distributed training, simplifies the process, and offers pre-configured configurations for running distributed training. Deepavlov is a library for text manipulations, while BERT is a text model that predicts text behavior. Please delete or stop any resources created to avoid unnecessary costs. Unfortunately, automatic shutdown is not available in Azure ML. Thank you for your time and feel free to ask any questions.
Okay, so I think what we learned today is that, first of all, Azure ML can be used for low code and no code scenarios. Like if you don't know anything about machine learning, you can use AutoML. If you do some serious machine learning, you can use Azure ML to configure the environment where you can schedule your experiments on the cluster and simplify hyperparameter optimization and those kind of repetitive, boring, complicated tasks. And also, I did not mention that today, but it supports distributed training. Like if you have, in our experience, for example, we needed to train the model to recognize certain objects, which is a RetinaNet model. And a RetinaNet model trains a long time. Like we used the cluster of four GPU machines and we trained it for 12 hours, I think, which is quite a lot. Like you don't want to wait like two days for the result, right? You better put more machines and train it in a distributed manner. And Azure ML supports, like it simplifies this process. It supports, it's called HoR-avoid, the library for distributed training. So it supports the pre-configured configurations for running HoR-avoid distributed training, for example. So for you, you have seen the link to GitHub repository. It contains instructions and the code. You can go through it at your own pace and ask me questions in Discord if you have them.
So, like what is BERT and what is Deepavlov? Great question. So Deepavlov is the library for text manipulations. I, unfortunately, I don't have the good slides here, but this is the best slide, I think, which describes... This one is the best slide, which describes both BERT and Deepavlov. So Deepavlov allows you to say, I want to solve this problem, I want to classify my text, and it would allow you to say, okay, I want to start with text, I want to pass this text through the one model, extract features, then I want to pass this text through another model and get the probability vector. Now, I have my dataset, which says that I am angry, I hate this training, is an insulting comment, so like, I hate this training would be the probability of 0.9 of being insulting, and I love this training, would have the 0.1 probability of being insulting. And I want to train this whole pipeline by passing this text to the input and this probability to the output, and I want to fine tune the classifier. So this is what Deepavlov allows you to do. And all those built in models like classifier is the existing model taken from Scikit-learn, for example. This BERT is an existing model, and I will tell you in a moment what it is, and it can train this whole pipeline end-to-end. So it's kind of a little bit similar to the designer, which we have seen before. And BERT is the text model. Text model is the neural network which can predict the behavior of text. Like it is trained on the huge amounts of text from the Internet of general purpose text. So it learns how you normally put words together like which words follow which words and which words are in between which words. So if you start with the input text like I hate this training and then the BERT model can extract from it some kind of numerical features which describe kind of describe the semantics which can be picked up by the classifier to produce the final probability. So that's kind of very quick explanation.
And a great comment that please delete everything that we have created today which you are not going to use. We don't want to waste your Asia subscription money. We don't want you to be left with no credits. So please delete the cluster, delete the virtual machine or at least stop the virtual machine. You don't have to delete it, you can just click stop. It will be stopped and it will not eat your money. Please please do that immediately after we finish, unless you want to play a little bit more. Unfortunately at the moment it's not possible to set automatic shutdown. For normal virtual machines you can say, I want this machine to shut down automatically every midnight and this is huge relief, because before I started using this feature, I forgot like GPU machines a couple of times for a few days, and that was unpleasant. So now I always set like shut down at midnight, it makes me not work over midnight, and it also makes me save money if I forget about virtual machines. But in Azure ML you cannot do that for some reason, because I think Azure ML is supposed to be like for rich people. That's why they should not care about spending money on virtual machines. So please do it manually now. I hope they will add this feature. I am always asking for this. I'm giving this feedback, please, please, please add the feature to automatic shutdown for virtual machines in Azure ML. I think I kept you for half an hour longer, but I hope that was half an hour well spent, because I showed you at least nice pictures produced by neural network. And you know that neural network can read papers and understand them a little bit, which is kind of exciting conclusion, I think, from this exercise. Any other questions that you have? Please, feel free to ask me. Ok the slides. Great question! Let me put the slides somewhere right away. I need to find where this? Excellent, I am very glad that you like that. So you will get the link in a few seconds once I get it from SpeakerDeck. So I want to thank everyone, at least like... especially those who stayed up to the end of this. I know that was difficult. A long remote workshop is not nearly as fun as being in person together and having pizza lying on the table next to us but probably it's easier to follow along because you have the links and like you are at your own favorite computer and that compensates a little bit for the pizza. Ideally, of course, the organizer should deliver pizza to all participants but that doesn't work like that in the moment, I think. Okay. So here is the link. Good. So have a great evening and have a great machine learning conference and I hope to see you there in a few days.
Comments