Dabl: Automatic Machine Learning with a Human in the Loop

Rate this content
Bookmark

In many real-world applications, data quality and curation and domain knowledge play a much larger role in building successful models than coming up with complex processing techniques and tweaking hyper-parameters. Therefore, a machine learning toolbox should enable users to understand both data and model, and not burden the practitioner with picking preprocessing steps and hyperparameters. The dabl library is a first step in this direction. It provides automatic visualization routines and model inspection capabilities while automating away model selection.


dabl contains plot types not available in standard python libraries so far, as well as novel algorithms for picking interesting visualizations. Heuristics are used to select appropriate preprocessing for machine learning, while state-of-the-art portfolio selection algorithms are used for efficient model and hyperparameter search.


dabl also provides easy access to model evaluation and model inspection tools provided scikit-learn.

35 min
02 Jul, 2021

Video Summary and Transcription

This talk introduces Dabble, a library that allows data scientists to iterate quickly and incorporate human input into the machine learning process. Dabble provides tools for each step of the machine learning workflow, including problem statement, data cleaning, visualization, model building, and model interpretation. It uses mosaic plots and pair plots to analyze categorical and continuous features. Dabble also implements a portfolio-based automatic machine learning approach using successive halving to find the best model. The future goals of Dabble include supporting more feature types, improving the portfolio, and building explainable models.

Available in Español

1. Introduction to Automatic Machine Learning

Short description:

Hello and welcome to my talk on Automatic Machine Learning with the human-in-the-loop with DABL. My name is Andreas Muller and I'm a Scikit-learn co-developer and a Principal Software Engineer at Microsoft. The standard machine learning workflow involves problem statement, data collection, data cleaning, visualization, building initial ML model, offline and online model evaluation. In applications, data collection and exploratory data analysis are often more important than model building.

Hello and welcome to my talk on Automatic Machine Learning with the human-in-the-loop with DABL. My name is Andreas Muller and I'm a Scikit-learn co-developer and a Principal Software Engineer at Microsoft.

So, let me start this by what I view as the standard machine learning workflow. We start with a problem statement and then usually data collection, data cleaning, visualization, building an initial machine learning model, doing offline model evaluation, and doing then online evaluation within your application. So, I think these are the core steps of any machine learning process, and usually this is not a linear process. I drew it here as a circle, but really, it's more of a fully connected graph. After each step, you might go back to previous steps. So after data cleaning, you might see that you need to change your data collection. After model building, you might see that you need to change something on data cleaning, and so on. I think all of these steps are really quite critical. So, in a machine learning community, you really focus a lot on model building, whereas in applications, this might not really be the most important part, and things like data collection, exploratory data analysis might be more important.

2. Introduction to Dabble

Short description:

These days, starting a machine learning workflow in Python often involves using Scikit Learn and pandas, or matplotlib and seaborn for visualization. However, these tools require a lot of work and explicit coding. Automatic machine learning tools like AutoSK Learn and AutoGlue1 can help with model building, but they often take a long time and result in black box models. To address these limitations, I introduce Dabble, a library that allows data scientists to iterate quickly and incorporate human input into the machine learning process.

So, these days, if you start your machine learning workflow in Python, you might use something like Scikit Learn and pandas, or matplotlib and seaborn for visualization. So, here I'm using pandas and seaborn to do some initial visualization on the standard adult data set, which is a classification data set using adult census data. I get some visualizations out there that tell me how does age relate to income, but it actually requires me to do quite a bit of work and know quite a bit of seaborn to do it this nicely. If I want to do it with straight-up matplotlib, it will be much more code.

Similarly, if I want to build a simple machine learning model with Scikit Learn, here I'm building a logistic regression model, but, if I want to do this, I have to write a lot of boilerplate for scaling the data, for imputing missing values, for doing one hot encoding for categorical calls, and so on, than adjusting the regularization parameter of the logistic regression model. So, to me, this is really the hello world of machine learning is like just build a logistic regression model, but Scikit Learn requires you to be very, very explicit, which has a lot of benefits, but if you're just starting out with your project, it's important that you're able to iterate very quickly.

And so, one way to do this is with some of the automatic machine learning tools that are out there these days, say AutoSK Learn, or more recently, AutoGlue1. And then there are several other commercial solutions, from H2O, and DataRobot, and so on. So again, I would say these automatic machine learning solutions really focus a lot on the model building part. And that has two downsides, in my opinion, particularly for initial analysis and exploration, which is one, they usually take a lot of time. So here I'm taking an example from the AutoSK Learn website on this very small toy data set. And you can see that this default setting runs for one hour, it gives you very high accuracy, but it runs for a very long time, even on a small data set. I think that if you have to wait that long for initial result, that really interrupts your workflow. I really love the things that are doing it in AutoSK Learn and other auto ML libraries. But the goal here is much more build a really, really good model, given the data set that you have.

However, in many applications, the data set is not as fixed and iteration is much more important than tuning the hyperparameter or selecting the model. The other downside is that you end up with a very black box model. And in many applications, that really prevents you from understanding more about your problem. It prevents you from understanding what are the important features or how should I have done preprocessing differently or what data should I have collected? And so as an alternative to this, I present Dabble, the data analysis baseline library, which is there to help data scientists iterate quickly and do machine learning data science with the human in the loop.

3. Workflow and Tools

Short description:

The idea is to provide tools for each step of the workflow, including problem statement, data cleaning, visualization, model building, and model interpretation. Dabble.clean can detect types, missing values, rare values, ordinal versus categorical variables, and more. It provides a quick and dirty way to clean data, and there's also the EZ preprocessor for a more formalized approach. For visualization, the double dot plot function is geared towards supervised machine learning and identifies informative features about the target.

The idea is to go through this workflow and try to give you tools to do each of these steps more easily. So problem statement is something that you'll have to do yourself, as well as data collection. But then from the tooling side, Dabble gives you Dabble.clean to do data cleaning, Dabble.plot for visualization, Dabble.anyclassifier for doing model building and some automatic machine learning, and Dabble.explain for model interpretation. For the application evaluation, sorry, you still have to think yourself because this is really about how does your machine learning model fit into your particular setting.

And so each of these are not a module, but a single function. So you have a single function that will give you a lot of plots, or you have a single function that will do some initial cleaning. And none of these are going to be perfect, but each of these will give you a very good starting point from which you can then iterate more quickly in zoom in on the parts that are really critical for your application. Let's start with the cleaning and preprocessing.

Dabble.clean can detect types, so types of features, detects missing and rare values, detects ordinal versus categorical variables. All of this is done mostly using heuristics, so it detects near constant features and index features. So, for the features that you want to use, it finds out how to use them and how to preprocess them, and for the other features, it finds out whether to throw them away, say an index or a unique identifier. You usually don't want to include a machine learning model. I want to give a quick shout out to a paper that I learned about this week, which is the mldata prep zoo toward semi-automatic data preparation for ML by Shah and Kumar. And they actually have a very neat hierarchy of the different types of features and what you need to do in data preparation. So, my devil works kind of creates this, but they have a very nice taxonomy and framework how to think about this data preparation. So, here is an example. I give you the AIMS housing dataset, which is a regression dataset with, as you can see, 82 columns. If you call devil.clean on it, it'll give you a clean data frame that dropped columns that are unnecessary or unique identifiers and identified what is continuous and what is categorical.

The clean function is really a quick and dirty way. If you want to do this in a more formalized way in a scikit-learn pipeline for cross validation, for example, there's a wrapper called the EZ preprocessor. The EZ preprocessor is basically a scikit-learn transformer that puts together a custom column transformer that includes all the preprocessing you need. So if there's missing values, it will put imputation in there. If there's categorical variables, it will put 100 encoder in there. And it takes care of some other kind of tricky edge cases. And so you can either use clean and just get a clean data frame or if you want to make sure to be careful in your cross validation, you can use the EZ preprocessor and then have full control over your model building using any scikit-learn model that you want. For visualization, there's the double dot plot function, which is really geared towards supervised machine learning. So towards classification and regression and you give it a data frame together with a target column. So this can be either a column inside the data frame. So here data has a column called income and it will use income as the target column or you can give it a separate series like in a scikit-learn X and Y style. It will then go through the features and find out which features are most informative about the target.

4. Analyzing Categorical and Continuous Features

Short description:

Using the adult data set, I analyze the categorical and continuous features using mosaic plots and pair plots. These plots provide insights into the size and balance of categories, as well as their influence on the target variable.

So here I'm using the adult data set and they identify it several of the categorical features and shows in a mosaic plot the categorical features that are most important. And then on the right hand side, you can see a pair plot of some of the continuous features that it found are important. I really like these mosaic plots because they tell you both what is the size of each of the categories. So you can see here in the lower left, for example, this data set contains more men than women and you can also see that the fraction of orange, which is the higher income bracket, is about twice the size in men than it is in women. So you can see both the balance between the categories but then also how the categories influence the target.

5. Feature Selection and Model Building

Short description:

For lower dimensional data sets, DABL uses pair plots. It selects informative feature interactions using a supervised learning version of Scacnomics. Linear discriminant analysis is an underused tool for dimensionality reduction and visualization. The simple classifier allows quick prototyping with various models. DABL also implements a portfolio-based automatic machine learning approach using successive halving to find the best model from a diverse set of pre-selected models.

For data sets that are lower dimensional, it will just do a pair plot. So the adult data set just had two continuous variables that were shown. And so we had a full pair plot for high dimensional data sets that's usually not feasible and so then DABL actually selects features that are informative.

In particular, it selects interactions that are informative. And so it uses something that is basically a supervised learning version of Scacnomics to figure out what are interesting plots by using relatively shallow trees on two-dimensional subsets. And so here it figured out what are interesting to look at feature interactions in a data set that has like 100 features. And this can be much more informative than just looking at the univariate most important features. And gives you something sort of that shows you how to best separate the classes in 2D.

Then there's the same for principal component analysis and linear discriminant analysis. So I really like limit discriminant analysis for dimensionality reduction from visualization for classification. I think it's a very underused tool. People really jump to t-SNE and UMAP very quickly, but I find this much harder to interpret usually and linear discriminant analysis often gives quite nice results. Then there's a simple classifier.

The simple classifier is really for quick prototypes. It runs a lot of models that are basically instantaneously, like dummy models, snake Bayes, small tree models and fast linear models. And then it will record a bunch of metrics and tell you which is the best among these. On the Delta dataset here we get logistic regression is the best one and I get several metrics. And the idea here is to really allow you to iterate very quickly. And so after running this model, which is basically instantaneous or running all of these models, you could go to interpretation, see what are the important features and obviously things that we expected. Was there data leakage, for example, or is it something that looks looks wrong or did the model not pick up on something that I thought was important? And then you can go back to data cleaning and data collection. If you want to build a more complex model, that will also have some automatic model search.

It has a portfolio-based automatic machine learning. So basically it has stored a portfolio of diverse good models with good hyperparameters. So there's a lot of gradient boosting in there, but also random forest linear models and support vector machines and they are all ranked and these were selected using a large benchmark suite to openML CC 18. And this methodology is basically taken from practical automatic machine learning for the AutoML challenge, which is a Feuerer and Hutter 2018 paper. So this is something that is for auto-sklearn that we are also implementing in DABL now. And then given this portfolio we're using successive halving to find the best candidate from this portfolio. So this is much quicker than trying all of the models. So successive halving is a fast alternative to grid search, basically when you train your models on increasingly bigger sizes of the data. So we take our whole portfolio, train it on a small fraction of the data set, then we select the best half or the best third of all the models and throw away the other ones and then retrain using more data.

6. Model Explanation and Future Goals

Short description:

Weed out worst models and select from portfolio. Model explanation tools include classification report, confusion matrix, precision-recall curve, feature importances, permutation importance, and partial dependence plots. Future goals include supporting more feature types, improving portfolio with time sensitivity, model compression, and building explainable models. Explainable boosting machines are a potential addition to Scikit-Learn.

And so we weed out the worst models and we can basically select the model among our portfolio in size that's comparable to just training a single model. So we showed that this kind of portfolio approach is actually quite effective. We did this for single classifiers in our paper learning multiple defaults. And then as I said, doing this across classifiers was done in the practical automatic NML by Fiverr.

So finally, we also have some model explanation. These are relatively basic tools that are mostly inside could learn, but they're packaged in a way that makes them really easy to use. So after we fit any classifier, you can just call that will not explain optionally with a test set or a validation set. And then it will give you metrics. So classification report, confusion matrix here. This is on a 10-class classification data set. Rocker of precision recall curve, which are all like the standard metrics if you want to look at and then some of the simple model debugging. It's like if there's a random forest, you will get the impurity-based feature importances, which is the Sklearner feature importance. These are actually usually like a little bit misleading. So for any model will also compute the permutation importance which was added to Sklearner a little bit more recently. So the permutation importance gives you usually a much better idea of what are really the important features in the data set. And then we do partial dependence plots. In scikit-learn they're accelerated for tree-based models, but we can also do them brute force for any kind of model. And you get all of this just by calling the explain function.

So some of the future goals for Dabble are support more different kinds of features such as time series and text data. Right now it's mostly focused on categorical, ordinal, and continuous features. And we also want to improve our portfolio. So right now we have a portfolio of 100 classifiers that are ranked and that we search through with successive halving, but we want to make them time sensitive. So we want to prioritize classifiers that are faster to evaluate so that we spend less time and give feedback more quickly. Another thing that we really want to include is model compression and building explainable models. So instead of building models that are sort of black box, like gradient boosting models and then trying to do explain them with partially pennis plots, ideally we could try to build models that are interpretable themselves, either by compressing an existing good model or by just using explainable models to start with. My favorite that I'm hopefully going to include soon and that we might include in Scikit-Learn right now is the explainable boosting machines, which is basically an additive model, an additive model based on gradient boosting and because it's an additive model you can basically see the whole function by just looking at the univariate plots, which is quite nice. All right. That's it for my side so far. Let me know if you have any questions in the Q&A. Also, if you're new to machine learning check out my book that I wrote with Sara Guido, Introduction to Machine Learning with Python and you can install dabble with pip install dabble and documentation is at dabble.github.io.

7. Project Status and Focus

Short description:

The project is still in an early phase, about two years old. Feedback and contributions are welcome. Dabble focuses on classification and regression, providing a quick way into these common use cases. It wraps around scikit-learn to make working with it easier. Dabble does not currently support active learning, semi-supervised learning, or self-supervised learning. It is solely based on scikit-learn. Light GBM was used in the beginning.

And so the project is still in like a somewhat early-ish phase. I mean, it's I think it's about two years old now, but it's definitely still early days. So if you have any feedback about the project, please feel free to reach out if you want to contribute. That's very very welcome and I hope I hear from you. Thank you.

Hello, Andreas. How are you today? Good. Good. I'm good. Are you doing? I am doing splendid. You're you're joining us from right here in the United States, right? I'm also here. Yeah, so I just moved to California from New York. Oh sweet. Well, welcome to the welcome to better weather. Welcome to better weather. So what one of the main motivations to be honest perfect. Alrighty. Well, we have actually some pretty amazing questions for you today and let's actually just get started with it. Okay, so here's a question from the audience. Does dabble support active learning semi-supervised learning or like self-supervised learning? So far, I'm really only focusing on classification and regression because these are really the majority of use cases and it's more trying to give you like a quick way into these like super common use cases. And these I think, let me see the question. I think none of these are and built on top of scikit-learn. None of these are actually really supported by scikit-learn right now. So dabble is more of a wrap around a nicer interface to make working with scikit-learn easier. And so we're not going to do any of these, well, scikit-learn has some semi-supervised learning. But it's actually not that commonly used. So we rarely get any like feature requests for that. Right? So it's just like completely based off of scikit-learn and like no other module or any other package that you would have actually considered when actually constructing dabble for now, right? It's just to make sure that like, okay, it's just easier and more tamable to use scikit-learn. Is that correct? Yeah, and I mean, I might add some other thing. I used to use light GBM in the beginning.

8. Default Model Recommendation

Short description:

But now we have the HistGradientBoosting in scikit-learn, which is basically a re-implementation of light GBM. And so I'm using that. Is there a default model in dabble that is recommended to use regardless of the data?

But now we have the HistGradientBoosting in scikit-learn, which is basically a re-implementation of light GBM. And so I'm using that. What again was like the default? I know you have like this generic, like classifier call in dabble, right? And I also know that dabble can do many other things besides just like the modeling aspect but out of curiosity, like what is there like a default that you know, dabble, a default model that dabble would recommend to use regardless of like your data or you would think, no, we need like that as a required parameter to say, okay, maybe for this kind of problem, a cat boost model would be good. Or for this kind of problem, maybe some random forest or other type of model would be good. Is there anything like that in dabble?

9. Model Selection with Successive Halving

Short description:

Dabble uses a diverse set of classifiers and runs successive halving to try out different models on a subset of data. It selects models based on their performance on a small subset of data, rather than using smart selection. This approach is based on the research by Matthias Freuer, Frank Hutter, and others, who found it challenging to make good decisions based on meta features. Instead, Dabble focuses on the dataset itself and determines which models run well on it.

So I'm going to use this model double doesn't do that. What double does instead is it says like, oh we looked at these hundred data sets. I think it was actually 80 and we created this portfolio this diverse set of classifiers and now I'm running successive halfing on these. So it's going to try out a bunch of classifiers on a subset of data including random forest gradient boosting support vector machines and logistic regression and sees which of them will run well on a small subset of data. So it's you don't do any like smart selection of the portfolio the portfolio was fixed using meta learning beforehand. But then the decisions you make are done using successive halving. So basically very fast search. This is modeled on this paper that I mentioned by Matthias Freuer and Frank Hutter and others that so this is because it's actually it's many people found it's quite hard to make good decisions based on meta features. So based on what the dataset looks like it's much easier to make good decisions based on what kind of models run well, and so you're just trying out a bunch of models very quickly and decide based on that.

10. Data-Driven Approach and Future Plans

Short description:

Based on the dataset, it's easier to make decisions on models. Dabble can provide big picture stats for exploration. However, the starting point is not data-driven. Dabble currently handles mostly tabular data and plans to add support for text data and time series in the future.

So based on what the dataset looks like it's much easier to make good decisions based on what kind of models run well, and so you're just trying out a bunch of models very quickly and decide based on that. Very interesting.

So as a follow-up to that I know like for now we're using meta learning to determine basically like what kind of modeling can be done just based on you know, what historically works on several data sets like the 80 datasets you mentioned. Is this the similar kind of logic that you would also use for like, you know, determining what kind of charts or visuals that a user should also be would be interested in. I mean, that would be amazing.

The problem is, so right now there's it's in a sense. It's a lot of heuristics because really what you would want is to use the experiments, but also what you would want is like guided user experiments and this is really hard to do. So the question is like what you want to learn about the dataset. And so ideally you'd have like a specific question you want to answer or you want to see what are the visualizations that helped the user determined there is an important outlier or determined that they forgot about this and that or determined. And so you want to find visualizations that allowed the user to answer questions quickly.

And so if you want to do an actual data-driven approach to this, you have to have a benchmark of questions. You have to show the data scientists and measure how well they answer these with the given visualizations and that's like really hard. Like coming up with the questions is hard and then getting datasets to get a data scientist to be part of your experiment is also really hard. I was wondering about that. Yeah, because like so but dabble can still help you get like some big big picture stats and then still using that and was like, okay, we can now determine the rabbit hole in which we can go to and to explore, you know, which part we want to explore deeper and on that and we would take it on, you know, as the data scientist, I would take it on my shoulders to actually do some EDA within that, you know, vertical. So it kind of gives us a good starting point would you say?

Yeah, but that the starting point is not data driven by having a hundred data scientists look at it. It's like mostly having me look at it and other users of dabble. I mean, this is the way a lot of visualizations are actually created is like you have some expert look at it and look like this is a useful visualization. And so we're going to use it. Um, Got it. Yep. I have another question that the questions are coming in from our audience. Okay. So does dabble handle signal or like sensor data processing or time series as an input. It's mostly done for tabular data. I feel like sensor data is usually in sort of a tabular data format, but there's no time series. Um, support for now. So what I'm going to hopefully add next is text data support and after that's going to be time series. I'm actually working on some time series visualizations right now on a project at Microsoft and hopefully this going to soon go back to dabble, but I don't have models for time series, but that would be really great to have also at least if there's anything you want send me a pull request. This is an open source project.

QnA

Contributions and Comparison with Stats Models

Short description:

So, grow with all of your contributions. Dabble is more prediction focused, while stats models are more inference focused. Dabble does not support p-values, and there are no p-values in scikit-learn either. Dealing with time series data can be different, depending on the task and the type of time series. FB profit is a one-stop shop for time series, but it is tuned for data with annual periodicity.

So, grow with all of your contributions. That's right. Yeah, just I'm sure there's so many people even in the audience right now who have so many like wonderful ideas. They could just like you're going to just like go to the dabbles main repository and just submit a PR. If you think that a feature would be useful to you and just your own workspace and it may be it's useful to the entire community, right?

Alright, so moving on to another question, let's see. I think I already asked that question. All right. So how would you actually compare dabble with stats models? Would you say that? Well, it's of course obvious that like maybe dabble is it seems to be doing a lot more in general, but would you say that like it's kind aiming to be like more of a super set to stats models? Or would you say that you know in the future we might even be able to replace like stats models for like preliminary you know, EDA when you know determining features. I think they're orthogonal mostly because stats model really allows you to make statistical inferences. And so if you look at compare this a stats model and scikit-learn stats model is really inference driven where scikit-learn is really prediction driven. And so it's scikit-learn is all about generalizing well to test that way stats model is about like testing a hypothesis and these are really different kind of problem sets and dabble was really more built on scikit-learn and it's really very prediction focused. So the goal is make good predictions on the test set assuming the data SI ID but stats model and yeah, that's more than it's more inference focused. So if you use stats models for EDA and not for inference, then it can maybe replace some of these parts, but there's no p-values in dabble and I don't I don't intend to add any p values to dabble. There's also not really any p-values in scikit-learn. Yeah. Yeah. I've noticed that as well.

So I'm coming back to the the point on how dabble currently doesn't support like the sequential piece, sequential information kind of like time series data. What exactly is maybe maybe in the front of like modeling I can see that it is very different, right? Dealing with time series data can be quite different, especially using like those traditional time series techniques like ARIMA. But what about from just a data visualization perspective? I wouldn't think that it's too difficult to kind of like it wouldn't be too much of a difference to like visualize data that is like not in that sequence for format versus data. That is in that sequential format or am I missing something here? So there's really there's many different kinds of time series data. The question is are you looking at a single time series or are you looking at time series of a fixed length that are each labeled like you want to classify time series or do you want to classify each point in a time series? Do you have multiple time series that have different length? Do you have a line time series where you have the same time step but different features like you have several different stock prices over time? Do you have them equally spaced or do you have like what's called the point process and statistics where you have a unique timestamp with events? And so even the task is very different for these kind of different sets of time series. I think the most common is probably like you have a time series where you have maybe equally spaced timestamps. You have some features and then you want to do regression for each time point. And doing something for this would be pretty straightforward, I guess depends a little bit on what you're looking at. There's already there's some interesting stuff in this area. For example, like FB profit tries to be sort of a one-stop shop for time series. It's pretty explicitly tuned for things that have like an annual periodicity. So it's not it has things like holidays, which are very important for like sales data and so on. Yeah, but the only problem with like I guess like for example that profit like using profit is that it's again, like a traditional time series approach where I'm not sure if you can use like certain non exotic like other variables that are also varying by time as kind of a feature.

Limitations of using time series varying variables

Short description:

If you want to do order forecasting in the future, you can only use a series of order forecasting in the past. However, you can't use other time series varying variables as input to profit. The case I would try to address first is a multivariate time series where you want to do a regression at every step.

So if you want to do like some order forecasting in the future, you couldn't you can only use like a series of order forecasting the pass, but you couldn't you can't really use like other time series varying variables as like an input to profit, right? So, I know, yeah. I'm not familiar with okay, if that is the case, and that's kind of okay, that's certainly limiting. So, I mean, I guess the case that probably I would try to address first is like you have a multivariate time series and you want to do a regression at every step. So that's like a standard forecasting thing. Definitely, yeah, yeah, definitely see it be useful there. I think right now though, we let me see if I could like scroll to see some questions here.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

6 min
Charlie Gerard's Career Advice: Be intentional about how you spend your time and effort
Featured Article
When it comes to career, Charlie has one trick: to focus. But that doesn’t mean that you shouldn’t try different things — currently a senior front-end developer at Netlify, she is also a sought-after speaker, mentor, and a machine learning trailblazer of the JavaScript universe. "Experiment with things, but build expertise in a specific area," she advises.

What led you to software engineering?My background is in digital marketing, so I started my career as a project manager in advertising agencies. After a couple of years of doing that, I realized that I wasn't learning and growing as much as I wanted to. I was interested in learning more about building websites, so I quit my job and signed up for an intensive coding boot camp called General Assembly. I absolutely loved it and started my career in tech from there.
 What is the most impactful thing you ever did to boost your career?I think it might be public speaking. Going on stage to share knowledge about things I learned while building my side projects gave me the opportunity to meet a lot of people in the industry, learn a ton from watching other people's talks and, for lack of better words, build a personal brand.
 What would be your three tips for engineers to level up their career?Practice your communication skills. I can't stress enough how important it is to be able to explain things in a way anyone can understand, but also communicate in a way that's inclusive and creates an environment where team members feel safe and welcome to contribute ideas, ask questions, and give feedback. In addition, build some expertise in a specific area. I'm a huge fan of learning and experimenting with lots of technologies but as you grow in your career, there comes a time where you need to pick an area to focus on to build more profound knowledge. This could be in a specific language like JavaScript or Python or in a practice like accessibility or web performance. It doesn't mean you shouldn't keep in touch with anything else that's going on in the industry, but it means that you focus on an area you want to have more expertise in. If you could be the "go-to" person for something, what would you want it to be? 
 And lastly, be intentional about how you spend your time and effort. Saying yes to everything isn't always helpful if it doesn't serve your goals. No matter the job, there are always projects and tasks that will help you reach your goals and some that won't. If you can, try to focus on the tasks that will grow the skills you want to grow or help you get the next job you'd like to have.
 What are you working on right now?Recently I've taken a pretty big break from side projects, but the next one I'd like to work on is a prototype of a tool that would allow hands-free coding using gaze detection. 
 Do you have some rituals that keep you focused and goal-oriented?Usually, when I come up with a side project idea I'm really excited about, that excitement is enough to keep me motivated. That's why I tend to avoid spending time on things I'm not genuinely interested in. Otherwise, breaking down projects into smaller chunks allows me to fit them better in my schedule. I make sure to take enough breaks, so I maintain a certain level of energy and motivation to finish what I have in mind.
 You wrote a book called Practical Machine Learning in JavaScript. What got you so excited about the connection between JavaScript and ML?The release of TensorFlow.js opened up the world of ML to frontend devs, and this is what really got me excited. I had machine learning on my list of things I wanted to learn for a few years, but I didn't start looking into it before because I knew I'd have to learn another language as well, like Python, for example. As soon as I realized it was now available in JS, that removed a big barrier and made it a lot more approachable. Considering that you can use JavaScript to build lots of different applications, including augmented reality, virtual reality, and IoT, and combine them with machine learning as well as some fun web APIs felt super exciting to me.


Where do you see the fields going together in the future, near or far? I'd love to see more AI-powered web applications in the future, especially as machine learning models get smaller and more performant. However, it seems like the adoption of ML in JS is still rather low. Considering the amount of content we post online, there could be great opportunities to build tools that assist you in writing blog posts or that can automatically edit podcasts and videos. There are lots of tasks we do that feel cumbersome that could be made a bit easier with the help of machine learning.
 You are a frequent conference speaker. You have your own blog and even a newsletter. What made you start with content creation?I realized that I love learning new things because I love teaching. I think that if I kept what I know to myself, it would be pretty boring. If I'm excited about something, I want to share the knowledge I gained, and I'd like other people to feel the same excitement I feel. That's definitely what motivated me to start creating content.
 How has content affected your career?I don't track any metrics on my blog or likes and follows on Twitter, so I don't know what created different opportunities. Creating content to share something you built improves the chances of people stumbling upon it and learning more about you and what you like to do, but this is not something that's guaranteed. I think over time, I accumulated enough projects, blog posts, and conference talks that some conferences now invite me, so I don't always apply anymore. I sometimes get invited on podcasts and asked if I want to create video content and things like that. Having a backlog of content helps people better understand who you are and quickly decide if you're the right person for an opportunity.What pieces of your work are you most proud of?It is probably that I've managed to develop a mindset where I set myself hard challenges on my side project, and I'm not scared to fail and push the boundaries of what I think is possible. I don't prefer a particular project, it's more around the creative thinking I've developed over the years that I believe has become a big strength of mine.***Follow Charlie on Twitter
ML conf EU 2020ML conf EU 2020
41 min
TensorFlow.js 101: ML in the Browser and Beyond
Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!
React Advanced Conference 2021React Advanced Conference 2021
21 min
Using MediaPipe to Create Cross Platform Machine Learning Applications with React
Top Content
This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low-powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
JSNation Live 2021JSNation Live 2021
39 min
TensorFlow.JS 101: ML in the Browser and Beyond
Discover how to embrace machine learning in JavaScript using TensorFlow.js in the browser and beyond in this speedy talk. Get inspired through a whole bunch of creative prototypes that push the boundaries of what is possible in the modern web browser (things have come a long way) and then take your own first steps with machine learning in minutes. By the end of the talk everyone will understand how to recognize an object of their choice which could then be used in any creative way you can imagine. Familiarity with JavaScript is assumed, but no background in machine learning is required. Come take your first steps with TensorFlow.js!
ML conf EU 2020ML conf EU 2020
32 min
An Introduction to Transfer Learning in NLP and HuggingFace
In this talk I'll start introducing the recent breakthroughs in NLP that resulted from the combination of Transfer Learning schemes and Transformer architectures. The second part of the talk will be dedicated to an introduction of the open-source tools released HuggingFace, in particular our Transformers, Tokenizers and Datasets libraries and our models.

Workshops on related topic

ML conf EU 2020ML conf EU 2020
160 min
Hands on with TensorFlow.js
Workshop
Come check out our workshop which will walk you through 3 common journeys when using TensorFlow.js. We will start with demonstrating how to use one of our pre-made models - super easy to use JS classes to get you working with ML fast. We will then look into how to retrain one of these models in minutes using in browser transfer learning via Teachable Machine and how that can be then used on your own custom website, and finally end with a hello world of writing your own model code from scratch to make a simple linear regression to predict fictional house prices based on their square footage.
ML conf EU 2020ML conf EU 2020
112 min
The Hitchhiker's Guide to the Machine Learning Engineering Galaxy
Workshop
Are you a Software Engineer who got tasked to deploy a machine learning or deep learning model for the first time in your life? Are you wondering what steps to take and how AI-powered software is different from traditional software? Then it is the right workshop to attend.
The internet offers thousands of articles and free of charge courses, showing how it is easy to train and deploy a simple AI model. At the same time in reality it is difficult to integrate a real model into the current infrastructure, debug, test, deploy, and monitor it properly. In this workshop, I will guide you through this process sharing tips, tricks, and favorite open source tools that will make your life much easier. So, at the end of the workshop, you will know where to start your deployment journey, what tools to use, and what questions to ask.
ML conf EU 2020ML conf EU 2020
146 min
Introduction to Machine Learning on the Cloud
Workshop
This workshop will be both a gentle introduction to Machine Learning, and a practical exercise of using the cloud to train simple and not-so-simple machine learning models. We will start with using Automatic ML to train the model to predict survival on Titanic, and then move to more complex machine learning tasks such as hyperparameter optimization and scheduling series of experiments on the compute cluster. Finally, I will show how Azure Machine Learning can be used to generate artificial paintings using Generative Adversarial Networks, and how to train language question-answering model on COVID papers to answer COVID-related questions.