Rechercher

Thoughts on AutoML

Context : I have been spending the last few years manually tuning sklearn algorithms.

I was looking for libraries that were able to efficiently deal with that part the most seamlessly possible.

I took a look at some Python libs for autoML and here are my first impressions on their hyperparameters optimization capability :


Auto-sklearn : probably the most advanced library out there, it uses bayesian optimization on top of meta learning to test a wide range of models and hyperparameters. Maybe a little too head first on simple problems. Also, no stopping criteria aside from time spent, meaning it will run for one hour even if the optimal model is found in less than a minute, wasting precious resources. Not compatible with scikit-learn > 1.0.0, which can be a problem.


TPOT : It uses genetic algorithms to find the best hyperparameters. From an external point of view, there is no difference with auto-sklearn, except that it has an early stop criterion generation-wise, but each generation can take hours on problems solved in seconds.


Pycaret : Blazing fast compared to the previous libraries, however the default search is a random search over fixed parameters. It did not even find better hyperparameters than the untuned model when I tested it (you can find better quite easily with manual tuning). It is made compatible with other tuning libraries which is nice, but the result were the same

Also not working on Python 3.10, despite using the last version. More precisely, one dependency is not available for Python 3.10, meaning you’ll install and run that lib for a while before figuring out it will not work with your current environment.


AutoVIML : a very verbose one with lots of information about your learning process. It gave me a 100% accuracy model predicting always the same value on a 5 classes balanced dataset. Too bad. (Yes I double-checked and triple-checked the training set)


Hpsklearn : Not compatible with the latest scikit-learn version and it does not specify the correct version. I must admit I did not bother to find the right one in this dependency hell.



In the end, we have libraries that work but waste far too much time and resources. Others that are fast but cannot tune the model. And in almost all cases, a dependency hell to deal with as maintenance cannot keep up with new Python/sklearn/{insert_dependency} version.

Now, I know this is open source, and these libraries are free to use, this is not criticism but observation. And this observation is reduced to a first try on a very simple problem.


These libraries maybe are made to deal with bigger problems and that is why I missed the point. Although, not all problems are big. I think there are lots of cases where the problem is a little too hard for classical programming, but very simple for machine learning and still worth tuning.


I will test other use cases to figure out, but in the meantime, I would like other people's thoughts on that problem, or pointers on what I missed.

36 vues0 commentaire

Posts récents

Voir tout