Pentaho machine learning capabilities have been extended with new model management functionality that reduces the amount of time data scientists need to spend manually rewriting algorithms.
Hitachi Vantara has announced a new collection of capabilities around Pentaho which they term machine learning model management. These machine learning orchestration capabilities are designed to help data scientists monitor, test, retrain and redeploy supervised models in production. That will greatly improve their efficiency, and increase customer time-to-value.
“Machine learning has always been baked into Pentaho,” said Geoff Marsh, Vice President and Analytics Leader, Americas for Hitachi Vantara. “What is new here is the model management capability. One of the biggest pieces in this process has always been writing the algorithms, and going back and redoing them. What the model management does is help make that smoother.”
Marsh said that a limitation on data scientist efficiency has been that they have had to constantly do work manually because the algorithms they create aren’t regularly updated.
“Data scientists’ job is to create and manage these algorithms once they get that data,” he said. “Those algorithms aren’t updated as often as they should be. Data scientists are also incredibly expensive. For them to sit there and change things manually costs a lot. Model management helps you train that data, You can now see it through the pipeline, so instead of having to rewrite it, you can make the changes on the fly. It makes the data scientists more efficient because now instead of having to go back from scratch, and having to code each time, they can see the more successful algorithm right away and get to it quicker.”
Marsh said that this new capability will be critical to data governance, which has been replaced as an attention-getter these days by hotter commodities, but is still important.
“This is a back-end capability designed to get data out to end users as accurately as possible,” he said. “Data governance has not been as sexy as Big Data or IoT, but it is now coming back into the forefront, because being able to govern data all the way along, to adjust to new data sources quickly, and improve time to decision — those things are incredibly important. That’s how our customers will use this.”
The new modelling will get models into production faster by evaluating models and improve their accuracy using real production data before going live. Data scientists will also be able to work on new models instead of spending most of their time writing and maintaining code. In addition, algorithm-specific data preparation and cleaning tasks – also referred to as “last mile data prep” – are now automated.
The new orchestration capabilities also maximize model accuracy, while in production. Typically, once a model is in production, its accuracy degrades as new production data runs through it. A new range of evaluation statistics helps to identify degraded models, and the rich visualizations and reports make it easier to analyze model performance and uncover errors, making it easier to adjust the model sooner.
Finally, these capabilities also enhance collaboration in groups deploying and maintaining models including operations teams, data scientists, data engineers, developers and application architects. These typically suffer from traditional poor transparency. The new capabilities promote collaboration by providing data lineage of model steps and visibility of data sources and features that feed the model.
Hitachi Vantara Labs is making machine learning model management available as a free plug-in through the Pentaho Marketplace, and they are available now. These plug-ins are currently unsupported.
“We determine if one of these needs full support based on the number of people that are using it,” Marsh said. “If we see a lot of users liking it and thinking its important, we will bake it into actual product. I think in this case, the chance of it being baked into the next major release of Pentaho is pretty high.”