Data-centric and model-centric machine learning – By Aditya Abeysinghe
Different approaches are used in machine learning to build AI (Artificial Intelligence) models. Two common methods used are the model-centric method and the data-centric method. The model-centric method focuses on improving the model and the data-centric method focuses on improving the data used for building the model. Both methods have benefits and drawbacks and both can be used in any model.
In the model-centric approach, the data used for the model is not changed. The model is changed to increase the accuracy and the performance. Different methods to improve the model are used like increasing the training cycles until overfitting, changing values of inputs in each training cycle etc. Most machine learning models are built using this method to improve the model as it is often easy to change the model when compared to changing data.
In the data-centric approach, the model is not changed; data is changed to increase the accuracy and the performance. Most researchers use external datasets for model building which are freely available to be used. Different methods to improve data in datasets can be used like removing columns that are not required, deleting empty data etc. This method is currently used rarely to train models.
Focus of each method
When using the data-centric method, changing data often means improving the quality of data over the quantity. Large datasets may be useful during training to train a model with different values. However, changing the quality of existing data is often found to increase accuracy. Data quality also means reducing the noise within the data. Therefore, data consistency is important in data when the data-centric method is considered. Data evaluation in this method is performed at each phase of the model building flow.
In the model-centric method, the model is changed by not changing the data. Therefore, data consistency is not important in this approach. Multiple methods to build models such as supervised or unsupervised learning may be used. This method often requires technical background on how models can be trained and how different model training or testing algorithms can be used. Data evaluation in this method is performed only during the training phase of the model building flow.
Why use data-centric approach?
Model-centric approach was the most used machine learning method until recently. However, recent research has considered data-centric approach to overcome many issues with the model-centric approach. A dataset or any other source of data is used to train any model. Data is transformed from a raw state to a stage before selecting any algorithm which can be directly used for training. The model-centric approach does not evaluate any stage before training a model. Therefore, data that is not cleaned, empty or duplicate data etc. are not considered important using this approach. These data are then directly used for training and any issue before this stage often cause incorrect sources to be used to train a model. Therefore, model-centric approach often uses data where there are issues and may cause less accurate models to be trained.
Focusing on data also allows it to be used for other uses such as for decisions and analytics. As data used in this approach has less errors and the reliability is high, decisions made using a model or dataset based on this approach are often highly accurate. It can help a business to be more efficient by using analytics on customer-related sales and purchasing trends.
Image Courtesy: https://www.iiot-world.com/