Introduction

AI is a field of Computer Science. It is a man-made intelligence with the goal of making it possible for computers to be able to automate intelligent behaviours like humans. Over the past decade, the application of AI (Artificial Intelligence) has been thriving. AI is being developed by small and large businesses for applications of human life in all areas such as insurance, health, information security, etc, …

According to Savvycom’s forecast: “AI will be worth $190 billion in 2025” and following the information of PwC, the overall contribution based on AI solutions for the global economy will reach $15.7 trillion in 2030. Therefore, the AI development opportunities of technology companies are very large along with the challenges about data are not small, limited technology, expenditure, etc,… After many years building AI products, we realised that there are recurring problems when we are working on AI projects, So in this blog, We will share with you some of the most common challenges, and our experience to overcome them.

I. Lack of quality data source

When we start a project, we will have to determine in advance what the AI model will do, how fast it will be processed, and how to deploy it to the system. Once identified, the first problem which needs solving is a dataset for the model. And in the majority of AI application cases, data is not available so AI teams must start to collect the data. In case having available data, there are always extremely common problems: Duplicated Data, Incomplete fields, Human Error. To solve those problems, these solutions can be taken into account when we start to make data for AI models.

Firstly, we must determine the standard of data to train the model: tagging, metadata or we can use the available data provided that it is closest to our actual problem that the business wants to solve at most. By making the standard of the data, businesses can rely on it to assess whether the collection is consistent with the problem to solve it.

In addition, using simulation and data analytics will be able to help the business visualize the data that the business has been owning whether the above-mentioned problem to quickly find a way to fix it.

II. Dataset has too much variations and requires so much cleaning efforts

Towards the data which is available from open sources or previously collected, business will always have issues about variation backlog. The available data which has no standard is a common problem of the open-source data resources, even in a few cases, available data has a completely different standard and quality than the AI problem that needs solving.

During the process of data labeling and standardizing, this often takes upwards of 80% of the overall time in our projects. A few datasets require meticulousness and high accuracy such as segmentation problems, data labeling takes more time and it requires a supervisor and quality management of the label after hitting. In fact, if a Polygon is corresponding to an object, the labeler will take one minute to complete in the most accurate way, so the dataset needs a lot of labeling and monitoring. Currently, there are not too many tools in the world that can support monitoring, checking, and evaluating data, this comes up against serious difficulties for supervisors to evaluate and have to wait until the data is labeled completely to evaluate the quality of the dataset. This makes the progress of the project slow, and till the dataset is evaluated perfectly, it may be too late and the project has delayed the deadline. In addition, the charge for a supervisor is often very expensive because their requirements are very high including knowledge of Data Science and Data pipelines.

To resolve the above problem, we need to have a tool that supports data labeling accompanied with

data management, evaluation, and statistics. BlueEye is going to be a great tool for that with a feature that can divide into Labeler and Reviewer, which makes it easier for managers to manage the project’s dataset. In addition, the tools also provide a statistical chart of the labels that have been typed, the data has been completed and verified, so it helps us detect a lot of problems in the dataset such as data bias, lack of labels, slow completion data, etc, …so that the manager can catch on quickly and come up with a timely solution.

III. No existing research for the problem.

AI can be applied in all fields, industries but not all. Every year, there are thousands of articles researching and improving AI for all fields but so far some areas have not been able to have an optimal solution for AI that can completely replace humans because of the accuracy, operating expense, cost of the research team are too large and it takes a lot of time to be able to apply practically. Even in a few research, they show that AI technologies that have been researched for practical application account for only 10% of all research each year.

Therefore, how can we troubleshoot this problem? AI does not have to completely replace humans, it can support humans in a certain field. Try to make AI support simple things first and optimize workflows so that we can reduce our effort and process as possible.

IV. Labelling costs and time keep skyrocketing.

One of the biggest problems in making data for AI models is the inability to anticipate the cost of data makers, data takers, and supervisors. Sometimes, the data has been collected so much that the cost of money and time is ballooning, which has accounted for the majority of the investment in our project. Suppose that the price of workers for a car segmentation problem with an average of 10 polygons on an image is $0.7 for a photo and $0.3 for reviewing a photo, so the total cost of an image is $1 and it takes at least 5 minutes to complete the entire process for an image. Assume the AI model’s requirements require 1 million images, so the making data expense is $1.000.000 and approximately 112 months of work to complete entirely the dataset. It’s not a small cost and the time it takes to complete them is too long for us.

To solve the above problem, we will have to hire workers in countries that have cheap labor such as Vietnam. Use data labeling platforms that support monitoring and control of labels and labelling schedules, and finally, try to use AI technologies that can reduce the large number of labels needed for models such as self-supervised Learning, Contrastive Learning so that we can reduce the number of images need hitting, that also reduces the cost and time to type data.

V. Lack of efficient development and deployment pipeline.

During the studying and working process of Computer Science majors, AI engineers often get little or no knowledge of good standards and practice for effective development and implementation. They know almost exclusively about how to research and use the Python programming language and build models through some available libraries such as Tensorflow or Pytorch but most of them don’t know how to deploy their system to servers like Amazon to reach users. This makes the output system deployment very time-consuming for users and it is not the most optimal. They will probably waste a lot of resources and the cost is not small and extremely wasted.

Most AI engineers use Jupyter or Google Colab to facilitate model building and test the model’s output. Because of using such tools, they can not deploy the system both on the computer and the server of the system. In addition, data, code, model weighting (Hyperparameter and Learning parameter) are everywhere, there is no version control, no tracking, nobody knows what problems are outstanding, what is missing, the issues need solving and what their teammates are doing.

For back-ends and frontend developers, there are often common standards and formats for their code when they start a project or deploy them to the product, but it does not have with an AI developer. For AI developers, there is no standard way to deploy AI models and display them as APIs. Libraries and formats are arranged by person’s thinking and code, so they are very diverse and do not have a common format. This makes it very difficult to grasp what their teammates are doing, how to code, it’s not easy, and it takes a lot of time to read the code and learn. Sometimes, to operate an AI system requires having a strong server, multiple hardware resources, and dividing the standard ratio for web applications, so it is often not suitable for heavy AI.

The best solution to these problems is to use at least one MLOps system, a pipeline from scratch, and force the AI team to use it as a common standard for AI model, this can make team take some time to accommodate, but when they are familiar with the system, group coordination and scaling system will take place extremely smooth and effective.

Copyrights by SETA International, please credit when taking out.

👉 SETA International is exploring new market sectors regional and international. We provide end-to-end technology solutions and services for AI, VR/AR, IOT, web, mobile and cloud.

—

✉️contact@setacinq.vn

🔗seta-international.com