top of page

HOW TO MEASURE THE PERFORMANCE OF YOUR AI/MACHINE LEARNING PLATFORM? - B-AIM PICK selects


With each passing day, new technologies are emerging across the world. They are not just bringing innovation to industries but also radically transforming entire societies. Be it artificial intelligence, machine learning, Internet of Things, or Cloud. All of these have found a plethora of applications in the world that are implemented through their specialized platforms. Organizations choose a suitable platform that has the power to uncover the complete benefits of the respective technology and obtain the desired results.

But, choosing a platform isn’t as easy as it seems. It has to be of high caliber, fast, independent, etc. In other words, it should be worth your investment. Let’s say that you want to know the performance of a CPU in comparison to others. It’s easy because you know you have Passmark for the job. Similarly, when you want to check the performance of a graphics processing unit, you have Unigine’s Superposition. But, when it comes to machine learning, how do you figure out how fast a platform is? Alternatively, as an organization, if you have to invest in a single machine learning platform, how do you decide which one is the best?

Why Do We Need Benchmarking Tools for AI and ML

For a long period, there has been no benchmark to decide the worthiness of machine learning platforms. Put differently, the artificial intelligence and machine learning industry have lacked reliable, transparent, standard, and vendor-neutral benchmarks that help in flagging performance differences between different parameters used for handling a workload. Some of these parameters include hardware, software, algorithms, and cloud configurations among others.

Even though it has never roadblock when designing applications, the choice of platform determines the efficiency of the ultimate product in one way or the other. Technologies like artificial intelligence and machine learning are growing to be extremely resource-sensitive, as research progresses. For this reason, the practitioners of AI and ML are seeking the fastest, most scalable, power-efficient, and low-cost hardware and software platforms to run their workloads.

This need has emerged since machine learning is moving towards a workload-optimized structure. As a result, there is a more than ever need for standard benchmarking tools that will help machine learning developers access and analyze the target environments which are best suited for the required job. Not just developers but enterprise information technology professionals also need a benchmarking tool for a specific training or inference job. Andrew Ng, CEO of the Landing AI points out that there is no doubt that AI is transforming multiple industries. But for it to reach its full potential, we still need faster hardware and software. Therefore, unless we have something to measure the efficiency of the hardware and software specifically for the needs of ML, there is no way that we can design more advanced ones for our requirements.

David Patterson, Author of the Computer Architecture: A quantitative approach highlights the fact that good benchmarks enable researchers to compare different ideas quickly, which makes it easier to innovate. Having said this, the need for a standard benchmarking tool for ML is more than ever.

To solve the underlying problem of an unbiased benchmarking tool, machine learning expert David Katner along with scientists and engineers from a reputed organization such as Google, Intel, and Microsoft have come up with a new solution. Welcome ML Perf- a machine learning benchmark suite that measures how fast a system can perform ML inference using a trained model.

Measuring the speed of a machine learning problem is already a complex task and tangles even more as it is observed for a longer period. All of this is simply because of the varying nature of problem sets and architectures in machine learning services. Having said this, ML Perf in addition to performance also measures the accuracy of a platform. It is intended for the widest range of systems including mobile devices to servers.

Training and Inference

Training is that process in machine learning, where a network is fed with large datasets and let loose to find any underlying patterns in them. The more the number of datasets, the more is the efficiency of the system. It is called training because the network learns from the datasets and trains itself to recognize a particular pattern. For example, Gmail’s Smart Reply is trained in 238,000,000 sample emails. Similarly, Google Translate is trained on a trillion datasets. This makes the computational cost of training quite expensive. Systems that are designed for training have large and powerful hardware since their job is to chew up the data as fast as possible. Once the system is trained, the output received from it is called the inference.

Therefore, performance certainly matters when running inference workloads. On the one hand, the training phase requires as many operations per second without the concern of any latency. On the other hand, latency is a big issue during inference since a human is waiting on the other end to receive the results of the inference query.

Complex Answers

Due to the complex nature of architecture and metrics, one cannot receive a perfect score through ML Perf. Since ML Perf is also valid across a range of workloads and overwhelming architectures, one cannot make assumptions about a perfect score just like in the case of CPUs or GPUs. In ML Perf, scores are broken down into training workloads and inference workloads before being divided into tasks, models, datasets, and scenarios. The result obtained from ML Perf is not a perfect score but a wide spreadsheet. Each task is measured under the following four parameters-

  • Single Stream: It measures the performance in terms of latency. For example, a phone camera working with a single image at a time.

  • Multiple Stream: It measures the performance in terms of the number of streams possible. For example, an algorithm that scans through multiple cameras and images and aids a driver.

  • Server: This is the performance measured in queries per second.

  • Offline: Offline measures the performance in terms of raw throughputs. For example, photo sorting and automatic album creation.

Conclusion

Finally, ML Perf separates the benchmark into Open and Closed divisions, with more strict requirements for the closed division. Similarly, the hardware for an ML workload is also separated into categories such as Available, preview, Research, Development, and Others. All these factors give Ml experts and practitioners an idea of how close a given system is to real production.

click here to watch making of B-AIM:

Post: Blog2_Post
bottom of page