MediaTek Genio Segmentation case study

Semantic segmentation is a fundamental problem in computer vision that involves partitioning an image into regions based on predefined object categories, assigning a class label to each pixel. Unlike traditional image classification, which provides a single label for the entire image, or object detection, which identifies and localizes objects with bounding boxes, semantic segmentation offers a pixel-level understanding of visual scenes. This fine-grained labeling is critical for applications where precise localization and delineation of objects are necessary, such as autonomous driving, medical imaging, and robotics. The core challenge lies in accurately classifying each pixel while preserving object boundaries and accounting for variations in scale, lighting, occlusion, and scene complexity.

Challenge: real-time segmentation on edge devices

Achieving real-time semantic segmentation on edge devices presents a significant challenge due to the inherent trade-off between model accuracy and computational efficiency. Edge devices, such as mobile phones, drones, or embedded systems in autonomous vehicles, often have limited processing power, memory, and energy resources. Deploying complex deep learning models on such constrained hardware requires careful optimization to maintain acceptable frame rates without severely sacrificing segmentation quality. Ensuring robust performance across diverse real-world environments—while meeting the strict latency requirements of real-time applications—remains an open problem in the field.

How segmentation model works in details

Semantic segmentation models built on convolutional architectures operate by extracting hierarchical features from input images using layers of convolution, activation, and pooling. As the image passes through deeper layers of the network, the spatial resolution of the feature maps decreases due to strided convolutions and pooling operations. This downsampling enables the model to capture increasingly abstract and semantic features – crucial for distinguishing between object categories – but it comes at the cost of losing precise spatial information about object boundaries and locations. To address this, modern segmentation architectures often employ a multi-branch design: subnetworks with fewer convolutional layers retain high-resolution spatial information but capture less semantic detail, while another, deeper subnetwork extracts rich semantic features at a lower resolution. These feature maps are then fused—either through concatenation, addition, or attention mechanisms—to combine the strengths of both branches. Following this fusion, a decoder network gradually upsamples the combined feature maps using techniques like transposed convolutions, bilinear interpolation, or restoring them to the original image resolution. The final output is a dense segmentation mask where each pixel is classified according to its corresponding object category. This encoder-decoder structure effectively balances semantic understanding with spatial precision, making it well-suited for detailed segmentation tasks.

MobileNetV3 for Semantic Segmentation

MobileNetV3 [1] is a lightweight convolutional neural network architecture designed specifically for efficient deployment on mobile and edge devices, making it a popular backbone for real-time semantic segmentation tasks. Built upon the principles of depthwise separable convolutions, MobileNetV3 achieves a strong balance between accuracy and speed. In the context of semantic segmentation, MobileNetV3 is typically used as the encoder in an encoder-decoder framework, where it extracts compact yet expressive features from the input image. While its streamlined design reduces computational cost, it still preserves essential semantic information through a combination of squeeze-and-excitation modules and novel activation functions like h-swish. To counter the loss of spatial resolution inherent in deep convolutional networks, segmentation models using MobileNetV3 often integrate lightweight decoder modules—such as those from DeepLabV3 or custom upsampling layers—to restore full-resolution segmentation maps. This architecture enables real-time, on-device inference for tasks like road scene understanding, human parsing, and augmented reality.

The MobileNetV3 architecture is a base for many successful semantic segmentation models. One of the great ones is SelfieSegmenter from the MediaPipe framework [2].

MediaPipe Selfie Segmentation is a lightweight, real-time segmentation model developed by Google, optimized specifically for human figure segmentation in images and video streams. Designed with mobile and web applications in mind, it enables fast and efficient background removal or replacement by generating a binary mask that distinguishes the human subject (typically a selfie or portrait) from the background. SelfieSegmentation provides a good tradeoff between speed and accuracy.

Potential architectural solutions to the problem

One of the common approaches to improve the speed of the model is to reduce the input resolution of the image. Typically models do not work on the native input image resolution. Convolutional filters sweep over the input image, and further activation maps, thus the number of steps depends directly on the input image size, and scales quadratically. On the other hand the lower input dimension is, the less details are

available for the model.

Another architectural solution is to reduce the depth of the model, where we reduce the number of convolutional layers. Such an approach reduces the amount of semantic information being encoded in the model, and can harm the accuracy. We could also reduce the number of convolutional filters in each layer. Simple reduction of the number of parameters is always effective in terms of speed, but usually affects accuracy in a negative way. Finding a correct allocation of model parameters in architecture is a key to finding a fast and accurate model, which is a main part of our solution.

Our solution

Our solution uses MobileNetV3 as a base architecture. We propose to divide the model into 3 sub networks. Two subnetworks we call “lean” networks, because they use only a few convolutional layers returning high resolution activation map 128×72 pixels and 64×36 pixels respectively. Our main backbone network has significantly more convolutional layers, operating at 16×9 pixels internal resolution.

When designing such neural network architecture it is not only important to decide about the number of layers, and their parameters, but it is also important what are the starting points of the “lean” networks, and points where the networks merge together. Those decisions are not independent from each other because if we want to merge two networks, we need to make sure that their activation map dimensions are compatible. But the main advantage of such optimization is that the final model can achieve better segmentation accuracy, without losing the speed. Especially correct design of bifurcation and merge points between subnetworks allow model to segment finer details of the image eg. fingers.

As described above, finding the optimal architecture is a complex task due to the high number of combinations of parameters, resulting in a high number of potential model architectures. At NeuralSpike we designed an automatic architecture search algorithm, which allows us to define base conditions, e.g. number of subnetworks, potential bifurcation and merge points etc. The algorithm generates potential model architectures, that are benchmarked against inference on selected platform (e.g. PC or edge device). Candidates that pass the speed benchmark test, are trained on a subset of the data. Then the training curves are compared against each other and the most promising candidates are trained on a full dataset. Such procedures allow us to find an architecture that suits the most to the selected deployment target.
After applying the above procedure we obtained 172K parameters model architecture that performs very well in terms of speed and accuracy which we show in the next section.

Inference speed evaluation

In our benchmark we focus on devices with a low compute budget. I can either make an inference on x64 PC platform in a web browser. Or more interestingly an edge device.

For PC we run our benchmark on i7-1185G CPU.
While for edge computing we selected a MediaTek Genio 700 SOM, which is a great platform for AI, allowing for easy deployment of the models. Genio 700 offers great performance, while being energy efficient and cost effective.

Below we present a speed comparison between our segmentation model, and SelfieSegmentation on both CPU and MediaTek Genio 700. First we compare the number of parameters of the model. In the case of SelfieSegmentation we downloaded the .tflite file from the MediaPipe website and estimated the number of parameters. The SelfieSegmentation contains 206K parameters while our model is smaller and contains 172K parameters. Thus our model contains 34K parameters less, and is ~16% smaller than SelfieSegmentation. Our model is also faster: on CPU our model is faster by ~26 FPS which is ~10% improvement. While on MediaTek Genio 700 our model reports an impressive 92 FPS being faster by ~7 FPS (8% improvement).

Model		Frames per Second (more better)
	# Params	x64 CPU	MediaTek Genio 700
SelfieSegmenter (landscape)	206K	279.42	85.50
Ours	172k	305.83	92.34

Qualitative evaluation

In this section we will compare qualitatively both models. In general we can distinguish 3 main challenges for segmentation models which are:

Small objects/details – for instance fingers. Small objects are difficult to segment because real-time models operate on small input resolution, in addition as we described in previous section, in case of segmentation models very often spatial location information is lost, thus small detailed objects are often treated as a blob.
Motion blur – due to limited amount of parameters real-time segmentation models have problems with correctly segmenting less often cases.
Closed area segmentation – real-time segmentation models very often confuse closed area and assign them to the wrong class.

Below we show a couple of screen shots which show that our model pays more attention to details. First our model visibly better at finger segmentation, as shown in screen shot below.

Also our model returns a more stable segmentation map, especially when the picture is blurred due to the motion.

Finally our model is able also to capture global structure of the image, making less errors in the closed areas.

Conclusions

We proposed a new Segmentation model based on MobileNetV3, we proposed an improved architecture, using our architecture search algorithm. We compared our model to SelfieSegmentation from MediaPipe. Our model is around 10% faster on x64 CPU and MediaTek Genio 700. In addition our model is able to segment more details.

References:

[1] https://arxiv.org/pdf/1905.02244

[2] https://ai.google.dev/edge/mediapipe/solutions/vision/image_segmenter

See other cases

No more post