Midv-550 · Free Access

: Object detectors such as Faster R‑CNN [5], YOLOv8 [6], and EfficientDet [7] have become de‑facto standards. However, their performance on low‑resolution, heavily distorted ID images remains under‑explored.

YOLOv8‑x attains the highest detection recall (98 %) while maintaining real‑time speed on mobile‑grade CPUs (≈ 150 ms per image using TensorRT). | Model | Mean IoU (all fields) | MRZ IoU | Portrait IoU | |-------|----------------------|----------|--------------| | Mask RCNN (ResNeXt‑101) | 0.78 | 0.84 | 0.71 | | DETR‑Doc (ViT‑B) | 0.74 | 0.80 | 0.68 | | Mask RCNN + Geometric Refine (baseline) | 0.82 | 0.88 | 0.75 | MIDV-550

Technical Report – April 2026 Abstract The proliferation of mobile‑based identity‑verification services has created a pressing need for realistic, large‑scale datasets that capture the visual variability of government‑issued identification (ID) documents captured with consumer‑grade smartphones. We introduce MIDV‑550 , a publicly released benchmark consisting of 5 550 high‑resolution images of five common ID‑document types (passport, national ID card, driver’s licence, residence permit, and employee badge) captured under uncontrolled lighting, pose, motion blur, and occlusion conditions. Each image is richly annotated with document‑level bounding boxes, per‑field polygons, text transcriptions, and a hierarchy of quality‑assessment tags. We present a systematic evaluation of state‑of‑the‑art detection (YOLOv8, EfficientDet‑D4) and recognition pipelines (CRNN, Transformer‑based OCR) on MIDV‑550, establishing baseline performance and highlighting the remaining challenges in mobile ID verification. The dataset, annotation tools, and evaluation scripts are released under a permissive CC‑BY‑4.0 license to foster reproducible research. 1. Introduction Mobile identity verification (MIV) has become a core component of financial onboarding, e‑government services, and travel‑related applications. Unlike traditional document‑verification workflows that rely on high‑quality scanners, MIV must cope with images captured by handheld smartphones in a wide range of uncontrolled environments. This introduces a set of visual degradations—low illumination, motion blur, perspective distortion, specular highlights, and partial occlusion—that dramatically affect both document detection and optical character recognition (OCR). : Object detectors such as Faster R‑CNN [5],

: Sequence‑to‑sequence models (CRNN [10]), Transformer‑based recognizers (SATRN [11]), and large‑scale pre‑trained vision‑language models (TrOCR [12]) have set the state‑of‑the‑art on clean scanned documents but degrade sharply on mobile captures. | Model | Mean IoU (all fields) |

: Recent works use instance‑segmentation (Mask RCNN [8]) or keypoint‑based approaches (DETR‑Doc [9]) to isolate MRZ, portrait, and signature regions.

Existing public benchmarks (e.g., [1], IDDoc [2], SROIE [3]) either contain a limited number of document classes, provide only coarse bounding‑box annotations, or lack realistic mobile acquisition conditions. Consequently, progress in robust MIV systems has been hindered by a mismatch between training data and real‑world deployment scenarios.