Abstract:
Automatic detection and recognition of safety helmet wearing based on video analysis is important to ensure production safety. It is inefficient to supervise whether workers wear safety helmets by manual means. With the advancement of deep learning, using computer vision to assist in the detection of safety helmet-wearing holds significant research and application value. However, complex environments and variable factors pose challenges in achieving accurate detection and recognition of safety helmet usage. Helmet-wearing detection methods are generally classified as traditional machine learning and deep learning methods. Traditional machine learning methods employ manually selected features or statistical features, resulting in poor model stability. Deep learning–based methods are further divided into “two-stage” and “one-stage” methods. The two-stage method has high detection accuracy but cannot achieve real-time detection, while the one-stage counterpart is the reverse. Achieving accuracy as well as real-time detection is an important challenge in the development of video-based helmet detection systems. Accurate and quick detection of helmets is essential for effective real-time monitoring of production sites. To address these challenges, this paper proposes DS-YOLOv5—a real-time helmet detection and recognition model based on the YOLOv5 models. The proposed model solves three main problems: First, insufficient global information extraction problem of convolutional neural network (CNN) models. Second, the lacking robustness of the deep SORT for multiple targets and occlusion problems in video scenes. Third, the inadequate feature extraction of multiscale targets. The DS-YOLOv5 model takes advantage of the improved Deep SORT multitarget tracking algorithm to reduce the rate of missed detections in multitarget detection and occlusion and increase the error tolerance in video detection. Further, a simplified transformer module is integrated into the backbone network to enhance the capture of global information from images and thus enhance feature learning for small targets. Finally, the bidirectional feature pyramid network is used to fuse multiscale features, which can better adapt to target scale changes caused by the photographic distance. The DS-YOLOv5 model was validated using the GDUT-HWD dataset by ablation and comparison experiments. In these experiments, the tracking capability of the improved Deep SORT is compared with the YOLOv5 model using the public pedestrian dataset MOT. The results of the comparison of the five one-stage methods and four helmet detection and recognition models revealed that the proposed model has better capability for dealing with occlusion and target scale. Further, the model achieved mean average orecision (mAP) of 95.5%, which is superior to that of the other helmet detection and recognition models.