• 《工程索引》(EI)刊源期刊
  • Scopus
  • 中文核心期刊
  • 中国科学引文数据库来源期刊

Hamba: 基于选择状态空间的图像和谐化

孙金胜, 姚超, 班晓娟

孙金胜, 姚超, 班晓娟. Hamba: 基于选择状态空间的图像和谐化[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2024.09.12.006
引用本文: 孙金胜, 姚超, 班晓娟. Hamba: 基于选择状态空间的图像和谐化[J]. 工程科学学报. DOI: 10.13374/j.issn2095-9389.2024.09.12.006
Hamba: Image Harmonization based on Selective State Space Model[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2024.09.12.006
Citation: Hamba: Image Harmonization based on Selective State Space Model[J]. Chinese Journal of Engineering. DOI: 10.13374/j.issn2095-9389.2024.09.12.006

Hamba: 基于选择状态空间的图像和谐化

详细信息
  • 分类号: TP37

Hamba: Image Harmonization based on Selective State Space Model

  • 摘要: 近年来,包含Transformer组件的深度学习模型已经推动了包括图像和谐化在内的图像编辑任务的性能极限。与使用静态局部滤波器的卷积神经网络(CNN)相反,Transformer使用自关注机制允许自适应非局部滤波来敏感地捕获远程上下文。然而,这种敏感性是以大量模型复杂性为代价的,这可能会损害学习效率,特别是在相对中等规模的成像数据集上。在这里,我们提出了一种用于图像和谐化的新型网络模型Hamba,利用选择性状态空间建模(SSM)来有效地捕获远程上下文,同时保持局部精度。为此,Hamba构建了基于VSS块的U型网络,SSM层用于多个空间维度来学习上下文关系,并通过局部-全局特征序列提取器,建立合成图像在前景和背景的语义和风格特征之间的联系。我们的研究结果表明,Hamba提供了优于最先进的基于CNN和Transform的方法的性能。
    Abstract: In recent years, deep learning models incorporating Transformer components have pushed the performance boundaries of image editing tasks, including image harmonization. Unlike Convolutional Neural Networks (CNNs) that utilize static local filters, Transformers employ a self-attention mechanism that allows for adaptive non-local filtering to sensitively capture long-range contexts. However, this sensitivity comes at the cost of significant model complexity, which can hinder learning efficiency, especially on relatively medium-scale imaging datasets. Here, we propose a novel network model, Hamba, for image harmonization, leveraging Selective State Space Modeling (SSM) to effectively capture long-range contexts while maintaining local precision. To this end, Hamba constructs a U-shaped network based on VSS blocks, where SSM layers are employed across multiple spatial dimensions to learn contextual relationships. Moreover, it establishes connections between the semantic and stylistic features of the foreground and background in composite images through a local-global feature sequence extractor. Our research findings demonstrate that Hamba offers performance superior to state-of-the-art CNN-based and Transformer-based methods. .
计量
  • 文章访问数:  51
  • HTML全文浏览量:  2
  • PDF下载量:  6
  • 被引次数: 0
出版历程
  • 网络出版日期:  2025-02-20

目录

    /

    返回文章
    返回