郑州大学学报（理学版）

2025, 02, v.57 37-43

FFConvNeXt3D:提取中大规模目标特征的大卷积核网络

黄乾坤¹ 黄蔚²

凌兴宏^1,3,4

1.苏州大学计算机科学与技术学院 2.苏州大学东吴学院计算机系 3.苏州城市学院计算科学与人工智能学院 4.吉林大学符号计算与知识工程教育部重点实验室

基金项目(Foundation): 吉林大学符号计算与知识工程教育部重点实验室开放自主课题项目(93K172021K08); 江苏高校优势学科建设工程资助项目

邮箱(Email): huangwei@suda.edu.cn;

DOI: 10.13705/j.issn.1671-6841.2023124

109	1	87
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

目前大卷积核模型在图像领域已经证明其有效性，但是在视频领域还没有优秀的3D大卷积核模型。此外，之前的工作中忽视了时空行为检测任务主体是人的特点，其中的骨干网络只针对通用目标提取特征。针对上述原因，提出了一种含有特征融合结构的3D大卷积核神经网络(FFConvNeXt3D)。首先，将成熟的ConvNeXt网络膨胀成用于视频领域的ConvNeXt3D网络，其中，预训练权重也进行处理用于膨胀后的网络。其次，研究了卷积核时间维度大小和位置对模型性能的影响。最后，提出了一个特征融合结构，着重提高骨干网络提取人物大小目标特征的能力。在UCF101-24数据集上进行了消融实验和对比实验，实验结果验证了特征融合结构的有效性，并且该模型性能优于其他方法。

关键词： 大卷积核; 目标检测; 时空行为检测; 行为识别; 特征融合;

Abstract：

Large convolutional kernel models was proven effective in the image domain, but the available 3D large convolutional kernel models were not good enough in the video domain. Additionally, the backbone network only could extract features for generic targets, and human was ignored as the subject in the spatio-temporal action detection task in previous work. To address these issues, a 3D large convolutional kernel neural network containing a feature fusion structure(FFConvNeXt3D) was proposed. Firstly, the mature ConvNeXt network into a ConvNeXt3D network was extended to the video domain, where pre-training weights were also processed for the expanded network. Secondly, the effect of the size and position of the temporal dimension of the convolutional kernel on the performance of the model was investigated. Finally, a feature fusion structure that would focus on improving the ability of the backbone network to extract features from targets of medium or larger size such as humans was proposed. The ablation experiments and comparison experiments were conducted on the UCF101-24 dataset. The experimental results verified the effectiveness of the feature fusion structure, and the model performed better than other methods.

KeyWords： large convolution kernel; object detection; spatio temporal action detection; action recognition; feature fusion;

参考文献

[1] 王阳，袁国武，瞿睿，等.基于改进YOLOv3的机场停机坪目标检测方法[J].郑州大学学报(理学版),2022,54(5):22-28.WANG Y,YUAN G W,QU R,et al.Target detection method of airport apron based on improved YOLOv3[J].Journal of Zhengzhou university (natural science edition),2022,54(5):22-28.

[2] 蒋韦晔，刘成明.基于深度图的人体动作分类自适应算法[J].郑州大学学报(理学版),2021,53(1):16-21.JIANG W Y,LIU C M.Adaptive algorithm for human motion classification based on depth map[J].Journal of Zhengzhou university (natural science edition),2021,53(1):16-21.

[3] TANG J J,XIA J,MU X Z,et al.Asynchronous interaction aggregation for action detection[EB/OL].(2020-04-16)[2023-03-11].https://arxiv.org/abs/2004.07485.pdf.

[4] FEICHTENHOFER C,FAN H Q,MALIK J,et al.SlowFast networks for video recognition[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:6201-6210.

[5] ZHAO J J,ZHANG Y Y,LI X Y,et al.TubeR:tubelet transformer for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:13588-13597.

[6] TRAN D,WANG H,FEISZLI M,et al.Video classification with channel-separated convolutional networks[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:5551-5560.

[7] LIU Z,MAO H Z,WU C Y,et al.A ConvNet for the 2020s[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:11966-11976.

[8] LIU Z,LIN Y T,CAO Y,et al.Swin transformer:hierarchical vision transformer using shifted windows[EB/OL].(2021-03-25)[2023-03-11].https://arxiv.org/abs/2103.14030.pdf.

[9] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:770-778.

[10] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition?A new model and the kinetics dataset[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2017:4724-4733.

[11] REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(6):1137-1149.

[12] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:6450-6459.

[13] XIE S N,SUN C,HUANG J,et al.Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video classification[C]//European Conference on Computer Vision.Cham:International Springer Publishing,2018:318-335.

[14] FEICHTENHOFER C.X3D:expanding architectures for efficient video recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:200-210.

[15] 佘颢，吴伶，单鲁泉.基于SSD网络模型改进的水稻害虫识别方法[J].郑州大学学报(理学版),2020,52(3)49-54.SHE H,WU L,SHAN L Q.Improved rice pest recognition based on SSD network model[J].Journal of Zhengzhou university (natural science edition),2020,52(3):49-54.

[16] K?PüKLü O,WEI X Y,RIGOLL G.You only watch once:a unified CNN architecture for real-time spatiotemporal action localization[EB/OL].(2019-11-15)[2023-03-11].https://arxiv.org/abs/1911.06644.pdf.

[17] PAN J T,CHEN S Y,SHOU M Z,et al.Actor-context-actor relation network for spatio-temporal action localization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:464-474.

[18] LIU S,QI L,QIN H F,et al.Path aggregation network for instance segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:8759-8768.

[19] HOU R,CHEN C,SHAH M.Tube convolutional neural network (T-CNN) for action detection in videos[EB/OL].(2017-03-30)[2023-03-11].https://arxiv.org/abs/1703.10664.pdf.

[20] SONG L,ZHANG S W,YU G,et al.TACNet:transition-aware context network for spatio-temporal action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:11979-11987.

[21] KALOGEITON V,WEINZAEPFEL P,FERRARI V,et al.Action tubelet detector for spatio-temporal action localization[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:4415-4423.

[22] LI Y X,WANG Z X,WANG L M,et al.Actions as moving points[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:68-84.

[23] YANG X T,YANG X D,LIU M Y,et al.STEP:spatio-temporal progressive learning for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:264-272.

基本信息:

DOI：10.13705/j.issn.1671-6841.2023124

中图分类号:TP391.41;TP183

引用信息:

[1]黄乾坤,黄蔚,凌兴宏.FFConvNeXt3D:提取中大规模目标特征的大卷积核网络[J].郑州大学学报(理学版),2025,57(02):37-43.DOI:10.13705/j.issn.1671-6841.2023124.

基金信息:

吉林大学符号计算与知识工程教育部重点实验室开放自主课题项目(93K172021K08); 江苏高校优势学科建设工程资助项目

请选择需要下载的pdf数据

郑州大学学报（理学版）

Summary