2025 02 v.57 37-43
FFConvNeXt3D:提取中大规模目标特征的大卷积核网络
基金项目(Foundation):
吉林大学符号计算与知识工程教育部重点实验室开放自主课题项目(93K172021K08);
江苏高校优势学科建设工程资助项目
邮箱(Email):
huangwei@suda.edu.cn;
DOI:
10.13705/j.issn.1671-6841.2023124
中文作者单位:
苏州大学计算机科学与技术学院;苏州大学东吴学院计算机系;苏州城市学院计算科学与人工智能学院;吉林大学符号计算与知识工程教育部重点实验室;
摘要(Abstract):
目前大卷积核模型在图像领域已经证明其有效性,但是在视频领域还没有优秀的3D大卷积核模型。此外,之前的工作中忽视了时空行为检测任务主体是人的特点,其中的骨干网络只针对通用目标提取特征。针对上述原因,提出了一种含有特征融合结构的3D大卷积核神经网络(FFConvNeXt3D)。首先,将成熟的ConvNeXt网络膨胀成用于视频领域的ConvNeXt3D网络,其中,预训练权重也进行处理用于膨胀后的网络。其次,研究了卷积核时间维度大小和位置对模型性能的影响。最后,提出了一个特征融合结构,着重提高骨干网络提取人物大小目标特征的能力。在UCF101-24数据集上进行了消融实验和对比实验,实验结果验证了特征融合结构的有效性,并且该模型性能优于其他方法。
关键词(KeyWords):
大卷积核;目标检测;时空行为检测;行为识别;特征融合
96 | 0 | 12 |
下载次数 | 被引频次 | 阅读次数 |
参考文献
[1] 王阳,袁国武,瞿睿,等.基于改进YOLOv3的机场停机坪目标检测方法[J].郑州大学学报(理学版),2022,54(5):22-28.WANG Y,YUAN G W,QU R,et al.Target detection method of airport apron based on improved YOLOv3[J].Journal of Zhengzhou university (natural science edition),2022,54(5):22-28.
[2] 蒋韦晔,刘成明.基于深度图的人体动作分类自适应算法[J].郑州大学学报(理学版),2021,53(1):16-21.JIANG W Y,LIU C M.Adaptive algorithm for human motion classification based on depth map[J].Journal of Zhengzhou university (natural science edition),2021,53(1):16-21.
[3] TANG J J,XIA J,MU X Z,et al.Asynchronous interaction aggregation for action detection[EB/OL].(2020-04-16)[2023-03-11].https://arxiv.org/abs/2004.07485.pdf.
[4] FEICHTENHOFER C,FAN H Q,MALIK J,et al.SlowFast networks for video recognition[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:6201-6210.
[5] ZHAO J J,ZHANG Y Y,LI X Y,et al.TubeR:tubelet transformer for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:13588-13597.
[6] TRAN D,WANG H,FEISZLI M,et al.Video classification with channel-separated convolutional networks[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:5551-5560.
[7] LIU Z,MAO H Z,WU C Y,et al.A ConvNet for the 2020s[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:11966-11976.
[8] LIU Z,LIN Y T,CAO Y,et al.Swin transformer:hierarchical vision transformer using shifted windows[EB/OL].(2021-03-25)[2023-03-11].https://arxiv.org/abs/2103.14030.pdf.
[9] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:770-778.
[10] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition?A new model and the kinetics dataset[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2017:4724-4733.
[11] REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(6):1137-1149.
[12] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:6450-6459.
[13] XIE S N,SUN C,HUANG J,et al.Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video classification[C]//European Conference on Computer Vision.Cham:International Springer Publishing,2018:318-335.
[14] FEICHTENHOFER C.X3D:expanding architectures for efficient video recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:200-210.
[15] 佘颢,吴伶,单鲁泉.基于SSD网络模型改进的水稻害虫识别方法[J].郑州大学学报(理学版),2020,52(3)49-54.SHE H,WU L,SHAN L Q.Improved rice pest recognition based on SSD network model[J].Journal of Zhengzhou university (natural science edition),2020,52(3):49-54.
[16] K?PüKLü O,WEI X Y,RIGOLL G.You only watch once:a unified CNN architecture for real-time spatiotemporal action localization[EB/OL].(2019-11-15)[2023-03-11].https://arxiv.org/abs/1911.06644.pdf.
[17] PAN J T,CHEN S Y,SHOU M Z,et al.Actor-context-actor relation network for spatio-temporal action localization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:464-474.
[18] LIU S,QI L,QIN H F,et al.Path aggregation network for instance segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:8759-8768.
[19] HOU R,CHEN C,SHAH M.Tube convolutional neural network (T-CNN) for action detection in videos[EB/OL].(2017-03-30)[2023-03-11].https://arxiv.org/abs/1703.10664.pdf.
[20] SONG L,ZHANG S W,YU G,et al.TACNet:transition-aware context network for spatio-temporal action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:11979-11987.
[21] KALOGEITON V,WEINZAEPFEL P,FERRARI V,et al.Action tubelet detector for spatio-temporal action localization[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:4415-4423.
[22] LI Y X,WANG Z X,WANG L M,et al.Actions as moving points[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:68-84.
[23] YANG X T,YANG X D,LIU M Y,et al.STEP:spatio-temporal progressive learning for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:264-272.
[2] 蒋韦晔,刘成明.基于深度图的人体动作分类自适应算法[J].郑州大学学报(理学版),2021,53(1):16-21.JIANG W Y,LIU C M.Adaptive algorithm for human motion classification based on depth map[J].Journal of Zhengzhou university (natural science edition),2021,53(1):16-21.
[3] TANG J J,XIA J,MU X Z,et al.Asynchronous interaction aggregation for action detection[EB/OL].(2020-04-16)[2023-03-11].https://arxiv.org/abs/2004.07485.pdf.
[4] FEICHTENHOFER C,FAN H Q,MALIK J,et al.SlowFast networks for video recognition[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:6201-6210.
[5] ZHAO J J,ZHANG Y Y,LI X Y,et al.TubeR:tubelet transformer for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:13588-13597.
[6] TRAN D,WANG H,FEISZLI M,et al.Video classification with channel-separated convolutional networks[C]//IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2020:5551-5560.
[7] LIU Z,MAO H Z,WU C Y,et al.A ConvNet for the 2020s[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:11966-11976.
[8] LIU Z,LIN Y T,CAO Y,et al.Swin transformer:hierarchical vision transformer using shifted windows[EB/OL].(2021-03-25)[2023-03-11].https://arxiv.org/abs/2103.14030.pdf.
[9] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2016:770-778.
[10] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition?A new model and the kinetics dataset[C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2017:4724-4733.
[11] REN S Q,HE K M,GIRSHICK R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE transactions on pattern analysis and machine intelligence,2017,39(6):1137-1149.
[12] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:6450-6459.
[13] XIE S N,SUN C,HUANG J,et al.Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video classification[C]//European Conference on Computer Vision.Cham:International Springer Publishing,2018:318-335.
[14] FEICHTENHOFER C.X3D:expanding architectures for efficient video recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:200-210.
[15] 佘颢,吴伶,单鲁泉.基于SSD网络模型改进的水稻害虫识别方法[J].郑州大学学报(理学版),2020,52(3)49-54.SHE H,WU L,SHAN L Q.Improved rice pest recognition based on SSD network model[J].Journal of Zhengzhou university (natural science edition),2020,52(3):49-54.
[16] K?PüKLü O,WEI X Y,RIGOLL G.You only watch once:a unified CNN architecture for real-time spatiotemporal action localization[EB/OL].(2019-11-15)[2023-03-11].https://arxiv.org/abs/1911.06644.pdf.
[17] PAN J T,CHEN S Y,SHOU M Z,et al.Actor-context-actor relation network for spatio-temporal action localization[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:464-474.
[18] LIU S,QI L,QIN H F,et al.Path aggregation network for instance segmentation[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:8759-8768.
[19] HOU R,CHEN C,SHAH M.Tube convolutional neural network (T-CNN) for action detection in videos[EB/OL].(2017-03-30)[2023-03-11].https://arxiv.org/abs/1703.10664.pdf.
[20] SONG L,ZHANG S W,YU G,et al.TACNet:transition-aware context network for spatio-temporal action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:11979-11987.
[21] KALOGEITON V,WEINZAEPFEL P,FERRARI V,et al.Action tubelet detector for spatio-temporal action localization[C]//IEEE International Conference on Computer Vision.Piscataway:IEEE Press,2017:4415-4423.
[22] LI Y X,WANG Z X,WANG L M,et al.Actions as moving points[C]//European Conference on Computer Vision.Cham:Springer International Publishing,2020:68-84.
[23] YANG X T,YANG X D,LIU M Y,et al.STEP:spatio-temporal progressive learning for video action detection[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2020:264-272.
基本信息:
DOI:10.13705/j.issn.1671-6841.2023124
中图分类号:TP391.41;TP183
引用信息:
[1]黄乾坤,黄蔚,凌兴宏.FFConvNeXt3D:提取中大规模目标特征的大卷积核网络[J].郑州大学学报(理学版),2025,57(02):37-43.DOI:10.13705/j.issn.1671-6841.2023124.
基金信息:
吉林大学符号计算与知识工程教育部重点实验室开放自主课题项目(93K172021K08); 江苏高校优势学科建设工程资助项目
暂无数据