Enhanced Feature Fusion and Transfer Learning for Multi-Format Government Document Classification

Main Article Content

Qiaomu Zhang

Abstract

Government document digitization faces significant challenges due to diverse formats, degraded quality, and limited annotated data. This paper presents an enhanced feature fusion framework combining convolutional neural networks and transformer architectures for multi-format government document classification. The proposed approach integrates hierarchical visual features with contextual text embeddings via a cross-modal attention mechanism, leveraging progressive transfer learning from general document corpora to specialized government domains. Experimental results on real-world administrative datasets demonstrate classification accuracy improvements of 5.6-8.3 percentage points (pp) over baseline methods, with particular robustness on degraded historical documents. The framework achieves 94.7% accuracy across multiple document formats while maintaining computational efficiency suitable for large-scale deployment in federal and state digitization initiatives.

Article Details

Section

Articles

How to Cite

Enhanced Feature Fusion and Transfer Learning for Multi-Format Government Document Classification. (2026). Journal of Science, Innovation & Social Impact, 1(1), 427-441. https://sagespress.com/index.php/JSISI/article/view/64

References

1. T. Hong, D. Kim, M. Ji, W. Hwang, D. Nam, and S. Park, "Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents," In Proceedings of the AAAI Conference on Artificial Intelligence, June, 2022, pp. 10767-10775. doi: 10.1609/aaai.v36i10.21322

2. S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha, "Docformer: End-to-end transformer for document understanding," In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 993-1003.

3. Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, "Layoutlm: Pre-training of text and layout for document image understanding," In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, August, 2020, pp. 1192-1200. doi: 10.1145/3394486.3403172

4. Y. Liu, S. Yan, L. Leal-Taixé, J. Hays, and D. Ramanan, "Soft augmentation for image classification," In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16241-16250. doi: 10.1109/cvpr52729.2023.01558

5. G. Jaume, H. K. Ekenel, and J. P. Thiran, "Funsd: A dataset for form understanding in noisy scanned documents," In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), September, 2019, pp. 1-6.

6. J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei, "Dit: Self-supervised pre-training for document image transformer," In Proceedings of the 30th ACM international conference on multimedia, October, 2022, pp. 3530-3539. doi: 10.1145/3503161.3547911

7. Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei, "Layoutlmv3: Pre-training for document ai with unified text and image masking," In Proceedings of the 30th ACM international conference on multimedia, October, 2022, pp. 4083-4091. doi: 10.1145/3503161.3548112

8. X. Zhang, J. Yoon, M. Bansal, and H. Yao, "Multimodal representation learning by alternating unimodal adaptation," In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 27456-27466.

9. C. Da, C. Luo, Q. Zheng, and C. Yao, "Vision grid transformer for document layout analysis," In Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19462-19472.

10. Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, and L. Zhou, "Layoutlmv2: Multi-modal pre-training for visually-rich document understanding," In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August, 2021, pp. 2579-2591. doi: 10.18653/v1/2021.acl-long.201

11. L. Yefan, L. Yijing, Y. Yina, H. Miaowan, and T. Da, "Multimodal Document Classification Based on Two-Stream Adaptive Feature Fusion," In 2025 IEEE 5th International Conference on Electronic Technology, Communication and Information (ICETCI), May, 2025, pp. 693-698. doi: 10.1109/icetci64844.2025.11084186

12. Z. Gu, C. Meng, K. Wang, J. Lan, W. Wang, M. Gu, and L. Zhang, "Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding," In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 4583-4592.

13. M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, and F. Wei, "Trocr: Transformer-based optical character recognition with pre-trained models," In Proceedings of the AAAI conference on artificial intelligence, June, 2023, pp. 13094-13102. doi: 10.1609/aaai.v37i11.26538

14. C. Auer, A. Nassar, M. Lysak, M. Dolfi, N. Livathinos, and P. Staar, "Icdar 2023 competition on robust layout segmentation in corporate documents," In International Conference on Document Analysis and Recognition, August, 2023, pp. 471-482. doi: 10.1007/978-3-031-41679-8_27

15. A. G. AV, "Efficient Document Classification Using Fused CNN-SVM Model," In 2024 International Conference on Communication, Computing and Energy Efficient Technologies (I3CEET), September, 2024, pp. 879-885.