Intelligent Detection and Protection of Personally Identifiable Information in Clinical Text: An Advanced NLP Approach with Optimized Attention Mechanisms
Main Article Content
Abstract
The protection of Personally Identifiable Information (PII) in clinical text is a critical challenge in healthcare data management, particularly as medical institutions increasingly adopt digital health records and data-sharing initiatives. This paper presents a novel natural language processing framework that leverages optimized attention mechanisms and context-aware tokenization strategies to achieve high accuracy in detecting and protecting sensitive information within clinical documents. Our approach integrates transformer-based architectures with domain-specific enhancements, achieving a 95.3% F1-score on standard benchmarks while satisfying HIPAA Safe Harbor requirements through a combination of deep learning and rule-based processing. The proposed method introduces a hierarchical detection system that processes clinical text at multiple granularity levels, employing specialized attention heads for different PII categories. Experimental results on three large-scale clinical datasets demonstrate that our framework outperforms existing state-of-the-art methods by 8.7% in detection accuracy and reduces false positives by 59% compared to ClinicalBERT (from 12.8% to 5.2%). Furthermore, our intelligent redaction strategy preserves the semantic integrity of clinical content, enabling secure data sharing while maintaining the utility of medical information.
Article Details
Section
How to Cite
References
1. J. S. Obeid, P. M. Heider, E. R. Weeda, A. J. Matuskowitz, C. M. Carr, K. Gagnon, and S. M. Meystre, "Impact of de-identification on clinical text classification using traditional and deep learning classifiers," Studies in Health Technology and Informatics, vol. 264, p. 283, 2019.
2. S. M. Meystre, O. Ferrández, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore, "Text de-identification for privacy protection: A study of its impact on clinical text information content," Journal of Biomedical Informatics, vol. 50, pp. 142-150, 2014.
3. X. Yang, T. Lyu, C. Y. Lee, J. Bian, W. R. Hogan, and Y. Wu, "A study of deep learning methods for de-identification of clinical notes at cross institute settings," In 2019 IEEE International Conference on Healthcare Informatics (ICHI), June, 2019, pp. 1-3.
4. L. Radhakrishnan, G. Schenk, K. Muenzen, B. Oskotsky, H. Ashouri Choshali, T. Plunkett, and A. J. Butte, "A certified de-identification system for all clinical text documents for information extraction at scale," JAMIA Open, vol. 6, no. 3, p. ooad045, 2023. doi: 10.1093/jamiaopen/ooad045
5. S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya, "Deep learning architecture for patient data de-identification in clinical records," In Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP), December, 2016, pp. 32-41.
6. P. Kulkarni, and N. K. Cauvery, "Personally identifiable information (PII) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique," International Journal of Advanced Computer Science and Applications, vol. 12, no. 9, 2021.
7. U. Ndolo, H. El-Sayed, and M. K. Sarker, "Application of machine learning-NLP approach with fully homomorphic encryption techniques in medical PII data," In 2025 6th International Conference on Artificial Intelligence, Robotics and Control (AIRC), May, 2025, pp. 469-474. doi: 10.1109/airc64931.2025.11077473
8. A. H. Razavi, and K. Ghazinour, "Personal health information detection in unstructured web documents," In Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems, June, 2013, pp. 155-160. doi: 10.1109/cbms.2013.6627781
9. S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore, "Automatic de-identification of textual documents in the electronic health record: A review of recent research," BMC Medical Research Methodology, vol. 10, no. 1, p. 70, 2010. doi: 10.1186/1471-2288-10-70
10. C. A. Kushida, D. A. Nichols, R. Jadrnicek, R. Miller, J. K. Walsh, and K. Griffin, "Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies," Medical Care, vol. 50, pp. S82-S101, 2012. doi: 10.1097/mlr.0b013e3182585355
11. S. M. Meystre, "De-identification of unstructured clinical data for patient privacy protection," In Medical Data Privacy Handbook, 2015, pp. 697-716. doi: 10.1007/978-3-319-23633-9_26