Automatic depersonalization of confidential information
https://doi.org/10.32362/2500-316X-2023-11-5-7-18
Abstract
Objectives. As the scope of personal data transmitted online continues to grow, national legislatures are increasingly regulating the storage and processing of digital information. This paper raises the problem of protecting personal data and other confidential information such as bank secrecy or medical confidentiality of individuals. One approach to the protection of confidential data is to depersonalize it, i.e., to transform it so that it becomes impossible to identify the specific subject to whom the data belongs. The aim of the work is to develop a method for the rapid and safe automation of the depersonalization process using machine learning technologies.
Methods. The authors propose the use of artificial intelligence models to implement a system for the automatic depersonalization of personal data without the use of human labor to preclude the possibility of recognizing confidential information even in unstructured data with sufficient accuracy. Rule-based algorithms for improving the precision of the depersonalization system are described.
Results. In order to solve this problem, a model of named entity recognition is trained on confidential data provided by the authors. In conjunction with rule-based algorithms, an F1 score greater than 0.9 is achieved. For solving specific depersonalization problems, a choice between several implemented anonymization algorithm variants can be made.
Conclusions. The developed system solves the problem of automatic anonymization of confidential data. This opens an opportunity to ensure the secure processing and transmission of confidential information in many areas, such as banking, government administration, and advertising campaigns. The automation of the depersonalization process makes it possible to transfer confidential information in cases where it is necessary, but not currently possible due to legal restrictions. The distinctive feature of the developed solution is that both structured data and unstructured data are depersonalized, including the preservation of context.
About the Authors
N G. BabakRussian Federation
Nikita G. Babak, Postgraduate Student, Department of Computing Machines, Systems and Networks; Chief Data Protection Officer, Cybersecurity Department
14/1, Krasnokazarmennaya ul.,Moscow, 111250; 19, Vavilova ul., Moscow 117312
ResearcherID HHY-9372-2022
Competing Interests:
The authors declare no conflicts of interest.
L. Yu. Belorybkin
Russian Federation
Leonid Yu. Belorybkin, Director of Data Protection Projects, Cybersecurity Department
19, Vavilova ul., Moscow, 117312
Competing Interests:
The authors declare no conflicts of interest.
S. A. Otsokov
Russian Federation
Shamil A. Otsokov, Dr. Sci. (Eng.), Professor, Department of Intelligent Information Security Systemss, Institute of Cybersecurity and Digital Technologies
78, Vernadskogo pr., Moscow, 119454
Scopus Author ID 57212622267
Competing Interests:
The authors declare no conflicts of interest.
A. T. Terenin
Russian Federation
Alexey A. Terenin, Cand. Sci. (Eng.), Managing Director, Cybersecurity Department
19, Vavilova ul., Moscow, 117312
Competing Interests:
The authors declare no conflicts of interest.
A. I. Shabrova
Russian Federation
Anastasia I. Shabrova, Data Protection Architect, Cybersecurity Department
19, Vavilova ul., Moscow, 117312
Competing Interests:
The authors declare no conflicts of interest.
References
1. Shabrova A.I., Terenin A.A., Babak N.G. Methodology for risk assessment from confidential information disclosure in data sources using data mining. Sovremennye informacionnye tehnologii i IT-obrazovanie = Modern Information Technologies and IT-Education. 2022;18(3):666–679 (in Russ.). https://doi.org/10.25559/SITITO.18.202203.666-679
2. Stolbov A.P. De-identification of personal data in health care. Vrach i informacionnye tekhnologii = Medical Doctor and Information Technologies. 2017;3:76–91 (in Russ.). Available from URL: https://elibrary.ru/zgyvot
3. Spevakov A.G., Kalutskiy I.V., Nikulin D.A., Shumailova V.A. Depersonalization of personal data during processing of information in automated systems. Telekommunikatsii = Telecommunications. 2016;10:16–20 (in Russ.). Available from URL: https://www.elibrary.ru/wwvxmt
4. Oleksy M., Ropiak N., Walkowiak T. Automated anonymization of text documents in Polish. Procedia Computer Science. 2021;192(1):1323–1333. https://doi.org/10.1016/j.procs.2021.08.136
5. Saluja B., Kumar G., Sedoc J., Callison-Burch C. Anonymization of Sensitive Information in Medical Health Records. In: CEUR Workshop Proceedings. 2019;2421:647–653. Available from URL: https://ceurws.org/Vol-2421/MEDDOCAN_paper_2.pdf
6. Roy A. Recent Trends in Named Entity Recognition (NER). arXiv. 2021. https://doi.org/10.48550/arxiv.2101.11420
7. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017. https://doi.org/10.48550/arXiv.1706.03762
8. Ratinov L., Roth D. Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009). 2009. P. 147–155. Available from URL: https://aclanthology.org/W09-1119.pdf
9. Fisher J., Vlachos A. Merge and label: A novel neural network architecture for nested NER. arXiv. 2019. https://doi.org/10.48550/arXiv.1907.00464
10. Fu Y., Tan C., Chen M., Huang S., Huang F. Nested named entity recognition with partially-observed TreeCRFs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(14):12839–12847. https://doi.org/10.1609/aaai.v35i14.17519
11. Dai X., Karimi S., Hachey B., Paris C. An effective transition-based model for discontinuous NER. arXiv. 2020. https://doi.org/10.48550/arXiv.2004.13454
12. Lothritz C., Allix K., Veiber L., Klein J., BissyandeT.F.D.A. Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. P. 3750–3760. http://doi.org/10.18653/v1/2020.coling-main.334
13. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. 2019. https://doi.org/10.48550/arXiv.1905.07213
14. Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzman F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised cross-lingual representation learning at scale. arXiv. 2020. https://doi.org/10.48550/arXiv.1911.02116
15. Patel A.A., Arasanipalai A.U. Applied Natural Language Processing in the Enterprise. O’Reilly Media, Inc.; 2021. 336 p. ISBN 978-1-4920-6257-8. Available from URL: https://spacy.io/universe/project/applied-nlp-inenterprise/
16. Singco V.Z., Trillo J., Abalorio C., Bustillo J.C., Bojocan J., Elape M. OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with Finetune Transformer Models for Long Document. Int. J. Emerging Technol. Adv. Eng. 2023;13(02):47–56. http://doi.org/10.46338/ijetae0223_07
17. Soltau H., Shafran I., Wang M., Shafey L.E. RNN Transducers for Nested Named Entity Recognition with constraints on alignment for long sequences. arXiv. 2022. https://doi.org/10.48550/arXiv.2203.03543
18. Abirkhaev E.A., Erokhin A.F., Pushkin P.Yu. Methods of depersonalizing data: overview and analysis. Naukosfera. 2021;6(2):57–31 (in Russ.). Available from URL: https://www.elibrary.ru/item.asp?id=46561812
19. Seryshev A.S., Krotov A.D., Efanova N.V. Development of an application for personal data depersonalization. In: Digitalization of the Economy: Directions, Methods, Tools: Proceedings of the 3rd All-Russian Scientific and Practical Conference. Krasnodar: Kuban State Agrarian University; 2021. P. 294–297 (in Russ.). ISBN 978-5-9074-3005-1. Available from URL: https://www.elibrary.ru/item.asp?id=44891383
20. Fot U.D., Korobova E.O. Depersonalization of personal data in the personnel management system of oil and gas sector enterprises. In: The Role of the Oil and Gas Sector in the Technical and Economic Development of the Orenburg Region: Proceedings of the scientific-practical conference. Saratov: Amirit; 2021. P. 161–168 (in Russ.). ISBN 978-5-0014-0888-8. Available from URL: https://www.elibrary.ru/item.asp?id=48392659
21. Williams C.K.I. The effect of class imbalance on Precision-Recall Curves. Neural Computation. 2021;33(4): 853–857. https://doi.org/10.1162/neco_a_01362
22. Du Y., Li C., Guo R., Yin X., Liu W., Zhou J., Bai Y., Yu Z., Yang Y., Dang Q., Wang H. PP-OCR: A practical ultra lightweight OCR system. arXiv. 2020. https://doi.org/10.48550/arXiv.2009.09941
23. Pan J., Shapiro J., Wohlwend J., Han K.J., Lei T., Ma T. ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv. 2020. https://doi.org/10.48550/arXiv.2005.10469
24. Ryffel T., Trask A., Dahl M., Wagner B., Mancuso J., Rueckert D., Passerat-Palmbach J. A generic framework for privacy preserving deep learning. arXiv. 2018. https://doi.org/10.48550/arXiv.1811.04017
Supplementary files
|
1. Data processing by the depersonalization system | |
Subject | ||
Type | Исследовательские инструменты | |
View
(92KB)
|
Indexing metadata ▾ |
- This paper raises the problem of protecting personal data and other confidential information such as bank secrecy or medical confidentiality of individuals.
- The authors proposed the use of artificial intelligence models to implement a system for the automatic depersonalization of personal data without the use of human labor to preclude the possibility of recognizing confidential information even in unstructured data with sufficient accuracy.
- A model of named entity recognition is trained on confidential data provided by the authors. In conjunction with rule-based algorithms, an F1 score greater than 0.9 is achieved.
- The distinctive feature of the developed solution is that both structured data and unstructured data are depersonalized, including the preservation of context.
Review
For citations:
Babak N.G., Belorybkin L.Yu., Otsokov S.A., Terenin A.T., Shabrova A.I. Automatic depersonalization of confidential information. Russian Technological Journal. 2023;11(5):7-18. https://doi.org/10.32362/2500-316X-2023-11-5-7-18