Preview

Russian Technological Journal

Advanced search

Automatic depersonalization of confidential information

https://doi.org/10.32362/2500-316X-2023-11-5-7-18

Abstract

Objectives. As the scope of personal data transmitted online continues to grow, national legislatures are increasingly regulating the storage and processing of digital information. This paper raises the problem of protecting personal data and other confidential information such as bank secrecy or medical confidentiality of individuals. One approach to the protection of confidential data is to depersonalize it, i.e., to transform it so that it becomes impossible to identify the specific subject to whom the data belongs. The aim of the work is to develop a method for the rapid and safe automation of the depersonalization process using machine learning technologies.

Methods. The authors propose the use of artificial intelligence models to implement a system for the automatic depersonalization of personal data without the use of human labor to preclude the possibility of recognizing confidential information even in unstructured data with sufficient accuracy. Rule-based algorithms for improving the precision of the depersonalization system are described.

Results. In order to solve this problem, a model of named entity recognition is trained on confidential data provided by the authors. In conjunction with rule-based algorithms, an F1 score greater than 0.9 is achieved. For solving specific depersonalization problems, a choice between several implemented anonymization algorithm variants can be made.

Conclusions. The developed system solves the problem of automatic anonymization of confidential data. This opens an opportunity to ensure the secure processing and transmission of confidential information in many areas, such as banking, government administration, and advertising campaigns. The automation of the depersonalization process makes it possible to transfer confidential information in cases where it is necessary, but not currently possible due to legal restrictions. The distinctive feature of the developed solution is that both structured data and unstructured data are depersonalized, including the preservation of context.

About the Authors

N G. Babak
National Research University “Moscow Power Engineering Institute”; Sberbank of Russia
Russian Federation

Nikita G. Babak, Postgraduate Student, Department of Computing Machines, Systems and Networks; Chief Data Protection Officer, Cybersecurity Department

14/1, Krasnokazarmennaya ul.,Moscow, 111250; 19, Vavilova ul., Moscow 117312

ResearcherID HHY-9372-2022


Competing Interests:

The authors declare no conflicts of interest.



L. Yu. Belorybkin
Sberbank of Russia
Russian Federation

Leonid Yu. Belorybkin, Director of Data Protection Projects, Cybersecurity Department

19, Vavilova ul., Moscow, 117312


Competing Interests:

The authors declare no conflicts of interest.



S. A. Otsokov
MIREA – Russian Technological University
Russian Federation

Shamil A. Otsokov, Dr. Sci. (Eng.), Professor, Department of Intelligent Information Security Systemss, Institute of Cybersecurity and Digital Technologies

78, Vernadskogo pr., Moscow, 119454

 Scopus Author ID 57212622267


Competing Interests:

The authors declare no conflicts of interest.



A. T. Terenin
Sberbank of Russia
Russian Federation

Alexey A. Terenin, Cand. Sci. (Eng.), Managing Director, Cybersecurity Department

19, Vavilova ul., Moscow, 117312


Competing Interests:

The authors declare no conflicts of interest.



A. I. Shabrova
Sberbank of Russia
Russian Federation

Anastasia I. Shabrova, Data Protection Architect, Cybersecurity Department

19, Vavilova ul., Moscow, 117312


Competing Interests:

The authors declare no conflicts of interest.



References

1. Shabrova A.I., Terenin A.A., Babak N.G. Methodology for risk assessment from confidential information disclosure in data sources using data mining. Sovremennye informacionnye tehnologii i IT-obrazovanie = Modern Information Technologies and IT-Education. 2022;18(3):666–679 (in Russ.). https://doi.org/10.25559/SITITO.18.202203.666-679

2. Stolbov A.P. De-identification of personal data in health care. Vrach i informacionnye tekhnologii = Medical Doctor and Information Technologies. 2017;3:76–91 (in Russ.). Available from URL: https://elibrary.ru/zgyvot

3. Spevakov A.G., Kalutskiy I.V., Nikulin D.A., Shumailova V.A. Depersonalization of personal data during processing of information in automated systems. Telekommunikatsii = Telecommunications. 2016;10:16–20 (in Russ.). Available from URL: https://www.elibrary.ru/wwvxmt

4. Oleksy M., Ropiak N., Walkowiak T. Automated anonymization of text documents in Polish. Procedia Computer Science. 2021;192(1):1323–1333. https://doi.org/10.1016/j.procs.2021.08.136

5. Saluja B., Kumar G., Sedoc J., Callison-Burch C. Anonymization of Sensitive Information in Medical Health Records. In: CEUR Workshop Proceedings. 2019;2421:647–653. Available from URL: https://ceurws.org/Vol-2421/MEDDOCAN_paper_2.pdf

6. Roy A. Recent Trends in Named Entity Recognition (NER). arXiv. 2021. https://doi.org/10.48550/arxiv.2101.11420

7. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is all you need. In: Advances in Neural Information Processing Systems. 2017. https://doi.org/10.48550/arXiv.1706.03762

8. Ratinov L., Roth D. Design Challenges and Misconceptions in Named Entity Recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009). 2009. P. 147–155. Available from URL: https://aclanthology.org/W09-1119.pdf

9. Fisher J., Vlachos A. Merge and label: A novel neural network architecture for nested NER. arXiv. 2019. https://doi.org/10.48550/arXiv.1907.00464

10. Fu Y., Tan C., Chen M., Huang S., Huang F. Nested named entity recognition with partially-observed TreeCRFs. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021;35(14):12839–12847. https://doi.org/10.1609/aaai.v35i14.17519

11. Dai X., Karimi S., Hachey B., Paris C. An effective transition-based model for discontinuous NER. arXiv. 2020. https://doi.org/10.48550/arXiv.2004.13454

12. Lothritz C., Allix K., Veiber L., Klein J., BissyandeT.F.D.A. Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020. P. 3750–3760. http://doi.org/10.18653/v1/2020.coling-main.334

13. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv. 2019. https://doi.org/10.48550/arXiv.1905.07213

14. Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzman F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised cross-lingual representation learning at scale. arXiv. 2020. https://doi.org/10.48550/arXiv.1911.02116

15. Patel A.A., Arasanipalai A.U. Applied Natural Language Processing in the Enterprise. O’Reilly Media, Inc.; 2021. 336 p. ISBN 978-1-4920-6257-8. Available from URL: https://spacy.io/universe/project/applied-nlp-inenterprise/

16. Singco V.Z., Trillo J., Abalorio C., Bustillo J.C., Bojocan J., Elape M. OCR-based Hybrid Image Text Summarizer using Luhn Algorithm with Finetune Transformer Models for Long Document. Int. J. Emerging Technol. Adv. Eng. 2023;13(02):47–56. http://doi.org/10.46338/ijetae0223_07

17. Soltau H., Shafran I., Wang M., Shafey L.E. RNN Transducers for Nested Named Entity Recognition with constraints on alignment for long sequences. arXiv. 2022. https://doi.org/10.48550/arXiv.2203.03543

18. Abirkhaev E.A., Erokhin A.F., Pushkin P.Yu. Methods of depersonalizing data: overview and analysis. Naukosfera. 2021;6(2):57–31 (in Russ.). Available from URL: https://www.elibrary.ru/item.asp?id=46561812

19. Seryshev A.S., Krotov A.D., Efanova N.V. Development of an application for personal data depersonalization. In: Digitalization of the Economy: Directions, Methods, Tools: Proceedings of the 3rd All-Russian Scientific and Practical Conference. Krasnodar: Kuban State Agrarian University; 2021. P. 294–297 (in Russ.). ISBN 978-5-9074-3005-1. Available from URL: https://www.elibrary.ru/item.asp?id=44891383

20. Fot U.D., Korobova E.O. Depersonalization of personal data in the personnel management system of oil and gas sector enterprises. In: The Role of the Oil and Gas Sector in the Technical and Economic Development of the Orenburg Region: Proceedings of the scientific-practical conference. Saratov: Amirit; 2021. P. 161–168 (in Russ.). ISBN 978-5-0014-0888-8. Available from URL: https://www.elibrary.ru/item.asp?id=48392659

21. Williams C.K.I. The effect of class imbalance on Precision-Recall Curves. Neural Computation. 2021;33(4): 853–857. https://doi.org/10.1162/neco_a_01362

22. Du Y., Li C., Guo R., Yin X., Liu W., Zhou J., Bai Y., Yu Z., Yang Y., Dang Q., Wang H. PP-OCR: A practical ultra lightweight OCR system. arXiv. 2020. https://doi.org/10.48550/arXiv.2009.09941

23. Pan J., Shapiro J., Wohlwend J., Han K.J., Lei T., Ma T. ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv. 2020. https://doi.org/10.48550/arXiv.2005.10469

24. Ryffel T., Trask A., Dahl M., Wagner B., Mancuso J., Rueckert D., Passerat-Palmbach J. A generic framework for privacy preserving deep learning. arXiv. 2018. https://doi.org/10.48550/arXiv.1811.04017


Supplementary files

1. Data processing by the depersonalization system
Subject
Type Исследовательские инструменты
View (92KB)    
Indexing metadata ▾
  • This paper raises the problem of protecting personal data and other confidential information such as bank secrecy or medical confidentiality of individuals.
  • The authors proposed the use of artificial intelligence models to implement a system for the automatic depersonalization of personal data without the use of human labor to preclude the possibility of recognizing confidential information even in unstructured data with sufficient accuracy.
  • A model of named entity recognition is trained on confidential data provided by the authors. In conjunction with rule-based algorithms, an F1 score greater than 0.9 is achieved.
  • The distinctive feature of the developed solution is that both structured data and unstructured data are depersonalized, including the preservation of context.

Review

For citations:


Babak N.G., Belorybkin L.Yu., Otsokov S.A., Terenin A.T., Shabrova A.I. Automatic depersonalization of confidential information. Russian Technological Journal. 2023;11(5):7-18. https://doi.org/10.32362/2500-316X-2023-11-5-7-18

Views: 805


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2782-3210 (Print)
ISSN 2500-316X (Online)