Automating the search  for legal information in Arabic: A novel approach to document retrieval

K. S. Jafar; A. A. Mohammad; A. H. Issa; A. V. Panov

doi:10.32362/2500-316X-2024-12-5-7-16

Automating the search for legal information in Arabic: A novel approach to document retrieval

K. S. Jafar, A. A. Mohammad, A. H. Issa, A. V. Panov

https://doi.org/10.32362/2500-316X-2024-12-5-7-16

EDN: CBEERK

Full Text:

PDF (Rus) PDF (Eng) Suppl.

Generate QR code

Abstract

Objectives. The retrieval of legal information, including information related to issues such as punishment for crimes and felonies, represents a challenging task. The approach proposed in the article represents an efficient way to automate the retrieval of legal information without requiring a large amount of labeled data or consuming significant computational resources. The work set out to analyze the feasibility of a document retrieval approach in the context of Arabic legal texts using natural language processing and unsupervised clustering techniques.

Methods. The Topic-to-Vector (Top2Vec) topic modeling algorithm for generating document embeddings based on semantic context is used to cluster Arabic legal texts into relevant topics. We also used the HDBSCAN densitybased clustering algorithm to identify subtopics within each cluster. Challenges of working with Arabic legal text, such as morphological complexity, ambiguity, and a lack of standardized terminology, are addressed by means of a proposed preprocessing pipeline that includes tokenization, normalization, stemming, and stop-word removal.

Results. The results of the evaluation of the approach using a dataset of legal texts in Arabic based on keywords demonstrated its superior effectiveness in terms of accuracy and memorability. The proposed approach provides 87% accuracy and 80% completeness. This circumstance can significantly improve the search for legal documents, making the process faster and more accurate.

Conclusions. Our findings suggest that this approach can be a valuable tool for legal professionals and researchers to navigate the complex landscape of Arabic legal information to improve efficiency and accuracy in legal information retrieval.

Keywords

search for documents, NLP, Top2Vec, HDBSCAN, Arabic legal documents, word embeddings, cosine similarity

About the Authors

K. S. Jafar

MIREA – Russian Technological University
Russian Federation

Kamel S. Jafar, Postgraduate Student, Department of Corporate Information Systems, Institute of Information Technologies

Scopus Author ID 57552322300

78, Vernadskogo pr., Moscow, 119454

A. A. Mohammad

HSE University
Russian Federation

Ali A. Mohammad, Master Student, Faculty of Computer Science

11, Pokrovsky bulv., Moscow, 109028

A. H. Issa

Russian Biotechnological University
Russian Federation

Ali H. Issa, Postgraduate Student, Department of Automated Control Systems for Biotechnological Processes

11, Volokolamskoye sh., Moscow, 125080

A. V. Panov

MIREA – Russian Technological University
Russian Federation

Alexander V. Panov, Cand. Sci. (Eng.), Associate Professor, Department of Corporate Information Systems, Institute of Information Technologies

78, Vernadskogo pr., Moscow, 119454

References

1. Sleimi A., Sannier N., Sabetzadeh M., Briand L., Dann J. Automated extraction of semantic legal metadata using natural language processing. In: 2018 IEEE 26th International Requirements Engineering Conference (RE). IEEE; 2018. P. 124–135. https://doi.org/10.1109/RE.2018.00022

2. Rogers A., Gardner M., Augenstein I. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. ACM Comput. Surveys. 2023;55(10):1–45. https://doi.org/10.1145/3560260

3. Alanazi S.S., Elfadil N., Jarajreh M., Algarni S. Question Answering Systems: A Systematic Literature Review. International Journal of Advanced Computer Science and Applications (IJACSA). 2021;12(3):359. https://doi.org/10.14569/IJACSA.2021.0120359

4. Sansone C., Sperl ́ı G. Legal Information Retrieval systems: State-of-the-art and open issues. Inform. Syst. 2022;106:101967. https://doi.org/10.1016/j.is.2021.101967

5. Sartor G., Araszkiewicz M., Atkinson K., et al. Thirty years of Artificial Intelligence and Law: the second decade. Artif. Intell. Law. 2022;30(4):521–557. https://doi.org/10.1007/s10506-022-09326-7

6. Zhong H., Xiao C., Tu C., et al. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. 2020. arXiv:2004.12158 [cs.CL]. https://arxiv.org/abs/2004.12158v5

7. Abu Shamma S., Ayasa A., Yahya A., et al. Information extraction from Arabic law documents. In: 2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT). IEEE; 2020;1–6. https://doi.org/10.1109/AICT50176.2020.9368577

8. Hammami E., Faiz R. Topic Modelling of Legal Texts Using Bidirectional Encoder Representations from Sentence Transformers. In: Advances in Information Systems, Artificial Intelligence and Knowledge Management. Conference paper. International Conference on Information and Knowledge Systems. Cham: Springer Nature Switzerland; 2023. V. 486. P. 333–343. https://doi.org/10.1007/978-3-031-51664-1_24

9. Angelov D. Top2Vec: Distributed Representations of Topics. 2020. arXiv:2008.09470 [cs.CL]. https://arxiv.org/abs/2008.09470v1

10. Karas B., Qu S., Xu Y., Zhu Q. Experiments with LDA and Top2Vec for embedded topic discovery on social media data—A case study of cystic fibrosis. Front. Artif. Intell. 2022;5:948313. https://doi.org/10.3389/frai.2022.948313

11. Vianna D., de Moura E.S., da Silva A.S. A topic discovery approach for unsupervised organization of legal document collections. Artif. Intell. Law. 2023;Online First. https://doi.org/10.1007/s10506-023-09371-w

12. McInnes L., Healy J., Astels S. hdbscan: Hierarchical density-based clustering. J. Open Source Softw. 2017;2(11):205. https://doi.org/10.21105/joss.00205

13. Devlin J., Chang M.W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv, preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805v2

14. Salton G., McGill M.J. Introduction to Modern Information Retrieval. N.Y.: McGraw-Hill; 1983. 472 p.

15. Manning C.D., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge, England: Cambridge University Press; 2008. 492 p.

Supplementary files

	1. Retrieving dense document regions using spatial clustering of applications based on hierarchical density with noise
	Subject
	Type	Исследовательские инструменты
	View (121KB)	Indexing metadata ▾

The approach proposed in the article represents an efficient way to automate the retrieval of legal information without requiring a large amount of labeled data or consuming significant computational resources.
The feasibility of a document retrieval approach in the context of Arabic legal texts using natural language processing and unsupervised clustering techniques was analyzed.
Challenges of working with Arabic legal text, such as morphological complexity, ambiguity, and a lack of standardized terminology, are addressed by means of a proposed preprocessing pipeline that includes tokenization, normalization, stemming, and stop-word removal.
The proposed approach provides 87% accuracy and 80% completeness.

Review

For citations:

Jafar K.S., Mohammad A.A., Issa A.H., Panov A.V. Automating the search for legal information in Arabic: A novel approach to document retrieval. Russian Technological Journal. 2024;12(5):7-16. https://doi.org/10.32362/2500-316X-2024-12-5-7-16. EDN: CBEERK

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2782-3210 (Print)
ISSN 2500-316X (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

	Title	Retrieving dense document regions using spatial clustering of applications based on hierarchical density with noise
	Type	Исследовательские инструменты
	Date	2024-10-16

User

Russian Technological Journal