Sindane, TMarivate, VMoodley, Avashlin2025-12-152025-12-152025-12978-1-0370-5280-4http://hdl.handle.net/10204/14525Code-switching has become the modus operandi of internet communi-cation in many communities, such as South Africans, who are domestically multi-lingual. This phenomenon has made processing textual data increasingly complex due to non-standard ways of writing, spontaneous word replacements, and other challenges. Pre-trained multilingual models have shown elevated text processing capabilities in various similar downstream tasks such as language identification, dialect detection, and language family discrimination. In this study, we exten-sively investigate the use of pre-trained multilingual models - AfroXLMR, and Serengeti for code-switching detection on five South African languages: Sesotho, Setswana, IsiZulu, IsiXhosa, and English, with English used interchangeably with the other four languages, including various transfer learning settings. Addition-ally, we explore the modeling of known switching pairs within a dataset through explicit cross-lingual embeddings extracted using projection models: VecMap, Muse, and Canonical Correlation Analyses (CCA). The resulting cross-lingual embeddings are used to replace the embedding layer of a pre-trained multilingual model without additional training. Concretely, our results show that performance gains can be realized (from 59.1% monolingual to 74.1% cross-lingual, and to 90.8% multi-lingual) by closing the representational gap between the languages of the code-switched dataset with known codes, using cross-lingual representations. Moreover, expanding code-switched datasets with datasets of closely related lan-guages improves code-switching classification, especially in cases with minimal training examples.FulltextenPre-trained multilingual modelsCross-lingual embeddingsCode-switching detectionInjecting explicit cross-lingual embeddings into pre-trained multilingual models for code switching detectionConference Presentationn/a