1887

n Literator : Journal of Literary Criticism, Comparative Linguistics and Literary Studies - Orthographic measures of language distances between the official South African languages

USD

 

Abstract


Twee metodes vir die bepaling van verwantskappe tussen die elf amptelike tale van Suid-Afrika word beskryf. Die eerste metode maak gebruik van n-gramme. Die verwarrings wat plaasvind in 'n taalherkenningstelsel verskaf inligting oor die verhouding tussen die tale. N-gram-statistieke word vanaf teksdokumente bepaal en word dan gebruik as kenmerke vir klassifikasie. Ons wys dat die uitsette van 'n bevestigingstoets gebruik kan word om te bepaal hoe naby tale aan mekaar lê. Vanuit hierdie metings het ons 'n sigbare voorstelling van die verhouding tussen tale afgelei.
Verder het ons die Levenshtein-metode gebruik om die afstand tussen die ortografiese transkripsies van woorde te bepaal, toegespits op die elf amptelike tale van Suid-Afrika. 'n Grafiese groepering volgens die afstande tussen die verskillende tale toon weer die verhoudings aan tussen die tale en ook familiegroepe. Met sowel die dendrogramme as die multidimensionele skalering word bepaalde familiegroepe aangedui, en selfs ook die fynere verwantskappe binne hierdie familiegroepe.

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically.


We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.

Loading

Article metrics loading...

/content/literat/29/1/EJC62001
2008-04-01
2016-12-09
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error