1887

n Lexikos - Semi-automating the reading programme for a historical dictionary project

Volume 28 Number 1
  • ISSN : 1684-4904
  • E-ISSN: 2224-0039
USD

 

Abstract

This paper describes the resources and software procedures used or developed in a major enabling step towards the revision of the scholarly reference work A Dictionary of South African English on Historical Principles (DSAE, Silva et al. 1996), namely the semi-automatic generation of a digitally-sourced lexical database on which new and updated dictionary entries will be based; as well as the addition, in parallel, of a new corpus of South African English (SAE) to the project. Drawing on online data sources and an extensive list of known SAE word forms, we have developed a software toolchain to gather, encode, annotate and collate textual sources, producing: (i) a 3.1-billion part-of-speech-annotated corpus of South African English; (ii) a lexical database of illustrative quotations for over 20,000 known SAE word forms, available for selection at the entry-revision stage; and (iii) a list of potential new variant spellings and headword inclusion candidates. These steps replace, where recent electronic sources are concerned, the mechanical aspects of quotation gathering, normally undertaken manually through a reading programme requiring years of teamwork to acquire sufficient coverage (cf. Hicks 2010).


Die semi-outomatisering van die leesprogramme van 'n historiese woordeboekprojek. 

Hierdie artikel beskryf die hulpbronne en sagtewareprosedures wat gebruik word of ontwikkel is in 'n belangrike bemagtigingstap na die hersiening van die vakkundige naslaanwerk A Dictionary of South African English on Historical Principles (DSAE, Silva et al. 1996), naamlik die semi-outomatiese generering van 'n leksikale databasis van digitale bronne waarop nuwe en bygewerkte woordeboekinskrywings gebaseer sal wees; asook die gelyktydige toevoeging van 'n nuwe korpus van Suid-Afrikaanse Engels (SAE) tot die projek. Gebaseer op aanlyn databronne en 'n uitgebreide lys bekende SAE woordvorme, het ons 'n sagteware nutsketting ontwerp vir die versameling, enkodering, annotering en vergelyking van teksbronne, wat gelei het tot die skep van (i) 'n 3.1-biljoen woordsoortgeannoteerde korpus van Suid-Afrikaanse Engels; (ii) 'n leksikale databasis van illustratiewe aanhalings vir ongeveer 20,000 bekende SAE-woordvorme, wat by die hersieningsfase van die inskrywings beskikbaar is vir seleksie; en (iii) 'n lys van potensieel nuwe variante spellings en moontlikhede vir trefwoordseleksie. Wat onlangse elektroniese bronne betref, vervang hierdie stappe die meganiese aspekte van die versameling van aanhalings, wat gewoonlik met die hand met behulp van 'n leesprogram wat jare se spanwerk vereis om voldoende dekking te verkry, gedoen word (cf. Hicks 2010).

Loading full text...

Full text loading...

Loading

Article metrics loading...

/content/journal/10520/EJC-146326954e
2018-12-01
2020-02-19

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error