Repository of colleges and higher education institutions

Show document
A+ | A- | Help | SLO | ENG

Title:Multilevel complex systems approaches to computational linguistics : doctoral dissertation
Authors:ID Ban, Kristina (Author)
ID Levnajić, Zoran (Mentor) More about this mentor... New window
ID Boshkoska, Biljana Mileva (Comentor)
Files:.pdf DR_2018_Kristina_Ban.pdf (17,51 MB)
MD5: FB4C4E62DA4D2522F3C9807F2CB9D764
 
Language:English
Work type:Doctoral dissertation
Typology:2.08 - Doctoral Dissertation
Organization:FIŠ - Faculty of Information Studies in Novo mesto
Abstract:Complex systems are omnipresent in nature, society as well as in human culture. Last few decades saw an increase of interest for their study, particularly by using graph-theoretic methodologies. By identifying systems' units as nodes and modelling interactions between the units as links, the study of complex networks spread to a number of disciplines including sociology, biology and linguistics, to just mention a few. The research done in this doctoral dissertation falls in this context. The core of this doctoral work is the data-driven multilevel analysis of major human languages, which was done in two stages. First, we looked at the speed of growth of Wikipedias in 26 different languages over the span of 15 years. This involved creating and analysing a dataset with 14962 articles, each of which exists in all 26 languages. We found six well-defined clusters of Wikipedias that share common growth patterns, with their make-up robust against the method used for their determination. Interestingly, the identified clusters were found to have little correlation with the respective language families. Rather, our results suggest that growth of Wikipedias is primarily governed by an intricate set of other factors, from culture to information literacy. Second, to approach human languages at another independent level, we gathered a dataset comprising a list of syllables and a list of syllables words in 10 different languages, specifically: English, Dutch, German, Russian, Slovenian, Croatian, French, Spanish, Latin and Basque. These datasets were obtained from recognized repositories for each language and benchmarked in the same way. Syllable networks were created by looking at pairs of syllables that jointly compose at least one word. We then carried out a systematic network analysis, relying on both standard network analysis methods and more recent techniques, such as K-core analysis and graphlet statistics. Research revealed striking similarities between the architectures of syllable networks that belong to the same language family, along with expected differences between the families. Indeed, structures of syllable networks were found to well quantify the linguistic similarities among these 10 languages, exactly as known from classical linguistics. Most interestingly, we found that Basque language, whose classification is as of today still unknown, bares a strong resemblance to Latin, at least when syllable network representation is concerned. Earlier stages of this doctoral work involved comparing the performance of network alignment algorithms, used in bioinformatics for studying protein networks. Several alignment algorithms were compared by scoring their performance on standard protein datasets. It was found that three algorithms, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. Due to the change of doctoral adviser, this research topic was abandoned in favour of language/syllable networks. In sum, this doctoral work involved two distinct directions of research in network science, one related to developing the methodology of network analysis (alignment algorithms), and the other devoted to extracting new information from specifically designed datasets (syllable networks). Therefore, the original contribution of this work to science includes both theory and methodology. Future research avenues include advancement along both directions, most interesting being the application of network alignment methods to syllable datasets, which could reveal more precise quantification of structural differences among syllable networks.
Keywords:computational statistics, biostatistics, bioinformatics, machine learning, computational linguistics
Place of publishing:Novo mesto
Place of performance:Novo mesto
Publisher:[K. Ban]
Year of publishing:2018
Year of performance:2018
Number of pages:XX, 135 str.
PID:20.500.12556/ReVIS-5369 New window
COBISS.SI-ID:297952768 New window
UDC:004.8:519.765:81'322(043.3)
Publication date in ReVIS:21.12.2018
Views:3878
Downloads:139
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Licences

License:CC BY-NC-ND 4.0, Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Link:http://creativecommons.org/licenses/by-nc-nd/4.0/
Description:The most restrictive Creative Commons license. This only allows people to download and share the work for no commercial gain and for no other purposes.
Licensing start date:21.12.2018

Secondary language

Language:Slovenian
Title:Večnivojski kompleksni pristopi k računalniški lingvistiki : doktorska disertacija
Abstract:Kompleksni sistemi so vseprisotni v naravi, družbi in v človeški kulturi. V zadnjih nekaj desetletjih se je povečalo zanimanje za njihovo preučevanje, zlasti z uporabo metod teorije grafov. S predstavitvijo enot sistema kot vozlišč in modeliranja interakcij med enotami kot povezav, študija kompleksnih omrežij se razširila na številne discipline, vključno s sociologijo, biologijo in lingvistiko, da bi omenili le nekaj. Raziskovalno delo v tej doktorski disertaciji sodi v ta kontekst. Jedro tega doktorskega dela je večplastna analiza glavnih svetovnih jezikov, ki temelji na podatkih, kar je narejeno v dveh fazah. Najprej smo pogledali hitrost naraščanja Wikipedij v 26 različnih jezikih v obdobju 15 let. To je vključevalo izdelavo in analizo podatkovji s 14962 članki, od katerih vsaki obstaja v vseh 26 jezikih. Našli smo šest jasno opredeljenih klastrov Wikipedij, ki imajo skupne vzorce rasti, njihova sestava pa je presenetljivo robustna glede na metodo njihove določitve. Zanimivo je, da so ugotovljeni klastri zelo malo korelirani z jezikovnimi družinami 26 jezikov. Nasprotno, naši rezultati kažejo, da je rast Wikipedij predvsem določen zapletenim nizom drugih dejavnikov, od kulture do informacijske pismenosti. V drugi smo fazi pristopili k svetovnim jezikom na novem nivoju, in sicer smo zbrali podatkovja s seznamom zlogov in seznamom zlogiziranih besed v desetih različnih jezikih: angleščini, nizozemščini, nemščini, ruščini, slovenščini, hrvaščini, francoščini, španščini, latinščini in baskovščini. Ti nabori podatkov so bili pridobljeni iz priznanih podatkovnih skladišč za vsaki jezik in ustrezno poenoteni. Omrežja zlogov so bila ustvarjena tako, da so pari zlogov, ki skupaj sestavljajo vsaj eno besedo, predstavljeni kot povezan par vozlišč. Nato smo izvedli sistematično analizo omrežij, ki se je opirala na standardne metode analize omrežja ter na novejše tehnike, kot sta analiza K-jedra in statistika grafkov. Raziskava je pokazala presenetljive podobnosti med arhitekturami omrežij zlogov, ki pripadajo isti jezikovni družini, skupaj s pričakovanimi razlikami med različnimi jezikovnimi družinami. Najbolj zanimivo je, da je zlogovna struktura baskovskega jezika, katerega klasifikacija je še danes neznana, močno podobna latinščini.
Keywords:računska statistika, biostatistika, bioinformatika, strojno učenje, računalniška lingvistika


Back