Volume 1, Issue 1, May 2013, Page: 1-6
Rule-Based Sentence Detection Method (RBSDM) for Turkish
Özlem AKTAŞ, Computer Engineering Department, Dokuz Eylul University, Izmir, Turkey
Yalçın ÇEBİ, Computer Engineering Department, Dokuz Eylul University, Izmir, Turkey
Received: Mar. 20, 2013;       Published: May 2, 2013
DOI: 10.11648/j.ijll.20130101.11      View  2700      Downloads  129
Abstract
The first process of generating a corpus, which is a representative of the language, is the determination of sen-tences, which is very complicated and hard to solve, but an important part of the corpus generation. Different approaches have been tried to find out sentence boundaries in some languages. In Turkish, the most known ways of determining sentence boundaries are using statistics and machine learning. In this study, to determine the sentence boundaries in contemporary Turkish, a rule-based method called “Rule-Based Sentence Detection Method for Turkish (RBSDM)” was developed by considering the agglutinative and rule based structure of Turkish. This method was tested on two different test sets generated by randomly selected columns from two Turkish newspapers. RBSDM determines end of sentences correctly and efficiently, about means of time and other costs, and provides success rate in a range of 99.60% and 99.80%.
Keywords
Linguistics, Natural Language Processing, Corpus, Turkish, Morphological Analysis, Sentence Boundary Detection
To cite this article
Özlem AKTAŞ, Yalçın ÇEBİ, Rule-Based Sentence Detection Method (RBSDM) for Turkish, International Journal of Language and Linguistics. Vol. 1, No. 1, 2013, pp. 1-6. doi: 10.11648/j.ijll.20130101.11
Reference
[1]
Z. Güngördü, "A lexical-functional grammar for Turkish", MSc Thesis, Computer Engineering Department, Bilkent University, Ankara-Turkey, 1993.
[2]
C. E. Shannon, "Prediction and Entropy of Printed English", The Bell System Technical Journal, vol. 30:1, pp. 50-64, 1951.
[3]
D. Crystal, A Dictionary of Linguistics and Phonetics, 3rd Edition, Blackwell, 1991.
[4]
J. Sinclair, "Corpus Concordance", Collocation, OUP, 1991.
[5]
Varliklar, O. Developing a Method to Determine Root and Suffixes for Turkish Words to Generate Large Scale Turkish Corpus. M.Sc. Thesis, Dokuz Eylul University Graduate School of Natural and Applied Sciences Computer Engineering Department, Izmir - Turkey, 2005.
[6]
Boye, J. "XML, What’s in it for us?", article published in www.irt.org, 1998.
[7]
Ş. H. Akalın, R. Toparlı, Yazım Kılavuzu, Türk Dil Kurumu Yayınları, 24th Edition, Ankara, 2005.
[8]
T. Kiss, J. Strunk, "Unsupervised Multilingual Sentence B oundary Detection", Computational Linguistics vol. 32:4 pp,. 485-525, 2006.
[9]
B. Say, D. Zeyrek, K. Oflazer, U. Ozge, "Development of a Corpus and a Treebank for Present-day Written Turkish", Proceedings of the Eleventh International Conference of Turkish Linguistics, ICTL, Ankara, Turkey, 2002.
[10]
T. Dinçer, B. Karaoğlan, "Sentence Boundary Detection in Turkish", Advances in Information Systems Proceedings: Third International Conference, Izmir-Turkey, pp. 255, 2004.
Browse journals by subject