>
Course Unit Title Course Unit Code Type of Course Unit Level of Course Unit Year of Study Semester ECTS Credits
Introduction To Text Mining TBL337 Elective Bachelor's degree 3 Fall 5

Name of Lecturer(s)

Associate Prof. Dr. Zeynep Hilal KİLİMCİ

Learning Outcomes of the Course Unit

1) Understanding the concept of text mining and its close relationship with statistical natural language processing (SNLP)
2) Learns statistical inference models
3) Learn text preprocessing methods to improve inference models
4) Learns and applies basic algorithms used in this field

Program Competencies-Learning Outcomes Relation

  Program Competencies
1 2 3 4 5 6 7 8 9 10 11
Learning Outcomes
1 High High High Low Low High Low Low Middle Middle No relation
2 High High High Low Low High Low Low Middle Middle No relation
3 High High High High High High Low Low Middle Middle No relation
4 High High High High High High Low Low Middle High No relation

Mode of Delivery

Face to Face

Prerequisites and Co-Requisites

None

Recommended Optional Programme Components

Probability and Statistics, Data Mining, Introduction to Machine Learning

Course Contents

Introduction to Text Mining: Complicated Text Data Mining Introduction to Statistical Natural Language Processing (NLP) Mathematical Basics Linguistic Foundations and Corpus Based Study Collocation Selection with Synchronous Frequency, Hypothesis Testing, Mutual Knowledge Statistical Inference: n-gram based models on Sparse Data Preparation of data for data mining algorithms Clustering Classification Web page classification

Weekly Schedule

1) Introduction to Text Mining
2) Introduction to Statistical Natural Language Processing (NLP)
3) Mathematical Foundations Elementary Probability Theory Essential Information Theory
4) Linguistic Essentials and Corpus-Based Work Low level Processing of the text corpora Tokenization, Sentence boundary detection, part-of-speech tagging, stemming (Porter’s stemmer algorithm), stop words,
5) Collocations Selection of Collocations by Frequency, Hypothesis Testing, Mutual Information
6) Statistical Inference: n-gram Models over Sparse Data Statistical estimators, combining estimators
7) Statistical Inference: n-gram Models over Sparse Data Statistical estimators, combining estimators
8) Spelling correction and synonyms: edit distance, soundex, language detection. IIR Ch. 3 Techniques for automatically correcting words in text (Kukich 1992) Finding approximate matches in large lexicons (Zobel and Dart 1995) Efficient Generation and Ranking of Spelling Error Corrections (Tillenius) How to write a spelling corrector (Peter Norvig)
9) Preparing our data for data mining algorithms. Index structures. Scoring, term weighting, and the vector space model. tf.idf weighting. The cosine measure
10) Clustering 1 Introduction to the problem. Partitioning methods: k-means clustering
11) Clustering 2 Hierarchical clustering.
12) Classification 1 Introduction to text classification. Naive Bayes models. Spam filtering.
13) Machine learning in automated text categorization (Sebastiani 2002) A re-examination of text categorization methods (Yang et al. 1999) A Comparison of event models for naive Bayes text classification (McCallum et al. 1998)
14) Classification 2 K Nearest Neighbors, Decision boundaries, Vector space classification, Decision Trees. Comparative results. NLP Ch. 16, IIR Ch. 14 Web page classification: Features and algorithms (Qi, Davison 2009) Semi-supervised text classification using EM (Nigam et al. , 2006) Transductive SVMs (Joachims, 1999) Link-based classification (Getoor 2005)
15) Review, examples from real world applications. Term project presentations Evaluation
16) Review, examples from real world applications. Term project presentations Evaluation

Recommended or Required Reading

1- Foundations of Statistical Natural Language Processing, by C. Manning and H. Schütze (2003).
2- Introduction to Information Retrieval, Manning, Raghavan and Schütze, Cambridge University Press (2008)
3- Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti (2003)
4- Information Retrieval: A book by C. J. van RIJSBERGEN

Planned Learning Activities and Teaching Methods

1) Group Study
2) Self Study
3) Project Based Learning


Assessment Methods and Criteria

Contribution of Midterm Examination to Course Grade

40%

Contribution of Final Examination to Course Grade

60%

Total

100%

Language of Instruction

Turkish

Work Placement(s)

Not Required