Information Retrieval - Syllabus

Embark on a profound academic exploration as you delve into the Information Retrieval course (IR) within the distinguished Tribhuvan university's CSIT department. Aligned with the 2074 Syllabus, this course (CSC413) seamlessly merges theoretical frameworks with practical sessions, ensuring a comprehensive understanding of the subject. Rigorous assessment based on a 60 + 20 + 20 marks system, coupled with a challenging passing threshold of , propels students to strive for excellence, fostering a deeper grasp of the course content.

This 3 credit-hour journey unfolds as a holistic learning experience, bridging theory and application. Beyond theoretical comprehension, students actively engage in practical sessions, acquiring valuable skills for real-world scenarios. Immerse yourself in this well-structured course, where each element, from the course description to interactive sessions, is meticulously crafted to shape a well-rounded and insightful academic experience.


Course Description:

This course familiarizes students with different concepts of information retrieval techniques mainly focused on clustering, classification, search engine, ranking and query operations techniques.

Course Objective:

The main objective of this course is to provide knowledge of different information retrieval techniques so that the students will be able to develop information retrieval engine.

Units

Introduction to IR and Web Search

Introduction, Data vs Information Retrieval, Logical view of the documents, Architecture of IR

System, Web search system, History of IR, Related areas



Text properties, operations and preprocessing

Tokenization, Text Normalization, Stop-word removal, Morphological Analysis, Word Stemming (Porter Algorithm), Case folding, Lemmatization, Word statistics (Zipf's law, Heaps’ Law), Index term selection, Inverted indices, Positional Inverted index, Natural Language Processing in Information Retrieval, Basic NLP tasks – POS tagging; shallow parsing



Basic IR Models

Classes of Retrieval Model, Boolean model, Term weighting mechanism – TF, IDF, TF-IDF weighting, Cosine Similarity, Vector space model , Probabilistic models (the binary independence model ,Language models; · KL-divergence; · Smoothing), Non-Overlapping Lists, Proximal Nodes Mode



Evaluation of IR

Precision, Recall, F-Measure, MAP (Mean Average Precision), (DCG) Discounted Cumulative Gain, Known-item Search Evaluation



Query Operations and Languages

Relevance feedback and pseudo relevance feedback, Query expansion (with a thesaurus or WordNet and correlation matrix), Spelling correction (Edit distance, K – Gram indexes, Context sensitive spelling correction), Query languages (Single-Word Queries, Context Queries, Boolean Queries, Structural Query, Natural Language)



Web Search

Search engines (working principle), Spidering (Structure of a spider, Simple spidering algorithm, multithreaded spidering, Bot), Directed spidering (Topic directed, Link directed), Crawlers (Basic crawler architecture), Link analysis (HITS, Page ranking), Query log analysis, Handling “invisible” Web – Snippet generation, CLIR (Cross Language Information Retrieval)



Text Categorization

Categorization, Learning for Categorization, General learning issues, Learning algorithms: Bayesian (naïve), Decision tree, KNN, Rocchio)


Text Clustering

Clustering, Clustering algorithms (Hierarchical clustering, k-means, k-medoid, Expectation maximization (EM), Text shingling)


Recommender System

Personalization, Collaborative filtering recommendation, Content-based recommendation



Question Answering

Information bottleneck, Information Extraction, Ambiguities in IE, Architecture of QA system, Question processing, Paragraph retrieval, Answer processing



Advanced IR Models

Latent Semantic Indexing (LSI), Singular value decomposition, Latent Dirichlet Allocation, Efficient string searching, Knuth – Morris – Pratt, Boyer – Moore Family, Pattern matching


Lab works

Laboratory Works:

The laboratory should contain all the features mentioned in a course. The Laboratory work should contain at least following tasks

  1. Program to demonstrate the Boolean Retrieval Model and Vector Space Model
  2. Tokenize the words of large documents according to type and token
  3. Program to find the similarity between documents
  4. Implement Porter stemmer
  5. Build a spider that tracks only the link of nepali documents
  6. Group the online news onto different categorize like sports, entertainment, politics
  7. Build a recommender system for online music store