Information Retrieval - Syllabus

Embark on a profound academic exploration as you delve into the Information Retrieval course () within the distinguished Tribhuvan university's CSIT department. Aligned with the 2065 Syllabus, this course (CSC-405) seamlessly merges theoretical frameworks with practical sessions, ensuring a comprehensive understanding of the subject. Rigorous assessment based on a 60 marks system, coupled with a challenging passing threshold of , propels students to strive for excellence, fostering a deeper grasp of the course content.

This 3 credit-hour journey unfolds as a holistic learning experience, bridging theory and application. Beyond theoretical comprehension, students actively engage in practical sessions, acquiring valuable skills for real-world scenarios. Immerse yourself in this well-structured course, where each element, from the course description to interactive sessions, is meticulously crafted to shape a well-rounded and insightful academic experience.


Course Synopsis: Advanced aspects of Information Retrieval and Search Engine

Goal: To study advance aspects of information retrieval and working principle of search engine, encompassing the principles, research results and commercial application of the current technologies.

Units

Introduction

Introduction, History of Information Retrieval, The retrieval process, Block diagram and architecture of IR System, Web search and IR, Areas and role of AI for IR


2. Basic IR Models

Introduction, Taxonomy of information retrieval models, Document retrieval and ranking, A formal characterization of IR models, Boolean retrieval model, Vector-space retrieval model, probabilistic model, Text-similarity metrics: TF-IDF (term frequency/inverse document frequency) weighting and cosine similarity.


Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Simple tokenizing, Word tokenization, Text Normalization, Stop-word removal, Word Stemming (Porter Algorithm), Case folding, Lemmatization, Inverted indices (Indexing architecture), Efficient processing with sparse vectors, Sentence segmentation and Decision Trees


Experimental Evaluation of IR

Relevance and Retrieval, performance metrics, Basic Measures of text retrieval (Recall, Precision and F-measure)


Query Operations and Languages

Relevance feedback and pseudo relevance feedback, Query expansion/reformulation (with a thesaurus or WordNet, Spelling correction like techniques), Query languages (Single-Word Queries, Context Queries, Boolean Queries, Natural Language)


Text Representation

Word statistics (Zipf's law), Morphological analysis, Index term selection, Using thesauri, Metadata, Text representation using markup languages (SGML, HTML, XML)


Search Engine

Search engines (working principle), Spidering (Structure of a spider, Simple spidering algorithm, multithreaded spidering, Bot), Directed spidering(Topic directed, Link directed) ,Crawlers (Basic crawler architecture), Link analysis (e.g. hubs and authorities, Page ranking, Google Page Rank), shopping agents


Text Categorization and Clustering

Categorization algorithms (Rocchio; naive Bayes; decision trees; and nearest neighbor), Clustering algorithms (agglomerative clustering; k-means; expectation maximization (EM)) ,Applications to information filtering; organization


Recommender Systems

Personalization, Collaborative filtering recommendation, Content-based recommendation


Information Extraction and Integration

Information extraction and applications, Extracting data from text, Evaluating IE Accuracy, XML and Information Extraction, Semantic web (purpose, Relation to hypertext page), Collecting and integrating specialized information on the web.


Advanced IR Models with indexing and searching text

Probabilistic models, Generalized Vector Space Model, Latent Semantic Indexing (LSI), Efficient string searching, Pattern matching


Multimedia IR

Introduction, multimedia data support in commercial DBMSs, Query languages, Trends and research issues


Lab works

The laboratory should contain all the features mentioned in a course

Samples

1. Program to demonstrate the Boolean Retrieval Model and Vector Space Model

2. Program to find the similarity between documents

3. Tokenize the words of large documents according to type and token.

4. Segment the documents according to sentences

5. Implement Porter stemmer

6. Try to build a stemmer for Nepali language

7. Build a spider that tracks only the link of nepali documents

8. Group the online news onto different categorize like sports, entertainment, politics

9. Build a recommender system for online music store