Information Retrieval - Syllabus

Course Overview and Structure

Embark on a profound academic exploration as you delve into the Information Retrieval course () within the distinguished Tribhuvan university's CSIT department. Aligned with the 2065 Syllabus, this course (CSC-405) seamlessly merges theoretical frameworks with practical sessions, ensuring a comprehensive understanding of the subject. Rigorous assessment based on a 60 marks system, coupled with a challenging passing threshold of , propels students to strive for excellence, fostering a deeper grasp of the course content.

This 3 credit-hour journey unfolds as a holistic learning experience, bridging theory and application. Beyond theoretical comprehension, students actively engage in practical sessions, acquiring valuable skills for real-world scenarios. Immerse yourself in this well-structured course, where each element, from the course description to interactive sessions, is meticulously crafted to shape a well-rounded and insightful academic experience.


Course Synopsis: Advanced aspects of Information Retrieval and Search Engine

Goal: To study advance aspects of information retrieval and working principle of search engine, encompassing the principles, research results and commercial application of the current technologies.

Units

Key Topics

  • Introduction to Computers
    IN-01

    An overview of computers and their significance in today's world. This topic sets the stage for understanding the basics of computers.

  • Digital and Analog Computers
    IN-02

    Understanding the difference between digital and analog computers, their characteristics, and applications.

  • Characteristics of Computers
    IN-03

    Exploring the key characteristics of computers, including input, processing, storage, and output.

  • History of Computers
    IN-04

    A brief history of computers, from their inception to the present day, highlighting key milestones and developments.

  • Generations of Computers
    IN-05

    Understanding the different generations of computers, including their features, advantages, and limitations.

  • Classification of Computers
    IN-06

    Categorizing computers based on their size, functionality, and application, including desktops, laptops, and mobile devices.

Key Topics

  • Introduction to IR Models
    2.1

    Overview of information retrieval models and their significance in IR systems.

  • Taxonomy of IR Models
    2.2

    Categorization of information retrieval models and their relationships.

  • Document Retrieval and Ranking
    2.3

    The process of retrieving and ranking documents based on relevance to a query.

  • Formal Characterization of IR Models
    2.4

    Mathematical representation of IR models and their underlying assumptions.

  • Boolean Retrieval Model
    2.5

    A model that retrieves documents based on exact matching of query terms.

  • Vector-Space Retrieval Model
    2.6

    A model that represents documents and queries as vectors in a high-dimensional space.

  • Probabilistic Retrieval Model
    2.7

    A model that estimates the probability of a document being relevant to a query.

  • Text-Similarity Metrics
    2.8

    Measures of similarity between documents and queries, including TF-IDF and cosine similarity.

Key Topics

  • Memory Read
    BA-01

    Memory Read operation involves retrieving data from memory locations. It is a fundamental operation in microprocessor-based systems.

  • Memory Write
    BA-02

    Memory Write operation involves storing data in memory locations. It is a crucial operation in microprocessor-based systems.

  • I/O Read
    BA-03

    I/O Read operation involves retrieving data from input/output devices. It enables the microprocessor to interact with the external environment.

  • I/O Write
    BA-04

    I/O Write operation involves sending data to input/output devices. It enables the microprocessor to interact with the external environment.

  • Direct Memory Access
    BA-05

    Direct Memory Access (DMA) is a technique that allows peripheral devices to access system memory directly, reducing the microprocessor's workload.

  • Interrupt
    BA-06

    An interrupt is a signal to the microprocessor that an event has occurred, requiring immediate attention. It enables the microprocessor to handle asynchronous events.

  • Types of Interrupts
    BA-07

    There are different types of interrupts, including maskable and non-maskable interrupts, which vary in their priority and handling by the microprocessor.

  • Interrupt Masking
    BA-08

    Interrupt Masking is a technique that enables the microprocessor to temporarily ignore or mask interrupts, allowing it to focus on high-priority tasks.

  • Non-Overlapping Lists
    BA-09

    Non-overlapping lists are used in some retrieval models to improve the efficiency of retrieval by reducing the number of documents to be ranked.

  • Proximal Nodes Mode
    BA-10

    The proximal nodes mode is a retrieval model that uses the proximity of terms in a document to improve the retrieval of relevant documents.

  • Performing CDB and PDB Flashback
    BA-11

    Understanding flashback technology, including performing flashback on Container Database (CDB) and Pluggable Database (PDB).

Relevance and Retrieval, performance metrics, Basic Measures of text retrieval (Recall, Precision and F-measure)

Key Topics

  • Query Processing
    QU-1

    Concept of query processing, including the steps involved in processing a query and the role of the query processor.

  • Query Trees and Heuristics
    QU-2

    Query trees and heuristics for query optimization, including the use of query trees to represent queries and heuristics to guide optimization.

  • Query Execution Plans
    QU-3

    Choice of query execution plans, including the factors that influence the choice of plan and the importance of plan selection.

  • Cost-Based Optimization
    QU-4

    Cost-based optimization, including the use of cost estimates to guide optimization and the role of cost-based optimization in query processing.

Word statistics (Zipf's law), Morphological analysis, Index term selection, Using thesauri, Metadata, Text representation using markup languages (SGML, HTML, XML)

Search engines (working principle), Spidering (Structure of a spider, Simple spidering algorithm, multithreaded spidering, Bot), Directed spidering(Topic directed, Link directed) ,Crawlers (Basic crawler architecture), Link analysis (e.g. hubs and authorities, Page ranking, Google Page Rank), shopping agents

Categorization algorithms (Rocchio; naive Bayes; decision trees; and nearest neighbor), Clustering algorithms (agglomerative clustering; k-means; expectation maximization (EM)) ,Applications to information filtering; organization

Personalization, Collaborative filtering recommendation, Content-based recommendation

Information extraction and applications, Extracting data from text, Evaluating IE Accuracy, XML and Information Extraction, Semantic web (purpose, Relation to hypertext page), Collecting and integrating specialized information on the web.

Probabilistic models, Generalized Vector Space Model, Latent Semantic Indexing (LSI), Efficient string searching, Pattern matching

Introduction, multimedia data support in commercial DBMSs, Query languages, Trends and research issues

Lab works

The laboratory should contain all the features mentioned in a course

Samples

1. Program to demonstrate the Boolean Retrieval Model and Vector Space Model

2. Program to find the similarity between documents

3. Tokenize the words of large documents according to type and token.

4. Segment the documents according to sentences

5. Implement Porter stemmer

6. Try to build a stemmer for Nepali language

7. Build a spider that tracks only the link of nepali documents

8. Group the online news onto different categorize like sports, entertainment, politics

9. Build a recommender system for online music store