Skip to main content

Search Engine Technology

This unit is about Web Search & Text Analysis.

WeeksContent
Weeks 1-6Basics concepts for Information Retrieval(IR) and Web Search.
Weeks 7-12Text Analysis & NLP
  • Supervised text analysis
  • Un-supervised text analysis

TextBook

Part One: Web Search & IR

  • Architecture of a search engine
  • Basic concepts for text processing
  • Information Retrieval
  • Evaluate search results & IR models

Part Two: Text Analysis

  • Supervised methods:
    • Information filitering
    • Text classification
    • Relevant discovery
  • Un-supervised methods:
    • Text feature selection
    • Topic modelling
    • Sentiment analysis
    • Document summarization

Why Do We Care?

  1. More than 80% of data that contain a large amount of knowledge is waitting for being extracted.
  2. There are many different types of data. They extends beyond structured data, including unstructured data:
    • text
    • audio
    • video
    • log files

HTML vs. XML

HTML is a language for marking up text for presentation.

XML(eXtensible Markup Language) is a language for describing data/content. In other words, it does not describe how to present it. Therefore it make Internet data machine-readable.

Appendix

Readings

Weekly Schedule

weekly schedule

Vocabulary

Cannot find definitions for "prong".

Cannot find definitions for "antler".