Skip to main content

Search Engine Technology

This unit is about Web Search & Text Analysis.

WeeksContent
Weeks 1-6Basics concepts for Information Retrieval(IR) and Web Search.
Weeks 7-12Text Analysis & NLP
  • Supervised text analysis
  • Un-supervised text analysis

TextBook

Part One: Web Search & IRโ€‹

  • Architecture of a search engine
  • Basic concepts for text processing
  • Information Retrieval
  • Evaluate search results & IR models

Part Two: Text Analysisโ€‹

  • Supervised methods:
    • Information filitering
    • Text classification
    • Relevant discovery
  • Un-supervised methods:
    • Text feature selection
    • Topic modelling
    • Sentiment analysis
    • Document summarization

Why Do We Care?โ€‹

  1. More than 80% of data that contain a large amount of knowledge is waitting for being extracted.
  2. There are many different types of data. They extends beyond structured data, including unstructured data:
    • text
    • audio
    • video
    • log files

HTML vs. XMLโ€‹

HTML is a language for marking up text for presentation.

XML(eXtensible Markup Language) is a language for describing data/content. In other words, it does not describe how to present it. Therefore it make Internet data machine-readable.

Appendixโ€‹

Readingsโ€‹

Weekly Scheduleโ€‹

weekly schedule

Vocabularyโ€‹

prong

noun
  1. A thin, pointed, projecting part, as of an antler or a fork or similar tool. A tine.
  2. A branch; a fork.
  3. The penis.
verb
  1. To pierce or poke with, or as if with, a prong

antler

noun
  1. A branching and bony structure on the head of deer, moose and elk, normally in pairs. They are grown and shed each year. (Compare with horn, which is generally not shed.)