Search Engine Technology
This unit is about Web Search & Text Analysis.
Weeks | Content |
---|---|
Weeks 1-6 | Basics concepts for Information Retrieval(IR) and Web Search. |
Weeks 7-12 | Text Analysis & NLP
|
📄️ Information Retrieval
We discuss "structured" and "unstructured" data. The applications and tasks that can be performed by search engines. And the main issues associated with information retrieval and search engine.
📄️ text-statistics
📄️ Text Processing
work in progress
📄️ Information Extraction
work in progress
📄️ Abstract Model of Ranking
abstract-model-of-ranking
📄️ A More Concrete Ranking Model
a-more-concrete-model-of-ranking
🗃️ Inverted Indexes
4 items
📄️ Auxiliary Structures
Inverted lists usually stored together in a single file for efficiency.
📄️ index-construction
📄️ Query Processing
Explore query processing techniques: document-at-a-time and term-at-a-time.
Part One: Web Search & IR
- Architecture of a search engine
- Basic concepts for text processing
- Information Retrieval
- Evaluate search results & IR models
Part Two: Text Analysis
- Supervised methods:
- Information filitering
- Text classification
- Relevant discovery
- Un-supervised methods:
- Text feature selection
- Topic modelling
- Sentiment analysis
- Document summarization
Why Do We Care?
- More than 80% of data that contain a large amount of knowledge is waitting for being extracted.
- There are many different types of data. They extends beyond structured data, including unstructured data:
- text
- audio
- video
- log files
HTML vs. XML
HTML is a language for marking up text for presentation.
XML(eXtensible Markup Language) is a language for describing data/content. In other words, it does not describe how to present it. Therefore it make Internet data machine-readable.
Appendix
Readings
Weekly Schedule
Vocabulary
Cannot find definitions for "prong".
Cannot find definitions for "antler".