11642 Lecture 01
Note Information
Title | Introduction |
---|---|
Course Name | 11642: Search Engines |
Lecturer | Jamie Callan |
Date | 01/17/2023 |
Note Author | Ziang Zhou |
Description | Course Logistics, Heap’s Law and Zipf’s Law |
Searching Concept: Diversity
Not everyone agree what tops each searching query. For instance the search of Michael Jordan.
Adminstrative Info
- Office Hours
Day | Time | Room | Instructor |
---|---|---|---|
Monday | 4:00 - 5:30 | GHC 5417 | Meghna |
Tuesday | 1:00 - 2:30 | GHC 5417 | Guanzhong |
Wednesday | 2:00 - 3:30 | GHC 5417 | Anubha |
Thursday | 2:00 - 3:30 | GHC 5417 | Siddharth |
Friday | 4:00 - 5:30 | GHC 5417 | Raghav |
- Textbook: Introduction to Information Retrieval
No JAVA pickup
This course is not a place to pick up Java. Find a Java Crash tutorial and see if there’s a problem to pick up.
Two laws describe how words are used in human language
Heap’s Law
Heap’s Law: the size of vocabulary
Heaps’ Law is a statistical law that describes the relationship between the number of distinct words (vocabulary size) and the total number of words in a text corpus. It states that the vocabulary size, $V$, of a text corpus grows at a slower rate than the number of words, $N$, in the corpus.
$$V=KN^{\beta}$$
Typically there are two ways to figure out $N$ and $\beta$.
- Use a small set to fit parameters like $K$ and $\beta$
- Maximum likelihood estimation (MLE)
Zipf’s Law
Zipf’s Law: the term frequency distribution
Zipf’s Law is a statistical law that describes the distribution of word frequencies in a text corpus. It states that the frequency of a word, $f$, is inversely proportional to its rank, $r$, in the frequency distribution of words.
$$f=\frac{k}{r}$$ Where $k$ is a constant.zipf’s law relates a term’s frequency to its rank.
Empirical Observation: $$P(t_R)=\frac{ctf_{t_R}}{N}\approx\frac{A}{R}$$ where $A\approx 0.1$
Therefore, the predicted probability of the most frequent word is $P(t_1)=\frac{0.1}{1}=0.1$. The second most frequent word is $P(t_2)=\frac{0.1}{2}=0.05$
- (tf stands for term frequency)
- (ctf stands for collection term frequency)
Rank $\times$ Frequency = Constant (Constant = 0.1 $\times$ N), therefore,
- Rank = $\frac{0.1 \times N}{\text{Frequency}}$
- Frequency = $\frac{0.1 \times N}{\text{Rank}}$
Where $N$ stands for the number of total occurrences
Terminologies
- $tf_{t,d}$: Term frequency
- $ctf_t$: collection term frequency
- $df_{t}$: # of documents that contain a term
Summary
- Heaps’ Law states that the vocabulary size of a text corpus grows at a slower rate than the number of words in the corpus, described by a power law function $V=KN^b$, where K and b are constants.
- Zipf’s Law states that the frequency of a term in a text corpus is inversely proportional to its ranking in the frequency distribution of words, described by the equation $f=k/r$, where k is a constant. Also the most popular term is likely to appear twice frequently than the second popular term.
Next Class Reading
- Chapter 1
- Chapter 5.1.1 - 5.1.2
11642 Lecture 01