11642 Lecture 01

Note Information

Title Introduction
Course Name 11642: Search Engines
Lecturer Jamie Callan
Date 01/17/2023
Note Author Ziang Zhou
Description Course Logistics, Heap’s Law and Zipf’s Law

Searching Concept: Diversity

Not everyone agree what tops each searching query. For instance the search of Michael Jordan.

Adminstrative Info

  • Office Hours
Day Time Room Instructor
Monday 4:00 - 5:30 GHC 5417 Meghna
Tuesday 1:00 - 2:30 GHC 5417 Guanzhong
Wednesday 2:00 - 3:30 GHC 5417 Anubha
Thursday 2:00 - 3:30 GHC 5417 Siddharth
Friday 4:00 - 5:30 GHC 5417 Raghav
  • Textbook: Introduction to Information Retrieval

No JAVA pickup

This course is not a place to pick up Java. Find a Java Crash tutorial and see if there’s a problem to pick up.

Two laws describe how words are used in human language

Heap’s Law

  • Heap’s Law: the size of vocabulary

    Heaps’ Law is a statistical law that describes the relationship between the number of distinct words (vocabulary size) and the total number of words in a text corpus. It states that the vocabulary size, $V$, of a text corpus grows at a slower rate than the number of words, $N$, in the corpus.

    $$V=KN^{\beta}$$

    Typically there are two ways to figure out $N$ and $\beta$.

    • Use a small set to fit parameters like $K$ and $\beta$
    • Maximum likelihood estimation (MLE)

Zipf’s Law

  • Zipf’s Law: the term frequency distribution

    Zipf’s Law is a statistical law that describes the distribution of word frequencies in a text corpus. It states that the frequency of a word, $f$, is inversely proportional to its rank, $r$, in the frequency distribution of words.
    $$f=\frac{k}{r}$$ Where $k$ is a constant.

    zipf’s law relates a term’s frequency to its rank.

  • Empirical Observation: $$P(t_R)=\frac{ctf_{t_R}}{N}\approx\frac{A}{R}$$ where $A\approx 0.1$

    Therefore, the predicted probability of the most frequent word is $P(t_1)=\frac{0.1}{1}=0.1$. The second most frequent word is $P(t_2)=\frac{0.1}{2}=0.05$

    • (tf stands for term frequency)
    • (ctf stands for collection term frequency)
  • Rank $\times$ Frequency = Constant (Constant = 0.1 $\times$ N), therefore,

    • Rank = $\frac{0.1 \times N}{\text{Frequency}}$
    • Frequency = $\frac{0.1 \times N}{\text{Rank}}$
      Where $N$ stands for the number of total occurrences

Terminologies

  • $tf_{t,d}$: Term frequency
  • $ctf_t$: collection term frequency
  • $df_{t}$: # of documents that contain a term

Summary

  • Heaps’ Law states that the vocabulary size of a text corpus grows at a slower rate than the number of words in the corpus, described by a power law function $V=KN^b$, where K and b are constants.
  • Zipf’s Law states that the frequency of a term in a text corpus is inversely proportional to its ranking in the frequency distribution of words, described by the equation $f=k/r$, where k is a constant. Also the most popular term is likely to appear twice frequently than the second popular term.

Next Class Reading

  • Chapter 1
  • Chapter 5.1.1 - 5.1.2
Author

Ziang Zhou

Posted on

2023-01-17

Updated on

2023-01-23

Licensed under

Comments