11642 Lecture 01

2023-01-172023-01-23courses3 minutes read (About 493 words)0 visits

Note Information

Title	Introduction
Course Name	11642: Search Engines
Lecturer	Jamie Callan
Date	01/17/2023
Note Author	Ziang Zhou
Description	Course Logistics, Heap’s Law and Zipf’s Law

Searching Concept: Diversity

Not everyone agree what tops each searching query. For instance the search of Michael Jordan.

Adminstrative Info

Office Hours

Day	Time	Room	Instructor
Monday	4:00 - 5:30	GHC 5417	Meghna
Tuesday	1:00 - 2:30	GHC 5417	Guanzhong
Wednesday	2:00 - 3:30	GHC 5417	Anubha
Thursday	2:00 - 3:30	GHC 5417	Siddharth
Friday	4:00 - 5:30	GHC 5417	Raghav

Textbook: Introduction to Information Retrieval

No JAVA pickup

This course is not a place to pick up Java. Find a Java Crash tutorial and see if there’s a problem to pick up.

Two laws describe how words are used in human language

Heap’s Law

Heap’s Law: the size of vocabulary

Heaps’ Law is a statistical law that describes the relationship between the number of distinct words (vocabulary size) and the total number of words in a text corpus. It states that the vocabulary size, $V$, of a text corpus grows at a slower rate than the number of words, $N$, in the corpus.

$$V=KN^{\beta}$$

Typically there are two ways to figure out $N$ and $\beta$.
- Use a small set to fit parameters like $K$ and $\beta$
- Maximum likelihood estimation (MLE)

Zipf’s Law

Zipf’s Law: the term frequency distribution

Zipf’s Law is a statistical law that describes the distribution of word frequencies in a text corpus. It states that the frequency of a word, $f$, is inversely proportional to its rank, $r$, in the frequency distribution of words.
$$f=\frac{k}{r}$$ Where $k$ is a constant.

zipf’s law relates a term’s frequency to its rank.
Empirical Observation: $$P(t_R)=\frac{ctf_{t_R}}{N}\approx\frac{A}{R}$$ where $A\approx 0.1$

Therefore, the predicted probability of the most frequent word is $P(t_1)=\frac{0.1}{1}=0.1$. The second most frequent word is $P(t_2)=\frac{0.1}{2}=0.05$
- (tf stands for term frequency)
- (ctf stands for collection term frequency)
Rank $\times$ Frequency = Constant (Constant = 0.1 $\times$ N), therefore,
- Rank = $\frac{0.1 \times N}{\text{Frequency}}$
- Frequency = $\frac{0.1 \times N}{\text{Rank}}$
  Where $N$ stands for the number of total occurrences

Terminologies

$tf_{t,d}$: Term frequency
$ctf_t$: collection term frequency
$df_{t}$: # of documents that contain a term

Summary

Heaps’ Law states that the vocabulary size of a text corpus grows at a slower rate than the number of words in the corpus, described by a power law function $V=KN^b$, where K and b are constants.
Zipf’s Law states that the frequency of a term in a text corpus is inversely proportional to its ranking in the frequency distribution of words, described by the equation $f=k/r$, where k is a constant. Also the most popular term is likely to appear twice frequently than the second popular term.

Next Class Reading

Chapter 1
Chapter 5.1.1 - 5.1.2

11642 Lecture 01

https://realzza.github.io/courses/11642-01-Introduction/

Author

Ziang Zhou

Posted on

2023-01-17

Updated on

2023-01-23

Licensed under

11642

11642 Lecture 01

Note Information

Searching Concept: Diversity

Adminstrative Info

No JAVA pickup

Two laws describe how words are used in human language

Heap’s Law

Zipf’s Law

Terminologies

Summary

Next Class Reading

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

follow.it