11642 Lecture 02

Note Information

Title Introduction to Search
Course Name 11642: Search Engines
Lecturer Jamie Callan
Date 01/19/2023
Note Author Ziang Zhou
Description Description Holder

Documents are unstructured

compared to relational databases, they are structured. Structured is specified for this kind of concepts. So, documents are considered unstructured by search engines.

The query is a clue. The search engine is a student that tries to make users happy. Query is an approximation of user’s information need.

Today relational databases and search engines are very different, but back in the 60s they are pretty much the same.

Ad hoc retrieval

You don’t carefully come up with a query for information retrieval.

This is a template of the lecture note.

Frequency-based full-text representations

corpus is collection of documents.

Inverted List

Definition: An inverted list, also known as an inverted index, is a data structure used in information retrieval to map content, such as words or phrases, to the location where they appear in a set of documents. It is called inverted because it is the reverse of a standard index, which maps documents to the terms they contain.

An inverted list stores a list of all the unique terms that appear in a set of documents, along with a list of the documents in which each term appears. The list of documents is called a posting list, and it typically includes information such as the frequency of the term in each document and the position of the term in the document.

Sample Inverted List

1
2
3
4
5
6
7
8
9
10
term: "cat"
document 1: (frequency: 2, positions: [1, 5])
document 2: (frequency: 1, positions: [3])
document 3: (frequency: 3, positions: [2, 6, 8])
term: "dog"
document 1: (frequency: 1, positions: [2])
document 2: (frequency: 2, positions: [1, 4])
term: "fish"
document 1: (frequency: 1, positions: [4])
document 3: (frequency: 1, positions: [7])

In this sample, we can see that there are three unique terms in our set of documents: “cat”, “dog”, and “fish”. Each term has its own entry in the inverted list, and for each entry, there is a list of documents in which the term appears, along with some information about the term’s appearance in each document. The list of documents which specifies the term positions is called posting list.

Summary: An inverted index is a data structure that maps terms to the documents in which they appear. Each term in the index is associated with a posting list, which contains information about the documents in which the term appears. The posting list typically includes the document ID, the frequency of the term in the document, and the positions at which the term appears in the document.

Zipf’s Law predicts that the median term occurs twice

$$
\begin{split}
R_L & = \frac{0.1\times N}{1} \\
R_M & = 0.5 \times R_L = \frac{1}{2} \times \frac{0.1\times N}{1} \\
cft_M & = 2 \\
\end{split}
$$

Sparse Representation of Inverted Lists

  • $df_t$=18, docids 1, 5, 6, 9 $\cdots$
    • $df_t$: document freqency (number of documents contains term $t$)

Exact match retrieval

  • Definition: retrieve documents if and only if they satisfy a boolean expression.

Unranked Boolean

The set of matching is unordered, typically sorted by date.

Ranked Boolean

Usually score is used in ranking. There are two common types of scores. They are $tf_{t,d}$ and tfidf

$tf_{t,d}$

Refers to the frequency of term $t$ in document $d$.

tfidf

  • tfidf $tf_{t,d}\times idf_t=tf_{t,d}\times log(N/df_t)$

    • $t$ is a term
    • $N$ is the number of documents in the collection
    • $df_t$ number of documents that contain term $t$
  • A summary

    Component Description
    Term Frequency (tf) The number of times a term appears in a document. Higher values indicate that the term is more important to the document.
    Inverse Document Frequency (idf) A measure of how rare a term is across all documents in a collection. Higher values indicate that the term is more rare and therefore more informative.
    tf-idf score The product of tf and idf for a term in a document. Higher values indicate a better match between the document and the query.

Need to calculate scores using tf or tfidf, or other score functions.

Precision and recall

  • Precision: Precision is the proportion of relevant documents among the retrieved documents.
  • Recall: Recall is the proportion of relevant documents that are retrieved among all relevant documents.

Which ranking is better

Ranked Boolean Unranked Boolean
Definition A retrieval method that ranks the documents based on the relevance of the query terms. A retrieval method that returns documents that match certain conditions specified by the query without ranking them.
Score calculation The relevance of the query terms to the documents is calculated based on different weighting schemes like tf-idf, BM25 etc. The relevance of the query terms to the documents is not calculated, only the matching condition is verified.
Output A ranked list of the retrieved documents in descending order of relevance. A list of the retrieved documents without any ranking.
Use Case When the user wants to get the most relevant documents first and does not mind about missing some relevant documents. When the user wants to get all the documents that match certain conditions and doesn’t care about the relevance of the documents.

Document Retrieval

Mainly two types of document retrieval, TAAT and DAAT. Will be discussed in the next note.

Next Class Reading

  • Chapter 2.4.2
Author

Ziang Zhou

Posted on

2023-01-19

Updated on

2023-01-27

Licensed under

Comments