程序代写代做代考 AI Bayesian scheme COMP6714: Informa2on Retrieval & Web Search – INTERVIEW&CODEHELP

COMP6714: Informa2on Retrieval & Web Search

Introduc*on to

Informa(on Retrieval

Lecture 9: Probabilis*c Model & Language Model

COMP6714: Informa2on Retrieval & Web Search

Recap of the last lecture
§  Improving search results

§  Especially for high recall. E.g., searching for aircra? so it
matches with plane; thermodynamic with heat

§  Op*ons for improving results…
§  Global methods

§  Query expansion
§  Thesauri
§  Automa*c thesaurus genera*on

§  Global indirect relevance feedback
§  Local methods

§  Relevance feedback
§  Pseudo relevance feedback

COMP6714: Informa2on Retrieval & Web Search

Probabilis*c relevance feedback
§  Rather than reweigh*ng in a vector space…
§  If user has told us some relevant and some irrelevant
documents, then we can proceed to build a
probabilis*c classifier, such as a Naive Bayes model:
§  P(tk|R) = |Drk| / |Dr|
§  P(tk|NR) = |Dnrk| / |Dnr|

§  tk is a term; Dr is the set of known relevant documents; Drk is the
subset that contain tk; Dnr is the set of known irrelevant
documents; Dnrk is the subset that contain tk.

COMP6714: Informa2on Retrieval & Web Search

Why probabili*es in IR?

User
Information Need

Documents
Document

Representation

Query
Representation

How to match?

In traditional IR systems, matching between each document and
query is attempted in a semantically imprecise space of index terms.

Probabilities provide a principled foundation for uncertain reasoning
.
Can we use probabilities to quantify our uncertainties?

Uncertain guess of
whether document
has relevant content

Understanding
of user need is
uncertain

COMP6714: Informa2on Retrieval & Web Search

Probabilis*c IR topics

§  Classical probabilis*c retrieval model
§  Probability ranking principle, etc.

§  (Naïve) Bayesian Text Categoriza*on
§  Bayesian networks for text retrieval
§  Language model approach to IR

§  An important emphasis in recent work

§  Probabilis2c methods are one of the oldest but also
one of the currently hoGest topics in IR.
§  Tradi2onally: neat ideas, but they’ve never won on
performance. It may be different now.

COMP6714: Informa2on Retrieval & Web Search

The document ranking problem
n We have a collec*on of documents
n User issues a query
n A list of documents needs to be returned
n Ranking method is core of an IR system:

n In what order do we present documents to the user?
n We want the “best” document to be first, second best
second, etc….

n Idea: Rank by probability of relevance of the
document w.r.t. informa(on need
n P(relevant|documenti, query)

COMP6714: Informa2on Retrieval & Web Search

Recall a few probability basics
§  For events a and b:
§  Bayes’ Rule

§  Odds:

∑ =
==

==∩=

aax
xpxbp

apabp
bp
apabp

bap

apabpbpbap
apabpbpbapbapbap

,
)()|(

)()|(
)(
)()|(

)|(

)()|()()|(
)()|()()|()(),(

)(1
)(

)(
)(

)(
ap
ap

ap
ap

aO
−

Posterior

Prior

COMP6714: Informa2on Retrieval & Web Search

The Probability Ranking Principle

“If a reference retrieval system’s response to each request is a
ranking of the documents in the collec*on in order of decreasing
probability of relevance to the user who submiged the request,
where the probabili*es are es*mated as accurately as possible on
the basis of whatever data have been made available to the system
for this purpose, the overall effec2veness of the system to its user
will be the best that is obtainable on the basis of those data.”

§  [1960s/1970s] S. Robertson, W.S. Cooper, M.E. Maron;
van Rijsbergen (1979:113); Manning & Schütze (1999:538)

COMP6714: Informa2on Retrieval & Web Search

Probability Ranking Principle

Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.

)(
)()|(

)|(

)(
)()|(

)|(

xp
NRpNRxp

xNRp

xp
RpRxp

xRp

p(x|R), p(x|NR) – probability that if a relevant (non-relevant)
document is retrieved, it is x.

Need to find p(R|x) – probability that a document x is relevant.

p(R),p(NR) – prior probability
of retrieving a (non) relevant
document

1)|()|( =+ xNRpxRp

R={0,1} vs. NR/R

COMP6714: Informa2on Retrieval & Web Search

Probability Ranking Principle (PRP)
§  Simple case: no selec*on costs or other u*lity
concerns that would differen*ally weight errors

§  Bayes’ Op*mal Decision Rule
§  x is relevant iff p(R|x) > p(NR|x)

§  PRP in ac*on: Rank all documents by p(R|x)

§  Theorem:
§  Using the PRP is op*mal, in that it minimizes the loss
(Bayes risk) under 1/0 loss

§  Provable if all probabili*es correct, etc. [e.g., Ripley 1996]

COMP6714: Informa2on Retrieval & Web Search

Probability Ranking Principle

§  More complex case: retrieval costs.
§  Let d be a document
§  C – cost of retrieval of relevant document
§  C’ – cost of retrieval of non-relevant document

§  Probability Ranking Principle: if

for all d’ not yet retrieved, then d is the next
document to be retrieved

§  We won’t further consider loss/u(lity from now
on

�

C ⋅ p(R | d) + ′ C ⋅ (1− p(R | d)) ≤ C ⋅ p(R | ′ d ) + ′ C ⋅ (1− p(R | ′ d ))

COMP6714: Informa2on Retrieval & Web Search

Probability Ranking Principle
§  How do we compute all those probabili*es?

§  Do not know exact probabili*es, have to use es*mates
§  Binary Independence Retrieval (BIR) – which we
discuss later today – is the simplest model

§  Ques*onable assump*ons
§  “Relevance” of each document is independent of
relevance of other documents.
§  Really, it’s bad to keep on returning duplicates

§  Boolean model of relevance
§  That one has a single step informa*on need

§  Seeing a range of results might let user refine query

COMP6714: Informa2on Retrieval & Web Search

Probabilis*c Retrieval Strategy

§  Es*mate how terms contribute to relevance
§  How do things like u, df, and length influence your
judgments about document relevance?
§  One answer is the Okapi formulae (S. Robertson)

§  Combine to find document relevance probability

§  Order documents by decreasing probability

COMP6714: Informa2on Retrieval & Web Search

Probabilis*c Ranking

Basic concept:

“For a given query, if we know some documents that are
relevant, terms that occur in those documents should be
given greater weighting in searching for other relevant
documents.

By making assumptions about the distribution of terms
and applying Bayes Theorem, it is possible to derive
weights theoretically.”

Van Rijsbergen

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model
§  Tradi*onally used in conjunc*on with PRP
§  “Binary” = Boolean: documents are represented as binary

incidence vectors of terms (cf. lecture 1):
§ 
§  iff term i is present in document x.

§  “Independence”: terms occur in documents independently
§  Different documents can be modeled as same vector

§  Bernoulli Naive Bayes model (cf. text categoriza*on!)

),,( 1 nxxx …
!
=
1=ix

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model
§  Queries: binary term incidence vectors
§  Given query q,

§  for each document d need to compute p(R|q,d).
§  replace with compu*ng p(R|q,x) where x is binary term
incidence vector represen*ng d Interested only in ranking

§  Will use odds and Bayes’ Rule:

)|(
),|()|(

),|(
),|(

),|(

qxp
qNRxpqNRp

qxp
qRxpqRp

xqNRp
xqRp

xqRO
!
!

!
!

!
==

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model

•  Using Independence Assumption:

∏
=

=
n

i i

qNRxp
qRxp

1 ),|(
),|(

),|(
),|(

!
!

),|(
),|(

)|(
)|(

),|(
),|(

),|(
qNRxp
qRxp

qNRp
qRp

xqNRp
xqRp

xqRO !
!

!
!

!
⋅==

Constant for a
given query Needs estimation

∏
=

⋅=
n

i i

qNRxp
qRxp

qROdqRO
1 ),|(

),|(
)|(),|(• So :

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model

∏
=

⋅=
n

i i

qNRxp
qRxp

qROdqRO
1 ),|(

),|(
)|(),|(

•  Since xi is either 0 or 1:

∏∏
== =

=
⋅

=
⋅=

01 ),|0(
),|0(

),|1(
),|1(

)|(),|(
ii x i

x i

qNRxp
qRxp

qROdqRO

•  Let );,|1( qRxpp ii == );,|1( qNRxpr ii ==

•  Assume, for all terms not occurring in the query (qi=0) ii rp =

Then…
This can be
changed (e.g., in
relevance feedback)

COMP6714: Informa2on Retrieval & Web Search

All matching terms Non-matching
query terms

Binary Independence Model

All matching terms
All query terms

∏∏

===

=
===

−

−
⋅

−

−
⋅=

−

−
⋅⋅=

1
01

1
1

)1(
)1(

)|(

1
1

)|(),|(

iii

i
iii

q i

qx ii

q
x i

qx i

r
p

pr
rp

qRO

r
p

qROxqRO
!

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model

Constant for
each query

Only quantity to be estimated
for rankings

∏∏
=== −

−
⋅

−

−
⋅=

11 1
1

)1(
)1(

)|(),|(
iii q i

qx ii

r
p

pr
rp

qROxqRO
!

•  Retrieval Status Value:

�

RSV = log
pi(1− ri)
ri(1− pi)xi = qi =1

∏ = log pi(1− ri)
ri(1− pi)xi = qi =1

∑

�

= log(odds(pi))− log(odds(ri))( )
xi = qi =1
∑ = logit(pi) − logit(ri)( )

xi = qi =1
∑

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model

•  All boils down to computing RSV.

∑∏
==== −

−
=

−

−
=

11 )1(
)1(

log
)1(
)1(

log
iiii qx ii

qx ii

pr
rp

RSV

∑
==

=
1
;

ii qx
icRSV

�

ci = log
pi(1− ri)
ri(1− pi)

= logit(pi) − logit(ri)

So, how do we compute ci’s from our data ?

COMP6714: Informa2on Retrieval & Web Search

Binary Independence Model
•  Estimating RSV coefficients.
•  For each term i look at this table of document counts:

S
s

pi ≈ )(
)(
SN
sn

ri
−

−
≈

)()(
)(

log),,,(
sSnNsn

sSs
sSnNKci

+−−−

−
=≈

•  Estimates:

Documents Relevant Non-Relevant Total

Xi = 1

Xi = 0

Total N S

n s n-s
S-s N-n-S+s N-n

N-S

However, these
estimates could
be 0.

COMP6714: Informa2on Retrieval & Web Search

Add ½ Smoothing
•  Add ½ to each of the center four cells.

�

ci ≈ K(N,n,S,s) = log
(s+1 2) (S − s+1 2)

(n − s+1 2) (N − n − S + s+1 2)

Documents Relevant Non-Relevant Total

Xi = 1

Xi = 0

Total N+2 S+1

n+1 s+½ n-s+½
S-s+½ N-n-S+s+½ N-n+1

N-S+1

COMP6714: Informa2on Retrieval & Web Search

Example /1

§  Query = {x1, x2}
§  O(R=1|D3, q)

Doc Judgment x1 x2 x3
D1 R 1 1 1

D2 R 0 0 1

D3 R 1 0 0

D4 NR 1 0 1

D5 NR 0 1 1

COMP6714: Informa2on Retrieval & Web Search

Example /2

§  Es*mate pi and ri

Doc Judgment x1 x2 x3
D1 R 1 1 1

D2 R 0 0 1

D3 R 1 0 0

D4 NR 1 0 1

D5 NR 0 1 1

COMP6714: Informa2on Retrieval & Web Search

Es*ma*on – key challenge
§  If non-relevant documents are approximated by
the whole collec*on, then ri (prob. of occurrence
in non-relevant documents for query) is n/N and
§  log (1– ri)/ri = log (N– n)/n ≈ log N/n = IDF!

§  pi (probability of occurrence in relevant
documents) can be es*mated in various ways:
§  from relevant documents if know some

§  Relevance weigh*ng can be used in feedback loop
§  constant (Cro} and Harper combina*on match – 0.5) –
then just get idf weigh*ng of terms

§  propor*onal to prob. of occurrence in collec*on
§  more accurately, to log of this (Greiff, SIGIR 1998)

in fact, u-idf can be deemed as the cross-entropy

COMP6714: Informa2on Retrieval & Web Search

Itera*vely es*ma*ng pi
1.  Assume that pi constant over all xi in query

§  pi = 0.5 (even odds) for any given doc
2.  Determine guess of relevant document set:

§  V is fixed size set of highest ranked documents on this
model (note: now a bit like u.idf!)

3.  We need to improve our guesses for pi and ri, so
§  Use distribu*on of xi in docs in V. Let Vi be set of

documents containing xi
§  pi = |Vi| / |V|

§  Assume if not retrieved then not relevant
§  ri = (ni – |Vi|) / (N – |V|)

4.  Go to 2. un*l converges then return ranking

op*onal

COMP6714: Informa2on Retrieval & Web Search

Probabilis*c Relevance Feedback
1.  Guess a preliminary probabilis*c descrip*on of R

and use it to retrieve a first set of documents V, as
above.

2.  Interact with the user to refine the descrip*on:
learn some definite members of R and NR

3.  Rees*mate pi and ri on the basis of these
§  Or can combine new informa*on with original guess (use

Bayesian prior):

4.  Repeat, thus genera*ng a succession of
approxima*ons to R.

κ
κ
+

+
=

||
|| )1()2(
V

pV
p iii

κ is
prior

weight

op*onal

COMP6714: Informa2on Retrieval & Web Search

PRP and BIR

§  Ge�ng reasonable approxima*ons of probabili*es
is possible.

§  Requires restric*ve assump*ons:
§  term independence
§  terms not in query don’t affect the outcome
§  boolean representa*on of documents/queries/
relevance

§  document relevance values are independent
§  Some of these assump*ons can be removed
§  Problem: either require par*al relevance informa*on or only can

derive somewhat inferior term weights
29

COMP6714: Informa2on Retrieval & Web Search

Okapi BM25

§  Heuris*cally extend the BIR to include informa*on
of term frequencies, document length, etc.

§  Typically,

�

RSVd = log
N
dft

⎛

⎝
⎜

⎞

⎠
⎟

t∈q
∑ ⋅ (k1 +1)tf t ,d

k1 (1− b) + b ⋅
Ld
Lave

⎛

⎝
⎜

⎞

⎠
⎟ + tf t ,d

⋅
(k3 +1)tf t ,q
k3 + tf t,q

idf Normalized term
freq (doc)

Normalized term
freq (query)

�

k1,k3 ∈ [1.2,2.0],b = 0.75

caps the contribu*on of R

COMP6714: Informa2on Retrieval & Web Search

Good and Bad News
§  Standard Vector Space Model

§  Empirical for the most part; success measured by results
§  Few proper*es provable

§  Probabilis*c Model Advantages
§  Based on a firm theore*cal founda*on
§  Theore*cally jus*fied op*mal ranking scheme

§  Disadvantages
§  Making the ini*al guess to get V
§  Binary word-in-doc weights (not using term frequencies)
§  Independence of terms (can be alleviated)
§  Amount of computa*on
§  Has never worked convincingly beger in prac*ce

COMP6714: Informa2on Retrieval & Web Search

Resources
S. E. Robertson and K. Spärck Jones. 1976. Relevance Weigh*ng of Search

Terms. Journal of the American Society for Informa2on Sciences 27(3):
129–146.

C. J. van Rijsbergen. 1979. Informa2on Retrieval. 2nd ed. London:
Bugerworths, chapter 6. [Most details of math] hgp://
www.dcs.gla.ac.uk/Keith/Preface.html

N. Fuhr. 1992. Probabilis*c Models in Informa*on Retrieval. The Computer
Journal, 35(3),243–255. [Easiest read, with BNs]

F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell. 1998. Is This
Document Relevant? … Probably: A Survey of Probabilis*c Models in
Informa*on Retrieval. ACM Compu2ng Surveys 30(4): 528–552.

hgp://www.acm.org/pubs/cita*ons/journals/surveys/1998-30-4/p528-crestani/

[Adds very ligle material that isn’t in van Rijsbergen or Fuhr ]

COMP6714: Informa2on Retrieval & Web Search

Resources
H.R. Turtle and W.B. Cro}. 1990. Inference Networks for Document Retrieval.

Proc. ACM SIGIR: 1-24.
E. Charniak. Bayesian nets without tears. AI Magazine 12(4): 50-63 (1991). hGp://

www.aaai.org/Library/Magazine/Vol12/12-04/vol12-04.html
D. Heckerman. 1995. A Tutorial on Learning with Bayesian Networks. Microso}

Technical Report MSR-TR-95-06
hGp://www.research.microso?.com/~heckerman/

N. Fuhr. 2000. Probabilis*c Datalog: Implemen*ng Logical Informa*on Retrieval
for Advanced Applica*ons. Journal of the American Society for Informa2on
Science 51(2): 95–110.

R. K. Belew. 2001. Finding Out About: A Cogni2ve Perspec2ve on Search Engine

Technology and the WWW. Cambridge UP 2001.

MIR 2.5.4, 2.8
33

COMP6714: Informa2on Retrieval & Web Search

LANGUAGE MODEL

COMP6714: Informa2on Retrieval & Web Search

Today
§  The Language Model Approach to IR

§  Basic query genera*on model
§  Alterna*ve models

COMP6714: Informa2on Retrieval & Web Search

Standard Probabilis*c IR

query

…

Information
need

document collection

matching

),|( dQRP

COMP6714: Informa2on Retrieval & Web Search

IR based on Language Model (LM)

query

…

Information
need

document collection

generation

)|( dMQP 1d
M

2d
M

…

nd
M§  A common search heuris*c is to use words that you expect to find in matching documents as your

query – why, I saw Sergey Brin advoca*ng that
strategy on late night TV one night in my hotel
room, so it must be good!

§  The LM approach directly exploits that idea!
§  See later slides for a more formal jus*fica*on 37

COMP6714: Informa2on Retrieval & Web Search

Formal Language (Model)

§  Tradi*onal genera*ve model: generates strings
§  Finite state machines or regular grammars, etc.

§  Example:

I wish

I wish
I wish I wish
I wish I wish I wish
I wish I wish I wish I wish
…

COMP6714: Informa2on Retrieval & Web Search

Stochas*c Language Models

§  Models probability of genera*ng strings in the
language (commonly all strings over alphabet ∑)

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

… …

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008

COMP6714: Informa2on Retrieval & Web Search

Stochas*c Language Models

§  Model probability of genera*ng any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maiden class pleaseth yon the

0.0005 0.01 0.0001 0.0001 0.2
0.01 0.0001 0.02 0.1 0.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman

COMP6714: Informa2on Retrieval & Web Search

Stochas*c Language Models

§  A sta*s*cal model for genera*ng text
§  Probability distribu*on over strings in a given language

P ( | M ) = P ( | M )

P ( | M, )

P ( | M, )
41

COMP6714: Informa2on Retrieval & Web Search

Unigram and higher-order models

§  Unigram Language Models

§  Bigram (generally, n-gram) Language Models

§  Other Language Models
§  Grammar-based models (PCFGs), etc.

§  Probably not the first thing to try in IR

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.
Effective!

COMP6714: Informa2on Retrieval & Web Search

Using Language Models in IR
§  Treat each document as the basis for a model (e.g.,
unigram sufficient sta*s*cs)

§  Rank document d based on P(d | q)
§  P(d | q) = P(q | d) x P(d) / P(q)

§  P(q) is the same for all documents, so ignore
§  P(d) [the prior] is o}en treated as the same for all d

§  But we could use criteria like authority, length, genre
§  P(q | d) is the probability of q given d’s model

§  Very general formal approach

COMP6714: Informa2on Retrieval & Web Search

The fundamental problem of LMs
§  Usually we don’t know the model M

§  But have a sample of text representa*ve of that model

§  Es*mate a language model from a sample
§  Then compute the observa*on probability

P ( | M ( ) )

M
doc query

COMP6714: Informa2on Retrieval & Web Search

Language Models for IR
§  Language Modeling Approaches

§  Agempt to model query genera(on process
§  Documents are ranked by the probability that a query
would be observed as a random sample from the
respec(ve document model

§  Mul*nomial approach

COMP6714: Informa2on Retrieval & Web Search

Retrieval based on probabilis*c LM

§  Treat the genera*on of queries as a random process.
§  Approach

§  Infer a language model for each document.
§  Es*mate the probability of genera*ng the query according
to each of these models.

§  Rank the documents according to these probabili*es.
§  Usually a unigram es*mate of words is used

§  Some work on bigrams, paralleling van Rijsbergen

COMP6714: Informa2on Retrieval & Web Search

Retrieval based on probabilis*c LM

§  Intui*on
§  Users …

§  Have a reasonable idea of terms that are likely to occur in
documents of interest.

§  They will choose query terms that dis*nguish these documents
from others in the collec*on.

§  Collec*on sta*s*cs …
§  Are integral parts of the language model.
§  Are not used heuris*cally as in many other approaches.

§  In theory. In prac*ce, there’s usually some wiggle room for
empirically set parameters

COMP6714: Informa2on Retrieval & Web Search

Query genera*on probability (1)
§  Ranking formula

§  The probability of producing the query given the language model of
document d using MLE is:

∏

∈

Qt d

Qt
dmld

dl
tf

MtpMQp

),(

)|(ˆ)|(ˆ

Unigram assumption:
Given a particular language model, the
query terms occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

)|()(
)|()(),(

dMQpdp
dQpdpdQp

≈

COMP6714: Informa2on Retrieval & Web Search

Insufficient data
§  Zero probability

§  May not wish to assign a probability of zero to a document
that is missing one or more of the query terms [gives
conjunc*on seman*cs]

§  General approach
§  A non-occurring term is possible, but no more likely than
would be expected by chance in the collec*on.

§  If ,

0)|( =dMtp

0),( =dttf

cs
cf

Mtp td =)|(

tcf : raw count of term t in the collection
: raw collection size(total number of tokens in the collection)

COMP6714: Informa2on Retrieval & Web Search

Insufficient data
§  Zero probabili*es spell disaster

§  We need to smooth probabili*es
§  Discount nonzero probabili*es
§  Give some probability mass to unseen things

§  There’s a wide space of approaches to smoothing
probability distribu*ons to deal with this problem,
such as adding 1, ½ or ℇ to counts, Dirichlet priors,
discoun*ng, and interpola*on
§  [See FSNLP ch. 6 or CS224N if you want more]

§  A simple idea that works well in prac*ce is to use a
mixture between the document mul*nomial and the
collec*on mul*nomial distribu*on

COMP6714: Informa2on Retrieval & Web Search

Mixture model
§  Jelinek-Mercer method

§  P(w|d) = λPmle(w|Md) + (1 – λ)Pmle(w|Mc)
§  Mixes the probability from the document with the
general collec*on frequency of the word.

§  Correctly se�ng λ is very important
§  A high value of lambda makes the search
“conjunc*ve-like” – suitable for short queries

§  A low value is more suitable for long queries
§  Can tune λ to op*mize performance

§  Perhaps make it dependent on document size (cf. Dirichlet
prior or Wigen-Bell smoothing) 51

COMP6714: Informa2on Retrieval & Web Search

Basic mixture model summary
§  General formula*on of the LM for IR

§  The user has a document in mind, and generates the query
from this document.

§  The equa*on represents the probability that the
document that the user had in mind was in fact this one.

∏
∈

+−=
Qt

dMtptpdpdQp ))|()()1(()(),( λλ

collection/background language model

individual-document model

COMP6714: Informa2on Retrieval & Web Search

Rela*onship to idf
fqi,D=0 è query word that
does not occur in the doc

Add contribu*ons
from i:fqi,D>0

Becomes a
constant | Q, C

propor*onal to the u, inversely
propor*onal to the cf

Note here (i.e., [CMS09]) λ is
mul*plied to the background
model.

COMP6714: Informa2on Retrieval & Web Search

Example
§  Document collec*on (2 documents)

§  d1: Xerox reports a profit but revenue is down
§  d2: Lucent narrows quarter loss but revenue decreases
further

§  Model: MLE unigram from documents; λ = ½
§  Query: revenue down

§  P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2]
= 1/8 x 3/32 = 3/256
§  P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2]
= 1/8 x 1/32 = 1/256

§  Ranking: d1 > d2 54

COMP6714: Informa2on Retrieval & Web Search

Ponte and Cro} Experiments
§  Data

§  TREC topics 202-250 on TREC disks 2 and 3
§  Natural language queries consis*ng of one sentence each

§  TREC topics 51-100 on TREC disk 3 using the concept fields
§  Lists of good terms

Number: 054

Domain: International Economics

Topic: Satellite Launch Contracts </p> <p><desc>Description:<br /> … </desc> </p> <p><con>Concept(s): </p> <p>1.  Contract, agreement<br /> 2.  Launch vehicle, rocket, payload, satellite<br /> 3.  Launch services, … </con> 55 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Precision/recall results 202-250 </p> <p>56 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Precision/recall results 51-100</p> <p>57 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Language models: pro & con<br /> §  Novel way of looking at the problem of text retrieval<br /> based on probabilis*c language modeling </p> <p>§  Conceptually simple and explanatory<br /> §  Formal mathema*cal model<br /> §  Natural use of collec*on sta*s*cs, not heuris*cs (almost…) </p> <p>§  LMs provide effec*ve retrieval and can be improved to the<br /> extent that the following condi*ons can be met<br /> §  Our language models are accurate representa*ons of the data.<br /> §  Users have some sense of term distribu*on.* </p> <p>§  *Or we get more sophis*cated with transla*on model </p> <p>58 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Comparison With Vector Space<br /> §  There’s some rela*on to tradi*onal u.idf models: </p> <p>§  (unscaled) term frequency is directly in model<br /> §  the probabili*es do length normaliza*on of term<br /> frequencies </p> <p>§  the effect of doing a mixture with overall collec*on<br /> frequencies is a ligle like idf: terms rare in the general<br /> collec*on but common in some documents will have a<br /> greater influence on the ranking </p> <p>59 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Comparison With Vector Space<br /> §  Similar in some ways </p> <p>§  Term weights based on frequency<br /> §  Terms o}en used as if they were independent<br /> §  Inverse document/collec*on frequency used<br /> §  Some form of length normaliza*on useful </p> <p>§  Different in others<br /> §  Based on probability rather than similarity </p> <p>§  Intui*ons are probabilis*c rather than geometric<br /> §  Details of use of document length and term, document,<br /> and collec*on frequency differ </p> <p>60 </p> <p> COMP6714: Informa2on Retrieval & Web Search </p> <p>Resources<br /> J.M. Ponte and W.B. Cro}. 1998. A language modelling approach to informa*on </p> <p>retrieval. In SIGIR 21.<br /> D. Hiemstra. 1998. A linguis*cally mo*vated probabilis*c model of informa*on </p> <p>retrieval. ECDL 2, pp. 569–584.<br /> A. Berger and J. Lafferty. 1999. Informa*on retrieval as sta*s*cal transla*on. SIGIR 22, </p> <p>pp. 222–229.<br /> D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model informa*on </p> <p>retrieval system. SIGIR 22, pp. 214–221.<br /> [Several relevant newer papers at SIGIR 23–25, 2000–2002.]<br /> Workshop on Language Modeling and Informa*on Retrieval, CMU 2001. hgp://</p> <p>la.l*.cs.cmu.edu/callan/Workshops/lmir01/ .<br /> The Lemur Toolkit for Language Modeling and Informa*on Retrieval. hgp://</p> <p>www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently<br /> ac*vely developed. </p> <p>61</p> </div> <footer class="entry-meta entry-meta-footer"> <span class="cat-links cat-links-single">Posted in Uncategorized</span> </footer> </div> </div> </article> </div> <div class="entry-author"> <div class="author-avatar"> <img alt='' src='https://secure.gravatar.com/avatar/587d467738f02efc1f3ad41e6550ae64?s=80&d=mm&r=g' srcset='https://secure.gravatar.com/avatar/587d467738f02efc1f3ad41e6550ae64?s=160&d=mm&r=g 2x' class='avatar avatar-80 photo' height='80' width='80' loading='lazy' decoding='async'/> </div> <div class="author-heading"> <h2 class="author-title"> Published by <span class="author-name">admin</span> </h2> </div> <p class="author-bio"> <a class="author-link" href="https://www.cscodehelp.com/author/admin/" rel="author"> View all posts by admin </a> </p> </div> <nav class="navigation post-navigation" aria-label="Posts"> <h2 class="screen-reader-text">Post navigation</h2> <div class="nav-links"><div class="nav-previous"><a href="https://www.cscodehelp.com/c-c-%e4%bb%a3%e5%86%99/%e7%a8%8b%e5%ba%8f%e4%bb%a3%e5%86%99%e4%bb%a3%e5%81%9a%e4%bb%a3%e8%80%83-confidence-intervals-ppt/" rel="prev"><span class="meta-nav">Prev</span> <span class="post-title">程序代写代做代考 Confidence Intervals.ppt</span></a></div><div class="nav-next"><a href="https://www.cscodehelp.com/c-c-%e4%bb%a3%e5%86%99/%e7%a8%8b%e5%ba%8f%e4%bb%a3%e5%86%99%e4%bb%a3%e5%81%9a%e4%bb%a3%e8%80%83-deep-learning-gpu-algorithm-flex-rethinking-the-inception-architecture-for-computer-vision/" rel="next"><span class="meta-nav">Next</span> <span class="post-title">程序代写代做代考 deep learning GPU algorithm flex Rethinking the Inception Architecture for Computer Vision</span></a></div></div> </nav> <div id="comments" class="comments-area"> <div id="respond" class="comment-respond"> <h3 id="reply-title" class="comment-reply-title">Leave a Reply <small><a rel="nofollow" id="cancel-comment-reply-link" href="/c-c-%e4%bb%a3%e5%86%99/%e7%a8%8b%e5%ba%8f%e4%bb%a3%e5%86%99%e4%bb%a3%e5%81%9a%e4%bb%a3%e8%80%83-ai-bayesian-scheme-comp6714informa2onretrievalwebsearch/#respond" style="display:none;">Cancel reply</a></small></h3><form action="https://www.cscodehelp.com/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate><p class="comment-notes"><span id="email-notes">Your email address will not be published.</span> <span class="required-field-message">Required fields are marked <span class="required">*</span></span></p><p class="comment-form-comment"><label for="comment">Comment <span class="required">*</span></label> <textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" required></textarea></p><p class="comment-form-author"><label for="author">Name <span class="required">*</span></label> <input id="author" name="author" type="text" value="" size="30" maxlength="245" autocomplete="name" required /></p> <p class="comment-form-email"><label for="email">Email <span class="required">*</span></label> <input id="email" name="email" type="email" value="" size="30" maxlength="100" aria-describedby="email-notes" autocomplete="email" required /></p> <p class="comment-form-url"><label for="url">Website</label> <input id="url" name="url" type="url" value="" size="30" maxlength="200" autocomplete="url" /></p> <p class="comment-form-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes" /> <label for="wp-comment-cookies-consent">Save my name, email, and website in this browser for the next time I comment.</label></p> <p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Post Comment" /> <input type='hidden' name='comment_post_ID' value='19449' id='comment_post_ID' /> <input type='hidden' name='comment_parent' id='comment_parent' value='0' /> </p></form> </div> </div> </div> </main> </div> <div id="site-sidebar" class="sidebar-area col-16 col-sm-16 col-md-16 col-lg-5 col-xl-5 col-xxl-5 order-lg-1 order-xl-1 order-xxl-1"> <div id="secondary" class="sidebar widget-area sidebar-widget-area" role="complementary"> <aside id="text-7" class="widget widget_text"><h2 class="widget-title">联系我</h2> <div class="textwidget"><p><code></code></p> <ul> <li><strong>网站: </strong><a href="https://www.cscodehelp.com/">https://www.cscodehelp.com/</a></li> <li><strong>Email</strong>: kyit630461@outlook.com</li> <li><strong>QQ</strong>:2235208643<br />   </li> <li><strong>Wechat:</strong> aplg6666</li> </ul> <figure class="wp-block-image size-large is-resized"><img decoding="async" loading="lazy" class="alignnone size-medium wp-image-26377" src="https://www.cscodehelp.com/wp-content/uploads/2023/05/WechatIMG181-202x300.png" alt="" width="202" height="300" srcset="https://www.cscodehelp.com/wp-content/uploads/2023/05/WechatIMG181-202x300.png 202w, https://www.cscodehelp.com/wp-content/uploads/2023/05/WechatIMG181-690x1024.png 690w, https://www.cscodehelp.com/wp-content/uploads/2023/05/WechatIMG181-768x1140.png 768w, https://www.cscodehelp.com/wp-content/uploads/2023/05/WechatIMG181.png 802w" sizes="(max-width: 202px) 100vw, 202px" /></figure> <p>为了保证我尽快联系和评估您的面试，作业, 请注明您的面试，作业具体要求、微信号和年级</p> <p>Notice: 一般在美东时间 11:00 AM – 17:00之间可能回复稍慢，望谅解</p> <p></p> <p class="has-text-color has-medium-font-size has-vivid-red-color"><strong>100% Plagiarism Free 代码保证唯一</strong></p> <p> </p> <p class="has-text-color has-medium-font-size has-vivid-red-color"><strong>100% Confidentiality 完全保密</strong></p> <p> </p> <p class="has-text-color has-medium-font-size has-vivid-red-color"><strong>100% Quality Assurance 保证质量</strong></p> <p></p> </div> </aside><aside id="text-2" class="widget widget_text"><h2 class="widget-title">请大家注意</h2> <div class="textwidget"><p>我从18年开始做代面，12年开始做代写，跌跌撞撞走来。若干网站内容都抄袭，请大家提高警惕。</p> </div> </aside><aside id="text-5" class="widget widget_text"><h2 class="widget-title">友情提示 </h2> <div class="textwidget"><p>我的Google排名是靠质量和口碑做起来的, 和带Ad标识的花钱排名不一样。codehelp的排名从来不需要通过付钱刷存在感</p> </div> </aside><aside id="categories-3" class="widget widget_categories"><h2 class="widget-title">成功案例分类</h2><form action="https://www.cscodehelp.com" method="get"><label class="screen-reader-text" for="cat">成功案例分类</label><select name='cat' id='cat' class='postform'> <option value='-1'>Select Category</option> <option class="level-0" value="212">AI</option> <option class="level-0" value="214">bigdata</option> <option class="level-0" value="1">C/C++ 代写</option> <option class="level-0" value="175">Compilers: Principles</option> <option class="level-0" value="253">computer architecture</option> <option class="level-0" value="251">computer graphics</option> <option class="level-0" value="133">computer network</option> <option class="level-0" value="54">computer system计算机系统</option> <option class="level-0" value="259">computer theory</option> <option class="level-0" value="32">Data Analysis/数据分析</option> <option class="level-0" value="33">Data Science</option> <option class="level-0" value="208">data visualization</option> <option class="level-0" value="236">DATABASE</option> <option class="level-0" value="165">Distributed System 分布式系统</option> <option class="level-0" value="246">essay</option> <option class="level-0" value="217">genetic algorithm</option> <option class="level-0" value="68">Go语言程序设计</option> <option class="level-0" value="162">Hadoop</option> <option class="level-0" value="226">Haskell</option> <option class="level-0" value="229">hive</option> <option class="level-0" value="147">IOS开发</option> <option class="level-0" value="40">JAVA</option> <option class="level-0" value="80">JavaScript</option> <option class="level-0" value="86">JS</option> <option class="level-0" value="75">Junit</option> <option class="level-0" value="186">kaggle</option> <option class="level-0" value="181">Machine Learning</option> <option class="level-0" value="79">matlab</option> <option class="level-0" value="119">matlab代写</option> <option class="level-0" value="228">mobile system</option> <option class="level-0" value="98">NLP</option> <option class="level-0" value="273">OA代做</option> <option class="level-0" value="39">OOP编程</option> <option class="level-0" value="237">ORACLE SQL</option> <option class="level-0" value="239">paper代写</option> <option class="level-0" value="280">powcoder</option> <option class="level-0" value="142">Prolog</option> <option class="level-0" value="164">Python</option> <option class="level-0" value="23">Python代写</option> <option class="level-1" value="24"> Tensorflow代写</option> <option class="level-0" value="203">R语言代写</option> <option class="level-0" value="177">Scientific Computing Language</option> <option class="level-0" value="29">Software Engineering</option> <option class="level-0" value="163">Spark</option> <option class="level-0" value="60">SPARK代写</option> <option class="level-0" value="94">sql代写</option> <option class="level-0" value="205">Statistical Analysis</option> <option class="level-0" value="196">Stochastic Analysis in Finance</option> <option class="level-0" value="193">Stylish</option> <option class="level-0" value="148">Swift开发</option> <option class="level-0" value="78">web</option> <option class="level-0" value="85">WEBGL</option> <option class="level-0" value="113">人工智能</option> <option class="level-0" value="67">分布式系统</option> <option class="level-0" value="93">加拿大代写</option> <option class="level-0" value="128">动态规划DP dynamic programming</option> <option class="level-0" value="106">商业分析</option> <option class="level-0" value="192">图像处理</option> <option class="level-0" value="107">图算法</option> <option class="level-0" value="114">图算法graph</option> <option class="level-0" value="198">工程</option> <option class="level-0" value="36">平时作业代写</option> <option class="level-0" value="188">强化学习|reinforcement learning</option> <option class="level-0" value="61">推荐系统</option> <option class="level-0" value="48">数据库</option> <option class="level-0" value="59">数据挖掘</option> <option class="level-0" value="47">数据结构｜Data Structure</option> <option class="level-0" value="185">文本分类</option> <option class="level-0" value="77">机器学习</option> <option class="level-0" value="171">汇编</option> <option class="level-0" value="219">深度学习</option> <option class="level-0" value="69">澳洲程序设计</option> <option class="level-0" value="123">生物信息工程</option> <option class="level-0" value="26">科研代码代写</option> <option class="level-0" value="159">移动通信</option> <option class="level-0" value="127">算法algorithm</option> <option class="level-0" value="204">统计分析</option> <option class="level-0" value="170">编译原理</option> <option class="level-0" value="49">网络安全</option> <option class="level-0" value="99">自然语言处理</option> <option class="level-0" value="254">计算机体系结构</option> <option class="level-0" value="250">计算机图形学</option> <option class="level-0" value="263">计算机编译原理</option> <option class="level-0" value="132">计算机网络</option> <option class="level-0" value="53">计算机网络代写</option> <option class="level-0" value="28">软件工程代码代写</option> <option class="level-0" value="120">边缘计算</option> <option class="level-0" value="158">通信工程</option> <option class="level-0" value="104">通信编程</option> <option class="level-0" value="218">遗传算法</option> <option class="level-0" value="108">量化代写</option> <option class="level-0" value="197">随机过程</option> <option class="level-0" value="274">面试</option> <option class="level-0" value="272">面试代面</option> <option class="level-0" value="279">骗子曝光</option> </select> </form> <script type="text/javascript"> /* <![CDATA[ */ (function() { var dropdown = document.getElementById( "cat" ); function onCatChange() { if ( dropdown.options[ dropdown.selectedIndex ].value > 0 ) { dropdown.parentNode.submit(); } } dropdown.onchange = onCatChange; })(); /* ]]> */ </script> </aside> <aside id="recent-posts-3" class="widget widget_recent_entries"> <h2 class="widget-title">成功案例</h2> <ul> <li> <a href="https://www.cscodehelp.com/c-c-%e4%bb%a3%e5%86%99/%e4%ba%9a%e9%ba%bboa-%e9%9d%a2%e7%bb%8f%e4%ba%9a%e9%a9%ac%e9%80%8a%e9%9d%a2%e7%bb%8famazon-oaoa%e4%ba%9a%e9%ba%bb%e7%bd%91%e7%94%b3amazon-%e4%bb%a3%e9%9d%a2amazon-oa%e4%bb%a3%e5%86%99/">亚麻OA 面经|亚马逊面经|AMAZON OA|OA亚麻网申|AMAZON 代面|AMAZON OA代写</a> </li> <li> <a href="https://www.cscodehelp.com/c-c-%e4%bb%a3%e5%86%99/breaking-the-mold-unconventional-strategies-for-accelerating-your-career-path/">Breaking the Mold: Unconventional Strategies for Accelerating Your Career Path</a> </li> <li> <a href="https://www.cscodehelp.com/c-c-%e4%bb%a3%e5%86%99/ng-%e9%9d%a2%e8%af%95-system-design%ef%bc%8c%e6%b1%97%e6%b5%81%e6%b5%83%e8%83%8c%e4%ba%86/">NG 面试 System Design，汗流浃背了</a> </li> </ul> </aside><aside id="archives-6" class="widget widget_archive"><h2 class="widget-title">日期分类</h2> <ul> <li><a href='https://www.cscodehelp.com/2024/03/'>March 2024</a></li> <li><a href='https://www.cscodehelp.com/2024/02/'>February 2024</a></li> <li><a href='https://www.cscodehelp.com/2023/11/'>November 2023</a></li> <li><a href='https://www.cscodehelp.com/2023/06/'>June 2023</a></li> <li><a href='https://www.cscodehelp.com/2022/12/'>December 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/11/'>November 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/08/'>August 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/07/'>July 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/06/'>June 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/05/'>May 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/04/'>April 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/03/'>March 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/02/'>February 2022</a></li> <li><a href='https://www.cscodehelp.com/2022/01/'>January 2022</a></li> <li><a href='https://www.cscodehelp.com/2021/12/'>December 2021</a></li> <li><a href='https://www.cscodehelp.com/2021/04/'>April 2021</a></li> <li><a href='https://www.cscodehelp.com/2021/03/'>March 2021</a></li> <li><a href='https://www.cscodehelp.com/2021/02/'>February 2021</a></li> <li><a href='https://www.cscodehelp.com/2021/01/'>January 2021</a></li> <li><a href='https://www.cscodehelp.com/2020/06/'>June 2020</a></li> <li><a href='https://www.cscodehelp.com/2020/05/'>May 2020</a></li> <li><a href='https://www.cscodehelp.com/2020/04/'>April 2020</a></li> <li><a href='https://www.cscodehelp.com/2020/03/'>March 2020</a></li> <li><a href='https://www.cscodehelp.com/2020/02/'>February 2020</a></li> <li><a href='https://www.cscodehelp.com/2019/12/'>December 2019</a></li> <li><a href='https://www.cscodehelp.com/2019/06/'>June 2019</a></li> </ul> </aside><aside id="text-3" class="widget widget_text"><h2 class="widget-title">口碑</h2> <div class="textwidget"><p>我们的<strong>Google</strong>排名是靠质量和口碑做起来的, 和带Ad标识的花钱排名不一样。codehelp的排名从来不需要通过付钱刷存在感</p> </div> </aside> </div> </div> </div> </div> </div> </div> <footer id="colophon" class="site-footer"> <div class="site-info"> <div class="site-info-inside"> <div class="container"> <div class="row"> <div class="col"> <div class="credits-wrapper"> <div class="credits credits-blog">cscodehelp™ 博士课程作业面试辅导 CS 计算机科学 | EE 电气工程 | STATICS 统计 | FINANCE 金融 | 程序代做 | 工作代做 | 面试代面 | CS代做</div><div class="credits credits-designer">Amphibious Theme by <a href="https://templatepocket.com" title="TemplatePocket">TemplatePocket</a> <span>⋅</span> Powered by <a href="https://wordpress.org" title="WordPress">WordPress</a></div> </div> </div> </div> </div> </div> </div> </footer> </div> <div class="overlay-effect"></div> <script type='text/javascript' src='https://www.cscodehelp.com/wp-content/themes/amphibious/js/enquire.js?ver=2.1.6' id='enquire-js'></script> <script type='text/javascript' src='https://www.cscodehelp.com/wp-content/themes/amphibious/js/fitvids.js?ver=1.1' id='fitvids-js'></script> <script type='text/javascript' src='https://www.cscodehelp.com/wp-content/themes/amphibious/js/hover-intent.js?ver=r7' id='hover-intent-js'></script> <script type='text/javascript' src='https://www.cscodehelp.com/wp-content/themes/amphibious/js/superfish.js?ver=1.7.10' id='superfish-js'></script> <script type='text/javascript' src='https://www.cscodehelp.com/wp-includes/js/comment-reply.min.js?ver=73f13a27fe977958cf7880e6b25eac33' id='comment-reply-js'></script> <script type='text/javascript' src='https://www.cscodehelp.com/wp-content/themes/amphibious/js/custom.js?ver=1.0' id='amphibious-custom-js'></script> </body> </html>