IT代考 IR H/M Course – cscodehelp代写
IR H/M Course
(Recall) Bag of Words Representation
• Simple strategy for representing documents
• Count how many times each term occurs – Binary mode uses only 0 & 1
Copyright By cscodehelp代写 加微信 cscodehelp
• A ‘term’ is any lexical item that you chose such as:
– A word (delimited by ‘white space’ or punctuation)
– Some conflated ‘root form’ of each word (e.g. a stem)
– An n-gram (a sequence of any consecutive n chars)
• Doesn’t consider the ordering of words in a document
– John is quicker than Mary and Mary is quicker than John have the same
representation
– This could be a set back: positional information allows to distinguish these 2 docs
• For now: Bag of Words Model (BoW)
Vector Space Model
LET’S LOOK AT THIS PROCESS DIFFERENTLY ….
What%is%a%bag%of%word%%representation?
IR H/M Course
Document Vectors One location for each word
diet film fur galaxy heat h’wood nova role
“Nova” occurs 10 times in text A
“Galaxy” occurs 5 times in text A 10 10 “Heat” occurs 3 times in text A 9 10
(Blank means 0 occurrences.)
H 6 10 2 8 I7513
Document Vectors One location for each word
diet film fur galaxy heat h’wood nova role
F G579 H 6 10 2 8
10 5 3 5 10
“Hollywood” o1c0curs 78times7in text I “Film” occu9rs 5 tim10es in 5text I
“Diet” occurs 1 time in text I 10 10 “Fur” occurs 3 times in text I 9 10
IR H/M Course
Document ids
galaxy heat
Document Vectors
One vector for each document
A B C D E F G H I
Vector Space Model
• Documents are also treated as a bag of words or terms
• Each document is represented as a vector in a t-dimensional vector space (t is the number of index terms)
• Each term weight is computed based on some variations of TF or TF-IDF scheme
IR H/M Course
Document ids
diet film fur
galaxy heat
TF-IDF Vectors
A B C D E F G H I
Sparse’Matrix
More Formally ….
• Documents and queries are represented by
vectors of term weights
• A collection is represented by a matrix of term weights
So,$we$have$docs$represented$ as$vectors
How$can$we$use$this$in$retrieval?
t”is”the”number”of”index”terms”(words,”stems,”etc)
IR H/M Course
Retrieval Model
An IR Model defines
– a model for document representation
– a model for query representation
– a mechanism for estimating the relevance of a
document for a given query
Progress’in’retrieval’models’has’corresponded’with’ improvements’in’effectiveness
Relevance Estimation
Retrieval in Vector Space Model • Vector space model represents both query and
documents using term sets (term vectors)
• Documents and queries are represented in a high
dimensional space (Bag of Words)
– Each dimension of the space corresponds to a term in
the document collection (t-dimensional vector space) • Relevance Estimation is performed by identifying
documents similar to the query
– Relevance of di to q:”Compare the similarity of query q and document di
What%is%a%retrieval model?
What%is%its%purpose?
IR H/M Course
Geometrically: Vector Space Model
Assumption: Documents that are “close together” in vector space “talk about” the same things
NB:$3D#diagrams#useful,#but#can#be#misleading#for# high6dimensional#space
Geometrically: Vector Space Model
Assumption: Documents that are “close together” in vector space “talk about” the same things
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
IR H/M Course
Vector Space • X = (t1,t2, …, tt)
– The number ti is called the i-th component of the vector
– Magnitude: is defined by the square root of the sum of the squares of the components
• that is, ∑ti2
– If ||X|| =1 then X is a unit vector • Concept of length normalization
Summary: Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally
– A vector is like an array of floating point
– Has direction and magnitude
– Each vector holds a (unique) place for every term in the collection
– Therefore, most vectors are sparse
IR H/M Course
Plotting the Vectors … & Intuition
Doc about astronomy
Doc about movie stars
Doc about mammal behavior Diet
Vector Space Intuition
– Books from a domain are organised at the same
place/ shelf/ nearby shelves
– Human organisation – librarian
What’is’the’intuition’behind The’vector4space’model?
IR H/M Course
Vector Space Model
• The relevant documents for a query are expected to be those represented by the vectors closest to the query
• Documents ranked by distance between points representing query and documents
– Similarity measure more common than a distance or dissimilarity measure
Cosine Measure
In(IR(we(consider(only(the(similarity(range(from(0(to(1 Why?(Why(not(-1(to(1?
• It measures cosine of the angle between the vectors
• Cosine ranges from 1 for vectors pointing in the same direction over zero for orthogonal vectors and -1 for vectors pointing in opposite directions
• If Cosine is applied to normalised (unit) vectors it gives the same ranking as Euclidean distance does
Cos(0’(=(1 Cos(90’(=(0 Cos(180’(=(-1 C
IR H/M Course
Similarity Calculation
– Consider two documents D1, D2 and a query Q
• D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
How$could$we$implement$a$cosine$similarity3based$ measure$using$inverted$index?
What$role$the$denominator$plays?
Algorithm (Reminder)
For each document I, Score(I) =0; I = 1 to N For each query term tk
– Search the vocabulary list
– Pull out the postings list
– For each document J in the list,
• Score(J) =Score(J) + wkj
IR H/M Course
• D1 = (T1 => 12 ,T2=> 23 , T3=>3)
• D2 = (T1 => 3 , T2 => 2 , T3 => 1)
• Q = (T1 => 0 ,T2=> 0, T3=>2)
• Sim(D1,Q) = 12*0 + 23*0 + 3*2 =6
• Sim(D2,Q) = 3*0+3*0+1*2 = 2
sim
Matching Coefficient (Coordination Level)
• Simply counts the number of dimensions on which both vectors are non-zero
• |X ! Y| ” #xi * yi
• Number of shared index terms (binary vectors)
• Does not take into account the sizes of the vectors
If#you#are#using#an#inverted#index? 3*2=6
Role#of#Document#length?
Is#this#familiar?
IR H/M Course
Some Problems …
• Normalisation …
• Consider a single word query and a single word document (In Binary mode…)
– If that matches • Coefficient is 1
• Same query against a thousand word document
– If that matches • Coefficient is 1
Dice Coefficient
• 2 |X!Y|/(|X|+|Y|)
• Normalises for length by dividing by the total number of non-zero entries.
• We multiply by 2 so that we get a measure that ranges from 0 to 1.0
Justify(the(need(for(vector length(normalization
What%is%a%dice%coefficient?
IR H/M Course
Query Term Weighting • Boolean representation
– Just have a weight of zero or 1
• Short queries
– Typical of web searches
– Multiple keyword occurrences are rare
• Wkq = idfk
• Long queries
– Result of relevance feedback (will talk about it later)
• Wkq =fkq .idfk
Advantages and Disadvantages of a Vector-space Model
Advantages
− Simple geometric interpretation of retrieval readily comprehensible to non- specialist and a uniform basis for wide range of operations
− Easy to compute measure (any similarity measure can be used)
− Easy to adapt to various weighting schemes
− Provision for ranked output
Disadvantages
− High dimensionality − Term independence
assumption
− Adhoc similarity metric: Cosine, Dice, etc. (which one to use?)
−Adhoc term weighting (not theoretically founded)
−No guidance on when to stop ranking
Discuss&three&query&term& weighting&strategies!
程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com