Determining the Relevance of Documents Using Vector Space Analysis
Posted: Thu Jan 30, 2025 6:46 am
To determine the relevance of documents in relation to a search query, Google uses so-called vector space analyses , which map the search query as a vector and relate it to relevant documents in the vector space. If known entities already appear in the search term, Google can then relate it to documents in which these entities are also mentioned. Documents can be any type of content such as text, images, videos...
A scoring can be applied to the documents based on the size of the angle of the vectors. Click behavior in the SERPs could also play a role in determining relevance.
Vectors and vector spaces can be used on different levels. Whether Word2Vec , Phrase2Vec , Text2Vec or Entity2Vec . The main vector represents the central element. If it is an entity, it can be placed in relation to other entities or documents.
Entities and Documents in Vector Space
Entities and Documents in Vector Space
Vector space analyses can be used for scoring in terms taiwan phone number data of ranking, but can also simply be used for organizing elements such as semantically related entities, topics or topic-specific terms. If the main vector is a search term related to documents, the size of the angle or the proximity can be used for ranking or scoring.
Determination of entity-relevant documents
The identification of documents that are relevant to a requested entity can be carried out by annotating or tagging the relevant documents or identifying entity mentions. This can be done manually by editors or automatically.
Authors and editors can add tags to posts for all named entities that appear in the text, such as people, places, organizations, products, events, etc. and concepts. This would create a relevant corpus of documents for each entity. However, this method alone cannot be used to weight the documents in relation to the entity. There is only tagged or untagged.
At Google, due to the large number of documents, one can assume an automated process, as described in the section on Natural Language Processing.
Furthermore, it is possible that Google documents are weighted based on the frequency of entity mentions, similar to the term frequency. Documents that exceed a certain threshold of a term or entity frequency are included in the scoring process. The rest remain unevaluated and are randomly ranked from position 30-50.
Analyzing the frequency of occurring terms is not a new invention and should not be foreign to any SEO .
Similar to the inverse term frequency TF-IDF or WDF*IDF, an inverse entity frequency can be determined. Here, the number of terms and other entities occurring in an entity description are compared and then compared to all entity-relevant documents in the corpus.
In the first step, connections between terms and entities can be determined using the entity-relevant documents. The more often a co-occurrence occurs between certain terms and an entity, the more likely a relationship to each other is. While in TF-IDF the proof terms are created via the reference to a keyword , here the terms are determined in relation to the requested entity.
The weighting is also based on the relevance of the respective document in which the co-occurrence occurred. In other words, the proximity of the document to the entity in question.
The formula is as follows:
Where t stands for term, e for entity, d for document.
The terms that occur in the immediate vicinity of the named entities as co-occurrences can be linked to them. From this, attributes as well as other "secondary entities" to the "main entity" can be extracted from the content and stored in the respective "entity profile". The proximity between the terms and the entity in the text as well as the frequency of the occurring main entity-attribute pairs or main entity-secondary entity pairs can be used both as validation and as weighting.
Google has officially confirmed that it has been using a similar approach for a long time when evaluating links. The focus here is not only on the anchor text of the link, but also on the surrounding terminology.
The two methods Bag of Words and Contextual Bag of Words (CBOW) should not go unmentioned here.
It is not possible to say in general terms how many words a text fragment contains or how large a window is. Theoretically, it could also be an entire text. However, it makes more sense to look at individual paragraphs or chapters in combination with an overall view of a text. This would also explain why very extensive content for hundreds of terms can have top rankings together.
A scoring can be applied to the documents based on the size of the angle of the vectors. Click behavior in the SERPs could also play a role in determining relevance.
Vectors and vector spaces can be used on different levels. Whether Word2Vec , Phrase2Vec , Text2Vec or Entity2Vec . The main vector represents the central element. If it is an entity, it can be placed in relation to other entities or documents.
Entities and Documents in Vector Space
Entities and Documents in Vector Space
Vector space analyses can be used for scoring in terms taiwan phone number data of ranking, but can also simply be used for organizing elements such as semantically related entities, topics or topic-specific terms. If the main vector is a search term related to documents, the size of the angle or the proximity can be used for ranking or scoring.
Determination of entity-relevant documents
The identification of documents that are relevant to a requested entity can be carried out by annotating or tagging the relevant documents or identifying entity mentions. This can be done manually by editors or automatically.
Authors and editors can add tags to posts for all named entities that appear in the text, such as people, places, organizations, products, events, etc. and concepts. This would create a relevant corpus of documents for each entity. However, this method alone cannot be used to weight the documents in relation to the entity. There is only tagged or untagged.
At Google, due to the large number of documents, one can assume an automated process, as described in the section on Natural Language Processing.
Furthermore, it is possible that Google documents are weighted based on the frequency of entity mentions, similar to the term frequency. Documents that exceed a certain threshold of a term or entity frequency are included in the scoring process. The rest remain unevaluated and are randomly ranked from position 30-50.
Analyzing the frequency of occurring terms is not a new invention and should not be foreign to any SEO .
Similar to the inverse term frequency TF-IDF or WDF*IDF, an inverse entity frequency can be determined. Here, the number of terms and other entities occurring in an entity description are compared and then compared to all entity-relevant documents in the corpus.
In the first step, connections between terms and entities can be determined using the entity-relevant documents. The more often a co-occurrence occurs between certain terms and an entity, the more likely a relationship to each other is. While in TF-IDF the proof terms are created via the reference to a keyword , here the terms are determined in relation to the requested entity.
The weighting is also based on the relevance of the respective document in which the co-occurrence occurred. In other words, the proximity of the document to the entity in question.
The formula is as follows:
Where t stands for term, e for entity, d for document.
The terms that occur in the immediate vicinity of the named entities as co-occurrences can be linked to them. From this, attributes as well as other "secondary entities" to the "main entity" can be extracted from the content and stored in the respective "entity profile". The proximity between the terms and the entity in the text as well as the frequency of the occurring main entity-attribute pairs or main entity-secondary entity pairs can be used both as validation and as weighting.
Google has officially confirmed that it has been using a similar approach for a long time when evaluating links. The focus here is not only on the anchor text of the link, but also on the surrounding terminology.
The two methods Bag of Words and Contextual Bag of Words (CBOW) should not go unmentioned here.
It is not possible to say in general terms how many words a text fragment contains or how large a window is. Theoretically, it could also be an entire text. However, it makes more sense to look at individual paragraphs or chapters in combination with an overall view of a text. This would also explain why very extensive content for hundreds of terms can have top rankings together.