The problem with knowledge databases like Wikipedia and Wikidata

Access ready-to-use Telemarketing Data to expand outreach, generate quality leads, and maximize your marketing ROI.
Post Reply
Reddi1
Posts: 368
Joined: Thu Dec 26, 2024 3:07 am

The problem with knowledge databases like Wikipedia and Wikidata

Post by Reddi1 »

Since Wikidata and Wikipedia only capture a fraction of all entities in the real world, the most difficult task for Google is to extract information about entities and entity types from other websites besides those mentioned above. Most websites and documents are all built differently and usually do not have a uniform structure. Therefore, Google still has a big task ahead of it to further expand the Knowledge Graph.

Structured and semi-structured information from manually maintained data sources such as Wikipedia or Wikidata are often checked and prepared in such a way that Google can easily extract them and add them to the Knowledge Graph. But these websites and databases are not perfect either.

The problem with manually maintained databases and pakistan phone number data semi-structured websites such as Wikipedia is the lack of completeness , validity and timeliness of the data.

Completeness refers both to the entities recorded in a database as well as to their attributes and associated entity types.
Validity refers to the correctness of the recorded attributes, statements or facts
Timeliness refers to the attributes of the recorded entities
There is a real tension between validity and completeness. If Google relies solely on Wikipedia, the validity of the information is very high due to the strict checks carried out by the eager Wikipedians. When it comes to timeliness, things get more difficult, and the information is simply not sufficient when it comes to completeness, as Wikipedia only covers a fraction of the world's knowledge.

To achieve the goal of near completeness, Google must be able to extract unstructured data from websites while ensuring validity and timeliness. For example, articles in Google News are a very interesting source of information to ensure the timeliness of entities already recorded in the Knowledge Graph.

Google has access to a huge wealth of knowledge through the trillions of indexed content or documents. These can be news sites, blogs, magazines, reviews, shops, glossaries, dictionaries, etc.

However, not every source of information is valid enough to be useful as an information source. Therefore, the first step is to identify the correct domain as a source.

By identifying the mentions of entities already stored in the Knowledge Graph, entity-relevant documents can be identified in the first step.

The terms that occur in the immediate vicinity of the named entities as co-occurrence can be linked to them. From this, attributes and other entities related to the main entity can be extracted from the content and stored in the respective entity profile. The proximity between the terms and the entity in the text as well as the frequency of the occurring main entity-attribute pairs or main entity-secondary entity pairs can be used both as validation and as weighting.

This allows Google to continuously enrich the entities in the Knowledge Graph with new information.

Below, I researched Google Patents and other sources to find approaches to ensure completeness (recall), validity and timeliness.
Post Reply