How AI could help make Wikipedia entries more accurate


A system that learns from everything Wikipedia

At first, the scale of the work seemed formidable, even for an advanced AI system: there were millions of citations to check and millions of potential evidence documents to consider. Even more intimidating was that editing the quotes requires near-human understanding and insight into the language. To succeed in this task, an AI model must understand the claim in question, find the corresponding passage on the quoted website, and predict whether the source actually verifies the claim.

At Meta AI, we have already started developing the building blocks for the next generation of citation tools. Last year we released an AI model that incorporates information retrieval and verificationand we train neural networks to learn more nuanced representations of language so they can identify relevant source material in an Internet-sized pool of data.

Where a person would use reason and common sense to evaluate a citation, our system applies Natural Language Understanding (NLU) techniques to estimate the likelihood that a claim can be inferred from a source. In NLU, a model translates human sentences (or words, sentences, or paragraphs) into complex mathematical representations. We designed our tools to compare these representations to determine whether one statement supports or contradicts another.

Our new dataset of 134 million web pages is one of the main components of the system: Sphereweb-scale recovery open source library here.

To find suitable sources among the millions of web pages in the library, we have designed a way to use AI to index a large amount of information. We fed our algorithms with 4 million queries from Wikipedia, teaching them to focus on a single source from a huge set of web pages to validate each statement.

During a search, models create and compare mathematical representations of the meaning of entire statements rather than individual words. Because web pages can contain long stretches of text, models evaluate content in blocks and consider only the most relevant passage when deciding whether or not to recommend a URL. These predefined indexes, which list 40 times more content than other Wikipedia indexes, will be included in Sphere.

Clues pass potential sources to an evidence ranking model, which compares the new text to the original citation. Using a fine-grained understanding of the language, the model ranks the cited source and retrieved alternatives based on the likelihood that they support the claim. When deployed in the real world, the template will suggest the most relevant URLs as potential citations for a human editor to review and approve.

Better quotes in action

Usually, to develop models like this, the input might just be a sentence or two. We trained our models with complicated statements from Wikipedia, along with comprehensive websites that may or may not support the claims. As a result, our models achieved a performance leap in terms of detecting citation accuracy. For example, our system found a better source for a quote in the Wikipedia article “2017 in classical music.” The complaint reads as follows:

“The Los Angeles Philharmonic Announces the Appointment of Simon Woods as President and CEO, Effective January 22, 2018.”

The current Wikipedia footnote for this statement refers to a press release from the Dallas Symphony Association announcing the appointment of its new President and Chief Executive Officer, also effective January 22, 2018. Despite their similarities, oyou classification of evidence the model inferred that the press release was irrelevant to the claim. Our AI clues suggested another possible source, a blog post on the website, which notes,

“On Thursday, the Los Angeles Philharmonic announced the appointment of Simon Woods as General Manager, effective January 22, 2018..”

The evidence ranking model then correctly concluded that this was more relevant than the existing Wikipedia citation to the claim.

Building an AI that can make sense of the real world

When ready to deploy, our models will strengthen the quality of knowledge on Wikipedia, helping to preserve the accuracy of a resource that virtually everyone uses. Beyond that, this project could ultimately help the research community solve tough problems in AI. Our models have been trained on realistic data at an unprecedented scale. In addition to representing preliminary steps towards the development of an automatic fact-checking system, they could pave the way – by serving as pre-trained models, for example – to better results on many other tasks, such as classical natural language inference, question retrieval -response systems, and learning in a few strokes.

Open-source projects like these, which teach algorithms to understand dense materials with an ever-increasing degree of sophistication, help AI make sense of the real world. Although we cannot yet design a computer system with human-level understanding of language, our research is creating smarter and more flexible algorithms. This improvement will only grow in importance as we depend on computers to interpret the growing volume of text citations generated every day.

And after?

As we continue to refine our verification and recovery models, we are also looking to the future. These templates are the first components for potential editors that could help verify documents in real time. In addition to offering citations, the system would suggest auto-complete text – informed by relevant documents found on the web – and offer proofreading corrections. Ideally, templates would understand multiple languages ​​and be able to handle multiple media types, including videos, images, and data tables. These capabilities are among the new targets of Meta AI as we help teach technology to understand our world.

REMARK: Wikimedia and Meta are not partners on this project. The project is still in the research phase and is not used to automatically update Wikipedia content.


Comments are closed.