PostHole
Compose Login
You are browsing us.zone2 in read-only mode. Log in to participate.
rss-bridge 2026-02-20T08:40:00+00:00

Even GenAI uses Wikipedia as a source

Ryan is joined by Philippe Saade, the AI project lead at Wikimedia Deutschland, to dive into the Wikidata Embedding Project and how their team vectorized 30 million of Wikidata’s 119 million entries for semantic search.


February 20, 2026

Even GenAI uses Wikipedia as a source

Ryan is joined by Philippe Saade, the AI project lead at Wikimedia Deutschland, to dive into the Wikidata Embedding Project and how their team vectorized 30 million of Wikidata’s 119 million entries for semantic search.

  • Credit: Alexandra Francis*

They discuss how this project helped offload the burden that scraping was creating for their sites, what Wikimedia.DE is doing to maintain data integrity for their entries, and the importance of user feedback even as they work to bring Wikipedia’s vast knowledge to people building open-source AI projects.

Wikimedia.DE announced the Wikidata Embedding Project with MCP support in October of last year. Check out their vector database and codebase for the project.

Connect with Philippe on LinkedIn and his Wiki page).

Today’s shoutout goes to an Unsung Hero on Stack Overflow—someone who has more than 10 accepted answers with a zero score, making up 25% of their total. Thank you to user MWB for bringing your knowledge to the community!

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Doman, and today we're talking about the Grand Vectorization MCT Project that happened with Wikimedia Deutschland. My guest today is Philippe Saade, who is the AI project lead at Wikimedia Deutschland. So, welcome to the program.

Philippe Saade: Thank you so much.

Ryan Donovan: So, before we get started, you'd like to find out a little bit about our guest. How did you get involved in software technology?

Philippe Saade: So, for me, I started my bachelor's in computer science.

Ryan Donovan: Mm-hmm.

Philippe Saade: And jumped into my Master's here in Germany. And it was more general computer science, but then I took a few more classes around AI and machine learning, and it was a little bit before what now AI means for people, like before chatbots.

Ryan Donovan: Yeah.

Philippe Saade: So, I was more intrigued into the simpler AI techniques with how you can teach a model to do some classifications, or things like that. So, I delved right into it.

Ryan Donovan: Is that the sort of more classical deep learning, unsupervised, or is it a more labeled data training?

Philippe Saade: Mostly both. I was more obviously interested in the deep learning stuff, 'cause it's just super interesting how you can just from data create this model that does really cool things.

Ryan Donovan: Mm-hmm.

Philippe Saade: From classification or even unsupervised learning. Reinforcement learning is super interesting. So, I just wanted to go into this topic and learn more how things work.

Ryan Donovan: Yeah. Well, today we're talking about a very big classification project, right? So, obviously Wikimedia, and Wikipedia, and all the Wiki sites are a very important part of the internet. I'm sure with all of the new AI real-time data, just-in-time data, you guys were getting pretty hammered, right?

Philippe Saade: I mean, yeah, for sure. We've been having a lot of discussions on scraping and how it's been affecting the infrastructure, which I assume they're probably RAG applications who are collecting all this data to provide answers, but also companies who are gathering the information for training, and that's a huge aspect; and we're hoping to find better solutions.

Ryan Donovan: Yeah. It seems like the better solution is to cooperate and not resist, right?

Philippe Saade: Yeah, of course. I mean, you can't really stop these types of scrapings. Better find solutions to either provide the data in a simpler way instead of having multiple calls on the API, or the Sparkle query service that do a lot of computations. So, it's better to just either provide the data or find a solution to make it simpler, and with less resources.

Ryan Donovan: Yeah, and I think, you know, that's sort of the conclusion we've come here at Stack Overflow, as well. The blog post you wrote about this came to my attention from some of our engineers that said, 'oh, this is, this is great. This is such an interesting project at scale.' Can you give us a little sense of what is the scope of this project?

Philippe Saade: Yeah. So, basically, at Wikimedia Deutschland, we're working mostly on Wikidata, which is the knowledge graph behind the sister project of Wikipedia, and the knowledge graph behind Wikimedia. Basically, the project that we're working on is a vector database on top of Wikidata, and we're hoping to, by developing this vector database, provide a simpler access point on Wikidata and enabling semantic search; and at the same time, encouraging the open source AI community to build projects with Wikidata.

Ryan Donovan: I think the numbers were something like 1.78 terabytes of text with 119 million entries. Is that about right?

Philippe Saade: Yeah, that's true. So, Wikidata currently has around 119 items in it. And you know, items that are found in the knowledge graph. Currently, with the Vector database, we're embedding around, I would say 30 million items, and that those are filtered out based on current testing, seeing what makes the most sense. The 30 million are the ones that are currently linked to a Wikipedia page.

Ryan Donovan: Wow, okay. So, there's some dead pages in there.

Philippe Saade: Not necessarily dead pages. So, a lot of these other items, for example, are linked to or are about articles. So, for scientific articles, which would be super interesting to have a Vector database for that. But then I would imagine if we add more and more items on the vector database, then it would explode, and it would be a bit more difficult. And for now, for testing purposes, it would be best to keep it with the more general knowledge information.

Ryan Donovan: Mm-hmm. I definitely wanna get into those limits at some point, but you talked about it as a knowledge graph. How much sort of organization and structure did you have to work with [in the] beginning?

Philippe Saade: In terms of the knowledge graph, how it's internally structured?

Ryan Donovan: Yeah, yeah. How much information and how much of that knowledge graph were you able to embed in the project?

Philippe Saade: Oh, for the vector database, basically, it's not as straightforward to transform an item on Wikidata from the knowledge graph to embedding, because most embeddings now work with textual information. Currently, we're using a more of a pre-trained embedding model and not trained ourselves.

Ryan Donovan: Mm-hmm.

Philippe Saade: So, there was a process of transforming each item into a textual representation to then transform it into an embedding. And that was a little bit of a challenge, I would say, because we had to pass through the data dump multiple times to gather all the information from the connected links in the graph, and aggregate everything into one text to then pass it to the embedding model.

Ryan Donovan: Yeah. I think you have a very curated data set, right? You have a lot of metadata associated with it. In those multiple passes, what were the sort of additional embedding information that you were gathering?

[...]


Original source

Reply