Bismillah: where it all began
I was four years old when my father held my hand and had me read my Bismillah. That was the beginning of a relationship with the Quran that would last a lifetime.
By the time I was nine, I was trying to understand what I was reading. The Urdu translations were more difficult than the Arabic itself, heavy, old-fashioned language that a child couldn't penetrate. I was frustrated. I wanted to know what Allah was saying to me, but the tools available felt like barriers rather than bridges.
But I kept going. The Quran was a friend through all those years. A companion.
2003 to 2010: when friendship became love
From 2003 to 2010, something shifted. I could not live without the Quran. Every day, whatever I could do: listening to it, trying to recite it, sitting with it even when I didn't understand every word. I didn't know Arabic. I didn't have a teacher. But I had the Book, and I had the desperate desire to understand it.
The way I learned was the only way I knew: I would read an ayah, then search the entire Quran for every other ayah that used the same word. Because Allah says:
Allah says the Quran explains itself, one part is explained by another. I experienced this firsthand. The word used in one surah carries a shade of meaning that only becomes clear when you see how the same root appears in another surah, in a different context. But finding those connections manually was exhausting. I would read an ayah with the word رِزْق (provision) and want to find every other ayah that talks about provision. Not just رزق but يَرْزُقُ، رَازِقِينَ، الرَّزَّاقُ. All the same root, all different forms. How do you find them without knowing every morphological pattern?
The book that changed everything
Then I found Fuad Abd al-Baqi's concordance, al-Mu'jam al-Mufahras li-Alfaz al-Quran al-Karim. A book where you look up any Arabic root and find every ayah containing it. It was exactly what I needed.
But you still had to know the root. If you didn't know that كَتَبَ, يَكْتُبُ, كِتَابٌ, مَكْتُوبٌ, and الكَاتِبِينَ all come from the same root ك-ت-ب, the book couldn't help you. And it covered only the Quran. What about the hadith? What if the Prophet ﷺ explained the same concept using the same root?
At the time, I didn't know Wensinck's hadith concordance existed. I didn't know anyone had ever tried to index hadith by root. All I knew was: Fuad Abd al-Baqi's book helped me find Quran words by root, and I wished someone had done the same thing for hadith.
2015 to 2017: my only companion
From 2015 to 2017, the Quran was my everything. My happiness, my sadness, my tears of joy, my embrace. I was memorizing it. I could not bear to be separated from it. Every surah I memorized became part of me.
During that time, the concordance was literally in my head. I could recall a root and find all the words from the Quran in my mind. Every root I encountered became a thread I wanted to pull, to see where it led across the Book. When you have memorized a surah and you encounter the same root in another surah, you feel the connection physically. You want to see ALL of them. You want to see the pattern Allah wove across 114 surahs.
Ramadan: where the idea was born
It started in Ramadan. I was deep in the Quran, using Fuad Abd al-Baqi's concordance to trace words across surahs. And I had discovered al-Furuq al-Lughawiyyah, the science of linguistic distinctions. Arabic doesn't have true synonyms. خَشْيَة and خَوْف both translate as "fear" in English, but they are not the same. رَحْمَة and رَأْفَة both translate as "mercy," but a scholar knows the difference. Every root carries a precise shade of meaning that no translation captures.
I was sitting with these books in Ramadan, tracing roots, reading the Furuq, and I thought: what if this could be an app? Starting from the Quran. Click any word, see its root, see al-Raghib al-Isfahani's classical definition, see the Furuq, how this word differs from its near-synonyms. And then, from that root, see every verse in the Quran and every hadith across every book that shares it.
I am not a developer or a software engineer. I didn't know what NLP was, or what a morphological analyzer does, or how to parse Arabic text computationally. But the idea was born in that Ramadan, and it wouldn't leave.
The first steps: Quran bil-Quran
It was the 19th of Ramadan, 2026, March 8th. The idea crystallized. And I started. Not knowing where it would go, not knowing how. Just starting.
The earliest work wasn't about hadith at all. It was pure Quran. Root databases, stem mappings, ayah-root connections. Then Mutashabihat ul-Quran, finding the similar and matching verses. The exact concept of "Quran explains Quran" turned into code.
Then it became Quran bil-Quran, a full app where you could click any word in any verse and see every other verse sharing its trilateral root. I grouped 1,651 roots into semantic families. I integrated al-Raghib al-Isfahani's Mufradat and the Furuq al-Lughawiyyah. The Quran was finally explaining itself digitally, the way I had been trying to do by hand for years.
It was deployed. It worked. But it covered only the Quran.
Then came the hadith
The question was obvious: what about the Sunnah? The Prophet ﷺ explained the Quran. His words used the same Arabic roots. If I could connect Quranic roots to hadith roots, I would have the first tool that bridges both corpora computationally, through shared morphology.
What followed was the most intense period of development. Every day a new exploration, every day a new problem solved, every day the corpus grew and the connections deepened. And every day, I am utterly astounded by Allah's mercy. How He helps me. How He lets me do it. Something appears in my inbox, a random video, a link, anything that comes my way, and it adds a layer into the project I didn't know was missing. This entire journey has been guided in ways I cannot explain rationally.
And then I discovered Wensinck
It was only during development that I learned about Wensinck's hadith concordance, a 7-volume work that a team of European orientalists spent from 1936 to 1969 compiling, doing for 9 hadith books what Fuad Abd al-Baqi did for the Quran. I had been building the same thing computationally without knowing it already existed in physical form.
But Wensinck's concordance and Fuad Abd al-Baqi's concordance existed in complete isolation. No one had ever connected them. Later, Wensinck's methodology would solve a problem I couldn't solve alone: 315 Quranic roots that my primary stemmer couldn't match to hadiths. Two methods, built independently, cross-validating each other. Root coverage jumped from 81% to 96.3%.
The problems that made me laugh and cry
Every step forward revealed a new problem. Some were so absurd I had to laugh. Others made me want to give up.
The 22 classical texts
The narrator database didn't arrive as a plan. It grew because each problem demanded a new source.
111,604 profiles and still not enough
By this point, 22 classical texts had been parsed, cleaned, deduplicated. 111,604 narrator profiles. 70.6% graded. 99.5% clean names. A confidence scoring system. Teacher-student chains from AR-Sanad. The database was real. But every external source we tried to cross-reference against was a dead end. Dorar.net blocks automation. Islamweb serves the same texts we already have. Shamela has Cloudflare. Sunnah.com needs an API key. There is no publicly accessible, structured narrator grading database anywhere on the internet.
Then we found Gawami al-Kalim
A 2.8GB Windows desktop app from 2010, archived on archive.org. Gawami al-Kalim 4.5, funded by Qatar's Directorate of Endowments. Abandoned software, no API, no documentation. Downloaded the RAR, extracted it, found Data/Rawy.zip (116 MB). Inside: .TBX files in a custom binary format. Record separator 0x01, tab-delimited, scattered control bytes.
Cracked it open: 49,845 narrator profiles (we initially thought 27,756, then discovered each record packed multiple narrators as 29-field blocks). 549,173 chain links. 38,355 isnad evaluations. 4.1 million extended chain links. And in the Books.zip: 1,004 hadith collections with narrator L-tags embedded in the text, every narrator in every hadith pre-tagged with their GK ID. 26,759 hadiths across 6 canonical books with perfect narrator identification.
Cross-referenced against our existing 111,604 profiles: 2,578 new grades recovered. Validated: 91.4% agreement on 5,062 overlapping narrators. GK didn't replace what we had built. It confirmed it, and filled gaps we couldn't reach from classical texts alone.
GK Data Available (too large for Git, available on request)
gk_json/gk_narrators.json (25 MB) 49,845 narrator profiles
gk_json/gk_chain_links_sanadrowah.json (110 MB) 549,173 chain links
gk_json/gk_chain_links_phase2.json (431 MB) 4,179,587 extended links
gk_json/gk_chain_evaluations.json (15 MB) 38,355 isnad evaluations
gk_json/gk_hadith_narrator_tags.json (1.5 MB) 26,759 hadiths with narrator L-tags
gk_json/gk_book_index.json (62 KB) 1,004 book names and IDs
gk_full_graph.json (30 MB) 72,339 narrators, 255k links
Total: ~731 MB of structured GK data converted from .TBX to JSON. Contact Ali Bin Shahid if you need access.
The grading engine: 50% to 77%
The entire narrator database was built for one purpose: grading hadiths. Take a hadith text, extract the isnad chain, look up each narrator, apply the weakest-link rule.
First attempt: 50% accuracy against Albani's grades. Half wrong. The chain direction was reversed (isnad goes backward, position 0 heard from position 1). The wrong Malik was being selected (a companion instead of Imam Malik). The isnad extractor was deleting the يعني clarifications that hadith scholars put there specifically for disambiguation.
Then the user said three words that changed the architecture: "but we have AR-Sanad." AR-Sanad had bidirectional teacher-student chains (234k links each way). GK was the name/grade layer. AR-Sanad was the chain layer. The combined approach worked.
The comma revelation: "عبد العزيز، يعني ابن محمد" means "Abdul Aziz, MEANING Ibn Muhammad." Our extractor was throwing away the يعني text. Line 73 literally stripped the disambiguation hints that scholars put there for exactly this purpose.
Fame as disambiguation: when scholars write just "مالك" without qualification, they mean the famous one. The single name IS the disambiguation. Prefer the narrator with the most connections.
The Albani reverse loop: 40,000 graded hadiths implicitly tell you about their narrators. If a narrator appears in 191 sahih hadiths and only 12 da'if ones, they're demonstrably reliable. 587 wrong grades fixed from statistical evidence.
50% to 77% accuracy. Every percentage point earned through a different insight.
Then Bukhari changed everything
We tested on Bukhari. Undisputed sahih. Starting accuracy: 98.0%. Every "wrong" grade was a narrator our database had too low. Fixed 25 narrators, each traced to taqrib, GK, and AR-Sanad evidence. Result: 99.99%. The one remaining "error" is not an error at all. Taqrib says أسيد بن زيد is weak, and Bukhari used him in a mutaba'a (supported) role. Our engine found what Ibn Hajar documented 600 years ago.
Muslim: 98.5% to 99.88% after 15 fixes. Abu Dawud improved to 77.0% from shared narrator corrections.
The insight that changed the architecture: Bukhari and Muslim are calibration tools. If a narrator appears in their chains, they are at least mostly_reliable. Any narrator graded lower in our database is wrong. Their acceptance IS the evidence.
Then we built 8 disambiguation rules from taqrib raw text. When scholars write just "سفيان" without qualification, which Sufyan do they mean? Taqrib tells you. "أبيه" (his father) resolved from the previous narrator's nasab. 623 dropped narrators restored. AR-Sanad audited: 99.99% accurate, 72 compressed chains, zero data errors. Publishable.
And the golden rule was established: go to the source text. Not the parsed data. Not the CSV. Not the pattern. The book.
What it became
112,221 hadiths across 18 books. 1,590 of 1,651 Quranic roots connected. 1,528,346 verified root links. 115,735 narrator profiles from 22 classical texts, 72.6% graded, 99.5% clean, 96% uniquely identified, 100% traceable. 217,762 name variants. 127,411 teacher-student links. 33,758 Arabic words with morphological definitions. 100,656 isnad chains parsed. A digital Wensinck concordance. Confidence scoring on every profile. Bukhari 99.99%, Muslim 99.88%, Abu Dawud 77%. 8 disambiguation rules sourced from taqrib. 59,365 hadiths graded by named scholars, 52% of the corpus.
What took Wensinck's team 33 years, this computes in seconds. What took a scholar physically opening four volumes of biographical dictionaries, this consolidates into one search. What no one had ever done, connecting the Quran concordance to the hadith concordance through shared roots, is now a click away.
And it is free. And it will remain free. Because this is a sadaqah jariyah.
What comes next
Break past 77%
GK's Books.zip has isnad/matn boundary markers for 26,759 hadiths. Use these as training data to teach the engine where isnads end and matns begin. Use the L-tags as ground truth to find extraction bugs systematically. The goal: 85%+ accuracy.
1,004 hadith collections
Books.zip contains 1,004 hadith collections. We use 6. Each new collection feeds the reverse learning loop with more narrator evidence. More evidence means more grades, which means better grading, which means more evidence. The virtuous cycle.
The narrator's portrait
Every narrator has a story. The hadiths they narrate are the hadiths they carried through their lives, what they heard, what they memorized, what they chose to transmit. A narrator's collection IS their portrait: Abu Huraira's 5,000+ narrations paint a picture of a man who spent every moment near the Prophet ﷺ. Aisha's narrations reveal the private life no one else had access to. Ali's cluster in Ahmad reflects the Kufan transmission school. The next step is to build this portrait.
The problem that hasn't been solved
We searched for a curated universal narrator ID database. GitHub, islam-db, dorar.net, islamweb, shamela. The n007rehan dataset that everyone references doesn't exist. There is no shortcut. The hadith narrator database is a problem that hasn't been solved computationally by anyone. We are building it from scratch, one classical text at a time, one war story at a time.