Itqan — The Story

Bismillah: where it all began

I was four years old when my father held my hand and had me read my Bismillah. That was the beginning of a relationship with the Quran that would last a lifetime.

By the time I was nine, I was trying to understand what I was reading. The Urdu translations were more difficult than the Arabic itself, heavy, old-fashioned language that a child couldn't penetrate. I was frustrated. I wanted to know what Allah was saying to me, but the tools available felt like barriers rather than bridges.

But I kept going. The Quran was a friend through all those years. A companion.

2003 to 2010: when friendship became love

From 2003 to 2010, something shifted. I could not live without the Quran. Every day, whatever I could do: listening to it, trying to recite it, sitting with it even when I didn't understand every word. I didn't know Arabic. I didn't have a teacher. But I had the Book, and I had the desperate desire to understand it.

The way I learned was the only way I knew: I would read an ayah, then search the entire Quran for every other ayah that used the same word. Because Allah says:

وَنَزَّلْنَا عَلَيْكَ الْكِتَابَ تِبْيَانًا لِّكُلِّ شَيْءٍ

An-Nahl 16:89, "And We have sent down to you the Book as a clarification for all things"

Allah says the Quran explains itself, one part is explained by another. I experienced this firsthand. The word used in one surah carries a shade of meaning that only becomes clear when you see how the same root appears in another surah, in a different context. But finding those connections manually was exhausting. I would read an ayah with the word رِزْق (provision) and want to find every other ayah that talks about provision. Not just رزق but يَرْزُقُ، رَازِقِينَ، الرَّزَّاقُ. All the same root, all different forms. How do you find them without knowing every morphological pattern?

The book that changed everything

Then I found Fuad Abd al-Baqi's concordance, al-Mu'jam al-Mufahras li-Alfaz al-Quran al-Karim. A book where you look up any Arabic root and find every ayah containing it. It was exactly what I needed.

But you still had to know the root. If you didn't know that كَتَبَ, يَكْتُبُ, كِتَابٌ, مَكْتُوبٌ, and الكَاتِبِينَ all come from the same root ك-ت-ب, the book couldn't help you. And it covered only the Quran. What about the hadith? What if the Prophet ﷺ explained the same concept using the same root?

At the time, I didn't know Wensinck's hadith concordance existed. I didn't know anyone had ever tried to index hadith by root. All I knew was: Fuad Abd al-Baqi's book helped me find Quran words by root, and I wished someone had done the same thing for hadith.

2015 to 2017: my only companion

From 2015 to 2017, the Quran was my everything. My happiness, my sadness, my tears of joy, my embrace. I was memorizing it. I could not bear to be separated from it. Every surah I memorized became part of me.

During that time, the concordance was literally in my head. I could recall a root and find all the words from the Quran in my mind. Every root I encountered became a thread I wanted to pull, to see where it led across the Book. When you have memorized a surah and you encounter the same root in another surah, you feel the connection physically. You want to see ALL of them. You want to see the pattern Allah wove across 114 surahs.

Ramadan: where the idea was born

It started in Ramadan. I was deep in the Quran, using Fuad Abd al-Baqi's concordance to trace words across surahs. And I had discovered al-Furuq al-Lughawiyyah, the science of linguistic distinctions. Arabic doesn't have true synonyms. خَشْيَة and خَوْف both translate as "fear" in English, but they are not the same. رَحْمَة and رَأْفَة both translate as "mercy," but a scholar knows the difference. Every root carries a precise shade of meaning that no translation captures.

I was sitting with these books in Ramadan, tracing roots, reading the Furuq, and I thought: what if this could be an app? Starting from the Quran. Click any word, see its root, see al-Raghib al-Isfahani's classical definition, see the Furuq, how this word differs from its near-synonyms. And then, from that root, see every verse in the Quran and every hadith across every book that shares it.

I am not a developer or a software engineer. I didn't know what NLP was, or what a morphological analyzer does, or how to parse Arabic text computationally. But the idea was born in that Ramadan, and it wouldn't leave.

The first steps: Quran bil-Quran

It was the 19th of Ramadan, 2026, March 8th. The idea crystallized. And I started. Not knowing where it would go, not knowing how. Just starting.

The earliest work wasn't about hadith at all. It was pure Quran. Root databases, stem mappings, ayah-root connections. Then Mutashabihat ul-Quran, finding the similar and matching verses. The exact concept of "Quran explains Quran" turned into code.

Then it became Quran bil-Quran, a full app where you could click any word in any verse and see every other verse sharing its trilateral root. I grouped 1,651 roots into semantic families. I integrated al-Raghib al-Isfahani's Mufradat and the Furuq al-Lughawiyyah. The Quran was finally explaining itself digitally, the way I had been trying to do by hand for years.

It was deployed. It worked. But it covered only the Quran.

Then came the hadith

The question was obvious: what about the Sunnah? The Prophet ﷺ explained the Quran. His words used the same Arabic roots. If I could connect Quranic roots to hadith roots, I would have the first tool that bridges both corpora computationally, through shared morphology.

What followed was the most intense period of development. Every day a new exploration, every day a new problem solved, every day the corpus grew and the connections deepened. And every day, I am utterly astounded by Allah's mercy. How He helps me. How He lets me do it. Something appears in my inbox, a random video, a link, anything that comes my way, and it adds a layer into the project I didn't know was missing. This entire journey has been guided in ways I cannot explain rationally.

And then I discovered Wensinck

It was only during development that I learned about Wensinck's hadith concordance, a 7-volume work that a team of European orientalists spent from 1936 to 1969 compiling, doing for 9 hadith books what Fuad Abd al-Baqi did for the Quran. I had been building the same thing computationally without knowing it already existed in physical form.

But Wensinck's concordance and Fuad Abd al-Baqi's concordance existed in complete isolation. No one had ever connected them. Later, Wensinck's methodology would solve a problem I couldn't solve alone: 315 Quranic roots that my primary stemmer couldn't match to hadiths. Two methods, built independently, cross-validating each other. Root coverage jumped from 81% to 96.3%.

The problems that made me laugh and cry

Every step forward revealed a new problem. Some were so absurd I had to laugh. Others made me want to give up.

The roots wouldn't transfer

The Quran uses one canonical form for a root, CAMeL Tools uses another. قضي becomes قضو. بيع becomes بوع. 315 Quranic roots returned zero hadith matches. Not because the words didn't exist, but because the two systems disagreed on what to call the root. I built an alias map, 196 entries, painstakingly cross-referencing forms. Still stuck. Then the Wensinck concordance solved it.

The ayah markers ate the words

Clicking a word gave me the definition of the wrong word. At first off by one. Then by ten. Then by fifty. Then by hundreds. By the later surahs, the lag was thousands of words. Every ayah marker, every delimiter, every verse number was being counted as a "word" by the tokenizer. The lag accumulated across the entire Quran. I had no clue what was happening until I traced it to the delimiters.

The hadith IDs were lying

idInBook restarts at 1 for every chapter in 14 of 18 books. Hadith #2 in Bukhari means 97 different hadiths, one in each chapter. Every Quran-to-Hadith filter had been showing false matches since the beginning. The entire bridge was producing wrong results.

Ghost narrators: أبيه, عمه, أمي, خالي, جدي

"أبيه" (his father) was a top narrator in ALL 11 books. Then أبي, جده, عمه, أمه, خاله, مولاه. None of them are people. Each needed a different solution: genealogy tables (76 entries), kunya merging, name extraction from comma-separated text. Abu Huraira was appearing as two different narrators because of an honorific suffix.

Abu Huraira lost his عبد

The most prolific narrator in Islamic history was listed as "الرحمن بن صخر." The عبد had been silently stripped from 13,189 names. عبد الملك became "الملك." عبد الله became "الله." Every major companion had a broken name. 80+ divine attributes catalogued to fix it.

"بن ماجة" is not a father

1,388 profiles started with "بن ماجة." The parser was eating book source markers as narrator names. The disambiguation: بن ماجة as a book name is ALWAYS preceded by a sigla. Real patronymics are ALWAYS followed by another name. Zero false positives on 11,761 entries.

The 42% garbage names

After the full pipeline, a name quality audit revealed 42.4% of all profiles had issues. Transmission chains embedded in names. Entire biographical sentences masquerading as names. 15 of 22 parsers had minimal or no cleaning. We built a universal post-processing gate. 82.5% to 99.5% clean.

The 22 classical texts

The narrator database didn't arrive as a plan. It grew because each problem demanded a new source.

The core eight came first. Taqrib al-Tahdhib (Ibn Hajar's one-line grades), Tahdhib al-Tahdhib (his expanded assessments), Tahdhib al-Kamal (al-Mizzi's encyclopedia that both abbreviate), Mizan al-I'tidal (al-Dhahabi's criticized narrators), Al-Jarh wa al-Ta'dil (Ibn Abi Hatim, the earliest systematic critic), Al-Thiqat (Ibn Hibban's reliable ones), Al-Kamil fi Du'afa (Ibn 'Adi's weak narrators), Tarikh Baghdad (al-Khatib's Baghdad scholars). These gave us the foundation: 83,082 entries.

The companion problem brought the biographical encyclopedias. Al-Isaba fi Tamyiz al-Sahaba (Ibn Hajar's 10,000+ companion encyclopedia) was needed because our companion count was suspiciously low. Tabaqat Ibn Sa'd (the earliest biographical dictionary, 3rd century) and Siyar A'lam al-Nubala (al-Dhahabi's masterwork covering 800 years of scholars) filled the generational gaps. Companions went from ~1,663 to ~10,489.

The grade gaps demanded specialized works. Tarikh al-Islam (al-Dhahabi's universal history), Lisan al-Mizan (Ibn Hajar's expansion of Mizan), Al-Durar al-Kamina (Ibn Hajar's contemporaries), Al-Kashif (al-Dhahabi's condensed reference). Then the du'afa books to find weak narrators: Mughni fi al-Du'afa, Diwan al-Du'afa, Dhayl Diwan al-Du'afa. And the specialized lists: Tadhkirat al-Huffaz (hadith masters), Mu'jam al-Shuyukh (al-Dhahabi's personal teachers), Ma'rifat al-Qurra (Quran reciters), Mu'in fi Tabaqat al-Muhaddithin (generation list).

Then came the discovery that the book IS the grade. If a narrator is listed in "The Book of Weak Narrators," their grade is weak. If they're in "The Reliable Ones," they're reliable. If they're in the companion encyclopedia, they're a companion. This single insight recovered 19,633 grades in one stroke. The highest-yield fix in the entire project.

111,604 profiles and still not enough

By this point, 22 classical texts had been parsed, cleaned, deduplicated. 111,604 narrator profiles. 70.6% graded. 99.5% clean names. A confidence scoring system. Teacher-student chains from AR-Sanad. The database was real. But every external source we tried to cross-reference against was a dead end. Dorar.net blocks automation. Islamweb serves the same texts we already have. Shamela has Cloudflare. Sunnah.com needs an API key. There is no publicly accessible, structured narrator grading database anywhere on the internet.

Then we found Gawami al-Kalim

A 2.8GB Windows desktop app from 2010, archived on archive.org. Gawami al-Kalim 4.5, funded by Qatar's Directorate of Endowments. Abandoned software, no API, no documentation. Downloaded the RAR, extracted it, found Data/Rawy.zip (116 MB). Inside: .TBX files in a custom binary format. Record separator 0x01, tab-delimited, scattered control bytes.

Cracked it open: 49,845 narrator profiles (we initially thought 27,756, then discovered each record packed multiple narrators as 29-field blocks). 549,173 chain links. 38,355 isnad evaluations. 4.1 million extended chain links. And in the Books.zip: 1,004 hadith collections with narrator L-tags embedded in the text, every narrator in every hadith pre-tagged with their GK ID. 26,759 hadiths across 6 canonical books with perfect narrator identification.

Cross-referenced against our existing 111,604 profiles: 2,578 new grades recovered. Validated: 91.4% agreement on 5,062 overlapping narrators. GK didn't replace what we had built. It confirmed it, and filled gaps we couldn't reach from classical texts alone.

GK Data Available (too large for Git, available on request)

gk_json/gk_narrators.json (25 MB) 49,845 narrator profiles

gk_json/gk_chain_links_sanadrowah.json (110 MB) 549,173 chain links

gk_json/gk_chain_links_phase2.json (431 MB) 4,179,587 extended links

gk_json/gk_chain_evaluations.json (15 MB) 38,355 isnad evaluations

gk_json/gk_hadith_narrator_tags.json (1.5 MB) 26,759 hadiths with narrator L-tags

gk_json/gk_book_index.json (62 KB) 1,004 book names and IDs

gk_full_graph.json (30 MB) 72,339 narrators, 255k links

Total: ~731 MB of structured GK data converted from .TBX to JSON. Contact Ali Bin Shahid if you need access.

The grading engine: 50% to 77%

The entire narrator database was built for one purpose: grading hadiths. Take a hadith text, extract the isnad chain, look up each narrator, apply the weakest-link rule.

First attempt: 50% accuracy against Albani's grades. Half wrong. The chain direction was reversed (isnad goes backward, position 0 heard from position 1). The wrong Malik was being selected (a companion instead of Imam Malik). The isnad extractor was deleting the يعني clarifications that hadith scholars put there specifically for disambiguation.

Then the user said three words that changed the architecture: "but we have AR-Sanad." AR-Sanad had bidirectional teacher-student chains (234k links each way). GK was the name/grade layer. AR-Sanad was the chain layer. The combined approach worked.

The comma revelation: "عبد العزيز، يعني ابن محمد" means "Abdul Aziz, MEANING Ibn Muhammad." Our extractor was throwing away the يعني text. Line 73 literally stripped the disambiguation hints that scholars put there for exactly this purpose.

Fame as disambiguation: when scholars write just "مالك" without qualification, they mean the famous one. The single name IS the disambiguation. Prefer the narrator with the most connections.

The Albani reverse loop: 40,000 graded hadiths implicitly tell you about their narrators. If a narrator appears in 191 sahih hadiths and only 12 da'if ones, they're demonstrably reliable. 587 wrong grades fixed from statistical evidence.

50% to 77% accuracy. Every percentage point earned through a different insight.

Then Bukhari changed everything

We tested on Bukhari. Undisputed sahih. Starting accuracy: 98.0%. Every "wrong" grade was a narrator our database had too low. Fixed 25 narrators, each traced to taqrib, GK, and AR-Sanad evidence. Result: 99.99%. The one remaining "error" is not an error at all. Taqrib says أسيد بن زيد is weak, and Bukhari used him in a mutaba'a (supported) role. Our engine found what Ibn Hajar documented 600 years ago.

Muslim: 98.5% to 99.88% after 15 fixes. Abu Dawud improved to 77.0% from shared narrator corrections.

The insight that changed the architecture: Bukhari and Muslim are calibration tools. If a narrator appears in their chains, they are at least mostly_reliable. Any narrator graded lower in our database is wrong. Their acceptance IS the evidence.

Then we built 8 disambiguation rules from taqrib raw text. When scholars write just "سفيان" without qualification, which Sufyan do they mean? Taqrib tells you. "أبيه" (his father) resolved from the previous narrator's nasab. 623 dropped narrators restored. AR-Sanad audited: 99.99% accurate, 72 compressed chains, zero data errors. Publishable.

And the golden rule was established: go to the source text. Not the parsed data. Not the CSV. Not the pattern. The book.

What it became

112,221 hadiths across 18 books. 1,590 of 1,651 Quranic roots connected. 1,528,346 verified root links. 115,735 narrator profiles from 22 classical texts, 72.6% graded, 99.5% clean, 96% uniquely identified, 100% traceable. 217,762 name variants. 127,411 teacher-student links. 33,758 Arabic words with morphological definitions. 100,656 isnad chains parsed. A digital Wensinck concordance. Confidence scoring on every profile. Bukhari 99.99%, Muslim 99.88%, Abu Dawud 77%. 8 disambiguation rules sourced from taqrib. 59,365 hadiths graded by named scholars, 52% of the corpus.

What took Wensinck's team 33 years, this computes in seconds. What took a scholar physically opening four volumes of biographical dictionaries, this consolidates into one search. What no one had ever done, connecting the Quran concordance to the hadith concordance through shared roots, is now a click away.

And it is free. And it will remain free. Because this is a sadaqah jariyah.

🌰 Open & free. This project is built as a sadaqah jariyah, ongoing charity through beneficial knowledge. Use it, share it, build upon it. If it helps your research, a citation is appreciated. Please do not place it behind a paywall. Let it remain accessible to anyone seeking to learn. If someone does profit from it without giving back, I trust that account will be settled by the One from whom nothing is hidden.

مَن سَنَّ فِي الإِسْلَامِ سُنَّةً حَسَنَةً فَلَهُ أَجْرُهَا وَأَجْرُ مَنْ عَمِلَ بِهَا

What comes next

Break past 77%

GK's Books.zip has isnad/matn boundary markers for 26,759 hadiths. Use these as training data to teach the engine where isnads end and matns begin. Use the L-tags as ground truth to find extraction bugs systematically. The goal: 85%+ accuracy.

1,004 hadith collections

Books.zip contains 1,004 hadith collections. We use 6. Each new collection feeds the reverse learning loop with more narrator evidence. More evidence means more grades, which means better grading, which means more evidence. The virtuous cycle.

The narrator's portrait

Every narrator has a story. The hadiths they narrate are the hadiths they carried through their lives, what they heard, what they memorized, what they chose to transmit. A narrator's collection IS their portrait: Abu Huraira's 5,000+ narrations paint a picture of a man who spent every moment near the Prophet ﷺ. Aisha's narrations reveal the private life no one else had access to. Ali's cluster in Ahmad reflects the Kufan transmission school. The next step is to build this portrait.

The problem that hasn't been solved

We searched for a curated universal narrator ID database. GitHub, islam-db, dorar.net, islamweb, shamela. The n007rehan dataset that everyone references doesn't exist. There is no shortcut. The hadith narrator database is a problem that hasn't been solved computationally by anyone. We are building it from scratch, one classical text at a time, one war story at a time.

The Story Behind Itqan

Bismillah: where it all began

2003 to 2010: when friendship became love

The book that changed everything

2015 to 2017: my only companion

Ramadan: where the idea was born

The first steps: Quran bil-Quran

Then came the hadith

And then I discovered Wensinck

The problems that made me laugh and cry

The 22 classical texts

111,604 profiles and still not enough

Then we found Gawami al-Kalim

GK Data Available (too large for Git, available on request)

The grading engine: 50% to 77%

Then Bukhari changed everything

What it became

What comes next

Break past 77%

1,004 hadith collections

The narrator's portrait

The problem that hasn't been solved