The Foundation
Started with raw hadith JSON from sunnah.com (49k hadiths across the six canonical books) and
a basic reader. The breakthrough was running every Arabic word through CAMeL Tools, the Cairo
Arabic Morphological Analyzer, to extract three-letter roots. Then building the concordance:
an inverted index mapping every word to every hadith containing it. This was the computational
equivalent of what Fuad Abd al-Baqi did by hand for the Quran in 1945.
49k hadiths
6 books
32,413 words defined
Key files created: word_defs_v2.json concordance.json roots_lexicon.json
Musannaf + Root Bridge + Families
Added Musannaf Ibn Abi Shaybah (37,943 hadiths, not available on Sunnah.com), pushing the
corpus to 87k. Built the Quran–Hadith root bridge: for each of 1,651 Quranic roots, find
every hadith containing a word from that root. Created 39 thematic families grouping roots by
semantic field from al-Raghib al-Isfahani's classical lexicography. Built the isnad parser extracting
narrator chains from Arabic text, and the chord visualizations showing how themes interconnect.
87k hadiths
18 books
384,016 root links
1,201 shared roots (73%)
Key files: quran_hadith_bridge.json family_corpus.json isnad_graph.json chord_matrices.json
Quran bil-Quran + Live App
Built the Quran reader with interactive Arabic text: hover for meaning, click for root panel
with Mufradat al-Quran definitions. The reverse bridge: from any root in the Quran, jump to all
connected hadiths. Deep-link URLs for sharing specific hadiths. GitHub Pages deployment.
The chord viewer was rebuilt from scratch. The old Book×Family tab was useless
(every book connected to every family proportionally), replaced with Book Distinctiveness
showing only over-represented connections.
Zenodo DOI minted
GitHub Pages live
The Bridge Was Wrong
A user reported that Quran verse 19:85 connected to the wrong hadiths. Investigation revealed
that idInBook restarts at 1 for every chapter in 14 of 18 books. The bridge was matching
hadith #2 in ALL 97 Bukhari chapters instead of the correct one. Every Quran→Hadith filter
had been producing false positives since v1.0. Fixed with chapter-aware IDs, verified 100% accuracy.
Also fixed root word highlighting (isnad suppression was hiding matches) and built the How It Works
guide page with SVG flow diagram.
Bridge IDs rebuilt chapter-aware
0 false positives (verified)
Key files: bridge_ids/*.json (1,324 files) rebuild_bridge_ids.py
Musnad Ahmad Expansion
Musnad Ahmad had only 1,374 hadiths from sunnah.com. The full book has 26,539.
Found the complete Arnaut edition on OpenITI (the Open Islamicate Texts Initiative) and wrote
a parser for their mARkdown format. Corpus jumped from 87k to 112k. Ahmad became the 2nd largest
book. Rebuilt the FAISS semantic index (112k vectors) and uploaded to HuggingFace. Fixed both
HF apps (repo_type bug + Gradio CSS compatibility).
26,539 Ahmad hadiths (was 1,374)
112,221 total hadiths
1,326,229 root links
Key files: parse_openiti_musnad.py app/data/sunni/ahmed/ (1,014 chapter files)
Isnad Cleanup + Narrator Profiles
Discovered that “أبيه” (his father) was appearing as a top narrator in ALL 11 books. It's not
a person, it's a relative reference. Built a 37-entry genealogy lookup to resolve father-son chains
(e.g., هشام بن عروة → عروة بن الزبير). Fixed kunya splitting where أبي صالح was being
broken into two tokens. Merged AR-Sanad 280K (18,298 narrators) and hatemben (701 with full jarh
wa ta'dil) into a unified database. Built the Rijal page, a searchable narrator browser.
18,298 narrator profiles
72,767 name variants
Isnad grading: 26% → 61%
Key files: narrator_unified.json rijal.html isnad_father_map.json isnad_kunya_map.json
8 Classical Rijal Texts Parsed
Downloaded and parsed 8 foundational texts of ilm al-rijal from OpenITI. Tahdhib al-Kamal,
Mizan al-I'tidal, Al-Jarh wa al-Ta'dil, Al-Thiqat, Al-Kamil fi Du'afa, Tarikh Baghdad,
Tahdhib al-Tahdhib, and Taqrib al-Tahdhib. 83,082 entries across 82.6 MB of Classical Arabic.
Fuzzy name matching merged them into the narrator database, creating the largest structured
open-source narrator database available.
65,391 narrator profiles (was 18,298)
119,860 name variants
83,082 classical entries parsed
31,822 source cross-references
Key files: download_openiti_rijal.py parse_openiti_rijal.py merge_classical_rijal.py
The Dual-Stemmer Breakthrough
315 Quranic roots had zero hadith connections, not because the words don't exist in hadiths,
but because CAMeL Tools canonicalizes roots differently from the Quranic Arabic Corpus. The Wensinck
concordance was built independently as a digital recreation of the 33-year physical concordance.
It was only after building it that we realized it solved the root gap: Wensinck's light stemmer,
using a completely different method, found hadith attestations for 254 of the 315 “missing” roots.
1,345 surface forms discovered by the light stemmer were patched into the morphological dictionary.
Two stemmers now cross-validate each other: where both agree a root has no hadith match,
we have the first empirical proof that it's genuinely absent. not a tooling failure.
Root coverage: 81% → 96.3%
Root links: 1.33M → 1.53M
Zero roots: 315 → 61
1,345 new word forms patched
Key files: wensinck.json (1,486 roots, 1,042,279 refs) root_alias_map.json (196 aliases)
Per-Hadith Grading + Ghost Narrators
112,000 hadiths and not a single one was graded. Bukhari and Muslim are self-authenticated
as Sahih, so those were tagged directly. Then Al-Albani's grades were scraped from sunnah.com
for four more books. Then Arnaut's tahqiq edition of Musnad Ahmad was found as a DOCX file
on Archive.org, and all 26,539 grades were parsed from it. Riyad al-Salihin grades extracted
from inline Arabic on web pages. 59,365 hadiths graded, 52% of the corpus.
Meanwhile, the isnad visualizer had ghost narrators everywhere. "أبيه" (his father) was a
top narrator in ALL 11 books. Then أبي, جده, عمه, أمه, خاله, مولاه. Each one needed a
different solution: 37 father-son pairs, 15 mother entries, 6 grandmother entries, 10 uncle
entries. And when the relative term was followed by a name ("عن عمه، واسع بن حبان"),
the parser had to extract the real name from the text. 76 genealogy entries, each researched
from classical rijal sources.
59,365 hadiths graded (52%)
76 genealogy lookups
9 books graded
22 Classical Texts + Teacher-Student Network
7 more classical texts parsed from OpenITI: Tabaqat Ibn Sa'd, Siyar A'lam al-Nubala,
Al-Isaba fi Tamyiz al-Sahaba, Tarikh al-Islam, Lisan al-Mizan, Al-Durar al-Kamina,
and Al-Kashif. 22 texts total, 152,000+ entries of classical Arabic biographical text.
The narrator database exploded from 65,391 to 111,604 profiles. 178,859 name variants.
41,131 death years (up from 9,141). Companions went from ~1,663 to ~10,489 because
Al-Isaba is specifically a companion encyclopedia.
Teacher-student links from AR-Sanad: 127,411 bidirectional connections mapping who taught
whom across 18,258 narrators. Global IDs from muslimscholars.info for 12,492 profiles.
Then came deduplication. 6,394 merges across two layers. But the really hard problem was
the false positives: 50 companion merges that looked correct but were actually different
people with the same name. "عبد الله بن الحارث" isn't one person. It's dozens. Arabic
biographical literature's oldest problem, now in code.
111,604 narrator profiles
22 classical texts
127k teacher-student links
6,394 dedup merges
The problems that made me laugh and cry
The 42% that found a home.
When we merged 21,465 entries from 3 new classical texts (Tabaqat, Siyar, Al-Isaba),
8,913 (42%) matched existing profiles. But the 58% that didn't match wasn't all new
people. Many were the same narrators with slightly different spellings across texts,
and we couldn't merge them safely.
Death years locked in prose.
98% of new profiles lack death years in structured form. The years ARE there, written
as "مات سنة ست وعشرين ومائة" (died in the year one hundred and twenty-six). Our regex
only catches numeric "سنة 126". Thousands of temporal anchors, locked behind Arabic
word-form numbers. We built a word-form parser that handles 21 different patterns.
The collision that wasn't.
"أنس بن مالك" (Anas ibn Malik) exists as TWO separate profiles: ID 8 (the famous
companion, 13+ name variants, 5 classical sources) and ID 17074 (a different person
with mizan+jarh+tabaqat sources). Common names create phantom duplicates that look
like real merges.
Teacher-student can't help (yet).
127,411 bidirectional teacher-student links. The strategy: if two profiles share 5+
teachers, they're probably the same person. Found 13,928 candidate pairs. Applied
name filtering. Result: ZERO dedup matches. Why? The teacher-student data only exists
for the original 18k AR-Sanad profiles. The 100k new profiles from classical texts
don't have it. Can't intersect what isn't there.
The ceiling.
After three dedup strategies (companion anchoring, death-year clustering, teacher-student
intersection), we found exactly 2 safe merges out of 77,794 profiles. The algorithmic
approach hit its ceiling. The real solution isn't better code. It's better data: curated
IDs from scholars who already solved this problem by hand.
Gawami al-Kalim + Grading Engine (77%)
Found a 2.8GB Windows desktop app (Gawami al-Kalim 4.5, Qatar Endowments, 2010) archived
on archive.org. Cracked 10 custom .TBX binary files to extract 49,845 narrator profiles,
549k chain links, 38k isnad evaluations, 4.1M extended links. All converted to JSON (731 MB).
Discovered Books.zip: 1,004 hadith collections with narrator L-tags (GK IDs embedded in text).
Built a hadith grading engine using the weakest-link rule. Starting accuracy: 50%.
Fixed chain direction (isnads go backward). Integrated AR-Sanad's bidirectional chains (234k links).
Fame-based disambiguation (famous narrators need no qualifier). The comma revelation:
يعني clarifications are disambiguation hints, not noise. Albani reverse loop: 40k graded
hadiths upgrade 587 narrator grades from statistical evidence. HokmText: 312 more fixes.
Final accuracy: 77% vs Albani. 115,112 profiles, 72.6% graded.
115,112 narrator profiles
77% grading accuracy
49,845 GK narrators mapped
731 MB GK data converted
Bukhari 99.99% — The Calibration Breakthrough
Tested on Bukhari (undisputed sahih). Starting accuracy: 98.0%. Fixed 25 narrators,
each traced to taqrib/GK/AR-Sanad evidence. Result: 99.99%. The 1 remaining
“error” (أسيد بن زيد) is CORRECT — taqrib says weak, Bukhari used him in
mutaba’a (supported) role. Our engine found what Ibn Hajar documented.
Muslim: 98.5% → 99.88% after 15 fixes. Abu Dawud improved to 77.0% from
shared narrator fixes.
Key insight: Bukhari and Muslim are calibration tools. If a narrator is in their
chains, they’re at least mostly_reliable. Any narrator graded lower in our database
is wrong — Bukhari’s acceptance IS the evidence.
Built 8-name disambiguation rule table from taqrib raw text (سفيان, الحسن, نافع, عكرمة,
قتادة, شعبة, جابر, أبيه). Resolved أبيه (his father) from previous narrator’s nasab.
623 dropped AR-Sanad narrators restored. AR-Sanad audited: 99.99% accurate (72 compressed
chains, 0 data errors — publishable). The golden rule established: go to the source
text, not the parsed data, not the pattern.
115,735 narrator profiles
99.99% Bukhari
99.88% Muslim
77% Abu Dawud