Itqan — Development History

v1.0April 5, 2026

The Foundation

Started with raw hadith JSON from sunnah.com (49k hadiths across the six canonical books) and a basic reader. The breakthrough was running every Arabic word through CAMeL Tools, the Cairo Arabic Morphological Analyzer, to extract three-letter roots. Then building the concordance: an inverted index mapping every word to every hadith containing it. This was the computational equivalent of what Fuad Abd al-Baqi did by hand for the Quran in 1945.

49k hadiths

6 books

32,413 words defined

Key files created: word_defs_v2.json concordance.json roots_lexicon.json

v1.0.1April 5–6

Musannaf + Root Bridge + Families

Added Musannaf Ibn Abi Shaybah (37,943 hadiths, not available on Sunnah.com), pushing the corpus to 87k. Built the Quran–Hadith root bridge: for each of 1,651 Quranic roots, find every hadith containing a word from that root. Created 39 thematic families grouping roots by semantic field from al-Raghib al-Isfahani's classical lexicography. Built the isnad parser extracting narrator chains from Arabic text, and the chord visualizations showing how themes interconnect.

87k hadiths

18 books

384,016 root links

1,201 shared roots (73%)

Key files: quran_hadith_bridge.json family_corpus.json isnad_graph.json chord_matrices.json

v1.1April 7

Quran bil-Quran + Live App

Built the Quran reader with interactive Arabic text: hover for meaning, click for root panel with Mufradat al-Quran definitions. The reverse bridge: from any root in the Quran, jump to all connected hadiths. Deep-link URLs for sharing specific hadiths. GitHub Pages deployment. The chord viewer was rebuilt from scratch. The old Book×Family tab was useless (every book connected to every family proportionally), replaced with Book Distinctiveness showing only over-represented connections.

Zenodo DOI minted

GitHub Pages live

v1.2April 8

The Bridge Was Wrong

A user reported that Quran verse 19:85 connected to the wrong hadiths. Investigation revealed that idInBook restarts at 1 for every chapter in 14 of 18 books. The bridge was matching hadith #2 in ALL 97 Bukhari chapters instead of the correct one. Every Quran→Hadith filter had been producing false positives since v1.0. Fixed with chapter-aware IDs, verified 100% accuracy. Also fixed root word highlighting (isnad suppression was hiding matches) and built the How It Works guide page with SVG flow diagram.

Bridge IDs rebuilt chapter-aware

0 false positives (verified)

Key files: bridge_ids/*.json (1,324 files) rebuild_bridge_ids.py

v1.3April 9

Musnad Ahmad Expansion

Musnad Ahmad had only 1,374 hadiths from sunnah.com. The full book has 26,539. Found the complete Arnaut edition on OpenITI (the Open Islamicate Texts Initiative) and wrote a parser for their mARkdown format. Corpus jumped from 87k to 112k. Ahmad became the 2nd largest book. Rebuilt the FAISS semantic index (112k vectors) and uploaded to HuggingFace. Fixed both HF apps (repo_type bug + Gradio CSS compatibility).

26,539 Ahmad hadiths (was 1,374)

112,221 total hadiths

1,326,229 root links

Key files: parse_openiti_musnad.py app/data/sunni/ahmed/ (1,014 chapter files)

v1.4April 9

Isnad Cleanup + Narrator Profiles

Discovered that “أبيه” (his father) was appearing as a top narrator in ALL 11 books. It's not a person, it's a relative reference. Built a 37-entry genealogy lookup to resolve father-son chains (e.g., ‎هشام بن عروة‏ → ‎عروة بن الزبير‏). Fixed kunya splitting where ‎أبي صالح‏ was being broken into two tokens. Merged AR-Sanad 280K (18,298 narrators) and hatemben (701 with full jarh wa ta'dil) into a unified database. Built the Rijal page, a searchable narrator browser.

18,298 narrator profiles

72,767 name variants

Isnad grading: 26% → 61%

Key files: narrator_unified.json rijal.html isnad_father_map.json isnad_kunya_map.json

v1.5April 9

8 Classical Rijal Texts Parsed

Downloaded and parsed 8 foundational texts of ilm al-rijal from OpenITI. Tahdhib al-Kamal, Mizan al-I'tidal, Al-Jarh wa al-Ta'dil, Al-Thiqat, Al-Kamil fi Du'afa, Tarikh Baghdad, Tahdhib al-Tahdhib, and Taqrib al-Tahdhib. 83,082 entries across 82.6 MB of Classical Arabic. Fuzzy name matching merged them into the narrator database, creating the largest structured open-source narrator database available.

65,391 narrator profiles (was 18,298)

119,860 name variants

83,082 classical entries parsed

31,822 source cross-references

Key files: download_openiti_rijal.py parse_openiti_rijal.py merge_classical_rijal.py

v1.6April 9

The Dual-Stemmer Breakthrough

315 Quranic roots had zero hadith connections, not because the words don't exist in hadiths, but because CAMeL Tools canonicalizes roots differently from the Quranic Arabic Corpus. The Wensinck concordance was built independently as a digital recreation of the 33-year physical concordance. It was only after building it that we realized it solved the root gap: Wensinck's light stemmer, using a completely different method, found hadith attestations for 254 of the 315 “missing” roots.

1,345 surface forms discovered by the light stemmer were patched into the morphological dictionary. Two stemmers now cross-validate each other: where both agree a root has no hadith match, we have the first empirical proof that it's genuinely absent. not a tooling failure.

Root coverage: 81% → 96.3%

Root links: 1.33M → 1.53M

Zero roots: 315 → 61

1,345 new word forms patched

Key files: wensinck.json (1,486 roots, 1,042,279 refs) root_alias_map.json (196 aliases)

v1.6.1April 9, 2026

Per-Hadith Grading + Ghost Narrators

112,000 hadiths and not a single one was graded. Bukhari and Muslim are self-authenticated as Sahih, so those were tagged directly. Then Al-Albani's grades were scraped from sunnah.com for four more books. Then Arnaut's tahqiq edition of Musnad Ahmad was found as a DOCX file on Archive.org, and all 26,539 grades were parsed from it. Riyad al-Salihin grades extracted from inline Arabic on web pages. 59,365 hadiths graded, 52% of the corpus.

Meanwhile, the isnad visualizer had ghost narrators everywhere. "أبيه" (his father) was a top narrator in ALL 11 books. Then أبي, جده, عمه, أمه, خاله, مولاه. Each one needed a different solution: 37 father-son pairs, 15 mother entries, 6 grandmother entries, 10 uncle entries. And when the relative term was followed by a name ("عن عمه، واسع بن حبان"), the parser had to extract the real name from the text. 76 genealogy entries, each researched from classical rijal sources.

59,365 hadiths graded (52%)

76 genealogy lookups

9 books graded

v1.8April 9-10, 2026

22 Classical Texts + Teacher-Student Network

7 more classical texts parsed from OpenITI: Tabaqat Ibn Sa'd, Siyar A'lam al-Nubala, Al-Isaba fi Tamyiz al-Sahaba, Tarikh al-Islam, Lisan al-Mizan, Al-Durar al-Kamina, and Al-Kashif. 22 texts total, 152,000+ entries of classical Arabic biographical text.

The narrator database exploded from 65,391 to 111,604 profiles. 178,859 name variants. 41,131 death years (up from 9,141). Companions went from ~1,663 to ~10,489 because Al-Isaba is specifically a companion encyclopedia.

Teacher-student links from AR-Sanad: 127,411 bidirectional connections mapping who taught whom across 18,258 narrators. Global IDs from muslimscholars.info for 12,492 profiles.

Then came deduplication. 6,394 merges across two layers. But the really hard problem was the false positives: 50 companion merges that looked correct but were actually different people with the same name. "عبد الله بن الحارث" isn't one person. It's dozens. Arabic biographical literature's oldest problem, now in code.

111,604 narrator profiles

22 classical texts

127k teacher-student links

6,394 dedup merges

Wicked Problems

The problems that made me laugh and cry

The 42% that found a home. When we merged 21,465 entries from 3 new classical texts (Tabaqat, Siyar, Al-Isaba), 8,913 (42%) matched existing profiles. But the 58% that didn't match wasn't all new people. Many were the same narrators with slightly different spellings across texts, and we couldn't merge them safely.

Death years locked in prose. 98% of new profiles lack death years in structured form. The years ARE there, written as "مات سنة ست وعشرين ومائة" (died in the year one hundred and twenty-six). Our regex only catches numeric "سنة 126". Thousands of temporal anchors, locked behind Arabic word-form numbers. We built a word-form parser that handles 21 different patterns.

The collision that wasn't. "أنس بن مالك" (Anas ibn Malik) exists as TWO separate profiles: ID 8 (the famous companion, 13+ name variants, 5 classical sources) and ID 17074 (a different person with mizan+jarh+tabaqat sources). Common names create phantom duplicates that look like real merges.

Teacher-student can't help (yet). 127,411 bidirectional teacher-student links. The strategy: if two profiles share 5+ teachers, they're probably the same person. Found 13,928 candidate pairs. Applied name filtering. Result: ZERO dedup matches. Why? The teacher-student data only exists for the original 18k AR-Sanad profiles. The 100k new profiles from classical texts don't have it. Can't intersect what isn't there.

The ceiling. After three dedup strategies (companion anchoring, death-year clustering, teacher-student intersection), we found exactly 2 safe merges out of 77,794 profiles. The algorithmic approach hit its ceiling. The real solution isn't better code. It's better data: curated IDs from scholars who already solved this problem by hand.

v1.9April 10-11, 2026

Gawami al-Kalim + Grading Engine (77%)

Found a 2.8GB Windows desktop app (Gawami al-Kalim 4.5, Qatar Endowments, 2010) archived on archive.org. Cracked 10 custom .TBX binary files to extract 49,845 narrator profiles, 549k chain links, 38k isnad evaluations, 4.1M extended links. All converted to JSON (731 MB). Discovered Books.zip: 1,004 hadith collections with narrator L-tags (GK IDs embedded in text).

Built a hadith grading engine using the weakest-link rule. Starting accuracy: 50%. Fixed chain direction (isnads go backward). Integrated AR-Sanad's bidirectional chains (234k links). Fame-based disambiguation (famous narrators need no qualifier). The comma revelation: يعني clarifications are disambiguation hints, not noise. Albani reverse loop: 40k graded hadiths upgrade 587 narrator grades from statistical evidence. HokmText: 312 more fixes.

Final accuracy: 77% vs Albani. 115,112 profiles, 72.6% graded.

115,112 narrator profiles

77% grading accuracy

49,845 GK narrators mapped

731 MB GK data converted

v1.20April 11, 2026

Bukhari 99.99% — The Calibration Breakthrough

Tested on Bukhari (undisputed sahih). Starting accuracy: 98.0%. Fixed 25 narrators, each traced to taqrib/GK/AR-Sanad evidence. Result: 99.99%. The 1 remaining “error” (أسيد بن زيد) is CORRECT — taqrib says weak, Bukhari used him in mutaba’a (supported) role. Our engine found what Ibn Hajar documented.

Muslim: 98.5% → 99.88% after 15 fixes. Abu Dawud improved to 77.0% from shared narrator fixes.

Key insight: Bukhari and Muslim are calibration tools. If a narrator is in their chains, they’re at least mostly_reliable. Any narrator graded lower in our database is wrong — Bukhari’s acceptance IS the evidence.

Built 8-name disambiguation rule table from taqrib raw text (سفيان, الحسن, نافع, عكرمة, قتادة, شعبة, جابر, أبيه). Resolved أبيه (his father) from previous narrator’s nasab. 623 dropped AR-Sanad narrators restored. AR-Sanad audited: 99.99% accurate (72 compressed chains, 0 data errors — publishable). The golden rule established: go to the source text, not the parsed data, not the pattern.

115,735 narrator profiles

99.99% Bukhari

99.88% Muslim

77% Abu Dawud

Development History

Current Scale (v1.20)