Case Study Details:
About the Client
For the university’s information systems division, maintaining a single, accurate, institution-wide student identity had become increasingly difficult. Over 1.6 million student records were spread across multiple academic, enrollment, and administrative platforms. As integrations expanded, so did duplicate student profiles, often caused by inconsistent data entry, format variations, or incomplete records.
Challenges
- Lack of a standardized enterprise data platform, leading to inconsistent data structures and unreliable reporting
- Lack of a true Medallion (Bronze–Silver–Gold) architecture, resulting in poor data organization and limited scalability
- Absence of formal data quality controls, increasing the risk of inaccurate and incomplete analytics
- Unified all enterprise data in OneLake with a true Medallion architecture Enhanced enterprise-wide trust in analytics
Solutions
- Configurable SQL pipelines generate potential duplicate clusters using flexible attribute matching rules, enabling preliminary grouping before deeper analysis.
- Large Language Models evaluate names, emails, phone numbers, and addresses to identify semantic similarity beyond exact matches, resolving ambiguity caused by data variations or partial inputs.
- Deterministic match signals and LLM similarity outputs feed into a unified scoring engine, producing confidence scores to rank potential duplicate records.
- An Azure-based architecture—ADF, Logic Apps, Azure Functions, Cosmos DB—enables incremental, automated, and scalable processing of millions of student records.
- High-confidence matches are auto-classified, while borderline clusters are routed to administrative reviewers for precise and controlled merging.
Results
- Ensured high-quality, reliable data using Delta Lake and automated validation
- Faster, data-driven decision-making across the enterprise
- Scalable support for future system integrations
- Lower long-term cost and effort for data quality management