Graph-Based Retrieval-Augmented Generation for SQL Querying of Longitudinal Databases
Many scientific institutions rely on large, historically grown databases that span multiple decades, are distributed across several servers and contain heterogeneous documentation. This is especially true for the agricultural funding data maintained by the Federal Institute of Agricultural Economics, Rural and Mountain Research, where, besides others, the INVEKOS and ÖPUL systems have accumulated more than thirty years of records across several PostgreSQL instances. These databases do not form a uniform whole. Instead, they exist as separate but interdependent systems, each containing slightly different schemas, versions and update histories. Over time, new variables were added, definitions changed and older fields became obsolete. As a result, understanding how a table or attribute was used in a specific year often requires reconstructing its historical evolution.
In addition to the structural complexity of the databases, the essential contextual knowledge that users and domain experts need is scattered across metadata files, internal documentation, legal descriptions and PDF manuals. These documents encode definitions, measurement rules, program descriptions and exceptions that must be understood before any meaningful analysis can take place. Because they are not directly linked to the database structures, researchers often rely on personal experience, institutional memory or extensive manual searching to interpret variables correctly. Historical SQL logs further complicate the picture: they reveal how analysts have queried the system, but they are not integrated with the metadata landscape.
The combination of distributed storage, shifting schemas and fragmented documentation means that researchers must navigate a multilayered information space. Access is additionally restricted by strict role-based permissions, ensuring that different users can only see the parts of the data relevant to their responsibilities. These constraints often make it difficult for scientific staff without strong SQL expertise to explore the data independently. Instead, they depend on specialized technical staff to translate domain questions into correct and secure SQL queries. This dependency slows research, increases communication overhead and raises the risk of misinterpretation.
Graph-based Retrieval-Augmented Generation (Graph RAG) presents a promising solution to these challenges. Unlike classical retrieval approaches that rely solely on vector similarity, a graph-based method can connect distributed database instances, versioned schemas, metadata relationships and textual documentation into a single, structured representation. Temporal edges can express how a column changed meaning over time; semantic links can connect a PDF definition to a specific table; RBAC rules can be represented directly in the graph. In this way, a Graph RAG system does not merely search for information—it reconstructs the relationships that define the data landscape itself. By combining this graph structure with the generative abilities of large language models, the system can translate natural-language questions into historically accurate, role-compliant and well-grounded SQL queries.
Objectives
The overall objective of this project is to develop an on-premise Graph RAG system that gives researchers intuitive access to distributed longitudinal databases through natural language. The system aims to integrate the multiple PostgreSQL instances, their historical schema versions, and the diverse metadata sources into a single temporal knowledge graph. This graph will link database elements to definitions in the associated documentation, to patterns from SQL logs and to RBAC constraints that determine who may access which parts of the data.
The knowledge graph will serve as the foundation for a hybrid retrieval mechanism in which vector-based search provides linguistic flexibility and graph traversal ensures structural accuracy. A natural-language interface will allow researchers to pose domain questions in their own words, while the system identifies the relevant tables, time periods, definitions and access restrictions. It will then generate an SQL query that reflects both the semantics of the question and the historical context of the data. The system will return not only the SQL query but also the reasoning behind it, including references to documentation, variable definitions and metadata relationships.
By integrating distributed data sources into a coherent structure, the project seeks to reduce the dependency on technical specialists and to make the data accessible to researchers from a wide range of backgrounds. The long-term goal is to create a blueprint for secure and explainable data access in other sensitive or longitudinal research settings.
Workpackages
WP 1 – Project Foundations and Data Understanding
The project begins with a detailed examination of the distributed PostgreSQL databases, their schema evolution and the associated metadata and documentation. During this phase, the conceptual approach is defined and the methodological groundwork is validated. Quality and presence of metadata and documentation will be assessed. This stage ensures a complete understanding of the data environment, access restrictions and the technical boundaries required for an on-premise Graph RAG system.
WP 2 – Knowledge Graph Construction and Retrieval Prototype
Once the foundations are established, the distributed data sources, metadata and PDF documents are processed to extract entities, relationships and temporal information. These elements are merged into an initial temporal knowledge graph that represents the evolution of the system across decades. A first version of the hybrid retrieval mechanism—combining vector search with graph traversal—is implemented to demonstrate how structured and unstructured information can be connected into a coherent reasoning space.
WP 3 – RAG Pipeline, SQL Generation and System Integration
The prototype graph and retrieval components are expanded into a fully functioning RAG pipeline capable of interpreting natural-language questions. Temporal reasoning, semantic disambiguation and RBAC rules are incorporated into the SQL generation logic. Historical SQL logs are integrated to align generated queries with established analytical patterns. This milestone produces a unified system that can translate research questions into well-grounded and historically accurate SQL statements.
WP 4 – Evaluation, Refinement and Final Deliverables
The system undergoes comprehensive evaluation focusing on accuracy, interpretability, temporal correctness and compliance with access-control rules. User feedback informs refinements to the interface and reasoning components. A complete version of the system and its documentation is produced, followed by final revisions and the preparation of a scientific paper or extended abstract summarizing the project’s methodology and results.
Deliverables
At the end of the six-month project period, the primary outcome will be a beta version of an on-premise Graph RAG system capable of translating natural-language questions into historically aware, structurally grounded and RBAC-compliant SQL queries. This beta system will include a functioning temporal knowledge graph that integrates distributed database schemas, metadata sources and documentation into a coherent representation. It will further provide an early hybrid retrieval mechanism and an initial generation pipeline that demonstrates the technical and methodological feasibility of the overall approach.
In addition to the software prototype, the project will deliver a comprehensive body of research that examines the methodological, technical and operational questions relevant to a full-scale implementation. This research will include an analysis of the system’s performance, an assessment of its suitability for long-term adoption and a detailed reflection on architectural choices such as temporal modelling, graph construction strategies, retrieval mechanisms and access-control integration. The outcome of this work will form the basis for an informed decision on whether the system should be developed into a production-ready tool in a following project phase.
Timeline
Start: 02/2026
End: 07/2026