Google Summer of Code 2024

Name	Raya Chakravarty
Organisation	INCF
Mentors	Arman Jahanpour, Sebastian Urchs, Alyssa Dai, Brent McPherson, Jean-Baptiste Poline
Project	A natural language interface for querying federated research data
Link to the project repository	Github Repository

A natural language interface for querying federated research data

Neurobagel is a federated data ecosystem that enables researchers and other data users to locate and access research data that must remain at its original institute due to data governance requirements.

Currently, Neurobagel offers a graphical web query interface that interacts with the node APIs on the user's behalf, simplifying the process of formulating complex queries.

This project intends to build a chatbot that utilizes existing large language models (LLMs) to parse text provided by users into precise queries and reliably summarize the results for them. The chatbot should be able to receive and comprehend user prompts in natural language, initiate corresponding API calls using predefined Neurobagel parameters (like minimum age, maximum age, sex, etc.), interpret the results, and communicate that information back to the user. The goal is to choose open tools and models to allow for flexible hosting options.

Understanding the codebase

The Neurobagel query tool AI codebase consists of three main parts:
- extracting information from user queries using LLMs
- mapping extracted terms to TermURLs
- generating the final API URL.
The project incorporates Pytest for testing, Continuous Integration for automating testing and linting, and Dockerization for deployment scalability.
A FastAPI server handles API requests, and a React-based chatbot interface enables users to input queries, which are processed via the API.
The project includes comprehensive documentation for both local and Dockerized setup options.

To have a deeper understanding of the codebase visit here.

Contributions

A separate github repository query-tool-ai was established for the GSoC project, and all the code was integrated into the project's main branch. Below are links to the key contributions I made:

Future Scope

Updates to the LLM Model: Continuously updating and refining the LLM model to enhance output quality and minimize hallucinations.

Conclusion

My GSoC experience has been incredibly rewarding and educational, significantly contributing to my future aspirations. I am deeply thankful to my mentors Arman Jahanpour, Sebastian Urchs, Alyssa Dai, Brent McPherson, Jean-Baptiste Poline, INCF and Neurobagel for their unwavering guidance and support throughout the project. Their expertise and readiness to assist whenever needed have been invaluable. I also extend my gratitude to Google for offering such a fantastic opportunity to student developers worldwide.

I hope to continue collaborating with Neurobagel and INCF in the future and contribute further to its impactful projects.