Model Selection Process

In the process of extracting the information from user queries, several LLMs were tested and evaluated. The goal was to identify a model that could accurately extract all necessary parameters without introducing hallucinations or errors.

Tested Models

google/flan-t5-xxl

The google/flan-t5-xxl model, a large-scale Transformer model fine-tuned on various NLP tasks, was one of the initial models tested for extracting parameters from user queries. Known for its versatility in handling complex language tasks, this model was integrated into the project using the Hugging Face Hub.

Performance overview

Extracting all parameters together

prompt: How many female subjects older than 50 with a Parkinson’s diagnosis?

Issue - While the model performed well in terms of fluency and coherence, it exhibited a significant flaw: it introduced assumptions not present in the original query. For example, it added values like imaging_sessions and phenotypic_sessions, which were not mentioned by the user.
Extracting values one by one

To address the issue of assumptions, a strategy was employed where each parameter was extracted individually using separate scripts:
- extract_age.py
- extract_sex.py
- extract_sessions
Issue: While this approach reduced the likelihood of hallucinations, it introduced other challenges. The google/flan-t5-xxl model has a rate limit when accessed via the Hugging Face Hub, which constrained the speed and scalability of this approach. Additionally, the model struggled with categorical values such as diagnosis, assessment tool, health-control, and image-modality, where its accuracy was inconsistent.

llama-2

Llama-2 is an advanced language model known for its large size and ability to handle complex language tasks. Despite these capabilities, its performance in this specific context was found to be lacking in accuracy. The model was integrated using the ChatOllama framework, and its performance was evaluated on the task of extracting structured information from user queries.

Performance overview

Despite the high expectations due to its size and architecture, Llama-2 struggled to deliver accurate results for this task. The model often provided outputs that were either incomplete or contained errors, particularly when extracting detailed parameters from complex queries

Example of LLM Response -

Issue: The model failed to accurately extract some of the key elements of the query, leading to incorrect or incomplete responses. This lack of precision made it unsuitable for the specific needs of this project, where accuracy in parameter extraction is crucial.

gemma

The gemma model is a large language model designed for natural language processing tasks. It aims to improve upon previous models like llama-2 by offering enhanced performance in tasks such as text generation, information extraction, and question answering.The model was implemented using the ChatOllama framework and tested on queries requiring the extraction of structured data.

Performance overview

Compared to llama-2, gemma demonstrated better accuracy in extracting information from user queries. However, it still struggled with hallucinations, where the model would generate information not present in the original query or misinterpret the provided data.

Example of LLM Response -

Issue: While the gemma LLM showed improvements over llama-2 in extracting relevant information, it still struggled with hallucinations. The model occasionally generated incorrect or irrelevant details that were not part of the original query. These hallucinations, though less frequent, could lead to misleading or inaccurate outputs, especially in cases requiring precise categorical information like diagnosis, assessment tools, or imaging modalities.

Despite its better performance in some areas, the inconsistency in handling specific queries made it unreliable for scenarios where accuracy is critical.

mistral

Themistral LLM is a state-of-the-art language model developed by Ollama, known for its impressive capabilities in natural language understanding and generation. It is designed to handle a wide range of language tasks, including text generation, comprehension, and contextual understanding.The model was implemented using the ChatOllama framework and tested on queries requiring the extraction of structured data.

Performance overview

mistral excelled in accurately extracting parameters like age, sex, and diagnosis from user queries, making it ideal for detailed information extraction.
It showed fewer hallucinations compared to models like llama-2 and gemma, providing more reliable outputs by avoiding the generation of irrelevant data
mistral effectively integrated with structured data models like Pydantic, allowing for seamless mapping of extracted information to predefined schemas.

Examples of LLM Response -

Open-source vs closed source models

Open-Source Models

Open-source models, such as mistral, offer the advantage of accessibility and customization, allowing users to modify and integrate them into various applications. These models provide significant benefits in terms of transparency and flexibility. However, they are not without limitations.

In the case of mistral, an open-source model, there were notable issues with handling certain parameter extractions. For instance, it occasionally struggled with identifying and interpreting vague age ranges such as "above 40" or "below 60."

Closed-Source Models

Closed-source models, while less customizable, often come with robust support and fine-tuning specific to accuracy and performance. They are typically developed with extensive resources and advanced techniques that address various limitations seen in open-source alternatives.

To address the challenges experienced with open-source models, experimentation was extended to closed-source options such as openai/chatgpt-4o-latest. This model demonstrated superior performance in extracting information with almost no hallucinations. It effectively handled ambiguous queries and accurately interpreted parameters like age ranges, providing reliable and precise outputs.

Examples

Examples of mistral unable to identify context of age.
Examples of openai/chatgpt-4o-latest giving accurate answers to the same questions.

Summary

While open-source models like mistral offer valuable benefits, they can exhibit limitations such as handling specific parameter nuances. Closed-source models, such as openai/chatgpt-4o-latest, often provide enhanced accuracy and reliability, making them a preferable choice for applications where precision is critical.

GPU Utilization

In the process of implementing and testing various language models for the extraction of parameters from user queries, GPU utilization played a crucial role in optimizing performance and ensuring efficient execution.

For the testing of both open-source and closed-source models, NVIDIA GPUs were utilized to accelerate the inference process, which is often resource-intensive due to the large-scale nature of these models.

The GPU setup was deployed on a virtual machine, provided by Neurobagel, which supported high-performance computing with NVIDIA GPUs. The virtual machine allowed seamless integration with the testing pipeline, enabling quick switching between different models, including google/flan-t5-xxl, llama-2, gemma, mistral, and openai/chatgpt-4o-latest. By utilizing NVIDIA's CUDA cores, the models were able to perform complex computations with significantly reduced latency compared to running on a CPU.