![]() |
![]() |
![]() |
![]() |
![]() |
Rishika Srinivas @rishikasrinivas |
Nataliia Kulieshova @Kulieshova |
Anushka Limaye @anushkalimaye |
Kymari Bratton @Kymari28 |
Fernanda Del Toro @Fernandadeltoro |
- Languages: Python, HTML, CSS, JavaScript
- Database: MongoDB
- AI Integration: OpenAI API
- User Interface: React with Cytoscape.js for graph visualization
Hand annotations were meticulously developed by team members, who manually reviewed every sentence across three provided PDFs. These annotations formed triplets structured as:
subj
(Subject)rel
(Relationship)obj
(Object)
Easily import PDFs, which are converted into Knowledge Graphs (KGs) that extract clinical entities and relationships.
Retrieve saved Knowledge Graphs for continued analysis or updates.
A search-first design lets users quickly locate nodes or relationships. Results are highlighted in orange and zoomed in for clarity.
Nodes represent clinical entities, and edges use color coding and varying thickness to show relationship categories and strength. The Relationship Table offers a legend with clickable colored circles for more details on each relationship. A magnitude table shows the significance of relationships.
Users can filter the graph to focus on specific relationship types (e.g., "Side Effects" or "Recommendations"), improving clarity without clutter.
An intuitive Help button offers guidance on the graph's features, ensuring accessibility for new users and clinicians unfamiliar with knowledge graphs.
- Brightside Health Branding: The design aligns with Brightside Health's brand using calming blues, pastels, and creative accents like color-coded edges for a clean, engaging experience.
- User-Centered Design: Focuses on usability with:
- Interactive Relationship Table for easy data interpretation
- Edge Thickness to prioritize strong relationships for evidence-based decisions
- Search and Filtering for focused, efficient navigation
Method | Type | Accuracy | Key Pitfalls |
---|---|---|---|
Accuracy | Model-Based | 86.0% | Utilizes an LLM to evaluate an LLM. |
Precision | Statistical | 81.81% | Can be mislead by cosine similarity |
Hallucination | Model-Based | 33.80% | Utilizes an LLM to evaluate an LLM. |
- Framework used: GPT Critic
- Evaluation Type: Model-Based
- Method:
- Uses 10 worker threads to enable parallel comparisons.
- Compares each LLM output row with ground truth rows using GPT-3.5-turbo.
- Finds the best similarity score for each LLM output.
- Accuracy: 86.0%
- Output: Best ground truth match for each LLM output row.
- Evaluation Type: Statistical: word match and cosine similarity
- Method: Checking if the extracted relationship is in the source text or in the ground truth annotations
- Threshold for Matching: 0.7
- Precision Score: 81.81%
- Evaluation Type: Statistical: Factual Alignment and Consistency
- Method: DeepEval Hallucination Metric
- Threshold for Matching: 0.5
- Hallucination Score: 33.80%
Method | Type | Accuracy | Key Pitfalls |
---|---|---|---|
Fuzzy Wuzzy | Statistical | 35.32% | Can be mislead by cosine similarity. |
TF-IDF Vector and Cosine Method | Statistical | 36.28% | Can be mislead by cosine similarity |
GPT Critic | Model-Based | 86.0% | Utilizes an LLM to evaluate an LLM. |
RAGA | Model-Based | 69.67% | Utilizes an LLM to evaluate an LLM. |
G-Eval | Model-Based | 46.60% | Utilizes an LLM to evaluate an LLM. |
- Evaluation Type: Statistical
- Method:
- Compare each row of the ground truth to each row of LLM output.
- Threshold for βmatchingβ requires 70% or above similarity.
- Accuracy: 35.32%
- Output:
- Rows of LLM output that match the ground truth at or above 70% similarity.
- Only one triplet pair is found matching per threshold.
- Evaluation Type: Statistical, Feature-weighting
- Method:
- Combine the triplet columns into a single string.
- Vectorize text using TF-IDF to convert it to numeric form.
- Compare each LLM row to each ground truth row using cosine similarity.
- Accuracy: 36.28%
- Output: Best matching ground truth row for each LLM output row.
- Evaluation Type: Model-Based
- Method:
- Uses 10 worker threads to enable parallel comparisons.
- Compares each LLM output row with ground truth rows using GPT-3.5-turbo.
- Finds the best similarity score for each LLM output.
- Accuracy: 86.0%
- Output: Best ground truth match for each LLM output row.
- Evaluation Type: Model-Based
- Method:
- Uses 10 worker threads to enable parallel comparisons.
- Compares each LLM output row with ground truth rows using GPT-3.5-turbo.
- Finds the best similarity score for each LLM output based on 3 criteria (Retrieval, Augmentation, and Generation).
- Accuracy: 69.67%
- Retrieval: 61.0%
- Augmentation: 73.0%
- Generation: 75.0%
- Output: Best ground truth match for each LLM output row.
- Evaluation Type: Model-Based
- Method:
- Define common synonyms and related terms in the medical domain
- Calculate semantic similarity between two triples considering medical domain knowledge.
- Evaluate matches between ground truth and LLM output
- Accuracy: 46.60%
- Output: Shows ground truth row with actual output row along with best evaluation score ranging from 0.10 to 0.80.
Clone the repo by pasting this into the terminal:
git clone git@github.com:rishikasrinivas/KnowledgeGraphMentalHealth.git
Using the terminal, cd into the KnowledgeGraphMentalHealth/ folder
cd KnowledgeGraphMentalHealth/
Install all necessary dependencies by running:
python3 install_dependencies.py
Start the program by executing:
python3 frontend/app.py
We welcome contributions to improve the evaluation methods, refine the UI, or expand the dataset. Please feel free to submit issues or pull requests.
This project is licensed under the MIT License.