News Release

Unveiling large multimodal models in pulmonary CT: A comparative assessment of generative AI performance in lung cancer diagnostics

Peer-Reviewed Publication

FAR Publishing Limited

DIAGNOSTIC WORKFLOW AND PER-RESPONSE ACCURACY COMPARISON OF GEN-AI FOR PULMONARY CT IMAGES.

image: 

DIAGNOSTIC WORKFLOW AND PER-RESPONSE ACCURACY COMPARISON OF GEN-AI FOR PULMONARY CT IMAGES.

view more 

Credit: THE AUTHORS.

Gen-AI is increasingly recognized for its potential in healthcare, particularly in complex radiological interpretations. However, the clinical utility of Gen-AI requires thorough validation with real-world data. 

Among 184 confirmed malignant lung tumor cases, diagnostic accuracy varied significantly across three models. Gemini achieved highest accuracy, followed by Claude-3-opus, both exceeding 90%, while GPT scored lowest at 65.22%. Statistical analysis confirmed Gemini's diagnostic accuracy in single-image tasks significantly exceeded Claude and GPT. However, Gemini's accuracy plummeted to 58.51% with continuous slices, likely due to difficulties interpreting lesion continuity and spatial relationships. Adding clinical history improved results slightly (68.30%), but still showed the most significant performance decline, suggesting Gemini overly relies on text input while neglecting imaging features. Similarly, GPT performed poorly with continuous CT slices or clinical history, averaging 48.91% and 63.95% accuracy respectively. Claude-3-opus and GPT showed higher stability across different image inputs, with Claude-3-opus demonstrating significant accuracy advantages in continuous slices. Using identical results in at least two attempts as the final diagnosis standard, we compared model accuracy under different inputs. Claude outperformed Gemini, which outperformed GPT. After incorporating non-malignant nodules (n=66), inflammatory lesions (n=100), and normal lungs (n=54) to enhance sample diversity, Claude and Gemini (both AUC = 0.61) performed best with single CT images. However, as input complexity increased, both models' AUC decreased significantly. GPT showed slight AUC improvement with increased input complexity, but remained in the 50-60% range, suggesting near-random performance.

Simplified prompts significantly improved diagnostic performance: Claude (AUC = 0.69), Gemini (AUC = 0.76), and GPT (AUC = 0.73) all showed increased AUC values. Accuracy, sensitivity, specificity, and F1 scores also improved, indicating more balanced performance. However, this improvement wasn't consistent in Gemini and GPT tests using normal images as controls. ROC curves for different control groups further demonstrated Claude's significant diagnostic improvement, while Gemini and GPT struggled with normal image recognition. Comparing pathological subtypes showed similar diagnostic sensitivity across all prompt environments, but overall performance was most balanced with simplified prompts.

Evaluation of Gen-AI-identified lesion features showed Claude and GPT demonstrated greater diversity and accuracy than Gemini in locating and describing lesions. Likert self-assessment indicated all models heavily relied on morphological and margin features for malignancy diagnosis, with "spiculated" and "irregular" as key differentiators. Lesion density and tumor size also played important roles. During sequential queries, we couldn't trace or supplement missing data, prompting further analysis of feature recognition and response rates.

Results showed Morphology/Margins features had highest response rates, with "spiculated" and "lobulated" features especially prominent. Likert scale results indicated models weighted Morphology/Margins features most heavily in malignant tumor diagnosis. In non-malignant lesions, false positives displayed similar feature patterns to malignant cases but with reduced diversity. Coefficient of variation analysis showed Claude had the lowest overall variation in the malignant lesion group. Claude and Gemini demonstrated good feature scoring consistency for both malignant and non-malignant lesions, while GPT showed greater fluctuations in malignant lesions.

In misdiagnosed cases, Gen-AI models showed significant deviations across multiple dimensions, some completely opposite, indicating potential feature fabrication risks and questioning the maturity of image feature learning during model training. For performance optimization, Lasso regression achieved AUCs of 0.896 and 0.884 before and after cross-validation, showing good stability. Stepwise regression achieved comparable AUC values (0.898 and 0.883) but with higher variability. TCGA-LUAD, TCGA-LUSC, and MIDRC-RICORD-1A datasets were used as external validation. Consistent with earlier findings, Claude showed better overall performance with simplified prompts. After feature dimensionality reduction, Lasso's performance indicators became more balanced, further validated by ROC curve analysis.


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.