This study evaluates the use of domain-specific transformer embeddings combined with classical machine learning models to detect dosing errors in clinical trial protocols. The research aims to improve patient safety and trial integrity by identifying preventable medication errors early through text representation analysis.

  • Textual data from clinical trials was encoded using ClinicalBERT, PubMedBERT, BioBERT, and MedCPT, then integrated with categorical features.
  • BioBERT consistently outperformed other encoders under a logistic regression baseline, achieving an ROC-AUC of 0.794, which is a 3.95% improvement over ClinicalBERT.
  • Combining multiple embeddings did not yield performance improvements, indicating that domain alignment is more critical than representational stacking.
  • Gradient boosting models, support vector classifiers, logistic regression, and residual neural networks achieved the strongest overall performance with ROC-AUCs ranging from 0.821 to 0.853.

The integration of domain-specific transformer embeddings with structured metadata enables the discrimination of trials meeting elevated dosing error risk criteria, advancing safety monitoring and supporting informed regulatory decision-making.