DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

This study evaluates the performance gains of transformer-based DNA language models like DNABERT2 compared to conventional approaches such as ConvNova, specifically addressing the high cost of pre-training. It investigates whether these improvements justify the computational overhead and analyzes the impact of Byte Pair Encoding (BPE) tokenization on genomic tasks.

The research compares transformer-based architectures against convolutional models to determine if performance gains outweigh pre-training costs.
It assesses the actual contribution of pre-training in fine-tuning scenarios for DNA sequence analysis.
The study examines how BPE tokenization affects model performance on genomics-related tasks, a topic of ongoing debate in the field.