DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks
This study evaluates the performance gains of transformer-based DNA language models like DNABERT2 compared to conventional approaches such as ConvNova, specifically addressing the high cost of pre-training. It investigates whether these improvements justify the computational overhead and analyzes the impact of Byte Pair Encoding (BPE) tokenization on genomic tasks.