MetaPlant-RNAseq: Large Scale RNA-seq Dataset Annotations for Stress Response in Agricultural Crops
Abstract
Plant RNA-seq repositories contain vast data but often lack structured metadata on stress conditions, treatments, and tissues, limiting their utility. We present MetaPlant-RNAseq, a large-scale collection of plant RNA-seq datasets enriched with annotations generated by large language models (LLMs) and validated by experts. Our multi-agent pipeline integrates BioSample records, BioProject context, and linked publications to extract key fields such as stress type, treatment/control status, tissue, and developmental stage. A proof of concept on soybean datasets shows high precision on well-documented cases while exposing challenges in cryptically encoded metadata. By combining automated annotation with human validation, MetaPlant-RNAseq provides a scalable, FAIR-aligned foundation for predictive modeling of crop stress responses, biomarker discovery, and AI-driven agricultural research.