pydra: Probing Code Representations With Synthetic Clones and Bugs
Ellie Kitanidis · Cole Hunter
Abstract
We introduce \texttt{pydra}: an open-source dataset of $\sim$9k Python examples with synthetic clones and buggy variants for each. Our augmentation pipeline generates both semantics-preserving and bug-injecting code variants via AST transforms and stores rich metadata for analysis. Using \texttt{pydra}, we probe state-of-the-art code embedding models and find a stark limitation in their ability to rank correct variants above incorrect ones. Our analysis suggests that embeddings remain dominated by token overlap and code length rather than true program semantics. We hope that \texttt{pydra} serves the research community by filling several gaps in the Python code dataset ecosystem as well as providing a general tool for training and evaluating code embedding models.
Chat is not available.
Successful Page Load