TY - JOUR
T1 - Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network
AU - Taylor, Duncan Alexander
AU - Humphries, Melissa
PY - 2025/6/25
Y1 - 2025/6/25
N2 - DNA profiles are made up from multiple series (relating to different fluorophores, referred to as ‘dyes’) of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts ‘read’ DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network (ANN) to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, pre-labelled, and biologically informed training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network (GAN), modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a ‘realism filter’ that applies the noise and artefact elements exhibited in typical electrophoretic signal. The GAN utilises a custom generator architecture, based on a U-Net configuration, but with two ‘U’ paths, one that models across-dye features and one that models within-dye features. Convergence was achieved after 150 epochs. Frechet Inception Distance showed that the generator was able to increase the realism of an idealised (noiseless) mock-electropherogram with real profile to real profile comparisons yielding a distance of 4.0, real to idealised yielding a value of 5.3 and real to generated profiles yielding a value of 4.7. The realism of the generated profiles was confirmed by a DNA profile expert. The ability to generate realistic DNA profiles provides the ability to simulate an unlimited amount of training data that possesses specific features of interest. This overcomes the limiting issue of expense associated with laboratory-created profiles.
AB - DNA profiles are made up from multiple series (relating to different fluorophores, referred to as ‘dyes’) of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts ‘read’ DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network (ANN) to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, pre-labelled, and biologically informed training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network (GAN), modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a ‘realism filter’ that applies the noise and artefact elements exhibited in typical electrophoretic signal. The GAN utilises a custom generator architecture, based on a U-Net configuration, but with two ‘U’ paths, one that models across-dye features and one that models within-dye features. Convergence was achieved after 150 epochs. Frechet Inception Distance showed that the generator was able to increase the realism of an idealised (noiseless) mock-electropherogram with real profile to real profile comparisons yielding a distance of 4.0, real to idealised yielding a value of 5.3 and real to generated profiles yielding a value of 4.7. The realism of the generated profiles was confirmed by a DNA profile expert. The ability to generate realistic DNA profiles provides the ability to simulate an unlimited amount of training data that possesses specific features of interest. This overcomes the limiting issue of expense associated with laboratory-created profiles.
KW - Biologically informed AI
KW - DNA profile simulation
KW - Electropherogram
KW - Generative adversarial network
KW - Pix2pix
UR - http://www.scopus.com/inward/record.url?scp=105002299351&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2025.127536
DO - 10.1016/j.eswa.2025.127536
M3 - Article
AN - SCOPUS:105002299351
SN - 0957-4174
VL - 280
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 127536
ER -