In the world of machine learning, innovation requires the use of data. However, the reality for many companies is that data access and environmental controls, which are vital for security, can also add inefficiencies to the model development and testing life cycle.
To overcome this challenge and help companies Capital One has provided an open resource for a new project called Synthetic Data.
“With this tool, data sharing can be done safely and quickly allowing for faster hypothesis testing and iteration of ideas,” said Taylor Turner, lead machine learning engineer and co-developer of Synthetic Data.
Synthetic data generates artificial data that can be used instead of “real” data. They often contain the same schema and statistical properties as the original data, but do not include personally identifiable information. This is most useful in situations where complex, non-linear datasets are needed, as is often the case in deep learning models.
Utilizing Synthetic Data involves the model builder furnishing the necessary statistical properties for the experiment’s dataset. This includes details such as the marginal distribution among inputs, correlation between inputs, and an analytical expression that establishes the mapping from inputs to outputs.
“And then you can experiment to your heart’s content. It’s as simple as possible, yet as artistically flexible as needed to do this type of machine learning”, said Brian Barr, senior machine learning engineer and researcher at Capital One.
According to Barr, there were some early efforts in the 1980s around synthetic data that led to capabilities in the popular Python machine learning library scikit-learn. However, as machine learning has evolved those capabilities are “not as flexible and complete for deep learning where there’s nonlinear relationships between inputs and outputs,” said Barr.
The Synthetic Data initiative originated within Capital One’s machine learning research program, which is dedicated to exploring and advancing progressive methods, applications, and techniques in machine learning to enhance the simplicity and security of banking. It emerged from the Capital One research paper, “Towards Ground Truth Explainability on Tabular Data,” co-authored by Barr.
“Sharing our research and developing tools for the open-source community are integral aspects of our mission at Capital One. We are excited about the ongoing exploration of the synergies between data profiling and synthetic data, and we are committed to sharing our insights”, Turner stated.
This project seamlessly integrates with Data Profiler, Capital One’s open-source machine learning library designed for monitoring large datasets and identifying sensitive information that requires secure handling. Data Profiler compiles the dataset’s representative statistics, and Synthetic Data can then be generated based on these empirical metrics.