How to Reduce Complexity of Big Data? Researchers Have the Answer

Whenever we conduct a scientific experiment, the results are turned into numbers that produce huge datasets. In order to reduce the size of this data, computer programmers use algorithms that can find and extract the features that represent the most salient statistical properties. However, many algorithms cannot be applied directly to these large volumes of data.

Reza Oftadeh, a doctoral student in the Department of Computer Science and Engineering at Texas A&M University developed an algorithm applicable to large datasets. This useful machine-learning tool can extract and directly order features from most salient to least. You can find the published paper here.

The abstract of R. Oftadeh’s Paper

“There are many ad hoc ways to extract these features using machine-learning algorithms, but we now have a fully rigorous theoretical proof that our model can find and extract these prominent features from the data simultaneously, doing so in one pass of the algorithm, ” commented Reza.

A subfield of machine learning deals with the problem of identifying a raw dataset’s features to help reduce its dimensionality. Once identified, the features are used to make annotated samples of the data for further analysis. Analyzing massive datasets is a very complicated, time-consuming process for programmers, so in recent years artificial neural networks (ANNs) have come to help.

ANNs are computational models that are designed to simulate how the human brain analyzes and processes information. They are typically made of dozens to millions of artificial neurons, called units. ANNs can be used in various ways, but they are most commonly used to identify the unique features that best represent the data and classify them into different categories based on that information. Oftadeh added:

“There are many ANNs that work very well, and we use them every day on our phones and computers. For example, applications like Alexa, Siri and Google Translate utilize ANNs that are trained to recognize what different speech patterns, accents and voices are saying.”

However, not all features are equally significant, and they can be classified. Previous approaches use a type of ANN called an autoencoder to extract them, but they cannot tell exactly where the features are located or which are more important.

To make a more intelligent algorithm, the researchers propose adding a new cost function to the network that provides the exact location of the features directly ordered by their relative importance. Once incorporated, their method results in more efficient processing that can be fed bigger datasets to perform classic data analysis.

To verify the effectiveness of their method, they trained their model for optical character recognition (OCR) experiment, which is the conversion of images of typed or handwritten text into the machine-encoded text from inside digital-physical documents, like a scanner produces. Once it’s trained for OCR using the proposed method, the model can tell which features are most important.

Currently, the algorithm can only be applied to one-dimensional data samples, but the team is interested in extending their algorithm’s abilities. The next step of their work is to generalize their method in a way that provides a unified framework to produce other machine-learning methods that can find the underlying structure of a dataset and extract its features by setting a small number of specifications.