Launchorasince 2014
← Stories

Technical specifications of data preprocessing and data transformation

Data preprocessing is one of the most complex tasks in machine learning. It is associated with a wide range of techniques that eventually lead to model building. Data preprocessing in machine learning is challenging because it is not only time consuming but also accounts for most of the efforts of data scientists. As such, it becomes important to understand the technical specifications of data preprocessing. This article is a step in this direction.

An overview of data preprocessing and other stages

Before understanding data preprocessing, it is important to get an overview of different stages that fall in the analytics lifecycle.

At the first stage, data is sourced from various repositories like data lakes. For simplicity, let us assume that data is used from two data sources. This data is now subjected to the pre-processing stage. The data preprocessing stage is divided into five levels.

The first level involves the merging of different types of databases. The second level involves encoding of various categorical variables. At the third level, the scaling of data is done along with the process of normalization. At the fourth level, data is checked for any missing values. Consequently, missing values are imputed at this level. It is at the fifth and the final level that pilot test split is performed.

After data preprocessing has been successfully performed, it is now time for exploratory analysis that involves data visualisation and relative techniques. The prime goal of this stage is to convey the extracted information in the most simplest manner. This is done with the help of graphs, histograms, heat maps, pie charts and other tables.

We can now move on to the stage of model building which is the most important stage. The first important step is the selection of an appropriate model that is followed by the tuning of hyperparameters. After hyper parameter tuning is done, the model is trained with the help of labeled or unlabeled data that has been preprocessed earlier.

Finally, the model is evaluated to ensure that it is not over fitting. In this way, the input from this stage is passed to the output stage and approximate predictions are made.

Technical specifications

Data processing can be primarily divided into three stages. The first stage is called data cleansing, the second stage is called data integration and the third stage is called data transformation.

While performing data cleansing, we may encounter the problem of missing values. In this case, missing values are deleted or replaced with the help of average value or median.

While performing data transformation, we first divide the data into categorical types or numerical types. The categorical type is subdivided into ordinal and nominal values. Similarly, the numerical category is divided into discrete and continuous values.

Data types that are in the ordinal and nominal categories are handled with the help of two important methods. The first method is called encoding and the second method is called replacement with alternate values that are derived from the average value. Encoding can further be classified into label encoding and dummy encoding.

Approaches of data transformation

The approach of data transformation involves the conversion of data from its present format to an alternate format. Broadly speaking, there are two main types of data transformation. The first is called the constructive approach and the second is called the destructive approach. The first approach involves the replication of data sets and creating multiple copies of the same. The second approach involves the deletion of records or incorrect values so that consistency in data is maintained.

Mathematical approaches to data transformation

There are five major mathematical approaches that are followed during the process of data transformation. The first approach of data transformation is called logarithmic transformation. The second approach is called exponential transformation of data. The third approach involves the transformation of data using the square root method. The fourth mathematical approach to data transformation is called reciprocal approach. The final approach is given the name box cox approach.

Concluding remarks

Apart from the approaches mentioned above, there are other methods of processing and transformation of data like outlier treatment and the Z score approach. All approaches and methodologies aim at the maintenance of overall data integrity and consistency.