Data exploration and preparation
Report on Whether Taiwanese Customers will Default on their Loan Repayments
Table of Contents
1A. Initial Data Exploration 2
Attribute type 2
2. Summarizing Properties 3
Measures of spread (Variance and standard deviation) 6
Percentiles and Inter-quartile range 7
1B Preprocessing 8
a. Binning 8
Equi-width binning 8
Equi-depth binning 9
b. Normalization 9
Min-max normalization 9
Z-score normalization 10
c. Discretization 10
d. Education Binarization 11
This report presents data analytical results that seek to help in determining whether or not, the Taiwanese customers would repay their loans. The determination of whether they will pay or not depends on the reliability of the data in making such a decision. The reliability measures is based on the distribution characteristics of the attributes of data (Shevlyakov and Vilʹchevskiĭ, 2002). The conclusion on the matter is based on the analytical results of the current records. The analytical results of the given data is as follows.
Note: The year of reference in this report is 2005
1A. Initial Data Exploration
For this dataset, the data attributes are identified as either nominal, ordinal, interval or ratio. The loan information contained in dataset included the customer id, loan limit balance, gender, education, marital status, age, past monthly payment records, bill statement, and defaulting information.
According to Stephens (2009), nominal data/scale can simply be described as a name or better put, label. In this sample data, the nominal data includes the customer Id, marital status (dichotomous), sex (dichotomous) and default status (dichotomous).
An ordinal data set/scale is one which has hierarchical dimensions. I.e, order matters. In this particular case, the data that order matters include education, repayment delay status (Stephens , 2009).
Interval scales can be described as the numeric scales in which not only the order, but also the exact differences between the values. The given classic example would be temperature. There is no such thing as ‘no temperature’ and 40 and 50 degrees celcius is the difference between 90 and 100 degrees Celcius is the same 10 degrees. Interval scales can give an easier realm of statistical analysis on the data sets. In this particular case, data attribute with such characteristics is the age, limit balance, monthly amount paid, monthly bill statement.
Stephens (2009) describes ratio scales as data which can have an absolute zero-which allows for a wide range of descriptive and inferential statistics. Ratio scales are those which can be easily have their ratios computed (subtraction, multiplication, addition) and have measures of central tendency easily determined eg the mean, median and measures of dispersion such as the standard deviation, coefficient of variation. In this data set, the first data set that meets this description is the age, limit balance, monthly bills and the monthly payment amount. Simply put, in this particular case, all the interval data attributes up there are ratio data attributes as well. In fact, when it comes to the monthly computations, they are mainly ratio data.
2. Summarizing Properties
Except for the case of high influence of the outliers, the mean is used to give a generalized representation of the data (Bowman and Robinson, 2007). In this case, the mean would be used to represent the average amount given to the customers, the average age of the customers, the average payments for each month (so as to compare the month statistics) and the average monthly repayments. For this dataset, the mean amount given out to the customers is 167065 NT Dollar, with an average borrowing age of 35-36 (35.7 years average) years. The average bill amounts for the months from September backwards to April are approximately:
NT Dollars respectively. From this it can be seen that towards September from April, the average Bill amounts were increasing with highest average in September.
The average amounts paid by the customers for the months of September backwards to April are approximately
NT Dollars respectively. On average, the month of August recorded the highest amount of repayment by the customers.
The median is much more like the middle of the data set about which values lie above and below and may give a more robust representation of the customer behavior compared to the mean computed above. It can also be said to be a more representative of the population than the mean (Bowman and Robinson, 2007). The median amount given out to customers
is 140000 NT Dollars
The median age of customers is 34 years. The median bills for September backwards to April are
NT Dollars respectively. September records the highest representative bills of the months given.
The median monthly repayments from September backwards to April are
NT Dollars respectively. The median shows September as the month with the highest repayments and with the repayments increasing from April to September. In this case, The median may be viewed as more representative of the customers as opposed to the mean which is highly affected by the outliers (Bowman and Robinson, 2007). The median for defaulting status is 0 indicating no defaulting.
The range is used to determine the difference between the smallest and the highest recording of the data. The difference between the highest and the smallest amount borrowed is 990000 NT Dollars. The difference between the youngest and the oldest borrowers is 52 years.
The range for the Bills from September backwards to April is
NT Dollars respectively. August has the highest difference between the smallest and the largest Bill and July has the smallest difference.
The repayment range is
NT Dollars for September to backwards to April respectively. July has the smallest difference between the highest and lowest repayment.
This measure is used to determine the mostly occurring entry.. It is very helpful in determining which group of a given attribute has the highest score/occurrence. In this case, 50000 NT Dollars is the amount that was most given out to the customers. There were more females borrowing compared to males. Most of the customers are University students. In terms of marriage, most of the customers are single. The largest borrowing age is 27 years. The payment status show that people mostly repaid earlier than due date for all the months. Both bills and the repayment amounts do not show one particular entry that mostly occurred. Defaulting for next month shows that most of the customers did not default payment for the next month the default status of highest frequency is 0 meaning more of the customers did not default on payment.
With respect to this case, the most appropriate data attribute distribution analysis is the age. In this particular case, the age distribution analysis was as shown below
In this case, the most common customers lie in the age between 30 and 39 years of age
Measures of spread (Variance and standard deviation)
For this data set, variance and standard deviation is very large for the amount given to customers, age. Amount variance and standard deviation are 16802336943 and 129623.8286 respectively. The variance and standard deviation for age is 84.94041 and 9.216312. A large variance and standard deviation indicate that the data is highly spread about the mean. This implies that the data is unreliable in decision making.
Percentiles and Inter-quartile range
The inter-quartile range is computed from the first and second quartile. It is used to measure spread. In this particular data, the difference between the third quarter of age and the first quarter is 14 years. This is not quite a spread data and can therefore be relied on. The range for amount given to customers is 180000 NT Dollars. This shows that there is a reasonably smaller difference between the highest quarter and the lowest quarter in the amount given to customers. This means that based on this as a measure of spread, the data is reliable. The quartile range for the months in the repayments show that the difference between the first and third quartile is small. This shows that data information is reliable for the bills and the repayments by the customers.
Clusters and outliers
Using Knime, the data was determined to be linear and with very without outliers. There is only one outlier in the age which is beyond 70 years from the mean of 35 years. The amount borrowed and the payments are also mainly linear with very few outliers. This data, can be described to be homogeneous.
Han, Kamber and Pei (2012) describe equi-width binning refers to categorizing the data into groups. For example having age in numerical values and we wish to categorise it into certain age brackets the equi-width binning involves age brackets of equal interval for example, each age bracket would be of interval 10 years. In this case, the procedure followed was as below.
First, the desired number of bins was determined to be 5 under the given age range. This would help categorise the customers into young-senior citizens. The interval was then determined by dividing the range (max-min) by the bin number. In this, the range was determined as below
=(MAX(A2:A2001)-MIN(A2:A2001) + 0.0000000001)/5
The 0.000000001 was used to avoid values equal to the maximum being put into their own bin.
For actual binning, the formula below was used to determine the number of bins greater than the interval.
From this categorization, the age equi-depth binning result is as shown below
Equi-depth binning is also known as equal-frequency binning. It is carried by classifying the data into classes of equal frequency. According to Han et al (2012), equi-depth binning divides data into N groups with each group having approximately same number of values. Using excel, the desired frequency for the age was determined to be 10 and then the equi-depth binning carried out using the function below.
The min-max normalization involves dividing the difference between a particular value and the minimum value of the dataset by the range of the dataset. A min-max score is an index of how much the value is far from the minimum value of the dataset. The min-max normalization may not give a true smoothing of the given data since the maximum value and the minimum values may be outliers (Brackett, 2012). A plot of the min-max is as shown below
Figure1: The min-max distribution. The score is always positive
The Z-score was carried out by dividing the difference between the value and the mean by the standard deviation. The Z-normaization can be a better smoothing criterion compared to min-max normalization (Brackett, 2012). The Z-score helps us determine the probability of a score occurring within within our normal distribution. The scores closer to the mean have the highest probability of occuring. Below is a plot of the Z-normalization.
In order to carry out data discretization, Excel If function was applied to filter and assign the discretization values that describe the age bracket into which a particular age falls. To carry out the process, age was determined to be of the groups in the teenage, young, mid-age, mature and old age ranges <20, 21-30, 31-45, 46-65 and >66 respectively. This classification was in accordance with Grasso and Burkins’ (2010) categorisation criterion. The excel if function entered was as shown below
Where A2 is the discretization value of the age and varies along the column. The discretization can then be used to determine the most common age bracket. In this case the most frequent of customers is the mid-age. The discretization results are as shown in the excel file.
d. Education Binarization
Binarization means encoding data into assuming one of two possible states. One is either this or that (Aggarwal and Zhai, 2012). In this case, education level of the customers is graduate school, university, high school or others. ‘Others’ in this case would mean level lower than high school or higher but not university. This could be college school or technical education. In this particular situation of binarization, the customers only belong to one of two groups formed depending on the given information. Aggarwal and Zhai (2012) Classifies all university education into one category. Based on this classification, university and graduate school are classified into one category. The others including the high school will belong to one category. The most appropriate categories in this case would be university education (code=1) and no university education (code=0). As such, the codes 1 and 2 denoting graduate school and university were coded as 1 and the high school and others codes 3 and 4 were coded 0. With this in mind, the codes were entered in excel and the mapping computation done as shown in the excel equations and encoding process below.
Under the respective columns of the excel worksheet for this particular job, the job was accomplished under following equation:
A snapshot of how the table looks like is as shown below.
Figure 1: An illustration of the excel worksheet.
From the binarized information, one can know how the university or otherwise of the customers’ education impacts on repayment. Good binarization is especially good since it makes decision making an easy process because it only involves studying reduced number of variables and thus decision making process is made easier. By matching the binarized information with a factor of consideration, a wiser decision can be easily reached through easy pattern presentability.
This report of data mining contains the basic information about the given data set. A data attribute is either nominal, ordinal, interval or ratio. Ratio and interval attributes are almost the same. The summarising properties of the data such as the mean, range, percentiles, frequency and median are used to show quickly the characteristics of customers. From the summarizing properties, it can be seen that on average, most of the customers are young and have most have gone through university. There are more females than males who take loans. From the same data attributes the month of September has highest bill amounts and also shows month of highest repayments. From Knime analysis of the data shows that age is almost evenly distributed. There are very minimal clustered distributions and more of linear. Thus it can be reliably said that the median age is a more representative compared to the mean.
The pre-processing of the information was of most relevance to age to understand how the age was associated with borrowing. From equi-width binning, it can be concluded that most of the customers (over 40%) are of age between 30-39. Thus from discretization, most of the customers are young. The normalization and smoothing of the data shows no skewness of data in either of the normalization processes (min-max and Z-normalization). It shows a normal and unskewed data thus confirming the linearity of the data from KNIME results. From the binarized results of education, it can be concluded that most of the customers have university education and thus should thus should be the target for marketers.
The data characteristics indicate that this data can be relied on in making the decision about the defaulting nature of the Taiwanese customers. With this in mind, it can be concluded that the defaulting analysis results for the data is reliable. From, the defaulting mean, median and frequency analysis results, it can be concluded that the Taiwanese customers do not default the payment of the results i.e the analysis results show 0 for the more reliable median analysis and the mean also tends to 0 more than it tends to 1 implying that the customers do not default on the average.
. New York: Springer.Mining text dataAggarwal, C. and Zhai, C. (2012).
. Bristol: Hilger.Introduction to statisticsBowman, A. and Robinson, D. (1987).
. Bradley Beach, N.J.: Technics Pub.Data resource designBrackett, M. (2012).
. Amsterdam: Elsevier/Morgan Kaufmann.Data miningHan, J., Kamber, M. and Pei, J. (2012).
. Utrecht: VSP.Robustness in data analysisShevlyakov, G. and Vilʹchevskiĭ, N. (2002).
. New York: McGraw-Hill.Schaum’s outline of theory and problems of beginning statisticsStephens, L. (1998).