PRACTICAL DATA MINING USING C LANGUAGE 1

  • Category:
    Logic & Programming
  • Document type:
    Assignment
  • Level:
    Undergraduate
  • Page:
    3
  • Words:
    1849

Practical Data Mining using C language

Institution of Afiliation

By definition, data mining is a practice of examining big databases with the aim of generating new information. Conversely, it is a technique of manipulating big data by breaking it down into small sections and using a set of algorithms extract meaningful data that can be used in analysis of a system. From the process of data mining, new relationships that previously were not identified could be discovered and analyzed. The analysis is done from different perspectives and views to comprehensively generate a structured layout report on the set of data. Hence, the initial stage of data mining is to come up with data sets that can be easily understood and processed to extract relevant information and relation. From the computation of discovered relationships and patterns, diverse methods can be applied on the data sets and would find practical use in a number of fields, including: machine learning, artificial intelligence, database systems and models, and also in conducting statistics.

The core source of data in the process of data mining is a database containing a mixture of data of different types. The data therein is what is extracted, and models applied on to it such that the types can be isolated and uniquely analyzed.

There are diverse data mining techniques that are in existence today. They include:

  • Classification

  • Clustering

  • Regression

  • Anomaly detection

  • Association rules

  • Reinforcement learning

  • Structured prediction

  • Feature engineering

  • Summarization

Any of the above techniques can be used individually or in combination with another or a number of them thereof to analyze a dataset. Now, data mining is the fourth step of machine learning identified universally under Knowledge Defined in Databases (KDD) process. For a successful process of data mining, a decision tree is first generated upon a data set of which information is to be extracted. Hence it is imperative to first construct a data tree.

The decision tree is created from well-defined algorithms. The common algorithms used include the ID3 algorithm and C4.5 algorithm. Since the requirement for the assignment is to use the id3 algorithm, the following is a snap description of it. Ideally, it is a straight-forward decision-tree learning algorithm that finds its application uniquely in those sets of data attributes are well defined and belong to clearly distinguished classes. The algorithm analyses the input data set iteratively starting from the root node from which it builds a data tree. At every node, the best data classification attribute is chosen.

The code used to do the data mining in this assignment has been written in c language and output from the console following ID3 algorithm. Here is the description of its functionality:

The program imports the standard library and the string libraries which are pre coded to enable it read the string data from the file. It then sets the string buffer size expanding the buffer memory to 1024 by 1024 bytes so that it can accommodate the longest line available. The fields also acting as data headers correspond to the number of columns present in the data table, which from the data set, it stands at value 22. This is established in the initial stage of the program under program constants. This spans from lines 17 to 39 of the code as commented appropriately.

Practical data mining project

The next step is to read the CSV file as the input. Here, pointers to the file are used to read the file input. The main advantage and reason for use of pointers is that the data from the file is extracted without changing the file by the program. Hence the raw data remains intact and the file is not modified in any way. Thus the file is read and by the load file method, taking the CSV file as the parsing parameter, and the returned values being of type “long” since the file could be extremely large. This is as shown below:

Practical data mining project  1

The second parameter of the loadFile method parses the number of count errors existing in the pointed to file, identified by the pointer indicator *pFile.

The definition of the loadFile method is given from code lines 45 all the way to line 77

This step entails loading the values of the file onto the program for analysis which is done by the loadValues method, which is of integer type as its return values and the modifier appended being static.

Practical data mining project  2

The output of the method is stored in the predefined variable RET_OK.

The two methods are called in and coordinated by the main method which runs from line 121 to line 146.

The description for the analysis process is as described below:

In step 1, three library files are invoked. Namely the stdio.h for allowing the program to process the input and output functions, the stdlib.h that holds standard library tools for parsing resources for standard features that the program avails. The methods for file processing, including fopen(), fcreate() and fclose() are defined in this library file. The last one is the string.h header file in which is precoded text processing tool. This allows the program to accept long texts of type String since it is not defined in the c language lexicon dictionary. The next section of code is definition of the fields as constants since they are not expected to change as the program is running. Since they are recursively used, they need to be defined as constants and their fixed names and values be assigned a fixed memory space hence they are defined as such and values assigned to them in a form of an index. Since they are a total of 22 fields they are all indexed and stored in an array of characters, identified as of char data type in c language.

The next step is the calling of the load file with two distinct parameters parsed as arguments to the loadFile method. The pointer to the file, named as *pFile causes the program to check the source file for the data that needs to be analysed, while the *errorcount points to the errors identified in the file. The method returns or outputs the results as of type long primarily because the output could be extremely large and hence the allocated memory capacity need to be big enough to contain all the characters of the output. Since the initial name for the provided CSV file was mushroom.csv, it was prudent to rename it to pFile.csv such that there be a consistency in naming with the camelcasing standard of naming variables. Without renaming the original source file to pFile would result to error since the program would return a null result and dim the source file as non-existent. The method is defined at a later stage where the processing features pertaining to flow and control of the data are declared.

In the definition of the loadFile method, the memory buffer size that temporarily holds the file data, and the lineno are declared locally and a memory is allocated to them. The buffer size is called in to expand the default buffer memory so as to allow the program to handle big capacity of data. A conditional statement is used to validate the presence of the file called which was renamed to pFile from the original name titled mushroom. In case it is not found or located within the path set, a zero is returned to indicate a fail read attempt. Otherwise the file is read and the line numbers are read iteratively through the entire file, going line after line, which is achieved by the use of the loop statement within the method. The results of the linear analysis are buffered and after the read operation, the results are printed over the terminal window.

The load values then do the processing for the line to line analysis. This method checks the special characters found in the dataset and using a counter, records their instances. One of the characters processed is the double quote character that is compared to the defined variable “ch”. The number of its occurrences is analyzed in each line and the buffer memory is updated until the search reaches end of file.

Finally, the main method is invoked to integrate the above described methods. The code is straight forward and has been appended with the necessary comments in case of clarification. The code is displayed in the appendix section below hence can be followed through with ease.

Appendix

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

// adjust BUFFER_SIZE to suit longest line

#define BUFFER_SIZE 1024 * 1024

#define NUM_FIELDS 22

#define MAXERRS 5

#define RET_OK 0

#define RET_FAIL 1

#define FALSE 0

#define TRUE 1

// char* array will point to fields

char *pFields[NUM_FIELDS];

// field offsets into pFields array:

#define #class 0

#define cap-shape 1

#define cap-surface 2

#define cap-color 3

#define bruises 4

#define odor 5

#define gill-attachment 6

#define gill-spacing 7

#define gill-size 8

#define gill-color 9

#define stalk-shape 10

#define stalk-root 11

#define stalk-surface-above-ring 12

#define stalk-surface-below-ring 13

#define stalk-color-above-ring 14

#define stalk-color-below-ring 15

#define veil-type 16

#define veil-color 17

#define ring-number 18

#define ring-type 19

#define spore-print-color 20

#define population 21

#define habitat 22

long loadFile(FILE *pFile, long *errcount);

static int loadValues(char *line, long lineno);

static char delim;

long loadFile(FILE *pFile, long *errcount){

char sInputBuf [BUFFER_SIZE];

long lineno = 0L;

if(pFile == NULL)

return RET_FAIL;

while (!feof(pFile)) {

// load line into static buffer

if(fgets(sInputBuf, BUFFER_SIZE-1, pFile)==NULL)

// skip first line (headers)

if(++lineno==1)

continue;

// jump over empty lines

if(strlen(sInputBuf)==0)

continue;

// set pFields array pointers to null-terminated string fields in sInputBuf

if(loadValues(sInputBuf,lineno)==RET_FAIL){

(*errcount)++;

if(*errcount > MAXERRS)

// On return pFields array pointers point to loaded fields ready for load into DB or whatever Fields can be accessed via pFields, e.g.

printf(«Class=%s, cap-shape=%s, cap-surface=%sn», pFields[#class], pFields[cap-shape], pFields[cap-surface]),pFields[cap-surface]);

return lineno;

static int loadValues(char *line, long lineno){

if(line == NULL)

return RET_FAIL;

if(*(line + strlen(line)-1) == ‘r’ || *(line + strlen(line)-1) == ‘n’)

*(line + strlen(line)-1) = »;

if(*(line + strlen(line)-1) == ‘r’ || *(line + strlen(line)-1 )== ‘n’)

*(line + strlen(line)-1) = »;

char *cptr = line;

int fld = 0;

int inquote = FALSE;

pFields[fld]=cptr;

while((ch=*cptr) != » && fld < NUM_FIELDS){

if(ch == ‘»‘) {

if(! inquote)

pFields[fld]=cptr+1;

*cptr = »; // zero out » and jump over it

inquote = ! inquote;

} else if(ch == delim && ! inquote){

*cptr = »; // end of field, null terminate it

pFields[++fld]=cptr+1;

if(fld > NUM_FIELDS-1){

fprintf(stderr, «Expected field count (%d) exceeded on line %ldn», NUM_FIELDS, lineno);

return RET_FAIL;

} else if (fld < NUM_FIELDS-1){

fprintf(stderr, «Expected field count (%d) not reached on line %ldn», NUM_FIELDS, lineno);

return RET_FAIL;

return RET_OK;

int main(int argc, char **argv)

FILE *fp;

long errcount = 0L;

long lines = 0L;

if(argc!=3){

printf(«Usage: %s csvfilepath delimitern», basename(argv[0])); //argv[1] holds the path to csv file parsed below

return (RET_FAIL);

if((delim=argv[2][0])==»){

fprintf(stderr,»delimiter must be specifiedn»);

return (RET_FAIL);

fp = fopen(argv[1] , «r»); //code to open the csv file in read mode

if(fp == NULL) {

fprintf(stderr,»Error opening file: %dn»,errno);

return(RET_FAIL);

lines=loadFile(fp,&errcount);

fclose(fp);

printf(«Processed %ld lines, encountered %ld error(s)n», lines, errcount);

if(errcount>0)

return(RET_FAIL);

return(RET_OK);