ADVANCED INFORMATION

TECHNOLOGY SOLUTIONS _____

 

 

Home         Products         Methodologies         FREE Quote         Site Map          Advanced Search          Contact .

 

 Strategic Outsourcing             Scientific Solutions              Business Applications           Software Tools .

 

 Talk to a representative > .

Last Chance Team
Scientific Advisory Board
Publications
MedCFD
Est. of SMS Disorder Risk Groups

In this section:

UP
Last Chance Team
UCI Benchmark Study
Spatial Modeling of Habitats
Last Chance Team

In this sub-section:

UP


Sample
User Interface Designs
 GUI's >>

Testimonial Laurent Favre, CEO


"We have been working with [ITC Software] for more than 2 years now. We started with contract basis work and finally came to ODC which is completely dedicated for our tasks. High team stability, good proficiency and efficient administration support are the factors which bring high value to our collaboration."  ..../more

Hedge Funds Optimizer >>

Climate Change >

> Decision Theory

 
 

 

 

We solve the "bottleneck" problems for people in the data analysis arena.

 



Benchmark Study for the University of California Irvine (UCI) Machine Learning Repository:

Prediction of persons with more then $50K income based on U.S. Census Bureau data.

Many researchers involved in actual data set analysis, e.g. in medicine or security problems, need algorithms and knowledge extraction with an intuitive understandable interface. Many research algorithms and methods providing good classification and prediction tools, fail to convey the results to people with insufficient experience in data mining techniques. This also applies to prediction problems and the validation of the outcome of any data mining endeavor. Many tools have been developed for this purpose, i.e. regression, discriminant analysis, decision trees, Byes classifiers etc., but they all require a definite and full set of true answers to understand the results. We offer data classification and prediction tools based on features extraction neural nets developed by Kohonen - Self Organizing Features Map (SOM).

Look at the potential of the approach realized and see what SOM can extract for you using a minimal available data set. Is it possible to lead about the income without known income, using only census data ? The answer is YES.

Let us show you the possibilities of SOM in the following example.

One of the main tasks in business planning is the target group selection based on commonly available information, for example using U.S. Census Bureau data, and from it select persons with -adjusted gross income- above $50K (age, class of workers, education, marital-status, occupation code, marital status, race, sex, capital gains, capital losses, working hours per week and native-country). We have used data sets from the Current Population Survey (CPS) database provided by the U.S. Census Bureau and posted on the University of California Irvine (UCI) Repository>> to predict if a person's income is over or under 50K. The data is publicly available and free of charge. The first data set (-adult data set-) was extracted from 1994 CPS data. The 48,842 instances were divided into two files: a training and a testing file. Fourteen attributes, eight categories and six continuous values were chosen>>. They include age, work class, weight, education, years of education, marital status, occupation, relationship, race, sex, capital gain, capital loss, hours per week and native country. The six continuous attributes were quantified into quintiles>> before running the algorithm. The second data set (-Census Income Database-) with 1,999,523 instances was extracted from 1994 and 1995 CPS data and contained 41 demographic and economic related variables. These attributes include the majority but not all of 14 attributes included in the adult dataset. Another difference between two data sets is the decision variable provided for the classification problem. For the adult dataset the decision was drawn from the -adjusted gross income- versus the -total personal income- of the census income database.

The SOM tool organizes the set of records into classes, ordinated>> on the plane in such a manner, that persons with similar records are mapped into neighboring classes on the ordination plane. Let us consider the age value distribution on the ordination plane. The color gradient spans from green (young) to red (declining years):

- young

- middle-aged

- declining years

Another pictures gives us an idea about the different classes of worker's distribution. The color gradient ranges from green (no persons in this class of workers) to red (only persons in this class of workers):

Local government worker

State government worker

Federal government worker

Legend: * - low proportion,  - considerable proportion, - high proportion

One can see some regularity in the patterns showing changes in mapping moving from local to federal government workers. The more clear ordination shows the education level. The color gradient ranges from green (1 year spent on education) to red (16 years spent for education):

- not educated

- secondary education

- doctorate

This picture shows that the level of education reflects in solid gradient on the map from the upper left hand corner (uneducated persons) to the lower right hand corner (highly-educated persons). Let us now look at the connection between features selected (positions on the ordination plane), and income. The mapping received>> was calibrated and each node received a probability of income above $50K per person, attributed to this node.

The color legend is dark green, where the probability lies below 5%, green, where the probability lies between 5% and 75%, light red, where the probability lies between 75% and 90%, and red, where the probability lies above 90%.

< 5% 

of High Income

5% - 75 %

of High Income

75% - 90%

of High Income

      > 90% 

of High Income

We can conclude, that a high probability of income above $50K directly correlates with the level of education, but that there are more complex dependencies due to the age and class of workers.

To classify persons by target group, a probability cut level was defined. A cut level near 0.5 leads to a maximum total accuracy (82%) on the testing set but only with 50% of the target persons selected; the balanced cut level of about 0.3 leads to an equal prediction accuracy for both classes and selects 75% of the desired record.

The results obtained with the SOM algorithm can be compared with those of other methods reported in the UCI Repository (http://www.ics.uci.edu/~mlearn/MLSummary.html) - see table.  Taking into account that the prediction was done without an answer set, the total error margin of 18%  obtained with the SOM algorithm seems satisfactory in comparison with other classifiers used for the same purpose. The tools utilized also allow to control the outcome depending on the specific aims of the investigation, for example, it is possible to select a relative error depending on what is more important, i.e. either not to miss a person who is in the above $50K income bracket or not to increase the size of group selected as the target.

Method used

Error(%)

Method used

Error(%)

FSS Naive Bayes

14.05

Voted ID3 (0.6)

15.64

NBTree

14.10

CN2

16.00

C4.5-auto

14.46

Naive-Bayes

16.12

IDTM (Decision table)

14.46

Voted ID3 (0.8)

16.47

HOODG

14.82

T2

16.84

C4.5 rules

14.94

1R

19.54

OC1

15.04

Nearest-neighbor (3)

20.35

C4.5

15.54

Nearest-neighbor (1)

21.42

Novel possibilities are available to process the answers. We used <data base field transform> to recode them for the high income probability. The resulting probability distribution of income above $50K per for persons attributed to this node, is shown below. As before, the color legend is dark green where the probability lies below 5%, green, where the probability lies between 5% and 75%, light red, where the probability lies between 75% and 90%, and red, where the probability lies above 90%.

< 5%

of High Income

5% - 75 %

of High Income

75% - 90%

of High Income

      > 90%

of High Income

We have now obtained more detailed information, e.g. classes populated with wealthy people (more then 95% of them have an income above $50K). The total accuracy of the prediction has been increased to 85%. If the records with <data permit> are used, the average influence of each variable on probability can be - as has been demonstrated for predicted probabilities - attributed to working class, education and hours-per-week only, are shown below. The color gradient is from green (low probability predicted by this variable) to red (high probability predicted by this variable).

Working class

Education

Hours-per-week

High Income probability predicted: * - low,  - average, - high

The patterns shown can be used to correlate persons income with specific variables and formulate hypothesis to be tested using usual statistical tools.

For more information on this study, please contact Anatoly Saveliev: info(at)itcsoftware.com


Notes:

Click on to return to section in text

 

Acknowledgement: Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html ]. Irvine, CA: University of California, Department of Information and Computer Science.

Quintiles are a measure of location, such as quartiles, and percentiles. Quintiles divide an ordered distribution into five equal parts.

  Data Selected: From all the attributes of the Census Bureau dataset (of about 50) only 14 were selected for this benchmark task. Three kinds of values are traditionally used to distinguish in data processing: nominal or categorical, ordinal (categorically ordered) and scalar or continuous (have any value on some scale.) Categorical attributes have a discrete (countable) and finite set of values, for example sex has two possible values: male and female. Continuous could take any value in finite or infinite interval.

  Ordination is a procedure (and result also) of placing (mapping) objects into some low-dimensional space (on a line or on a plane) preserving some properties of mutual objects correlation or ratio (distance in N-dimensional space or similarity).

  Mapping: Kohonen SOM produces classifications using classes ordinated on the plane in such a manner, that similar classes are located alongside on the plane.
 


In this section UP
Last Chance Team
UCI Benchmark Study
Spatial Modeling of Habitats
Last Chance Team

Also see: Meet the Team >

    Scientific Advisors            Publications     

 

Let's have an informal chat about what we can do for you.

Prof. Anatoly Saveliev: info(at)itcsoftware.com


  Issue Online Request for Proposal 

  

ISO 9001:2000 certifiedHome    Site Map    About us      Meet the Team   FREE Quote    Legal    Investors   Contact

© 2002-2015  ITC Software. All rights reserved. Sponsor of Modalart Shipping Containers

Product and company names mentioned herein may be trademarks of their respective owners. info(at)itcsoftware.com
Follow itcsoftwarecom on Twitter  

vBulletin statistic
    


Recommend this page::