UCI Benchmark Study
Databases with millions of records and thousands of fields are now common in business, medicine, engineering, and the sciences. The problem of extracting useful information from such data sets is an important practical problem. Research on this topic focuses on key questions such as how can one build useful descriptive models that are both accurate and understandable? Probabilistic and statistical techniques in particular, play a key role in both analyzing the inference process from a theoretical viewpoint and providing a principled basis for algorithm development. [http://www.ics.uci.edu/~mlearn/Machine-Learning.html]
Our main focus is to help people understand data (multidimensional and spatial) when they have no idea about the data structure and dependency - an area of data mining and exploratory data analysis.
We are a small unclear-problem-oriented team, using the same means in various domains, trying to solve problems having inefficient solutions or not having known solutions at all:
Multidimensional data exploratory analysis and visualization:
Remote-sensed data (RS), multi-source images, scientific, sociological, medical, ... The task usually is not to test known hypothesis, but to generate new ones from huge amounts of very noisy unstructured data in order to discover data dependencies using statistical and "exoteric" means (like Kohonen neural nets SOM - they produce pretty pictures representing data and are useful for interpretation by domain specialists.)
- Multidimensional (including spatio-temporal and economic) data analysis, modeling and prediction using "black-box" models (where no information about the object's internal structure is available):
Various regressions, Artificial Neural Networks (ANN), fuzzy logic, expert systems, etc. For example, the EUNITE Website (European Network of Excellence on Intelligent Technologies for Smart Adaptive Systems) > has a short description of the Competition Team (we were participants) on glass manufacturing melting tank modeling>
In spatial data processing we combine (GIS) technics; geostatistics: variography, kriging, splines; (contextual) Kohonen neural networks (SOM) and stochastic spatial models (Gaussian and Markov Random Field - MRF) providing a means of estimating and mapping of various spatial gradients and ecotones using multidimensional sets of features.
The main reason for using SOM and other topographic mapping techniques in data analysis (not only spatial) is that it combines data classification and ordination, and could be used in data mining and exploratory data analysis to discover data structures, especially if the data is of a continuous nature and can not be separated into discrete classes - almost all natural data is of this kind.
For example, NeRIS software, utilizing SOM and MRF, was used for the creation of the Atlas of Russia’s Intact Forest using RS data, and allowed to process the full RS coverage for Russia and represent all results using the same legend (Aksenov et al., Atlas of Russia’s Intact Forest Landscapes, Moscow: 2002, 186p.).
Another example is the application of ANN (Artificial Neural Networks) in water quality assessments based on a set of indirect tests, where ANN models were compared with traditional statistical approaches (see references).
In these areas we develop proper "technology chains" from available modules for the task at hand, rather than "end-user" software.