Application of cluster analysis in Microsoft Excel. Cluster analysis is a study by dividing a set of objects into homogeneous groups Cluster analysis statistics

CLUSTER ANALYSIS IN THE PROBLEMS OF SOCIO-ECONOMIC FORECASTING

Introduction to cluster analysis.

When analyzing and forecasting socio-economic phenomena, the researcher often encounters the multidimensionality of their description. This happens when solving the problem of market segmentation, building a typology of countries according to a sufficiently large number of indicators, predicting the market situation for individual goods, studying and predicting economic depression, and many other problems.

Multivariate analysis methods are the most effective quantitative tool for studying socio-economic processes described by a large number of characteristics. These include cluster analysis, taxonomy, pattern recognition, factor analysis.

Cluster analysis most clearly reflects the features of multivariate analysis in classification, factor analysis - in the study of communication.

Sometimes the cluster analysis approach is referred to in the literature as numerical taxonomy, numerical classification, self-learning recognition, etc.

Cluster analysis found its first application in sociology. The name cluster analysis comes from the English word cluster - bunch, accumulation. For the first time in 1939, the subject of cluster analysis was defined and its description was made by the researcher Trion. The main purpose of cluster analysis is to divide the set of objects and features under study into groups or clusters that are homogeneous in the appropriate sense. This means that the problem of classifying data and identifying the corresponding structure in it is being solved. Cluster analysis methods can be applied in a variety of cases, even in cases where we are talking about a simple grouping, in which everything comes down to the formation of groups according to quantitative similarity.

The great advantage of cluster analysis is that it allows you to partition objects not by one parameter, but by a whole set of features. In addition, cluster analysis, unlike most mathematical statistical methods does not impose any restrictions on the type of objects under consideration, and allows us to consider a set of initial data of an almost arbitrary nature. This is of great importance, for example, for market forecasting, when indicators have a variety of forms that make it difficult to use traditional econometric approaches.

Cluster analysis makes it possible to consider a sufficiently large amount of information and drastically reduce, compress large arrays of socio-economic information, making them compact and visual.

Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions). Here it is possible to single out the periods when the values of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

Cluster analysis can be used cyclically. In this case, the study is carried out until the desired results are achieved. At the same time, each cycle here can provide information that can greatly change the direction and approaches of further application of cluster analysis. This process can be represented by a system with feedback.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Like any other method, cluster analysis has certain disadvantages and limitations: In particular, the composition and number of clusters depends on the selected partitioning criteria. When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values of the cluster parameters. When classifying objects, very often the possibility of the absence of any cluster values in the considered set is ignored.

In cluster analysis, it is considered that:

a) the selected characteristics allow, in principle, the desired clustering;

b) the units of measurement (scale) are chosen correctly.

The choice of scale plays a big role. Typically, data is normalized by subtracting the mean and dividing by the standard deviation so that the variance is equal to one.

The problem of cluster analysis.

The task of cluster analysis is to split the set of objects G into m (m is an integer) clusters (subsets) Q1, Q2, ..., Qm, based on the data contained in the set X, so that each object Gj belongs to one and only one partition subset and that the objects belonging to the same cluster are similar, while the objects belonging to different clusters are heterogeneous.

For example, let G include n countries, each of which is characterized by GNP per capita (F1), the number M of cars per 1,000 people (F2), per capita electricity consumption (F3), per capita steel consumption (F4), etc. Then X1 (measurement vector) is a set of specified characteristics for the first country, X2 for the second, X3 for the third, and so on. The challenge is to break down countries by level of development.

The solution to the problem of cluster analysis are partitions that satisfy a certain optimality criterion. This criterion can be some functional that expresses the levels of desirability of various partitions and groupings, which is called the objective function. For example, the intragroup sum of squared deviations can be taken as the objective function:

where xj - represents the measurements of the j-th object.

To solve the problem of cluster analysis, it is necessary to define the concept of similarity and heterogeneity.

It is clear that the i-th and j-th objects would fall into the same cluster when the distance (distance) between the points Xi and Xj would be small enough and would fall into different clusters when this distance would be large enough. Thus, hitting one or different clusters of objects is determined by the concept of the distance between Xi and Xj from Ep, where Ep is a p-dimensional Euclidean space. A non-negative function d(Xi, Xj) is called a distance function (metric) if:

a) d(Xi , Xj) ³ 0, for all Xi and Xj from Ep

b) d(Xi, Xj) = 0 if and only if Xi = Xj

c) d(Xi, Xj) = d(Xj, Xi)

d) d(Xi, Xj) £ d(Xi, Xk) + d(Xk, Xj), where Xj; Xi and Xk are any three vectors from Ep.

The value d(Xi, Xj) for Xi and Xj is called the distance between Xi and Xj and is equivalent to the distance between Gi and Gj according to the selected characteristics (F1, F2, F3, ..., Fр).

The most commonly used distance functions are:

1. Euclidean distance d2(Хi , Хj) =

2. l1 - norm d1(Хi , Хj) =

3. Supremum - norm d¥ (Хi , Хj) = sup

k = 1, 2, ..., p

4. lp - norm dр(Хi , Хj) =

The Euclidean metric is the most popular. The l1 metric is the easiest to calculate. The supremum-norm is easy to calculate and includes an ordering procedure, while the lp-norm covers the distance functions 1, 2, 3,.

Let n measurements X1, X2,..., Xn be represented as a p ´n data matrix:

Then the distance between pairs of vectors d(Хi , Хj) can be represented as a symmetrical distance matrix:

The concept opposite to distance is the concept of similarity between Gi objects. and Gj. A non-negative real function S(Хi ; Хj) = Sij is called a similarity measure if: The value Sij is called the similarity coefficient.

1.3. Methods of cluster analysis.

Today there are many methods of cluster analysis. Let us dwell on some of them (the methods given below are usually called the methods of minimum variance).

Let X be the observation matrix: X = (X1, X2,..., Xu) and the square of the Euclidean distance between Xi and Xj is determined by the formula:

1) The method of complete connections.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than some threshold value S. In terms of the Euclidean distance d, this means that the distance between two points (objects) of the cluster should not exceed some threshold value h. Thus, h determines the maximum allowable diameter of a subset forming a cluster.

2) Method of maximum local distance.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any threshold values.

3) Word method.

In this method, the intragroup sum of squared deviations is used as an objective function, which is nothing more than the sum of the squared distances between each point (object) and the average for the cluster containing this object. At each step, two clusters are combined that lead to the minimum increase in the objective function, i.e. intragroup sum of squares. This method is aimed at combining closely spaced clusters.

Random Forest is one of my favorite data mining algorithms. Firstly, it is incredibly versatile, it can be used to solve both regression and classification problems. Search for anomalies and select predictors. Secondly, this is an algorithm that is really difficult to apply incorrectly. Simply because, unlike other algorithms, it has few customizable parameters. And yet it is surprisingly simple in its essence. At the same time, it is remarkably accurate.

What is the idea of such a wonderful algorithm? The idea is simple: let's say we have some very weak algorithm, say . If we make a lot of different models using this weak algorithm and average the result of their predictions, then the final result will be much better. This is the so-called ensemble learning in action. The Random Forest algorithm is therefore called "Random Forest", for the received data it creates many decision trees and then averages the result of their predictions. An important point here is the element of randomness in the creation of each tree. After all, it is clear that if we create many identical trees, then the result of their averaging will have the accuracy of one tree.

How does he work? Suppose we have some input data. Each column corresponds to some parameter, each row corresponds to some data element.

We can choose, randomly, from the entire data set a certain number of columns and rows and build a decision tree from them.

Thursday, May 10, 2012

Thursday, January 12, 2012

That's actually all. The 17-hour flight is over, Russia has remained overseas. And through the window of a cozy 2-bedroom apartment, San Francisco, the famous Silicon Valley, California, USA is looking at us. Yes, this is the very reason why I have not written much lately. We moved.

It all started back in April 2011 when I had a phone interview with Zynga. Then it all seemed like some kind of game that had nothing to do with reality, and I could not even imagine what it would lead to. In June 2011, Zynga came to Moscow and conducted a series of interviews, about 60 candidates who passed a telephone interview were considered, and about 15 people were selected from them (I don’t know the exact number, someone later changed their mind, someone immediately refused). The interview turned out to be surprisingly simple. No programming tasks for you, no intricate questions about the shape of hatches, mainly the ability to chat was tested. And knowledge, in my opinion, was evaluated only superficially.

And then the rigmarole began. First we waited for the results, then the offer, then the approval of the LCA, then the approval of the petition for a visa, then the documents from the USA, then the line at the embassy, then the additional check, then the visa. At times it seemed to me that I was ready to drop everything and score. At times I doubted whether we need this America, because Russia is not bad either. The whole process took about half a year, in the end, in mid-December, we received visas and began to prepare for departure.

Monday was my first day at the new job. The office has all the conditions to not only work, but also to live. Breakfasts, lunches and dinners from our own chefs, a bunch of varied food stuffed in all corners, a gym, massage and even a hairdresser. All this is completely free for employees. Many get to work by bike and several rooms are equipped to store vehicles. In general, I have never seen anything like this in Russia. Everything, however, has its price, we were immediately warned that we would have to work a lot. What is "a lot", by their standards, is not very clear to me.

I hope, however, that despite the amount of work, in the foreseeable future I will be able to resume blogging and maybe tell something about American life and working as a programmer in America. Wait and see. In the meantime, I wish you all a Merry Christmas and a Happy New Year and see you soon!

For an example use case, print out the dividend yield Russian companies. As a base price, we take the closing price of the share on the day the register is closed. For some reason, this information is not on the website of the troika, but it is much more interesting than absolute values dividends.
Attention! The code takes a long time to execute, because for each stock, you need to make a request to the finam servers and get its value.

result<- NULL for(i in (1:length(divs[,1]))){ d <- divs if (d$Divs>0)( try(( quotes<- getSymbols(d$Symbol, src="Finam", from="2010-01-01", auto.assign=FALSE) if (!is.nan(quotes)){ price <- Cl(quotes) if (length(price)>0)(dd<- d$Divs result <- rbind(result, data.frame(d$Symbol, d$Name, d$RegistryDate, as.numeric(dd)/as.numeric(price), stringsAsFactors=FALSE)) } } }, silent=TRUE) } } colnames(result) <- c("Symbol", "Name", "RegistryDate", "Divs") result

Similarly, you can build statistics for past years.

cluster analysis

Most researchers are inclined to believe that for the first time the term "cluster analysis" (eng. cluster- bunch, clot, bunch) was proposed by the mathematician R. Trion. Subsequently, a number of terms arose that are currently considered to be synonymous with the term "cluster analysis": automatic classification; botryology.

Cluster analysis is a multivariate statistical procedure that collects data containing information about a sample of objects, and then arranges objects into relatively homogeneous groups (clusters) (Q-clustering, or Q-technique, proper cluster analysis). Cluster - a group of elements characterized by a common property, the main goal of cluster analysis is to find groups of similar objects in the sample. The range of applications of cluster analysis is very wide: it is used in archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, marketing, sociology and other disciplines. However, the universality of application has led to the emergence of a large number of incompatible terms, methods and approaches that make it difficult to unambiguously use and consistently interpret cluster analysis. Orlov A. I. suggests distinguishing as follows:

Tasks and conditions

Cluster analysis performs the following main goals:

Development of a typology or classification.
Exploring useful conceptual schemes for grouping objects.
Generation of hypotheses based on data exploration.
Hypothesis testing or research to determine whether types (groups) identified in one way or another are actually present in the available data.

Regardless of the subject of study, the use of cluster analysis involves next steps:

Sampling for clustering. It is understood that it makes sense to cluster only quantitative data.
Definition of a set of variables by which objects in the sample will be evaluated, that is, a feature space.
Calculation of the values of one or another measure of similarity (or difference) between objects.
Application of the cluster analysis method to create groups of similar objects.
Validation of the results of the cluster solution.

Cluster analysis presents the following data requirements:

indicators should not correlate with each other;
indicators should not contradict the theory of measurements;
the distribution of indicators should be close to normal;
indicators must meet the requirement of "stability", which means the absence of influence on their values by random factors;
the sample should be homogeneous, not contain "outliers".

You can find a description of two fundamental requirements for data - uniformity and completeness:

Homogeneity requires that all entities represented in a table be of the same nature. The requirement for completeness is that the sets I and J presented a complete description of the manifestations of the phenomenon under consideration. If we consider a table in which I is a collection, and J- the set of variables describing this population, then it should be a representative sample from the studied population, and the system of characteristics J should give a satisfactory vector representation of individuals i from a researcher's point of view.

If cluster analysis is preceded by factor analysis, then the sample does not need to be “repaired” - the stated requirements are performed automatically by the factor modeling procedure itself (there is one more advantage - z-standardization without negative consequences for the sample; if it is carried out directly for cluster analysis, it can lead to resulting in a decrease in the clarity of the separation of groups). Otherwise, the sample must be adjusted.

Typology of clustering problems

Input Types

In modern science, several algorithms for processing input data are used. Analysis by comparing objects based on features (most common in the biological sciences) is called Q- type of analysis, and in the case of feature comparison, on the basis of objects - R- type of analysis. There are attempts to use hybrid types of analysis (for example, RQ analysis), but this methodology has not yet been properly developed.

Goals of clustering

Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying its own analysis method to each cluster (the “divide and conquer” strategy).
Data compression. If the initial sample is excessively large, then it can be reduced, leaving one of the most typical representatives from each cluster.
novelty detection. novelty detection). Atypical objects are selected that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, individual objects that do not fit into any of the clusters are of greatest interest.

In all these cases, hierarchical clustering can be applied, when large clusters are split into smaller ones, which, in turn, are split even smaller, etc. Such tasks are called taxonomy tasks. The result of taxonomy is a tree-like hierarchical structure. In addition, each object is characterized by an enumeration of all clusters to which it belongs, usually from large to small.

Clustering methods

There is no generally accepted classification of clustering methods, but a solid attempt by V. S. Berikov and G. S. Lbov can be noted. If we generalize the various classifications of clustering methods, we can distinguish a number of groups (some methods can be attributed to several groups at once, and therefore it is proposed to consider this typification as some approximation to the real classification of clustering methods):

Probabilistic approach. It is assumed that each object under consideration belongs to one of the k classes. Some authors (for example, A. I. Orlov) believe that this group does not belong to clustering at all and oppose it under the name "discrimination", that is, the choice of assigning objects to one of the known groups (training samples).
Approaches based on artificial intelligence systems. A very conditional group, since there are a lot of AI methods and methodically they are very different.
logical approach. The construction of a dendrogram is carried out using a decision tree.
Graph-theoretic approach.
- Graph clustering algorithms
Hierarchical approach. The presence of nested groups (clusters of different orders) is assumed. Algorithms, in turn, are divided into agglomerative (unifying) and divisive (separating). According to the number of features, monothetic and polythetic methods of classification are sometimes distinguished.
- Hierarchical divisional clustering or taxonomy. Clustering problems are considered in quantitative taxonomy.
Other Methods. Not included in the previous groups.
- Statistical clustering algorithms
- Ensemble of clusterers
- Algorithms of the KRAB family
- Algorithm based on the sifting method
- DBSCAN etc.

Approaches 4 and 5 are sometimes combined under the name of the structural or geometric approach, which has a more formalized concept of proximity. Despite significant differences between the listed methods, they all rely on the original " compactness hypothesis»: in object space, all close objects must belong to the same cluster, and all different objects, respectively, must be in different clusters.

Formal Statement of the Clustering Problem

Let be a set of objects, be a set of numbers (names, labels) of clusters. The distance function between objects is given. There is a finite training set of objects . It is required to split the sample into non-overlapping subsets, called clusters, so that each cluster consists of objects close in metric , and the objects of different clusters differ significantly. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that associates any object with a cluster number. The set in some cases is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Clustering (unsupervised learning) differs from classification (supervised learning) in that the labels of the original objects are not initially set, and the set itself may even be unknown.

The solution of the clustering problem is fundamentally ambiguous, and there are several reasons for this (according to a number of authors):

there is no uniquely best criterion for the quality of clustering. A number of heuristic criteria are known, as well as a number of algorithms that do not have a clearly defined criterion, but carry out a fairly reasonable clustering “by construction”. All of them can give different results. Therefore, to determine the quality of clustering, an expert in the subject area is required, who could assess the meaningfulness of the selection of clusters.
the number of clusters is usually unknown in advance and is set according to some subjective criterion. This is true only for discrimination methods, since in clustering methods, clusters are selected using a formalized approach based on proximity measures.
the clustering result significantly depends on the metric, the choice of which, as a rule, is also subjective and is determined by an expert. But it is worth noting that there are a number of recommendations for choosing proximity measures for various tasks.

Application

In biology

In biology, clustering has many applications in a wide variety of fields. For example, in bioinformatics, it is used to analyze complex networks of interacting genes, sometimes consisting of hundreds or even thousands of elements. Cluster analysis allows you to identify subnets, bottlenecks, hubs and other hidden properties of the system under study, which ultimately allows you to find out the contribution of each gene to the formation of the phenomenon under study.

In the field of ecology, it is widely used to identify spatially homogeneous groups of organisms, communities, etc. Less commonly, cluster analysis methods are used to study communities over time. The heterogeneity of the structure of communities leads to the emergence of non-trivial methods of cluster analysis (for example, the Czekanowski method).

In general, it is worth noting that historically, similarity measures are more often used as proximity measures in biology, rather than difference (distance) measures.

In sociology

When analyzing the results of sociological research, it is recommended to carry out the analysis using the methods of a hierarchical agglomerative family, namely the Ward method, in which the minimum dispersion is optimized within the clusters, as a result, clusters of approximately equal sizes are created. Ward's method is the most successful for the analysis of sociological data. As a measure of difference, the quadratic Euclidean distance is better, which contributes to an increase in the contrast of clusters. The main result of hierarchical cluster analysis is a dendrogram or "icicle diagram". When interpreting it, researchers are faced with a problem of the same kind as the interpretation of the results of factor analysis - the lack of unambiguous criteria for identifying clusters. It is recommended to use two methods as the main ones - visual analysis of the dendrogram and comparison of the results of clustering performed by different methods.

Visual analysis of the dendrogram involves "cutting" the tree at the optimal level of similarity of the sample elements. The “vine branch” (terminology of Oldenderfer M.S. and Blashfield R.K.) should be “cut off” at around 5 on the Rescaled Distance Cluster Combine scale, thus achieving an 80% similarity level. If it is difficult to select clusters by this label (several small clusters merge into one large one), then you can choose another label. This technique is proposed by Oldenderfer and Blashfield.

Now the question of the stability of the adopted cluster solution arises. In fact, checking the stability of clustering comes down to checking its reliability. There is a rule of thumb here - a stable typology is preserved when clustering methods change. The results of hierarchical cluster analysis can be verified by iterative k-means cluster analysis. If the compared classifications of groups of respondents have a share of coincidences of more than 70% (more than 2/3 of coincidences), then a cluster decision is made.

It is impossible to check the adequacy of the solution without resorting to another type of analysis. At least theoretically, this problem has not been solved. Oldenderfer and Blashfield's classic Cluster Analysis elaborates on and ultimately rejects five additional robustness testing methods:

In computer science

Clustering of search results - used for "intelligent" grouping of results when searching for files, websites, other objects, allowing the user to quickly navigate, select a subset that is obviously more relevant and excludes a obviously less relevant one - which can increase the usability of the interface compared to output in the form a simple sorted by relevance list .
- Clusty - Vivísimo's clustering search engine
- Nigma - Russian search engine with automatic results clustering
- Quintura - visual clustering in the form of a cloud of keywords
Image segmentation image segmentation) - Clustering can be used to break up a digital image into distinct regions for the purpose of edge detection. edge detection) or object recognition.
Data mining data mining)- Clustering in Data Mining becomes valuable when it acts as one of the stages of data analysis, building a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than to create one general model for all data. This technique is constantly used in marketing, highlighting groups of customers, buyers, goods and developing a separate strategy for each of them.

Notes

Links

In Russian

www.MachineLearning.ru - professional wiki resource dedicated to machine learning and data mining

In English

COMPACT - Comparative Package for Clustering Assessment. A free Matlab package, 2006.
P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999.
for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering . Also has an explanation on mixture of Gaussians.
david dowe, Mixture Modeling page- other clustering and mixture model links.
a tutorial on clustering
The on-line textbook: Information Theory, Inference, and Learning Algorithms , by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
"The Self-Organized Gene", tutorial explaining clustering through competitive learning and self-organizing maps.
kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
Data Mining Software - Data mining software frequently utilizes clustering techniques.
Java Competitve Learning Application A suite of Unsupervised Neural Networks for clustering. Written in Java. Complete with all source code.
Machine Learning Software - Also contains much clustering software.

Clustering tasks in Data Mining

Introduction to Cluster Analysis

From the entire vast field of application of cluster analysis, for example, the problem of socio-economic forecasting.

When analyzing and forecasting socio-economic phenomena, the researcher often encounters the multidimensionality of their description. This happens when solving the problem of market segmentation, building a typology of countries according to a sufficiently large number of indicators, forecasting the market situation for individual goods, studying and forecasting economic depression, and many other problems.

cluster analysis most clearly reflects the features of multivariate analysis in the classification, factor analysis - in the study of communication.

Sometimes the cluster analysis approach is referred to in the literature as numerical taxonomy, numerical classification, self-learning recognition, etc.

The great advantage of cluster analysis in that it allows splitting objects not by one parameter, but by a whole set of features. In addition, cluster analysis, unlike most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration, and allows us to consider a set of initial data of an almost arbitrary nature. This is of great importance, for example, for market forecasting, when indicators have a variety of forms that make it difficult to use traditional econometric approaches.

Cluster analysis makes it possible to consider a sufficiently large amount of information and drastically reduce, compress large arrays of socio-economic information, make them compact and visual.

In the tasks of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Like any other method , cluster analysis has certain disadvantages and limitations: In particular, make up the number of clusters depends on the selected partitioning criteria. When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values of the cluster parameters. When classifying objects, very often the possibility of the absence of any cluster values in the considered set is ignored.

In cluster analysis, it is considered that:

a) the selected characteristics allow, in principle, the desired clustering;

b) the units of measurement (scale) are chosen correctly.

The choice of scale plays a big role. Typically, data is normalized by subtracting the mean and dividing by the standard deviation so that the variance is equal to one.

1. The task of clustering

The task of clustering is to, based on the data contained in the set X, split a lot of objects G on the m (m– whole) clusters (subsets) Q1,Q 2 , …,Q m, so that each object Gj belong to one and only one partition subset and that objects belonging to the same cluster are similar, while objects belonging to different clusters are heterogeneous.

For example, let G includes n countries, any of which is characterized by GNP per capita ( F1), number M cars per 1,000 people F2), per capita electricity consumption ( F3), per capita steel consumption ( F4) etc. Then X 1(measurement vector) is a set of specified characteristics for the first country, X 2- for the second, X 3 for the third, and so on. The challenge is to break down countries by level of development.

where x j- represents measurements j-th object.

To solve the problem of cluster analysis, it is necessary to define the concept of similarity and heterogeneity.

It is clear that the objects i -th and j-th would fall into one cluster when the distance (remoteness) between points X i and X j would be small enough and would fall into different clusters when this distance would be large enough. Thus, hitting one or different clusters of objects is determined by the concept of the distance between X i and X j from yer, where yer - R-dimensional Euclidean space. Non-negative function d(X i, Х j) is called a distance function (metric) if:

a) d(Xi , Х j)³ 0 , for all X i and X j from yer

b) d(Xi , Х j) = 0, if and only if X i= Х j

in) d(Xi , X j) = d(X j , X i)

G) d(Xi , Х j)£ d(Xi , X k) + d(X k , X j), where X j ; Xi and Х k- any three vectors from yer.

Meaning d(Xi , Х j) for Xi and X j is called the distance between Xi and X j and is equivalent to the distance between Gi and Gj according to the selected characteristics (F 1, F 2, F 3, ..., F p).

The most commonly used distance functions are:

1. Euclidean distance d 2 (Xi , Х j) =

2. l 1- norm d 1 (Xi , Х j) =

3. Supremum - the norm d ¥ (Xi , Х j) = sup

k = 1, 2, ..., p

4. lp- norm d p (Xi , Х j) =

The Euclidean metric is the most popular. The l 1 metric is the easiest to calculate. The supremum norm is easy to calculate and includes an ordering procedure, a lp- the norm covers the functions of distances 1, 2, 3,.

Let n measurements X 1, X 2,..., Xn are presented in the form of a data matrix with the size p´ n:

Then the distance between the pairs of vectors d(X i, Х j) can be represented as a symmetrical distance matrix:

The concept opposite to distance is the concept of similarity between objects. G i . and Gj. Non-negative real function S(X i; X j) = S i j is called a similarity measure if:

1) 0 £ S(X i , X j)< 1 for X i ¹ X j

2) S( Xi, Xi) = 1

3) S( Xi, Xj) = S(Xj, X i )

Pairs of similarity measure values can be combined into a similarity matrix:

the value Sij called the coefficient of similarity.

2. Clustering methods

Today there are many methods of cluster analysis. Let us dwell on some of them (the methods given below are usually called the methods of minimum variance).

Let X- observation matrix: X \u003d (X 1, X 2, ..., X u) and the square of the Euclidean distance between X i and X j is determined by the formula:

1) Full connection method.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than a certain threshold value S. In terms of Euclidean distance d this means that the distance between two points (objects) of the cluster should not exceed some threshold valueh. In this way, hdefines the maximum allowable diameter of a subset forming a cluster.

2) Maximum local distance method.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any thresholds.

3) Word method.

4) centroid method.

The distance between two clusters is defined as the Euclidean distance between the centers (averages) of these clusters:

d2ij =(` X-` Y) T (` X-` Y) Clustering proceeds in stages on each of n–1 steps unite two clusters G and p having the minimum value d2ij If a n 1 much more n 2, then the merging centers of two clusters are close to each other, and the characteristics of the second cluster are practically ignored when clusters are merged. Sometimes this method is sometimes also called the method of weighted groups.

3. Sequential clustering algorithm

Consider Ι = (Ι 1 , Ι 2 , … Ιn) as many clusters (Ι 1 ), (Ι 2 ),…(Ιn). Let's choose two of them, for example, Ι i and Ιj, which are in some sense closer to each other and combine them into one cluster. The new set of clusters, already consisting of n -1 clusters, will be:

(Ι 1 ), (Ι 2 )…, {Ι i, Ι j ), …, (Ιn).

Repeating the process, we obtain successive sets of clusters consisting of (n-2), (n-3), (n-4) etc. clusters. At the end of the procedure, you can get a cluster consisting of n objects and coinciding with the original set Ι = (Ι 1 , Ι 2 , … Ιn).

As a measure of distance, we take the square of the Euclidean metric d i j2. and calculate the matrix D = (di j 2 ), where di j 2 is the square of the distance between

Ι i and Ιj:

			….	Ι n
	d 12 2	d 13 2	….	d 1n 2
		d 23 2	….	d 2n 2
			….	d 3n 2
….			….	….
Ι n

Let the distance between Ι i and Ι j will be minimal:

d i j 2 = min (d i j 2 , i¹ j). We form with Ι i and Ι j new cluster

{Ι i , Ι j ). Let's build a new ((n-1), (n-1)) distance matrix

	( Ι i , Ι j )				….	Ι n
( Ι i ; Ι j )		d i j 2 1	d i j 2 2		….	d i j 2 n
			d 12 2	d 1 3	….	d 1 2 n
					….	d2n
					….	d3n

(n-2) the rows for the last matrix are taken from the previous one, and the first row is recomputed. Computations can be kept to a minimum if one can express d i j 2 k ,k = 1, 2,…,n (k¹ i¹ j) through the elements of the original matrix.

Initially, the distance is determined only between single-element clusters, but it is necessary to determine the distances between clusters containing more than one element. This can be done in various ways, and depending on the chosen method, we get cluster analysis algorithms with different properties. One can, for example, put the distance between the cluster i + j and some other cluster k, equal to the arithmetic mean of the distances between clusters i and k and clusters j and k:

d i+j,k = ½ (d i k + d j k).

But one can also define d i+j,k as the minimum of these two distances:

d i+j,k = min(d i k + d j k).

Thus, the first step of the agglomerative hierarchical algorithm operation is described. The next steps are the same.

A fairly wide class of algorithms can be obtained if the following general formula is used to recalculate distances:

d i+j,k = A(w) min(d ik d jk) + B(w) max(d ik d jk), where

A(w) = ifdik£ djk

A(w) = ifdik> djk

B(w) = ifd i k £ djk

B(w ) =, ifdik> djk

where n i and n j- number of elements in clusters i and j, a w is a free parameter, the choice of which determines a particular algorithm. For example, when w = 1 we get the so-called "average connection" algorithm, for which the formula for recalculating distances takes the form:

d i+j,k =

AT this case the distance between two clusters at each step of the algorithm turns out to be equal to the arithmetic mean of the distances between all pairs of elements such that one element of the pair belongs to one cluster, the other to another.

The visual meaning of the parameter w becomes clear if we put w® ¥ . The distance conversion formula takes the form:

d i+j,k =min (d i,kdjk)

This will be the so-called “nearest neighbor” algorithm, which makes it possible to select clusters of an arbitrarily complex shape, provided that the various parts of such clusters are connected by chains of elements close to each other. In this case, the distance between two clusters at each step of the algorithm turns out to be equal to the distance between the two closest elements belonging to these two clusters.

Quite often it is assumed that the initial distances (differences) between the grouped elements are given. In some cases, this is true. However, only objects and their characteristics are specified, and the distance matrix is built based on these data. Depending on whether distances between objects or between characteristics of objects are calculated, different methods are used.

In the case of cluster analysis of objects, the most common measure of difference is either the square of the Euclidean distance

(where x ih , x jh- values h-th sign for i th and j-th objects, and m is the number of characteristics), or the Euclidean distance itself. If features are assigned different weights, then these weights can be taken into account when calculating the distance

Sometimes, as a measure of difference, the distance is used, calculated by the formula:

which are called: "Hamming", "Manhattan" or "city-block" distance.

A natural measure of the similarity of characteristics of objects in many problems is the correlation coefficient between them

where m i ,m j ,d i ,d j- respectively, the average and standard deviations for the characteristics i and j. A measure of the difference between the characteristics can be the value 1-r. In some problems, the sign of the correlation coefficient is insignificant and depends only on the choice of the unit of measurement. In this case, as a measure of the difference between the characteristics, ô 1-r i j ô

4. Number of clusters

A very important issue is the problem of choosing the required number of clusters. Sometimes m number of clusters can be chosen a priori. However, in the general case, this number is determined in the process of splitting the set into clusters.

Studies were carried out by Fortier and Solomon, and it was found that the number of clusters must be taken to achieve the probability a finding the best partition. Thus, the optimal number of partitions is a function of the given fraction b the best or, in some sense, admissible partitions in the set of all possible ones. The total scattering will be the greater, the higher the fraction b allowable splits. Fortier and Solomon developed a table from which one can find the number of partitions needed. S(a , b ) depending on the a and b (where a is the probability that the best partition is found, b is the share of the best partitions in the total number of partitions) Moreover, as a measure of heterogeneity, not the scattering measure is used, but the membership measure introduced by Holzenger and Harman. Table of values S(a , b ) below.

Table of valuesS(a , b )

b \ a	0.20	0.10	0.05	0.01	0.001	0.0001
0.20	8	11	14	21	31	42
0.10	16	22	29	44	66	88
0.05	32	45	59	90	135	180
0.01	161	230	299	459	689	918
0.001	1626	2326	3026	4652	6977	9303
0.0001	17475	25000	32526	55000	75000	100000

Quite often, the criterion for combining (the number of clusters) is the change in the corresponding function. For example, sums of squared deviations:

The grouping process must correspond here to a sequential minimum increase in the value of the criterion E. The presence of a sharp jump in the value E can be interpreted as a characteristic of the number of clusters that objectively exist in the population under study.

So, the second way to determine the best number of clusters is to identify the jumps determined by the phase transition from a strongly coupled to a weakly coupled state of objects.

5. Dendograms

The best known method of representing a distance or similarity matrix is based on the idea of a dendogram or tree diagram. A dendrogram can be defined as a graphic representation of the results of a sequential clustering process, which is carried out in terms of a distance matrix. With the help of a dendogram, it is possible to graphically or geometrically depict the clustering procedure, provided that this procedure operates only with elements of the distance or similarity matrix.

There are many ways to construct dendrograms. In the dendrogram, the objects are located vertically on the left, the clustering results are on the right. Distance or similarity values corresponding to the structure of new clusters are displayed along a horizontal straight line over dendrograms.

Fig1

Figure 1 shows one example of a dendrogram. Figure 1 corresponds to the case of six objects ( n=6) and kcharacteristics (signs). Objects BUT and FROM are the closest and therefore are combined into one cluster at the proximity level equal to 0.9. ObjectsDand E combined at the level of 0.8. Now we have 4 clusters:

(A, C), (F), ( D, E), ( B) .

Further clusters are formed (A, C, F) and ( E, D, B) , corresponding to the proximity level equal to 0.7 and 0.6. Finally, all objects are grouped into one cluster at a level of 0.5.

The type of dendogram depends on the choice of similarity measure or distance between the object and the cluster and the clustering method. The most important point is the choice of a measure of similarity or a measure of distance between an object and a cluster.

The number of cluster analysis algorithms is too large. All of them can be divided into hierarchical and non-hierarchical.

Hierarchical algorithms are associated with the construction of dendograms and are divided into:

a) agglomerative, characterized by a consistent combination of initial elements and a corresponding decrease in the number of clusters;

b) divisible (divisible), in which the number of clusters increases, starting from one, as a result of which a sequence of splitting groups is formed.

Cluster analysis algorithms today have a good software implementation that allows solving problems of the highest dimension.

6. Data

Cluster analysis can be applied to interval data, frequencies, binary data. It is important that the variables change on comparable scales.

The heterogeneity of units of measurement and the ensuing impossibility of a reasonable expression of the values of various indicators on the same scale leads to the fact that the distance between points, reflecting the position of objects in the space of their properties, turns out to depend on an arbitrarily chosen scale. To eliminate the heterogeneity of the measurement of the initial data, all their values are preliminarily normalized, i.e. are expressed through the ratio of these values to a certain value that reflects certain properties of this indicator. The normalization of initial data for cluster analysis is sometimes carried out by dividing the initial values by the standard deviation of the corresponding indicators. Another way is to calculate the so-called standardized contribution. It is also called Z-contribution.

Z -contribution shows how many standard deviations a given observation separates from the mean:

Where x iis the value of this observation,- average, S- standard deviation.

Average for Z -contribution is zero and the standard deviation is 1.

Standardization allows comparison of observations from different distributions. If the distribution of a variable is normal (or close to normal) and the mean and variance are known or estimated from large samples, then Z -observation input provides more specific information about its location.

Note that normalization methods mean the recognition of all features as equivalent from the point of view of elucidating the similarity of the objects under consideration. It has already been noted that in relation to the economy, the recognition of the equivalence of various indicators does not always seem justified. It would be desirable, along with normalization, to give each of the indicators a weight that reflects its significance in the course of establishing similarities and differences between objects.

In this situation, one has to resort to the method of determining the weights of individual indicators - a survey of experts. For example, when solving the problem of classifying countries according to the level of economic development, we used the results of a survey of 40 leading Moscow experts on the problems of developed countries on a ten-point scale:

generalized indicators of socio-economic development - 9 points;

indicators of sectoral distribution of the employed population - 7 points;

indicators of the prevalence of hired labor - 6 points;

indicators characterizing the human element of the productive forces - 6 points;

indicators of the development of material productive forces - 8 points;

indicator of public spending - 4 points;

"military-economic" indicators - 3 points;

socio-demographic indicators - 4 points.

The experts' estimates were relatively stable.

Expert assessments provide a well-known basis for determining the importance of indicators included in a particular group of indicators. Multiplying the normalized values of indicators by a coefficient corresponding to the average score of the assessment makes it possible to calculate the distances between points that reflect the position of countries in a multidimensional space, taking into account the unequal weight of their features.

Quite often, when solving such problems, not one, but two calculations are used: the first, in which all signs are considered equivalent, the second, where they are given different weights in accordance with the average values of expert estimates.

7. Application of cluster analysis

Let's consider some applications of cluster analysis.

1. The division of countries into groups according to the level of development.

65 countries were studied according to 31 indicators (national income per capita, the share of the population employed in industry in %, savings per capita, the share of the population employed in agriculture in %, average life expectancy, the number of cars per 1 thousand inhabitants, the number of armed forces per 1 million inhabitants, the share of GDP in industry in%, the share of GDP in agriculture in%, etc.)

Each of the countries acts in this consideration as an object characterized by certain values of 31 indicators. Accordingly, they can be represented as points in a 31-dimensional space. Such a space is usually called the property space of the objects under study. Comparison of the distance between these points will reflect the degree of proximity of the countries under consideration, their similarity to each other. The socio-economic meaning of this understanding of similarity means that countries are considered the more similar, the smaller the differences between the same indicators with which they are described.

The first step of such an analysis is to identify the pair of national economies included in the similarity matrix, the distance between which is the smallest. These will obviously be the most similar, similar economies. In the following consideration, both of these countries are considered a single group, a single cluster. Accordingly, the original matrix is transformed so that its elements are the distances between all possible pairs of not 65, but 64 objects - 63 economies and a newly transformed cluster - a conditional union of the two most similar countries. Rows and columns corresponding to the distances from a pair of countries included in the union to all the others are discarded from the original similarity matrix, but a row and column are added containing the distance between the cluster obtained by the union and other countries.

The distance between the newly obtained cluster and the countries is assumed to be equal to the average of the distances between the latter and the two countries that make up the new cluster. In other words, the combined group of countries is considered as a whole with characteristics approximately equal to the average of the characteristics of its constituent countries.

The second step of the analysis is to consider a matrix transformed in this way with 64 rows and columns. Again, a pair of economies is identified, the distance between which is of the least importance, and they, just as in the first case, are brought together. In this case, the smallest distance can be both between a pair of countries, and between any country and the union of countries obtained at the previous stage.

Further procedures are similar to those described above: at each stage, the matrix is transformed so that two columns and two rows containing the distance to objects (pairs of countries or associations - clusters) brought together at the previous stage are excluded from it; the excluded rows and columns are replaced by a column with a row containing the distances from the new joins to the rest of the objects; further, in the modified matrix, a pair of the closest objects is revealed. The analysis continues until the complete exhaustion of the matrix (i.e., until all countries are brought together). The generalized results of the matrix analysis can be represented in the form of a similarity tree (dendogram), similar to that described above, with the only difference that the similarity tree, which reflects the relative proximity of all 65 countries we are considering, is much more complicated than the scheme in which only five national economies appear. This tree, according to the number of matched objects, includes 65 levels. The first (lower) level contains points corresponding to each country separately. The connection of these two points at the second level shows a pair of countries that are closest in terms of the general type of national economies. At the third level, the next most similar paired ratio of countries is noted (as already mentioned, either a new pair of countries or a new country and an already identified pair of similar countries can be in such a ratio). And so on up to the last level, at which all the studied countries act as a single set.

As a result of applying cluster analysis, the following five groups of countries were obtained:

Afro-Asian group

Latin-Asian group;

Latin-Mediterranean group;

group of developed capitalist countries (without the USA)

USA

The introduction of new indicators beyond the 31 indicators used here, or their replacement by others, naturally leads to a change in the results of the country classification.

2. The division of countries according to the criterion of proximity of culture.

As you know, marketing must take into account the culture of countries (customs, traditions, etc.).

The following groups of countries were obtained through clustering:

· Arabic;

Middle Eastern

· Scandinavian;

German-speaking

· English-speaking;

Romanesque European;

· Latin American;

Far East.

3. Development of a zinc market forecast.

Cluster analysis plays an important role at the stage of reduction of the economic and mathematical model of commodity conjuncture, contributing to the facilitation and simplification of computational procedures, ensuring greater compactness of the results obtained while maintaining the required accuracy. The use of cluster analysis makes it possible to divide the entire initial set of market indicators into groups (clusters) according to the relevant criteria, thereby facilitating the selection of the most representative indicators.

Cluster analysis is widely used to model market conditions. In practice, the majority of forecasting tasks are based on the use of cluster analysis.

For example, the task of developing a forecast of the zinc market.

Initially, 30 key indicators of the global zinc market were selected:

X 1 - time

Production figures:

X 2 - in the world

X 4 - Europe

X 5 - Canada

X 6 - Japan

X 7 - Australia

Consumption indicators:

X 8 - in the world

X 10 - Europe

X 11 - Canada

X 12 - Japan

X 13 - Australia

Producer stocks of zinc:

X 14 - in the world

X 16 - Europe

X 17 - other countries

Consumer stocks of zinc:

X 18 - in the USA

X 19 - in England

X 10 - in Japan

Import of zinc ores and concentrates (thousand tons)

X 21 - in the USA

X 22 - in Japan

X 23 - in Germany

Export of zinc ores and concentrates (thousand tons)

X 24 - from Canada

X 25 - from Australia

Import of zinc (thousand tons)

X 26 - in the USA

X 27 - to England

X 28 - in Germany

Export of zinc (thousand tons)

X 29 - from Canada

X 30 - from Australia

To determine specific dependencies, the apparatus of correlation and regression analysis was used. Relationships were analyzed on the basis of a matrix of paired correlation coefficients. Here, the hypothesis of the normal distribution of the analyzed indicators of the conjuncture was accepted. It is clear that r ij are not the only possible indicator of the relationship between the indicators used. The need to use cluster analysis in this problem is due to the fact that the number of indicators affecting the price of zinc is very large. There is a need to reduce them for a number of the following reasons:

a) lack of complete statistical data for all variables;

b) a sharp complication of computational procedures when a large number of variables are introduced into the model;

c) the optimal use of regression analysis methods requires the excess of the number of observed values over the number of variables by at least 6-8 times;

d) the desire to use statistically independent variables in the model, etc.

It is very difficult to carry out such an analysis directly on a relatively bulky matrix of correlation coefficients. With the help of cluster analysis, the entire set of market variables can be divided into groups in such a way that the elements of each cluster are strongly correlated with each other, and representatives of different groups are characterized by a weak correlation.

To solve this problem, one of the agglomerative hierarchical cluster analysis algorithms was applied. At each step, the number of clusters is reduced by one due to the optimal, in a certain sense, union of two groups. The criterion for joining is to change the corresponding function. As a function of this, the values of the sums of squared deviations calculated by the following formulas were used:

(j = 1, 2, …,m ),

where j- cluster number, n- the number of elements in the cluster.

rij-coefficient of pair correlation.

Thus, the grouping process must correspond to a sequential minimum increase in the value of the criterion E.

At the first stage, the initial data array is presented as a set consisting of clusters, including one element each. The grouping process begins with the union of such a pair of clusters, which leads to a minimum increase in the sum of squared deviations. This requires estimating the values of the sum of squared deviations for each of the possible cluster associations. At the next stage, the values of the sums of squared deviations are considered already for clusters, etc. This process will be stopped at some step. To do this, you need to monitor the value of the sum of squared deviations. Considering a sequence of increasing values, one can catch a jump (one or more) in its dynamics, which can be interpreted as a characteristic of the number of groups "objectively" existing in the population under study. In the above example, jumps took place when the number of clusters was 7 and 5. Further, the number of groups should not be reduced, because this leads to a decrease in the quality of the model. After the clusters are obtained, the variables most important in the economic sense and most closely related to the selected market criterion are selected - in this case, with the London Metal Exchange quotes for zinc. This approach allows you to save a significant part of the information contained in the original set of initial indicators of the conjuncture.

There are two main types of cluster analysis in statistics (both represented in SPSS): hierarchical and k-means. In the first case, the automated statistical procedure independently determines the optimal number of clusters and a number of other parameters required for clustering.

analysis. The second type of analysis has significant limitations in practical applicability - for it it is necessary to independently determine the exact number of allocated clusters, and the initial values of the centers of each cluster (centroids), and some other statistics. When analyzing by the k-means method, these problems are solved by preliminary conducting a hierarchical cluster analysis and then, based on its results, calculating the cluster model using the k-means method, which in most cases not only does not simplify, but, on the contrary, complicates the work of a researcher (especially an unprepared one).

In general, we can say that due to the fact that hierarchical cluster analysis is very demanding on computer hardware resources, k-means cluster analysis was introduced into SPSS to process very large data sets consisting of many thousands of observations (respondents), under conditions insufficient capacity of computer equipment1. Sample sizes used in marketing research in most cases do not exceed four thousand respondents. Practice marketing research shows that it is the first type of cluster analysis - hierarchical - that is recommended for use in all cases as the most relevant, universal and accurate. At the same time, it should be emphasized that the selection of relevant variables is important when conducting cluster analysis. This remark is very important, since the inclusion of several or even one irrelevant variable in the analysis can lead to the failure of the entire statistical procedure.

We will describe the methodology for conducting cluster analysis using the following example from the practice of marketing research.

Initial data:

During the study, 745 air passengers flying with one of 22 Russian and foreign airlines were interviewed. Air passengers were asked to rate, on a five-point scale, from 1 (very poor) to 5 (excellent), seven aspects of airline ground staff performance during the check-in process: courtesy, professionalism, promptness, helpfulness, queue management, appearance, work staff in general.

Required:

Segment the studied airlines according to the level of quality of work of ground personnel perceived by air passengers.

So, we have a data file, which consists of seven interval variables denoting the performance ratings of the ground personnel of various airlines (ql3-ql9), presented on a single five-point scale. The data file contains a single variable q4 indicating the airlines selected by the respondents (22 in total). Let's carry out a cluster analysis and determine which target groups the airline data can be divided into.

Hierarchical cluster analysis is carried out in two stages. The result of the first stage is the number of clusters (target segments) into which the studied sample of respondents should be divided. The cluster analysis procedure as such is not

can independently determine the optimal number of clusters. She can only suggest the desired number. Since the problem of determining the optimal number of segments is a key one, it is usually solved at a separate stage of the analysis. At the second stage, the actual clustering of observations is performed according to the number of clusters that was determined during the first stage of the analysis. Now let's look at these cluster analysis steps in order.

The cluster analysis procedure is launched using the Analyze > Classify > Hierarchical Cluster menu. In the dialog box that opens, from the left list of all variables available in the data file, select the variables that are the segmentation criteria. In our case, there are seven of them, and they denote estimates of the parameters of the work of ground personnel ql3-ql9 (Fig. 5.44). In principle, specifying a set of segmentation criteria will be quite enough to perform the first stage of cluster analysis.

Rice. 5.44.

By default, in addition to the table with the results of the formation of clusters, on the basis of which we will determine their optimal number, SPSS also displays a special inverted histogram icicle, which, according to the intention of the creators of the program, helps to determine the optimal number of clusters; Diagrams are displayed using the Plots button (Fig. 5.45). However, if we leave this option set, we will spend a lot of time processing even a relatively small data file. In addition to icicle, a faster Dendogram bar chart can be selected in the Plots window. It is a horizontal bars reflecting the process of cluster formation. Theoretically, with a small (up to 50-100) number of respondents, this diagram really helps to choose the optimal solution for the required number of clusters. However, in almost all examples from marketing research, the sample size exceeds this value. The dendogram becomes completely useless, since even with a relatively small number of observations it is a very long sequence of line numbers of the original data file, connected by horizontal and vertical lines. Most SPSS textbooks contain examples of cluster analysis on just such artificial, small samples. In this tutorial, we show you how to get the most out of SPSS in a practical setting and real market research examples.

Rice. 5.45.

As we have established, neither Icicle nor Dendogram are suitable for practical purposes. Therefore, in the main dialog box of Hierarchical Cluster Analysis, it is recommended not to display charts by deselecting the default Plots option in the Display area, as shown in Fig. 5.44. Now everything is ready to perform the first stage of cluster analysis. Start the procedure by clicking on the OK button.

After a while, the results will appear in the SPSS Viewer window. As mentioned above, the only result of the first stage of the analysis that is significant for us will be the Average Linkage (Between Groups) table, shown in Fig. 5.46. Based on this table, we must determine the optimal number of clusters. It should be noted that there is no single universal method for determining the optimal number of clusters. In each case, the researcher must determine this number himself.

Based on the experience, the author proposes the following scheme of this process. First of all, let's try to apply the most common standard method for determining the number of clusters. Using the table Average Linkage (Between Groups), it is necessary to determine at what step of the cluster formation process (column Stage) the first relatively large jump in the agglomeration coefficient occurs (column Coefficients). This jump means that before it, observations that were at fairly small distances from each other were combined into clusters (in our case, respondents with a similar level of assessments in terms of the analyzed parameters), and starting from this stage, more distant observations are combined.

In our case, the coefficients smoothly increase from 0 to 7.452, that is, the difference between the coefficients at steps 1 to 728 was small (for example, between 728 and 727 steps - 0.534). Starting from the 729th step, the first significant jump in the coefficient occurs: from 7.452 to 10.364 (by 2.912). The step at which the first jump of the coefficient occurs is 729. Now, in order to determine the optimal number of clusters, it is necessary to subtract the obtained value from total number observations (sample size). The total sample size in our case is 745 people; therefore, the optimal number of clusters is 745-729 = 16.

Rice. 5.46.

We got a fairly large number of clusters, which will be difficult to interpret in the future. Therefore, now it is necessary to examine the obtained clusters and determine which of them are significant, and which ones should be tried to reduce. This problem is solved at the second stage of cluster analysis.

Open the main dialog box of the cluster analysis procedure (menu Analyze > Classify > Hierarchical Cluster). In the field for analyzed variables, we already have seven parameters. Click the Save button. The dialog box that opens (Fig. 5.47) allows you to create a new variable in the source data file that distributes respondents into target groups. Select the Single Solution option and specify the required number of clusters in the corresponding field - 16 (determined at the first stage of the cluster analysis). Clicking the Continue button will return you to the main dialog box, where you can click the OK button to start the cluster analysis procedure.

Before continuing the description of the cluster analysis process, it is necessary to briefly describe the other parameters. Among them there are both useful features and actually superfluous (from the point of view of practical marketing research). For example, the main Hierarchial Cluster Analysis dialog box contains a Label Cases by field, in which you can optionally place a text variable that identifies the respondents. In our case, the q4 variable, which encodes the airlines chosen by the respondents, can serve for these purposes. In practice, it is difficult to come up with a rational explanation for the use of the Label Cases by field, so you can safely always leave it empty.

Rice. 5.47.

Infrequently, when performing cluster analysis, the Statistics dialog box is used, called by the button of the same name in the main dialog box. It allows you to display a Cluster Membership table in the SPSS Viewer window, in which each respondent in the source data file is mapped to a cluster number. With a sufficiently large number of respondents (in almost all examples of marketing research), this table becomes completely useless, since it is a long sequence of pairs of values “respondent number / cluster number”, which in this form cannot be interpreted. The technical goal of cluster analysis is always to create an additional variable in the data file that reflects the division of respondents into target groups (by clicking on the Save button in the main cluster analysis dialog box). This variable, together with the numbers of respondents, is the Cluster Membership table. The only practical option in the Statistics window is to display the Average Linkage (Between Groups) table, but this is already set by default. Thus, using the Statistics button and displaying a separate Cluster Membership table in the SPSS Viewer window is not practical.

The Plots button has already been mentioned above: it should be deactivated by deselecting the Plots parameter in the main cluster analysis dialog box.

In addition to these rarely used features of the cluster analysis procedure, SPSS also offers some very useful options. Among them, first of all, the Save button, which allows you to create a new variable in the source data file that distributes respondents into clusters. Also in the main dialog box there is an area for selecting the object of clustering: respondents or variables. This possibility was discussed above in section 5.4. In the first case, cluster analysis is mainly used to segment respondents according to some criteria; in the second, the purpose of cluster analysis is similar to factor analysis: classification (reduction in the number) of variables.

As can be seen from fig. 5.44, the only possibility of cluster analysis not considered is the button for selecting the method of conducting the statistical procedure Method. Experimenting with this Parameter allows you to achieve greater accuracy in determining the optimal number of clusters. General form this dialog box with default settings is shown in Fig. 5.48.

Rice. 5.48.

The first thing that is set in this window is the method of forming clusters (that is, combining observations). Among all options statistical methods offered by SPSS, you should choose either the default Between-groups linkage method or the Ward (Ward "s method). The first method is used more often due to its versatility and the relative simplicity of the statistical procedure on which it is based. When using this method the distance between clusters is calculated as the average of the distances between all possible pairs of observations, with each iteration involving one observation from one cluster and the second from another. The Ward method is more difficult to understand and less commonly used, it consists of many steps and is based on averaging the values of all variables for each observation and then summing the squared distances from the calculated means to each observation. practical tasks market research, we recommend that you always use the default between-groups linkage method.

After selecting a statistical clustering procedure, select a method for calculating distances between observations (Measure area in the Method dialog box). There are various methods for determining distances for the three types of variables involved in cluster analysis (segmentation criteria). These variables can have an interval (Interval), nominal (Counts) or dichotomous (Binary) scale. The dichotomous scale (Binary) implies only variables that reflect the occurrence / non-occurrence of an event (bought / not bought, yes / no, etc.). Other types of dichotomous variables (for example, male/female) should be considered and analyzed as nominal (Counts).

The most commonly used method for determining distances for interval variables is the default Squared Euclidean Distance. It is this method that has proven itself in marketing research as the most accurate and universal. However, for dichotomous variables where observations are represented by only two values (for example, 0 and 1), this method is not suitable. The point is that it takes into account only interactions between observations of the type: X = 1,Y = 0 and X = 0, Y=l (where X and Y are variables) and does not take into account other types of interactions. The most comprehensive measure of distance, taking into account all important types of interactions between two dichotomous variables, is the Lambda method. We recommend using this method due to its versatility. However, there are other methods, such as Shape, Hamann or Anderbergs's D.

When specifying the method for determining distances for dichotomous variables, it is necessary to indicate in the corresponding field the specific values that the studied dichotomous variables can take: in the Present field - the answer encoding Yes, and in the Absent field - No. The names of the fields present and absent are associated with the fact that in the Binary method group it is supposed to use only dichotomous variables that reflect the occurrence / non-occurrence of an event. For the two types of variables Interval and Binary, there are several methods for determining the distance. For variables with a nominal scale type, SPSS offers only two methods: (Chi-square measure) and (Phi-square measure). We recommend using the first method as the most common.

The Method dialog has a Transform Values area that contains a Standardize field. This field is used when variables with different scale types (for example, interval and nominal) take part in the cluster analysis. In order to use these variables in cluster analysis, it is necessary to carry out standardization, leading them to a single type of scale - interval. The most common method of variable standardization is 2-standardization (Zscores): all variables are reduced to a single range of values from -3 to +3 and after transformation are interval.

Since all optimal methods (clustering and distance determination) are set by default, it is advisable to use the Method dialog box only to specify the type of variables to be analyzed, as well as to indicate the need to perform 2-standardization of variables.

So, we have described all the main features provided by SPSS for cluster analysis. Let us return to the description of the cluster analysis carried out for the purpose of segmenting airlines. Recall that we settled on a sixteen-cluster solution and created a new variable clul6_l in the source data file, distributing all the analyzed airlines into clusters.

To establish how correctly we have determined the optimal number of clusters, we will build a linear distribution of the clul6_l variable (menu Analyze > Descriptive Statistics > Frequencies). As seen in fig. 5.49, in clusters numbered 5-16, the number of respondents ranges from 1 to 7. Along with the universal method described above for determining the optimal number of clusters (based on the difference between the total number of respondents and the first jump in the agglomeration coefficient), there is also an additional recommendation: the size of clusters should be statistically meaningful and practical. With our sample size, such a critical value can be set at least at the level of 10. We see that only clusters with numbers 1-4 fall under this condition. Therefore, now it is necessary to recalculate the cluster analysis procedure with the output of a four-cluster solution (a new variable du4_l will be created).

Rice. 5.49.

Having built a linear distribution on the newly created variable du4_l, we will see that only in two clusters (1 and 2) the number of respondents is practically significant. We need to rebuild the cluster model again -- now for a two-cluster solution. After that, we construct the distribution with respect to the variable du2_l (Fig. 5.50). As you can see from the table, the two-cluster solution has a statistically and practically significant number of respondents in each of the two formed clusters: in cluster 1 - 695 respondents; in cluster 2 - 40. Thus, we determined the optimal number of clusters for our task and carried out the actual segmentation of respondents according to seven selected criteria. Now we can consider the main goal of our task as achieved and proceed to the final stage of cluster analysis - the interpretation of the obtained target groups (segments).

Rice. 5.50.

The resulting solution is somewhat different from those that you may have seen in teaching aids by SPSS. Even the most practically oriented textbooks provide artificial examples where clustering results in ideal target groups of respondents. In some cases (5) the authors even point directly to the artificial origin of the examples. In this tutorial, we will use as an illustration of the operation of cluster analysis real example from practical marketing research, not distinguished by ideal proportions. This will allow us to show the most common difficulties in conducting cluster analysis, as well as the best methods to eliminate them.

Before proceeding with the interpretation of the resulting clusters, let's summarize. We have the following scheme for determining the optimal number of clusters.

¦ In step 1, we determine the number of clusters based on a mathematical method based on the agglomeration coefficient.

¦ At stage 2, we cluster the respondents according to the obtained number of clusters and then build a linear distribution according to the new variable formed (clul6_l). Here you should also determine how many clusters consist of a statistically significant number of respondents. In general, it is recommended to set the minimum significant number of clusters at the level of at least 10 respondents.

¦ If all clusters satisfy this criterion, we proceed to the final stage of cluster analysis: the interpretation of clusters. If there are clusters with an insignificant number of observations that make up them, we determine how many clusters consist of a significant number of respondents.

¦ We recalculate the cluster analysis procedure by specifying in the Save dialog box the number of clusters consisting of a significant number of observations.

¦ We build a linear distribution on a new variable.

This sequence of actions is repeated until a solution is found in which all clusters will consist of a statistically significant number of respondents. After that, you can proceed to the final stage of cluster analysis - the interpretation of clusters.

It should be specially noted that the criterion of practical and statistical significance of the number of clusters is not the only criterion by which the optimal number of clusters can be determined. The researcher can independently, based on his experience, suggest the number of clusters (the condition of significance must be satisfied). Another option is a fairly common situation when, for the purposes of the study, a condition is set in advance to segment respondents according to a given number of target groups. In this case, you just need to perform a hierarchical cluster analysis once, keeping the required number of clusters, and then try to interpret what happens.

In order to describe the resulting target segments, one should use the procedure for comparing the average values of the studied variables (cluster centroids). We will compare the average values of the seven considered segmentation criteria in each of the two resulting clusters.

The procedure for comparing average values is called using the Analyze > Compare Means > Means menu. In the dialog box that opens (Fig. 5.51), select the seven variables selected as segmentation criteria (ql3-ql9) from the left list and transfer them to the Dependent List field for dependent variables. Then move the variable сШ2_1, which reflects the division of respondents into clusters in the final (two-cluster) solution of the problem, from the left list to the field for independent variables Independent List. Then click on the Options button.

Rice. 5.51.

The Options dialog box will open, select the necessary statistics in it to compare clusters (Fig. 5.52). To do this, in the Cell Statistics field, leave only the output of the Mean values, removing other default statistics from it. Close the Options dialog box by clicking the Continue button. Finally, from the main Means dialog box, start the mean comparison procedure (OK button).

Rice. 5.52.

In the SPSS Viewer window that opens, the results of the statistical procedure for comparing averages will appear. We are interested in the Report table (Fig. 5.53). From it you can see on what basis SPSS divided the respondents into two clusters. In our case, such a criterion is the level of assessments for the analyzed parameters. Cluster 1 consists of respondents for whom the average scores for all segmentation criteria are at a relatively high level (4.40 points and above). Cluster 2 includes respondents who rated the considered segmentation criteria quite low (3.35 points and below). Thus, we can conclude that 93.3% of the respondents who formed cluster 1 rated the analyzed airlines as good in all respects; 5.4% is quite low; 1.3% found it difficult to answer (see Fig. 5.50). From fig. 5.53, one can also conclude which level of ratings for each of the parameters considered separately is high and which is low (and this conclusion will be made by the respondents, which allows achieving high classification accuracy). From the Report table, you can see that for the Queue Throttling variable, the average score level of 4.40 is considered high, and for the parameter Appearance -- 4.72.

Rice. 5.53.

It may turn out that in a similar case, 4.5 is considered a high score for the X parameter, and only 3.9 for the Y parameter. This will not be a clustering error, but, on the contrary, will make it possible to draw an important conclusion regarding the significance of the parameters under consideration for the respondents. Thus, for the Y parameter, already 3.9 points is a good estimate, while the respondents impose more stringent requirements on the X parameter.

We have identified two significant clusters that differ in the level of average scores according to the segmentation criteria. Now you can assign labels to the received clusters: for 1 - Airlines that meet the requirements of the respondents (according to the seven analyzed criteria); for 2 -- Airlines that do not meet the requirements of the respondents. Now you can see which particular airlines (coded in the q4 variable) meet the requirements of the respondents, and which do not according to the segmentation criteria. To do this, you should build a cross distribution of the variable q4 (analyzed airlines) depending on the clustering variable clu2_l. The results of such a cross-sectional analysis are presented in Figs. 5.54.

Based on this table, the following conclusions can be drawn regarding the membership of the studied airlines in the selected target segments.

Rice. 5.54.

1. Airlines that fully meet the requirements of all customers in terms of the work of ground personnel (included in only one first cluster):

¦ Vnukovo Airlines;

¦ American Airlines;

¦ Delta Airlines;

Austrian Airlines;

¦ British Airways;

¦ Korean Airlines;

Japan Airlines.

2. Airlines that meet the requirements of most of their customers in terms of the work of ground personnel (most of the respondents flying with these airlines are satisfied with the work of ground personnel):

¦ Transaero.

3. Airlines that do not meet the requirements of the majority of their customers in terms of the work of ground personnel (most of the respondents flying with these airlines are not satisfied with the work of ground personnel):

¦ Domodedovo Airlines;

¦ Pulkovo;

¦ Siberia;

¦ Ural Airlines;

¦ Samara Airlines;

Thus, three target segments of airlines were obtained by the level of average ratings, characterized by varying degrees of satisfaction of respondents with the work of ground personnel:

1. the most attractive airlines for passengers in terms of the level of work of ground personnel (14);
2. rather attractive airlines (1);
3. rather unattractive airlines (7).

We have successfully completed all stages of cluster analysis and segmented airlines according to seven selected criteria.

Now we give a description of the methodology of cluster analysis paired with factor analysis. We use the condition of the problem from section 5.2.1 (factorial analysis). As already mentioned, in segmentation problems with a large number of variables, it is advisable to precede cluster analysis with factor analysis. This is done to reduce the number of segmentation criteria to the most significant ones. In our case, we have 24 variables in the original data file. As a result of factor analysis, we managed to reduce their number to 5. Now this number of factors can be effectively used for cluster analysis, and the factors themselves can be used as segmentation criteria.

If we are faced with the task of segmenting respondents according to their assessment of various aspects of the current competitive position of airline X, we can conduct a hierarchical cluster analysis according to the five criteria identified (variables nfacl_l-nfac5_l). In our case, the variables were evaluated on different scales. For example, a score of 1 for the statement I would not want the airline to change and the same score for the statement Changes in the airline will be a positive moment, diametrically opposed in meaning. In the first case, 1 point (strongly disagree) means that the respondent welcomes the changes in the airline; in the second case, a score of 1 indicates that the respondent rejects the changes in the airline. When interpreting clusters, we will inevitably encounter difficulties, since such variables that are opposite in meaning can

fall into the same factor. Thus, for the purposes of segmentation, it is recommended to first bring the scales of the variables under study into line, and then recalculate the factorial model. And already further to carry out cluster analysis on the variables-factors obtained as a result of factor analysis. We will not again describe in detail the procedures for factor and cluster analysis (this was done above in the relevant sections). We only note that with this technique, as a result, we got three target groups of air passengers, differing in the level of assessments of the selected factors (that is, groups of variables): the lowest, the average and the highest.

A very useful application of cluster analysis is the division into groups of frequency tables. Suppose we have a linear distribution of answers to the question What brands of antiviruses are installed in your organization?. To form conclusions on this distribution, it is necessary to divide antivirus brands into several groups (usually 2-3). To divide all brands into three groups (most popular brands, average popularity and unpopular brands), it is best to use cluster analysis, although, as a rule, researchers separate the elements of frequency tables by eye, based on subjective considerations. In contrast to this approach, cluster analysis makes it possible to substantiate the performed grouping scientifically. To do this, enter the values of each parameter in SPSS (it is advisable to express these values as a percentage) and then perform a cluster analysis on these data. By saving the cluster solution for the required number of groups (3 in our case) as a new variable, we get a statistically valid grouping.

We will devote the final part of this section to describing the use of cluster analysis for classifying variables and comparing its results with the results of the factor analysis carried out in Section 5.2.1. To do this, we will again use the condition of the problem about assessing the current position of airline X in the air transportation market. The methodology for conducting cluster analysis almost completely repeats the one described above (when the respondents were segmented).

So, in the original data file, we have 24 variables that describe the attitude of respondents to various aspects of the current competitive position of airline X. Open the main Hierarchical Cluster Analysis dialog box and place 24 variables (ql-q24) in the Variable(s) field, fig. 5.55. In the Cluster area, indicate that you are classifying variables (check the Variables option). You will see that the Save button is no longer available -- unlike factorial analysis, you cannot save factorial ratings for all respondents in cluster analysis. Disable plotting by deactivating the Plots option. In the first step, you don't need any other options, so just click the OK button to start the cluster analysis procedure.

The Agglomeration Schedule table appeared in the SPSS Viewer window, according to which we determined the optimal number of clusters using the method described above (Fig. 5.56). The first jump in the agglomeration coefficient is observed at step 20 (from 18834.000 to 21980.967). Based on the total number of analyzed variables, equal to 24, it is possible to calculate the optimal number of clusters: 24 - 20 = 4.

Rice. 5.55.

Rice. 5.56.

When classifying variables, a cluster consisting of only one variable is practically and statistically significant. Therefore, since we have obtained an acceptable number of clusters by the mathematical method, no further checks are required. Instead, open the main cluster analysis dialog box again (all the data used in the previous step is preserved) and click the Statistics button to display the classification table. You will see a dialog box of the same name, where you must specify the number of clusters into which 24 variables must be divided (Fig. 5.57). To do this, select the Single solution option and specify the required number of clusters in the corresponding field: 4. Now close the Statistics dialog box by clicking the Continue button and run the procedure from the main cluster analysis window.

As a result, the Cluster Membership table will appear in the SPSS Viewer window, distributing the analyzed variables into four clusters (Fig. 5.58).

Rice. 5.58.

According to this table, each considered variable can be assigned to a specific cluster as follows.

Cluster 1

ql. Airline X has a reputation for excellent passenger service.

q2. Airline X can compete with the best airlines in the world.

q3. I believe that Airline X has a promising future in global aviation.

q5. I am proud to work for Airline X.

q9. We have a long way to go before we can claim to be a world class airline.

qlO. Airline X really cares about passengers.

ql3. I like how Airline X is presenting itself visually to the general public (in terms of colors and corporate identity).

ql4. Airline X is the face of Russia.

ql6. Airline X service is consistent and recognizable throughout

ql8. Airline X needs to change in order to exploit its full potential.

ql9. I think Airline X needs to present itself visually in a more modern way.

q20. Changes in airline X will be a positive thing. q21. Airline X is an efficient airline.

q22. I would like to see the image of airline X improve in terms of foreign passengers.

q23. Airline X is better than most people think.

q24. It is important that people all over the world know that we are a Russian airline.

Cluster 2

q4. I know what the future strategy of Airline X will be.

q6. Airline X has good communication between departments.

q7. Each employee of the airline makes every effort to ensure its success.

q8. Now Airline X is improving rapidly.

qll. There is a high degree of job satisfaction among airline employees.

ql2. I believe that senior managers work hard to achieve the success of an airline.

Cluster 3

ql5. We look like “yesterday” compared to other airlines.

Cluster 4

ql7. I would not want airline X to change.

If you compare the results of factor analysis (Section 5.2.1) and cluster analysis, you will see that they differ significantly. Cluster analysis not only provides significantly less opportunity for variable clustering (for example, the inability to save group ratings) compared to factor analysis, but also produces much less visual results. In our case, if clusters 2, 3 and 4 are still amenable to logical interpretation1, then cluster 1 contains statements that are completely different in meaning. In this situation, you can either try to describe cluster 1 as it is, or rebuild statistical model with a different number of clusters. In the latter case, to find the optimal number of clusters that can be logically described, you can use the Range of solutions parameter in the Statistics dialog box (see Figure 5.57), specifying the minimum and maximum number of clusters in the corresponding fields (in our case, 4 and 6, respectively). In such a situation, SPSS will rebuild the Cluster Membership table for each number of clusters. The task of the analyst in this case is to try to choose such a classification model in which all clusters will be interpreted unambiguously. In order to demonstrate the capabilities of the cluster analysis procedure for clustering variables, we will not rebuild the cluster model, but will limit ourselves to what has been said above.

It should be noted that, despite the apparent simplicity of cluster analysis compared to factor analysis, in almost all cases of marketing research, factor analysis is faster and more efficient than cluster analysis. Therefore, for the classification (reduction) of variables, we strongly recommend using factor analysis and leave the use of cluster analysis for the classification of respondents.

Classification analysis is perhaps one of the most complex statistical tools from the point of view of an unprepared user. This is due to its very low prevalence in marketing companies. At the same time, this particular group of statistical methods is also one of the most useful for practitioners in the field of marketing research.

Application of cluster analysis in Microsoft Excel. Cluster analysis is a study by dividing a set of objects into homogeneous groups Cluster analysis statistics

Thursday, May 10, 2012

Thursday, January 12, 2012

Tasks and conditions

Typology of clustering problems

Input Types

Goals of clustering

Clustering methods

Formal Statement of the Clustering Problem

Application

In biology

In sociology

In computer science

see also

Notes

Links

Table of valuesS(a , b )

b \ a

0.20

0.10

0.05

0.01

0.001

0.0001

0.20

8

11

14

21

31

42

0.10

16

22

29

44

66

88

0.05

32

45

59

90

135

180

0.01

161

230

299

459

689

918

0.001

1626

2326

3026

4652

6977

9303

0.0001

17475

25000

32526

55000

75000

100000