Ақпараттық Технологиялар Кафедрасы СӨЖ “Сlustering task



бет2/4
Дата02.12.2023
өлшемі146,4 Kb.
#132464
1   2   3   4
Байланысты:
СРС Бердіғалиев Ақниет ИС 20-11

Clustering goals
Why might you even need to do data clustering? Several areas can be highlighted here:
• complex data is divided into similar groups, simplifying their presentation, and therefore processing within each group (usually used in classification and regression problems);
• reducing the sample size by maintaining a small number of characteristic representatives of each class;
• identification of atypical objects (outliers) that cannot be attributed to any cluster (classification tasks, antifraud and spam systems);
• building a hierarchy of objects (example - the classification of animals and plants by K. Linnaeus - the task of taxonomy).What problems arise along the
way? First. Often clustering problems do not have a clearly defined single solution (even from the perspective of human perception). The following example shows how many points can be divided into groups in different ways:
Here, the result of clustering may depend on the choice of the metric, on the algorithm used, and on the initial initial data, which are often determined randomly. Therefore, completely different clustering results can be obtained for the same data.
Note that in the last two examples there are no clusters as such. Geometric shapes, as a rule, are not the data set that the customer is interested in (despite the fact that our brain distinguishes them well). And below is generally represented by one cluster. With such a distribution of sample objects, it is pointless to set the clustering task. Nevertheless, in practice, it often happens that the customer sets the clustering task for data that form such a single distribution in the form of a single cluster along all axes and it is almost impossible to single out any separate structures. In this case, the task should be corrected and solved by other methods.
Clustering quality criteria
You may have a question here, and what are these objects for which you can calculate only the distances between them, but not the cluster centers? What does it mean they are not represented in the feature space? For example, it can be the distances between words. The word itself is an object without numerical attributes. But, nevertheless, there are techniques that allow you to determine how close (or far) one word is from another word. The same can be done with sounds, sequences of nucleotides in DNA and RNA, etc. There are many tasks where we can set distances (metric), but not the feature space. Or, we don't need separate signs and it just doesn't make sense to do additional work. In all these cases, you can use the characteristics to assess the quality of clustering. If we have numerical signs explicitly prescribed for images, then we can move on to more understandable (in my opinion) quality assessments.

  • Partial training: we build an algorithm such that for any input image (not necessarily from the training sample), a class (cluster number) is predicted.

  • Transductive learning: we build an algorithm such that cluster forecasts are generated only for objects.

Where is this partial learning approach used? For example, when cataloging texts. When they need to be grouped by similarity, that is, correlated with previously marked texts (for which there are category labels). The same applies to all types of data, up to images, sounds, DNA sequences, etc.
This concludes the introductory lesson on clustering. In the following, we will continue this topic and consider specific algorithms for dividing objects into categories.




Достарыңызбен бөлісу:
1   2   3   4




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет