2 views

In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with? Any examples, links, references will be helpful.

Also, being new to the field, I would like the community opinion on how closely related Data Mining and Artificial Intelligence are. Are they synonyms, is one the subset of the other?

by (108k points)

The similarity is the measure of how much alike two data objects are.

The similarity in data mining is usually described as a distance with dimensions representing features of the objects. A small distance between the objects indicating a high degree of similarity and a large distance indicating a low degree of similarity.

The similarity is subjective and is highly dependant on the domain and application. For example, two people are more similar because they have the same first name or because they live in the same city? Care should be taken when calculating the distance across dimensions or the features that are unrelated. The relative values of each feature must be normalized or else one feature could end up dominating the distance calculation. An example would be if you considered two people similar to their height and how far apart they currently live from each other. If you measured both of these in centimeters, then the distance between their dwellings would dominate any correlation in their heights.

Now talking about the relation between Data Mining and Artificial Intelligence, data mining is divided into several stages (as defining the goal on your mining process and cleaning the data), but, to do the real work on the data set you collected is where you need the AI.

You are going to need AI techniques and algorithms in order to inspect the data and to obtain some results. These techniques belong to machine learning and are divided into supervised and unsupervised learning, the two types of machine learning.

In supervised learning, we use Neural Networks, where you divide your data into two sets (with ratios of 70% and 30% most commonly) and let the network learn to classify your data. It is supervised because you train the network teaching it to classify your data.

On unsupervised, the most common technique is the Genetic Algorithm, It is unsupervised because you don’t teach anything, you run this algorithm on your data set and expect to discover hidden relationships between the data.