Concept Mining

Concept mining enables machines to autonomously correlate structured data to obtain 'the big picture'. The technique is inspired on how humans think and cooperate in teams to make sense of data.

Associative Reasoning

Hey guys, look what I found!

This phrase can be heard daily in teams all over the globe. Imagine a police investigation team for instance. Investigator Bob: "Look, I found an e-mail message sent to that address we found yesterday, [email protected]." He continues: "And next to this e-mail address a name: John Wilson. I guess the sender had the real name of this dude in their address book!"

A couple of desks away Bob's colleague Alice is looking at address books from seized smartphones. She responds: "John Wilson, that sounds familiar. Where did I see that name again?". Some minutes later: "Ah, here it is. That name is in a contacts list of this phone we found. It has phone number 0031-61234567. Apparently, he frequently called a guy named Leonard".

This exchange is a typical example of associative reasoning in teams. It can be observed in police investigation teams, fraud research teams, security operations centers and many other types of environments where digging in data is key. The associative reasoning performed by humans is incredibly powerful, but also fragile. What if Alice had left her desk to go to the coffee machine while Bob said "Hey guys, look what I found"? The link with the address book in the smartphone might not have been established. Or what if Alice did not recall seeing that name in the contacts list? Same thing.

Now imagine a machine being part of that team. A machine that has all available research data at its disposal, able to locate any data record in microseconds. A machine that knows how to correlate all of that data, knows how it is structured, knows how to associate one record with related ones, just like human analysts do. Would that machine not be a perfect team member?

Fitting two jigsaw puzzle pieces together

Data, a Jigsaw Puzzle

Data is like a pile of pieces of a jigsaw puzzle. Like the pieces of a puzzle, the individual data records come in many shapes. A single piece does not reveal much of the big picture. Getting the big picture requires a human analyst to sift through the pieces and connect the ones that fit.

Consider a very, very big jigsaw puzzle, consisting of millions of pieces: The EDXML events. The puzzle, once completed, shows a detailed picture of a big city with skyscrapers, trees, people and cars on the streets. Only, we cannot see the big picture yet. All we have is a big pile of pieces.

Completing the puzzle is a daunting task. And unfortunately, computers are generally not very helpful in trying to complete the puzzle. The reason is that computers do not know the meaning of the data they work with. They have no idea what a skyscraper or a tree is. To the computer, the pieces in the puzzle show all sorts of seemingly unrelated colorful patterns. It takes a human analyst to interpret the data and connect the dots.

Concepts

EDXML enables associating the puzzle pieces with concepts like tree or car. In a real jigsaw puzzle, many pieces convey only a suggestion of what kind of thing the piece belongs to. A piece may look like the wheel of a vehicle, but we cannot see if it is from a family car or from a truck. EDXML concepts enable computers to recognize what a piece looks like. They enable the computer to actually start puzzling in the same way as a human being: Iteratively connect pieces of a vehicle to get a better idea of what kind of vehicle it is. A piece of a kind of vehicle may turn out to fit another piece that clearly belongs to a truck, revealing that it is actually a truck.

A pile of jigsaw puzzle pieces representing a data set

Relations and Shapes

Besides concepts EDXML also supports defining relations between events. These relations have the exact same purpose as the shapes of puzzle pieces: they determine which events fit together. The EDXML specification describes how to define both inter-concept and intra-concept relations. An intra-concept relation relates information about the same concept. It can be used to connect two pieces of a car-like thing. The inter-concept relations relate different concepts. These can be used to connect a skyscraper to a car parked in front of it.

Concept Mining

Now it is easy to understand what concept mining is. Concept mining is the process of completing the puzzle. Start with a piece that is associated with the skyscraper concept. Then, find other skyscraper pieces that fit to reveal the entire skyscraper. Repeat it to construct another skyscraper. Connect the two. And so on.

Real life is always more complicated than theory. Actual data sets are more like a puzzle that is missing a lot of pieces. Adding information from more data sources into the mix may be needed to complete the puzzle. The composable knowledge offered by EDXML enable us to do just that.

Also, real life data puzzles may involve so many pieces that it gets overwhelming to us humans. This is where concept mining comes in. Now machines can work tirelessly to find the pieces we need and hand them to us.

Man and machine join forces to obtain the big picture.

How it works

Implementing basic EDXML concept mining is fairly straight forward. It involves an iterative process of jumping to related events and their objects, guided by the information in the ontology. This process is illustrated in the following graphic.

Graphic showing how EDXML event property relations and shared objects can be used to automatically discover related information

Imagine taking some object value, say e-mail address [email protected], as a starting point. This starting point is called the concept seed. And imagine searching the data set, finding that the object value appears in a property of some EDXML event. Associating the object value (o) to the property (p) of that event (e) is represented by step 1 in the above graphic.

Suppose that the e-mail address is associated with some concept (c). This might be a person concept for instance. The concept is shared by another property containing names. Both properties are related by means of an intra-concept relation which relates both e-mail addresses and names in the events to the person concept. This is step 2 in the graphic. The related property might have object value "Alice Johnson". That is step 3.

Next, let us assume that the name "Alice Johnson" also appears as an object of a property of another event (step 4). This event is of a different type. In this event type, the property containing the name is related to another property containing an organization name. The two properties are related by means of an intra-concept relation (step 5). The related property also has an object value, "ACME Corporation" for example (step 6).

So, by following related properties and shared object values across two events we have learned that

e-mail address [email protected] belongs to a person named Alice Johnson
Alice Johnson belongs to an organization named ACME Corporation

By iterating these steps, we might learn a lot more about Alice and ACME Corporation, find other persons that also belong to ACME and discover details about those persons. The characters in the story that the data tells us, as well as their relationships, are revealed to us in an automated fashion.

A blog post about concept mining algorithms can be found here.

other subjects

Story Telling

Introduction

Ontologies

Scientific Background

EDXML Foundation

SDK

Specification