Teaching machines to tell stories
EDXML enables machines to tell stories. Before we explain how that works, let us briefly state why we want to do this in the first place. The goal is to make computers interact more naturally with human data analysts. To perceive computers not just as machines that take commands and execute them, but as knowledgeable assistants that can actively join us to find the missing pieces of the puzzle, to connect the dots, to reveal the bigger picture.
For thousands of years humans have exchanged knowledge by telling stories. This is our source of inspiration. We can teach machines to tell stories, too. And to see the story told by the data that is given to them. Then, machines can understand our data the way we do, see how it fits together, recognize patterns.
Your computer will be your guide rather than you having to guide your computer.
We will demonstrate how machines can tell stories by telling the story of Snow White. Along the way, we will show how we can use EDXML as a language to express the meaning of data in a way that both humans and machines can understand.
To give an impression of what things look like 'under the hood' we will be showing snippets of XML. In case you are not familiar with that, it may look a bit intimidating. Do not worry though, fully understanding these snippets is not essential to follow along.
So let us begin the story of Snow White by introducing the main character:
Note that the XML snippets shown here have been slightly abbreviated. As such they may not be valid EDXML.
What we see here is called an event. In EDXML, all data is represented as events. While the above event may look complicated, events are actually not much different from rows in a spreadsheet. In our example, the event has three 'columns':
look. These columns are called properties. Properties have values called objects.
When we, humans, look at the above event and see a property 'look' with value 'handsome', then we can easily guess what the meaning of that data is. Machines cannot do that. Throughout the course of our lives we become familiar with a vast number of concepts. Machines do not.
To a machine, the event displayed above has no meaning whatsoever. This is key in understanding what EDXML is all about: To give machines access to those concepts and enable them to work with data knowing what that data means, just like humans do.
But how do we do that? In EDXML, events are given meaning by associating them with an event type. You may already have spotted the following detail in the event that we showed earlier:
This associates the event with an event type named
character-intro. An event type defines which properties the events have, what kinds of values these properties have, which concept the property refers to, and so on. We will show this in some detail in a moment.
EDXML integrates event type definitions and events into a single data format. This assures that no data ever goes without its meaning.
An event type allows events to be transformed from a row in a spreadsheet to a paragraph in a novel. Literally. Based on the event type definition a computer can translate the above EDXML event into the following text:
Once upon a time there was a princess named Snow White, who looked particularly handsome and friendly.
So here we have a machine explaining the exact meaning of the data to us. We will see how that trick works later.
As mentioned earlier, event types define which properties an event can have. For example, the definition of the
name property might look like this:
This is basically just the equivalent of saying that the rows in a spreadsheet have a column titled "name".
The real magic happens when properties get associated with concepts. Let us extend the property definition to include a concept association:
Now the property is associated with a concept named
princess. Concepts describe what the data is about. A concept is like a character in the story that the data tells us. A concept can make its appearance in many places throughout the data set. Remember how the example event mentioned a name:
Combined with the property definition and its associated concept a computer can now learn what this part of the event means:
There is a princess. The princess has a name. The name is "Snow White".
In a data set there can be many events that refer to a particular concept. Each event may reveal more information about the concept. Just like the reader of a novel gradually gets to know a character by reading the novel, paragraph by paragraph, a machine learns all about a concept by reading the EDXML data, event by event.
Data analysis typically requires the analyst to "read the book" first. What we mean by this is that the analyst repeatedly searches the data set to find related bits of information that are needed to answer a question. Using EDXML, machines can read the book for us and tell us about it.
The story continues. We learn that the princess lives in a castle:
This event has a different event type, because it has different properties and generally a different meaning. Let us assume that the
name property is defined in a similar way as in the previous event type. It is associated with the same concept. The
whereabouts property tells us where the character lives. It could be defined like this:
Now, both the
whereabouts properties are associated with the princess concept. This suggests that both are related.
Now look at our 'whereabouts' event. What does it mean? As humans, we are quick to infer that both "the castle" and "Snow White" are properties of one particular princess. Snow White lives in a castle, right? Well, the data does not explicitly say this. The name and whereabouts could also be details about two different princesses.
More technically speaking: There could be two distinct instances of the princess concept. Somehow, we need to explicitly tell that both properties refer to the same instance. We can do that by adding a property relation to the event type definition. Such a definition could look like this:
What we see here is an intra-concept relation, relating the
name property to the
whereabouts property. In EDXML an intra-concept relation indicates that both properties refer to the same concept instance. The predicate tells us in more detail how the two are related. Using this predicate, which is just a free form text, computers can describe relations between property objects:
Snow White lives in the castle
Combining both events we have seen so far, machines can learn two things about our princess:
- Snow White
- the castle
By repeatedly using relations to connect one piece of information to another, machines can learn everything there is to know about the concepts in a data set. This process is called concept mining. It is a simple form of machine reasoning which is similar to human associative reasoning.
The EDXML SDK contains a basic concept mining implementation. Let us use its
edxml-mine utility to see what a machine can learn by reading the full Snow White data set:
Above, we actually see a machine reading some data and telling us what it learns from it. You can try this for yourself if you like. The Github page of the EDXML foundation hosts both the EDXML SDK and the Snow White data set.
We also promised to reveal how EDXML events are translated into English prose. This is actually disappointingly simple. Each event type definition can contain a text template, such as this one:
[[when]] there was a [[type]] named [[name]], who looked particularly [[look]].
The object values from the event are inserted in the placeholders (the names between square brackets), et voilà. Transforming the full Snow White data set into English can be done using the
edxml-to-text utility from the EDXML SDK. The following command should do the trick:
edxml-to-text -f snow-white.edxml
Hopefully the Snow White story helped to get a basic idea of what EDXML is. You can use below links to continue reading about specific subjects in more detail.
Explore the idea of story telling and how it is embedded in the data modelling mechanisms of EDXML.
If you ever put together a jigsaw puzzle you are not far from understanding how EDXML concept mining works.
About the science of knowledge representation and where EDXML fits in.
Ontologies are the technical heart that makes EDXML tick. This article uncovers a bit of how machines view the human world.
About the EDXML specification, which describes in detail how EDXML data is structured.
The EDXML Software Development Kit provides the basic software components for developing EDXML generators, processors, storage backends and so on.
Copyright © The EDXML Foundation