Versatile

EDXML has not been designed for representing any specific type of information. Therefore, it has a virtually unlimited scope.

Simple

Despite its broad scope, EDXML is not a complex specification. Implementing data processing systems is relatively quick and affordable.

Semantic

EDXML combines data with semantics: Computers can dynamically learn how to treat any type of information, simply by reading the EDXML stream.

So, you need to deal with dozens of types of information flowing around in your organization. Information from various sources. And now you want to combine all of this information to see the big picture. Really know what is going on. Without spending a fortune on complex data warehouse solutions.

Enter EDXML.


A BIT ABOUT EDXML


A common data representation lies at the heart of any data integration effort. Many, many different data representation standards exist, but these are typically limited to specific domains like forensics, e-commerce or cyber security. Integrating data from different domains allows for two approaches.

Option one is to create a new representation by combining aspects from all domains. This yields a custom, complex and verbose data representation, requiring complex custom made analysis software to process it.

Option two involves finding a 'common ground' for all domains. The more domains are involved, the narrower this common ground will be and the more the 'fidelity' of the data will suffer.

EDXML offers a data representation that evades the typical scope / complexity / fidelity tradeoff by combining data with semantics. The physical structure of the data is decoupled from its semantic structure, resulting in a simple representation having a virtually unlimited scope while preserving the fidelity of the original data as much as possible.



EDXML IS VERSATILITY



EDXML has been designed without any specific type of data in mind. Data sources that generate EDXML data streams can define their own data structures that are relevant to their own domain. The addition of semantics make these data streams self explanatory to data consumers.

By reading the EDXML streams, computers can dynamically learn how to process, store, analyse, visualize or report the data contained within the stream, no matter what kind of data it is. This allows the use of versatile, reusable data processing components, saving the development cost involved in developing custom systems that are tailored to your dataset.



EDXML IS SIMPLICITY



The physical structure of the data in EDXML streams is similar to the simple rows and columns structure seen in spreadsheet applications. The rows are called events in EDXML, while columns are called properties. Due to this simple structure, EDXML data streams are easy to process.

If so desired, the semantics embedded in the EDXML stream can be used to transform the data into more complex structures, which may for instance be useful for graph analysis.

The dual character of the data representation offers different 'views' at the same data, either of which may be more appropriate for a particular task.



EDXML IS SEMANTIC



EDXML allows the precise meaning of the data to be encoded, in a way that computers can understand. In fact, EDXML demands that the meaning of the data is specified, allowing computers to analyse it and uncover patterns. For example, the fact that contact lists of facebook users can be combined to visualize their social network can be encoded into EDXML, enabling computers to learn how to perform SNA on the data contained in the EDXML file.


OTHER NOTABLE FEATURES


Advanced Report Generation

The semantics encoded into EDXML streams allow computers to describe the data in plain english, explaining precisely how it should be interpretated and where the data originated from. Even relations between fragments of data can be described, explaining the exact nature of the relation. This capability can be used for more than just generating reports. For instance, computers can use semantics to autonomously generate database queries to find related data and explain to the user what the query will do. This is most useful when developing intuitive graphical user interfaces to process EDXML data.

Syncing Live Data Sources

Data sources can generate updates for EDXML data that it generated in the past. Not by generating a new version of the entire dataset (although this is also possible) but by sending only the changes. This allows EDXML to be used to perform efficient, near real-time synchronization of target systems with highly dynamic source systems.

Sticky Hashes

Many data sources fail to offer a unique, persistent identifier that allows specific bits of data to be referred to. EDXML mitigates this problem by specifying a hash computation method that yields an identifier that is guaranteed to be unique. These identifiers are called sticky hashes in EDXML. Every EDXML event (unit of data) has one, allowing other systems to refer to specific EDXML events.

Sticky hashes are called sticky as they do not change if the event itself is changed. This makes for a truly persistent identifier.

Online Data Model Upgrades

The modular, distributed data model that characterizes EDXML allows the data model of databases to be upgraded without down time. Introducing new types of information and attaching new data sources can become part of regular operation.

Hash Chaining

Every EDXML event can refer to other EDXML events. In complex analysis environments, mutiple systems might generate output events based on input events, where the output events refer back to the input event. This creates a chain of evidence that allows every event to be traced back to the source. This is one of the many features of EDXML that make it highly suitable for forensic science applications.



FREQUENTLY ASKED QUESTIONS



Who created EDXML?

EDXML was created in 2009 by Dik Takken in a quest to find a data representation that is extensible, true to the original data, domain independent and not too complicated.

Since 2012, EDXML is a formal specification, published under a permissive Creative Commons license.

Where does the name of EDXML originate from?

E is for Event, D is for Dataset. And XML is.. Well, you know what that is.

How is EDXML different from other versatile data serialization formats, like Apache Thrift, Protocol Buffers or Smile?

The biggest difference is the semantics that is included in the data stream. This allows machines to 'understand' what the data means and how to process it, providing the means to develop simple reusable data processing components that automatically do 'the right thing' with your data.

Existing serializations require that all system components know what the data means in advance. This means that all components need to be programmed in advance how to interpret the data. With EDXML, you could say that the basic application logic that is needed to process the data is packaged with the data itself, generated by the data source. Only the source needs to know how the data works, other systems 'learn' from the source.

EDXML is event based. Does every event need a timestamp?

No. The name 'event' suggests that it does, but a timestamp is not mandatory. Events can contain zero, one or multiple timestamps.

Can EDXML streams contain binary data, like PDF documents and photos?

Not directly. EDXML can refer to externally stored binary files by means of hashlinks. A hashlink is an EDXML data primitive that contains the SHA1 hash of the file it refers to.

What limitations does EDXML have, representation wise?

First of all, context is everything in EDXML. Without context, data has no meaning. Without meaning, computers have no clue what to do with it. So, you cannot just take an arbitrary bunch of information and call it an event. You need to define an event type first, which provides context for your data.

Second, information must be at least semi-structured. Events are made out of properties, some of which may be optional, some may be mandatory. Unstructured data, like a human written text, can be stored as event content.

Can EDXML represent nested, hierarchical data structures?

Yes, but not within a single event. You can define parent-child relationships between events, creating tree structures. These relationships are implicit, encoded in the semantics. The event data itself does not show any hierarchical structure.

How is EDXML different from standards like DFXML and STIX?

Scope, mainly. A single EDXML stream can mix DFXML data and STIX data, while neither DFXML nor STIX can represent EDXML data, generally.

Is EDXML a schemaless format?

No. The structure of events needs to be defined first (event types). However, introducing new event types is easy and can usually be done without software updates or down time.

Does EDXML include a transport protocol?

No, it does not. You are free to use HTTP, message queueing systems or a pile of paper as a means of transport.



RESOURCES

EDXML 2.1.0




Specification Document

This is the formal specification document.

The document both explains and complements the RelaxNG schema, and explains all EDXML aspect like hash collisions, sticky hashes and event structure in detail.

The EDXML specification is the combination of both this specification document and the RelaxNG schema.


RelaxNG Schema

The RelaxNG schema is used for partial validation of EDXML data streams.

Note that this schema is not complete, a data stream that validates against the schema does not need to be valid EDXML. The specification document explains why this is the case and how full EDXML validation is done.

The EDXML specification is the combination of both this schema and the specification document.


SDK

The Software Development Kit is a Python module and a collection of example applications.

Both the module and the examples can be downloaded from Github.

Documentation of the Python module is available on readthedocs.org.

The module can be installed directly through Pip:

pip install edxml


Design based on the Shield theme by BlackTie.co

Oxygen icons by Project Oxygen


dtakken