4 key components of a strong graph data product builder
As digital architects and data scientists, it is key to constantly improve our ability to manage, understand, and make use of data. One approach that has gained traction in recent years is the use of semantic knowledge graphs based on ontologies and taxonomies. Such advanced knowledge graphs provide a powerful means of organizing and connecting data, more expressive and useful.
Knowledge canvas showing a configured canvas with links between selected security controls and the NIST Cybersecurity Framework.
However, building and managing these knowledge graphs can be a challenging task, especially when working with large and complex datasets and where boundaries are fuzzy. In this blog post, we will explore how our DFRNT data product builder platform helps build accurate semantic knowledge graphs.
Accurate graph data products help digital architects and data scientists streamline the process of building and managing knowledge graphs, and ultimately enable better decision-making and data-driven insights.
Key elements of a graph data product builder platform
A data product builder platform for semantic knowledge graphs is a tool that allows digital architects and data scientists to create and manage accurate knowledge graphs as data products with ease. Graph data is often represented as RDF triples, and as property graphs. The lack of constraints implies limited accuracy and governance. To enable accurate data to be created, a schema is needed.
Strong schema for RDF knowledge graphs, a key differentiator
Triples can be collected together into documents and there is a perfect format for doing just that: JSON-LD (J…). Each "layer" of the JSON-LD document corresponds to triples between subjects, predicates/properties and objects. Such layered form of JSON-LD is called framed JSON-LD.
A good framed JSON-LD modeller enables data structures and enums to be modelled in reusable ways across the data products, preferably with multiple inheritance of property definitions and composable data structures. The schema should support to effectively manage both taxonomies, ontologies and instances of defined entities.
Having a strong schema enables governance, quality and deriving accurate answers from the knowledge graph.
Importing and creating data
Excel and other tabular software are important concepts for data people like us, even when they are really bad at working with graph data. The reason they are great is that it is easy to make lists look like we want to make them importable. A good tool enables graph importation from Excel, but also makes it possible to import advanced data structures.
A well-integrated graph data product builder needs to support all the way from importing a list of records, basically as CSV, all the way to deeply nested and interconnected graph data. RDF is built upon web URIs (and more specifically IRIs) and forms links between triples using web URIs. JSON-LD is just an expression of such triples, and a convenient way to organize the triples in a recognizable form.
Building schemaful documents means that they are possible to check using JSON Schema, and the JSON Schema should possible to generate from the schema that governs documents stored in a data product builder.
No knowledge graph platform is complete without a strong ability to work with the data. The best ones are coupled not only with a simple querying engine that can follow paths, but with advanced expert system capabilities and datalog reasoning. Most system struggle to visualize the queries, but the best ones offer multiple ways to interact with their data, preferably with document APIs, GraphQL and some form of datalog engine.
Visualizing and navigating data
What I found most important when building a graph data platform, was the ability to incrementally build canvases to see the data and how it was linked. Most software like spreadsheets and mind-mapping software supports tree-like views and tabular data. But where I was stuck was with hyperconnected data, that is all but tree-structured.
Being able to view graphs in multiple forms, with heatmaps, filtering, force-directed diagrams and directed graphs is a key part of good graph-interactive environments.
Additionally to the visualisation, what is even more important is to be able to see, query, computationally reason about, and edit the data for each defined entity in the system in a structured way and navigate the data structures in a wiki-like format, is an important feature for exploratory data work.
Lastly, lineage, provenance and version control
The software world has an enormous advantage over the data world, for it has effective version control. A solid graph data product approach should include the ability to create branches for experiments, pull in successful experiments, rebase the changes from the main branch onto a branch, bring the changes in a branch back to the main branch again.
Such a git-for-data approach should support collaborators working concurrently on branches and have push/pull ability to exchange their changes, where they simulate scenarios, build hypotheses or just want to work in solitude for a while on a new data schema. The graph should contain information about who committed a change, when it was done and keep a full history of changes.
Getting a platform that excels on all of this
Wouldn't this be quite the platform? For keeping that hyperconnected data that no longer fits Excel, for testing out a hypothesis of the data structures to interconnect all disconnected pieces of a work process, or to quicky prepare, import and navigate data that needs to be visualized as a graph but that is just a long list of records in a CSV or tabs in an Excel.
I have been asking architects I have met in client settings for years if they have any tools to manage and visualize their hyperconnected data and I have so far not heard of any tool that could do this in a way I could depend upon. I have found some products that handles parts of it but far from all of it.
The journey to build DFRNT
A while back, while researching how to build digital boundary objects, I found TerminusDB and decided to build DFRNT around it. I wanted to add security and secrets protection and orchestration, visualisation, a solid editing environment for the data model, together with hosting. What felt important was to have dependable and scalable SaaS infrastructure for the data products. I ended up building the tool I wanted to have all along, for me and the community around me.
We have many of the features on the list, including really hard ones like git-for-data. And hope to build out the additional features that the community needs.
From data scientists in AI and machine learning that needs data governance, to bringing improvements to visualisation and organisation of the messy data that digital architects are set to bring order to, and help cybersecurity professionals secure and manage the digital estates of companies, where governance, risk and compliance needs to be put under control and with a good place to store all that data that fits no tool.
The DFRNT tool is free to use with data products at TerminusX, and there is a subscription for hosted data products that are expensive to run as in-memory instances for high performance, backed by durable storage. And there is also an option for those that choose to model their TerminusDB data products locally.