Knowledge Graphs 2.0: High Performance Computing Emerges
June 2, 2021
The increasing reliance on knowledge graphs parallels that of Artificial Intelligence for three irrefutable reasons. They’re the most effective means of preparing data for statistical AI, creditable knowledge graph platforms utilize supervised and unsupervised learning to accelerate numerous processes, and their smart inferences are a form of machine intelligence.
Coupling knowledge graphs with high performance computing enables organizations to not only avail themselves of sophisticated techniques to optimize AI, but also employ it at the scale and speed of contemporary data demands.
According to Katana Graph CEO Keshav Pingali, “There is a need for high performance graph computing…in two ways. One is the volume of data, and the other is time to insight.”
Scaling knowledge graphs with high performance computing is a means of rapidly analyzing the tremendous data quantities organizations routinely contend with for informed, low latent action across numerous use cases including “intrusion detection, fraud detection, and Anti-Money Laundering,” Pingali noted.
Moreover, this synthesis allows them to do so with established and emerging AI approaches that are imperative for accurately producing desired results in these and other deployments.
Scaling out high performance computing
High performance computing typically involves clusters with multiple processing units swiftly performing complicated calculations. It has a lengthy history in science use cases that are too computationally intense for conventional approaches. “People are used to the notion of high performance computing in the context of computational science applications, where you’re solving very large systems of PDEs and you use something like finite elements to ultimately generate systems of linear and non-linear equations,” Pingali explained.
There is a need for high performance graph computing…in two ways. One is the volume of data, and the other is time to insight.
This computing paradigm is critical for deployments in which copious amounts of storage and compute are required. The ability to distribute workloads among multiple servers is fundamental to high performance computing, particularly for apportioning a knowledge graph engine among machines for parallel processing tasks. According to Pingali, top options in this space “scale to 256 machines” as required for computational demands at scale. This capability is primed for AI deployments involving event stream processing and other applications.
Performant Knowledge Graphs
Parallel processing with high performance computing is ideal for addressing the size of knowledge graphs in the post big data era. “A scale out solution is essential in some verticals…in fintech, security identity,” Pingali reflected. “We’re talking about very big graphs, very big topologies, in some cases maybe a trillion edges. And also, lots of property data on nodes and edges.” The overflowing quantities of predominantly unstructured data inundating the enterprise via external, cloud, social media, and IoT sources are directly responsible for the expansiveness of contemporary knowledge graphs.
Pingali echoed the notion that “more than half of the world’s data was created in the last two years, but less than 2 percent of it has been analyzed. Some of this data is of course structured data…but a lot of that data is also unstructured and can be viewed usefully as graphs and processed usefully with graph algorithms.” Fortified by high performance computing, knowledge graphs successfully represent and process this data via:
- Nodes and Edges: Graphs represent data and their relationships via nodes and edges. “The nodes represent entities of some kind and the edges represent binary relations between these entities,” Pingali commented.
- Labeled Properties: Users can annotate graphs with labeled properties that are helpful for data provenance and recording confidence scores, both of which enhance machine learning use cases. “In a lot of applications, the nodes and edges also have a lot of property data,” Pingali revealed. “For example, if there is a node that represents a person, the property associated with that node could be the first name, the last name, the social security number, the date of birth, where the person resides, citizenship, and so on.”
- Graph Algorithms: There are several algorithms that excel in graph settings for understanding data. Specific algorithm types include “path finding, node ranking, community detection, structural properties, and graph mining algorithms,” Pingali disclosed. “Most of them run on CPUs, GPUs, as well as distributed CPUs and GPUs.”
The responsiveness of knowledge graphs underpinned by high performance computing greatly exceeds that of other methods. These performance gains are often the vital distinction between simply amassing immense knowledge graphs and actually deriving low latent action from them. “A lot of the time there is a window of opportunity within which, if your analytics completes, you can get insights and you can act on those insights,” Pingali remarked. “Then, you benefit from the analytics. But if the answer comes too late outside of that window of opportunity, then you might as well not have done the analytics.”
Pingali described a use case in which the Defense Advanced Research Projects Agency (DARPA) utilized knowledge graphs enhanced by high performance computing for real-time intrusion detection in their computer networks. “They build interaction graphs and then you are pattern mining within that graph to find what are called forbidden patterns,” Pingali mentioned. “If you find a forbidden pattern then you raise an alarm, somebody steps in, and so on.”
The capacity to rapidly traverse extensive topologies laden with labeled properties at the speed of high performance computing is primarily based on the following three considerations for distributing workloads among machines.
- Sharding: Sharding is a means of partitioning workloads among different machines. Once that’s done “each of the machines has a small portion of the graph, and so you can do graph computing on that single machine,” Pingali specified.
- Dynamic Load Balancing: Unlike the case with many computer science applications, the computations for workloads in graph computing aren’t always predictable or static. Load balancing systems can rectify this issue for users.
- In-Memory: In-memory capabilities are fundamental for the quick processing high performance computing is acclaimed for. According to Pingali, sharding also allows an “in-memory compute engine to run on each machine.” Credible options in this space also have runtime capabilities for inter-machine communication.
Advancing Knowledge Graphs
Although the knowledge graph idiom is widely proclaimed by a number of vendors with varying approaches, pairing this technology with high performance computing is a significant development for meeting the needs of the contemporary data ecosystem. It addresses the burgeoning size of knowledge graphs, the real-time responsiveness required to succeed with them, and the computational demands required for AI in mission-critical use cases at enterprise scale.
“That is where the real intelligence comes in, to figure out what might happen in the future and mitigate any bad things that might happen and ensure you can exploit all the good things that might happen,” Pingali posited. “This is going to require using lots and lots of knowledge graphs, as well as AI. Knowledge graphs and AI are really made for each other…in a platform where you can quickly spin up those kinds of applications and exploit the enormous amount of data that we all have.”