Quantify dissimilarities between networked data

Data is getting big, but more than big it is getting pervasive. As our lives become more dependent and integrated with the digital infrastructure, more aspects of our life get measured and recorded. This pervasiveness leads to the emergence of novel types of signals for which existing analytic tools cannot be applied easily. Networked data falls in this category. As an example, the networks above are the coauthorship networks constructed from IEEE Tran. Signal Process. publications of two famous signal processing researchers, Prof. Georgios B. Giannakis (GG) of the University of Minnesota and Prof. Martin Vetterli (MV) of the École Polytechnique Fédérale de Lausannee for the selected timeframe. Each node in the networks represents an author. The size of the nodes is proportional to the number of publications of the corresponding authors. The width of the links is proportional to the number of publications between the author pairs. Color intensity of the shading triangles is proportional to the number of publications to the enclosed author triplets. Distinguishable differences exist in their respective collaboration patterns. A natural question is: how can we quantify the differences?

Such a problem has been solved by comparing some features of the respective networks. Feature analysis is application dependent, utilizes only a small portion of the information conveyed, and may yield conflicting judgements -- two networks close to each other are quiet different from the third. For the networks presented above, it is not clear which features one should use. Moreover, feature analysis is not easily generalizable to incorporate high dimensional information, e.g. publications between three or four authors, as what we did in the example. We tackle this problem from a different perspective -- by defining valid metric distances which are universal, depend on all edge weights, and are null if and only if the networks are isomorphic; see Metrics in the Space of High Order Networks. We also establish a connection between Network Science and Computational Topology and uses tools from the latter to efficiently quantify network differences; see Persistent Homology Lower Bounds on High Order Network Distances.


Recommendation systems from graph signal processing

Often, networks have intrinsic value and are themselves the object of study. Other times, the network defines an underlying notion of proximity, but the object of interest is signals defined on top of this graph. In these scenarios, the network serves as an underlying topology describing proximities between nodes and the subject of study is a signal supported on the network. As an example, consider the core task of a recommendation system which is the estimation of the rating that a user would give to a certain item. Traditionally, ratings are considered as a two dimensional matrix and the methods to solve the problem rely on the fact that while the number of users of the system may be large there is a much smaller number of tastes. While empirical observations say that this is largely true, the inherent noise in available rankings limits the precision of these tools. My approach to this problem is to use ratings and meta-information to construct a network of similarities between items. The ratings given by each user can then be viewed as a graph signal. It further follows that the ratings given by the same user should be similar if the evaluated items are highly alike with respect to the similarity measure in the constructed network. The information can also be used to construct an underlying network entailing proximities between user tastes, and the ratings for each item could be considered as a graph signal defined on top of this network –- the exact dual by switching the roles of items and users –-, and the ratings given for the same item should be similar if the users who assess them possess similar tastes. In either case, the estimation of rankings can be formulated as a low pass filtering problem. The signal that we are given is highly variable because many ratings are missing. The signal that we want is a low frequency signal where similar nodes have similar ratings.

The image above illustrates a concrete example of 9 movies and the ratings for them created by a user. Each node in the network denotes a movie and two nodes are connected by an edge if they are high similar (e.g. belonging to the same series). The line segments attached to each node represents the ratings, with segments pointing upwards the positive ratings (the user likes the movie) and segments pointing downwards the negative ratings (the user dislikes the movie). The latent signal (a) is unknown but assumed to be the ground true preference of the considered user. The observation (b) is the recorded rating; it is different from the latent signal due to inconsistency during the rating. Network tools can be used to recover signals (c) by treating ratings as signals supported on the network describing similarities. See Diffusion Filtering for Graph Signals and Its Use in Recommendation Systems for its application in recommendation systems, as well as Diffusion and Superposition Distances for Signals Supported on Networks for its application in handwritten digit recognition and ovarian cancer classification.


Signal processing and neuroscience

Human brain activities can be ideally modeled using networks and signals supported on networks. Each node in the network represents a brain region of similar functionality, signals denote the level of activity of each brain region, and the network describes structural connectivity or functional coherence. Such an interpretation motivates the application of traditional signal processing and graph signal processing methods to examine patterns of brain activation. I have been exploiting this connection to apply signal processing tools to recognize patterns associated with differences in human behaviors and abilities. Some specific and interesting observations have been found.

In terms of temporal domain, we found that when individuals are unfamiliar with a task, good learners exhibit more and more frequent variations in brain activities for each brain regions. Their brain activities are also more unpredictable. This correlates with the intuition that stronger variation implies better adaptation and therefore faster learning. The association is not clear when subjects are familiar with the tasks. Regarding spatial domain, we found that it favors learning to have more smooth, spread, and cooperative brain activities across the brain when we face an unfamiliar task. As we gradually become familiar with the task, the smooth and cooperative signal distribution on the brain becomes less and less important, and there is a level of exposure when such signal distribution becomes destructive instead of constructive. When we become highly familiar with the task, it is better and favors further learning to have varied, spiking, and competitive brain signals across the brain. See Graph Frequency Analysis of Brain Signals for more details.