Cranic Computing

Data Science

Critical nodes reveal peculiar features of human essential genes and protein interactome

Network-based ranking methods (e.g., centrality analysis) have found extensive use in systems biology and network medicine for the prediction of essential proteins, for the prioritization of drug targets candidates in the treatment of several pathologies and in biomarker discovery, and for human disease genes identification. We here studied the connectivity of the human protein-protein interaction network (i.e., the interactome) to find the nodes whose removal has the heaviest impact on the network, i.e., maximizes its fragmentation. Such nodes are known as Critical Nodes (CNs). Specifically, we implemented a Critical Node Heuristic (CNH) and compared its performance against other four heuristics based on well known centrality measures. To better understand the structure of the interactome, the CNs' role played in the network, and the different heuristics' capabilities to grasp biologically relevant nodes, we compared the sets of nodes identified as CNs by each heuristic with two experimentally validated sets of essential genes, i.e., the genes whose removal impact on a given organism's ability to survive. Our results show that classical centrality measures (i.e., closeness centrality, degree) found more essential genes with respect to CNH on the current version of the human interactome, however the removal of such nodes does not have the greatest impact on interactome connectivity, while, interestingly, the genes identified by CNH show peculiar characteristics both from the topological and the biological point of view. Finally, even if a relevant fraction of essential genes is found via the classical centrality measures, the same measures seem to fail in identifying the whole set of essential genes, suggesting once again that some of them are not central in the network, that there may be biases in the current interaction data, and that different, combined graph theoretical and other techniques should be applied for their discovery. (Paper)

Traffic Data Classification for Police Activity

Traffic data, automatically collected en masse every day, can be mined to discover information or patterns to support police investigations. Leveraging on domain expertise, in this paper we show how unsupervised clustering techniques can be used to infer trending behaviors for road-users and thus classify both routes and vehicles. We describe a tool devised and implemented upon openly-available scientific libraries and we present a new set of experiments involving three years worth data. Our classification results show robustness to noise and have high potential for detecting anomalies possibly connected to criminal activity. (Paper)

Unsupervised Classification of Routes and Plates from the Trap-2017 Dataset

This paper describes the efforts, pitfalls, and successes of applying unsupervised classification techniques to analyze the Trap-2017 dataset. Guided by the informative perspective on the nature of the dataset obtained through a set of specifically-written perl/bash scripts, we devised an automated clustering tool implemented in python upon openly-available scientific libraries. By applying our tool on the original raw data it is possibile to infer a set of trending behaviors for vehicles travelling over a route, yielding an instrument to classify both routes and plates. Our results show that addressing the main goal of the Trap-2017 initiative (“to identify itineraries that could imply a criminal intent”) is feasible even in the presence of an unlabelled and noisy dataset, provided that the unique characteristics of the problem are carefully considered. Albeit several optimizations for the tool are still under investigation, we believe that it may already pave the way to further research on the extraction of high-level travelling behaviors from gates transit records. (Paper)

Traffic Data: Exploratory Data Analysis with Apache Accumulo

The amount of traffic data collected by automatic number plate reading systems constantly incrseases. It is therefore important, for law enforcement agencies, to find convenient techniques and tools to analyze such data. In this paper we propose a scalable and fully automated procedure leveraging the Apache Accumulo technology that allows an effective importing and processing of traffic data. We discuss preliminary results obtained by using our application for the analysis of a dataset containing real traffic data provided by the Italian National Police. We believe the results described here can pave the way to further interesting research on the matter. (Paper)

Tor marketplaces exploratory data analysis: the drugs case

The anonymous marketplaces ecosystem represents a new channel for black market/goods and services, offering a huge variety of illegal items. For many darknet marketplaces, the overall sales incidence is not (yet) comparable with the correspondent physical market; however, since it represents a further trade channel, providing opportunities to new and old forms of illegal trade with worldwide customers, anonymous trading should be carefully studied, via regular crawling and data analysis, in order to detect new trends in illegal goods and services (physical and digital), new drug substances and sources and alternative paths to import socially dangerous goods (e.g. drugs, weapons). Such markets, based on e-commerce retail leaders model, e.g. Amazon and E-bay, are designed with ease of use in mind, using off-the-shelf web technologies where users have their own profiles and credentials, acting as sellers, posting offers, or buyers, posting reviews or both. This lead to very poor data quality related to market offers and related, possible feedback, increasing the complexity of extraction of reliable data. In this paper we present an approaching methodology to crawl and manipulate data for analysis of illicit drugs trade taking place in such marketplaces. We focus our analysis on AlphaBay, Nucleus and East India Company and we will show how to prepare data for the analysis and how to carry on the preliminary data investigation, based on the Exploratory Data Analysis. (Paper, pdf)

Multi-word Structural Topic Modelling of ToR Drug Marketplaces

Topic Modelling (TM) is a widely adopted generative model used to infer the thematic organization of text corpora. When document-level covariate information is available, so-called Structural Topic Modelling (STM) is the state-of-the-art approach to embed this information in the topic mining algorithm. Usually, TM algorithms rely on unigrams as the basic text generation unit, whereas the quality and intelligibility of the identified topics would significantly benefit from the detection and usage of topical phrasemes. Following on from previous research, in this paper we propose the first iterative algorithm to extend STM with n-grams, and we test our solution on textual data collected from four well-known ToR drug marketplaces. Significantly, we employ a STM-guided n-gram selection process, so that topic-specific phrasemes can be identified regardless of their global relevance in the corpus. Our experiments show that enriching the dictionary with selected n-grams improves the usability of STM, allowing the discovery of key information hidden in an apparently "mono-thematic" dataset. (Paper)