Analysing the Tor Web
As the Web has become the main means for information exchange and retrieval, a whole body of work focuses on gaining a better understanding of its content and shape, in order to improve usability and security. Web mining becomes an even more interesting/challenging task when the target includes the submerged Internet contents usually known as "deep" Web not crawled/indexed by traditional search engines.
A recent research trend is especially focused on the subset of the deep Web usually called "dark" Web. This is the collection of web resources that exist on darknets, describable as overlay networks, which despite leaning on the public Internet require specific software, configuration or authorization to access. Among darknets, Tor (The Onion Router) is probably the most known and used. It is a communication network designed as a low-latency, anonymity-guaranteeing and censorship-resistant network, relying on an implementation of the so-called onion routing protocol. Its servers, run by volunteers over the Internet, work as routers to allow Tor users to access the Internet anonymously, evading traditional network surveillance and traffic analysis mechanisms. Other than guaranteeing an anonymous access to normal websites, Tor allows running anonymous and untraceable services, known as hidden services, that can only be accessed using a Tor-enabled browser.
Onion under Microscope: An in-depth analysis of the Tor Web
Tor is an open source software that allows accessing various kinds of resources, known as hidden services, while guaranteeing sender and receiver anonymity. Tor relies on a free, worldwide, overlay network, managed by volunteers, that works according to the principles of onion routing in which messages are encapsulated in layers of encryption, analogous to layers of an onion. The Tor Web is the set of web resources that exist on the Tor network, and Tor websites are part of the so-called dark web. Recent research works have evaluated Tor security, its evolution over time, and its thematic organization. Nevertheless, limited information is available about the structure of the graph defined by the network of Tor websites, not to be mistaken with the network of nodes that supports the onion routing. The limited number of entry points that can be used to crawl the network, makes the study of this graph far from being simple. In the present paper we analyze two graph representations of the Tor Web and the relationship between contents and structural features, considering three crawling datasets collected over a five-month time frame. Among other findings, we show that Tor consists of a tiny strongly connected component, in which link directories play a central role, and of a multitude of services that can (only) be reached from there. From this viewpoint, the graph appears inefficient. Nevertheless, if we only consider mutual connections, a more efficient subgraph emerges, that is, probably, the backbone of social interactions in Tor. (Paper, pdf)
Spiders like Onions: on the Network of Tor Hidden Services
Tor hidden services allow offering and accessing various Internet resources while guaranteeing a high degree of provider and user anonymity. So far, most research work on the Tor network aimed at discovering protocol vulnerabilities to de-anonymize users and services. Other work aimed at estimating the number of available hidden services and classifying them. Something that still remains largely unknown is the structure of the graph defined by the network of Tor services. In this paper, we describe the topology of the Tor graph (aggregated at the hidden service level) measuring both global and local properties by means of well-known metrics. We consider three different snapshots obtained by extensively crawling Tor three times over a 5 months time frame. We separately study these three graphs and their shared "stable" core. In doing so, other than assessing the renowned volatility of Tor hidden services, we make it possible to distinguish time dependent and structural aspects of the Tor graph. Our findings show that, among other things, the graph of Tor hidden services presents some of the character- istics of social and surface web graphs, along with a few unique peculiarities, such as a very high percentage of nodes having no outbound links. (Paper, pdf) [Supporting material, Poster]
Design, Implementation and Test of a Flexible Tor-Oriented Web Mining Toolkit
Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution. (Paper, pdf)
Tor marketplaces exploratory data analysis: the drugs case
The anonymous marketplaces ecosystem represents a new channel for black market/goods and services, offering a huge variety of illegal items. For many darknet marketplaces, the overall sales incidence is not (yet) comparable with the correspondent physical market; however, since it represents a further trade channel, providing opportunities to new and old forms of illegal trade with worldwide customers, anonymous trading should be carefully studied, via regular crawling and data analysis, in order to detect new trends in illegal goods and services (physical and digital), new drug substances and sources and alternative paths to import socially dangerous goods (e.g. drugs, weapons). Such markets, based on e-commerce retail leaders model, e.g. Amazon and E-bay, are designed with ease of use in mind, using off-the-shelf web technologies where users have their own profiles and credentials, acting as sellers, posting offers, or buyers, posting reviews or both. This lead to very poor data quality related to market offers and related, possible feedback, increasing the complexity of extraction of reliable data. In this paper we present an approaching methodology to crawl and manipulate data for analysis of illicit drugs trade taking place in such marketplaces. We focus our analysis on AlphaBay, Nucleus and East India Company and we will show how to prepare data for the analysis and how to carry on the preliminary data investigation, based on the Exploratory Data Analysis. (Paper, pdf)
Multi-word Structural Topic Modelling of ToR Drug Marketplaces
Topic Modelling (TM) is a widely adopted generative model used to infer the thematic organization of text corpora. When document-level covariate information is available, so-called Structural Topic Modelling (STM) is the state-of-the-art approach to embed this information in the topic mining algorithm. Usually, TM algorithms rely on unigrams as the basic text generation unit, whereas the quality and intelligibility of the identified topics would significantly benefit from the detection and usage of topical phrasemes. Following on from previous research, in this paper we propose the first iterative algorithm to extend STM with n-grams, and we test our solution on textual data collected from four well-known ToR drug marketplaces. Significantly, we employ a STM-guided n-gram selection process, so that topic-specific phrasemes can be identified regardless of their global relevance in the corpus. Our experiments show that enriching the dictionary with selected n-grams improves the usability of STM, allowing the discovery of key information hidden in an apparently "mono-thematic" dataset. (Paper)
Exploring and Analyzing the Tor Hidden Services Graph
The exploration and analysis of Web graphs has flourished in the recent past, producing a large number of relevant and interesting research results. However, the unique characteristics of the Tor network limit the applicability of standard techniques and demand for specific algorithms to explore and analyze it. The attention of the research community has focused on assessing the security of the Tor infrastructure (i.e., its ability to actually provide the intended level of anonymity) and on discussing what Tor is currently being used for. Since there are no foolproof techniques for automatically discovering Tor hidden services, little or no information is available about the topology of the Tor Web graph. Even less is known on the relationship between content similarity and topological structure. The present paper aims at addressing such lack of information. Among its contributions: a study on automatic Tor Web exploration/data collection approaches; the adoption of novel representative metrics for evaluating Tor data; a novel in-depth analysis of the hidden services graph; a rich correlation analysis of hidden services' semantics and topology. Finally, a broad interesting set of novel insights/considerations over the Tor Web organization and content are provided. (Paper, pdf)