A combination of network science and text analytics was used to construct a taxonomy of science fields, identify relevant articles in each knowledge area, and organize maps, communities and connections (image: visualization of citation network obtained by querying 'complex network' in Web of Science database/ Filipe N. Silva)

Method uses computers to survey scientific literature
2016-06-08

A combination of network science and text analytics was used to construct a taxonomy of science fields, identify relevant articles in each knowledge area, and organize maps, communities and connections.

Method uses computers to survey scientific literature

A combination of network science and text analytics was used to construct a taxonomy of science fields, identify relevant articles in each knowledge area, and organize maps, communities and connections.

2016-06-08

A combination of network science and text analytics was used to construct a taxonomy of science fields, identify relevant articles in each knowledge area, and organize maps, communities and connections (image: visualization of citation network obtained by querying 'complex network' in Web of Science database/ Filipe N. Silva)

 

By José Tadeu Arantes  |  Agência FAPESP – Searching the internet for information by means of keywords is a trivial activity that is part of people’s everyday lives worldwide. Selecting information with real scientific relevance from thousands of references is far more complicated.

Even harder is knowing how the relevant information is organized, how the research field concerned is structured, and which knowledge areas and sub-areas are involved, let alone which communities work in them and how the various research groups are connected. Yet, this knowledge is fundamental for anyone who needs to conduct a survey of the specialized literature in any domain of science.

One method for performing literature surveys by computational means has recently been developed by a group of researchers and is described in a paper published in the Journal of Informetrics.

“This type of computational resource is increasingly necessary, owing not only to the volume of specialized literature but also to the growing interdisciplinarity in science,” said Filipi Nascimento Silva, lead author of the article, in an interview with Agência FAPESP.

“Because of interdisciplinarity, in order to create something new, a researcher in any given knowledge area may need to locate articles in other areas with which she may not be familiar. For example, a researcher in oncology may need to know more about complex networks. We’ve created a method to map the different areas using data from indexed journals, including titles, abstracts and citations of scientific publications,” said Nascimento Silva, a researcher at the University of São Paulo’s São Carlos Physics Institute (IFSC-USP) in Brazil, where he is working on the project entitled “Complex network approach to e-Science and dynamic datasets” using a postdoctoral scholarship from FAPESP.

“The methodology enables users to visualize the area, identify the most important keywords for each sub-area, discover connections among sub-areas, and finally access the articles that are genuinely relevant,” he added.

Considering the highly relevant datasets deriving from publications in indexed journals and the existence of efficient search systems based on keywords, the authors of the study took on the challenge of organizing all the material that could be found.

“We set out to organize this mass of information in a hierarchy tree or dendrogram. This involves the use of two distinct procedures. The first identifies the most relevant articles in each set of articles. The second labels the different communities identified in each of the areas,” said Osvaldo Novais de Oliveira Júnior, a professor at IFSC-USP and coordinator of the study.

The most relevant articles were determined by means of citation networks, in which each article is treated as a node of a network and each citation of an article by another is considered a connection.

Frequently cited articles become nodes with many connections. Groups of nodes with many connections to each other but not connected to nodes in other groups define communities, which are specific subsets of the overall set. This is the type of design typically used in network science.

Text-analysis technology was used to label communities in the second procedure. The most important topics in each article were identified on the basis of the title and abstract, discarding words that occur frequently in any type of text (such as the various forms of the verbs to be and to have, prepositions, definite and indefinite articles, and common nouns), and labels were then defined.

“By combining these two kinds of information, we produce a map of each area, with its different communities and connections, most important and influential articles, and so on,” said Novais, who is also a member of FAPESP’s Physics Area Panel.

Communities with few connections

To test the model, the researchers chose two areas in which they are experts so that they could subjectively evaluate whether the results obtained made sense. The areas chosen were Complex Networks and Photonic Crystals.

“When we tested our methodology in these areas, we fortuitously made a few very interesting discoveries. For example, in the area of photonic crystals, we identified two distinct communities, both very large: a community of engineers working in telecommunications, and an even larger community of physicists and chemists who develop concepts and fabricate materials,” Novais said.

“We found that these communities have relatively few connections with each other. This means the knowledge available in the area may not be used by researchers in that same area because one community is unaware of what’s happening in the other. This was an accidental discovery, but it evidences the importance of having a computational model for surveys of specialized literature.”

According to Novais, the scripts for the programs used by the researchers are available by request, but proper use of the programs requires knowledge of computer languages.

The next step is to transform these scripts into software with an interface that can be easily used by non-specialists in computing. “For now, the programs can only be used by specialists, but in the future we want to make them more accessible so they can be used by the entire community,” Novais said. “We hope researchers in any area can survey the literature using our methodology.”

A computer animation of the citation network obtained by querying the term “Complex Network” in the Web of Science database can be seen at the following link: www.youtube.com/watch?v=5shcaMJ-gJI.

The article entitled “Using network science and text analytics to produce surveys in a scientific topic” (doi:10.1016/j.joi.2016.03.008) by Filipi Nascimento Silva, Osvaldo Novais de Oliveira Júnior and others can be read at www.sciencedirect.com/science/article/pii/S1751157715301966 and http://arxiv.org/pdf/1506.05690v2.pdf.

 

  Republish
 

Republish

The Agency FAPESP licenses news via Creative Commons (CC-BY-NC-ND) so that they can be republished free of charge and in a simple way by other digital or printed vehicles. Agência FAPESP must be credited as the source of the content being republished and the name of the reporter (if any) must be attributed. Using the HMTL button below allows compliance with these rules, detailed in Digital Republishing Policy FAPESP.