Algorithms facilitate automated classification of web texts
September 02, 2015
By Diego Freire
Agência FAPESP – A set of algorithms developed at the University of São Paulo’s Mathematics & Computer Science Institute (ICMC-USP) in São Carlos, Brazil, can be used to filter large amounts of text to extract data as a basis for classifying texts according to their contents.
Comments and other texts posted to social networks, for example, can easily be identified as positive or negative, and the collections held by virtual libraries can be classified by literary genre, subject matter and other specific aspects of each work.
The algorithms were developed by Rafael Geraldeli Rossi, who conducted his doctoral research on “Pattern extraction from textual document collections using heterogeneous networks” with support from FAPESP. Rossi won a best paper award at the 16th International Conference on Intelligent Text Processing & Computational Linguistics, held in April 2015 in Egypt.
“Growing amounts of information are available on different easily accessible platforms like the web,” Rossi said. “New strategies are need to filter them in an intelligent way without losing data in the process while guaranteeing greater accuracy in the interpretation of information.”
The algorithms developed by Rossi classify texts on the basis of both the frequency of certain terms and networks of associations among terms, speeding up the process and reducing the amount of information that needs to be provided to “train” the computer software.
The basis for the method is machine learning, a subfield of artificial intelligence involving data mining, pattern recognition algorithms and other techniques that enable a computer to improve its own performance on a task by learning from examples previously classified by a user or specialist.
According to Solange Oliveira Rezende, a researcher at ICMC-USP and Rossi’s PhD supervisor, representing data as networks enhances its organization and classification on the basis of a few examples that have already been classified.
“The representation of term relations in network form facilitates the learning of patterns that aren’t assimilated in other types of representation,” Rezende said. “So algorithms were developed to manipulate these representations in term networks that can then be analyzed for the different kinds of relationship among terms while tailoring the machine learning process to the user’s needs.”
For Rezende, the algorithms developed by Rossi simplify the classification process without impairing its accuracy and with far less computational complexity than when more conventional methods are used.
“The most distinctive feature of this method is that it doesn’t consider only the frequency with which terms occur in documents, as most text classification techniques typically do,” she said. “It also takes into account the relationships among terms.”
The algorithms were developed as part of the research project “Machine learning for WebSensors: algorithms and applications” led by Rezende at ICMC-USP and also supported by FAPESP.
The project’s primary aim is to investigate machine learning methods to support the automatic construction of web sensors, Rezende explained.
“The development of a web sensor depends on specialists to define the sensor’s parameters, such as search query expressions, filters and web content mining, all of which makes the process more complex,” she said. “Semi-supervised machine learning algorithms for text classification such as those developed for our project can be used to create sensors and monitor whatever interests the user.”
According to Rezende, she and her group also aim to enable leveraging of the web as a “powerful and very large social sensor capable of identifying and monitoring events of various kinds on the basis of texts published by news portals and social networks: this includes epidemic detection, sentiment analysis, and extraction of political and economic indicators”.
Rossi’s research results, produced in collaboration with Rezende and Alneu de Andrade Lopes, a professor at ICMC-USP, can be viewed at www.researchgate.net/profile/Rafael_Rossi2.
Agência FAPESP licenses news reports under Creative Commons license CC-BY-NC-ND so that they can be republished free of charge and in a straightforward manner by other digital media or by print media. The name of the author or reporter (when applied) must be cited, as must the source (Agência FAPESP). Using the button HTML below ensures compliance with the rules described in Agência FAPESP’s Digital Content Republication Policy.