Researchers at Carnegie Mellon University and McGill University have adapted an algorithm first developed to spot anomalies in data, like typos in patient information at hospitals or errant figures in accounting, to identify similarities across escort ads.
The algorithm scans and clusters similarities in text and could help law enforcement direct their investigations and better identify human traffickers and their victims, said Christos Faloutsos, the Fredkin Professor in Artificial Intelligence at CMU's School of Computer Science, who led the team.
"Our algorithm can put the millions of advertisements together and highlight the common parts," Faloutsos said. "If they have a lot of things in common, it's not guaranteed, but it's highly likely that it is something suspicious."
The team calls the algorithm InfoShield and presented a paper on their findings at this year's IEEE International Conference on Data Engineering (ICDE).
According to the International Labor Organization, an estimated 24.9 million people are trapped in forced labor. Of those, 55% are women and girls trafficked in the commercial sex industry, where most ads are posted online. The same person may write ads for four to six victims, leading to similar phrasing and duplication among listings.
"Human trafficking is a dangerous societal problem which is difficult to tackle," lead authors Catalina Vajiac and Meng-Chieh Lee wrote. "By looking for small clusters of ads that contain similar phrasing rather than analyzing standalone ads, we're finding the groups of ads that are most likely to be organized activity, which is a strong signal of (human trafficking)."
To test InfoShield, the team ran it on a set of escort listings in which experts had already identified trafficking ads. The team found that InfoShield outperformed other algorithms at identifying the trafficking ads, flagging them with 85% precision. Perhaps more importantly, it did not incorrectly flag any escort listings as human trafficking ads when they were not. False positives can quickly erode trust in an algorithm, Faloutsos said.
Proving this success was tricky. The test data set contained actual ads placed by human traffickers. The information in these ads is sensitive and kept private to protect the victims of human trafficking, so the team could not publish examples of the similarities identified or the data set itself. This meant that other researchers could not verify their work.
"We were basically saying, 'Trust us, our algorithm works,'" Vajiac said.
To remedy this, the team looked for public data sets they could use to test InfoShield that mimicked what the algorithm looked for in human trafficking data: text and the similarities in it. They turned to Twitter, where they found a trove of text and similarities in that text created by bots.
Bots will often tweet the same information in similar ways. Like a human trafficking ad, the format of a bot tweet might be the same with some pieces of information changed. Rabbany said that in both cases -- Twitter bots and human trafficking ads -- the goal is to find organized activity.
Among tweets, InfoShield outperformed other state-of-the-art algorithms at detecting bots. Vajiac said this finding was a surprise, given that other algorithms take into account Twitter-specific metrics such as the number of followers, retweets and likes, and InfoShield did not. The algorithm instead relied solely on the text of the tweets to determine bot or not.
"That speaks a lot to how important text is in finding these types of organizations," Vajiac said.
The paper's authors are Christos Faloutos, Catalina Vajiac and Namyong Park from Carnegie Mellon University; Reihaneh Rabbany, Aayushi Kulshrestha and Sacha Levy from McGill University, Meng-Chieh Lee from National Chiao Tung University; and Cara Jones from Marinus Analytics.