News Release

Word scans indicate new ways of searching the Web

Peer-Reviewed Publication

Cornell University

DENVER -- In the years after the American Revolution, U.S. presidents were talking about the British a lot, and then about militias, France and Spain. In the mid-19th century, words like "emancipation," "slaves" and "rebellion" popped up in their speeches. In the early 20th century, presidents started using a lot of business-expansion words, soon to be replaced by "depression."

A couple of decades later they spoke of atoms and communism. By the 1990s, buzzwords prevailed.

Jon Kleinberg, a professor of computer science at Cornell University, Ithaca, N.Y., has developed a method for a computer to find the topics that dominate a discussion at a particular time by scanning large collections of documents for sudden, rapid bursts of words. Among other tests of the method, he scanned presidential State of the Union addresses from 1790 to the present and created a list of words that eerily reflects historical trends. The technique, he suggests, could have many "data mining" applications, including searching the Web or studying trends in society as reflected in Web pages.

Kleinberg will emphasize the Web applications of his searching technique in a talk, "Web Structure and the Design of Search Algorithms," at the annual meeting of the American Association for the Advancement of Science (AAAS) in Denver on Feb. 18. He is taking part in a symposium on "Modeling the Internet and the World Wide Web"

Kleinberg says he got the idea of searching over time while trying to deal with his own flood of incoming e-mail. He reasoned that when an important topic comes up for discussion, keywords related to the topic will show a sudden increase in frequency. A search for these words that suddenly appear more often might, he theorized, provide ways to categorize messages.

He devised a search algorithm that looks for "burstiness," measuring not just the number of times words appear, but the rate of increase in those numbers over time. Programs based on his algorithm can scan text that varies with time and flag the most "bursty" words. "The method is motivated by probability models used to analyze the behavior of communication networks, where burstiness occurs in the traffic due to congestion and hot spots," he explains.

In his own e-mail -- largely from other computer scientists -- he quickly found keywords relating to hot topics. In mail from students he found bursts in the word "prelim" shortly before each midterm exam. Later, he tried the same technique on the texts of State of the Union addresses, all of which are available on the Web, from Washington in 1790 through George W. Bush in 2002. From these speeches he produced a long list of words (see attached table) that summarizes American politics from early revolutionary fervor up to the age of the modern speechwriter.

While we already know about these trends in American history, Kleinberg points out, a computer doesn't, and it has found these ideas just by scanning raw text. So such a technique should work just as well on historical records in obscure situations where we have no idea what the important terms or keywords are. It might even be used to screen e-mail "chatter" by terrorists. Sociologists, Kleinberg adds, may find it interesting to look for trends in personal Web logs popularly known as "blogs."

For searching the Web, Kleinberg suggests, such a technique could help zero in on what a searcher wants by recognizing the time context of such material as news stories. For instance, he says, a person searching for the word "sniper" today is likely to be looking for information about the recent attacks around the nation's capital -- but the same search nearly four decades ago might have come from someone interested in the Kennedy assassination.

In his AAAS talk Kleinberg also explores other Web-searching techniques. A few years ago, he suggested that a way to find the most useful Web sites on a particular subject would be to look at the way they are linked to one another. Sites that are "linked to" by many others are probably "authorities." Sites that link to many others are likely to be "hubs." The most authoritative sites on a topic would be the ones that are linked to most often by the most active hubs, he reasoned. A variation on this idea is used by Google, and a more formal version is being used in a new search engine called Teoma http://www.teoma.com .

Kleinberg and others have found that despite its anarchy, there is a great deal of "self-organization" on the Web. In a variation on the "six degrees of separation" idea, Kleinberg says, almost every site on the Web can be reached from almost any other through a series of steps. The structure seems to be a bit like the Milky Way galaxy, with a very dense "core" of heavily interconnected sites surrounded by less dense regions. Nodes outside the core are divided into three categories: "upstream" nodes that link to the core but cannot be reached from it; "downstream" nodes that can be reached from the core but don't link back to it; and isolated "tendrils" that are not linked directly to the core at all.

Within this structure there are many "communities" of sites representing common interests that are extensively linked to one another. So, Kleinberg suggests, searches might be done by following along the link paths from one site to another, as well as just scanning an index of everything.

"Deeper analysis, exposing the structure of communities embedded in the Web, raises the prospect of bringing together individuals with common interests and lowering barriers to communication," Kleinberg concludes.

###

Related World Wide Web sites: The following site provides additional information on this news release

Jon Kleinberg's page, with links to papers: http://www.cs.cornell.edu/home/kleinber/ .

The 150 term bursts of highest weight in Presidential State of the Union Addresses, 1790-2002

Word

Interval of burst

gentlemen

1790 - 1800

militia

1801 - 1816

british

1809 - 1814

enemy

1812 - 1814

savages

1812 - 1819

spain

1818 - 1821

likewise

1818 - 1824

chambers

1833 - 1835

french

1833 - 1835

bank

1833 - 1836

france

1834 - 1835

texas

1843 - 1846

annexation

1844 - 1846

mexican

1845 - 1847

her

1846 - 1847

mexico

1846 - 1847

steamers

1847 - 1849

oregon

1847 - 1852

california

1848 - 1852

kansas

1856 - 1858

slavery

1857 - 1860

whilst

1857 - 1860

slaves

1859 - 1863

rebellion

1861 - 1871

emancipation

1862 - 1864

paper

1867 - 1868

coinage

1877 - 1886

silver

1884 - 1885

silver

1889 - 1891

spanish

1897 - 1898

cuba

1897 - 1899

puerto

1898 - 1901

reserves

1901 - 1904

forest

1901 - 1905

forests

1907 - 1908

interstate

1907 - 1908

marketing

1919 - 1929

tile

1922 - 1928

ought

1925 - 1926

veterans

1925 - 1931

relief

1929 - 1935

depression

1930 - 1937

recovery

1930 - 1937

banks

1931 - 1934

democracy

1937 - 1941

wartime

1941 - 1947

production

1942 - 1943

fighting

1942 - 1945

japanese

1942 - 1945

war

1942 - 1945

peacetime

1945 - 1947

program

1946 - 1948

veterans

1946 - 1948

wage

1946 - 1949

housing

1946 - 1950

atomic

1947 - 1959

collective

1947 - 1961

aggression

1949 - 1955

defense

1951 - 1952

free

1951 - 1953

soviet

1951 - 1953

korea

1951 - 1954

communist

1951 - 1958

program

1954 - 1956

alliance

1961 - 1966

communist

1961 - 1967

poverty

1963 - 1969

propose

1965 - 1968

tonight

1965 - 1969

billion

1966 - 1969

vietnam

1966 - 1973

america

1970 - 1972

goal

1970 - 1974

inflation

1971 - 1980

energy

1974 - 1978

oil

1974 - 1981

significant

1974 - 1981

ensure

1974 - 1988

nuclear

1975 - 1981

strategic

1975 - 1981

percent

1975 - 1984

major

1977 - 1983

we've

1978 - 1980

commitment

1978 - 1981

sector

1978 - 1986

nation's

1979 - 1981

soviet

1979 - 1983

energy

1980 - 1981

1980's

1980 - 1982

initiatives

1980 - 1985

afghanistan

1980 - 1988

program

1981 - 1982

programs

1981 - 1983

women

1981 - 1984

chamber

1982 -

that's

1982 -

we're

1982 -

we've

1982 -

deficits

1982 - 1988

america's

1982 - 1992

spending

1982 - 1995

it's

1982 - 1996

there's

1982 - 1996

we'll

1982 - 1998

they're

1982 - 1999

can't

1983 -

child

1983 -

i'm

1983 - 1998

tonight

1984 -

tell

1984 - 1995

freedom

1985 - 1991

don't

1986 -

america

1986 - 1991

let's

1987 -

get

1987 - 1995

kids

1987 - 1995

let

1987 - 1995

businesses

1990 -

got

1990 -

parents

1990 -

something

1990 - 1997

cuts

1991 -

families

1991 -

crime

1991 - 1996

cut

1991 - 1996

jobs

1991 - 1998

hard

1991 - 1999

know

1991 - 1999

children

1992 -

thank

1992 -

health

1992 - 1994

want

1992 - 1995

you

1992 - 1995

americans

1994 -

medicare

1994 -

school

1994 -

welfare

1994 - 1997

bipartisan

1995 -

college

1995 -

communities

1995 -

working

1995 - 1996

america

1996 -

challenge

1996 -

schools

1996 -

teachers

1996 -

21st

1997 -

ask

1997 -

century

1997 -

help

1998 -

you

1998 - 1999

E-Mail: deb27@cornell.edu


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.