searchnet06 - People - Hobart and William Smith Colleges

advertisement
Search and the ‘Net @ 2006
Trends, Challenges and Cutting-Edge
Developments in Internet Search
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
For
Rochester Regional Library Council
Member Libraries’ Staff
Sponsored by the
Rochester Regional Library Council
Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the
New York State Library 2005
For Today ….




The current “state of web search”
What’s new among established services
Services launched in 2005
The Latest from the Living Web
Deep Web
RSS feeds
Blogs
Podcasts
 Cutting Edge in Search
– Natural Language Processing
Text mining
 Current trends and future possibilities
Linklist for today’s session:
people.hws.edu/hunter/search06links.htm
Web Search @ 2006
Who’s crawling the Web?
 Yahoo
– Owns AlltheWeb, Altavista, Inktomi, Overture





Google
MSN
AskJeeves owns Teoma
Gigablast
NOTE: Ownership is different from
database affiliation
Google
Database Affiliates
Google
AOL
Excite
Netscape
Most popular services





Google
48%
Yahoo
29% (up 20% from last year)
MSN
8% (up 30% from last year)
All others
15% (AOL, AJ, Net, Gig)
Study by Harris Interactive (must purchase)
– www.harrisinteractive.com
Database Size
 Google: ca. 10 billion web pages
(???)
 Yahoo – 20 billion “web objects”
 MSN – 6 billion (est.)
 Teoma – 3 billion (est.)
 Gigablast – 1.5 billion (est.)
Search Engine Overlap
 Results compared from 12,500 random queries
from the largest engines
 85% were unique to one engine
 11% were shared by any two
 3% were shared by any three
 1% were shared by all
 Study by Dogpile, U Pittsburgh and Penn State
– CompareSearchEngines.dogpile.com/OverlapAnalysis
Recent Developments
Among Established Services
2005: A lot to Yahoo! about
 No longer just a subject directory
 New features and an estimated 20%
increase in users
 Vertical Search Engines
– Music, health, finance, shopping and over 20 more
 Personalization – My Yahoo and Yahoo
360
– Creates an online identity with photos, restaurant
reviews, personal histories and personal blog
2005: A lot to Yahoo! about
 RSS feeds
– Offered as part of My Yahoo
– User-friendly Reader/Aggregator provided;
limited to 250,000 Yahoo-selected feeds
– Yahoo content as RSS: News, Ask Yahoo, Buzz
Index (popular searches), News Groups
 Video search (beta) //video.search.yahoo.com
– Advanced search features: KW, format, file size,
length, content filter
 Creative commons search.yahoo.com/cc
– Content that is free to share or modify
2005: A lot to Yahoo! About
Contextual Searching - Y!Q
 Selected web pages or highlighted
sections analyzed for word frequency
and “concept extraction” and used as
basis for a search
 Results give basis for query in “context
selection box”
 Refinements include removing
unwanted terms/phrases and “more
like this” link
 Requires download of free toolbar
toolbar.yahoo.com
2005: A lot to Yahoo! About
Open Content Alliance (10/3/05)
 Large scale E-text initiative
 Members include Yahoo, Internet
Archive, National Archives (UK), RLG,
LC, 8 US and 6 Canadian Universities
 Over 25,000 Digitized copies of public
domain AND copyrighted works
 Works under copyright only available if
permission granted by owner
 Yahoo plans to include the content in
its database or subject directory
2005: A lot to Yahoo! About
Yahoo/OCLC toolbar
 Searchers may restrict their results to the
Open World Cat database, currently at 57
million records
 Displays library holdings in the searcher’s
vicinity
 Download (free) at www.oclc.org/toolbar
AOL
search.aol.com




Results from Google
Personalization- (with free account)
Results clustering a la Vivisimo
“Smartbox” query refinement
– Offers suggestions BEFORE search button is
clicked
 “Snapshots” Human-created answers a
la AJ
 Local Search, Maps, Vertical Engines
Gigablast
www.gigablast.com
 “Related pages” – Relevant search
results which may not contain original
search terms
 Database now at 1.5 billion (up 50%)
 One to keep your eye on
Ixquick Metaengine
www.ixquick.com
 Repetitive results removed
 Results marked as irrelevant by user
used to delete other similar pages in
real time
 International price comparison
covering over 5,000 merchants
 International phone directory,
residential and business
The Year at Google
Google
Personalization
 Re-orders search results based on user’s past
searches and click tracks
 Ranking will change, depending on user
profiles
 Requires setting up a (free) account
 Personalized home page (G. as portal?)
 Complex profiles are problematic
eg. “Movies, computer hardware, the Internet,
general news, astronomy”
SEARCH: cars
Which categories take precedence over others?????
Google
Personalization
 Search records personally associated
with a user are deleted if service is
dropped
 Search log data for all Google searches
kept (via cookies)
 Google’s privacy policy:
www.google.com/privacy.html
 Bookmark entire web pages
Google
Google Earth earth.google.com
 Geographic search application
 Originally Keyhole 3D, now a free
Google download
 Images taken by satellites and aircraft
“sometime in the last 3 years”
 “Fly to” accepts an address or coordinates, returns a view from 3,000
ft. above, with zoom capabilities
Google
Local for Mobile google.com/glm
 Free download
 Unique ID associated with your phone
 Simplified version of the web-based Local
Search
 Emphasis on maps and directions
 Point-to-point directions limited to a certain
area
 Business listings offer address and phone
number only
 Does not support all mobile phones
Google
Video Search
video.google.com
 Index of closed captioning and text
descriptions from selected TV and
other video content after Dec. 2004
 Results include thumbnail, description,
source, date, duration and hyperlink
 Currently hyperlink links to more
description, not to the video itself
 Q&A Service Ready reference service
providing answers to fact-based
queries
Google Print’s 2 divisions
Publisher Program and Library Project
Publisher Program
 Publishers authorize G. to scan and
make searchable the full text of their
books
 Users see only the full page containing
their search terms
 Link to purchase copy
Google Print’s 2 divisions
Publisher Program and Library Project
Library Project
 Scan and make searchable 15 million
books, in and out of copyright, from
Harvard, Stanford, Oxford, U. Michigan
and NYPL
 For works in copyright, users see only a
few sentences around search terms
 Users may browse full text of public
domain works
 NOTE: Not possible to print ANY material
from either Google Print project
Library Project in 2005
 June – Assoc. of American Publishers
question legality of Library Project
 August 15 – G. “temporarily halts”
scanning in-copyright works; continues
scanning public domain works
 September 20 – Author’s Guild files a
formal complaint against G. in NY
Federal District Court alleging “massive
copyright infringement”
Services Launched in 2005
Icerocket
www.icerocket.com
 Results Enhancements
– Thumbnails of pages retrieved
– Archived version (Internet Archive)
– Quick View
 Full Boolean
 Includes Web, Blogs, Multimedia and
News, with unique advanced features
 “Blog Trends” tool
 MAY be using Google
 May become www.blogscour.com
Brainboost
www.brainboost.com
 A natural language “answer engine”
 Results include “Related Questions” as
well as responses to your query
Topic Hunter
www.topichunter.com
 Rich interface to over 175 general,
meta and vertical search engines,
many of them new
 Categories include
– General (16)
Answer Searching
News
– Metaengines
Blogs
Invisible Web
– Images
Audio/Video
+11 more
 Not a meta; carries your search terms
to each engine you query
Queryster
queryster.com
 Interface that provides quick scanning
of results from up to 10 engines
– Yahoo, Google, MSN, AJ, WNut, Teoma, AV,
Amazon, Ebay, A9
 Executes your search as you click on
the engine
 Batch search – executes multiple
queries in each engine
 Fresh Google – uses daterange search,
(not reliable)
RedLightGreen
www.redlightgreen.com
 120 million titles from the Research
Libraries Group union catalog
 Search options
– Boolean
Phrase
Author
Title
– Keyword (Title, L C S H)
Subject (LC)
– Limits by language and date
 Results refined by Related Subjects,
Authors and Language
 Reviews of books linked to record
 5 citation outputs available
The Latest from the Living Web
Deep Web – Weblogs
RSS Feeds - Podcasts
Deep Web
 Estimated to be from 67,000 to
92,000 terabytes, with surface web at
167 terabytes
– A terabyte is equivalent to 1024 gigabytes
– The Library of Congress contains ca. 20
terabytes of text
 http://www.sims.berkeley.edu/research/proj
ects/how-much-info-2003/internet.htm
Deep Web
 Crawlers are increasing their coverage
– Proprietary, non-html filetypes
– Multimedia
– Software
– Weblogs
 Still “in the deep dark web”
– Dynamically-created pages
– Password-protected sites
– Sites prohibiting crawlers (robots.txt
exclusion)
Dynamically-created Web pages
 Created at the moment of the query
using the most recent version of the
database.
 Database-driven
 Require interaction
– Amazon.com
 What titles are available? At what price?
 Used widely in e-commerce, news,
statistical and other time-sensitive sites.
Turbo10
turbo10.com
 Free, web-based service
 Create your own metaengine
(“collection”) of up to 10 engines from
turbo’s listing of ca. 400
 New engines can be added
 Requires setting up a turbo e-mail
account
 Maintenance (??)
Copernic Agent
www.copernic.com
 Software that organizes, executes and
collates results of deep web searches
by engines of your choosing
 “Basic” version is free download
 “Personal” version is $29.95
Blogs: What are they?
 Online diaries or journals, usually by
one person, though many invite
“comments”
 First developed in 1997
 Within the same blog tone can range
from personal musings to discussion
of recent issues in technology and
research
 High link-to-word ratio
 Often link to other weblogs of similar
content
Blogs: What are they?
 Can contain rumor, inside information,
speculation, blatant errors as well as
– Breaking news: political and
technical/research
– Commentary on new software or websites
– Consumer reaction to products or services
 Blog authoring tools are basic content
management software, useful in ways
other than online diaries
– Typify the spirit of information sharing that has
fueled the Internet since its beginnings
Today’s Blogosphere
 The blogosphere is now over 30 times
as big as it was 3 years ago, with no
signs of letup in growth
 As of October 2005, Technorati is now
tracking 19.6 million weblogs
 The total number of weblogs tracked
continues to double about every 5
months
 About a new weblog is created each
second
Today’s Blogosphere
 2% - 8% of new weblogs per day are
fake or spam weblogs
 Between 700,000 and 1.3 Million posts
are made each day
 http://www.problogger.net/archives/2005/1
0/17/state-of-the-blogosphere-october2005/
Blogs and Search: Google
blogsearch.google.com and
search.blogger.com
 First major engine to offer a blogspecific search (Sept, ’05)
 Defines blogs as “sites which use RSS
and other structured feeds and update
content on a regular basis”
 Advanced Search features
– Blog title
Author of post Date range
– Language limit
Safe Search option
Blogs and Search: Clusty
clusty.com
 Formerly Vivisimo
 Metasearch engine with a blog search
capability
 Source engines for blog search
– Blogdigger
– Feedster
Blogpulse
Technorati
Daypop
Blogs and Search: Clusty
clusty.com







Results clustered in topical folders
Source engine given for each result
Date and time of each posting given
Accepts natural language queries
Full Boolean capabilities
Phrase search (“ “)
Limits include
– Domain
Host
– Number of results
Source Engine
Length of search (timeout)
RSS: What is it?
 A broadcast version of current
content from a website, blog, news
page or other source (aka “RSS
Feed”)
 A live, constantly updated table of
contents with links to the full text,
eg. a feed from NYTimes.com
How do I access RSS feeds?
 Sites with RSS feeds display a small icon
(usually orange) labeled RSS or XML or
Atom
 As RSS is in XML, may require
downloading reader software (older
versions of browsers cannot read XML).
Sources for reader software include
– www.download.com (search rss reader)
 Aggregators allow for reading and
organizing feeds of your choosing
RSS:Crossing into the Mainstream
Study of 4,000 respondents by Yahoo! And Ipsos
Insight August, 2006
 Who is using RSS?
12% were aware of RSS
4% had knowingly used it
27% unknowingly use RSS via
personalized start pages, eg. My Yahoo
 Why do they use RSS?
Ease of use
Choice of content
Instant updating capability (only 7% !!!)
RSS:Crossing into the Mainstream
 What feeds are they using?
(in order of popularity)
World news
National news
Entertainment
Science and technology
Weather
Local news
 http://publisher.yahoo.com/rss/RSS_WhitePaper1
004.pdf
MY Yahoo! Ticker
yahoo.com
RSS reader and aggregator
Click on Downloads
Click Deskbar for MS Windows
Choose among 250,000 Yahoo-selected
RSS feeds
 News and Stocks Server Options allow
filtering by a list of topics




RSS at Google
www.google.com/reader
 Requires setting up a (free) account
 Subscribe to any feed of your choosing
 Keyword search available for feeds in
Google’s database
 RSS feeds available for Google News
 Folders (labels) available for grouping
feeds of similar content
 Sort feed items by date and relevance
Podcasts 101
iPod + broadcast = podcast
 Downloadable audio or video files which
can be played on many devices
PC, home systems
Mobile (iPods, cars, MP3)
 “Broadcast” by means of RSS
 Not limited to Apple’s iPod or MP3 format
 Often embedded in weblogs with RSS
feeds
 As with any Living Web (RSS) content
podcasts can go offline; may or may not
be archived
Podcasts 101
 Development
Ease of publication (cheap storage, MP3 format)
Ease of subscription (RSS 2.0)
Ease of use (iPod, other mobile audio devices)
 Create files (audio parameters apply!)
 Publish files
– From iTunes.com
– From any web site capable of supplying
content via RSS (Most blogs do)
Podcasts 101
 Subscribe to files via the URL for the
RSS Podcast feed (Red RSS or XML
button)
 Podcatcher – Freeware that receives
and organizes podcasts (an RSS
aggregator for podcasts).
Available at podcatcher.rubyforge.org
Podcasts in Higher Education
 Drexel (Chemistry)
– Lectures podcasted; class time used for
problem solving
 Duke (Computer Science)
– Students required to listen to podcasts on
related topics not covered in class
 U. of Hawaii (Computer Science)
– Intro class of 600; lectures podcasted;
“Listen to them when you have the time”
Video podcasting
“Vodcasts”
 June, 2005 - Apple iTunes begins to
support video podcasting
 Can provide supplemental multimedia
content as part of a course, or public
relations initiative
 With Web cams, DV cameras and
vodcasting, we may be headed toward
the democratization of video content
 60gb Video iPod now available
 Arstechnica.com/news.ars/post/20050
915-5313.html
Podcasting and Search
 Many podcasts are embedded in blogs
 Google blog search tool
blogsearch.google.com
(subject) podcast
Main Google search still text-based;
rock filetype:mp3 = 124 on 11/28/05
 Blog-based search engine with media
search: www.blogdigger.com/media/
Podcasting and Search
Podcast Directories and Catalogs
 www.podcast.net
Directory of over 15,000 podcast feeds
Searchable by Title & Description, KW, Host (Author),
Location and Episode
 www.odeo.com
Searchable catalog of mp3 podcasts
Updates every 3 hours
Offers text snippet of the latest podcast from each
feed
 www.podcastdirectory.com
 www.podfeed.net
 http://dmoz.org/Computers/Internet/On_th
e_Web/Podcasts/
Podcast Subscriptions
From Feedburner 12/4/05










Diggnation w/Kevin Rose & Alex Albrecht (31247)
Photoshop TV (15915)
English as a Second Language Podcast (10486)
IT Conversations (8969)
TOEFL Podcast (5526)
Bruce Springsteen Born To Run 30th Anniversary
Podcast (5378)
Digital Photography Tips From The Top
Floor (4308)
The Secrets of Harry Potter (2832)
Next Big Hit (2614)
EarthCore - A Podcast Novel (2536)
The Cutting Edge in Search:
Natural Language Processing
Beyond Searching the Full Text:
Natural Language Processing
(aka Text mining Data mining)
 How can we manage unstructured
information?
 Current web search engines match query
terms from the full text of downloaded
documents (“bag of words”)
 Term frequency, position, page linkage and
popularity and other factors used to create
the final selection and ranking of results.
Enter Natural Language
Processing (NLP)
 With NLP software unstructured text
and data can be processed to reveal
degrees of meaning by
– Extracting terms identified as significant
– Summarizing content
– Discovering relationships among terms and
groups of terms
– HOW???
NLP Extraction
 Take all articles from a group of
pharmaceutical journals published in
one year (the “corpus”)
 Extraction – Run a relevant controlled
vocabulary (list of all known drugs)
against the corpus
NLP Extraction
 Drugs found, number of occurrences
and location in the corpus plus a list of
possible drugs not in the controlled
vocabulary
86>penicillin click for locations
124>tetracycline click for locations
213>aspirin click for locations
Are these also drugs? XXX, XXX, XXX
NLP Summarization
 Retain phrases surrounding the
extracted term(s) with links to
locations in the corpus (KWIC Index)
rare uses of penicillin
Often penicillin is contraindicated when
responds well to penicillin
NLP Summarization
 Tag all words in the corpus with their
grammatical function and search for
noun – verb – noun and other
syntactic patterns
(drug A) treats (disease B)
(drug C) causes (disease B)
(drug D) is contraindicated in (disease B)
NLP Term Relationship
 Queries answered by tracking
references across sentences
Can penicillin cause shock?
“Penicillin treatment is not without risks. In
certain cases it can trigger anaphylactic
shock.”
NLP can do even more …
 Word disambiguation
bank (river)
bank (finances)
bank (verb)
 Retrieval of alternative word forms
 Retrieval of variants in capitalization
and spelling
 Topic detection and tracking
Following different themes in a changing RSS
feed
 Machine translation
NLP and Real Life
 Early recognition of emerging market
trends and/or competitors
 Monitoring content from bio-medical
and other journal literature that grows
faster than the ability of researchers
to read it
 Improve relevancy in searches of
content from libraries, publishers and
the Web
Trends and Future
Possibilities
Search Today …
 “Mass Media” as “My Media”
Podcasting iTunes
Blogs
RSS
“Search is no longer about a text-based web
index. It’s about a person’s interface to the
world” -- SEO executive
 Enhancing search through context and
user personal profiles
– My Yahoo!
– Google Personalized Search
Search Today …
 Federated search (single-point
access, enterprise applications)
The Desktop “without walls”
Unstructured and structured data
Internal, personal sources and WWW
 XML helps make this possible
– “Middleware” layer with modules that
acquire, manage, retrieve and rank text, data
and multimedia from a variety of sources and
formats
Search tomorrow ???
 Search will become more
Sophisticated
Individualized
Portable
Specialized (vertical, subject-specific services)
 Voice recognition, GPS and mobile,
local search will grow
“Where can I find the best bargains on this in the
area?”
“Where is the nearest pizza parlor and how do I
get there from here?”
Thank You and
Happy Holidays!
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
Geneva, NY 14456
(315) 781-3552
hunter@hws.edu
Download