• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Tuesday, 09 Mar 2010 19:22
Krisztian posted a link to the TREC 2009 Entity Track Overview, part of the TREC 2009 proceedings.

The track website has information on the 2009 track and what is planned for 2010. One change they are seeking discussion about is a new semantic entity search subtask:
We propose a semantic entity search subtask for 2010: return URIs of related entities, instead of their homepages. We are planning to enrich topics with URIs of the input entities. URIs need to come from a predefined set of semantic data sources (which will include DBPedia and Freebase, at least).
The plan is to use the full category A set of ClueWeb09, which has 500 M English web pages instead of the smaller B subset which doesn't contain many entity homepages.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Tuesday, 09 Mar 2010 15:42
An updated draft of the upcoming book, Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer is available.

The book isn't finished, but it still has interesting material. It emphasizes algorithms for processing text with Mapreduce: co-occurrence analysis, inverted index construction, and the EM algorithm applied to estimating parameters in HMMs.

You can also see Jimmy's cloud computing course (spring 2010) and the Ivory search engine.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Thursday, 04 Mar 2010 15:40
Peter Mika highlights the Semantic Search competition at the upcoming Semantic Search 2010 workshop at WWW 2010. From Peter's post,
Participants will be given queries sampled from a web search query log provided by the Yahoo Webscope program, and have to try to answer those queries using the Billion Triples Challenge corpus from 2009. The queries that are selected are all entity queries in that they are looking to find information about a single entity.
This is an interesting competition because it attempts to use unstructured web queries to do retrieval over a heterogeneous collection of structured data. The Billion Triples collection contains data from DBpedia (extracted from Wikipedia), Geonames, a variety of social networks, and other sources.

There's a group of us here working on an entry; we'll see how it goes.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 24 Feb 2010 20:26
Yahoo! has announced a Learning to Rank challenge as part of the Learning to Rank Workshop at ICML 2010.

They are releasing (to participants) two large real-world datasets. The first dataset has:
29,921 queries
744,692 URLs
519 features

For details on the second set, see the website.

The URLs are rated on a graded scale, 0 (irrelevant) to 5 (perfect). The evaluation will use Normalized Discounted Cumulative Gain (NDCG) and Expected Reciprocal Rank (ERR).

The set only includes query and URL identifiers without the original information, so engineering new features seems unlikely.

The competition begins March 1st and goes through May 31st.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 24 Feb 2010 14:57
I haven't been writing much recently. I was a bit burnt out after paper season. I submitted a short paper on synonym recognition to ACL 2010. I hope to share more on that in the future. On the topic of synonyms, the recent Wired article on How Google's Algorithm Rules the Web mentions briefly their synonym recognition algorithm.

Towards the middle of the article, Amit Singhal talks about synonyms. The first part talks about the straightforward mappings identified from query reformulations. I think the more interesting case is when you don't have millions of those to learn from. You can use the information on the web documents. Here's the relevant section,
Google’s synonym system understood that a dog was similar to a puppy and that boiling water was hot. But it also concluded that a hot dog was the same as a boiling puppy. The problem was fixed in late 2002 by a breakthrough based on philosopher Ludwig Wittgenstein’s theories about how words are defined by context... “Today, if you type ‘Gandhi bio,’ we know that bio means biography,” Singhal says. “And if you type ‘bio warfare,’ it means biological.”
This type of query sensitive synonym usage is quite important for web retrieval.

See also my recent previous post on Google's synonym effectiveness and their recent patent on using query context for determining synonyms.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Thursday, 04 Feb 2010 14:28
Today is the beginning of the Web Search and Data Mining conference, WSDM 2010. You can follow the conference on Twitter, #wsdm2010 as Daniel Tunkelang, Ian Soboroff, and others are tweeting it. The proceedings are also available. I hope to highlight some of the papers here in future posts.

Two of my UMass CIIR friends and colleagues have papers that they are presenting:
Good luck to them, and check out their papers.

Since I don't have a paper, I'm not attending. As a side note, I think that academic conferences should drop the registration fee for students. I would have been able to afford the travel given the proximity, but the registration fee was prohibitively expensive for me.

For those of you attending, please take notes and share them!



Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Thursday, 04 Feb 2010 14:17
Yesterday was the Search in Social Media (SSM) 2010 workshop. Daniel posted his summary. You can also read the tweet stream.

Of particular note is the coverage of Jan Pedersen's keynote where he highlighted some opportunities and challenges with SSM.
The benefits of social media search include trust and personal interaction (as compared to web content that is often soulless and of uncertain provenance), low latency (though perhaps at the cost of accuracy), and access to niche or ephemeral information that web search rarely surfaces.
I look forward to reading more coverage and getting other perspectives on the event.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 03 Feb 2010 15:05
Today is the Search in Social Media Workshop, SSM 2010. Unfortunately, I'm not attending, but you should follow the coverage on Twitter, #SSM2010. You can read the papers from the program. However, the really interesting parts will be the Keynote from Jan Pederson and thee panel discussions:
  • The Big Players and Integrating Social Media
  • Social Media Companies - How does search fit?
  • What are the most important research problems for SSM?
I look forward to reading the coverage!
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Saturday, 30 Jan 2010 21:47
I'm starting a new project. We're building a prototype search system, so we want it to be quick and simple. This started me thinking about options...

In the past, I built RecipeComun using PHP and Java running on top of Quercus inside a Java application server. However, that's a relatively heavy-weight solution.

Now, I would likely build it or something similar with a Ruby on Rails frontend UI. For an initial prototype I would index with SOLR and integrate the two with BlackLight, Solr Flare, or just straight RSolr. For a facet engine on top of Lucene, I think Bobo-Browse from LinkedIn is an interesting alternative to Solr.

Anyone else have different ideas?

Update: I tried installing Ruby/Blacklight on my Windows laptop. I don't recommend this. It needs Linux (for compiling some of the native dependencies used by the ruby libraries). You can work around it with Cygwin with enough effort, I think, but will not be trying.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Friday, 22 Jan 2010 15:40
SearchEngineLand has an interview with Stefan Weitz, a Director at Bing Search. The interview highlights some of Bing's future directions and challenges in search. One emphasis is on better supporting user's tasks. Stefan says,
Relevancy is relative. It is about the intent of the user, first of all. What is the user trying to do? Then, secondly, what do you know about the user or the query that could help to better refine the results?
I disagree with the above statement. By definition, a document is relevant if it satisfies a user's "information need". I think he's really trying to say that too often we make the mistake of removing the user from the equation and creating a universal relevance judgment that holds across all users who issue a query.

The interview goes on to talk about how Microsoft is investing in technologies to support complex decision making, in vertical categories like travel, health, and shopping. He highlights Farecast. However, it's also clear by the current Bing results for plan a trip to Florida, that there is still a long way to go.

The article goes on to detail Microsoft's "vertical" strategy as a means of differentiation:
We will continue to introduce these verticals, in pretty short order, frankly. The sum of those parts will become a very differentiated experience that will expand how people think about search...
If you missed it, yesterday Microsoft started rolling out Bing Recipe Search.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Thursday, 21 Jan 2010 14:41
Google has used synonyms for query expansion for several years now. It is part of their attempt to find what you mean, not just what you type. Steven Baker, an engineering on the quality team wrote a post covering a recent examination of synonym usage in query expansion. He writes,
...our measurements show that synonyms affect 70 percent of user searches across the more than 100 languages Google supports. We took a set of these queries and analyzed how precise the synonyms were, and were happy with the results: For every 50 queries where synonyms significantly improved the search results, we had only one truly bad synonym.
Another tidbit is that Google is expanding their highlighting of synonyms in search result summaries.

Lastly, a tip if you get stuck with one the 1 in 50 queries where synonyms go bad:
You can also turn off a synonym for a specific term by adding a "+" before it or by putting the words in quotation marks.
Bill Slawski has good coverage of the post, and previous work on synonym usage, including Steven's patent, Determining query term synonyms within query context.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Tuesday, 19 Jan 2010 15:15
I'm a bit behind here because of the upcoming SIGIR deadline. However, I wanted to make sure to mention an article in the January CACM, New Search Challenges and Opportunities. The article highlights three main directions:
1) Web-scale information extraction
2) Real-time search: blogs and status updates
3) Task-based search: Time and location

Web-scale information extraction
In the first section, they highlight Oren Etzioni's work on TextRunner. It's an interesting project, but it was published in 2008. If you're interested in more recent and in-depth work I suggest reading the students' theses: Michael Cafarella, Extracting and Managing Structured Web Data and Michele Banko, Open Information Extraction for the Web. From Michael's introduction,
TextRunner is an extractor for processing natural language Web text. WebTables extracts and provides applications on top of relations in HTML tables. Finally, Octopus provides integration services over extracted Structured Web data. Together, these three systems demonstrate that managing structured data on the Web is possible today,and also suggest directions for future systems.
The work on integration with Octopus was recently published, Data Integration for the Relational Web.

Blogs and real-time search
I thought that this section didn't add useful discussion over what was previously discussed at length in other forums. I particularly think there was little useful discussion on Twiter and status updates. One thing of note was that Susan Dumais' comment on the challenge of opinion analysis in blogs:
But rating postings as positive or negative, or figuring out whether they're aimed at an older or younger audience or have a left-leaning, right-leaning, or middle-of-the-road viewpoint, is challenging, she says.
A key challenge here is that simple term based algorithms do not capture meaning in complex discourse.

Task based search: Utilizing time and location
Jon Kleinberg highlights the need to integrate tighter with user applications,
The real issue with a search engine is not just to serve up results, but to help people accomplish what they're trying to do...
They discuss it mainly in the context of mobile search and utilizing a user's location to help better identify search intent, an obvious evolution.

A few useful reminders of trends over the past few years, but nothing particularly new.

Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Thursday, 14 Jan 2010 23:24
Yesterday, Technology Review posted an interview with Amit Singhal, on How Google Ranks Tweets. According to Amit, one key is to find "reputed followers",
"You earn reputation, and then you give reputation. If lots of people follow you, and then you follow someone--then even though this [new person] does not have lots of followers," his tweet is deemed valuable because his followers are themselves followed widely...
It seems like pretty straightforward translation of PageRank with "following" as a form of link endorsement. In this vein, see also Daniel's TunkRank.

The interview goes on to mentions the use of geolocation in tweets as a next likely step. Amit also rightly points out that blogs and news organizations are important components of "real-time" search; it's not just tweets.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Monday, 28 Dec 2009 18:44
Jeff Dean and Sanjay Ghemawat wrote an article for the January edition of CACM, MapReduce: A Flexible Data Processing Tool. In the article, they refute the findings of A Comparison of Approaches to Large-Scale Data Analysis. On their blog, the authors also wrote a post bashing MapReduce: MapReduce, A major step backwards. The post is no longer available, but thankfully Greg had good coverage.

In the article Dean and Ghemawat address the paper and attempt debunk its claims, although they lack the benchmarks to back it up. In the process, they inform you about the right way to run M/R jobs efficiently:
  1. Avoid starting processes for each new job, reuse workers.
  2. Careful data shuffling, avoid O(M*R) disk seeks
  3. Beware of text storage formats.
  4. Use natural indices like timestamps on files.
  5. Do not merge reducer output.
They present some good M/R lessons in their refutation. You should be using a binary serialization system like Avro or Protocol Buffers and storing your data in a format that provides efficient access, using a natural file structure or using a database system like HBase.

Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Monday, 28 Dec 2009 16:59
The NY Times ran a op-ed article, Search, but you may not find. I can't believe they ran such rubbish. I'm not going to bother to debunk it, Paul Kedrosky did a better job than I could.

The problem is that commercial search engines are inherently conflicted: they have products to sell and advertisers to please. The question is: Should search be a public service, like a library?

The French are taking on Google books with Polinum, the "Operating Platform for Digital Books." Jimmy Wales's efforts with Wikia Search failed because they didn't execute and weren't profitable. Daniel, a long advocate of transparency in search now works for Google.

There will always be disgruntled quacks, but in the long-run, is a company or even a small group of companies with such a large share of search healthy?

Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Sunday, 27 Dec 2009 10:09
The NY Times had a recent article on search for kids. They covered a study sponsored by Google and performed by Allison Druin at the HCI lab at UMd that conducted a user study with 83 kids to understand how they search. My wife is an elementary school teacher, so this a topic we've often discussed and is particularly interesting.

In recent related work, Druin published, How Children Search the Internet with Keyword Interfaces which was performed on 12 kids. Read section 6 for their suggestions on user interfaces. Here are several of their possibilities: (1) using voice search instead of typing, (2) simplified results pages (3) results that are at an appropriate reading level. The NY Times article appears to describe a larger follow-up study.

The NY Times interviewed Irene Au, Google’s Director of User Experience for ways the research could be incorporated into a product. They note that they keyword mismatch problem is much challenging for kids, who have less of the conceptual framework of a subject necessary to be effective. From the article, “The problems that kids have with search are probably the problems adults experience, just magnified... If we can solve that for children we can solve that for adults." However, I'm not convinced that this is a correct conclusion. Druin says that the bottom of the screen is an area that offers an important area to suggest related searches.

In the article, representatives from Bing and Ask.com also weigh in; a representative from Y! is notably absent given Y!'s presence in this market. Stefan Weitz, from Bing suggests that visual interfaces offer an opportunity because kids haven't developed typing skills. Scott Kim, from Ask.com says that kids are more likely than adults to ask questions. Perhaps if we catch them early enough, we can study them before they are brainwashed into keywordize.

Given their lack of typing skill, the article briefly mentions that voice search, like that used for mobile search, offers an interface opportunity for kids.

At the end one of the kids interviewed suggests, “I think there should be a program where Google asks kids questions about what they’re searching for,” he said, “like a Google robot.”

I look forward to reading the paper on the study. Hopefully it will contain the concrete solutions to improve the search experience for kids that they foreshadow in their earlier work.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Sunday, 13 Dec 2009 21:12
I'm writing a Hadoop job and I ran into a little problem that I wanted to share (and remind myself of the solution for the future).

I am packaging up my Hadoop program into a Jar file. It has external dependencies on text parsers. To include these with my program, one way to do this is to package the dependencies inside the jar in a /lib directory. This ensures the jar and all dependencies get copied to the Hadoop Mappers.

I create my jar file by right-clicking on the project --> export --> Java --> Jar file. I then select my code and the lib directory. However, the problem I had was that my lib directory was not being exported. I learned that this happens if the jars in lib are on your build path. To solve this, the jars need to be "external" or in a different folder. Then you can export the lib directory as a resource.

Anyone care to share a better solution?
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Friday, 11 Dec 2009 22:03
Today, Google announced an extension, Quick Scroll, for Chrome 4 Beta that utilizes your previous search while browsing. Consider their example query [belgian waffles served by street vendors?] and browsing a result:
... a small black box appears in the lower right hand corner of the browser with a couple snippets of text from the page that might be relevant to your query.... Quick Scroll analyzes things like proximity, prominence and position of the words to identify the most relevant content.
... who will be the first to reverse engineer the formula from the extension?
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 09 Dec 2009 21:53
The ICWSM is a conference on blogs and social media. For the conference, they issued a data challenge.
The dataset, provided by Spinn3r.com, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
The deadline is March 1st.

Something to look at after the SIGIR deadline....
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Date: Monday, 07 Dec 2009 18:31
Google is hosting a press event at the Computer History Museum offering an "inside look at the evolution of Google search".

Danny Sullivan is live-blogging the event at SELand.
Author: "jeffdalton104@hotmail.com (jeff.dalton)"
Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader