• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Thursday, 19 Nov 2009 12:10
The annual TREC meeting is this week in Maryland. The proceedings won't be available until February, but you can get hints about what is happened (but no eval results) by following on Twitter, #trec09. Some highlights:
Keep up the news Ian and Iadh!
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Thursday, 19 Nov 2009 08:00
Yesterday, I mentioned that Mahout has an implementation of LDA, a form of clustering.

Today, there is a post on the LingPipe blog covering a recent paper, Reading Tea Leaves: How Humans Interpret Topic Models. Read the post for an overview of what the authors found when they used Mechanical Turk to evaluate the coherence of topic-document and topic-word clusters.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Thursday, 19 Nov 2009 07:52
Yesterday Microsoft Live Labs launched Pivot. Pivot is a desktop application for faceted navigation and visualization to explore collections of information. Watch the YouTube demo.

I don't have invitation for the tech preview, so you'll have to watch the demo for more details.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 18 Nov 2009 10:59
Grant announced on the Lucid Imagination blog that Mahout 0.2 is released. Mahout is a library of scalable (distributed) machine learning algorithms using MapReduce.

Mahout 0.2 has several key new features that are worth taking a look at:
The release also has many other bug fixes and improvements. Keep up the good work guys!
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Monday, 16 Nov 2009 09:38
Hugo Zaragoza let me know that Yahoo! research has a new demo out, Quest. Quest is a faceted navigation interface on Q&A data. It lets you browse using key phrases, nouns, and verbs extracted from a dependency parse of the questions.

For their description, you can read the announcement on the Y! Sandbox. The demo uses a set of 8 million Q&A documents from Yahoo! Answers collected in 2007. Here's their description of some of the challenges they faced:
The first one is to select the right "lexical units" of the collection in order to produce meaningful browsing suggestions. The next challenge is to develop interesting list suggestions, on the fly, for whatever query the user may submit. Lastly, we had to invent an interface that would allow users to interact with the suggestions and the results, and enable a natural browsing experience.
They used the DeSR dependency parser to extract terms and phrases and then use a forward index with Archive4J to count and sort the terms in the questions that are returned by a query.

I tried it for pasta and then filtered to "pasta salad" I was hoping that some of the nouns would include common ingredients: bacon, chicken, olives, onion, pepperoni, mozzarella cheese, etc... However, most of the nouns/verbs are more general and somewhat redundant given my selected filters. I think the algorithm to select the terms could still be improved.

Faceted search interfaces are important browsing tools, and automatically extracting and selecting facets is a challenging problem. It's good to see first steps applying NLP to the task. I look forward to seeing how Quest evolves.

Be sure to check out the Correlator demo if you haven't seen it.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Thursday, 12 Nov 2009 13:56
Today at the Yahoo! sponsored machine learning lunch, Lee Spector presented his work on genetic programming. His talk, Expressive Languages For Evolved Programs highlighted his work using the Push programming language for solving interesting and hard real-world problems.

He pointed to two key principles that these systems need to have to learn solutions, based on observations from biology:
  • Meaningful variation - Variations can't just be random, the mutations and selections have to produce meaningful effects in the domain.

  • Heritability - children need the ability to inherit desirable features from the parent without being clones.
During the talk, a really obvious application would be to use GP to learn IR ranking functions. Recently, Ronan Cummins, did some work in this area. Ronan's recent paper at SIGIR 2009 applied it to learning proximity functions, Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval.

I think there's still interesting work combining GP with IR. For example, one problem is that collections and users evolve over time, but most ranking functions are static.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Monday, 02 Nov 2009 20:38
Evan Sandhaus announced in a NYTimes Open blog post that they are opening up the NYT subject headings. Today they are announcing the release of the first batch of 5,000 headings.
Over the last several months we have manually mapped more than 5,000 person name subject headings onto Freebase and DBPedia. And today we are pleased to announce the launch of http://data.nytimes.com and the release of these 5,000 person name subject headings as Linked Open Data.
Over the next few months they plan to expand this to over 30,000 tags.

Also, check out the NYT Article Search API.

Browse the headings and get hacking!

Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Monday, 02 Nov 2009 10:32
Today there were workshops and tutorials at CIKM.

Workshops
Social Web Search and Mining (SWSM2009)
Web Information and Data Management (WIDM)
Cloud Data Management (CloudDB 2009)

There were also four tutorials.

I'm particularly disappointed to miss Marius Pasca's tutorial on for the acquisition of Open-Domain Concepts and Conceptual Hierarchies.

There's little coverage of the conference so far, but I'll try to link to what I find.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 28 Oct 2009 17:04
Jon points out a post on the Y! Developer Network Blog detailing their use of the M45 cluster for Information Extraction.

The post is by Andy Carlson and Justin Betteridge. They are PhD students working on the Read the Web project. The goal is to generate a knowledge from web documents.

They ran MapReduce jobs over a large web crawl to find:
  1. Given a list of patterns, what noun phrases fill in the blanks of those patterns?
  2. Given a list of noun phrases, what patterns do those noun phrases occur with?
  3. Given a list of patterns and noun phrases, how many times does each pattern co-occur with each noun phrase (or pair of noun phrases)?
They are currently scaling their techniques up to ClueWeb09 and using features from a dependency parse obtained from the Malt parser.

See their upcoming paper at WSDM 2010, Coupling Semi-Supervised Learning for information extraction.

You can also see the Read the Web course wiki page.

My group here at the CIIR uses M45 for large-scale extraction and organization work on the Million Book Project data. More on that work as it develops.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Friday, 23 Oct 2009 13:03
I'm not attending either, but trying to follow what's going on.

The 2009 conference on recommendation systems in NY is happening this weekend. Follow the conference on Twitter, #recsys09. I'm particularly looking for coverage on the Netflix Challenge panel: What did we learn from the Netflix Prize? Perspectives from some of the leading contestants.

The HCIR Workshop is also taking place in DC. Daniel is one of the chairs. You can also see other coverage on #hcir09. The proceedings for the workshop are available. Henry is attending and taking part in a panel, so hopefully I'll be able to share some of his highlights.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Tuesday, 20 Oct 2009 20:14
The IR field is largely driven by empirical experiments to validate theory. Today, one of the biggest perceived problems is that academia does not have access to the query and click log data collected by large web search engines. While this data is critical for improving a search service and useful for other interesting experiments, ultimately I believe it would lead to researchers being distracted by the wrong problems.

Data is limiting. Once you have it, you immediately start analyzing it and developing methods to improve relevance. For example, identifying navigational queries using click entropy. You can also apply supervised machine learning to rank documents and weight features. These are important and practical things to do if you run a service, but they aren't the fundamental problems that require research.

The IR community has it's own data: TREC. TREC deserves credit for driving significant improvements in IR technology in the early and mid 90s. However, it too can be limiting. For many in academia, success and failure is measured by TREC retrieval performance. Too often, a researcher struggles with superhuman effort to get incremental improvements on well-studied corpora that won't make a significant long-term contribution to the field. What's missing are the big leaps: disruptive innovation.

Academia should be building solutions for tomorrow's data, not yesterday's.

What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It's not an easy question to answer, but you can watch Bruce Croft's CIKM keynote for some ideas. Without going into too much detail, also consider trends like cross-language retrieval, structured data, and search from mobile phones.

One proven pattern is that breakthroughs often come from synthesizing a model from a radically different domain. One recent intriguing direction is Keith van Rijsbergen's work on The Geometry of Information Retrieval applying models of quantum mechanics to describe document retrieval. Similarly, are there potential for models of information derived from molecular genetics and other fields? If you're a molecular geneticist and are interested in collaborating, e-mail me!

I still believe in empirical research. However, I'm also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

That said, if you want me to study your query logs... I'd be happy to do it. After all, I need those publications to graduate.

Am I wrong? I'm interested to hear your thoughts, tell me in the comments.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Thursday, 15 Oct 2009 08:28
The second edition of the classic, Elements of Statistical Learning is available. The book covers topics such as:
  • Supervised and Unsupervised Learning
  • Regression
  • Linear Classification (including LDA)
  • Kernel Methods
  • Evaluation and Assessment (including the right and wrong way to do cross-validation)
  • Bayesian inference
  • Decision Trees and boosting methods
  • Neural Nets
  • SVMs
  • K-Means clustering and nearest neighbor classification
  • Random Forests
  • Ensemble learning (shown to be very effective in the Netflix competition)
  • Undirected Graphical Models (including RBMs)
The PDF is available for download, so you can read/search it before you buy it.

While you're looking at books, another book to check out is Probabilistic Graphical Models.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Thursday, 08 Oct 2009 15:05
Last week BusinessWeek had a series of articles interviewing Google Search Quality leaders. In the interview with CEO Eric Schmidt, there was a question about innovation:
The days when you can come in with some new idea and change everything are gone. It's a much more sophisticated set of problems than can be done with a small team coming up with a new development.
Instead, he says that disruptive ideas will focus on a smaller part of the system, e.g. a new important ranking feature that will be assimilated into the massive behometh of a system. One example of this that comes to mind is Sep Kamvar's work on personalized Pagerank at Kaltix that Google acquired in 2003 and has now integrated.

As Eric also mentions in the interview, a key obstacle to web search innovators is scale. In economics terms the market has a large barrier to entry.

Despite the barriers, I think he is wrong. There are still disruptive innovations left in search.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 07 Oct 2009 08:15
Daniel points out that the HCIR 2009 workshop proceedings are available. Here are a few highlights:
  • Modeling Searcher Frustration by Henry Feild
    Henry is a labmate who recently conducted an interesting study. He conducted a user study to analyze the affective mental state of the user during search tasks in order to detect 'frustration'. The goal is then to try and predict when a user is frustrated based on observable query log data. He has some interesting results:

    1) Users who get frustrated tend to stay frustrated
    2) Frustration tends to increase with the number of queries submitted
    3) Certain users are more predisposed to being frustrated than others
    4) Frustration levels depend on the type of task

  • Using Twitter to Assess Information Needs: Early Results by Max Wilson
    They analyze 189,000 tweets collected 100 results for 10 search queries hourly over a two week period. Their goal was to understand the kinds of things people are looking for.
  • I Come Not to Bury Cranfield, but to Praise It by Ellen Voorhees
    She argues that the very simplified (impoverished) role of the user in Cranfield is necessary in order to run highly controlled experiments. A key challenge is the cost of judging results. She says,
    Modifications as small as moving from MAP to a more user-focused measure like precision at ten documents retrieved require larger topic sets for a similar level of confidence. More radical departures will require even larger topic sets.
  • Freebase Cubed: Text-based Collection Queries for Large, Richly Interconnected Data Sets by David Huynh, creator of Parallax.
    David explores some of the challenges presenting faceted interfaces across large, heterogenous domain models. He writes,
    Any large data set such as Freebase that contains a large number of types and properties accumulated over actual use rather than fixed at design time poses challenges to designing easy-to-use faceted browsers. This is because the faceted browser cannot be tuned with domain knowledge at design time, but must operate in a generic manner, and thus become unwieldy.
  • Usefulness as the Criterion for Evaluation of Interactive Information Retrieval by Michael Cole, et al. from Belkin's group at Rutgers.

    The paper argues that pure relevance based measures fail to measure whether or not a system helped a user accomplish their task. They propose a method to measure 'usefulness'.
    ... usefulness judgment can be explicitly related to the perceived contribution of the judged object or process to progress towards satisfying the leading goal or a goal on the way. In contrast to relevance, a judgment of usefulness can be made of a result or a process, rather than only to the content of an information object. It also applies to all scales of an interaction.
  • Towards Timed Predictions of Human Performance for Interactive Information Retrieval Evaluation by Mark Smucker

    He advocates an extension of the Cranfield paradigm that measures the user's ability to find relevant documents within a timed environment. The overall goal is to develop of a model of user behavior in order to inform decisions about what UI and search features provide the most opportunity for improvement. They use GOMS to estimate the time for users to complete a task given an interface. He writes,
    The acronym GOMS stands for Goals, Operators, Methods, and Selections. In simple terms, GOMS is about finding the sequence of operations on a user interface that allows the user to achieve the user’s goal in the shortest amount of time.
That's all for now, although there is a lot more interesting work in the proceedings!
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 07 Oct 2009 00:22
You may have noticed my post frequency decreasing. It's inversely proportional to the amount of homework. This semester I am taking three classes, all of which relate to my research interests (for a nice change).

CS646: Information retrieval
The graduate IR class with James. The slides from the class are available for you to follow along. For texts we're using:
  1. B. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice. Addison Wesley, February 2009. [amazon]

  2. C. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, 2008. [cup]. The authors also make it available online.
This is the first time there have ever been good comprehensive texts for IR. I recommend you pick them up if it's your area of interest.

STAT 607: Mathematical Statistics I
This is an introductory class on statistical theory taught by Michael Lavine. We're learning R for data analysis. The textbook, Introduction to Statistical Thought is available for free download. It has lots of good R examples.

CS791: Information Retrieval Seminar on User Modeling
Bruce is leading a seminar on User Modeling for IR. Last week we focused on query term weighting, led by Michael. This week we'll cover Query Reformulation techniques. There is no website or text for this course, but I'll try to provide some links to relevant papers and presentations as we cover material.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Friday, 02 Oct 2009 00:38
To coincide with Cloudera's Hadoop World, today Y! announced an updated release of the distribution that runs their clusters. The new version is based on Hadoop 0.20.1.

Watch #HadoopWorld on Twitter for more updates on HadoopWorld tomorrow..
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 30 Sep 2009 23:08
Cloudera today announced a new version of the Cloudera CDH2 testing release.

The new version is based on Hadoop 0.20.1 and has compatible versions of PIG, Hive, and HBase 0.20. This is a big deal because these were previously unavailable to early adopters.

They announced it just in time for Hadoop World in NYC, which starts tomorrow. I won't be attending due to classes and other scheduling conflicts, but please go and let me know what's going on.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Wednesday, 23 Sep 2009 10:34
IBM Research Cambridge hosted the Transparent Text Symposium yesterday and today. Judging by the conversation it's sparking, I missed a great event. The sheer volume of interesting presentations looks astounding.

Start by checking twitter, #tt09. Daniel has been carpet-tweeting the entire event! Check out his coverage of Day 1 and Day 2.

Be sure to read Ethan Zuckerman posts on what Matthew Gray is doing with the Google Books data and Beth Noveck's open government keynote.

If you haven't checked out the Guardian's datablog, do it! Tons of captivating and informative information vizualization. Their talk at TT highlighted the MP Spending crowd sourcing project.

I look forward to catching up with the videos once they are posted!
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Tuesday, 22 Sep 2009 17:15
Today Yahoo! launched a new search page framework.

The new framework integrates SearchMonkey, SearchPad, and SearchAsisst. The team streamlined the layout to make it more faster and more modular. As they write,
Now, here’s the best part: Rather than building this new experience on top of our existing front-end technology, our talented engineering and design teams rebuilt much of the foundational markup/CSS/JavaScript for the SRP design and core functionality completely from scratch. This allowed us to get rid of old cruft and take advantage of quite a few new techniques and best practices, reducing core page weight and render complexity in the process.
I like the SearchPad integration. However, I find the three-column Y! search result page layout cramped compared to Google and Bing. It feels cluttered and less readable. I have a big monitor and the restricted layout doesn't utilize the space well.

Interesting stuff, but now let's see what the Y! team can do with a new revamped code base.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Date: Monday, 21 Sep 2009 15:39
In case you've been living under a hole, today Netflix formally awarded the $1M dollar prize to BellKor. See the NYTimes article. Wired has a picture of the team, which met for the first time to receive the award.

The top finishing teams recently published papers outlining their strategies via the Netflix Prize Forum message.
I look forward to hearing more about the second iteration of the contest! From the NY Times article it will be a different:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies.
Author: "jeff.dalton (jeffdalton104@hotmail.com)"
Send by mail Print  Save  Delicious 
Next page
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader