• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Friday, 21 Sep 2007 18:49

Recently a Japanese user wondered why searching for "あいう" or "あいうえ" would not find a document containing the string "あいうえおかきくけこさしすせそ" even though searching for "あい" or "あいうえお" would find it. Recall that the default for search in Vista is that we look for words beginning with what you typed, so it was resonable to expect that all four strings would find the document.

The reason is that word breaking in Japanese (and various other languages, especially East Asian languages) is difficult and highly context sensitive. In Western languages we usually have whitespace or punctuation between words and except for some quirks and exceptions, breaking a paragraph into words is usually straightforward and unambiguous. In Japanese, on the other hand, words are frequently not separated by spaces and therefore word breaking becomes a guessing game. A good lexicon helps but unless you have a way of verifying that the resulting word broken sentence makes sense semantically (which is what a Japanese speaking human would do), there are usually a number of syntactically valid ways of word breaking a sentence and you have to resort to heuristics for picking what is likelythe right one. Considering this our Japanese word breaker is doing a really nice job but it means that the same characters can be broken differently when you change the characters around them.

So let us look at how various prefixes of "あいうえおかきくけこさしすせそ" are broken:

 

あいうえおかきくけこさしすせそ

あいうえお - かき - くけ - こさ - - - -

あい

あい

あいう

- いう

あいうえ

あい - うえ

あいうえお

あいうえお

あいうえおか

あいうえお -

あいうえおかき

あいうえお - かき

あいうえおかきく

あいうえお - - きく

あいうえおかきくけ

あいうえお - かき - くけ

あいうえおかきくけこ

あいうえお - かき - くけ -

You see that as we take longer and longer prefixes, the word breaking approaches that of the full string. Now recall that when you type a query we will word break it and then see if we can find that sequence of words. Or, if you have “Find partial matches” on (which is the default in Vista's search) we will look for a sequence of words where each words begins with one you typed, in order. So in English if your query was "my foot", we'd find a document containing "mystic footwear" as that is a word beginning with "my" followed by a word beginning with "foot".

Now look at the table above. For example, if your query was "あい", then we will look for documents with words that begin with "あい" and we will find the one with "あいうえおかきくけこさしすせそ" as it has a word "あいうえお" in it.

And if your query was "あいうえおか" we will look for a document that has in it a word beginning with "あいうえお" followed by a word beginning with "". Then too we will find the document with "あいうえおかきくけこさしすせそ" as it has the word "あいうえお" followed by the word "かき" in it.

But if your query was "あいう", then we will look for a document with a word beginning with "" followed by a word beginning with "いう" and then we will not find the document with "あいうえおかきくけこさしすせそ" in it as even though it contains a word "あいうえお", that word is followed by "かき", which doesn’t begin with "いう".

Even though the behaviour is better for real text ("あいうえおかきくけこさしすせそ" is simply the first fifteen Hiragana characters) we obviously wish this wouldn't happen and we are looking into ways of improving the behaviour in a future version. But it may please you to know that it is a negative side effect of a positive way of tackling a hard problem: word breaking of Japanese and other East Asian languages.

Author: "JonasBar" Tags: "AQS"
Comments Send by mail Print  Save  Delicious 
Date: Monday, 10 Sep 2007 19:56

Say you have a document containing the number 1234567890 (it might be a part number or an invoice number) and another containing the number 1234567891 (presumably the next part or invoice). You want to find the former document so you do Start > Search and type in

1234567890

but the result contains both documents. This is not what you expected and rightly so. What happens is that when a document is indexed and it contains a numeral, vista will index that exact numeral but also a normalized form of that numeral, which is its numeric value. This is primarily so that if you have a document containing something like '1,234' you can find it with a query string such as '1234' that doesn't have the comma. That is a good thing. The problem is that on Vista, if the numeral denotes a sufficiently large number, the numeric value used will be approximate. The numeric values of 1234567890 and 1234567891 will be indistinguishable and that is why searching for one will also return the other.

The best workaround I can think of is using the ~ operator to force a character match. What you have to keep in mind is that ~ will match on the whole string value of a property so usually you have to use * at the beginning and end. Moreover, it cannot search across all properties so you have to know what property to search in. Suppose the numbers you are searching for is in the title property of the documents. Then your query would be:

  • title:~"*1234567890*"

You need to use the doublequotes above so the wildcards are interpreted correctly. If the property you search over simply has the number as the value (as opposed to having the value appear embedded in a piece of text), then you can simplify a bit, for example:

  • subject:=1234567890

Searching for large numbers using WDS on Windows XP does not have this problem and we are looking into improving the behaviour in a future version.

Author: "JonasBar" Tags: "AQS"
Comments Send by mail Print  Save  Delicious 
Date: Tuesday, 13 Mar 2007 19:13

There is a major difference between file search on Windows XP and desktop search on Windows Vista (or on Windows XP with Windows Desktop Search [WDS] installed) that has perhaps not been made sufficiently clear.

On Windows XP search is character based. That is, if you search for a string 'test', it will find files named 'my test data.doc', 'additional testing.xls' as well as 'latest junk.txt' or (if you tell it to search also contents of files) files containing words such as 'test', 'tester' and 'fattest'.

On Windows Vista, and on Windows XP with WDS installed, search is normally word based. Searching for the string 'test' will only find documents with the word 'test' in them, or words beginning with 'test'. So it will find the files named 'my test data.doc' and 'additional testing.xls' but it will not find 'latest junk.txt'. Moreover, it will find documents containing 'test' or 'tester' but it will not find documents containing 'fattest'.

One cannot really say one is better than the other; if one is really looking for a word 'test', finding documents containing 'fattest' is just unwanted noise. On the other hand, sometimes one is really looking for a string, wherever it occurs, and then character based search is the only game in town.

The main reason for the change is that by making search word based one can use an index to make searches much faster. This is why searches on Windows Vista are generally so much faster than on Windows XP (without WDS): on Windows XP each search basically plows through every single file, looking for the string, while on Windows Vista an index lookup produces the right documents instantly. By the way, this is how most Internet search engines work and that is why they too are word based.

But what if you really want to look for a string anywhere? The good news is that you can do that also on Windows Vista (or Windows XP with WDS). You do it by searching for a string that contains '?' and/or '*'. As on Windows XP (and harking all the way back to DOS), '?' matches exactly one arbitrary character while '*' matches zero, one or more arbitrary characters. So to search for 'test' occurring anywhere in words, search for '*test*'. It finds all of the examples above, just as 'test' would on Windows XP. Note that the pattern will be matched against the whole value, not just against each word, so searching for 'lo?t' will not find documents with the word 'lost' in them unless that was the only word. You would have to search for '*lo?t*' even though that will also find documents containing words such as 'plotting'.

You can also restrict your search to a single property. For example, 'name:*bum*' will return any document where the string 'bum' occurs somewhere in the name property, 'subject:???' will search for documents where the subject has exactly three characters, and 'filename:l?t*t' will find documents with file names such as 'latest', 'littlest' or 'lqtfgdfhgt', and so on. See New Mansions in Search - Advanced Query Syntax for more on querying over specific properties.

There is one exception: if what you search for has a '*' wildcard at the end and no other wildcards, as in 'subject:test*', the search will still be word based and the results will contain any document that has a word beginning with 'test' somewhere in the subject. However, you can force the character based search by using the operation '~'. 'subject:~test*' will return only those documents where the subject begins with 'test' and 'subject:~*test*' will return those where 'test' occurs anywhere (also inside words).

So what is the price for using '*' and '?', that is, using character based search? Time! The search engine is forced to go through every document in the scope and look for the specified pattern. If you are searching over a large collection of documents, this can take a long time, but the choice is up to you, the user!

Author: "JonasBar" Tags: "AQS"
Comments Send by mail Print  Save  Delicious 
Date: Wednesday, 31 Jan 2007 23:05

With the Windows Vista General Availability behind us, it's time to call attention to one of the useful features of the new search facilities in Vista.

I am sure that you have noticed that there are "search boxes" sprinkled in strategic places across the user interface, in particular:

  • In the Start menu (this one is a bit special)
  • In the Search Home (you go there by Start > Search)
  • In any Windows Explorer folder.

There are many things to say about how to use these search boxes but I will focus on one of them: the Advanced Query Syntax (AQS) that you can use in all of them. Often all you need to do to find something is simply typing in one or more words that you recall from that document, e-mail or whatever in a search box and in a jiffy you have found it. But sometimes that returns way too many results to wade through so you need to be more precise, or you need to search on something other than words, like a date. That is when AQS comes in very handy.

Terminology: I'll write item when I mean any kind of document, e-mail, image, text file or other searchable thing on your computer.

Terminology: Items have properties that can be textual (author, title, name, tags, etc), numeric (size, importance, etc), date/time (when created, modified, or sent, etc), and so on. They also usually have some form of content which often is searchable, such as the text of a document, or the body of an e-mail message.

Me lazy boy: As you have probably figured out on your own, when you type, say, 'ste' in the search box, Vista will actually search for anything beginning with 'ste'. But it would get very tedious to write "search for words beginning with blah, blah" all over this post, so unless I specifically say otherwise, when I write "search for blah, blah" below, I mean "search for words beginning with blah, blah".

Here are a couple of examples of queries that illustrate some of the abilities of AQS:

  • author:Darrell
  • modified:Jan 2006
  • taken:yesterday
  • date:6/29/2004..10/10/2006
  • genre:rock year:<1990 NOT artist:jack
  • folder:"Sample Music" barra*
  • filename:M?c?o*ft
  • tags:(hamster baseball religion)
  • from:(Bill OR Steve)
  • from:Bill OR to:Ray
  • importance:high
  • kind:docs

And here are some facts about AQS that were illustrated above:

  • If you just type arbitrary words, they will be searched for in the content of items as well as in the properties of those items. But if you write a property followed by a colon and then a word, Vista will look for the word in that property only. So 'author:Darrell' searches for items that have the word Darrell somewhere in the author property.
  • You can search for items by date and/or time and there are various date/time properties to use. The property 'date' is "some useful date associated with the item" and every item has it so when you are searching for items of any kind by date, it's a handy property to use.
  • When specifying a date, you can either use an absolute date (as in '10/15/2007' or 'Jan 2006') or a relative date (as in 'yesterday', 'last week' or 'this month'). You can also specify an explicit range, as in '6/29/2004..10/10/2006'. (Go ahead, try 'yesterday..tomorrow' or '9/24/2000..next year', they work too.)
  • You can stick an operation such as '<', '>', '<=', '>=' in front of a value to look for anything before or after (inclusive or not) that value, so 'year:<1990' will find songs published before 1990.
  • If you want to exclude items that contain some word, use NOT, which must then be in upper case (so you can search for the word 'not' by writing it in lower case). Actually you can use NOT not only for words but also together with date/time, etc, so 'date: NOT 2006' or 'NOT date:2006' will find items with a primary date that is not in 2006.
  • If you write more than one condition, Vista will only return items that satisfy all conditions. So 'genre:rock year:<1990 NOT artist:jack' will return items that have a genre of 'rock', were published before 1990, and that do not contain the word 'jack' in the artist property. You could also stick the word AND (in upper case) between the conditions but if conditions simply follow each other, AND is what we assume anyway.
  • You can search for a phrase by enclosing it in doublequotes. So "Kurt Bloch" (with the doublequotes) will search for items that have the word 'Kurt' somewhere immediately followed by the word 'Bloch'. Here we really mean the words 'Kurt' and 'Bloch', not words beginning with 'Kurt' and 'Bloch'. If you want to search for items that have a word beginning with 'Kurt' immediately followed by a word beginning with 'Bloch', type "Kurt Bloch"* (with the star immediately following the closing doublequote).
  • If you put a star '*' immediately after a word, it means to search for anything beginning with that word. So 'barf' and 'barf*' both search for anything beginning with 'barf'. How silly, you might think, but actually one can turn off the automatic "partial matching" in the Folder and Search Options, and then it's very handy to have a way of getting the "partial matching" when you really want it.
  • So, 'folder:"Sample Music" barra*' will find any item that resides in a folder with a name that has the word 'Sample' (really) followed by the word 'Music' (really), and that has some word beginning with 'barra'.
  • If you think you miss good old DOS wildcards, the good news is that you can use them, too: '?' matches one arbitrary character, '*' matches any number of characters. The bad news is that such queries run much, much slower than others (exception: just a '*' at the end of a word, as above, doesn't bring the speed down). But sometimes this is just what you want, so go buy a short, nonfat latte while the computer finds every item that has a file name matching 'M?c?o*ft' (actually it only has to begin with that, just like for regular words).
  • Using parentheses you can group together multiple words, which will all be searched for. So 'tags:(hamster baseball religion)' finds items that have all three tags 'hamster', 'baseball' and 'religion', in any order. It would find exactly the same things as the query 'tags:hamster tags:baseball tags:religion'. If you stick extra parentheses around things, they are simply ignored, so 'tags:(((latex)))' is the same query as 'tags:latex'.
  • If you put OR between words, we will search for items that contain either. This often requires using parentheses if only a certain property is to be searched. 'from:(Bill OR Steve)' will find all your items that have a sender with either the word 'Bill' or the word 'Steve' in them.
  • OR can go between larger chunks as well, so 'from:Bill OR to:Ray' finds items that either have 'Bill' in the sender or 'Ray' in the recipient. Sometimes you want to use them with parentheses to be sure there is no misunderstanding, as in '(from:Bill OR to:Ray) tags:bowling' that will find items with 'bowling' among the tags and that are either from 'Bill' or to 'Ray'. On a bit of a tangent, note that 'from:Bill Ramsey' will not look for items with 'Bill Ramsey' as sender. It will look for items having a sender with 'Bill' in it and 'Ramsey' anywhere (in the contents, for example). To really look for items with Bill Ramsey as sender, try 'from:(Bill Ramsey)' or 'from:"Bill Ramsey"', depending on how exact a match you want.
  • Finally, some properties contain special values that you don't really want to keep track of what they are but search for anyway. So you can write 'importance:high' and not have to know that this looks for a value of 5 in the importance property. Or you can write 'kind:docs' and not have to care that it actually will search for items with a kind containing 'document'.

When you use "Advanced Search" in the Search Home, pay attention to the search box. As you compose a query in the "query builder", it will put the corresponding AQS in the search box. Another time you might try simply typing an AQS query yourself if it saves you time!

My final tip is that AQS works also when searching in Outlook 2007, though only for Outlook properties.

And if the title reminded you of something in your CD collection, you rock!

Author: "JonasBar" Tags: "AQS"
Comments Send by mail Print  Save  Delicious 
Date: Thursday, 29 Jan 2004 18:27
To spill my mind at will, with a readership of millions! Or something. Stay tuned for revolutionary insights, secret passageways, and mindbending opinions. Or something.
Author: "JonasBar"
Comments Send by mail Print  Save  Delicious 
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader