• Shortcuts : 'n' next unread feed - 'p' previous unread feed • Styles : 1 2

» Publishers, Monetize your RSS feeds with FeedShow:  More infos  (Show/Hide Ads)


Date: Friday, 26 Jun 2009 12:59


I've been quiet recently. I've been working flat out on a project that has required all of my attention: increasing the number of UK blogs for Wikio UK (www.wikio.co.uk). The UK site was the last one to appear after wikio.fr, wikio.it, wikio.es, wikio.de and wikio.com, and has to some extent always suffered a little in terms of increasing the number of sites in the database. I thus put in place some adapted algorithms several weeks ago and I'm happy to announce that the UK site has now passed 100,000 blogs. Exactly 113,000 at the time of writing, and this number is set to increase further in the coming hours: there are nearly 30,000 more blogs in the pipeline.



If you go to the site you will see "Live breaking news from 156920 blogs", but this is simply the number of anglophone blogs, and not only those from the UK. The same number is indeed shown on wikio.com. Both sites draw from the same database but do not display the same results: it's all a question of weighting. The UK site prioritises UK news and the US site prioritises US news (hence the need to geolocate sources). You will see for example the differing reactions to international events, be it the situation in Iran, or the death of Michael Jackson - all rather interesting.

It is alas very complicated in practice. It is extremely difficult for our machines to determine whether a site is American or British (or Canadian or Australian etc.). Obviously if the URL ends in .co.uk, there is little ambiguity. But this is in fact rarely the case. Most British blogs for example are on blogspot.com, wordpress.com, etc.

The algorithms are rather sensitive, and as far as I'm aware, no other service goes as far to distinguish between UK/US in the way that we do at Wikio. If you try Google Blogs Search or Technorati, you will see for example that it is a mish-mash without any real attempt to sort by country except a (probable) bias towards .co.uk. domains.

The difficulty comes from the fact that no one criterion suffices unto itself. We can, for example, check the spelling. We know that in Britain they write colour or neighbour and not color and neighbor as in America. This can be useful, but it does not in fact concern that many words, and we are not guaranteed to find them on your average blog. To further complicate matters, Canadians, Australians and other blogs of the Commonwealth use the British spelling style. So we can also turn to the blogger's profile: if it cites "London, UK", there you have it. But there is very often not a profile on the page, and it must be found and correctly parsed by the machines. Web 2.0 it appears lacks certain standards! So in practice this requires a fair bit of work...

We can also look at the topology of the blogosphere (I hope soon to be able to show you some maps of the US/UK à la Wikiopole FR). UK blogs tend principally to reference UK blogs, and the US blogs US blogs. The web is simply a sum of communities... However, in pratice it's a little trickier than that: UK blogs also reference US blogs (yet this tends not to happen in the opposite direction, which does help a little).

So, in order to end up with a reliable sourcing technique, one must combine all these criteria, and let me assure you it has not been simple. But I am rather pleased with the results, both in terms of coverage and reliability. The UK site is now the second biggest in terms of the number of blogs. I hope it will be useful for you if you are interested in British culture, and wish to discover blogs from across the channel. I would have loved that when I was learning English at school (we had only the BBC on short wave radio...). The themed rankings are still somewhat light, but I am currently working furiously on this with a team of Masters students whom Wikio kindly granted internships, and we are already seeing some great categories emerging. I don't know whether some (perhaps Wine & Beer) will see the light of day for the next ranking, but if not, it will be at the end of July.

That is also a real challenge: as reliably as possible categorising hundreds of thousands of blogs. It's not simple: a nice example of intermingled semantics and topology. That, however, will be the subject of another post. I don't wish to wear you all out!
Author: "--"
Send by mail Print  Save  Delicious 
Date: Monday, 20 Oct 2008 21:15


We have already discussed several times on this blog (most notably in the comments) the fact that French bloggers seem to link to one another with far less regularity than their American counterparts. I had wanted to avoid stereotypes - the disciplined Germans, complaining French and romantic Italians etc. - and approach this with an open mind. Still, the results are fairly clear cut: each country has a different approach to Web 2.0.

I worked out the proportion of links on the blogs of the various countries in the Wikio database for September 2008. The results are clear: The US is well out in front with 0.17 links per post, or one link for every 6 posts. Then comes Germany, followed by the UK and then France, which has half the proportion of links of the United States (0.08 per post or one link for every 12 posts published). Then finally comes Spain and Italy. Cliches aside, it's funny to see the the Anglosaxons and Germans on one side, and the Latin countries on the other, the two flanking the French who sit slap-bang in the middle.


CoutryLinks/post
US0.17
DE0.12
UK0.09
FR0.08
ES0.06
IT0.05



Number of links per post


Even more interesting is to see the results separated by link type: to another post or to a blog's homepage.

Countryto a post
to homepage
US0.140.03
DE0.100.02
UK0.070.02
FR0.030.04
ES0.030.02
IT0.030.02


You will notice that the different rates are essentially those of links from post to post: links to homepages occur in pretty much the same proportions from country to country (with a slightly higher rate for France). French, Spanish and Italian bloggers link to other posts 4 times less than Americans do.

These results lead me to two remarks:

1. First off, they explain the difficulties one encounters in trying to make a memetracker work in several countries - one which tracks 'hot' discussions by following post-to-post links. The example of Techmeme seems a difficult one to recreate in other cultures: for example Wikio's memetracker works better in the US than in France, where the discussions are less easily aggregated. New ideas are needed!

2. Also, as I explained in my last post [Fr], the Wikio rankings do not currently take into account links to homepages. This was a way of combating "chains", but (having seen your reactions and comments) we will evidently have to rethink this one too!

Libellés : ,

Author: "--"
Send by mail Print  Save  Delicious 
Date: Monday, 29 Sep 2008 13:43

A big clean is more commonly carried out in Spring... but there's no reason not to do so at the end of the summer! This indeed was my recent advice to Wikio regarding their famous Blog rankings. I told you recently that one of the projects which would be receiving my attention would be the rankings, in collaboration with you all. In fact, I completely reworked it, and, as promised [fr], I will provide you with the algorithm's details in the days to come. My first observation is that there was a significant amount of dust in certain nooks and crannies, which needed a little attention before we could progress and try to improve the rankings as a whole. This is not a criticism: such a ranking is an extremely technical undertaking and even the very big names have troubles with it (Technorati for example [fr]).



So, the various Wikio teams have spent September with broom in hand and the results are likely to ruffle a few feathers... There will surely be some grinding of teeth (there always is: not everyone can be on top), but the engine is now much cleaner. Several of you had noted that there were inactive blogs that had stuck around in the rankings, even though they had not published for a few weeks. Well no more - they're out. I got our developers to create several indicators, one of which flags up publication volume, that allow us to more closely follow the behaviour of the tens of thousands of sources in our database. All such blogs who had not published for four months have thus been jettisoned. Other indicators were a little more difficult to implement, but now in place they allow one to assess the similarity between sources and so address spammers, aggregators and multiple posting (which is sometimes legitimate, but such activity can seriously affect the analysis of backlinks, and thus the rankings as they are based solely on this criterion). So out also with aggregators and other doubles (a lot of the recent work was precisely this, dealing with the enormous presence of source duplication which is a delicate and extensive process).

I also implemented a small change, which has no bearing on the overall principle, but improves the transition from one month to another. Many of you had seen that there was sometimes a yo-yo effect, whereby blogs suddenly lose a large number of positions, or the opposite, they shoot up the rankings like a rocket. This was largely due to the time period used when analysing backlinks. This period as you will know is four months, but say a blog is very heavily buzzed in April, it will then appear high up in the rankings from May to August and then (if it is not further talked about in the mean time), suddenly plummet in September. Not ideal clearly. I thus replaced the straight four-month calculation with a progressive attenuation over nine months. So September's links have a value of 1, August's a value of 1 – 1/9, July's 1 – 2/9 etc. etc. The variations are now a lot more temperate.

Before


Now

Obviously this month there will still be a lot of change in the rankings as many things have been adjusted. The good news is that the clearing out of moribund or spammer blogs has cleared a number of places, and there are thus more blogs on their way up than on their way down. I don't yet wish to reveal the rankings as verifications are still being carried out, but there are some noteworthy and indeed worthy leaps. A few falls as well but that is to be expected. The summer entailed a drop in activity for many blogs but that is true everywhere (you will have likely seen the report on Technorati). It is of course up for analysis, but we hope at least to have provided an improved and cleaner ranking.

Libellés : ,

Author: "--"
Send by mail Print  Save  Delicious 
Date: Wednesday, 24 Sep 2008 08:44

I’ve dreamt about it (and I’m sure you have too), Google have done it (in part at least)... How many times have you sent a message and later realized that you have forgotten to send the attachment? Embarrassment guaranteed. It has nearly come to be a standing joke with me to say that the automatic detection of missing attachments will be one of the best selling natural language processing programs in the world. A few years ago I even had discussions with students in my seminars on the various ways of developing such a function.

Well, believe it or not Google has announced that it has developed this function as part of GMail, under the mildly sexy name of "Forgotten attachment detector".



It must seem slightly magical to some of you, almost the stuff of science-fiction (could Google now be able to guess, or even anticipate our thoughts? It’s enough to make you shiver...). I am the first to denounce false announcements, which do more harm than good in the field of language technologies (there have been a slew of them over the last half century or more, on automatic translation, man-machine dialogue, and others). We know the problem with these technologies, and the greatest modesty still reigns. As I say in my first lesson, in fifty years we have managed to decode the human genome, but not the language... In this particular case however, I do believe it’s perfectly feasible.




How on earth has Google managed to do it? Honestly I have no idea, but I can tell you how I would have done it (and it seems to me to be just about the only way). The wrong way, in my opinion, is to scratch your head and try to find expressions to detect in the body of mails: "please find attached", etc. Even if you hire the best linguists in the world, the majority will still more than likely be missed.

So here’s my recipe:
I’ve just done a little rough test with my own mails and I can see word sequences appearing like: "hereafter”, "attached file(s)”, "attachment(s)”, "I’m sending you”, "I’m forwarding to you”, "here is the report”, "here is the file”, "here is the/a document”, "here is the estimate”, "please find”, etc.

Of course, a program like this will generate a little noise (false alerts) and silence (missed attachments), but if 95% of cases can be detected, it’s a more than useful function.

My estimate:
Maybe I should offer my services to Google, since if I am to believe the mini-test featured on Pulse 2.0, it's not very good. The detector recognizes "I have attached", but not "Attach a document" or "Here is the attachment"... I tested this myself, with phrases like "Attached please find a copy of...", without much more success. Rather strange all the same.

It remains to be seen (after having resolved these few details...) if Google will offer a French version. I’ve already mentioned in the past the amount of time Google takes in localizing its products. Sometimes a few years. Watch this space.

Libellés :

Author: "--"
Send by mail Print  Save  Delicious 
Date: Monday, 28 Apr 2008 10:35

I have already mentioned the admiration I have for Wikio's linguistic technology, which I find to be one of the most well-developed amongst the many tools, search engines and portals currently available on the web. One very interesting function is the automatic detection of "named entities", those being names of people, places and companies. You might have already noticed that the engine displays, in the summary of each news story, a selection of different links to the various entities that it has recognised, allowing one to launch further, related searches with a single click. An inveterate tinkerer, I had a little fun with this and aggregated a few statistics to give you the daily buzz:



Interesting, no? Do bookmark this page; it's automatically updated every hour according to the latest news.

And if you are interested in other languages, drop by this page, where you will find the daily buzz for the news in German, French, Spanish and Italian. You will be surprised at just how much tastes vary from country to country!
Author: "--"
Send by mail Print  Save  Delicious 
Date: Saturday, 01 Mar 2008 07:44


Quite a while ago now, I promised to talk to you about the intelligent news portal Wikio [fr]. I came across this site in an absent minded glimpse, as with many of you no doubt, and stupidly only saw it as another aggregator, all be it with Digg style vote buttons admittedly, but nothing worth writing home about. Tragic error. Wikio is undoubtedly the service which harbors the most advanced linguistic technology on the Web at the current time (and you’ve noticed that that’s the theme of this blog... it just had to interest me!).



I’ll no doubt come back to it in other postings, but I just wanted to give you an example. Wikio doesn’t just aggregate news and postings ad hoc. When you go to its main competitor, Google News, the home page offers you today’s headlines grouped into major categories (Sports, International, France, Economy, etc.). That’s basically where the intelligence of the service ends. It’s true that when you enter a keyword, the articles are presented to you in aggregate fashion, but this aggregation is of poor quality. Enter “Yahoo”, for example, and you will see that the groups are quite un-readable. Many news items are not grouped at all and the existing groups overlap each other: the Microsoft affair is spread over a variety of groups, etc. (when you enter a query, the page will certainly have changed, but you get the idea). When it came online in 2002, however, I praised this service. Document clustering (and thus news clustering) is an extremely difficult issue, as you can imagine, and the system seemed very promising. Alas, as with many Google products, after its initial launch it hardly evolved, although it officially left the beta version in 2006. Google concentrated more on the number of sources (4500 for English so we’re told) than on their quality, or that of the algorithms… The increase in the number of sources (easy to do automatically) quite logically leads to the deterioration of the clustering quality.

For Wikio, it’s not perfect (the service is clearly announced as a beta version), but the underlying technology is infinitely more promising. Articles (from media or blogs) are not merely grouped into high level categories (Sports, etc.) but in a veritable “knowledge tree” which currently includes over 30 000 categories (at least on the French site -- Wikio.com is more recent and might be a little behind):



If you count, you will see that there aren’t quite 30 000 categories (even on the French site). I asked Wikio the question: it's normal, the list changes constantly and only categories which have had recent news appear.

To my knowledge, the categories are not visible anywhere in tree form, but one can guess the organization by the URL form. Take the “deafness” category for example. When you enter this keyword into the engine, it sends you back to a page containing news on the topic, with an URL giving the following hierarchy:


http://www.wikio.com/health/disability/deafness

The Health theme contains numerous sub-themes, including Disability, which in turn contains Deafness. This hierarchy is also clearly given by navigation links in the top left hand corner of the page:

News > Health > Disability > Deafness




The Deafness theme in turn contains other sub-themes: Cochlear implants, Sign language, Lip reading, Cued speech and others. But navigating to the sub-categories is less easy, and it’s a shame (a bunch of tags can indeed be found to the right of the screen, but they are often complex and don't only present daughter categories). One could imagine other more practical solutions (a small scroll down menu for example under the word Deafness in the navigation link at the top of the page).

Don’t think that it only consists of an alert on the keyword deafness as is the case with Google. The page offers articles which don’t contain this word, but which contain related words: deaf, hearing, hearing loss, etc. And, above all, Wikio doesn’t let itself get too much hoodwinked by articles (and there are plenty in its database, I’ve just checked) which talk about the deafness of power, and politicians turning a deaf ear and so on.

Wikio presents a fantastic reservoir of structured information, that is, to my knowledge, unrivalled. The beauty of the thing is that everyone can create their own news pages, either by subscribing directly to a category’s RSS feed (for example here for deafness), or by combining the categories with each other to create one’s own tabs – which can in turn be exploited by a specific RSS feed!

Absolutely fascinating. The possibilities of such a system are mind boggling... Of course, there is some tweaking to be done here and there, as you may imagine. This is the very forefront (and believe me, extremely difficult) in language technologies. And there are some perverse cases. One of my postings, on Google and internet referencing, has gone into the Cosmetics category because I quoted the expression nail varnish for example. But, honestly, only the HAL's grandson [fr]…would be able to resolve that one, and in 3001 no doubt.

I'll be brief... I know that we are in the zapping civilization and that most of you have already gone off onto other channels. So I’ll come back to that. I’ll go into greater detail about what I've been able to understand of the surprising technology behind all this. Meanwhile, I’m eagerly awaiting the new version on which Wikio will apparently begin to do “teasing” [fr] ;-) So watch this space!



PS


It's confirmed! [fr], a new version is in the starting-blocks.
Author: "--"
Send by mail Print  Save  Delicious 
Date: Friday, 08 Feb 2008 11:40

A lot has been said about Yahoo! lately. The company is clearly suffering from image problems, and a certain lack of coherence… But I’ve already had the opportunity of pointing that out (for example here, or here…), its technology is not bad at all and in a certain number of cases it is even superior to that of Google.

This is the case for example with search suggestions. It’s true that Google announced this function as of late 2004, three years earlier than Yahoo! [Nostalgia trip: it was one of the first postings (fr) I wrote on this blog...] We’re now familiar with Ajax and relatively blazé about Web 2.0 (which already sometimes seems dated), but remember: at the time, this type of interactive technology (based on Javascript and XMLHttpRequest, see reverse-engineering here) was a small revolution…

The problem with Google Suggest, as with many other Google technologies, is that it has hardly evolved since it was launched. It’s true that it has been integrated into the Toolbar and the service home page no longer says “beta”, but I can hardly see any changes compared with 2004. Especially as, unless I’m mistaken, Google Suggest still doesn't distinguish between languages, which is really awkward for us French speakers (and a few other internet users throughout the world).

So, when I type “ala” in my search box, I get suggestions concerning Alaska airlines, Alamo car rental or alarm clocks, which are not exactly subjects of interest for your average Frog.



Yahoo! took three years longer (that just might be their problem...) but the version which came out in December (after coming out in October in the States) is "localized":




It is even better designed than Google Suggest, as it knows how to search in individual words inside complex requests:




Compared with Google Suggest, which settles for searching in the first word:


Other interesting functions from a linguistics point of view have also appeared recently with Yahoo! I’m a little short of time these days (I’m sure you've noticed ;-) but I'll try to come back to this... Anyway, once again I find it hard to understand the disdain internet users have for Yahoo! (especially in France). Questions of image, marketing, buzz... That’s the way the world goes round, and the Web with it!

Libellés :

Author: "--"
Send by mail Print  Save  Delicious 
Date: Wednesday, 09 Jan 2008 10:38


You know I love tools (some [fr] might even say I'm a tool junkie). I've come up with a few myself, you remember, but I like it even better when they are developed by others – especially when it's precisely an idea I've been playing with for months, and when I haven't been able to find the time to do it.

You have probably noticed that numerous sites offer buzzometers, providing graphs: Technorati, BlogPulse, BlogScope, Trendio, soon Wikio Buzz [fr], etc. There's just one thing though... These services usually dish up beautiful graphics, but not the data themselves (not as daft as that), and obviously, the graphs cannot be compared (different scales, different time frames, etc.). Maybe you did what I did: manually superimposed the graphs playing with transparency in Photoshop. But I wondered whether the images couldn't be analyzed automatically...

Fait accompli. Philippe Gambette [fr], another surprising tool junkie (and an admirer of my blog – I promise I paid nothing for the title of his [fr]!) designed this tool. Simple, effective, open and free.

With, as a bonus, the comparative analysis of the "Manaudou naked" [fr] mega-buzz:



Click to enlarge and view on Philippe's blog

To be seen (and buzzed!) urgently! It's here.


New

Author: "--"
Send by mail Print  Save  Delicious 
Date: Thursday, 13 Dec 2007 15:09

You have no doubt noticed that I have started posting in English again lately. I don’t know if I’ll carry on as it takes more time…

Anyway, this time I thought it might present problems for some tools like Wikio or RollSense which apparently don't detect (not yet, at least) the language. This means my postings are wrongly indexed…

So I programmed a little system enabling this blog to have two streams:
This might be more comfortable for those of you who only want to have one or other of the versions. If you want to have both French and English, the following streaming will provide both:
It’s a shame that the blog platforms don't offer this function. It's so easy though...
Author: "--"
Send by mail Print  Save  Delicious 
Date: Wednesday, 12 Dec 2007 21:07

After publishing a posting on Google I like looking at my statistics just as California is waking up.



Hi guys, greetings from Aix!

Libellés :

Author: "--"
Send by mail Print  Save  Delicious 
Date: Wednesday, 12 Dec 2007 14:59

But not Dailymotion! I recently told you that my site was second in the image search on Google for the word "Google", which was undreamt of... Well, would you believe that if you activate the SafeSearch function (read “anti-porn”), this image disappears from the results!



However the posting featuring the Google image in question is hardly pornographic... My only conclusion is that Google is all tits and bums (unless the words Thatcher, Saddam, Poutine, which appear on the page can be considered as obscenities!).

I have already mentioned the mysteries of the SafeSearch function and the sledge hammer that Google sometimes uses to crack nuts, but it’s still funny...

As for Dailymotion, I was pleasantly surprised to see that I am first on the first two images for the keyword with the SafeSearch function activated. Not bad, there too, for such a common keyword! And I see that it brings visitors (I should start making this site “pay”…).





But the funniest thing is, that with the SafeSearch function deactivated, so normally letting through the hardest of porn, my little image only comes 12th, behind, amongst others, www.book.fr, www.politique.net, tempsreel.nouvelobs.com and sntic.parti-socialiste.fr. Of these 11 sites eliminated due to the public health hazard, only one seems to me to be an "erotic site"…



Basically, it’s a load of rubbish, isn't it?



Follow up

Author: "--"
Send by mail Print  Save  Delicious 
» You can also retrieve older items : Read
» © All content and copyrights belong to their respective authors.«
» © FeedShow - Online RSS Feeds Reader