AOL vient de mettre en ligne les données de recherche de 650 000 utilisateurs d’AOL. Ces données incluent l’historique de recherche d’une période de 3 mois ainsi que les liens cliqués pour chacune des requêtes. Il s’agit d’un fichier compressé de 439 MB qui une fois décompressé pèse 2GB : ce fichier est disponible ici (lien direct vers le fichier supprimé par AOL) et il s’agit de 10 fichiers textes concassés. Depuis le lien a d’être désactivé par AOL mais c’est un peu tard.
Un article complet est disponible sur Techcrunch.
UPDATE 07/08/2006 15h00 :
Après avoir vu la structure du fichier, les utilisateurs sont différentiés par le champ “AnonID” qui est un identifiant anonyme. Par conséquent je vois beaucoup moins où ce situe la divulgation de données privées. Ce fichier va je pense être décortiqué par pas mal de monde avec des ambition plus ou moins louables.
UPDATE 08/08/2006 09h00 :
AOL a fait des excuses via un communiqué de presse ainsi que par commentaire sur certains blogs. Par ailleurs, AOL confirme comme je le disais hier que toutes les données sont basées sur un identifiant anonyme?
UPDATE 08/08/2006 13h50 :
Il était clair que ça n’allait pas en rester là. Un site (au nom de domaine particulièrement bien choisi) vient de voir le jour et propose une interfance permettant de parcourir ce lot de données au moyen d’une interface web. D’autres sites ne manqueront pas de suivre…
UPDATE 08/08/2006 15h44 :
Des exemples concrets du contenu de la base sont disponibles dans le détail de ce billet
Exemples de contenus de la base de données (en anglais)
User 6426084
6426084 is a definite fan of pitbull dogs. And pitbull fighting. Looks like he wants to register a pitbull dog now, too. Other than that, 6426084 likes to search for “gangbuses” and “gangboats”.
User 8268
We got a power searcher here. 8268 makes frequent use of the minus search operator, and is interested in anything from aerospace technology to Thai food, from the Windows Multimedia Knowledgecenter to the Alias season finale.
User 29665
29665 is one of the more innocent searchers, looking for Johnny Cash, the Middle East, and pictures of famous psychologists. 29665 also wants to know how to save the rainforest.
User 19655
It’s after midnight. 19655 is looking for “dirty jokes for Christians”. Later, 19655 clarifies; “clean dirty jokes” is what he’s after. Finally, 19655 decides to settle for “inspiring bible quotes”.
In another search, 19655 reveals a full name, including when and where that person went to University, and other names of that family (as well as their jobs).
User 1045042
1045042 is researching the relationship between Republicans and terrorism.
User 24868
The life of 24868 circles around pottery barns, HTML, MySpace (one of AOL users’ favorites), camping, limos, bedroom furniture and hair extension tools.
User 11829
11829 is also into dogs (bulldogs), though not quite as obsessed as 6426084 above. Who knows why 11829 looked up red roofs, palm trees, kibbutz houses and chicken houses… or “little Arabian boys.” Wait… March 7, several search for “dog porn”. OK, maybe 11829 is obsessed with dogs. Searches for “submit pictures of dogs online” follow. (Is 11829 producing dog porn?) Other searches reveal what might be 11829’s home town. A couple of more regular searches, like “hairy chests”, “fake hairy chests” and “the theme from jurassic park” round up the day.
User 20320
We got a horse guy here. 20320’s searches circle around saddlery, horse racing, and jockeys. Other searches reveal 20320’s hometown, the age of 20320’s children, and the summer camp they’re going to. A search on May 18 is compiling facts on a “fast divorce”.
User 22542
AOL user 22542 is a classic case of confusing the search box with the browser address bar. Almost all searches are URLs, like www.bowwowinformation.com and www.barbie.com.
22817
User 22817 seems to look up every word in a dictionary. 22817’s quest starts on March 12, 5 PM:
<<what does acute mean
what does accompany mean
what does adrenaline mean
what does alternative mean
what does acute mean
what does ample mean
what does abundant mean
what does ambition mean
what does ambiguous mean
what does agony mean
what does achieve mean
what does apprehend mean
what does annoy mean
what does aggravate mean>>
22817’s gives up after just two hours. A while later, 22817 searches for “summer activities”. Maybe there’s something more interesting to do?
User 28963
At 10:08 PM, 28963 looks for “porn sites”. 28963 quickly amends the search query to read “freee porn sites”. (Two days later, 28963 shows a sudden interest in genital warts.)
User 29076
Hip Hop fan 29076 likes AntiStudy.com. His searches include “disney chanal”, “emty lots”, “michael jordon timeline” and “goolge”.
User 1133
1133 is looking for “Google grass”. (What’s Google grass?)
User 2761
2761 wants to acquire a box of lobster tails. Might come in handy for the trip to Amsterdam…
Source :
Le fichier txt descriptif accompagnant l’archive d’AOL
500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY.
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.
Brief description:
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.
Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields — namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next “page” or results for some query, this appears as a subsequent identical query with a later time stamp.
CAVEAT EMPTOR — SEXUALLY EXPLICIT DATA! Please be aware that these queries are not filtered to remove any content. Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material. There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE. This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms. If you are offended by sexually explicit language you should not read through this data. Also be aware that in some states it may be illegal to expose a minor to this data. Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.
Basic Collection Statistics
Dates:
01 March, 2006 - 31 May, 2006
Normalized queries:
36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for “next page” of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID’s
Please reference the following publication when using this collection:
G. Pass, A. Chowdhury, C. Torgeson, “A Picture of Search” The First International Conference on Scalable Information Systems, Hong Kong, June, 2006.
Copyright (2006) AOL
A lire également :
- AOL vient de mettre en ligne des données privées en quantité… sur Techcrunch
- AOL: “Nous avons commis une grosse erreur” sur Techcrunch
- Saga AOL suite: un 1er site web en place sur Techcrunch
- Allo AOL ? sur Potinblog
- AOL et vie privée sur Pointblog
- AOL diffuse les données privées de ses utilisateurs sur Easy-IT
- AOL s’excuse sur Easy-IT

AOL et vie privée
Etrangement, la plus que fâcheuse mise en ligne, par AOL, de fichiers contenant 20 millions de recherches faites, sur une période de 3 mois, par plus de 650 000 personnes sur le site AOL Search, n’est pas au top des…
AOL à commis l’irréparable !
AOL viens de commettre l’irréparable en diffusant les requêtes de recherche de ses utilisateurs ! La base de données qui contient 20 Millions de requête de plus de 650 000 membres étalé sur trois mois à étés diffusé publiquement dans
Des poursuites engagées suite à la divulgation de données par AOL
Une class action a été engagée par les “victimes” de la divulgation de données privées par AOL. Il est réclamé 1000 USD de dommages et intérêts par utilisateur et 4000 de plus pour les utilisateurs de Californie.
merci pour ce billet, c’edst toujours intéressant de vous lirr. Je me demandais cependanr pourquoi cette parenthèse : two days later, 28963 shows a sudden interest in genital warts?