Archive

Posts Tagged ‘desktop search’

Knowledge is Half the Battle

July 9th, 2009

Shaving is not something I usually think about. Every so often, I would end up clogging the razor and eventually cutting my face. After running into my friend Holly who is a brand manager at Gillette, I learned how to improve my shaving experience with minimal effort. Here’s how it works:

  • Run water over the back of the blades. That’s it.

I used to rinse the blade from the front (blades facing me) but it would eventually get clogged. For some reason, the Mach 3 rinses out much better from the back. I haven’t cut my face since — knock on wood.

This got me thinking about the little things that make our lives easier. Microsoft has a way to index shared drives on a network. This is useful because desktop search can now find files on shared drives. We surveyed several companies involved in everything from medical devices to architecture. We discovered close to 100% of the companies used shared network drives to store information. Being able to find documents that your team mates share would be a boon so we posted some directions for Windows XP and Windows Vista on our FAQ. Note that this only works if you have Microsoft Desktop search, which comes with Office 2007 or Microsoft Vista.

Small Talk ,

Desktop Search: What hasn’t changed

May 20th, 2009

A few posts back, we talked about why search on the desktop works a lot better than it did just a few years ago.  In this post, we’ll talk about how desktop search hasn’t kept up as the way we find and consume content on our computers has changed.

As recently as 2000, the deluge of emails, files, podcasts, blog posts and everything else that we have to keep track of was more like a drizzle.  The average hard drive held about 8 GB of data and we averaged about 7 non-junk emails per day.

As of 2009, those numbers look pretty different.  My laptop’s hard drive is a relatively tiny 160 GB; most computers come with at least 320 GB.  The way we work with email has changed too.  We now average 25 emails per day (almost a whopping 10,000 per year!) thanks to a lot of mailing lists and a lot of CCing.

Of course, we’re not suddenly 60 times more productive than we used to be.  Instead, we just get more of other people’s content.  Before Gmail made email quotas obsolete, CCing large files to everyone who might want a document wasn’t practical.  In 2000, blogs didn’t really exist, and the number of pages that interested each of us on the Internet was orders of magnitude smaller.

The problem only intensifies if we think about it from a corporate perspective.  How many gigabytes of data does your entire company have?  Where does it live?  At our former company, many groups had internal wikis, all of them had internal sharepoint sites (at least three, and as many as fifteen per group!), we had a document management library, we had personal websites with documents attached; everyone cared more about getting the job done than setting it up for other people to have an easy time finding what they created.

So there are now a lot more fragments of information in our brains and a lot more places that the rest of that information could be.  We spend a lot more time asking ourselves “where did I see that again?”  That translates into a lot of time and money. Bill Gates says that the average knowledge worker spends 11 hours a week looking for information, costing his/her company $18,000 per year in lost time.

The future looks like it is going to be even more chaotic – we will not only access more information in more places, but on more devices as well.  We will see some content on our computers, some on our $200 Netbooks, more on our iPhones or BlackBerries, and even more on our Kindles or Sony Readers.  And as we see more content on more devices, remembering where we saw the content we need NOW is going to get even harder.

A lot of productivity gurus are challenging us to “take charge of our Inboxes!” and implement a regimen that will help us manage the information.  But technology caused this problem.  Why isn’t it fixing it?

Fundamentally, the way we look for information hasn’t changed a lick since 2000.  Whether searching our computers or the Internet, we try to figure out what we want to find and we type it into a search box.  We get results that we hope are good enough – they often are.  When programmers have tried to improve on the search box, they’ve come up with some terrifying things.

I’ve attached a screenshot of the MIT Simile Seek project’s implementation of what is called faceted search below.  It’s a programmer’s dream.  I think I am wired to love driving tools like this.  It feels like piloting a starship.  If I know i want the 2nd top level domain to be .mit.edu because I know it came from someone at MIT, but I don’t know which lab, faceted search puts that power right at my fingertips.

simile_seek

But when I showed faceted search to anyone who doesn’t program computers for a living (like Electrical Engineers), they did not share my enthusiasm.  Other search improvements yielded similar gnashing of teeth.  The search box remains the search box.

So we’ve got a lot more content than we’ve ever had before, located in a lot more places than it’s ever been before, and we access it on more devices than we’ve ever used before.  And we still do pretty much the same things to find it that we did in 2000, when we had a lot less content, all on one hard drive, all on one computer.

So there’s a lot to fix.  And we’d love to fix all of it!  But for now, we’re trying to siphon off just one aspect of the problem where we think our technology can make a big difference.  In a few days, we’ll talk more about how we’re going to do it.

Technical

A Brief History of Search on the Desktop

May 6th, 2009

Desktop search has come a long way in the past few years.  In this post, we’ll explore how the technology behind all of the major desktop search options has changed based on web search innovations.  In the follow-up posts, we’ll talk a little bit about how desktop search is different from web search and how it has both succeeded and failed at making interacting with our computers better.  We’ll share a few tricks for getting more out of Desktop Search and a few things we wish it could do.  We’ll also share a little bit about how Baydin plans to fill in the gaps.

There are two major advantages to a modern desktop search experience: the first is that searching for a document is a lot faster than it used to be, and the second is that in virtually all file types, the text inside the document is searchable, instead of just the filename.

Think back to the file search in Windows 95.  It was pretty terrible.  All it could do was search for filenames, and it took the better part of eternity to find anything.  Here’s why: when someone searched for a word, Windows opened the file system and looked at every single file it had.  It compared the search query with the filename for each file, and as it found matches, it added the files to the results listing.  Every time a new search started, Windows had to look at every single file, which is why the results trickled in over a period of a few minutes.  If the search term were somewhere in a document or in an email rather than in the filename of a physical file, we were pretty much out of luck.

win95search

Searching the full text of documents was beyond the pale.  To do that, Windows would need to open every single file as it came across them and extract the text.  It would have been slower than slow, it would have required every piece of software that saved any kind of document to provide hooks for Windows to extract the text, and it probably would have made the computer rottenly unstable. 

Searching through email in Office (up until 2003) used the same method, but since every email had a known structure, Outlook could search through the full text of messages.  When a user started searching for something, Outlook opened the most recent email and compared the search terms against each word in that email.  If there was a match, it would add the email to the result list in real-time.  When it finished with the most recent email, it would move on to the next, then to the next, then to the next.  Searching through email was a slow process, but it would eventually yield results where the terms were found only in the text of emails.

A real innovation happened, though, when software developers realized that the same technology that powers web search engines could be applied to the desktop.

When someone clicks the search button on a web search engine, the search engine responds in a totally different way from Windows 95-style search.  Google does not crawl every page on the web, word for word, comparing the search terms for a match.  Instead, Google just looks in a previously-generated database where they already have prepared a list of all the web pages that contain the search term (and a bunch of other information that helps them order the results!)

Instead of sifting through every word ever written on the Internet in real time, Google crawls each page on the web only every few hours, days, or weeks depending on how important a site is and how frequently its content changes.  When Google crawls a site, its crawler looks through every page, processes every term, and updates the database. 

Very crudely, that index looks like this:

Term Results
baydin http://www.baydin.com
http://burmadigest.info/2008/03/20/set-ka-lay-baydin-burmese
http://www.baydin.com/blog
etc.
chicken http://en.wikipedia.org/wiki/Chicken
http://allrecipes.com/Recipes/Chicken
etc.
outlook http://www.microsoft.com/outlook http://en.wikipedia.org/wiki/Microsoft_Outlook
etc…

All Google has to do when you search for “chicken” is find that index and list the results.

Of course, that’s a sweeping simplification – it doesn’t address multiple-term searches, result order, or the fact that the index is HUGE and difficult to maintain.  There are dozens of fantastic papers from Google engineers that explains a lot of the details; try http://labs.google.com/papers for a listing, or start here for an overview from when Sergey and Larry were still at Stanford.  But for the purposes of this post, that’s all we need to worry about. 

Creating and maintaining a mapping from search terms to web pages is the critical innovation for desktop search.  The idea extends quite well to our individual computers.  Instead of a mapping from terms to web pages, though, we need to make a mapping from terms to documents. So the problem is a little bit harder in that we have to be able to index a whale of a lot of document types instead of just HTML, but it is a lot easier in that the index size is nowhere near as large as the index for the web.  It can be generated relatively fast (probably under an hour for the average computer) and does not require a lot of space.

Google Desktop Search, Windows Desktop Search, and all the competitors do exactly this.  Their indexer runs in the background, opens every file on the computer, and creates a database in the same format as the web databases above:

Term Results
baydin C:\Alex\Documents\baydin_biz_plan.doc
C:\Alex\Desktop\blog\post1.html
C:\Alex\Documents\cashflow.xls
etc.
chicken C:\Alex\Documents\Recipes\chicken florentine.doc
C:\Alex\Desktop\chicken.jpg
etc.
outlook C:\Program Files\Microsoft\Outlook.exe
C:\Alex\Documents\problems with outlook.doc
etc…

When I search on my computer for a word, like the web search engines, all my computer now has t
o do is look in that index and find the already-generated list of files that match my term. 

The key takeaway is that thanks to these indexes, searching through the full text of every file on a computer is now thousands of times faster than just searching the filenames used to be. 

Technical , ,