Archive

Archive for November, 2009

TechStars: Just Do It

November 30th, 2009

Techstars opened up applications for their 2010 classes a couple weeks back.  If you’re thinking about quitting your job to start a company, or you’re already working on a company but haven’t got all the kinks worked out yet, you should apply. (Full Disclosure: Baydin is a TechStars alum from Boston’s 2009 program)

I’ll be putting up a couple additional posts over the next couple of weeks (then a couple more to follow on in a couple months) about my experience in TechStars, what to do while applying to the program and before it starts, and a few thoughts on how to make the most of it for the teams who are getting started in the program. 

For now, I want to shill a little bit about how valuable and helpful and exciting the program was.  There are three reasons a startup should seriously think about TechStars – all more valuable, in my mind, than the ramen-level-funding and the investor access. 

techstars150widthcolor

Mentorship

TechStars is mentorship-driven.  As soon as you arrive at the office, you’ll start connecting with 60+ mentors, most of whom have been in the trenches at least once, and understand how the sausage gets made.  They’re eager to help, excited to teach, and even more excited to learn. 

Both Boston and Boulder have amazing entrepreneurial ecosystems.  There’s a really strong culture of “giving back” to the entrepreneurial community after running a successful startup (or funding a few).  And the recent buildup of “valley envy” in Boston means that the community here has closed ranks and is serious about helping first-timers. 

That means you’ll get tons of help figuring out novel business strategies, tons of folks who will give you product feedback, and a lot of help avoiding common pitfalls, like not knowing who the CEO is or building your product until it’s finished inside a vacuum, with no feedback from customers. 

Eran Egozy, one of the founders of Harmonix (the folks who made Guitar Hero and Rock Band) told us our product was too slow and hooked us up with an ex-Microsofter at Harmonix who shared some tricks for optimizing it.  Warren Katz, founder of VT MAK, taught us about SBIR grants, a nondilutive grant program for commercial technology research run by NASA, the Department of Defense, and a few other agencies.  David Skok from Matrix Partners came in to teach us how to build a sales and marketing pipeline on the web – and what to measure to find out if it’s working.  And that’s just the beginning – Richard Dale met with us almost every week to keep us on the straight and narrow. When nobody else would even think about installing our 3-week old protoype crap software inside MS Outlook, Will Herman ran it every day.  He also helped talk me through some of the hardest decisions I’ve ever had to make. 

When things are going smoothly and everything is working right, mentorship is a nice-to-have.  When the crap hits the fan (and it pretty much always hits the fan at least a few times for a tech startup) that mentorship is the difference between a company in the Deadpool and a company that is still fighting.  You want these folks on your side. 

Camaraderie

TechStars called the relationships that form between the companies during the summer “cooper-tition” in one of the handouts to the mentors.  For a kind of cute phrase, it’s amazing how accurately it describes the way we interacted during the summer. 

We all helped each other out and we were all there for each other pretty much anytime.  If you need some help figuring out where to find good interns, one of the companies will know.  If you are trying to figure out how to reach an elusive blogger, someone at one of the other companies will have been there.  If you need to know which version control system makes the most sense, someone from one of the companies will be able to get you started.  Because we were all in the same boat, we all understood, and we all had valuable information to share. 

And if things are going crappily (which they will), you just waxed an important blogger’s data, or you can’t get any customers, or one of your most promising potential investors just said no, there’s someone right there who’s been there too. 

At the same time, when you’re seeing the LangoLab folks still working until 4 in the morning and the TempMine guys in the office before the crack of dawn, and one of the TechStars community interns crashing in one of the “conference rooms” after an all-night marketing blitz session… it inspires you to work harder too.

Plus ping pong, a medium-well stocked beer fridge, lots of free food, and a handful of pretty awesome gatherings. 

Validation

Paul Graham said in one of his essays that in many cases, the most valuable thing companies get from Y Combinator is the kick in the ribs to abandon a stable job with a salary and take a chance on really making the startup idea work. 

TechStars validating our idea and our team was critical to starting Baydin.  Without that kick in the ribs, it probably would have remained a weekend project and never really gotten off the ground.  It’s totally different working on a startup full time.  And you’ll find out, a lot faster, whether or not it’s the right product and the right company if you go full time. 

If you get into TechStars, or even become a finalist, you’ll get feedback on where the strengths and holes in your business idea fall.  You’ll know whether this is the right startup to take a chance and commit with, or if you’re better off figuring something else out, or if you’re best off bootstrapping this idea on nights and weekends. 

Either way, you’ll get that feedback just for applying.  Plus, the application process will force you to crystallize your idea and make it stronger. 

So go get started!

Uncategorized

VirtualBox Image for Enron Email and Twitter Data Analysis

November 23rd, 2009

As I mentioned during my talk at Defrag 2009, the best corpus of sample email data we have is the email data dump that the federal government released after Enron’s collapse.  The corpus includes over 400,000 real email messages from Enron employees, and it’s ripe for analysis.

The data is available on the web in a ton of different formats, but none of them are especially conducive to just picking them up and starting analysis, especially in Windows – the original data actually is posted in a format that has filenames ending in periods, so they’re completely invisible to the 93% of us who are in Windows.  It took me about 2 days of solid work to get something running before I could actually work on analyzing the data.

I’ve created a VirtualBox image that includes the Enron data, the tools to gather a large sample of Twitter data, and some sample Python scripts that take almost all of the work out of accessing the data for analyzing the email or Twitter data.  I think this is by far the easiest way to start analyzing the Enron email corpus, and a pretty darned easy way to get started collecting and analyzing Twitter data.

It should take you less than half an hour of hands-on time (plus a little time for zip files to extract) to go from nothing to running the sample scripts and generating histograms of message length.  Good luck, enjoy, and feel free to email me or leave a comment with questions.

Getting Started

  1. Download and install Sun’s VirtualBox.
  2. Download my VirtualBox Image file, which runs Xubuntu (a streamlined Ubuntu installation optimized for slower hardware – perfect for a virtual machine!)
  3. Extract the VirtualBox Image file into a directory you can remember.
  4. Run VirtualBox and create a new Image.  Name it whatever you like; the OS is Linux and the version is Ubuntu
  5. Set the memory to at least 512 MB (if you have 3+ GB, I recommend 1.5 GB so that you can load the entire enron messages table into memory).
  6. Leave Boot Hard Disk checked and choose Use an Existing Hard Disk. Click the folder icon next to the dropdown, click Add, and navigate to the EnronTwitter.vdi file extracted from the download link above.
  7. Highlight the EnronTwitter.vdi file and click Select.
  8. Click Finish.  Select the new VM and click Start.  Wait for the image to boot.

Everything you need to get started is in your home directory in the data folder.  Double click the Home icon on the desktop, and double click the data folder inside that directory.  To edit the files: Right-click and choose Open With Mousepad or use a text editor of your choice.

    To get to a Terminal: Double-Click the Terminal icon on the Desktop or click Applications at the top left, click Accessories, click Terminal.  You will need to type cd data to get into the directory with all the sample scripts.

Linux login info: enron/enr0n
MySQL login info: root/enr0n

Enron Data

There are two sample scripts in the directory – enron.py and enronrecpients.py.  The enron.py script generates a histogram of the message lengths of all of the emails in the corpus.  enronrecpients.py counts how many emails from the corpus are multi-recipient.

One caveat – these scripts load the entire database into memory before they run.  For that reason, enron.py is currently set up to run on only the first 200k messages.  If you chose to provide more than 1GB of memory, you should be OK to load the full set of messages, so just remove the LIMIT 200000 from the SQL command.

To run the script, open a terminal, cd into the data directory, then type
python enron.py

You can modify these scripts to analyze additional message data.  The comments describe what does what as well as providing instructions on how to figure out what else is inside the Enron data using the MySQL client.

Twitter Data

The relevant Python data analysis script is twitter.py and the relevant data collection script is datacollector.php.  The twitter script uses simplejson to access the fields in the Twitter JSON stream and counts the number of multi-reply (@ to multiple people) as well as multi-retweet (multiple RT in one tweet) messages in the sample data.

To run the Python analysis script, open a terminal, cd into the data directory, and type:
python twitter.py

There’s only a tiny amount of Twitter data in the image as-is.  You’ll need to run the datacollector.php script to pull data from a Twitter streaming API called the “Gardenhose” –a medium-volume feed that provides a pretty good way to get a bunch of data fast.  The script pulls from what is called the “spritzer” stream, which is just a random, undirected sample.  I got this script from this streaming api tutorial.   You’ll get about 25,000 Tweets per hour.

To run it, you will need to open datacollector.php and replace twitterusername with your Twitter account’s user name and twitterpassword with your Twitter password.

Then open a terminal, cd into the data directory and type:
php datacollector.php

After you’ve run the script long enough to get all the data you want, I recommend that you cat the files together into a single file so the Python script can digest it in one pass.  Do this by typing
cat 20*.txt > tweetcorpus.txt

There’s a lot more information about customizing the stream coming out of the Twitter streaming API, including using search on the front end to restrict the stream at Twitter’s Streaming API Documentation page.

Getting Data out of the Virtual Machine (into Windows)

VirtualBox helpfully provides the ability to share a folder between the guest OS (the Xubuntu image) and the host OS (whatever you’re running, in my case Windows).  To do that as of 11/23/09, click the Devices menu entry at the top of the VirtualBox window, and select Shared Folders.  Click the Add button on the right, click the dropdown under Folder Path and choose Other. Select the folder you want to share, and give it a name (I shared my Desktop, so i just called it Desktop).  Click OK on both dialog boxes.

Now you need to mount the shared folder, so you can access it in the guest OS.  Open a Terminal and type the following (replacing Desktop with the name you chose for the folder you shared):

> sudo mount.vboxsf Desktop /media/windows-share

Now, double-click the File System icon on the Desktop in the VirtualBox image, and double-click the media folder.  The shared folder you selected will appear there as windows-share, and you can exchange data with your computer’s regular file system using that folder.

Helpful Links

I already set up the VirtualBox image with all the scripts and data you should need to get started, but here are some links in case you need to repl
icate some of these steps or if you need to find the original Enron source data.

Uncategorized

Why Email's Not Going Anywhere

November 16th, 2009

I presented the following slides at Defrag 2009 last week in Denver.

I wanted to take a quick look at the different use cases for email vs. microblogging and make some predictions about how the email experience will be changing over the next handful of years.  To do this, I performed some analysis on the Enron email corpus and a half-a-million-message Twitter corpus to illustrate the differences in the way people use these services.  I used that information and some trends about Twitter’s growth to make a few educated guesses about which Web 2.0 features we’ll watch make their way into email clients and servers.

There are a ton of useful links and papers on Slide 15.

Email is Here to Stay (Baydin Defrag 2009)

I’ll be posting my virtual machine image with all of the Enron data and the download scripts for the Twitter data I used shortly.

Uncategorized

Enron/Twitter Data Coming Soon

November 11th, 2009

Defrag is *awesome* so far – so much to learn! I’ll be posting the virtual machine image with the Enron/Twitter data I used, plus information about how to process it, right here this weekend. Check back Monday, and if you’re at Defrag, say hi!

Uncategorized