At E3 2009 Microsoft showed a prototype video of ‘Project Natal’, their next-generation controller technology, it’s pretty cool:
This excited a lot of hobbyists (including myself!) as TOF cameras are unique in their ability to capture 3D images from a single sensor, removing the need for multiple sensors, complicated calibrations and software to capture a 3D scene. The prospect of one being included with a videogame accessory meant that there would be an inexpensive option for people wanting to play with a 3D camera at home.
There is a nice debunking of this here, where the blogger shows some videos that demonstrate the technique used. Structured Light 3D scanning is not something introduced by the Kinect, but it’s definitely the most impressive implementation I’ve ever seen of the technique. I’ll explain a bit more about how structured light scanning works in the following paragraphs.
In conventional stereo vision, two cameras are placed in different locations both looking at the same scene. A point-matching algorithm is used to identify identical points in the images resulting from both cameras. The distance between the cameras and the location of the matched pixels in each image can be used to triangulate the depth of the object at each pixel location.
The basic idea of structured light scanning is instead of using two cameras and a point-matching (correspondence) algorithm, we use one camera and one projector. If you could project a unique colour on to every pixel column of the scene that the camera sees and then pick those colours in the resultant image, you have a virtual correspondence between two ‘cameras’ in different locations. If this is difficult to understand there is an alternate technique (using a moving pattern) using a similar concept that is pretty intuitive and can be seen in this video:
This is not how the Kinect works but hopefully it paints a picture of how encoding information in the projected image can help us retrieve depth information from the single camera.
If you want to try this out for yourself, ut is possible to encode all the information necessary to reconstruct the scene in just three images so that we don’t have to use a moving image.
If you want to do this (with your own webcam and projector) there is a great Processing implementation by Kyle McDonald here.
This implementation is based on a Three Phase technique developed by Song Zhang for his PhD thesis. If you’re interested in developing your own implementation the paper you want to look at is probably S Zhang, “Recent progresses on real-time 3-D shape measurement using digital fringe projection techniques”
The Kinect proves that real-time structured light scanners can be made compact and performant, and gives TOF cameras a run for their money. TOF or not, the Kinect is still the most inexpensive 3D camera available, and it does structured light so well that it doesn’t really matter.
The Kinect has several advantages over a home-made 3d scanner:
- It projects in the infrared spectrum so the pattern is invisible to the human eye
- It has the projector and camera in one unit, in fixed positions for easy calibration
- It generates a depth map in real-time
Almost as soon as the Kinect was released an effort to develop drivers to repurpose the device was underway. People from all over the world collaborated and within a few days there were already functioning open source drivers. The awesome OpenKinect people have developed the libfreenect library which has become the standard library for Kinect hacking. There are drivers available for the 3 major operating systems along with wrappers for several programming languages.
One look at Engadget is proof enough that the Kinect Hacks community is prolific and vibrant. So many creative and talented people have made use of the amazing potential of the device in different ways.
One of the reasons a 3D camera is nice to use as an input device is that it makes it easier to tackle a tricky problem in computer vision: background subtraction (i.e. tell which pixels are in the foreground and which are in the background). With that problem solved, things like hand tracking and pose estimation become easier.
If you’re interested in extracting real-world coordinates from the Kinect, you probably want to look at Matthew Fisher’s site where he posts example code for the transform including some empirical calibration factors. Depending on what wrapper you use those transforms may be transparent when you use the wrapper API to get the depth information from the device.
I bought a Kinect on release day but until recently I had only connected it to my computer a couple of times to see how the driver development was progressing. Yesterday I started hacking up a quick implementation of the classic ‘Pong’ game, but controlled by the hands of two players.
Here’s a video of an early prototype:
The video is very jerky but that is only when something is trying to capture the OpenGL window.
The source is available on GitHub here.
The ‘Pong’ game is setup as a single Box2D world with the ball, paddles and walls all as Box2D bodies. Body movement and collision detection is all handled by the physics engine with the bodies rendered using openFrameworks.
The goals are also Box2D bodies that aren’t rendered to screen, and on collision with the goals the score is incremented and a new game starts.
The controls take the depth image from the Kinect and threshold it between a near and far plane, leaving only items that are a configurable distance from the sensor appearing in the image.
The depth image is then processed to find contours that are in a certain size range, and those contours are tested for their curvature to try to determine if they contain fingers. If they do, they are chosen as a ‘control blob’ (hand) and depending on what half of the Kinect’s vision they are in, assigned to a player’s paddle.
(All trademarks and registered trademarks are the property of their respective owners.)