1 AN INTUITIVE SENSORY SUBSTITUTION SYSTEM FOR THE VISUALLY IMPAIRED USING TACTILE FEEDBACK By RYAN CHILTON A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIRE MENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014
2 Â© 2014 Ryan Chilton
3 To my Lord and Savior, Jesus Christ and to my wife , Kristin
4 ACKNOWLEDGMENTS I would like to thank Dr. Carl Crane for providing me with the opportunity to join the Center for Intelligent Machines and Robotics and for all the other great experiences that came out of that. I have learned so much from the various projects that the team has worked on over the year s and have also been given a lot of independence from Dr. Crane in choosing projects for my personal research, which I am grateful for. I am also thankful to my committee members, Dr. Antonio Arroyo, Dr. Douglas Dankel, Dr. John Schueller, and Dr. Gloria Wiens for their help and guidance. I would also like to thank my wonderful wife who has been such a help and support through the whole process. She has kept me company while working on research late at night, helped me think through problems, helped wit h testing , given me motivation to keep going when I needed encouragement , and helped m e relax when I needed to take a break. I have had a wonderful group of peers through the years at CIMAR. Jae was always eag er to help me get caught up to speed when I was the new guy. With Drew, Jonathon, and Nick, there was never a lack of good conversation and practical jokes. Vishesh could always do a good job explaining concepts and showing me how much more there is to l earn. Shannon is the smartest designer I know and I can always go to him for questions. And Bob, Dars a n, Moses, and Sujin have all been great, too. Thank you.
5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 8 LIST OF ABBREVIATIONS ................................ ................................ ........................... 14 ABSTRACT ................................ ................................ ................................ ................... 15 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 17 Background on Visual Impairment ................................ ................................ .......... 17 Current Methods for Sensing Surroundings and Navigating ................................ ... 19 The White Cane ................................ ................................ ............................... 20 Guide Dogs ................................ ................................ ................................ ...... 20 E cholocation ................................ ................................ ................................ ..... 21 Background on Sensory Substitution ................................ ................................ ...... 21 Background on Relevant Sensor Technologies ................................ ...................... 24 Cameras ................................ ................................ ................................ ........... 24 Stereo Camera Systems ................................ ................................ .................. 25 Laser Ranging Sensors ................................ ................................ .................... 27 Sonars ................................ ................................ ................................ .............. 28 Structured Light Sensor Systems ................................ ................................ ..... 29 2 MOTIVATION ................................ ................................ ................................ ......... 35 The Need to Improve How People with Blindness Sense Their Environment ......... 35 The Applicability of Robotic Sensor Systems ................................ .......................... 37 Problem Statement ................................ ................................ ................................ . 40 3 REVIEW OF LITERATURE ................................ ................................ .................... 44 Sensory Substitution Research ................................ ................................ ............... 44 Blind Navigation Aids ................................ ................................ .............................. 51 Simple Systems ................................ ................................ ................................ 51 More Advanced Systems ................................ ................................ .................. 53 Aids that use Stereo Vision ................................ ................................ .............. 56 Blind Navigation Aids Using Tactile Feedback to the Hands ................................ .. 59 Stereo Vision ................................ ................................ ................................ .......... 66 4 SYSTEM DESIGN AND IMPLEMENTATION ................................ ......................... 75
6 Concept ................................ ................................ ................................ .................. 75 System Design ................................ ................................ ................................ ........ 78 Design Considerations ................................ ................................ ..................... 78 System Functionality ................................ ................................ ........................ 83 Performance Goals ................................ ................................ .......................... 89 Prototype System Implementation ................................ ................................ .......... 91 Hardware ................................ ................................ ................................ .......... 92 Algorithms ................................ ................................ ................................ ........ 94 Stereo vision ................................ ................................ .............................. 94 Ground removal ................................ ................................ ....................... 105 H and tracking ................................ ................................ ........................... 108 Collision detection ................................ ................................ .................... 112 Software process outline ................................ ................................ .......... 116 5 PERFORMANCE AND USER TESTING ................................ .............................. 141 Performance ................................ ................................ ................................ ......... 141 Stereo Vision ................................ ................................ ................................ .. 141 Hand Tracking ................................ ................................ ................................ 143 Ground Removal ................................ ................................ ............................ 145 System Functionality ................................ ................................ ...................... 146 User Testing ................................ ................................ ................................ ......... 148 Test 1 Outdoor Walkway ................................ ................................ ............. 149 Test 2 Outdoor Breezeway ................................ ................................ .......... 152 Test 3 Indoor Hallway ................................ ................................ .................. 152 Test 4 Parking Lot and Wall Following ................................ ........................ 153 Test 5 Indoor meeting hall ................................ ................................ ........... 154 Test 6 Outdoor Trail ................................ ................................ .................... 154 Test 7 Apartment Complex ................................ ................................ .......... 155 Summary ................................ ................................ ................................ ........ 157 6 DISCUSSION AND CONCLUSIONS ................................ ................................ .... 181 Discussion ................................ ................................ ................................ ............ 181 Future Work ................................ ................................ ................................ .......... 184 Conclusions ................................ ................................ ................................ .......... 185 LIST OF REFERENCES ................................ ................................ ............................. 188 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 192
7 LIST OF TABLES Table page 3 1 Summary of reviewed navigation aids. ................................ .............................. 65 4 1 Rela tionship between disparity, camera separation, and minimum detection distance. ................................ ................................ ................................ ........... 103 4 2 Listing of parameter values used in the stereo algorithm with a 480 by 360 pixel image size. ................................ ................................ ............................... 104 4 3 Plane parameters and their bin sizes for the Hough planes. ............................ 108 5 1 The range of position and orientation values the hand marker can have and still be detected by the head mounted camera. ................................ ................ 145
8 LIST OF FIGURES Figure page 1 1 A person using a white cane to navigate without running into obstacles. ........... 30 1 2 A white cane being used in a busy crosswalk. ................................ .................... 31 1 3 A guide dog leading a person down a street. ................................ ..................... 31 1 4 A person using echolocation to judge his distance to the building. ..................... 32 1 5 Stereoscopy geometry. ................................ ................................ ....................... 32 1 6 Two corresponding features in a stereo image pair. ................................ ........... 33 1 7 A sliding window used to estimate disparity. ................................ ...................... 33 1 8 A stereo image pair and the resulting disparity map. In the disparity map, lighter colors represent greater disparity and closer range. ................................ 34 2 1 The flow of data in a typical autonomous system's decision making process and in a human's decision making process. ................................ ....................... 42 2 2 The flow of data in a typical sensory substitution application. ............................ 43 3 1 Shapes used in the identification testing of an electro tactile tongue display. .... 71 3 2 The CyberGrasp force feedback glove. ................................ .............................. 7 1 3 3 An ultrasonic navigation aid. ................................ ................................ ............... 72 3 4 MOWAT travel aid. ................................ ................................ ............................. 72 3 5 The hardware setup and a d iagram of the operation of the navigation aid developed at FIU. ................................ ................................ ............................... 73 3 6 The Tactile Handle. ................................ ................................ ............................ 73 3 7 Illustration of disparity comp utation. ................................ ................................ ... 74 4 1 Sensor is positioned to capture the field of view in front of the user. ................ 118 4 2 Sensor capturing 3D shape of veh icle. ................................ ............................. 119 4 3 The sensed model is scaled down to create a feedback field that the user can interact with. ................................ ................................ ............................... 119
9 4 4 A user stands in front of a table. The scaled down scene can be seen in blue to represent the virtual feedback field generated. ................................ ............. 120 4 5 When the user's hands collide with part of an object in the feedback f ield, a tactile stimulus is created by the gloves and felt by the user. ........................... 120 4 6 An example of a marker which can be detected to judge position and orientation. ................................ ................................ ................................ ........ 121 4 7 One of the vibration motors used in the feedback gloves. ................................ 121 4 8 An example of what the feedback field would look like as a user walks down a sidewalk. ................................ ................................ ................................ ........ 122 4 9 The vibration feedback, illustrated as a spark, as a user stands in front of an outdoor scene and explores the feedback field with his hands. ........................ 122 4 10 Arrangement of the cameras used for sensing the environment and the hands. Their fields of view are also drawn. ................................ ....................... 123 4 11 Illustration of how the sensors' fields of view change to permit lateral exploration as the user turns his head. ................................ ............................. 123 4 12 An illustration of the scenario in which a short obstacle would be confused for the ground plane. ................................ ................................ .............................. 124 4 13 The vibration feedback in a confusing situation where a user's fingers intersect the ground plane and a short obstacle. ................................ .............. 124 4 14 A more useful feedback state is provided when the ground plane is removed from the geometry which provides feedback. ................................ ................... 125 4 15 The remaining feedback geometry once the ground plane is removed. ........... 125 4 16 Detailed picture of fingers intersecting with a point cloud model of a doorway. 126 4 17 Blue Fox Cameras mounted in a ho rizontal stereo configuration. .................... 126 4 18 The Sony Eye camera used for hand tracking. ................................ ................. 127 4 19 The navigation aid worn by a user. ................................ ................................ ... 127 4 20 The hand feedback interface electronics which includes a microcontroller board and a set of opto isolators. ................................ ................................ ..... 128 4 21 Th e connectivity diagram of the entire system. Signal lines are shown in blue and power lines are shown in red. ................................ ................................ .... 128 4 22 The gloves equipped with feedback motors on the fingertips. .......................... 129
10 4 23 Epipolar geometry. The blue triangle represents the epipolar plane and the red lines represent the epipolar lines on the image planes. .............................. 129 4 24 A chessboard tracking pattern used in the stereo calibration process. ............. 130 4 25 A visual example of un distorting and aligning the two stereo images in order to create collinear epi polar lines (shown in dashed lines). ................................ 131 4 26 The sliding window used to find the disparity between a feature in one image and the same feature in the other image. ................................ ......................... 132 4 27 Schematic illustrating the geometry of the minimum detection distance for a stereo algorithm with a maximum detectable disparity. ................................ .... 133 4 28 Image of a brick column and the disparity output from the stereo algorithm. .... 133 4 29 Image of some barricades in a grassy area and the disparity map output from the stereo algorithm. ................................ ................................ ......................... 134 4 30 Image of bike racks and the disparity map output from the stereo algorithm. ... 134 4 31 A point cloud generated from stereo images on the l eft and th e original scene on the right. ................................ ................................ ................................ ....... 134 4 32 A point cloud of some boulders along a sidewalk on the left and the original scene on the right. ................................ ................................ ............................ 135 4 33 Diagram showing the parameterization of a plane used in the Hough plane detection algorithm. ................................ ................................ .......................... 135 4 34 The results of the ground plane identification process on a sc ene with boulders along the sidewalk. ................................ ................................ ............ 136 4 35 The results of the ground plane identification process on a scene with bike racks on a walkway. ................................ ................................ ......................... 136 4 36 Process of masking the imag e from the hand tracking camera ........................ 137 4 37 The hand markers being detected with the hand tracking camera in an indoor environment. ................................ ................................ ................................ ..... 137 4 38 The hand markers being detected with the hand tracking camera in an outdoor environment. ................................ ................................ ........................ 138 4 39 Diagram showing the angle between the hand tracking camera and the stereo cameras. ................................ ................................ ................................ 138
11 4 40 Diagram showing the artificial offsets added to the hand tracking camera in order to situate the hands at the proper level in the ste reo camera's reference frame. ................................ ................................ ............................... 139 4 41 The measured pose of the hands being used to map the finger positions to the disparity map and test for collisions. ................................ ........................... 139 4 42 Software flow diagram including hardware interfaces. ................................ ...... 140 5 1 A thin sign post and a thin tree being dilated by the stereo block matching algorithm. ................................ ................................ ................................ .......... 160 5 2 Example disparity map generated of a grassy area and a box. ........................ 160 5 3 A disparity map of a scene with large variation in lighting. ............................... 160 5 4 An example of a failure mode of the stereo matching algorithm when the image contains tight, horizontally repeating pattern such as the slats of the fence. ................................ ................................ ................................ ................ 161 5 5 An example of a failure mode of the stereo matching algorithm caused by shiny surfaces in the image such as the hood of the sports car pictured. ......... 161 5 6 An example of a failure mode of the stereo matching algorithm caused by a surface with very little texture. ................................ ................................ .......... 161 5 7 An example of a failure mode of the stereo matching algorithm caused b y saturated portions of the image. ................................ ................................ ....... 162 5 8 The error in the distance measured to the marker using the Aruco Marker detection code. The error bars indicate one standard deviation. ...................... 162 5 9 One of the hand tracking markers fails to be detected due to the shadow and bright sun light. ................................ ................................ ................................ . 163 5 10 A cat participating in the res earch study is clearly distinguished from the ground plane in the disparity map. ................................ ................................ .... 163 5 11 Small objects being distinguished from the ground plane. ................................ 164 5 12 Small objects being distinguished from the ground plane. The smallest object of 13 cm is not discriminated from the ground and hence will not be included in the feedback field. ................................ ................................ .......... 164 5 13 The low curb is somewhat distinguished from the ground plane, but not entirely. ................................ ................................ ................................ ............. 164 5 14 An illustration of the data gleaned from haptic exploration of the feedback fiel d. ................................ ................................ ................................ .................. 165
12 5 15 An illustration of the data gleaned from haptic exploration of the feedback field. ................................ ................................ ................................ .................. 166 5 16 An example of the data g leaned fr om haptic exploration over time .................. 167 5 17 Test subjects in the outdoor setting used for Test 1. ................................ ........ 168 5 18 An example scen e from the Test 1 obstacle course ................................ ......... 168 5 19 Example scene from Test 1 showing subject feeling an obstacle with the left hand. ................................ ................................ ................................ ................ 169 5 20 A thin object being sensed by the test subject. ................................ ................. 169 5 21 A user performing the distance comparison test. ................................ .............. 170 5 22 A user d etecting a moving person walking towards them. ................................ 170 5 23 A tester feeling a column while using the system to navigate through an outdoor area. ................................ ................................ ................................ .... 171 5 24 A tester feeling a picnic table while using the system to navigate an outdoor area. ................................ ................................ ................................ ................. 171 5 25 A tester feeling the corner of a building in an outdoor breezeway. ................... 171 5 26 A test subject using the prototype system to navigate an indoor obstacle course . ................................ ................................ ................................ .............. 172 5 27 The system being used on an indoor hal lway. ................................ .................. 172 5 28 A tester feeling a wall with the system. ................................ ............................. 173 5 29 A user feeling a car with the system. ................................ ................................ 173 5 30 A user performing the wall following test. ................................ ......................... 173 5 31 A user feeling a wall of shrubs while trying to follow the wall. ........................... 174 5 32 A user feeling a fence beyond a shrub while trying to perform wall following. .. 174 5 33 A user feeling a fence while trying to perform wall followi ng. ........................... 174 5 34 User navigating around chairs indoors. ................................ ............................ 175 5 35 Feeling chairs with the system during an indoor obstacle course. .................... 175 5 36 Feeling chairs during an indoor obstacle course. ................................ ............. 176
13 5 37 A user walking along a bridge with the aid of the system. ................................ 176 5 38 A user feeling brush to the left along a wooded trail. ................................ ........ 176 5 39 A user detecting a tree along a trail. ................................ ................................ . 177 5 40 A user feeling a trash can marker and raising a hand to feel its height. ........... 17 7 5 41 A user feeling a low retaining wall, but not the traversab le ground. .................. 177 5 42 An individual who is blind using the system to navigate through bushes. ......... 178 5 43 A test subject rounding a c orner while feeling and avoiding a bush. ................ 178 5 44 Two traffic barriers being detected by the stereo cameras. .............................. 179 5 45 An object being felt up and down to determine its height. ................................ 179 5 46 A small tree being felt with the outer fingers of the left hand. ........................... 180 6 1 Rendering of what the system could look like if it were commercialized. .......... 184
14 LIST OF ABBREVIATIONS C PU Central Processing Unit D SP Digital Signal Processing F PGA Field Programmable Gate Array G PU Graphics Processing Unit M EMS Micro Electronic Mechanical System R AM Random Access Memory
15 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AN INTUITIVE SENSORY SUBSTITUTION SYSTEM FOR THE VISUALLY IMPAIRED USING TACTILE FEEDBACK By Ryan Chilton August 2014 Chair: Carl Crane Major: Mechanical Engineering Vision is the most descriptive sense humans use to understand their surroundings. It is extremely valuable to being able to navigate unknown and changing environments, so when vision is impaired, whether partially or fully, people must use other means for getting around. The current set of available tools are still severely limiting in the a mount of information and the quality of the information that is provided to people when trying to understand their surroundings. By pairing sensing techniques commonly used in the field of robotics with a novel way of communicating this information to peo ple through the sense of touch, a better navigation aid is created. The basic premise is to use a body mounted, 3D sensor to sense the geometry length to create a virtual haptic feedback field that the user can explore by moving their hands around. The feedback is conveyed to the user through a pair of gloves that simulate contact with the scaled down virtual objects through the sense of touch. In order to create the sens ation of a feedback field, it is necessary to know the location of hands relative to their body.
16 The design and development of the prototype system is presented as is th e performance of the system when put through validation testing. The results show that the system is effective at conveying 3D information about the environment to users. Test subjects have been able to complete obstacle courses in various settings as we ll as complete other challenges such as determining the closest object or the tallest object using only the feedbac k provided through the gloves. The system has been proven to be both useful and intuitive and has great potential for becoming a commerciali zed product that would give better clarity to those without sight than is possible with any other device.
17 CHAPTER 1 INTRODUCTION Human sensing abilities are nothing short of extraordinary. When matched against state of the art sensors and data interpretation algori perceive and quickly understand the environment is often the clear winner. So when a sensory abilities are impaired, the effects can be burdensome and the lack of information can seem an insurmountable challenge. And whi le man made solutions will likely never be able to replace something as marvelous as the human eye, there have been technologies that can help individuals who live with visual impairment to overcome some of the disadvantages. Many of these sensing technolo gies have been heavily researched because of their application to the field of robotics, but there has also been much research done with the express intention of helping people who live with sensory impairments. Background on Visual Impairment For people l iving with visual impairment, the most important sense for navigating and understanding the world around them is nearly or totally cut off. The number of people who live with blindness in the US is 1.3 million according to the National Federation of the B lind  . Blindness can be a congenital cond ition or something that occurs later in life. When eye sight cannot be relied upon, it can make certain t hings imp ossible; like driving or bicycling ; and it can make other things very difficult, such as walking around unknown environments or using computer applications designed with graphical user interfaces. Vision is the richest of the human senses. It is remarkab le in both the amount of information the eyes take in and in the amount of brain power devoted to processing
18 and interpreting this information. The lens and iris work together to produce a focused image on the retina with proper brightness. The retina of t he human eye detects electromagnetic radiation in the wavelength range of 400 700 nm and is capable of detecting color information and light intensity separately. It is unclear whether human vision is processed with a discrete it is mo re alike to a continuous signal, but in either case, discrete changes can be detected up to around 30 Hz. With two eyes (stereo vision) , humans are able t o perceive depth to an object by several methods. If both eyes are directed at an object, then the d istance can be gauged by the amount of eye convergence , which is simply the angular difference between the direction of each eye . Distance can also be estimated by the amount of disparity that exists between the eyes , a measure of how much objects appear to be shifted between one eye and the other . Stereo vision enables depth to be gauged independently, but i t also allows people to accurately compare distances to objects in order to determine which object is closer. Two eyes are not the only way of deter mining depth, however. A single eye can be used to determine distance because the brain infer s the depth order from object occlusions and infers distance by comparing the apparent size of an object to its assumed size . A subjective analysis of human eye s ight reveals that sight can be used to very quickly understand the qualities of our environment . It can be used to observe the geometry of our surroundings . Vision can tell us where it is safe to walk and where obstacles are. The brain can quickly discri minate between moving and static parts of the environment. It can inform us of the texture and light scattering properties of objects from which we can make inferences about material properties. Eye sight even helps
19 humans balance to some degree. And eye s ight has many non utilitarian purposes too. impossible without vision. By simply collecting, measuring, and interpreting the intensity and frequency of light coming in to the eye, a wide range of critica l information can be gathered. From the previous cursory discussion of human eye sight, it is obvious that any impairment leading to partial or total vision loss will cause significant degradation to a f life. With this in mind, it is important to devote research efforts towards alleviating some of the negative effects of blindness and providing solutions that will in some way compensate for the lack of sight . To date, the blind community has been help ed by an array of useful inventions, such as braille, text to speech software, text to speech book readers, electronic navigation aids, trained guide dogs, white cane s, and more elaborate remote sensing devices tailored for obstacle avoidance. One area wh ere people without sight can be helped is in the task of navigation. While there has been significant rese arch on this topic since the 196 who are blind today still use very low tech devices to find their way around and have not adopted new er high tech solutions. This can be attributed to a general hesitancy to adopt new technology, but the more likely reason is the inadequacies of products on the market and the unavailability of more useful product ideas that are still only in the research stage. Current Methods for Sensing Surroundings and Navigating In order to navigate known or unknown environments, people without useful eyesight have come to rely upon several tools to accomplish the task. The following list
20 does not encompass all of the methods used, simply the most common. These methods are not able to provide high level directional guidance to a person, but they are able to help an individual who can roughly navigate a pre learned path to travel ing along it with out running into an obsta cle or deviating off the walkway into a stairwell. These tools include the white cane , guide dogs, and very rarely echolocation. The White Cane The white cane is a stick about 4 5 hand and can be tapped back and forth in front of the person to create a swath in which obstacles can be detected. The white cane is an excellent way to detect low obstacles, but not able to detect objects that protrude at torso height or head height. The white cane is also able to dete ct negative obstacles such as the drop off of a curb or a stair well. There are different ways to use the cane, but the side to side tapping technique is the most popular. The cane is very simple, requiring very little training and is not an expensive prod uct. Figures 1 1 and 1 2 show examples of white cane s in use. Guide Dogs Guide dogs are able to help direct individuals along walking routes while helping them to avoid obstacles, cars, and other pedestrians. They have the advantage of being another set of eyes that can discern more about the surroundings such as approaching traffic or unexpected dangers such as wet patches of ground. Guide dogs still need to be directed in the general direction of travel, but while walking in that direction, they provide e xcellent guidance. Unlike canes, which are obstacle detectors, guide dogs are obstacle avoiders. They can be great companions and provide an added sense of security ; yet , for all their benefits, they are more difficult to obtain than most people realize. The cost of all the training that guide dogs and users go through is estimated to
21 be between $25,000 and $40,000 and time r equired to get a guide dog is around 1 2 years  . Figure 1 3 shows a guide dog being used in service. Echolocation The use of echolocation to detect obstacles is an amazing feat learned by many individuals without sight. Using either ambient noise or a click made with their mouth, some people are able to detect w h ere the sound bounced off a solid object and gauge its position , as shown in Figure 1 4 . The echo can even be used to detect the size of an object , and this technique can be used both indoors and outs ide. This surprisingly adept method is only used b y a small percentage of individuals who are blind, but it is still a method of navigating around obstacles that can be used with success  . Background on Sensory Substitution One way of helping those with sensory impairments with day to day tasks is to provide them with the information they need artificially through alternative means. The concept of replacing or enhancing human senses (especially in the case of senses that have been im paired) is referred to as sensory substitution , while a dding sensory abilities that humans do not naturally have, is referred to as sensory augmentation . Sensory substitution is the process of replacing information that would normally be captured in one sensory mode with information through another sensory mode. This could mean that stimuli coming in on one channel, say hearing, are replaced with stimuli on another channel, say vi sion. But this could also mean that information from one channel is passed through that same channel, but in a different mode. An example of this is of touch were lost o n one side of the body, their sense of touch could be artificially replicated on the other side of their body by using a pressure sensor array on the
22 damaged half of their body which activated a tactile feedback array on an undamaged part of their body. Bu t, most frequently, sensory substitution involves mapping sensory input from one channel to another. If information that is normally perceived through the ears is instead conveyed through vision, this would be an example of sensory substitution. One could imagine wearing glasses which use a heads up style display , and a graphical shape , such as a radar screen , projected into vision. If the radar graphic had visual indicators to denote where loud sounds were coming from, the user would b e able to understand where sounds are originating from using vision alone. Another example could involve substituting the sense of balance with the sense of touch. Suppose an inclinometer were attached to a person and the inclination of his or her body wa s conveyed through the sense of touch , using devices that could exert a controllable pressure on the user pressure on the left shoulder, and an inclination to the right could trigger a pressure on the right shoulder. This would enable the individual to sense their static balance through an entirely different sensory channel. Both of the previous examples have limitations in the amount of information that they convey as well as the quali ty of the information they convey. This is an unfortunate, but unavoidable difficulty in the field of sensory substitution. When replacing one sense with another, the degradation in information quality can be caused by physiological limits of the senses, the inability of the person to correctly interpret the new information in its alternate representation, or simply inherent limitations in mapping (translating) the data from one form to another.
23 Consider first the effect of physiological limits. If soun d waves are captured using a microphone and the raw signal is mapped directly to a vibration motor on the surface of the skin, a bandwidth limitation in the tactile sense will be encountered. While the ears can sense vibrations in the air up to 20 kHz, th e tactile sense cannot distinguish between frequencies anywhere near that range. This is a hard and fast physiological barrier that cannot be surmounted using this type of substitution. While an alternative representation could be designed where different frequencies were mapped to different locations along the length of the arm (similar to a Fourier transform plot), this would bring up the second issue the inability of the person to correctly interpret the information in its alternate representation. Somet imes it is too difficult or simply impossible for humans to correctly interpret an alternative representation of information. To illustrate this point , consider the such that each frequency component is mapped to a different part along the forearm, and the signal strength at each frequency was mapped to the intensity of the vibration at the respective motor. While this mapping does convey more information, a person will li kely never be able to interpret speech in this way. Another example of this type of limitation would be mapping sound information to vision by simply showing the user a screen that displayed a live spectrograph of auditory information. While much informati a meaningful enough way to understand speech. Finally, some types of sensory information simply have inherent limitations when mapping from one channel to another. It woul d certainly seem infeasible to map vision
24 to taste. And because binocular vision gives depth information, it is not feasible to directly map the 3 D depth we get from vision to the 2 D surface of our skin. Certainly, some creative mapping schemes can be deve loped that attempt to overcome these limitations in sensory substitution, but the risk of creating confusing or convoluted data for the user is always a common problem. Sensory augmentation is another subject in which information not normally sensed by peo People cannot normally sense magnetic fields, but if a magnetometer were used to sense magnetic North and this direction were conveyed to a person through a belt which created a vibrati on in that direction, then that person would have an augmented sense of direction. Infrared goggles, used by pilots and military personnel, also augment the sense of sight by allowing people to see electromagnetic radiation in a wavelength they normally c ould not sense. Background on Relevant Sensor Technologies There are several sensing technologies which can provide data that in some way is similar to that of human vision. While most are inferior to human vision in the amount of information they provide and in the quality of the information they provide , they still can be used to gather information that may be beneficial in the attempt to aid non sighted individuals in day to day tasks through some form of sensory substitution. These technologies include cameras, stereo camera systems, laser ranging sensors, sonar sen sors, and structured light sensors. Cameras Cameras have been around for many years and the transition from analog to digital models has spurred their use in many different applications. Came ras are
25 conceptually very similar to the human eye in that they can collect visual information about light intensity and color in the same wavelength range as humans. They consist of a lens to focus the incoming light on an image plane and an imaging sens or to measure this light. Full color cameras record the strength of light in the red, green, and blue wavelength ranges, whereas black and white cameras measure only light intensity in the range of the visual spectrum. While cameras are very similar in n ature to the human eye in the way they collect and measure light, the way this information is processed is a different story. Until recently, computer image processing was severely limited by the speed of computers such that high level information could no t be quickly extracted from the raw images. Many image processing techniques were very simple in nature and only produced low level information such as edge detection, motion detection, color tracking, and other low level filters. But as computers became m ore powerful, the complexity of image processing algorithms increased, as did the usefulness of the output data. Real time algorithms could perform classification, pattern recognition, visual odometry, and object tracking. The field of computer vision is a very active area of research today and will continue to be so as autonomous systems continue to grow in popularity and as computing ability increases in power yet decreases in its physical footprint . Stereo Camera Systems Computer s tereo vision has bee n around since at least the 1 97 0s. Using two cameras separated by some distance, a matching algorithm, and knowledge of the camera properties and some basic geometry, the depth to points in an image can be estimated. The basic goal is to determine the de pth of any given point in the scene by finding the same point in each of the images and calculating the distance through
26 triangulation . Figure 1 5 illustrates a top down view of the basic geometry involved in stereoscopy for a si de by side camera setup . The star represents a point in the scene plane. Since the two cameras are separated by a distance T, the star appears in different parts for each image. The left camera detects the star in its image plane at a point defined by x L and the right camera detects the star on its image plane at a point defined by x R . The difference between x L and x R is the disparity associated with that point in the im age. If the disparity is known, it can be used with the other camera parameters to calculate the distance , Z, to the actual object. For example, Figure 1 6 shows a pair of stereo images with a horizontal slice highlighted. In th is slice, a green lamp pole (circled in red) is visible in both of the images, but the location in the image is different. One common way the disparity between location in the left image and the location in the right image can be estimated is by sliding a window of pixels from the left image over the right image and then finding the location that minimizes a difference measure. Figure 1 7 shows a window from the right image being slid over the left image. Looking at the graph , the minimum of the difference measure occurs at the disparity that best aligns the two windows. When t his process of finding the disparity is performed at all locations in the image , a disparity map is created from which depth can be inferred, such as the one in Figure 1 8 . Stereo vision systems have the benefit of being relatively cheap for the amount of data that they provide , they are completely passive , and they can be packaged into a small footprint . These systems are very similar in nature to our eyes and , therefor e ,
27 offer a valuable option for augmenting natural senses if eye sight is impaired. There are some drawbacks that must be considered though. The quality of the data is very dependent on how much texture objects have , so scenes that have a very flat texture or have poor lighting will not be captured well by these systems. S hadows can cause lighting variations that make processing difficult. Improper exposure will generally cause issues with solving the correspondence problem because recognizable features get washed out in overexposed areas of the image or lost in the noise of underexposed areas . In addition, t he accuracy of the data is very dependent on the quality of the cameras and lenses as well as t he calibration done to determine the internal camera properties and each camera position and orientation . Additionally, the computational requirements for processing the image data to extract depth information is not menial and scales with image size. Laser Ranging Sensors Laser ranging sensors can provide very accurate distance information at very high ra tes. These sensors come in different versions that detect data in one, two, or three dimensions . These devices have been on the market long enough to gain good market acceptance and have been widely used on mobile robots and in industrial settings. The y have the advantage of being able to operate both indoors and outdoors as they are not strongly affected by sunlight. Additionally , since they are active sensors providing their own radiation pulse, they can operate in the absence of ambient light and work day or night . Products which detect data in one dimension report the range of an object along a single line. Products capable of detecting data in two dimensions usually consist of a beam that is rotating around an axis perpendicular to the beam. The laser is pulsed so
28 as to create a 2D scan of the objects in the plane of the laser rotation. This creates a map similar to a radar sweep with a high degree of precision (~4 cm) and at relatively long ranges (~300 m). Products can be found with angular resolutions ranging from 1 degree to 0.25 degrees and with update rates from 10 to 50 Hz. Some models have a field of view that is limited to 180 degrees while others have wider ranges up to 360 degrees. Products are also available which detect data in three dimensions by scan ning with multiple lasers around one axis or by quickly sweepin g a single laser about two axes . These can create dense point clouds of 3D data at high refresh rates, making them very valuable for robot navigation in unknown environments. This class of devices may operate by rapidly varying the angle of a single laser beam about two different axes , or they may operate by rotating an array of multiple laser beams about a single axis. Sonars Sonars can detect ranges to objects by creating a pulse of ultrasonic sound and listening for the echo. By measuring the time of flight between the pulse and the perceived echo, the distance can be calcula ted based on the speed of sound. Sonars for robotic applications usually have ranges on the order of 10 20 feet. Since sound propagates as a wave, these devices have a cone within which objects will be detected. They cannot be used to find where in the c one of detection an object is, only the range to the object. Sonars can be found with a range of different cone angles, with some models having narrow cones at around 10 degrees and other models having wide cones at around 60 degrees. Sonars are very use ful short range distance sensors because of their small size, low power, and low price point. The limitation that they are by nature 1D sensors can be
29 overcome by creating arrays of multiple sensors aimed at different angles; however, interference will bec ome an issue if the sensors are not synchronized to fire at separate times. This fact imposes a limitation on the number of sonars that can be placed in an array and their overall update rate. Structured Light Sensor Systems Structured light sensors are t he combination of a camera, a light pattern emitter , and a data processing module. Usually, the light pattern is composed of infrared light so that the camera is able to filter out visible light and capture only the infrared light that came from the emitt er. T o collect distance data, first a known pattern of light is projected onto a scene and the camera is used to detect the pattern once it has been cast onto the objects and reflected back. The idea is that once the pattern is reflected back to the camer a , then the differences between the projected pattern and the detected pattern can be used to infer the geometry of the scene. This process is mainly that of triangulation. For instance, consider the case where an infrared light emitter were placed 10 c m to the right of an infrared camera and both devices are aimed in the same direction. If the infrared emitter projected single dot of infrared light directly ahead onto a wall, then the camera would be able to detect the dot somewhere in the right half p center of the camera to the sensed dot on the wall would be enough to calculate the distance to the wall. The closer the wall was to the device, the further the dot would appear in the left of the imag e. Now imagine an array of dots or some other known pattern. By doing the same calculation for each known element or feature in the light pattern, a fairly dense field of distance estimates can be calculated. Unfortunately, one of the major drawbacks of t hese devices is that because most rely on infrared light, they
30 do not function very well outdoors or wherever sunlight is present since so much of the And while bright laser light could be used to overcome this drawb ack, it would pose an eye hazard for other people. Figure 1 1 . A person using a white cane to navigate without running into obstacles  .
31 Figure 1 2 . A white cane being used in a busy crosswalk  . Figure 1 3 . A guide dog leading a person down a street  .
32 Figure 1 4 . A person using echolocation to judge hi s distance to the building  . Figure 1 5 . Stereoscopy geometry.
33 Figure 1 6 . Two corresponding features in a stereo image pair. Figure 1 7 . A sliding window used to estimate di sparity.
3 4 Figure 1 8 . A stereo image pair and the resulting disparity map. In the disparity map, lighter colors represent greater disparity and closer range.
35 CHAPTER 2 MOTIVATION The Need to Improve How People with Blind ness Sense Their Environment The goal of this research is to provide people who are visually impaired with a better way of sensing their environment. It is intended to use sensing techniques prevalent in the field of mobile robotics to enable people who cannot rely on eyesight to tap into vision information using other means . Currently, the methods used by people with blindness to sense the ir environment when moving around are very simplistic. Simplicity is certainly not a drawback; indeed it is of ten an advantage , b ut the amount of information gathered by vision is so rich, that much of this information is passed up by using simplistic sensing methods . Consider the importance of visual information. In the British Journal of Vision Impairment, Mi chael Tobin posits that the entirety of what is different between blind and sighted individuals is purely a matter of information available to them  . He implies that even difference s which obviously stem from a lack of information (cognitive abilities, spatial visualization abilities, etc.) are caused by a lack of information during formative years. He summarizes as follows: and from commonly observed behavior al phenomena at various stages of the rt the case that all delays and carriers experienced by blind people have as their causation the lack, the inadequacy , or the Perhaps givin g individuals with blindness better tools for understanding their environment will help in more ways than we might expect. There have been many attempts to provide aids that compensate somewhat for the lack of vision, but they have met with limited succe ss or limited adoption. Vision is a
36 requisite to perform many basic tasks, so when vision is not possible, the effects are obviously enough to require extreme lifestyle changes. While the effects of living without vision will never be completely alleviate d without giving or restoring sight completely, much can be done with modern technology to make living with such a condition easier . There have been some wonderful advancements that have made life easier , such as braille, electronic text readers, speech t o text software on computing devices, and extreme magnification devices for those without total vision loss; but navigation around unknown or dynamic environments is still quite a challenge. Vision is used so often to understand new objects that are obser ved . It is used to gain an understanding of where one is in their environment to understand where one is relative to a landmark, or a building exit, or an intersection, or a pedestrian hazard. So without vision, navigation is done completely differently . Other senses , such as sound and touch , can be used to build a mental map of the surroundings, but with much less detail and speed . And with a diminished mental map of the immediate surroundings, the ability to navigate becomes that much more difficult . Safe and efficient pedestrian navigation is extremely valuable for independence , community interaction , and a sense of freedom. To this end, much of this research will be aimed at providing a better solution for safe pedestrian navigat ion in the absence (or near absence) of sight . Part of the motivation to perform research in this area also stems from advancements seen in the field of sensing for robotic applications, which will be presented in the next section.
37 The Applicability of Ro botic Sensor Systems Mobile robotic platforms that operate in unknown or dynamic environments must also overcome the challenge of understanding the surroundings in order to perform tasks such as obstacle avoidance, target tracking, path planning, and high level decision making. R obots can only react to their environment if they are able to sense it in some way, and their reactions/decisions will only be as good as the information available to it. Many creative sensor technologies have enabled robots to se nse their environment, and many creative algorithms have enabled them to further understand this sensor data through higher level classification and abstraction schemes . It is important to note that data is easily collected, but this data can only become usable information upon further processing and interpretation. On the one hand, the goal of providing meaningful information about the environment to robots is very similar to the goal of providing meaningful information about the environment to the visu ally impaired. Having information about t he geometry of the environment is an important factor when considering how to navigate. There are other things which are important too , such as the ability to recognize and classify objects with known attributes o r the ability to determine which parts of the ground are fit for load bearing . But, f or both applications, the geometric information about the environment gathered by manmade sensors can be very valuable. Robots can use this information to estimate wher e it is safe to travel just as a human would use visual information to sense where it is safe to walk and where barriers, walls, or stairs may present a hazard. In the realm of robotics, once sensor data is collected and processed, the challenge is in fur ther classifying and abstract ing the data to make useful decisions. In
38 the realm of sensory substitution, the challenge lies in representing this information in a way that is meaningful to the human using alternat ive sensory channels. Because of the reas ons discussed in the previous chapter, this can be quite difficult. E xamine the normal flow of data in the decision making process in both autonomous systems and in humans. In both cases, data is needed before knowledge can be formed and used in the deci sion making process. A typical sequence for an autonomous system is data collection (sensing), processing, interpretation, representation, and decision making . This sequence is illustrated in the left half of Figure 2 1 . In the data collection stage, raw data from the environment is sensed. This could be raw voltage levels from a transducer or perhaps a phase shift from a laser pulse, or a time differential between an ultrasonic chirp and its subsequent return. This data could al so be gathered from an array of sensors such as the grid of light sensing range scanner with multiple lasers . The next stage, processing, turns the raw data into something more meaningful and might remove noi se as well as convert raw signals to actual distance values or light intensity values. The interpretation stage is critical for usable data. This can be an operation as simple as determining whether or not obstacles exist in front of the robot or somethi ng as complex as segmenting the data into discrete objects, classifying them based on an algorithm that was trained beforehand, and creating predictive models for the dynamic objects in the area. Finally, the data must be represented in a way that a decis ion making algorithm can operate on the information. The data could be stored in a vector or rasterized format . Perhaps i t could be represented using parameterized models . Or it could be represented in a probabilistic manner where definite values are not stored,
39 but rather probability density functions are used to represent object states. Finally, decisions can be made using this d ata in the chosen representation format . An analogous sequence of even ts can be seen in the way human s process data about th eir environment. This sequence is also illustrated in Figure 2 1 (on the right side ) with the only difference being the representation stage is instead called the understanding stage. The steps are not as clear cut on the natura l side. It is known that senses do capture inputs and transmit them as electrical signals, but the actual processin g, interpretation, and knowledge stages are as mysterious to researchers as they are beyond the scope of this research . Giving consideratio n to how artificial sensing can be used to replace natural sensing, it can see , at a system level , how these systems can be useful for helping people cope with sens ory disabilities such as blindness. To create a sensory substitution system that can be use d by a human, it must include artificial means to sense the environment, process it, and convert it to a new t ype of sensation that can be fe d back through an alternate sensory channel. This arrangement can be seen in Figure 2 2 which shows a visual tactile substitution. In the middle of the graphic, the path of visual data is shown marked off and in red to indicate a lack of information flow . In an attempt to compensate for this lack of visual information, an artificial vision system using cameras can be used to somehow transmit useful data back to the individual through the sense of touch. By having the artificial sensing system process the visual data for the person and re represent the data in a tactual way, the person can r egain some understanding of the environment that normally would be sensed through eyesight. As can be seen, the artificial sensing flow chart in Figure 2 2 is very similar to
40 that in Figure 2 1 except th at the final step is to output the data to the individual instead of making autonomous decisions. Since the process of artificial sensing is applicable in concept to helping people with vision impairment, how exactly might a system like this function? T here are many different ways that robots sense the world, and there are many different ways to convey information to humans. This results in a diverse array of possibilities for pairing artificial sensors with natural senses, and many of them have been ex plored by other researchers. However, the task of providing humans with meaningful and intuitive information through alternative sensing channels requires thoughtfulness and ingenuity. If such a task is not given such care, the result can often be so cum bersome that it becomes more of a hindrance than a help and traditional guidance techniques such as the white cane have a clear advantage. Problem Statement T o limit the scope of this research from the broad field of sensory substitution and blind naviga tional aids, a more specific problem statement was formulated . The desire was to create a tool or system that would enable those who cannot rely on eyesight to gain an understanding of their surroundings with enough acuity to be able to navigate unknown e nvironments. The goal was to employ tactile vision substitution, that is to provide some of the information that would normally be sensed through vision through the sense of touch, to convey geometrical (and perhaps even higher level information) about th e scene to a user. Furthermore, the operation of such a system should be intuitive, not requiring extensive training to reinterpret cryptic feedback that has little conceptual relation to the actual information it attempts to represent. This requirement was presumed to be important to usability and user acceptance.
41 It is not intended for this tool to be a high level navigational aid with the ability to provide global orientation and location information or to be able to direct users along predefined rou tes. This task is important, and heavily explored , but not within the scope of this research. A system for high level navigation would be very useful for the overall navigational process, but this research only attempt ed to address the issues related to perception of the immediate surroundings. The desire to provide people with a better understanding of their immediate environment is indeed only a solution to part of the problem, but it does appear to be the more difficult one. High level navigation is within the scope of traditional path planning algorithms and the two major challenges lie in 1) the method used to convey the desired path to the user and 2) indoor position estimation which is difficult in the absence of GPS reception. Since high level n avigation and sensing of the surroundings are both important to being able to get around, a marriage between the two types of systems would be ideal in that a user would know in what general direction to walk and be able to navigate around obstacles, other people, and hazards. That being said, the focus will be on the problem of low level , close range navigation. To get a clearer picture of the intended system, the following list of requirements were selected : 1. Wearable. The system should be wearable and allow the user to move about freely. 2. Self contained. The system should not rely on external power or computational processing.
42 3. Tactile feedback. The system should provide feedback about the environment to the user through the sense of touch. 4. Intuitive fee dback. The system should provide feedback that is intuitive, a system for which most training will go into familiarizing users with the new form of feedback as opposed to a learning how to correlate abstract feedback signals with reality . 5. Indoor/outdoor c apability. The system should function indoors as well as outdoors. Figure 2 1 . The flow of data in a typical autonomous system's decision making process and in a human's decision making process.
43 Figure 2 2 . The flow of data in a typical sensory substitution application.
44 CHAPTER 3 REVIEW OF LITERATURE This chapter present s the research that has been done in the area of blind navigation aids. The literature contai ns some very early experiments done in the 19 60s and 19 70s up to the most recent research endeavors. Since the proposed subject covers areas of sensory substitution, navigational aids for the blind , and specifically tactile feedback, the survey of literat ure will be broken down into four categories: sensory substitution, blind navigation aids, blind navigation aids using tactile feedback, and finally a special section on stereo vision which will prove to be a valuable tool for artificially sensing the envi ronment. Sensory Substitution Research Two common devices seen in sensory substitution are vibrotactile elements, which use a motor to produce vibrations which can be sensed by the skin, and electrotactile elements, which produce a tactile sensation by app lying a voltage and creating a current through the skin . The science of electrotactile and vibrotactile feedback display s was discussed by Kaczmarec et al. in an article that attempted to summarize their usefulness in sensory substitution  . These feedback mechanisms have been used to transmit information by varying the stimulation frequency, varying the stimulation amplitude, or a combination of both. The tactile elements can be arranged in both 1D or 2D arrays on the skin, and these arrays usually require training before people can use them with any effectiveness. One of the earlier experiments of sensory substitution was published in 1974 by Collins a nd Mad l ey  . Collins and Madley developed a tactile feedback sy stem to help those who lacked sensation in their fingertips. The sensing device consisted of
45 strain gages mounted in a special glove to measure the force on each fingertip . Users were also fitted with 5 electrotactile elements on their foreheads . E ach o f the five sensors controlled the electrotactile stimulation intensity of one forehead electrode. During testing, users without sensation in the hands were able to use the gloves and the feedback on their foreheads to handle objects and distinguish betwee n smooth and rough surfaces and bet ween soft and hard objects. Despite the feedback system being very basic and the low resolution of the display (only having five feedback elements), the users were surprisingly able to detect edges and corners by the proc ess of scanning the objects back and forth with their fingers. This is an important phenomenon and it shows the power of exploration in understanding more complex information with limited feedback resolution . Another early experiment was performed by Ba ch y Rita, Collins, et al. in 1969 in which they tested the efficacy of vision substitution by tactile image projection  . It was an interesting attempt to convey visual information to non sighted people that would lead to othe r similar implementations. The setup was cumbersome because of technological limitations of the day, but one of the goals was to provide a practical aid for the blind. In the experiment, t he output of a video camera was mapped to a 20 x 20 vibrotactile d isplay in the backrest of a chair in which the user sat . (A vibrotactile element produces a tangible sensation by applying a vibration to the skin.) The intensity of the light at each location in the image controlled the strength of the corresp could be felt by directly projecting this image to the feedback array on the back . The video camera could be aimed by the person as he or she sat in the chair , enabling them
46 to explore the seen by tilting and panning the camera . The testing was primarily in object recognition tasks and identification times dropped significantly after 10 20 hours of training. After training, subjects were able to identify letters after 10 second s of scanning them with the camera. Similarly subjects could identify one of several objects from a pre learned training set after 5 20 seconds of scanning. This research proved that visual information could, in a limited form, be translated to different sensory channels in a way that could be understood by people. A much more recent adaption of the previous experiment was published in 1998 , again by Bach y Rita, et al.  . In this research, a 49 point electrotactile device was designed to be placed on the tongue instead of on the back. The mai n focus was to study form pe rception of basic shapes using the 7x7 electro tactile array. The long term goal wa s to build minimally invasive system which wirelessly transmits data from a small camera to the tongue display. This was a n attempt to take the tactile vision substitution system to a more practical level as previous attempts to interface them with the abdomen, back, thigh, and fingertip had man machine interface limitations such as bulkiness and higher voltage requirements. In testing, users wer e tasked with recognizing basic shapes like the ones shown in Figure 3 1 . The shape recognition test showed a better rate than when using a similar raised dot display on the finger tips from their previous research; however, real w orld data from cameras were not used. Another application of a tactile array was in a device called the Optacon which Kaczmarec et al. note that systems with low re solution, such as the one developed by Collins and Madely , which only had five feedback electrodes on the forehead and 5
47 force sensor on the hand , enabled users to gauge shape and smoothness and edges very well ( a feat which is somewhat unexpected with so few sensors.) They believe that this is a result of spatial information received by manually scanning objects, which they While the finger tips are more sensitive than other parts of the body by an order of magnitude, they do no te a critical limiting phenomenon which is that he threshold amplitude for vibrotactile stimulation increases after a strong conditioning This means that the fingertips tend to become more numb to slight stimuli after receiving a stronger sti mulus for a while. They reported that this effect went away after about 2 minutes of rest. When using an array of tactile feedback elements, whether from vibration motors or electrodes, the s patial resolution of tactile inputs can be measured with the Two Point Discriminatory Threshold (TPDT). The TPDT is the minimum separation required to distinguish between two simultaneous stimuli. This distance measure can be smaller if the two stimuli are activated alternately instead of simultaneously, but this val ue imposes a limit on the density or resolution of the array. Another disadvantageous phenomenon can occur when two or more different tactile stimuli are perceived as a stimulus from a single point. A similar effect observed in tact ile feedback systems is when two impulses in close spatial or temporal proximity interfere with each other so as to mask the intended pattern and inhibit signal recognition. This effect limits the amount of information that can b e effectively communicated through the skin. Masking can imperceptible. This is a key limitation in the information bandwidth that can be passed
48 through the tactile sense. Wh ile some people believe that this effect is due to physiological reasons with the nervous system s transmission of the sensation or of the lingering stimulus sensation stored in the receptors, Loe pelman n believes that the primary reason is actually tempora l integration in our processing of the impulses  . The previous examples have all used tactile feedback as a form of sensory substitution, but this is just one of the possible options. An interesting device c alled vOICe is designed to convey images to people through sound [1 4] . Developed by P eter B. Meijer, the device takes a 64x64 pixel image with 4 bit grayscale color depth (16 different shades) and creates a 1 to 1 mapping of the images to sounds. This means that the tone sequence produced is unique to each image and , t hus, it is possible to invert the funct ion and generate an image from the audio . A computer would be able to produce the grayscale image exactly from the sound, but a human should also be able to make some sort of mental leap from the audible realm to the visual realm in their mind . The concept is i ntended for eventual use by people who are blind and a software implementation has been developed fo r mobile computing devices . For the vOICe project, t he image to sound mapping is performed as follows. Consi der a single column of pixels from an image. Each row is assigned a different tone so that frequency corresponds to the row number . The amplitude of the tone is determined by the brightness of th e corresponding pixel. To convey a single column of the im age, all tones for each row are played simultaneously. To convey the entire image, the sounds generated by each column are played sequentially by scanning across the image from left to right which takes 1 2 seconds . Then a click is played to indicate the start of a new frame and the process of playing back the image is repeated.
49 Meijer believes that simple shapes will be easily recognized. He posit s that more complicated images will be understandable with conscious effort but think s that eventually, it w ill become subconscious. Unfortunately, the ability for humans to interpret this information was not presented and t he efficacy of this idea is still under active research . The next example is not purely sensory substitution, however it is an excellent ex ample of how haptic feedback can be used to present information to those who are visually impaired. In 2004, researchers at the Informatics and Telematics Institute in Greece developed a system to help train people to navigate using a haptic virtual envir onment using their hands and a virtual cane  . The goal was to create a training environment, but the concepts can be applied to sensory substitution if the system were modified to use actual data instead of a virtual environment. The users are fitted with a force feedback glove shown in Figure 3 2 that can exert a pulling force on the fingers, and they are also equipped with tracking sensors on their arms. The system is able to track a person and their pose using the MotionStar wireless tracker system which use s pulsed magnetic fields to track sensor s located on their body . Using the knowledge of collision detection algorithm runs to determine when one of the fingers is in contact wi th any of the virtual environment models. Once a finger comes into contact, a pulling force can be applied to the finger by the glove. (The entire hand could, however, be inserted through objects because there was no way to exert a force on the entire arm. ) The idea was then extended so that the user could use a virtual cane to detect objects instead of the fingers. T he virtual cane was programmed as an extension of the index finger. When
50 the virtual cane collide d with the ground, force wa s applied to th e index finger. When the cane hit an object to the left, force was applied to the ring finger, and when the cane hit an object to the right, force wa s applied to the thumb. Additionally, if the virtual cane were to penetrat e an object, a constant buzzing was generated. In this way, users could explore the virtual scene while receiving force feedback when their virtual cane was touching one of the virtual objects. For testing, the researchers created a virtual model of an intersection crosswalk. In the ir experimental crossing the street simulation, they expected users to be able to cross the street in 3 minutes, and the average time was about 2 minutes. The testing gave positive results for the overall concept of the system, but the system wa s limited to indoor laboratory testing because of the method of measuring the user s hands and could only use virtual models . orientation requires a fixed base station and the sensing range from this base station is limi ted to a 7 m radius . practical navigation aid but rather a training tool to help people explore potentially dangerous scenarios in the safety of a laboratory. Two valuable pieces of information w ere concluded at the end of the testing. One finding was that important to utilize both acoustic and haptic feedbacks, as they are indispensable for T he other finding was that This supports the idea that haptic feedback that follows this intuitive and natural scheme can be useful for a broad range of people who are visually impaired.
51 Blind Navigation Aids Blind navigation aids are designed for the purpose of being practical devices that will help people navigate through an unknown area without using eyesight. The National Research Council lists guidelines for elec tronic travel aids (ETAs) as they call them  . These guidelines are a s follows: 1. Detection of obstacles in the travel path from ground level to head height for the full body width. 2. Travel surface information including textures and discontinuities. 3. Detection of objects bordering the travel path for shorelining (wall following ) and projection. 4. Distant object and cardinal direction information for projection of a straight line. 5. Landmark location and identification information. 6. Information enabling self familiarization and mental mapping of an environment. 7. In addition: ergonomic, operate with minimal interface with natural sensory channels, single unit, reliable, user choice of auditory or tactile modalities, durable, easily repairable, robust, low power and cosmetically accepted. The previous list contains guidelines that provid e situational awareness of the surroundings (1 3), high level directional information (4 and 5), and some interface and hardware requirements (6 and 7). None of the devices found in the literature review, whether commercial products or experimental protot yp es, met all of these guidelines in fact most of them only met one or two of them, b ut these requirements should be kept in mind when developing a new type of aid. Simple Systems One of the earliest research experiments to use remote sensing technology f or the visually impaired was performed by Leslie Kay in 1964  . Kay r ealized that ultrasonic sensors can be a valuable method of sensing the environment for the blind. After performing initial research, he reported that b lind people are averse to having artific ial sounds fed into their ears and they do not ta ke naturally t o artificial aids. He
52 also states that a functional aid must be capable of detecting all forms of obstacle including curbs and descending steps with 100% accuracy. The aid developed by Kay consisted of a handheld sonar range sensor, an electronics box wh ich could be clipped to the belt and an in ear speaker for auditory feedback seen in Figure 3 3 . The system played a frequency which varied based on the distance to the object pointed at by the handheld sonar and could detect obje cts up to 20 feet away . The auditory feedback signal was a single frequency if only one sonar bounce was detected, but could contain multiple frequencies if multi ple bounces were received . The testing Two popular navigation aids today are the Mowat sensor ( Figure 3 4 ) and the Polaron  . Karyl Moore, an orie ntation and mobility specialist, gives a non empirical comparison of both of these sensors  . The Polaron and Mowat aids both use ultrasonic sensors to help visually impaired people detect obstacles. They use both auditory signals and tactile vibration to indicate the existence of and dista nce to the nearest obstacle in the cone of detection. The Mowat gradually increases the frequency of the vibration feedback as an object gets closer whereas the Polaron has a stepped frequency scheme with three discrete frequencies denoting the distance a s being close, near, or far. They both produce an audible pitch that scales with distance either through a speaker or optional headphones in the case of the Mowat. The Mowat has a range up to 8 m and the Polaron has a shorter reach at about 5 m . She notes that due to the nature of sonar sensors, b oth can fail to detect angled obstacles such as the hood of a
53 car because the signal glances off instead of being reflected. These sensors require scanning with the device to interpret the surroundings and can ev en be used to locate dropped objects of sufficient size by scanning the ground and feeling the change in vibration intensity. Since the task of helping people without sight navigate with minimal impedance is such a challenging problem, no one aid can solv e the total problem. Some aids attempt to provide safety in more specific scenarios. An example of this is the Ultra Bodyguard  . This small device is worn on a lanyard about the neck and serves the purpose o f alerting when a low hanging obstacle like a tree branch is imminent. This helps to alleviate the danger of obstacles at the torso or head level which the white cane would not detect. The device uses a sonar sensor and the warning can be vibrational or au ditory. More Advanced Systems The previous examples were somewhat simplistic in their method of detecting objects and providing information to users. This is not necessarily a drawback, but there are usually complexities in a scene that will not be easi ly sensed or understood when using a simple handheld sensor. Much research has been done on more complex implementations in an attempt to provide more information to non sighted individuals. These types of setups usually rely on arrays of sonar sensors o r stereo vision systems in order get a more detailed picture of the world around the user. In obtaining more information, however, the added challenge is always in how to convey this information in the most usable and understandable way. A clever system was developed by students at Florida International University that creates virtual directional sounds to inform users of obstacles  . Using transfer
54 functions, a sound can be artificially manipulated to create the illusion that the origin of the sound is located at an arbitrary point in space . This technique is leveraged to convey obstacles location information to the u ser . The mobile system consists of head phones to convey the spatialized sound, six head mounted sonars, and a pocket PC for data processing and synthesizing the spatialized sound shown in Figure 3 5 . The system also contains a m agnetometer which enables directional North to be communicated by anoth er synthesized sound. The object sound mapping scheme play s a sound coming from the direction of the obstacle with an amplitude that scales with its proximity This system was tested by having four blindfolded users navigate around the inside of a university building. Users were asked to navigate from point A to point B and the efficiency of the path the users was determined by comparing it to the ideal path length. While the ratio of ac tual path length to ideal path length was relatively good , the average walking speed was fairly slow (34 ft/min or about 1/3 mph). The authors attribute possible causes to lack of user training or the small delay that precedes the auditory feedback which c onveys the physical environment. A similar project was completed that used the same method of communicating obstacle locations, but used stereo cameras instead of sonars as the sensor  . Another device called Navbelt used 8 body mounted sonars to detect obstacles at 8 different angles around a person  . The object locations were communicated to the user by translating the data into a succession of quick beeps similar to a radar sweep in which the amplitude of the beep indicated the distance. The nice thing about this system is it is very minimalistic and unencumbering, yet does a good job of
55 conveying the gross geometry of the immediat e surroundings. The disadvantage is that lower objects would not be detected and the auditory interface requires constant attention to be able to mentally map the tone sequence to the 2D object map that it represents. Other tools attempt to solve the prob lem of high level navigation. One such navigational tool determines position and orientation from a GPS receiver, magnetometer, inertial measurement unit (IMU) , and velocity sensor ,  . Then using a geographic database of preloaded information, users are directed along predefined routes using auditory signals. Audio signals for waypoint guidance can be of command, but the one that yielded the best results was when the direction to the waypoint was indicated by a synthesized stereo effect, giving the impression that the audio command was coming from the direction of the waypoint. One limitation of the system is that to navigate effectively, a relatively fast communication rate to the user is required (about every 5 seconds), but this limits the amount of other information that can be passed through the auditory channel such as obstacle avoidance ques . In testing, u sers were able to navigate to the waypoints at roughly 1.5 mph. A similar system used physical markers located along a predefined route to guide users instead of predefined GPS waypoints  . The team developed the system by employing GPS for outdoor positioning and a chest mounted camera for visual marker detection and an RFID reader on the end of the white cane to detect landmark RFID chips placed in the ground. The vision based guidance is comprised of a detection algorithm which looks for w hite circles placed on the ground used to designate a
56 walking path. The angle to the nearest marker is calculated and discretized into one of five possible angles. This correction angle is then conveyed to the user with one of five corresponding vibration motors on the body to indicate the direction to the next waypoint . Aids that use Stereo Vision A more advanced use of stereo vision in a navigation aid for the visually imp aired was proposed by Molton et al. ,  . They note that t here have been a few sonar based mobility aids developed very early but they have n ot gained wide spread adoption, p robably because of their lack of detail as information than the low tech white cane . They also note that a simple mapping from intensity information to electronic impulses on the skin such as those conducted by Collins and Bach y Rita and published in  and  result in information overload when used outside because of all the clutter in a typical outdoor scene . ( M ost of their experiments were with simple black and white shape detection . ) One of the main goals of using ste reo vision in their research was to attempt to distinguish obstacles from the ground plane. To do this, the algorithm was designed to keep track of the ground plane in between each processing loop so as to distinguish objects that are on the ground from t is maintained so that the disparity of detected objects can be compared to this model. If the disparity of any point is more than the predicted ground plane, it is marked as a n obstacle. Testing was p erformed by mounting the camera s to a person, but the maximum translation of the cameras was limited to 0.2 m and the maximum rotation of the cameras was limited to 5 degrees. The reported results focused mainly on the ability to track the ground plane and do not mention the ability to detect or
57 the accuracy of detecting actual obstacles. The method of communicating obstacle locations to the user was not discussed. The same researchers published another paper which deta iled a system with additional features such as sonars and rudimentary stair detection  . Four sonar sensors were mounted around the person with vibration motors mounted nex t to each sensor to indicate close objects. The stair detection algorithm did not detect the 3D geometry of stairs, but rather processed the 2D images looking for a pattern of horizontal lines. This method works only if enough of the stairs are visible a nd the contrast is high between each step. The overall goal of distinguishing obstacles from the ground plane is very useful, but the ability to do so was not adequately demonstrated, probably because of the complexities introduced by trying to predict th e ground plane instead directly sensing it from frame to frame. Another stereo vision based aid called NAVI was developed which gave feedback through sound  . Wong et al. developed a navigational aid that us es head mounted stereo cameras to detect obstacles and a computer to convert the information about the obstacle to the user. The stereo algorithm first segments the two images and then attempts to computer disparities between the two using a rule based app roach. distance is conveyed by a verbal sound. Simple shapes were tested (circle s, triangles, squares). They tried to segment the background from the foreground and the n extract the outline of the foreground object . The distance to an object is discretized into one of four distinct ranges. These ranges are communicated to the user by the system playing audio of someone saying , low , high , or the blob is
58 conveyed using a tonal scheme where the vertical location in the image is mapped to tone amplitude and the horizontal location in the image is mapped to frequency. Again, no testing results were reported. A system called SVETA used another variation o f tonal feedback for collision avoidance  . Their system also employed head mounted stereo cameras with a slightly different auditory feedback scheme. In this system, the depth map provide d from the stereo algorithm is conveyed to the users using different tones and the left/right audio channels. The vertical component of an object is mapped to the frequency of the tone, the horizontal component is mapped to the fade between the left and r ight audio channels, and the distance is mapped to the amplitude of the tone. In this way, users were tested in recognizing basic shapes with success and a blind individual was shown to be able to navigate around obstacles in some capacity. Another rese arch project employed stereo cameras to determine object locations which were then mapped to a vibration feedback belt  . Stereo cameras mounted at the hips were used to create a 2D slice of the scene. Feedback was provided through 14 vibration motors mounted to a belt which each corresponded to a different detection angle . The ranges discovered by the stereo imaging sli ce controlled how strongly each motor vibrated (The closer the object, the greater the vibration) . The stereo algorithm worked by detecting e dges in both cameras which it then used to compute disparity (depth) along a single horizontal slice of the image. However, s ince the granularity of the stereo algorithm was fairly coarse (it was not a dense stereo reconstruction), holes in the data occurred were defined edges were not present.
59 A similar stereo vision based aid called Tyros used the cameras to obtai n depth information of the scene in front of the user ,  . The depth information was then down sampled to a 4 x 4 grid and inste ad of relaying this via a vibration belt, as was done in  , the system used a 4 x 4 grid of vibration pads mounted to t he stomach. Stronger vibrations were mapped to closer proximities. This system communicated more information than a single horizontal slice, but the low resolution still significantly limits the quality of the information. Blind Navigation Aids Using T actile Feedback to the Hands Several devices have used tactile feedback to the hands as a way of communicating information about the surroundings. Passing feedback to the hands has the added benefits of utilizing a part of the body that is both familiar t o touch and extra sensitive  . Some devices use gloves to provide feedback while others are completely handheld with the sensing and feedback component s packaged together. One research project attempted to convey obstacles detected by stereo vision to users through vibration motors attached to their hands  . The researc the problem was noteworthy is naÃ¯ve to assume that the richness provided by the visual sense can be duplicated and represented in a tactile form, given that the tactile medium is already limited in the information it can convey compared to tile substitution, but all forms of sensory substitution which attempt to compensate for a lack of vision. They go on to summarize sensory substitution information. T hey also note that although m uch research has been an attempt to map information to an array of 1D or 2D vibration elements, they usually require training as the feedback is not natural. They go on to point out the shortcomings
60 of ETAs that use sonar or l aser in that to detect obstacles, which is time consuming and requires a conscious effort; (2) the user must perform additional measurements to determine the dimensions and shape of obstacles; and (3) enviro In implementing their idea to convey obstacles through the sense of touch to the hands, the researchers chose to use mechanical vibrators built into a glove for feedback and chose to use cameras for sen sing because of the wealth of information collected and the similarity to sensing the environment the way the eye does. Unfortunately, their method for conveying obstacles was quite rudimentary though, using only three feedback motors. One motor indicated an obstacle to the left, another motor on a different part of the hand indicated an obstacle in front of the user, and another motor indicated an obstacle to the right. In this configuration, the entire 640 x 480 pixel image is reduced to just three outp uts. The vibration magnitude was mapped to distance and vibration frequency was mapped to certainty ; the data was updated every 3 5 seconds. Their system was designed to help users avoid obstacles with a range of up to 10 meters , but was not meant to be a complete way finding replacement, only an augment to current way finding devices. They also state that changes in terrain, in general, is not possible with the simple mapping described in this and discovered that objects with little texture would sometimes go undetected due to the requirements of the stereo vision processing. This source did outline a useful way of testing the effectiveness of way finding devices. They u sed an obstacle course with two objects and an auditory signal to broadcast the location of the end goal. They then c ompared path s to one
61 generated from a gradient descent method which they decided was the optimal route . Any additional distance walked by the users compared to the optima l route was considered inefficiency and counted against the score of the travel aid. Another example of a navigation aid that uses tactile feedback to the hands, is the Tactile Handle , a clever handheld device that uses four sonar range sensors and a coar se 4x4 array of vibration motors on the grip of the device to provide range information  . The team from Rutgers and Northeastern University attempted to intuitively map the sonar senso rs on the Tactile Handle to the feedback motors so that obstacle locations are easily inferred. When grasping the device, each finger has four vibration motors in contact with it. The distance to the nearest object is encoded as vibration amplitude on the feedback motors such that the closer the object is to the device, the greater the amplitude. The direction to an object determines which of the tactile feedback motors are active. Figure 3 6 shows that the sonars are mounted fa cing four different directions: left, right, down, and forward. The concept of providing intuitive feedback had merit, but the implementation was less intuitive and informative than desired. Since t he s ensors have a 60 degree detection cone , the informat ion was a bit cloudy and the mapping between the obstacle directions and the fingers were not that intuitive after all. It was desired for users of the device to be able to distinguish between different obstacle geometries based on which feedback motors w ere active, but the low resolution of the sonar sensors limited the perceived geometrical data to the point that the full capability of the feedback handle was not used . No results for actual navigational testing were presented nor were subsequent publica tions found .
62 Another system which used feedback to the hands was developed by researchers at the University of Guelph  . They use a chest mounted stereo camera setup to detect objects and five vibration motors placed at the fingertips. The stereo cameras were used to detect obstacles and the sp ace was divided into five angular regions. These five angular regions were mapped to the five fingers with the buzzers on them. When an object is detected closer than 3 feet away , the appropriate finger buzzer is activated to indicate an object in that dir ection. As similar system was developed by Meers and Ward who used electrotactile elements on all ten fingers  . Besides using 10 angular regions instead of 5, the system also produced a stimulation that was proportional to t he nearness of the objects. Navigational systems can also use force feedback to convey obstacles. The  . An ultrasonic range sensor determines the distance to the object it is pointed at and adjusts the tension on the string to create the sensation of a force pulling the handheld device bac kwards with increasing fo rce as an object gets closer. T his idea creates a very intuitive experience and requires little training . Two drawbacks with this system are lack of resolution and lack of force feedback directionality. The system has a relativel y low resolution because the feedback is one dimensional (the force is exerted on a single handle) so detail is somewhat lacking. And t he force feedback is helpful, but not actually indicative of surface normals. For example, the hand will always be pull ed back towards the waist even though in some cases, such as pointing the device towards a wall to the right, a more useful feedback response would be to pull the handle to the left , not backwards .
63 Several studies have also been performed which attest to t he power of touch when learning. One such study was an experiment to see how three different modes of learning help convey spatial understanding to adults who are blind  . The three modes tested were 1) direct experience, 2) cartographic representation, and 3) verbal description. L earning ability through direct experience was tested by first leading the test subject on a route and then allowing the test subject to re navigate the same route without aid. Learning ability through cartographic representation was tested by allowing user s to feel around a scale model of the test route, complete with 3D landmarks and braille markings to identify the landmarks. Subjects were then tested on their ability to walk the predefined course laid out on the scale model. The results showed that when subjects were initially allowed to learn a route with a tactile map, their spatial knowledge was better than when learning with verbal description and even better than direct experience. While this experiment illustrated the superiority of using tactile i nput (over auditory description) to learn large scale maps, the same should be true of learning small scale maps of nearby obstacles. Another study showed how haptic feedback devices enable individuals to mentally map the geometry of unseen objects. Rese archers from the Royal Institute of Technology in Sweden studied the interaction of blind and sighted users interacting in a virtual environment using a haptic feedback pointing device  . Two use rs, one sighted and one blind, were tasked with exploring a virtual environment and moving various primitive shapes around in the virtual environment. The exploration task required both users to determine where certain shapes were and what their dynamic pr operties were (such as their firmness and surface friction since they could be moved
64 around on the virtual floor). The task of moving these objects around required coordinated efforts from both users such as handing off objects from one grasper to the othe r grasper (the virtual grasper was controlled by the haptic pointing device) . They write: Some of the users did reques t a way of knowing when their feedback tool was outside the scope of the environment instead of being able to unknowingly leave the virtual working space, a feature that would be important for any type of haptic feedback device where exploration was allowe d. A summary of the key navigation aids that have been reviewed so far is presented i n Table 3 1 . This table lists the sensing method, the output method, and notes on how the information was conveyed to the user. They are generally sorted by the sensing method and level of complexity.
65 Table 3 1 . Summary of reviewed navigation aids. Ref. Name Sensing m ethod Output m ethod Notes  No t presented 49x49 electro tactile array on tongue Current proportional to pixel v alue.  vOICe Camera (64x64 grayscale image) Audio Brightness mapped to tone amplitude, vertical direction mapped to frequency, horizontal columns played sequentially.  ,  Polaron, Mowat 1 Sonar Audio tone Handheld device. Proximity mapped to pitch.  Ultra Bodyguard 1 Sonar Vibration or warning tone Worn around neck. Warns if object clo ser than threshold dist.  CyARM 1 Sonar Force Proximity of object pointed to by handheld device creates greater pull to wards body.  Tactile Handle 4 Sonars 4 vibration motors in handheld device Proximity mapped to vibration intensity  Navbelt 8 Sonars o n belt Sound Beeps played like a radar sweep were amplitude corresponds to proximity.  6 Sonars on head Spatialized s ound Sound appears to come from direction of obstacle.  Stereo cameras Spatialized s ound Sound appears to come from direction of obstacle. ,  Stereo cameras and sonars on vest Vibration on vest (from sonar) Use of stereo data not discussed. S onar proximity mapped to intensity of vibration motors on vest.
66 Table 3 1 . Continued Ref. Name Sensing m ethod Output m ethod Notes  NAVI Stereo cameras Sound Vertical mapped to am plitude, horizontal mapped to frequency, and verbal cues  SVETA Stereo cameras Sound Vertical mapped to frequency, horizontal mapped to L/R pan.  Stereo cameras Vibrotactile Belt 2D slice from stereo cameras mapped to vibration motors on belt  TYROS Stereo cameras 4x4 vibrotactile grid on stomach Proximity mapped to vibration intensity  Stereo cameras 3 vibration motors on hand Segment image into 3 regions mapped to three feedback motors.  Stereo cameras on chest Vibration motors o n 5 fingers Segment image into 5 regions. If a device is within 3 ft. the corresponding motor vibrates.  Stereo cameras Electrotactile feedback on 10 fingers Segment image into 10 regions mapped to fingers. Stereo Vision As w as mentioned earlier, computer vision offers capabilities that most closely resemble the capacity of human eyesight. When attempting to provide a system to compensate for the lack of vision in people, the techniques used in the field of computer visio n can be very useful. Computer stereo vision is most akin to our mechanism of sight and was used in this research; thus a quick review of current stereo vision techniques follow s . Computer stereo vision is still an active area of research as new , improve d algorithms are developed. Improvements can be made by increasing the
67 accuracy of the data , by increasing the robustness to various types of lighting and texture conditions , or by adding resilience for confusing geometry . Improvements are also made by i ncreasing data density (i.e. minimizing areas wh ere depth cannot be calculated) and by increasing the update rate through hardware or software advancements . The goal of stereo vision is simple. Use the information from two different cameras to infer the distance to the objects in the scene. The methods to do this vary considerable as many different approaches have been taken. Stereo algorithms can be classified as dense, which attempt to assign a depth value to every pixel location, or as sparse or feat ure based algorithms which do not attempt to assign a value to each pixel. For instance a sparse algorithm may only assign distance values to edges in the image by first filtering the image for edges and then matching and triangulating these edges between the two images. A feature based stereo algorithm will only match key features which are detected between the frames, providing a limited subset of depth information compared to the more densely populated depth field that would be provided from a dense ster eo algorithm. Scharstein and Szeliski developed a taxonomy and an evaluation framework for stereo vision algorithms and presented a variety of these algorithms with qualitative and quantitative comparisons  . In attempt ing to reduce the various algorithms developed to something that could be compared, they note that stereo algorithms usually perform all or some of the following step s : 1. matching cost computation 2. cost (support) aggregation
68 3. disparity computation/opt imization 4. disparity refinement The first step, matching cost computation, computes the cost for matching a pixel in one frame with a pixel in the other frame . The goal is to match pixels in one frame with the pixels in the other frame that are focused on the same spot. A lower cost means a better guess that these pixels are matched. The second step, cost aggregation, is a way of grouping the best guesses together in such a way that they best fit the data when considered with their neighboring pixels. This can be thought of as an optimization problem where the matching candidates are selected so as to best explain the image given certain assumptions of smoothness. The third step, disparity computation, is the act of taking the matches given by the pr evious step and using these disparities to calculate the depth or distance to the object. The fourth step is optional but can improve the accuracy of the depth map through noise reduction or other means. The process of computing matching costs can be vis ualized using the figure below. In this image, the two images from the two cameras are shown (Figure 3 7 a and 3 7 b) where the dashed green line denotes the selected row. The graph at the bottom (Figure 3 7 c) shows the matching cost over a range of dispar ity and x value s . A brighter color indicates a higher disparity matching cost and a darker color indicates a lower cost. Observing this graph, we can see that at most of the x (column) locations , there is a disparity that minimizes the matching cost. Se lecting the disparity at each x value that minimizes the cost is the simplest way of aggregating the costs and resolving the depth data along this scan line.
69 The task of determining matching points between the left and the right images (the correspondenc e problem) is commonly done using the sum of squared differences (SSD) approach . In this method , a window of pixels from one image is slid across the other image and the sum of the squared difference between the pixel intensities is computed at each dispar ity. For each pixel, the best disparity (lowest matching error) is selected and assigned to that pixel. The sliding window approach using the SSD is an example of a local method for finding correspondence since it does not consider the matching costs of o ther parts of the image simultaneously . G lobal methods , on the other hand, use global optimization on the whole image during the correspondence step . These g lobal methods attempt to solve the correspondence problem by finding the least cost disparity map by minimizing an energy or cost function on the whole as opposed to a local subset of the image . The cost function used in these methods often combines how well the disparity map matches the data and how smooth the disparity map is. Adding a smoothness te rm to the cost function is a way of explicitly assuming a degree of smoothness in the depth field which is almost always the case  . Some algorithms make sub pixel accurate disparity maps  . Single pixel accuracy is sufficient for many applications, but leads to a definite discretization of the disparity space and hence a discretization of the d istances that are computed from the correspondence algorithm. Certain cases do require extra accuracy such as 3D scanning and life like rendering, but for many cases such as obstacle detection, this extra accuracy is not worth the sacrifice of processing speed. Other techniques can create multi valued representations  , voxel based representations  , and triangular meshes  .
70 Research has also been directed at making algorithms that are faster and that can be run on hardware instead of in software. L ai et al. have developed a simplified algorithm designed to run on a Digital Signal Processing chip (DSP)  . Their goal was to design a lower cost system for stereo imaging. Their algorithm does not use the full blown stereo correspondence windows typically used, but instead uses a three step process consisting of a Gaussian filter for blurring, then a modified Sobel filter for edge finding, and the n performing correspondence on those edges. Using a DSP chip to process the images resulted in being able to process a 720x576 size image at about 25 frames per second . Unfortunately, application of the Sobel edge filter reduces features to find stereo c orrespondence for (in areas which lack edges) and the additional noise reduction filters reduce the amount of pixels with depth data by roughly 50%. Anderson et al. have developed an embedded stereo vision processing system with the intention of providing guidance to the visually impaired  . Their code runs on an FPGA which improves performance dramatically ove r running code on a CPU. Their system actually attempts to represent the depth data along a horizontal scan line with a spline which is found using a genetic algorithm. The success of the stereo algorithm was presented, but the interface with the human w as not described. Once stereo data is collected and turned into a 3D point cloud, it needs to be further processed to be useful to a sensory substitution application. For instance, objects can be extracted from the data and conveyed to the user. Or the entire 3D geometry can be simplified so that objects in predefined zones trigger warning to users. Collision detection methods, such as the one presented in  , can be used to speedily
71 determine if a region of space is occupied by data in the point cloud. This can be useful for processing the large volume of raw data into something more meaningful. Figure 3 1 . Shapes used in the identification testing of an electro tactile t ongue display. Figure 3 2 . The CyberGrasp force feedback glove  .
72 Figure 3 3 . An ultrasonic navigation aid  . Figure 3 4 . MOWAT travel aid  .
73 Figure 3 5 . The hardware setup and a diagram of the operation of the navigation aid developed at FIU  . Figure 3 6 . The Tactile Handle  .
74 Figure 3 7 . Illustration of disparity computation. A) The image from the left camera. B) The image from the right camera. C) The matching error as a function of disparity and column number.
75 CHAPTER 4 SYSTE M DESIGN AND IMPLEMENTATION Concept The goal of this research was to develop a new system that make s it easier for people with blindness to understand the geometry of the surrounding world with more clarity and simplicity than ever before. By taking motiv ation from how people naturally feel around with their hands to discover something new, it is desired to design an intuitive tactile feedback system that effectively extend s the range at which people can feel with their hands. This system should help peop le who are visually impaired to understand their environment and navigate using a more natural feedback method while giving them ric her detail about the ir surroundings . Conceptually, the idea is to measure the geometry of the area around a user with a bo dy mounted 3D sensor, then to scale this sensed geometry down to arms length to be used as the model for a feedback field. The feedback field is a volume in front of the user which represents the scaled down environment geometry are in a position where they intersect a piece of the geometry in that feedback field, a stimulus is provided to that finger. In this manner, a subject can learn about the geometry far beyond arms length by simply moving his or her hands around in this sp ace, sensing where objects are based on the feedback to the fingers, and forming a mental map of geometry. This mental map can then be used to infer what the scene truly looks like. Additionally, the model used for the feedback field was updated continu ously from the sensor data so that the model always represent s the most current state of the environment, giving the user the ability to feel the scene in front of him or her in real
76 time . Having a sufficiently fast update rate is important for several re asons. One benefit is that as the scene changes, the se changes can be sensed by the user. For example, if a person were walking across the path of the individual using the device, the user would be able to hold out her hands and feel the sensation her fi ngers being stimulated sequentially from left to right, indicating that an object was moving in this direction. Another benefit of having the data refresh quickly is that it enables the device to be used while walking and not just standing still. As the user travels along a path, he can be sweeping the area in front of him with his hands in a search of potential obstacles. F or this to be useful, the data must be updated sufficiently fast . Similarly, the user may want to change her field of view by scan ning the area by rotating. As she rotates, the feedback field must be refreshed so that it accurately reflects the scene in front of her. Maintaining a sufficiently fast update frequency is critical to successfully creating the illusion of interacting wi th a live model. The system is comprised of four components : 1) a device to sense the geometry of the surroundings , 2) feedback gloves to provide tactile feedback, 3) a system to , and 4) a computat ional device to process the data. The feedback gloves are worn on the hands to convey the sensation of touching a surface, even though the surface is virtual. This can be done by creating a vibration , an electrical stimulus, or by applying a force to the fingers. The computational device runs the software which is responsible for processing the 3D data, proce ssing the location of the hands , and setting the stimulation state of the feedback gloves to simulate the sensation of touching the virtual object.
77 The operation of the system is described using an example which is illustrated in Figure 4 1 Figure 4 5 . Suppose that a person without vision is trying to understand the scene in front of him. In Figure 4 1 , he is standing in front of a car. The beams radiating from his body In Figure 4 2 , the 3D model measured with the sensor is shown. This data is scaled down and placed at arms length from the user to create the virtual feedback field , shown in Figure 4 3 . This is the virtual object that the feedback gloves will attempt to convey as the user moves his hands and fingers within this space, giving him a tactile representation of the larger scene in front of him. Another scene is pictured in Figure 4 4 . Here a person is standing in an outdoor area with a picnic table and support columns. The geometry of the surrounding area is c aptured using the 3D sensor and scaled down in size. In the figure, t he virtual 3D geometry is shown semi transparent in blue . The computer is constantly getting updated measurements of hand locations relative to this virtual 3D geometry so th at when a finger is in contact with the virtual surface, a tactile sensation is triggered as depicted in Figure 4 5 . The aim is to give non sighted people the ability to extend their reach and thereby give them the ability to better n avigate in unknown and changing environments. The hope is that if the system is intuitive enough, it can give people the ability to to form mental maps of the surrounding structures . Th e system should be wearable , requir e no remote computer, and give a person the ability to use the system in their daily tasks, both indoors and outdoors. The power of the
78 intuitive nature lies in being able to use the hands as a natural extensi on for probing the environment at artifi cially enhanced ranges. While this design has some similarities to that presented in  , the vision tactile mapping in that project was not intuitive hands into account . By not using the location of the hands relative to the body, the choice of placing the vibration motors on the hands is arbitrary and could just as well have been placed in other locations . In that scheme, t he power of haptic exploration in a feedback field is not enabled and the system is limited to expressing data in only 2 dimensions. System Design The previous section described the desired operation of the system and the overall effect to be achieved. This section discuss es how those goals influenced the system design. This includes what systems were selected for measuring the surroundings and the pose of the hands as well as the method of conveying stimuli through the gloves. It also includes how the data is used to create the effect of a feedback field based on the surrou ndings. Design Considerations The first component of the design is the sensor for measuring the surrounding geometry . This is critical for providing a useful picture to the user because the output information can only be as good as the input information . This sensor need s to have an adequate working range for obstacle detection which is at least 3 4 meters. It need s to be lightweight since it will be worn by the individual, and it need s to be effective both
79 indoors and outside. The affordability of thi s sensor is also important if such a system will one day be commercialized for the market. Chapter 1 covered some options for detecting the geometry of the environment and listed their benefits and drawbacks. T o meet the requirement of being able to fun ction both indoors and outside, structured light sensors cannot be used since most of these sensors do not function well in daylight due to the fact that sunlight washes out the visible or infra red light pattern. An array of sonars could be used to provi de a fairly accurate image of the surroundings, but the resolutions would be limited by the number desire an array with a very modest resolution, say 10 x 10, then the 100 sonars required would cause the array to be very bulky. Such an array would also require that the sonar pulses be staggered so as to not interfere with neighboring sensors causing a lengthy capture time . A laser range scanner would be excellent at providing an accurate and high resolution 3D image at fast update rates, but the cost of these sensors is still prohibitive for applications such as wearable electronics. The 2 kg Velodyne HDL 32E is capable of producing dense point clouds at 10 Hz with a 40 degree vertical field of view and a full 360 degree horizontal field of view , but the cost is currently around $ 30,000 , making it an excellent choice for performance , but a poor choice for meeting the price criterion. Stereo camera systems have been used for decades and have the benefits of being functional both inside and outside and are r elatively inexpensive and light weight. They also produce a 3D image that can have the same angular resolution of the imaging sensor which is much higher than anyt hing achievable with sonar. There are
80 some drawbacks that make stereo cameras non ideal, however. They are affected by lightin g conditions as all cameras are so d imly lit environments can cause the image to be too dark if a short exposure is used, or too blurry if a longer exposure is used. And glare from the sun can cause saturated areas of the images that cannot be processed properly. They are also ineffective at determining the distance to transparent surfaces such as glass doors or highly refl ective surfaces such as water. In light of all this, s tereo cameras were selected for the task of providing a 3D image of the surroundings because of their small size, affordability, and indoor/outdoor capabili ty. As computing power improves, larger stereo imag es can be processed at faster speeds , a nd the use of parallelized computing systems such as GPUs and FPGAs greatly increase the processing speed. The size of the stereo cameras can also be eventually reduced to something that could fit in a pair of glasse s, making them very unobtrusive. The second component of the design is the system for sensing the position and orientation of the hands. This task is essential for knowing where the fingers are relative to the scaled down geometry that will be used as t he feedback field. When collisions are detected between the fingers and the feedback field model, a tactile sensation will be sent Some of the options considered for measuring the hand posi tions are listed below: 1) Visua lly sensing the positions of the hands using the stereo cameras 2) Visually sensing the positions of the hands using a separate camera 3 ) Mechanical ly sensing the position s of the hands And two of the options considered for measuring the hand orientation are listed below:
81 A) Visually measuring the orientation of the hands B) Measuring the orientation of the hands using MEMS inclinometers Option 1 was ruled out because initial tests showed that using the stereo cameras to s ense the hand locations blocked out too much of the image, reducing the ability to sense objects close to the user. Option 3 could have been implemented with a string pot entiometer mounted to the hips with two additional rotation sensors to measure the an gle at which the string was pulled. Then by connecting the string to the gloves, the position could be determined. This option was ruled out because it was cumbersome and the string could become snagged on an object . To measure the orientation of the ha nds, o ption B was considered but was ruled out because it would be unable to measure the yaw of the hand. The chosen solution was a system that measured both the position and orientation of the hand s with a separate dedicated camera. This was accomplis hed b y using a 2D marker with a known pattern. When the marker is in the field of view of the camera and its features can be distinguished, its position and orientation relative to the camera can be determined. To accomplish this, a software system commo nly used with augmented reality applications was adapted. An example of what these markers looks like is shown in Figure 4 6 . These markers could then be placed on the back of the hands and the camera looking down at the hands could pose relative to the camera. The third component of the design is the feedback gloves. This is a crucial part of the system as it is all the user will experience. Even i f the data is perfect, and the processing is perfect,
82 information, the system will be non functional. Since the overall goal of this work is to provide a feedback field model of the environment that can be explored with the hands, it is necessary to provide a sensation that indicates when the fingers are colliding with part of this model. One thought was to emulate the haptic CyberGrasp glove used in  which was capable of exerting an upward force on individual fingers. One drawback to this design is that if fingers can only be pulle d upward, then it might be confusing if the top of the finger were to brush the side of a virtual wall and be lifted up when a more sensible response would be to push the finger downward. So a system using force feedback to the fingers would preferably be able to exert a force in either direction depending on the orientation of the finger and the surface normal of the object. Such a glove would be fairly bulky though, so a simpler approach was selected . Vibration motors were placed at the fingertips of t he gloves to simply provide vibration feedback upon collision with a surface in the feedback field model. This made the gloves more light weight and less cumbersome. Since one of the goals is to design the system to be as unobtrusive as possible, the sma ll vibration motors could even be placed on the top of the fingers , and the gloves could be cut to expose the fingertips for reading braille and handling small objects . Figure 4 7 shows one of the vibration motors used in the feedbac k gloves. Figure 4 8 shows a user approaching an area with a light pole and park bench to his left as well as the corresponding feedback field. Figure 4 9 show s how individual fingers would be stimulate d and the fingers intersect parts of the geometry. The fourth component of the system is the computational resource. This is responsible for processing the images from the stereo cameras and the hand tracking
83 camera a nd for setting the state of the feedback gloves to produce the intended effect. It is desired that this be small, lightweight, and have ample computing power for processing stereo video and performing all the other required processing. This could eventua lly be miniaturized into an embedded system optimized for the task, but for this prototype, a laptop computer was selected to run the software required for the data processing. While not ideal because of its weight and power requirements, it proves the sy customized for such an application. S ystem F unction ality The system functionality include s all the specifics that are required to create the intended user experience. Si nce it is desired to create a model of the surroundings that the user can feel with his or her hands , the location of the stereo cameras that sense the environment and the location of the third camera that senses the hands are both important design decisio n that affect s how the system functions. If the stereo cameras they are mounted on the chest, they will capture the direction their torso is facing; and if the cam eras are mounted on the head, they will capture images in whatever orientation the head is pointing. Similarly, the location of the hand tracking camera will dictate the reference frame from which the hands are measured . If the hand tracking camera is mo unted to the chest, the hands will have a different detection zone than if the camera were mounted on the head. Also, considering that the finger collision detection step requires that the location an d orientation of the hands be represented in the same coordinate system as the 3D
84 geometry measured with the stereo cameras , there are two possible mounting configurations. The first is that the stereo cameras and the hand tracking camera are mounted on different parts of the body that can m ove relative to e ach other (e.g. , the stereo cameras might be mounted to the head and the hand tracking cameras might be mounted to the chest). In this scenario, the position and orientation of the stereo cameras must be measured relative to the hand tracking camera to tr anslate positions from one coordinate system to another. The other option is to mount the hand tracking camera and the stereo cameras on the same object so that they do not move relative to each other. This scenario is beneficial because it does not requ ire the measurement of any joints such as the neck joint with its many degrees of freedom. A fixed, known transformation between the two camera systems will exist and can be used for placing the hands and the environment geometry in a common coordinate sy stem. Taking the previous considerations into account, the decision was made to mount the stereo cameras rigidly to the hand tracking camera and to have these cameras mounted on the head as shown in Figure 4 10 . This offers a sim pler design and the ability to change the field of view by turning their head , in the same way people with vision do . In this configuration, a person wearing the system could be facing one direction, but be curious about what is to his immediate left. By turning his head to the left and extending his left hand to his side, he would still be able to interact with the feedback field as the stereo cameras would be capturing in that direction and the hand tracking camera would be aimed toward his laterally ex tended hand as shown in Figure 4 11 . This would not be possible if the cameras were mounted to the chest, as both the
85 3D geometry to the left and the laterally extended hand would be out of the field of view of the cameras. Anoth er decision that must be made is how to inform users when their hand pose is either not properly detected or outside the field of view of the camera. One option is to indicate this by playing a tone , perhaps one tone to indicate the left hand is undetecte d and different tone for the right hand. Another option is to place an additional vibration motor somewhere else on the hand which would indicate by vibrating that that hand was not detected. This information is important to convey so that when the hands do go undetected, the user does not interpret the lack of vibration feedback to the fingers as a lack of obstacles. They should be made aware of the invalid state of the feedback gloves until the hand pose is reacquired. It would also be possible to pre dict the location and orientation of the hands for a short period of time after the tracking pattern was lost, but because the hand motion is not constrained and is fairly random, no useful model for predicting hand motion would likely be valid for time sp ans greater than about a second. Therefore, it was decided to not attempt to predict hand locations during the periods when pose measurements are los t, and instead to warn the user of this condition using an additional vibration motor mounted to the back of each hand. Another important design choice is t he scheme used for scaling the sensed geometry of the surroundings . The stereo cameras detect objects from a couple feet away to hundreds of feet away. The way in which this data is scaled and translated affect s which parts of the surroundings can be sensed by the user and how much detail is present. It is desired to map the physical surroundings to a comfortable range of hand positions . A reasonable range for obstacle detection would probably be from
86 ab out one half meter up to five meters. This range should allow for the detection of obstacles which pose an immediate risk to walking as well as objects which are fa rther s, or hedges. It is desirable to have the part of the feedback field that corresponds to the ground level be sense low objects by putting the ir hands at that height. Higher object s can, of course, be sensed by raising the hands. To achieve this, the 3D model from the stereo cameras will need to be scaled down and translated downward to the desired level. These scaling and translation values will need to be easily modified paramet ers in order to find the values that work best for the individual. They may only need to be modified until a suitable value is found that works for most people, or they may need to be parameters that can be tuned by the users of the device so that a custo m experience is achievable. One way to do this automatically could be to have the user place the ir hands in a comfortable position and then activate a calibration process that then places the ground level where their hands are. In the case where either of the hands is not properly detected, the user will know of this condition by the vibration on the top of that hand. This situation can arise when the marker on the back of the hand becomes undetectable by the hand tracking camera for one of several reas ons: 1) the hand marker goes outside the field of view of the hand tracking camera, 2) the marker is rotated so much that the marker detection algorithm fails, 3) the marker is occluded by the other hand, or 4) poor image quality due to lighting or blurrin g causes the marker to be come undetectable. It is expected
87 that a normal method for navigating a pathway would be to walk with the hands at about waist height while sweeping them back and forth to detect potential obstructions. This is similar to how a l ong cane would be used except that it would require less side to side movement since there are effectively ten probes which increase the area of coverage (one for each finger) . An important question arises when considering what geometric information to r elay to the hands through the feedback field. Should the ground surface be included in the 3D model ? If the ground surface is retained in the 3D model, then a user will be her hands and will need to be able to distinguish between touching the ground and touching an object that is on or above the ground. But i f a method is used to remove the ground from the 3D model, then the user will be able to assume that any feedback she receives through the gloves is indicative of an o bject that is either on or above the ground plane. Figure 4 12 illustrates the scenario that would occur if a user were to come across a relatively short obstacle, in this case, a soccer ball. All ten fingers are currently rece iving feedback, making it nearly impossible to distinguish what is traversable ground and what is an impediment to walking. A closer look is offered in Figure 4 13 . Here it can be seen that the fingers are both intersecting the ground and touching the soccer ball model, creating a full feedback condition. If the hands were raised slightly, perhaps the only fingers to receive feedback would be those touching the soccer ball, but since the ground geometry will be somewhat noisy, t his could still strain the abilities of human perception . To remediate this difficulty, the ground plane could be classified as such and not included in the feedback field. Figure 4 14 shows the ground plane
88 drawn in red to ind icate that it is not used to create feedback, so even though the us hands are intersecting it , only the two fingers which are touching the soccer ball would have active stimuli and the trip hazard to the user could be recognized and avoided. The simpl ified model without the ground plane ( Figure 4 15 ) provides a model which is much easier to scan for obstacles than the original model ( Figure 4 12 ) . In light of the fact that excluding the ground plane f rom the feedback model would make it much simpler to recognize shorter obstacles, it was decided to design this feature in to the system. The description of the process used for the ground plane exclusion is left for the next section on implementation. On e of the difficulties involved with removing the ground plane is that the orientation of the cameras relative to the ground is not measured. This means that a simple scheme that eliminates points below not produce the desired result since the true grou nd plane could be oriented at a wide range of angles from the camera . Another design choice involves how the feedback state of the gloves are determined as the fingers interact with the feedback field . S ince the geometry of the feedback field is based on the output of the stereo camera algorithm, only the outer surface of the objects in the environment are captured. If the finger feedback is only activated when the fingertip is intersecting a surface in the feedback field, then feedback will not be provided if the fingertip has already penetrated through the surface. However, if the finger feedback is activated when the fingertip is either touching a surface or has been pushed through a surface, users wi ll be less likely to overlook obstacles when they position their fingers inside of or beyond an object.
89 The model used in the feedback field is based on the data produced by the stereo algorithm which natively generates a disparity map . T his disparity ma p can also be converted into a point cloud using the known mapping between the pixel locations and the angles they correspond to . While the previous illustrations have depicted the feedback field model using smooth vector geometries, the actual model is a collection of points as shown in Figure 4 16 . This collection of points is the actual data on which the finger collision/penetration test is performed. Performance G oals There are several performance categories which are impor tant for creating a helpful user experience. These are categories which, if performed well in , are expected to produce a navigation aid that will be enjoyable to use and which will out perform tool s currently in use as well as the systems which have been developed in the research community thus far . The performance goals in these categories are established by making an educated estimate about what level of performance should be required to achieve t he intended effect of making a usable live feedback field . These categories include 1) the update rate, 2) the ground detection accuracy, 3) the level of detail in the feedback field model, and 4) the level of accuracy in measuring the hand position and orientation. The update rate is critical for a usable hu man interface. If the hand feedback state is computed based on hand position data and stereo vision data from a moment in time sufficiently long ago, then the delayed output will be confusing and it will be difficult to form a mental map based on haptic e xploration. It is crucial to deliver a sufficiently fast update rate for the entire process of sensing and processing the 3D environment data, sensing the pose of the hands, performing collision detection, and setting the
90 feedback state of the gloves. An y delays will be perceived as unresponsiveness and will require the user to explore the feedback field more gradually and to walk more slowly. It is desired to avoid this situation at all cost s to ensure that the tool does not get in the way but rather en ables the user to explore freely. It is expected that 5 Hz will be the minimum allowable update rate, but a rate closer to 10 Hz is more desirable. The accuracy of the ground removal algorithm will also be important. Since the ground removal algorithm will be responsible for delineating between potential obstacles and the load bearing surface, it is important that it perform this task while minimizing false positives and false negatives. False positives would incorrectly identify actual obstacles as gr ound, creating the hazard of not informing the user of the threat, whereas false negatives would incorrectly leave parts of the ground in the feedback field which could be interpreted as obstacles when really none are present. Although the former is the g reater danger, the latter will be an annoyance and will decrease the The level of detail in the feedback field is another factor which will determine how useful the system will be. A low resolution field mi ght be useful for generally avoiding obstacles but a higher resolution field would provide enough detail to clearly understand properties such as how much walking space is available between two obstacles and whether a small obstacle is lying in the pathway . The level of detail is dependent upon the resolution of the stereo images used in the stereo algorithm, the accuracy of the stereo syst em (a function of the stereo algorithm and the camera separation), and the type of post processing performed on the da ta.
91 orientation is also critical to the overall experience . Even if the quality of the feedback ate, noisy, or intermittent, then the details of the feedback field will be nearly impossible to sense and interpret. Ideally, the accuracy of the hand pose estimation would be similar to the resolution and accuracy of the feedback field model. Keep in m ind that s ince the feedback field is scaled down by a scaling factor , the accuracy of the feedback field geometry relative to the accuracy of the hand positions will p roportiona l l y better . It was desired that the aforementioned properties ( the update rat e, the detail and accuracy of the feedback field , the hand tracking accuracy, and the ground removal feature ) be sufficient to create a user experience which allows people to perform tasks such as : 1) Finding a doorway 2) Perceiving obstacles on the ground which are at least the size of a shoebox 3) S ensing and avoiding small trees, tree branches, and cables 4) Sensing and avoiding moving obstacles such as people and bikes 5 ) Finding a chair and understanding its orientation Prototype System Implementation This section cover s the implementation details of the prototype device. The design was implemented as a proof of concept system so there are many elements that can be improved upon if a commercialized device were desired; nevertheless, the system describ ed below demonstrates the feasibility of the concept in all areas. This
92 section will cover first the hardware which was either selected or fabricated followed by the software algorithms and architecture. Hardware The hardware for the proof of concept sy stem consists of 1) a computing device, 2) the camera s for the stereo system and the hand tracking system , 3) the mounting hardware , and 4) a microcontroller board for controlling the vibration motors and camera timing. The computing device is a laptop PC with an Intel i7 quad core processor and 8 GB of RAM. The three cameras are connected to this device as well as the feedback interface board with which it communicates over a serial connection . The computer was not equipped with a dedicated G raphics Proc essing U nit (GPU), although having one would have enabled the stereo data processing step to be offloaded to the GPU, providing a noticeable speed boost since this step is very amenable to parallelization. The stereo cameras selected for the application are a pair of MatrixVision BlueFox 1 20aC machine vision cameras with a USB 2.0 Interface. The cameras have a (4.8 mm by 3.6 mm) which has a resolution of 640 by 480 pixels . Attached to the cameras are 4 mm lenses which provide a field of view an gle of 48.5 degrees in the vertical direction and 62 degrees in the horizontal ; these values were determined using (4 1) and (4 2) . It would be desirable to eventually transition to using fish eye lenses with a field of view closer to 180 degrees. The ca meras can be triggered to capture an image at a regular interval or through an external signal. If the cameras are not synchronized to capture at the same time and there is any motion in the scene (either from objects moving or the cameras moving) then th e stereo algorithm will produce erroneous results since it assumes the pictures are captured simultaneously . To avoid
93 this, the camera s are both triggered from a common external signal which is generated from a microcontroller board at a rate of 10 Hz. (4 1) (4 2) The hand tracking camera is a Sony Playstation Eye camera. This camera has a resolution of 640 by 480 pixels and can capture video at 120 Hz, although this spe ed is not needed as the processing algorithm runs much slower than this. The camera has an adjustable focal length lens which is set to the widest setting giving a 75 degree field of view. Both the stereo vision BlueFox cameras and the Eye camera are mou nted on a helmet. This is more cumbersome than is necessary and improvements to the design could be made by using smaller cameras and mounting them to perhaps a hat. A backpack is used to carry the laptop computer and the hand feedback interface electron ics. The cameras are, as mentioned earlier, rigidly mounted to a helmet. From the backpack come all the wires for the feedback gloves and for the cameras (including the lines that carry the stereo camera triggering signal). Figure 4 19 shows the complete navigation aid system worn by a user. A microcontro ller board with an Atmel ATmega 128 chip , seen in Figure 4 20 , is used to control the twelve vibration motors used for feedback to the hands. T his bo ard controls the vibration motors by controlling the state of an array of 12 opto isolators which either interrupt or connect the motors to a 5 Volt power source. The board
94 communicates with the computer over a serial communication line so that the softwa re on the laptop can send the commanded feedback state to the microcontroller board. The board also simultaneously send s a 10 Hz pulse from one of its GPIO pins to the two stereo cameras to provide the trigger signal for capturing a new frame . A diagram illustrating the connectivity is shown in Figure 4 21 . The feedback gloves are made from a pair of polyester gloves to which the vibration motors have been attached on each finger as shown in Figure 4 22 . Additionally another vibration motor is mounted on the back of each hand to indicate to angled too much, or otherwise unrecognized by the software ) . The wire pairs from each motor are routed back towards the wrist where they are connected into a single bundle to be routed along the arm and into the backpack. The gloves also have a Velcro pad on the top to which the hand tracking marker is affixed. This arrangement i s suitable for concept validation, but to be practical in daily life, the feedback motors would need to be placed somewhere besides the fingertips because this would interfere with reading braille and tasks requiring dexterity. An alternative location for the motors is on the top of the finger or on the bottom of the finger but closer to the palm. Algorithms Stereo v ision The images captured from the stereo cameras are used to generate 3D data which needs to be conveyed to the user. Before going into th e details of the algorithm, the theory behind this process is covered. The cameras are modeled as a pinhole lens camera which assumes that all the rays of light which strike the imaging plane pass through the same point called the focal point . The focal length , , is the distance
95 between the focal point and the imaging plane. Knowing the dimensions of the sensor on the imaging plane allow s for the field of view to be calculated using simple geometry as in (4 1) and (4 2) . These parameters can also be us ed to determine where a point in 3D space would be projected on to the imager using the equations below (4 3) (4 4) where c x and c y are the x and y coordinates of the princip al point; X, Y, and Z are the coordinates of the point in 3D space; and x and y are the location of th e projected point on the imager (t he principle point is roughly the center of the imager where a ray would strike if it passed through the focal point perpendicular to the image plane). These equations can be written in matrix form as follows (4 5) Or as (4 6) where M is called the camera matrix. The elements of the camera matrix can be scaled to convert the 3D point to a 2D location on the imager in units of cm or in units of pixel s, which is us ually more useful . These elements of the camera matrix are often referred to as the intrinsic properties of the camera. Since camera construction and lens geometry
96 is not perfect, there will always be non ideal aspects of a camera that create distortions in images. Parameters that describe these distortions can be found through regression and are sometimes called extrinsic properties. Essential to understanding how stereoscopy can be accomplished is understanding the geometry of the problem . One feature that can be useful in describing how two cameras view the same scene is called the epipolar line, and Figure 4 23 will be a helpful reference for visualizing this . The epipolar plane corresponding to a point in space, P (green do t in Figure 4 23 ), is the plane which contains P and the two camera focal points (the blue dots) . The epipolar lines ( shown in red ) are then defined by the intersection between the epipolar plane and the image planes. This is us eful because i f the location of a point P is known in one of the images , then its location in the other image must be somewhere along the corresponding epipolar line. And, of course, since the cameras are not coincident ar line will generally be different in each image. The fact that the location of a feature in one image will be somewhere along the corresponding epipolar line in the other image is very useful to stereoscopy. This means that the correspondence step, whi ch attempts to match features in one image with the same features in the other image, can be performed along a one dimensional line instead of over the entire two dimensional image , saving computational time. In practice, it is easier to rectify ( by rema p ping ) the two images into another image space so that epipolar lines are all horizontal and the search for corresponding pairs can be performed along a single pixel row of the image. To be able to achieve this, the translation between the two cameras, T, and the rotation matrix describing the
97 rotation between them, R, must be measured accurately. This is achieved by analyzing pairs of images with easily identified points in a known arrangement that are visible in both frames. Typically , to perform calib rated stereo vision, a known pattern such as a chessboard pattern shown in Figure 4 24 is held in front of the stereo cameras and the location of the corner points are found in both images. Since the pattern shape and size are known, the position and orientation of the pattern relative to the camera can be estimated for each image. The estimated position and orientation of the calibration pattern for each image can then be used to estimate the critical parameters for stereo pr ocessing , which are the rotation matrix , , and translation vector , , that describe the orientation and translation between the two cameras . The following equations show how the rotation matrix and the translation vector are calculated and use the following naming convention s: denotes the rotation matrix that A P denotes a vector P coordinate system, and T A/B denotes the vector coordinate origin coordin ate origin. Let L be the coordinate system of the left camera, R be the coordinate system of the right camera, and C be the coordinate system attached to the 2D calibration pattern. coo rdinate system, C P system using the following equations (4 7) (4 8)
98 And the same relationship between the left and right camera coordinate systems exists: (4 9) The rota t ion matrix betwe en the two cameras is easily recovered using the following identity (4 10) Now substituting (4 9) into (4 7) produces (4 11) an d rearranging gives (4 12) Since , multiplying both sides by gives (4 1 3) which finally reduces to (4 14) The actual values for the rotation matrix and translation vector between the two cameras are empirically determined by analyzing multiple image pairs with the tracking pa ttern in various poses and selecting the values of and that minimize the re -
99 projection error for both cameras . At this same step, the distortion coefficient for both of the cameras can also be estimated. After the relative pose of t he cameras is computed, the next step is to calculate the image rotations and translations that are necessary to rectify the images and create epipolar lines that are horizontal and collinear in order to make the stereo matching problem easier. The se ima g e adjustments create a new projection matrix, P , for each camera which can be used to convert points from 3D space to 2D image coordinates, as well as a re projection matrix, Q , which convert s points in the disparity map to 3D space given the disparity, d , at that point as shown below where (4 15) The steps of un distorting the image (making linear features also remain linear i n the image) , rotating the image, and cropping the image can be combined into a single remapping table that can be used to remap and interpolate the pixels from the original image into the undistorted and rectified image. Once this remapping table is comp uted, performing these steps sequentially is shown in Figure 4 25 although this is just
100 visualization purposes as these steps are performed simul taneously when remapping the image . To perform the calibration of the cameras and the un distortion and rectification of the images, the popular open source software library OpenCV was used. After the cameras have been calibrated using a known pattern a nd the un distortion/rectification maps have been computed, the remapping functions can be applied to every new pair of images that are acquired with the cameras , and the epipolar lines should be nearly coincident in these rectified image pairs. The next step is to actually process the images by finding the corresponding points in the images and computing the disparity. There are many ways of solving the correspondence problem, as was mentioned in Chapter 3, but one of the simplest and most popular method s called block matching was used for this application. OpenCV was also used for this step of the process. Recall that finding the correspondence requires matching features between the two stereo images. In the block matching method, this is done by takin g a horizontal slice of the left image and sliding it across the same row in the right image. At each incremental step, the difference between the pixels of the sliding window and the pixels of the image underneath the window are computed by summing the a bsolute differences of the pixel intensities as illustrated in Figure 4 26 offset or disparity that minimizes this difference measure is selected as the best estimated disparity for this particular pixel in the image. The process is then repeated for the next pixel until the entire row has been processed and until all the rows have been covered.
101 There are several adjustable parameters that govern exactly how the block matching algorithm works. One parame ter is the window size . Increasing the window size will tend to smooth out the disparity map and create fewer areas without disparities, but increasing the window also causes a loss of detail in the disparity map. For this application, it is more desirab le to have a smoother and more continuous disparity map while sacrificing some detail. Through trial and error a value of 17 pixels for the window size was found to give the best results. Another parameter, the m inimum disparity , determines at what dispa rity the sliding window starts. Normally this can be set to zero since all the points in the left image will be at the same location or further right than all the corresponding points in the right image. However if the cameras are not perfectly calibrate d, objects which are very far away from the camera may have negative disparity values. A value of zero was selected for the minimum disparity and no reason was found to reduce this number . The number of disparities is another critical parameter and it h as significant tra de offs associated with it. This parameter controls the number of pixels over which the sliding window will be shifted while searching for the best match . If the number of disparities to try is too low, then close objects, which have la rger disparities, will not receive a proper disparity estimate. However, if the number of disparities to try is raised, the computational expense is increased greatly because this involves computing that many more difference measurements and this increase d expense is multiplied by the number of pixels in each image . Given the maximum disparity value and the camera parameters, the minimum detectable range for objects can be estimated using the geometry portrayed in Figure 4 27 . T he distance to an object, D , is approximated in
102 (4 16) , and (4 17) relates the disparity to the difference in the angles measured from the left and right cameras. (4 16) (4 17) Rearranging (4 17) gives the following equation which approximates based on the disparity and the known camera parameters. (4 18) Substituting this back into (4 16) and setting the dispar ity to the maximum disparity yields an approximation for the minimum detection distance. (4 19) This distance is the closest an object will be detected by the stereo algorithm given a maximum di sparity value. There are two ways to decrease the minimum range at which objects are detected that both have their tradeoffs to consider . One is that increasing the maximum disparity allows objects to be detected at a closer range, but requires more comp utational time. The other is that decreasing the camera separation allows objects to be detected at a closer range, but decreases the accuracy of the depth map due to ill conditioned geometry. Table 4 1 lists four possible scena rios based on two different disparity limit s ( 16 and 32 pixels ) and two different camera separation
103 distances (67 mm and 100 mm) . Since it is unacceptable to have a zone between the ed, the only acceptable configuration in this table is the second row. A minimum detection distance of 84 cm should be small enough that obstacles which are that close to the cameras are already close enough for the user to physically touch them with exte nded hands and therefore Table 4 1 . Relationship between disparity, camera separation, and minimum detection distance. Two other important tunable parameter s are the texture threshold and the uniqueness ratio. Since the correspondence step employs block matching to find matching features between the two images, it is necessary for there be enough texture in the image to create a unique match with the corresponding location i n the other image . The texture threshold specifies the minimum amount of texture within the window before it even attempts to find the corresponding position in the other frame. The uniqueness ratio specifies how unique a match between points in the imag es should b e before it is actually used as the disparity for that point . The uniqueness ratio is the ratio of how much closer the best match is to the second best match. Increasing these values will increase the quality of the data by reducing the number of incorrect
104 matches, but will also decrease the number of pixels which have assigned disparity values, leaving more blank area. Primarily from trial and error, the values listed in Table 4 2 were determined to work best for the application. These gave disparity maps which were not too noisy, yet still had adequate coverage without too many holes in the map. The selected window size, which is at the higher end, also helped to fill in gaps and smooth out the data. And the maximu m disparity value enabled the cameras to sense objects closer to the cameras so that objects are not dangerously omitted from the feedback field. Once the algorithm has performed the matching process over each row and over all the rows of the image, the d isparit ies between the matches at each pixel are saved to a disparity map. Figure 4 28 Figure 4 30 show some examples of the disparity map computed for various scenes. Table 4 2 . Listing of parameter values used in the stereo algorithm with a 480 by 360 pixel image size . Once the disparity map has been created, the distance associated with any point in the map can be computed. This is accomplished us ing the information known about the camera parameters and the pose of the right camera relative to the left camera which was computed during the calibration step. Recall tha t the calibration and rectification steps produced a re projection matrix, Q , that can be used as shown in (4 -
105 15) to convert elements in the disparity map to points in 3D space given the disparity at that pixel. Using this relationship, the disparity map can then be converted into a collection of 3D points. An example of this point cl oud can be seen in Figure 4 31 and Figure 4 32 where a 3D viewer based on OpenGL has been written to display these points. Ground r emoval The ground removal step is important for the user experience as wa s discussed earlier in the chapter. To accomplish this, some method of distinguishing between the ground and the other objects in the scene is required. Unfortunately, the straightforward approach of simply finding the lowest points in the point cloud an d assuming that these represent the ground will not work since the head mounted cameras could be oriented coordinate system. Additionally, since no inclinometer was used, the orientation of the head relative to the gravity vector is unknown. There are several characteristics of the ground that ca n be used to identify it: 1) limited ra nge (about Â±30Â°) , 2) it has little curvature on the scale the search (~7m), and 3) there is nothing that can be detected below the ground . An approach which takes advantage of the third feature was implemented by using a technique presented in  . This method classifies points as terrain if and only if there exist no other points in a downward facing cone extending from the candidate point. This worked somewhat, but because the point cloud had a fair amount of variation in the ground su rface, it resulted in a very spotty classification accuracy. Also, since the angle of the ground plane could be quite tilted relative to the cameras, the test cones had to be made fairly narrow which
106 increased the false positive rate among other features like walls and vertical or slightly slanted surfaces. Another technique was experimented with which, rather than attempting to classify individual points as either ground or not ground, instead attempts to extract a parameterized model of the ground from the data. To do this, the Hough plane detection method was implemented to find the plane that best fits the point cloud data. In order to prevent the algorithm from selecting a plane that fits other flat surfaces such as walls or ceilings, constraints w ere placed on the search to limit the best fit plane to one that could possibly be the ground. The Hough plane detection algorithm takes a collection of points in space and finds planes that fit the greatest number of points. This is different from regr ession which attempts to minimize the error between the plane and all the points in the set. It instead finds candidate planes that best represent the points in the set. Figure 4 33 will be helpful in understanding the problem. In this figure, t he axes represent the coordinate system in which the points are represented. The candidate planes are axis; and angle, , about the modified z origin (measured normal to the plane). Any point can contain an infinite number of plane s through that point with that orientation can be determined using : (4 20) where is the unit normal vector to the plane and is determined by:
107 (4 21) The way the candidate planes are selected is by aggregating votes for each of the po ssible planes through a binning process. The space of all poss ible planes is combinations, ca that bin is incremented. After this process is performed for all of the points, the bins with t he highest values describe the planes that best fit the data. For this implementation, the parameters used to generate the bins are given in Table 4 3 . After all the points are added and the bin values are accumulated, the plan e that best estimates the ground is assumed to be the bin with the highest number of votes, as long as it is above a certain threshold. If it is below the threshold, then it is assumed that the ground plane could not be detected and none is used. The ran ge of possible planes is limited by the selection of the range of the parameters for the bins. By selecting a reasonable range , this implicitly restricts the returned ground plane to be within a range of orientations reasonable for the head mounted camera system. If the ground plane could not be detected properly.
108 Table 4 3 . Plane parameters and their bin sizes for the Hough planes. Once the ground plane is successfully detected, it can be used to remove those points that are part of the ground from the point cloud data set. This is simply a matter of testing each point and determining whether it is above or below the pl ane at that location. However, since there are variances in the point cloud data due to sensor inaccuracy and noise , an additional margin is required above the plane so that anything above this remains in the set used for the feedback field, and all the p oints below are not included. Figure 4 34 shows the disparity map after the ground plane has been detected and shifted up by some margin by shading all the points that lie below this plane in red. By removing the red shaded data from the feedback field, it will make it much easier for the user to distinguish where the walking hazards are as opposed to just the ground. Figure 4 35 shows the ground plane extracted in a scene were bicycle racks could be ea sily discerned from the feedback field. Hand t racking The next step required for presenting the information to the user is detecting the location and orientation of the hands. For this step, the head mounted hand tracking camera will be used to capture images in which the hand mounted markers will be detected. To accomplish the marker detection, the ArUco library written by Rafael MuÃ±oz Salinas was used and adapted. The algorithm works by applying an edge filter to the image and detecting all the four sided quadrilaterals which could represent the
109 outline of the marker. Then, each of these candidate quadrilaterals is tested to see whether it is a valid marker by looking at the inner region and determining if the pattern inside encodes a valid marker id . If the inner region does in fact contain a valid marker pattern, then the position and orientation of the marker can be determined using the information known about the actual size and shape of the marker. The position and orientation of a pattern wit h a known shape can be computed usi ng the homography matrix. The h omography matrix relates points represented in represent the points in homogeneous coordinates, and le t the point on the image plane be written as (4 22) and the point on the marker be written as (4 23) then the two points are related by (4 24) where s is a scaling factor, M is the camera matrix, and T is the transformation matrix which takes coord coordinate system . Written out in expanded form , the equation is:
110 (4 25) Now since the point P will be a point on the 2D marker and will be represented in the (4 25) as (4 26) where product of the two matrices, H , is called the homography matrix. This relationship can be used in reverse to find the position and orientation of the marker given the location of at least 4 non collinear points on the marker surface. By supplying the location of the 4 corners of the marker in the coordinate system attached to the y matrix can be calculated (or estimated through an optimization algorithm if more than four points are provided ). And since the camera matrix is known, the rotation and translation information can be factored out of the homography matrix to reveal the po se of the marker plane relative to the camer a using the following equations (4 27) (4 28) (4 29)
111 where r 1 is the first row of the condensed T matrix, r 2 is the second r ow, and t is the third row. Once a marker has been successfully detected and the corner points determined, the process of determining the pose of the marker is straightforward. Unfortunately, the process of extracting just the marker from the image is fai rly computationally intensive. The algorithm works by applying an edge filter to the image , then converting all the raster edge data into a vector format, approximating the vector representation where possible, and searching through the list of connected vector edges to find those that are quadrilaterals. Next all of the quadrilaterals are tested to see if a valid marker is inside. The processing burden appears primarily because noisy or high contrast images provide a lot of edges when passed through the edge detection filter. This in turn produces a lot of data for the software to approximate with vector To mitigate this bottleneck in processing the whole image, a method was devised to only process a subset of the image where the marker is known to be. This is done by searching for another fe ature. A red dot on a black background is placed just below the mar ker and found using a computationally simpler process to isolate the location of this dot. The feature is found by first down sampling the image (to further speed up this step ) and applying a filter to the image which takes every pixel and subtracts from it the value of the pixel n number of pixels above it, n number of pixels below it, and n number of pixels to the left and right of it. This effectively produces an image where bright spots surrounded on all sides by dark regions are highlighted and all o ther areas tend to have
112 pixel values pushed towards zero. To further refine the results, only pixels with a significant amount of red are kept. This proved to be a very simple and robust way of finding the red dot on the black background and worked well in different lighting conditions and viewing angles. Once the location of the dot is determined, the part of the image just above this is unmasked to reveal the area where the hand tracking marker should be as shown in Figure 4 36 . Applying the ArUco marker detection algorithm on this masked image showed speed improvements that scaled with the amount area that was masked (typically about 1/2 3/4 of the image is masked). Two examples of the marker detection algorithm output can be seen in Figure 4 37 and Figure 4 38 where the coordinate axes of the marker/hand have been overlaid on the image. Collision d etection After the stereo camera algorithm has been used to obtain 3D data of the environment and the hand locations have been estimated using the optical marker detection algorithm, the next step is to scale the environment data appropriately and perform collision detection between the hands and feedback field. The desired beha vior is to provide feedback to the finger if it touches or penetrates part of the sensed geometry in the environment . Since the actual finger locations are not measured, their positions are inferred from the pose of the hand by assuming that the fingers a re extended and spread out evenly. In this way, the positions of all ten digits are estimated given an image in which both hand tracking markers are visible. The scaling and translation of the environment dat a is desired to create a configuration where the ground level in the feedback field corresponds to having the hands at about waist level and where the geometry that is about 5 m away corresponds
113 to having the hands extended. Whether the finger positions are scaled up or the sensed geometry is scale d down is indistinguishable. Similarly, either the finger positions can be shifted in one direction or the sensed geometry can be shifted in the opposite direction and the effect will be the same . The scaling and alignment are accomplished by tuning thre e parameters. The first is the vertical shift of the hand axis), the second is the forward shift of the hand axis), and the third is the scaling applied to the sensed geometry. By setting these parameters appropriately, the relevant part of the scene can be positioned centrally in The transformation is performed as follows. The coordinates of the fingers (which are assumed to be extended) are trans formed from the hand coordinate system to the hand rotation and translation recover ed from the hand markers . This transformation represents the position of the i th finger, P i , in the hand rdinate system to obtain H P i as shown below (4 30) These finger positions which are represented in the hand system can then be transformed to the coordinate system of the stereo cameras to get S P i . Observing the orientation difference between the two camera systems as shown in Figure 4 39 : (4 31)
114 B ut since it is desirable to position the hands in a convenient zone in the stereo data , as shown in Figure 4 40 , the addition of an artificial o ffset in the y and z direction is added to situ ate the finger locations in a better region. By adding these modification s to (4 31) the transformation matrix becomes : (4 32) Once the finger positions a re determined in the stereo camera coordinate system, t he collision detection can be accomplished in one of two ways. The first approach is to convert the stereo data to a point cloud and test whether the 3D position of each of the fingers has any points between the finger and the body. The other approach is to map the 3D location of the finger to the corresponding 2D location in the disparity map using the projection matrix, and then test whether the z coordinate of the finger is less than or greater tha n the z coordinate of the object given by the disparity value at this point. If the z coordinate of the depth map is less than that of the finger, then the finger has penetrated the sensed geometry and the corresponding feedback motor should be activated. It was this second approach that was implemented because it is more optimized than searching through a 3D point cloud (although searches could still be made quite efficient s ince the point cloud is an ordered point cloud ). The first step of mapping the 3D finger location to the corresponding 2D location s frame is performed using the projection matrix obtained when the stereo cameras were rectified. This projection matrix, P , can be used to obtain the pixel coordinates of the dispa rity map that correspond to a particular point in space as follows:
115 (4 33) Now that the location in the disparity map is identified, the actual distance corresponding to this disparity needs to be calculated. Using the re projection matrix, Q , the depth to the object can be gauged from the disparity as shown below: (4 34) But since only the depth to the object must be tested to determine if the finger has intersected or penetrated the sensed object, only the formula for the z coordinate is needed which is found using (4 35) below (4 35)
116 where f is the focal length, T x is the horizontal separation of the cameras, d is the disparity value, c x is the x coordinate of the principal point in the left image, and c x is the x coordinate of the p rincipal point in the right image. Since the purpose of the actual test is performed after scaling the sensed geometry distance by a tunable scaling factor, k , as follows: (4 36) Now the transformed location of each finger can be tested to determine whether or not it has penetrated a piece of the sensed geometry from the environment and this state can be conveyed to the user using the vibration feedback. The entire process of measuring the pose of the hands, transforming and shifting the finger coordinates to the collision detection is done every time a n ew set of images is acquired . An example of this process can be seen in Fig ure 4 4 1 where the hands are shown on the right side with coordinate axes overlaid on the image and the shifted positions of the fingers are shown on the disparity map on the left. The green boxes represent fingers which are not intersecting with any of the sensed geometry and the red boxes represent fingers which are intersecting the sensed geometry. As the user moves his or her hands about this space, t he shape of the surrounding objects should become apparent. Software process outline The entire software architecture is designed to run as fast as possible to provide the best experience for the user. The multi threaded code takes advantage of the mu ltiple processor cores to reduce processing time. The camera interface code, for
117 example, is designed with two image buffers so that as one is copied from, the other can be written to as a new frame is acquired from the camera. The stereo camera image pr ocessing is also performed in parallel with the hand tracking image processing so that these steps do not increase the loop time by being performed sequentially. Figure 4 42 shows how the data is processed within the software arc hitecture. There is a class devoted to handling everything related to hand tracking called the hand tracking manager, and another class devoted to handling everything related to the stereo image processing called the stereo camera manager. The hand trac from the EyeCam interface, then detects the red dots in the image and uses those locations to mask the image for better speed performance during the marker detection. T he markers are detected and the resulting rotation vector is converted to a rotation matrix and combined with the translation vector to form the transformation matrix which relates the hand to the camera. This transformation matrix is then combined with the transformation matri x which relates the hand camera to the stereo cameras and used to parameters determined from the calibration step must be loaded and the rectification parameters must be computed . Then, the mapping function can be initialized which performs the un distortion and rectification on each new image. After these initialization steps are performed, t he main loop can be run whic h first grabs the latest pair of stereo images and un distorts and rectifies the images using the pre computed mapping, then performs the correspondence step to generate the disparity map, and finally detects and
118 removes the ground plane from the data. On ce both of these processes have been completed, the finger positions and the disparity map with the ground surface removed can be used to perform the collision detection step and then, finally, the feedback state is sent to the microcontroller via a serial message, which then activates the correct feedback motors on the gloves. Figure 4 1 . Sensor is positioned to capture the field of view in front of the user.
119 Figure 4 2 . Sensor capturing 3D shape of vehicle. Figure 4 3 . The sensed model is scaled down to create a feedback field that the user can interact with.
120 Figure 4 4 . A user stands in front of a table. The scaled down scene can be seen in blue to represent the virtual feedback field generated. Figure 4 5 . When the user's hands collide with part of an obj ect in the feedback field, a tactile stimulus is created by the gloves and felt by the user.
121 Figure 4 6 . An example of a marker which can be detected to judge position and orientation. Figure 4 7 . One of the vibration motors used in the feedback gloves.
122 Figure 4 8 . An example of what the feedback field would look like as a user walks down a sidewalk. Figure 4 9 . The vibration feedback, illustrated as a spark, as a user stands in front of an outdoor scene and explores the feedback field with his hands.
123 Figure 4 10 . Arrangement of the cameras used for sensing the environment and the hands. Their fields of view are also drawn. Figure 4 11 . Illustration of how the sensors' fields of view change to permit late ral exploration as the user turns his head.
124 Figure 4 12 . An illustration of the scenario in which a short obstacle would be confused for the ground plane. Figure 4 13 . The vibration feedback in a confusing situation where a user's fingers intersect the ground plane and a short obstacle.
125 Figure 4 14 . A more useful feedback state is provided when the ground pla ne is removed from the geometry which provides feedback. Figure 4 15 . The remaining feedback geometry once the ground plane is removed.
126 Figure 4 16 . Detai led picture of fingers intersecting with a point cloud model of a doorway. Figure 4 17 . Blue Fox Cameras mounted in a horizontal stereo configuration.
127 Figure 4 18 . The Sony Eye camera used for hand tracking. Figure 4 19 . The navigation aid worn by a user, May 4, 2014. Courtesy of Kristin Chilton.
128 Figure 4 20 . The hand feedback interface electronics which includes a microcontroller board and a set of opto isolators. Figure 4 21 . The connectivity diagram of the entire system. Signal lines are shown in blue an d power lines are shown in red.
129 Figure 4 22 . The gloves equipped with feedback motors on the fingertips. Figure 4 23 . Epipolar geometry. The blue triangle represents the epipolar plane and the red lines represent the epipolar lines on the image planes.
130 Figure 4 24 . A chessboard tracking pattern used in the stereo calibration process.
131 Figure 4 25 . A visual example of un distorting and aligning the two stereo images in order to create collinear epipolar lines (shown in dashed lines).
132 Figure 4 26 . The sliding w indow used to find the disparity between a feature in one image and the same feature in the other image.
133 Figure 4 27 . Schematic illustrating the geometry of the minimum detection distance for a stereo algo rithm with a maximum detectable disparity. Figure 4 28 . Image of a brick column and the disparity output from the stereo algorithm.
134 Figure 4 29 . Image of some barricades in a grassy area and the disparity map output from the stereo algorithm. Figure 4 30 . Image of bike racks and the disparity map output from the stereo algorithm. Figure 4 31 . A point cloud generated from stereo images on the left and the original scene from the camera.
135 Figure 4 32 . A point cloud of some boulders along a sidewalk on the left and the original from the camera. Figure 4 33 . Dia gram showing the parameterization of a plane used in the Hough plane detection algorithm.
136 Figure 4 34 . The results of the ground plane identification process on a scene with boulders along the sidewalk. The disparity map on the left has been shaded red where the elevations are near or below the detected ground plane. Figure 4 35 . The results of the ground plane identification process on a scene with bike racks on a walkway. The disparity map on the left has been shaded red where the elevations are near or below the detected ground plane.
137 A B C D E F Figure 4 36 . Process of masking the image f rom the hand tracking camera. A) The original image. B) The image after subtracting pixels to the left. C) The image after subtracting pixels to the right. D) The image after subtracting pixels to the bottom. E) The image after subtracting pixels to the to p. F) The original image masked except for the region above the remaining dot. Figure 4 37 . The hand markers being detected with the hand tracking camera in an indoor environment.
138 Figure 4 38 . The hand markers being detected with the hand tracking camera in an outdoor environment. Figure 4 39 . Diagram showing the angle between the hand tracking camera and the stereo cameras.
139 Figure 4 40 . Diagram showing the artificial offsets added to the hand tracking camera in order to situate the hands at the proper level in the stereo camera's reference frame. Fig ure 4 4 1 . The measured pose of the hands being used to map the finger positions to the disparity map and test for collisions. The green blocks represent finger positions which do not intersect sensed geometry while the red blocks indicate a collision. The top image shows the left pinky being stimulated and the bottom image shows the ring and middle finger being stimulated.
140 Figure 4 42 . Software flow diagram in cluding hardware interfaces.
14 1 CHAPTER 5 PERFORMANCE AND USER TESTING Performance The performance of the system was evaluated first on a subsystem level by ga u ging the performance of each of the individual components. T he stereo vision t he environment, the hand tracking system accurately determine hand poses , and the effectiveness of the ground removal algorithm were investigated before evaluating the system functionality as a whole. These evaluations have a large qualitati ve aspect to them. For instance, in evaluating the performance of the stereo disparity maps, a ground truth disparity map was not available for the scenes , so the density and loss of detail are discussed. Similarly, for the ground removal algorithm, a v isual inspection of the output classification should make its effectiveness obvious and forego the need to compare the ground classification to a ground truth dataset. The next section on user testing provide s a rits. Stereo Vision The stereo vision block matching algorithm, was tuned to provide useful depth maps to use as a feedback field model. While tuning the parameters, the adjustment of the slidi ng window size revealed a tradeoff between a smooth, dense disparity map and a sharp er disparity map having more missing values . Using a relatively large sliding window size of 17 pixels turned out to be advantageous both by creating a denser disparity ma p and by enlarging small features such as small limbs or thin sign posts, making them easier to
142 detect in the feedback field. An example of how the sliding window size helped to dilate thin objects is seen in Figure 5 1 . The imag es from the synchronized BlueFox cameras were generally able to be converted to well filled disparity maps by the stereo algorithm when the parameters were set to the values listed in Table 4 2 . Figure 5 2 and Figure 5 3 show two typical examples of scene s in front of a user and the corresponding disparity map s . In these two examples, most of the disparity map is populated with values and not many areas are blank as a result of lack of texture or lack of uniqueness in the matching step. There are several failure modes, however. One scenario that can cause erroneous disparity values is the case where a horizontally repeating pattern causes the correspondence step to identify the wrong location in the other image. This case is shown in Figure 5 4 where the slats in the fence create a situation where several nearly identical matches between the images are possible, causing incorrect disparity values to be extracted. Another problem arises when very reflective surfaces, such as the car hood in Figure 5 5 , fail to provide any distin guishing features on the surface for the image to capture . This is also a problem with mirrors and h ighly polished floors (although failure to detect the floor is not problematic since it would be removed from the feedback field anyway) . Also, if an object does not have enough visual texture to it, such as the plain white wall shown in Figure 5 6 , then the algorithm is unable to find strong correspondence matches between the two images and a depth value will not be determined. And even if a particular surface does have significant texture, such as the sidewalk in Figure 5 7 , but the image quality suffers from overexposed or underexposed areas, then the algorithm will fail.
143 But g enerally, using the stereo vision system to sense the geometry of the environment provided excellent data for the feedback fi eld and worked well both inside buildings and outside, providing a major benefit over a structured light sensor. The problems arising from poor image quality are mitigated somewhat with better imaging hardware . Good optics and good sensors will produce s harp images with little motion blur and less underexposed and overexposed areas. T he prototype system typically provided usable data whether the lighting conditio ns consisted of bright sunlight, the waning light of day , or indoor lighting . Hand Tracking The hand tracking algorithm was able to determine the six degrees of freedom of each hand by visually detecting the markers in the imag e from the camera. Of course, t he space in which the hands are detected is limited by the field of view of the camera a nd the range of orientations is limited by the viewing angle of the marker. Initial testing showed that when the hand tracking camera was mounted below the stereo cameras outside of the detection range. To improve the range, the camera was raised by about 30 cm. A more space saving approach would be to use a camera with a wider lens, and this would certainly be done if such a system were commercialized. After raising t he camera, the range of detection improved to the point that users were freer to explore the feedback field without going beyond hand tracking range . The range of position and orientation value s of the hands (as measured from the em) are presented in Table 5 1 which show s a lateral range of motion (x direction) of more than a meter and a forward range of motion (y direction) of about 60 cm (limited by arm reach) and a vertical range of motion of about 80 c m (also
144 limited by arm reach). The roll and pitch limits are also reasonable with a range of about 130 degrees each. The accuracy of the distance measurement was evaluated at various distances and is presented in Figure 5 8 alon g with error bars showing one standard deviation. The markers did quite well in user testing in terms of the detection rate, but because they were being detected visually, there were intrinsic limitations. The first is that bright light and shadows (see Figure 5 9 ) can cause the marker to go undetected because the edge finding step extracts the contour of the shadow instead of just the edges of the marker. This was primarily an issue when walking under tree branches while the s un was high and bright. A second issue is that any hand orientations which caused the marker to have a poor viewing angle would render the marker undetected , and a third type of detection failure occurred when a user put one hand over the other and occlud ed the lower marker. The Ar Uc o marker detection update rate varied between 5 10 Hz usually and depended highly on the amount of clutter in the image. This process was in fact the slowest step of all other computing processes, so any speed improvements t o this step would immediately increase the overall update rate of the sense process feedback loop. (The stereo processing and ground removal are performed in parallel with the hand detection algorithm, but since the updated hand locations must be paired w ith the updated feedback field to generate a new feedback state, the code runs at the rate of the slower of the two operations.) While the visual marker tracking performed reasonably well for this application, alternative means of detecting the pose of th e
145 hands could provide a better solution to the problem without changing the overall operation of the system. Table 5 1 . The range of position and orientation values the hand marker can have and still be detec ted by the head mounted camera. Ground Removal The ground removal process is crucial to providing a feedback field that is understandable to the user. If the ground were not removed, it would be quite difficult to perceive a small object on the ground by any other means besides slowly raising and lowering the hands while attempting to notice when certain fingers ceased to vibrate before others. But with the ground removed from the feedback field, objects are much more noticeable as the user explores th e space. The performance of the ground removal is much more important in the area close to the user and less important further away because not all the geometry sensed by the cameras is within sensing range using the gloves . In testing, the ground remov al algorithm did quite well at detecting and removing the ground to isolate objects , and was robust to a wide range of camera orientations . Even small objects such as the kitten shown in Figure 5 10 were distinguishable from the ground, a feat that hardly any of the systems reviewed in Chapter 3 would have performed since most would only detect and convey large obstacles in a 2D plane. The buffer region for the ground plane was tuned to strike a balance between consistently
146 remov ing the noisy ground surface, while not removing actual objects and hiding their existence. The minimum detectable object height was tested by placing objects of different sizes on the ground and observing which were discriminated from the ground as seen in Figure 5 11 and Figure 5 12 . The test showed that for objects on a smooth surface, anything above 18 cm would remain in the feedback field. Some objects, such as the curb in Figure 5 13 are marginally detected and would not be felt entirely if the hand were in that area. The limitations of the ground removal process stem from the fact that it a ssumes that the ground is a plane, so there will be variances between the model and the actual ground surface if any curvature or irregularities exist. If this discrepancy is large enough, it will cause higher parts of the ground surface to remain in the feedback field or it will cause small objects in lower parts to be removed from the feedback field. And any errors in the point cloud created from the stereo camera data can also cause parts of the ground to be misclassified. Also, since the Hough plane detection returns the closest fitting plane in a discretized space, the plane will have some amount of position and orientation deviation from the actual ground surface because of this. Based on the Hough S pace bin sizes listed in Table 4 3 , the plane could deviate by up to 7.5 cm in position and 3 4 degrees in orientation. However, since the model only needs to be accurate close to the user, these variances are usually not significant enough to cause issues at close range. The model will often become invalid at further distances, but that is usually beyond System Functionality The system as a whole achieved the intended functionality required to make the user experience concept a reality. All subsystems worked together to create an intuitive
147 tactile fee dback field which could present users with an understandable model of the immediate environment. rate, varied somewhat based on the environment, but was generally fast enough to provide a usable experience, usually staying in the range of 6 10 Hz. In some situations where the hand tracking image contained dense , cluttered texture, the marker detection process bogged the update rate down to 3 4 Hz, but this was rare. On the finger feedback side , the vibration motors were determined to have a bandwidth of 15 Hz, meaning they could be turned off and on 15 times per second (with a 50% duty cycle) and still have a clear ly distinguished on and off state. Since this is faster than the update rate of the software, the motors did not impede the rate at which feedback was perceived . In order to visualize how a person would use mental integration to make sense of the vibration feedback while exploring the space with his or her hands, several example sce nes have been selected. A static depth map was used to create the feedback field and the hand locations were tracked over time as usual. The hands were waved around for a short time span and whenever a finger received vibration feedback, a circle was dra location in the image with a color that corresponded to the finger This image therefore contains all the depth information that would be known to the user from the finger feedback as it is built up over time. Whil e this image shows the accumulation of information explicitly in a single image , it is just for visualization purposes this task of creating a mental map over time is actually performed by the user in his or her mind . Figure 5 14 shows the actual feedback received while exploring a scene with some objects on the right hand side having a
148 range of distances , Figure 5 15 shows the feedback received while exploring a scene with a bridge and its guard rails , and Figure 5 16 shows the feedback received while feeling a sign and waste basket. In all these examples, the ground surface has been removed from the depth map and feedback field. User Testing User testing was performed to vali date the effectiveness of the prototype system in enabling users to understand their surroundings. Volunteers were solicited to perform a variety of different tasks using the prototype without being able to use their eyesight, relying solely on the inform ation the system provided through the gloves. Very little training was performed before allowing the participants to conduct the tests in order to verify the intuitiveness of the system. Each of the participants were informed about the basic operating pr inciples of the system such as how the environment was sensed and scaled down and how feedback to the fingers indicated the presence of an obstacle in that vicinity. They were also informed about the limitations of the system such as the field of view lim its on the stereo cameras and the hand tracking camera. They were then asked to don the system and were allowed to get a feel for how it worked by using their hands to explore the feedback field while standing in front of different scenes such as a light pole, a person, a box, or a fence. One of the testing methods involved asking users to navigate an obstacle course without the use of sight by relying solely on the feedback from the gloves. The obstacle course consisted of 4 5 cardboard boxes placed at random in the path between the The box sizes varied between about 2 4 feet tall and between about 0.5 2 feet wide. The ending location was
149 communicated to the test subject by someone standing at the finish line who periodically called out to the subject to provide an auditory beacon for the ending spot. Another testing method involved testing the ability to perceive a three dimensional picture of the environment by placing two objects in fro nt of the user and asking him or her to identify which one is closer each time they were re arranged. A similar test was performed by using two objects of different heights and asking the user to identify the taller of the two each time they were re arran ged. Test 1 Outdoor Walkway The first test was performed outside on a concrete walkway. Three test subjects were used for this test. First, the subjects were given information about how the system works and what kind of sensation to expect for differen t surrounding geometry and different hand positions. Then before the test, each subject was allowed to try the device with their eyes open to get used to the feel of it. This entire process took only about 10 with the device was relatively limited. While this step would not be possible for someone who is completely blind, the intention of including it in these tests was to speed up the training process. A person who is totally blind would require more advanc ed training to teach them the relationship between the scale of what they feel with the scale of the actual environment. Once each subject was comfortable with the operation of the system, the subject was asked to walk through an obstacle course of roughl y 20 meters, trying to avoid any sensed obstacles and head towards the sound of the voice of the person at the finish line. Test Subject A successfully navigated through the obstacle course without hitting any objects in 2 minutes 0 seconds . For this pa rt of the test, the environment scaling parameter was set to the short range value, meaning the range at which the hands can
150 sense when fully extended is less than if the long range scaling value was used. In addition to the boxes, an additional tall thin object with a width of 4 cm , visible in Figure 5 17 , was placed in the course which was successfully detected and avoided. After this test, the two other tests were performed in which the subject was able to correctly identify t he nearer of two objects 3 out of 3 times with about 5 10 seconds of haptic exploration . The objects were about 1.5 m away and the difference in their proximity from the user was about 60 cm . The subject was also able to correctly identify the taller of two objects 3 out of 3 times with about 5 10 seconds of hand movement using objects which were 70 cm and 106 cm tall. Test Subject B performed the obstacle course in 2 minute 5 seconds , but used the environment scaling parameter set to the long range va lue. During the course, she grazed one of the boxes, but was often observed feeling objects that were far away and tried to navigate away from them too early. This increased the time to complete the course and changing the parameter to the short range se tting allowed her to complete the re configured course faster . Test Subject C performed the obstacle course in 1 minute 10 seconds using the short range scaling parameter. The subject was also able to correctly identify the nearer of two object in the pro ximity test 3 out of 3 times. A variation of the obstacle several times during the attempt to navigate the obstacle course to test the ability to sense and react to a changin g environment. The subject recognized the sudden appearance of the person and stopped in time to avoid him all three times .
151 Some example images collected during these tests are shown in Figure 5 18 Figure 5 24 . On the left is shown the view from one of the stereo cameras as well as the view from the hand tracking camera as an inset so that the pose of the hands can be seen (as in the previous images, the detected marker axes are overlaid on the image ). O n the right is the disparity map, where brighter areas indicate greater disparity and closer distance. The dark red shading of the disparity map indicates the detected ground plane and these areas will not be included in the collision detection. The location of the fingers are also drawn over the disparity map; the green (lighter) color means that the finger is not colliding with any of the scaled geometry and the red (darker) color means that the finger is colliding with some part of the scaled geome try. Figure 5 18 shows part of the obstacle course and the ground removal algorithm has clearly isolated the geometry that poses a risk to walking such as the low brick wall d very far, she is feeling the area close to her and, feeling no feedback, knows that this area is clear of obstructions. Figure 5 19 shows an example of the user extending the left hand and feeling the box with the thumb and ind ex finger. In Figure 5 20 , the user is feeling a thin 4 cm object with the left index finger. Thin objects can be readily detected due to the fact that the stereo vision algorithm is configured with a larger sliding window size which tends to dilate objects in the disparity map. This is a definite advantage over sonar based approaches which could miss thin objects. An example state of a user performing the distance test is shown in Figure 5 21 where t he user is extending her hands until the fingers first start vibrating. The image of the hand locations shows that the right hand is extended further before feeling first contact so the user can perceive that the object on
152 the right is further away than t he one on the left. An example state from the moving person test can be seen in Figure 5 22 where a person walking towards the user is detected with the outside of the extended left hand, giving the user enough information to avo id the person. Figure 5 23 shows a test subject feeling a brick column and Figure 5 24 shows the detection of a picnic table. Test 2 Outdoor Breezeway The obstacle course was again performed in an outd oor breezeway. Test S ubject D system with his eye un covered for less than ten minutes to get an idea of how the system operates. After this brief training period, he was asked to navigate an obstacle course which was not known to him , using the short range scaling setting . The breezeway was set up with cardboard boxes for obstacles, but it also contained some protruding corners to be avoided. The 12 m long course was completed in 1 minute 40 seconds the first attempt and after being reconfigured completed in 1 minute 5 seconds on the second attempt. A scene from this test can be seen in Figure 5 25 . Note that in this figure the plain white column which is lacking sufficient texture is not fully detected by the stereo algorithm, leaving gaps in the disparity map which could cause confusion for the user. Fortunately, the corners are easily detected and can therefore be sensed. Test 3 Indoor Hallway Th e obstacle course was again performed in an indoor hallway using the short range scaling setting . Without seeing the arr angement of the boxes beforehand, Test Subject D , seen in Figure 5 26 , was able to navigate the 25 m course t hrough hallway using only the navigation aid in 2 minutes 0 seconds. The test subject did brush the side of a box with his leg two times because after initially detecting them, he
153 underestimated how far away they were and deviated early and came back on c ourse too early causing the leg to brush the box. This highlights one of the drawbacks of the prototype implementation which is that the lenses on the stereo cameras do not provide et is not detected. Using a fisheye lens would help to alleviate this problem. The user then completed the proximity test by correctly telling the closer of two objects 3 out of 3 times and completed the height test by correctly telling the taller of tw o objects side by side 3 out of 3 times. Figure 5 27 shows the hallway with the obstacles in place, and Figure 5 28 shows the user detecting a wall on his right. The conditions in the hallway did cause some problems in the generation of the disparity map such as some voids along the bright parts of the white walls and a failure to consistently detect the shiny floor. Neither of these problems where critical, however, since the walls could be detected by intermittent or noisy feedback, and the obstacles were still clearly detected even if the ground was not. Test 4 Parking Lot and Wall Following The obstacle course test was performed in a parking lot with Test Subject E who had received about 15 20 minutes of test time before attempting the test. The tester was able to navigate 2 0 m through the course with boxes and a vehicle in her path in 2 minutes 15 seconds. The user at one point even entered a dead end and was able to successfully turn around and take a better path to the goal. Figure 5 29 shows the system being used to detect the front end of a car. The pure white, texture less regions of the car are poorly detected by the stereo algorithm, but there is enough textu re in other parts to successfully feel that there is an object there. Another test was performed in which the user was asked to follow a wall of shrubs and fence line by walking
154 alongside them ( Figure 5 30 ) . The subject followed the wall of shrubs and fence for about 15 m in 2 minutes 0 seconds while walking at a cautions pace. Figure 5 31 shows a good example of the ground detection algorithm correctly detecting the tilted ground and leaving the bushes easily distinguished, and Figure 5 32 shows how the user can feel above the bushes to touch the fence by raising her right hand higher. Figure 5 33 shows an example of a high quality disparity map as a result of good texture visible throughout the scene. Test 5 Indoor meeting hall The obstacle course test was again performed in a meeting hall with Test Subject F who had received about 15 20 minutes of test time before attempting the test. During fam iliarization and pre test use, he tended to keep his head level when trying to detect close obstacles below waist level which caused them to go undetected because of the field of view limitation on the stereo cameras. After being reminded a few times to k eep his head tilted down to detect obstacles this close, he performed better. The obstacle course in this setting consisted of chairs that were arranged haphazardly around the room as shown in Figure 5 34 . T he tester was able to navigate using the prototype system through the 1 2 m course in 2 minutes 15 seconds. Figure 5 35 and Figure 5 36 show the feedback received for two different scenes. Test 6 Outdoor Trail A more expe rience user , Test Subject G, was able to use the system to navigate through a long outdoor trail having trees and foliage on either side and which included a narrow wooden bridge. Using the system using both the short range scaling setting and the long ra nge scaling setting, the user was able navigate about 80 m in 3 minutes using only the prototype system. The disparity map formation and ground plane
155 detection work very well in this type of outdoor natural setting which has plenty of texture for the ster eo vision algorithm. The bridge was easily crossed as the system clearly conveyed only the side of the bridge to the user as seen in Figure 5 37 . The system also performed well at conveying the presence of trees, foliage, and ot her objects along the trail as can be seen in Figure 5 38 Figure 5 40 . The ground plane detection and removal proved especially useful for situations such as the one shown in Figure 5 41 where a low retaining wall is correctly left in the feedback field as an obstacle yet the sidewalk itself is shown to be clear as indicated by the lack of feedback to the fingers in that area. Test 7 Apartment Complex Test Subject H, was an i ndividual who was legally blind, but since her blindness was not total, she was still set up with a blind fold during the tests. S he was asked to use the system to navigate around various parts of an apartment complex. The subject was able to walk arou nd a building, taking a route of about 130 m in about 3 minutes. The route required walking along a sidewalk while avoiding shrubs and cars on either side and then transitioning to walking along a dirt trail behind the building that had trees along the si de. At one point in the test, the researcher stopped the test subject before running into a small tree, thinking it was not sensed , but besides this one intervention, the subject completed the course without collision . A second route was attempted that c onsisted of walking through two traffic barriers, then along a sidewalk, then making a turn through an opening in a hedge of bushes and then walking through a grassy area with trees on either side. This route of about 70 m took about 1 minute 50 seconds a nd was completed without any collisions.
156 After reviewing the recorded data later, it was observed that the hands positions were consistently low in the feedback field, so the user was not feeling a majority of the sensed geometry. This was probably due to a combination of longer arm length, a preferentially lower hand position, and a misadjusted orientation of the hand tracking camera. This could have been resolved with a larger vertical offset for the hand positions as shown in Figure 4 40 , but these parameters which map the hand positions to the feedback field should really be automatically calibrated for each individual since their stature and preferences vary. Some example situations from the test are shown in the following figures. Figure 5 43 shows the user detecting some low shrubs to her left , and Figure 5 44 shows how the thin traffic barrier is distinguished from the ground plane and easily sensed by one of the finge rs. Figure 5 45 s hows a sign post being felt up and down to gauge its height, and Figure 5 46 shows the user detecting a tree to her left by extending her hand. After using the system, t his individual ga ve feedback suggesting the addition of another vibration motor on the bottom of the hand to indicate sharp changes in elevation such as curbs or stairs. An extension of this idea could be to affix a motor on the bottom and the top of the hand; the bottom motor could indicate a sharp jump upwards (need to step up a curb), and the top motor could indicate a sharp jump downwards (need to step off a curb). She also suggested making the vibration strength proportional to the closeness of the objects. While th is was initially considered, the thought was that it would be too confusing if all motors were constantly vibrating at different strengths and the user was required to interpret both the relative strength of the vibration and the
157 position of the hands. Th is concept is be discussed further in the future work section of the next chapter. Summary The user te sting showed that the design has definite prospects of becoming a useful commercial device for blind navigation. While the prototype system had some limi tations that hampered the subjects from feeling a truly realistic feedback field modeled after the environment, changes could be made that would improve the user experience greatly. It is important to note that this system provided arguably more informati on and made it more accessible than any of the other navigation aids found in the literature review. While many of these systems did not present user testing results, those that did are discussed below. The most similar concept was the virtual cane that u sed a force feedback glove to simulate actually hitting objects in a virtual environment without actually touching them  . Even though their research did not involve actively sensing the environment and was intended for training, in the virtual environment, the test subjects required 2 mi nutes to cross a street and were still limited to detecting objects like the curb using a single probing point. Also, this system relied on a tracking technology that was not mobile. Trials using the Stereo Vision based Electronic Travel Aid (SVETA), whic h turned the stereo depth map into an audible signal, resulted in the user being able to walk around without hitting objects, but the testing was performed in a clean room with simple obstacles like a large table draped with a cloth  . Corridor navigation was also successful, but no times were listed.
158 The stereo based aid developed by Zelek et al. that conveyed object locations through vibration feedback showed that users could navigate around large ob stacles  . The methodology for transforming the spatial information to the 3 vibration motors involved down sampling the depth map into 3 coarse regions which throws away large amounts of data. Additionally, the actual user testing was p erformed with a human controlling the state of the vibration motors as opposed to allowing their state to be affected by stereo vision data. The navigation aid developed by Molton et al. was tested using only the vest mounted sonar sensors and correspondin g vibration motors  . This device seemed to have the most success, probably because the vibration feedback was felt on the part of the body that was facing the obstacle, ma king the interpretation of the feedback very straightforward. The vest allowed someone to navigate through an obstacle course without hitting anything (no times or distances were listed), but the researchers stated that the user had to be encouraged to sw eep with their body to detect obstacles to the side. The system that created a 3D spatialized sound corresponding to objects detected by head mounted sonars gave promising results and users were reported to achieve a walking speed of 34 ft./min around obst acles  . The relatively slow speed was attributed to a lack of familiarity with the feedback mode. Traveling through h allways using the device was reported to be achievable as well, although obstacles were not mentioned. This device does not detect low obstacles however.
159 The wearable obstacle detection system for the visually impaired developed by Cardin et al. was anoth er system that employed vibration feedback on the body based on body mounted sonar sensors  . Their user testing proved that people could navigate through a corridor and avoid people walking towards them, but as with similar devices, this does not detect low obstacles since it only senses in a plane. The system developed in this research was able to present users with a clearer picture of the environment becau se it made the geometric information easily accessible in a mode that was most natural. The only other devices that attempted to convey the full three dimensional information gathered by the cameras were those that remapped the data to a series of tones, and these trials have not shown that users are able to fu lly comprehend these patterns. A nd those tones would certainly become less intelligible as the geometry of the environment becomes more complicated. Additionally , by presenting users with a model t hat isolated objects from the ground, the sensory experience was more useful than had the raw data been used. And the addition of the user controlled scaling parameter, which adjusted how much the environment was scaled down for the feedback field, provid ed people with the option to feel a more detailed model with a shorter reach, or have a longer reach with less detail . Some users liked the short range scaling for the obstacle course and the long range scaling for more open areas. Overall, the system wa s shown to be successful in providing enough information to successfully navigate fairly complex environments and several
160 A B Figure 5 1 . A t hin sign post and a thin tree being dilated by the stereo block matching algorithm. A B Figure 5 2 . Example disparity map generated of a grassy area and a box. A B Figure 5 3 . A disparity map of a scene with large variation in lighting.
161 A B Figure 5 4 . An example of a failure mode of the stereo matching algorithm when the image contains tight, horizonta lly repeating pattern such as the slats of the fence. A B Figure 5 5 . An example of a failure mode of the stereo matching algorithm caused by shiny surfaces in the image such as the hood of the sports car pictured. A B Figure 5 6 . An example of a failure mode of the stereo matching algorithm caused by a surface with very little texture.
162 A B Figure 5 7 . An example of a failure mode of the stereo matching algorithm caused by saturated portions of the image. Figure 5 8 . The error in the distance measured to the marker using the Aruco Marker detection code. Th e error bars indicate one standard deviation.
163 Figure 5 9 . One of the hand tracking markers fails to be detected due to the shadow and bright sun light. Figure 5 10 . A cat participating in the research study is clearly distinguished from the ground plane in the disparity map.
164 Figure 5 11 . Small objects being distinguished from the ground plane. From left to right the object heights are 13 cm, 18 cm, 24 cm, and 31 cm. Figure 5 12 . Small objects being distinguished from the ground plane. The smallest object of 13 cm is not discriminated from the ground an d hence will not be included in the feedback field. Figure 5 13 . The low curb is somewhat distinguished from the ground plane, but not entirely.
165 A B C D Figure 5 14 . An illustration of the data gleaned from haptic exploration of the feedback field. A) The original scene. B) The depth map without the ground plane. C) The feedback received after 5 seconds of exploring. D) The feedback received after 10 seconds.
166 A B C D Figure 5 15 . An illustration of the data gleaned from haptic exploration of the feedback field. A) The original scene. B) The depth map without the ground plane. C) The feedback recei ved after 4 seconds of exploring. D) The feedback received after 10 seconds.
167 A B C D E Figure 5 16 . An example of the data gleaned from haptic exploration over time. A) The original scene. B) The depth map without the ground plane. C) The feedback received after 2 seconds of exploring. D) The feedback received after 4 seconds. E) The feedback received after 10 seconds.
168 Figure 5 17 . Test subjects i n the outdoor setting used for Test 1, May 7, 2014. Courtesy of Ryan Chilton. A B Figure 5 18 . An example scene from the Test 1 obstacle course. A) The view from one of the stereo cameras and an inset of theview from the hand tracking camera B) The disparity map with the ground plane shaded and the finger locations drawn.
169 A B Figure 5 19 . Example scene from Test 1 showing subject feeling an obstacle wit h the left hand. A) The view from a stereo camera and the hand tracking camera. B) The disparity map and finger location. A B Figure 5 20 . A thin object being sensed by the test subject. A) The view fro m a stereo camera and the hand tracking camera. B) The disparity map and finger location.
170 A B Figure 5 21 . A user performing the distance comparison test. A) The view from a stereo camera and the hand tra cking camera. B) The disparity map and finger location. A B Figure 5 22 . A user detecting a moving person walking towards them. A) The view from a stereo camera and the hand tracking camera. B) The disp arity map and finger location.
171 A B Figure 5 23 . A tester feeling a column while using the system to navigate through an outdoor area. A B Figure 5 24 . A tester feeling a picnic table while using the system to navigate an outdoor area. A B Figure 5 25 . A tester feeling the corner of a building in an outdoor breezeway.
172 Figure 5 26 . A test subject using the prototype system to navigate an indoor obstacle course, May 8, 2014. Courtesy of Ryan Chilton. A Figure 5 27 . The system being used on an indoor hallwa y.
173 A B Figure 5 28 . A tester feeling a wall with the system. A B Figure 5 29 . A user feeling a car with the system. Figure 5 30 . A user performing the wall following test, May 9, 2014. Courtesy of Ryan Chilton.
174 A B Figure 5 31 . A user feeling a wall of shrubs while trying to follow the wall . A B Figure 5 32 . A user feeling a fence beyond a shrub while trying to perform wall following. A B Figure 5 33 . A user feeling a fence while trying to perform w all following.
175 Figure 5 34 . User navigating around chairs indoors, May 11, 2014. Courtesy of Ryan Chilton. A B Figure 5 35 . Feeling chairs with the syst em during an indoor obstacle course.
176 A B Figure 5 36 . Feeling chairs during an indoor obstacle course. A B Figure 5 37 . A user walking along a bridge with the aid of the system. A B Figure 5 38 . A user feeling brush to the left along a wooded trail.
177 A B Figure 5 39 . A user detecting a tree along a t rail. A B Figure 5 40 . A user feeling a trash can marker and raising a hand to feel its height. A B Figure 5 41 . A user feeling a low retaining wall, but not the traversable ground.
178 Figure 5 42 . An individual who is blind using the system to navigate through bushes, May 18, 2014. Courtesy of Ryan Chilton. A B Figure 5 43 . A test subject rounding a corner while feeling and avoiding a bush.
179 A B Figure 5 44 . Two traffic barriers being detected by the stereo cameras. The one on the left is felt with the middle finger of the left hand. A B C D Figure 5 45 . An object being felt up and down to determine its height.
180 A B Figure 5 46 . A small tree being felt with the outer fingers of the left hand.
181 CHAPTER 6 DISCUSSION AND CONCLUSIONS Discussion The attempt to design an intuitive navigation aid for people who are blind that gives them the ability to understand the 3D structure of their surroundings has been successful. and be able to use it to navigate with as little as ten minutes of training is a testament to the intuitiveness of the system. And the ability to freely explore the feedback field with the hands allows users to adopt different scanning modes that are better suited for different tasks. When walking on a sidewalk for instance, users will hold their hands low and sweep their hands inward and outward. When in an unknown environment, the hands will explore vertically as well as horizontally. During the wall following test , the user tended to keep the hand feeling the wall extended and the other hand close to the body. This design is very natural and flexible as it artif icially extends the range of the hands which people are already familiar to using for sensory feedback. Additionally, since the device employs remote sensing, things such as people, strollers, or pets can be felt without harm or embarrassment. For instan ce, when walking along the beach, it would be much better to feel a sand castle with a virtual feedback field than by hitting it with a long cane. The addition of other features makes this more than just simply extending the I f an object reaches a minimum proximity, then all the columns in that area of the disparity map will be set to a high value to ensure that a user will detect it regardless of whether the hands are held high or low. Also, the ground plane removal provides a level of processing that makes the objects easier to
182 distinguish. And the ability for the user to adjust the scaling parameter allows the system to be tuned to provide appropriate feedback for different navigation situations. The testing showed that t he device can be learned relatively easily, and that with more experience, users are able to perform navigation faster an d more confidently. Simple tests such as the obstacle distance test and obstacle height test that users performed showed the ability t o explore and sense in all three dimensions was successful and it is expected that even greater fidelity could be accomplished through more training and with feedback fields that had more filtering to improve their consistency over time. The ob stacle cour se proved that users could navigate with both static and dynamic obstacles and since these tests were conducted in realistic c onditions (outside, inside, paved, unpaved, areas with structures, and wooded trails) the results have more weight than if they ha d been conducted in uncluttered rooms with well controlled lighting. The prototype system demonstrated the capabilities of the concept, but also imposed some unwanted limitations. The most noticeable limitation was the field of view of the stereo camera s , which was narrower than the field of view of human vision . Ideally, the system would be able to detect objects in a wide 180 degree field of view in front of the user. This would prevent close objects from going undetected by the cameras and unnoticed by the user if the object is approached while the head is pointed in a different direction . A wider field of view would also mitigate the need to scan with the head so much and objects could not slip under the field of view and then be absent from the fe edback field . Users could then walk more naturally with their head held level and sweep their hands in an even broader area to feel their surroundings.
183 The hand tracking system suffered from the same drawback. While a lens with a wider angle would have improved the situation, a better hand tracking technology is making headway in the gaming market. Systems that measure a magnetic field emitted by a base station are used to measure the absolute position and orientation of handheld controllers with a pos itional accuracy of 1 mm and a rotational accuracy of 1 degree. If the magnetic base station was mounted on the body and the sensors were placed on the hands, this could produce a system without range limits on the hands so a much more extensive feedback field would be possible. When envisioning this system as a commercial product, there are many improvements that could be made to produce a system that is simpler than the prototype implementation. The system may look something like that shown in Figure 6 1 . Wide angle stereo cameras could be built into the frame of a pair of glasses. The hand tracking system could be comprised of another small camera built into the glasses looking downward, or could be implemented with a magneti c field sensing system. The computing resource could be reduced to a specialized embedded system which implemented the algorithms in hardware on an FPGA or specialized DSP chip , saving space and power. The feedback gloves could be designed to be less enc umbering by placing the vibration motor on the finger closer to the hand in order to leave the finger tips free to grasp small objects and read brail.
184 Figure 6 1 . Rendering of what the system could look li ke if it were commercialized. The stereo cameras (A) could be built into a pair of glasses, magnetic hand tracking sensors (B) could be used to measure the pose of the hands, and all computing could be done on an embedded system (C). Future Work There are a couple specific areas that could be explored in future work. One is the scheme used to convey geometry to the fingers and the other is the processing of the 3D geometry data performed before providing this information to the user. In the current design , the sensed geometry is scaled down and the user can feel this model only when a finger touches or penetrates a surface. There are a couple of alternatives that may achieve greater acceptance among users. One method is to use an imaginary ray originatin which passes through the fingertip; that finger could
185 then be given a vibration whose strength is proportional to the proximity of the nearest object along that line. Another scheme could be to use the orientation of the finger and c onstruct an imaginary line that extends in the direction of the finger; that finger could then be given a vibration whose strength is proportional to the proximity of the nearest object pointed to by the finger. Both of these schemes could potentially be more intuitive and the results would be interesting. The second area for exploration is in further processing the raw 3D data to An algorithm that classified items such as doorways, stairs, signs, et c. could be used to convey more descriptive information, and information that cannot be gleaned from geometry alone, to a user . This could perhaps be done by activating a mode where touching something in the feedback field that has been classified trigger s its verbal description to be played audibly. Another possibility is to track moving obstacles like people, shopping carts, or cars and predict their expected trajectory for the next few seconds. These trajectories edback field so that those areas do not seem like free space. There is no end to the various types of additional processing that could be explored. Conclusions The field of navigation aids for people who are blind is still wide open for innovation, and me thods developed for robotic navigation can be successfully applied to this area. Both areas share many of the same goals and challenges so the predicted explosion of robotic devices and autonomous vehicles that will be interact ing with unknown environment s should drive advances in sensing and perception technology that will also apply to hel ping people without sight. The most difficult part will be finding
186 appropriate and efficient ways of communicating this information to individuals through sensory subs titution , but this system has been shown to communicate 3D geometric information necessary for navigation in an intuitive way . This system is the first of its kind and has the potential to be further developed into a commercialized product that will give people who are blind a much richer picture of their environment. There are a myriad of sensor y substitution devices which have been proposed and tested to help people with blindness, a nd the application of technology and research in this endeavor is very important. Despite all these efforts, however, no systems have emerged that have been greeted with great acceptance besides simple, one dimensional sonar based system. This is almost certainly because the more advanced system s which collect more information about the environment do a poor job of conveying high quality information through an appropriate human interface. By removing the ground plane and preserving only the important par ts of the picture and by conveying this information in an easily explored tactile feedback field, this system has come closer to the goal of providing a better picture of the environment through non visual channels. The test results have shown that stereo vision is an excellent way of sensing the surroundings and can provide much of the information necessary to navigate. The method used to track the hand locations could possibly be improved by using a magnetic field sensing system to increase the range of motion, but the concept was successfully demonstrated through the use of visual marker tracking. T he hand feedback scheme was useful to test subjects almost immediately who demonstrated their ability to both navigate and achieve an awareness of the genera l configuration of
187 objects around them through haptic exploration. The system as a whole has shown much promise in improving how people without sight sense the world around them.
188 LIST OF REFERENCES [1 ] National Federation of the Blind , 05 Sep 2013. [Online]. Available: https://nfb.org/blindness statistics. [Accessed: 05 Sep 2013].  Guide Dogs . [Online]. Available: http://www.guidedogs.com/site/P ageServer?pagename=about_overview_faq. [Accessed: 24 Sep 2013].  University of Southern California, 1982.  flickr.com , 03 Dec 2013. [Online]. Availabl e: https://www.flickr.com/photos/irisheyes/11189321316. [Accessed: 15 Jul 2014].  flickr.com , 15 Oct 2010. [Online]. Available: https://www.flickr.com/photos/brailleinstitute/5170164844/. [Accessed: 15 Jul 2014].  flic kr.com , 07 Jul 2012. [Online]. Available: https://www.flickr.com/photos/smerikal/7528132016/. [Accessed: 17 Jul 2014].  cnn.com , 11 Nov 2011.  understanding of Br. J. Vis. Impair. , vol. 26, no. 2, pp. 119 127, May 2008.  K. A. Kaczmarek, J. G. Webster, P. Bach y Biomed. Eng. IE EE Trans. On , vol. 38, no. 1, pp. 1 16, 1991.  Diego Biomedical Symposium, San Diego, CA, 1974, vol. 13, pp. 15 26.  P. Bach y Rita, C. C. Collins, F. A. Saunders, B. W Nature , vol. 221, no. 5184, pp. 963 964, Mar. 1969.  P. Bach y Rita, K. A. Kaczmarek, M. E. Tyler, and J. Garcia perception with a 49 point electrotactile stimulu s array on the tongue: A technical J. Rehabil. Res. Dev. , vol. 35, pp. 427 430, 1998.   P. B. M Biomed. Eng. IEEE Trans. On , vol. 39, no. 2, pp. 112 121, 1992.
189  and Implementation of Haptic Virtual Environments for the Training of the Visually IEEE Trans. Neural Syst. Rehabil. Eng. , vol. 12, no. 2, pp. 266 278, Jun. 2004.  B. B. Blasch, W. R. Wiener, and R. L. Welsh, Foundations of Orientation and Mobility , 2nd ed. New York: AFB Press , 1997.  Ultrasonics , vol. 2, no. 2, pp. 53 59, 1964.  Nurion Industries . [Online]. Available: http://www.nurion.net.  K. Re:Veiw , vol. 26, no. 4, p. 181, Winter 1995.  Body RTB . [Online]. Available: http://www.rtb bl.de/RTB/ultra body guard 2/?lang=en. [Accessed: 03 Aug 2013]. [21 ] The 2nd LACCEI Int. Latin Amer. Caribbean Conf. Eng. Technol. Miami, FL , 2004.  J. L. Gonzalez Mora, A. Rodriguez Hernandez, L. F. Ro driguez Ramos, L. Diaz 2009.  avoidance in a IEEE Robot and Automation Conference , San, 1994, pp. 2023 2029.  Presenc e , vol. 7, no. 2, pp. 193 203, 1998.   vision in blind navigation as  based aid Image Vis. Comput. , vol. 16, no. 4, pp. 251 263, 1998.
190  ensing for the Robot. Auton. Syst. , vol. 26, no. 2, pp. 185 201, 1999.  Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on , Singapore, 2003, vol. 2, pp. 734 737.  real Eng. Lett. , vol. 14, no. 2, pp. 6 14, 2007.  Engineering in Medicine and Biology Societ y, 2006. , 2006, pp. 6289 6292.  Proceedings of the 7th IEEE Internation al Conference of Bioinformation and Bioengineering , Boston, MA, 2007, vol. 1, pp. 930 937.  C onf. Pervasive Technol. Related Assist. Environ., Athens, Greece, 2008.  J. Vis. Impair. Blind. , vol. 97, no. 10, pp. 621 632, 2 003.  IMECE2004 , 2004.  vision system for the visually impai 41x 1, 2000.  and GPS navigation via electro Conference on Sensing Tec hnology, Palmerston North, New Zealand, 2005.  K. Ito, M. Okamoto, J. Akita, T. Ono, I. Gyobu, T. Tagaki, T. Hoshi, and Y. Procedings of CHI05 , Portland, OR, 2005, pp. 1483 1488.
191  J. Vis. Impair. Blind. , vol. 92, no. 5, pp. 338 345, May 1998.  E. L. SallnÃ¤s, K. Bjerstedt gation and Haptic Audio Interact. Des. , pp. 68 80, 2006.  frame Int. J. Comput . Vis. , vol. 47, no. 1, pp. 7 42, 2002.  G. Bradski, Learning OpenCV: Computer Vision with the OpenCV Library , 1st ed.  Int. J. Comput. Vis. , vol. 32, no. 1, pp. 45 61, 1999.  Int. J. Comput. Vis. , vol. 35, no. 2, pp. 151 173, 1999.  Centered Surface Reconstruction: Combini ng Multi Int. J. Comput. Vis. , vol. 16, pp. 35 36, 1995.  time Range Finding System with Binocular Int. J. Adv. Robot. Syst. , 2012.  J. D. Anderson, D. J. Lee, and J. Life Science Systems and Applications Workshop, 2007. LISA 2007. IEEE/NIH , 2007, pp. 229 232.  Computer Graphics Forum , 2004, vol. 23, pp. 567 576.  Robotics Institute , Washington, D.C., 2003.  S. Cardin, D. Thalmann, and F VR workshop on haptic and tactile perception of deformable objects , 2005, pp. 50 55.
192 BIOGRAPHICAL SKETCH Ryan Chilton was born in Irvine, CA in 1986. He receive d his B.S. and M.S. degrees in mechanical e interests include anything related to robotics and autonomous vehicles, as well as devic . Ryan enjoys spending time with his wi fe, bein g outside, playing racquetball and tennis, biking, and skiing.