Testing With Human Users: What Can Go Wrong?

When Testing is Not Tested

We are at that point in DIDYMOS-XR where we begin the validation of the technologies developed with users. In our previous blog we discussed the technology we are developing at the Vision and Robotics Lab at the American University of Beirut, where humans use augmented reality (AR) headsets to improve digital twins’ generation and maintenance

In the past few months, we embarked on testing the performance of this technology and user acceptance. The idea was simple; use the idealworks Research and Development Center in Munich to create a realistic environment where users would have to virtually place assets in the digital twin using either traditional methods or an AR headset employing our developed technology. This would allow us to compare our technology to traditional methods, using parameters like accuracy, time of completion, and human subject feedback. Thus the setup was tested, team of researchers identified, and the participants recruited. We were up against a deadline, around the holiday season in December. I bring this up because we believe this contributed to the oversight that happened.

The experimental setup at idealworks.
The experimental setup at idealworks.

The testing was undertaken with about 20 subjects and all data was collected. However, when we came to the analysis there was a surprise. The results came out completely counter to expectations!!! Now this is research and things do not always come out in the way we hope but, when all objective and subjective measures are in agreement, typically this indicates that something was off. Especially when tasks like placement are showing higher accuracy when a human estimates a position versus when a camera snaps it in location. 

The team of researchers started a series of meetings and tests dedicated to root cause analysis. After several discussions and evaluations, the evidence pointed to two issues; one technical and one experimental. On the technical side, it turned out that our method of aligning the AR headset axis with the axis of the assets in the environment was not accurate. This contributed to perceived reduced accuracy in certain positions, as we demonstrated using updated code. On the experimental site, it is a bit more complicated because we are dealing with human subjects. We speculate that the fact that users of traditional methods were able to see the assets from a seated position, whereas using the AR headset they had to move around, contributed to longer completion time and lower user ratings for the AR headset. Since in the real case scenario in a warehouse the users would have to walk around regardless of the method used, this means that this small detail resulted in outcomes that are not realistic and could explain the results. The solution? To redo all experiments while requiring all users to walk around while placing assets, as is the case in reality.

We can blame deadlines, holidays, and even team members for this but the fact of the matter is that the Test was Not Tested!!! Tests that involve users require considerable time and effort and are costly to repeat. That is why it is always recommended, when human subject testing commences, that the data collected from the first few experiments be analyzed fully in order to identify irregularities early on. This would allow room for remedial action before the tests continue.   

Author: Imad H. Elhajj, American University of Beirut