A new research project led by Facebook’s AI team suggests the scope of the company’s ambitions. It imagines AI systems that are constantly analyzing peoples’ lives using first-person video; recording what they see, do and hear in order to help them with everyday tasks. Facebook’s researchers have outlined a series of skills it wants these systems to develop, including “episodic memory” and “audio-visual diarization”.

Right now, the tasks outlined above cannot be achieved reliably by any AI system, and Facebook stresses that this is a research project rather than a commercial development. However, it’s clear that the company sees functionality like these as the future of AR computing. Facebook AI research scientist Kristen Grauman commented:

“Definitely, thinking about augmented reality and what we’d like to be able to do with it, there are possibilities down the road that we’d be leveraging this kind of research.”

Privacy experts are already worried about how Facebook’s AR glasses allow wearers to covertly record members of the public. Such concerns will only be exacerbated if future versions of the hardware not only record footage but analyze and transcribe it, turning wearers into walking surveillance machines.

The name of Facebook’s research project is Ego4D, which refers to the analysis of first-person, or “egocentric,” video. It consists of two major components: an open dataset of egocentric video and a series of benchmarks that Facebook thinks AI systems should be able to tackle in the future.


Facebook partnered with 13 universities around the world to collect the data. In total, some 3,205 hours of footage were recorded by 855 participants living in nine different countries. The universities, rather than Facebook, were responsible for collecting the data. Participants, some of whom were paid, wore GoPro cameras and AR glasses to record videos of the unscripted activity. This ranges from construction work to baking to playing with pets and socializing with friends. All footage was de-identified by the universities, which included blurring the faces of bystanders and removing any personally identifiable information

Grauman says the dataset is the “first of its kind in both scale and diversity.” The nearest comparable project, she says, contains 100 hours of first-person footage shot entirely in kitchens.

“We’ve open up the eyes of these AI systems to more than just kitchens in the UK and Sicily, but to footage from Saudi Arabia, Tokyo, Los Angeles, and Colombia.”

The second component of Ego4D is a series of benchmarks, or tasks, that Facebook wants researchers around the world to try and solve using AI systems trained on its dataset. The company describes these as:

Episodic memory: What happened when (e.g., “Where did I leave my keys?”)?

Forecasting: What am I likely to do next (e.g., “Wait, you’ve already added salt to this recipe”)?

Hand and object manipulation: What am I doing (e.g., “Teach me how to play the drums”)?

Audio-visual diarization: Who said what when (e.g., “What was the main topic during class?”)?

Social interaction: Who is interacting with whom (“Help me better hear the person talking to me at this noisy restaurant”)?

Photo Credits: Facebook

Right now, AI systems would find tackling any of these problems incredibly difficult, but creating datasets and benchmarks are tried-and-tested methods to spur development in the field of AI.

Indeed, the creation of one particular dataset and associated annual competition, known as ImageNet, is often credited with kickstarting the recent AI boom. The ImagetNet datasets consist of pictures of a huge variety of objects which researchers trained AI systems to identify. In 2012, the winning entry in the competition used a particular method of deep learning to blast past rivals, inaugurating the current era of research.

Facebook is hoping its Ego4D project will have similar effects on the world of augmented reality. The company says systems trained on Ego4D might one day not only be used in wearable cameras but also home assistant robots, which also rely on first-person cameras to navigate the world around them.

Tags: , , , , , , , , , , , , , , , , , , , , , , ,
Nikoleta Yanakieva Editor at DevStyleR International