Food logging, i.e., monitoring food eaten by individuals along with various nutritional information associated with that food, is becoming increasing popular for a variety of reasons. For example, obesity has been linked to conditions such as cardiovascular disease, diabetes, and cancer, and dramatically impacts both life expectancy and quality of life. Furthermore, the rapid rise in the prevalence of obesity presents a critical public health concern. While diet and exercise have been shown to be central to combating obesity, changes in a person's diet and exercise habits are often difficult. However, it has been shown that the use of exercise in combination with accurate food logging supports such changes. Further, food logging is known to be well-correlated to increased initial weight loss and improved weight maintenance.
Unfortunately, food logging is often performed as a fully or partially manual process, with the result that the effectiveness of food logging is often limited by inconvenience to the user. Attempts to perform automatic food logging, based on inferring nutritional information from a single food image, have shown generally poor performance due to a variety of reasons. For example, there may be significant occlusions (e.g., a sausage hidden under a side of coleslaw) in a food image, resulting in missing information. Further, it is highly unlikely that visual information alone conveys all the details of food preparation (e.g., amount of oil, fat content of meats, sugar content, salt content, etc.) that strongly impacts nutritional content. In addition, accurate volume estimation from a single image remains a challenging computational task.
In light of such issues, effective techniques for estimating nutritional statistics (e.g., calories, fats, carbohydrates, etc.) from single images of realistic meals present challenging problems. One existing attempt to address such issues relaxes the single-image assumption and utilizes auxiliary hardware such as calibration targets, multiple images, laser scanners, and structured light. Further, such techniques generally assume unrealistic arrangements of the food items on a plate in a manner that allows each individual item to be clearly imaged. Unfortunately, techniques requiring users to provide food images using various combinations of calibration targets, multiple images, laser scanning, careful arrangement of food items on a plate, etc. before consuming a meal are not generally considered to be “user friendly.”
Additional attempts to address some of the aforementioned challenges provide techniques that relax the goal of estimating nutritional statistics, while instead focusing on core computer vision challenges. For example, one approach suggests the use of a feature descriptor but evaluates only on the highly controlled “Pittsburgh Fast-Food Image Dataset” (also referred to as the “PFID”). Another approach considers the use of user-supplied images and nutritional statistics to bootstrap classification. This approach utilizes a nutritional table with five categories: grain, vegetable, meat/fish/beans, fruit, and milk. Images are mapped to these categories and serving sizes are then supplied by the user. Unfortunately, such works are limited by the granularity of the nutritional table and portion sizes. In particular, the coarse nutritional information used in such approaches carries large standard deviations of serving counts, preventing accurate calorie estimation. Yet another approach considers manual crowd-sourced assessments of nutritional information based on images of food being consumed. This crowd-sourced approach has been observed to show results similar to those supplied by a dietitian, but at the cost of significant human input and delay in feedback to the person consuming the meal.