1. Field of the Invention
The present invention relates to an operational system and computer system architecture for improved manipulation of video and other time-based media and associated time-based metadata. More specifically, the present invention relates a system and architectural model for organizing a method for manipulating, editing and viewing video and other time-based media and associated time-based metadata and edits thereto without changing initially secured and underlying video data wherein a series of user interfaces, an underlying operative program module and a supportive data module are provided within a cohesive operating system.
2. Description of the Related Art
Consumers are shooting more and more personal video using camera phones, webcams, digital cameras, camcorders and other devices, but consumers are typically neither skilled videographers nor are they able or willing to learn complex, traditional video editing and processing tools like Apple iMovie or Windows Movie Maker. Nor are most users willing to watch most video “VCR-style”, that is, in a steady steam of unedited, undirected, unlabeled video.
Thus consumers are being faced with a problem that will be exacerbated as both the number of videos shot and the length of those videos grows (supported by increased processing speeds, memory and bandwidth in end-user devices such as cell phones and digital cameras) while the usability of editing tools lags behind. The result will be more and longer video files whose usability will continue to be limited by the inability to locate, access, label, discuss and share granular subsegments of interest within the longer videos in an overall library of videos.
In the absence of editing tools for the videos, adding titles and comments to the videos as a whole does not adequately address the difficulties within the present technology as will become aware to those of skill in the art who recognize the technological challenges illustrated herein after reviewing the instant discussion. As an example of the challenges currently unmet within the present technology, there may be only three 15-second segments of interest scattered throughout a 10 minute long, unedited video. A special problem, recognizable by those of skill in the art, is that distinct viewers may find distinct 15-second intervals of interest.
The challenge faced by viewers is to find those few short segments of video which are of interest to them at that time without being required to scan through the many sections which are not of interest.
The reciprocal challenge is for users to help each other find those interesting segments of video. As evidenced by the broad popularity of chat rooms, blogs etc. viewers want a forum in which they can express their views about content to each other, that is, to make comments. Due to the time-based nature of the video, expressing interest levels, entering and tracking comments and/or tags or labels on subsegments in time of the video or other time-based media is a unique and previously unsolved problem. Based on the disclosure herein, those of skill in the art should recognize that such time-variant metadata has properties very different from non-time-variant metadata and will require substantially distinct mechanisms and systems to manipulate and manage it.
Additional challenges described in Applicant's incorporated references apply equally well here including especially:
a. the fact that video and accompanying audio is a time-dependent, four dimensional object which needs to be viewed, manipulated and managed by users on a two-dimensional screen when time is precious to the user who does not wish to watch entire, unedited videos (discussed in detail below with regard to the special complexities of digitally encoded video with synchronized audio (referred to as DEVSA data);
b. the wide diversity of capabilities of the user devices which users wish to use to watch such videos ranging from PCs to cell phones (as noted further below); and
c. the need for any proposed solution to be able to be structured for ready adaptation and re-encodation to accommodate the rapidly changing capabilities of the end-user devices and of the networks which support them.
Those with skill in the art should recognize the more generic terminology “time-based media” which encompasses not only video with synchronized audio but also audio alone plus also a range of animated graphical media forms ranging from sequences of still images to what is commonly called ‘cartoons’. All of these forms are addressed herein. The terms, video, time-based media, and digitally encoded video with synchronized audio (DEVSA) are used as terms of convenience within this application with the intention to encompass all examples of time-based media.
A further challenge is that video processing uses a lot of computer power and special hardware often not found on personal computers. Video processing also requires careful hardware and software configuration by the consumer. Consumers need ways to edit video without having to learn new skills, buy new software or hardware, become expert systems administrators or dedicate their computers to video processing for great lengths of time.
Consumers have been limited to editing and sharing video that they could actually get onto their computers, which requires the right kind of hardware to handle their own video and also requires physical movement of media and encoding if they wish to use video shot by another person or is taken from stock libraries.
When coupled with the special complexities of digitally encoded video with synchronized audio the requirements for special hardware, difficult processing and storage demands combine to reverse the common notion of using “free desktop MIPS and GBs” to relieve central servers. Unfortunately, for video review and editing the desktop is just is not enough for most users. The cell phone is certainly not enough, nor is the personal digital assistant (PDA). There is, therefore, a need for an improved system and architectural model for shared viewing and editing of time-based media.
Those with skill in the conventional arts will readily understand that the terms “video” and “time-based media” as used herein are terms of convenience and should be interpreted generally below to mean DEVSA including content in which the original content is graphical.
This application addresses a unique consumer and data model and other systems that involve manipulation of time-based media. As introduced above, those of skill in the art reviewing this application will understand that the detailed discussion below addresses novel methods of receiving, managing, storing, manipulating and delivering digitally encoded video with synchronized audio. (Conveniently referred to as “digitally encoded video with synchronized audio” (DEVSA)). Those of skill in the art will also recognize that a focus of the present application is, in parallel with the actions applied to the DEVSA, to provide a novel system architecture and data model to gather, analyze, process, store, distribute and present to users a variety of novel and useful forms of information (referred to as metadata) concerning that DEVSA which information is synchronized to the internal time of DEVSA and multiply linked to the users both as individuals and as groups (defined in a variety of ways) which information enables them to utilize the DEVSA in a range of novel and useful manners, all without changing the originally encoded DEVSA.
In order to understand the concepts provided by the present, and related inventions within the inventive family, those of skill in the art should understand that DEVSA data is fundamentally distinct from and much more complex than data of those types more commonly known to the public and processed by computers such as basic text, numbers, or even photographs, and as a result requires novel techniques and solutions to achieve commercially viable goals (as will be discussed more fully below).
Techniques (editing, revising, compaction, etc.) previously applied to these other forms of data types cannot be reasonably extended due to the complexity of the DEVSA data, and if commonly known forceful extensions are orchestrated, they would                Be ineffective in meeting users' objectives and/or        Be economically infeasible for non-professional users and/or        Make the so-rendered DEVSA data effectively inoperable in a commercially realistic manner.        
Therefore a person skilled in the art of text or photo processing cannot easily extend the techniques that person knows to DEVSA.
What is proposed for the present invention is a new architecture for managing, storing, manipulating, operating with and delivering, etc. DEVSA data and novel kinds of metadata associated with, linked to and, in many cases, synchronized with said DEVSA. As will be discussed herein the demonstrated state-of-the-art in DEVSA processing suffers from a variety of existing, fundamental challenges associated with known DEVSA data operations. The differences between DEVSA and other data types and the consequences thereof are discussed in the following paragraphs. These challenges affect not only the ability to manipulate the DEVSA itself but also to manipulate associated metadata linked to the internals of the DEVSA. Hence those of skill in the art not only face the challenges associated with dealing with DEVSA but also face the challenges of new metadata forms such as deep tagging, synchronized commenting, visual browsing and social browsing as discussed herein and in Applicant's related applications.
This application does not address new techniques for digitally encoding video and/or audio or for decoding DEVSA. There is related art in this area that can provide a basic understanding of the same and those of skill in the electronic arts know these references. Those of skill in the art will understand however that more efficient encoding/decoding to save storage space and to reduce transmission costs only serves to greatly exacerbate the problems of operating on DEVSA and having to re-save revised DEVSA data at each step of an operation if the DEVSA must be decoded, modified and re-encoded to perform any of those operations and as such are in direct contrast to the teachings of the present invention.
A distinguishing point about video and, by extension, stored DEVSA is to emphasize that video or DEVSA represents an object with four dimensions: X, Y, A-audio, and T-time, whereas photos can be said to have only two dimensions (X, Y) and can be thought of as a single object that has two spatial dimensions but no time dimension.
With video or stored DEVSA, large numbers of pixels arranged in a fixed X-Y plane which vary smoothly with T (time) plus A (audio amplitude over time) also varies smoothly in time in synchrony with the video. For convenience this is often described as a sequence of “frames” (such as 24 frames per second). This is however a fundamentally arbitrary choice (number of “frames” and use of “frame” language) and is a settable parameter at encoding time. In reality the time variance of the pixel's change with time is limited only by the speed of the semiconductors that sense the light. The difficulty in dealing with mere two dimensional photo technology is therefore so fundamentally different as to have no bearing on the present discussion (even more lacking are text art solutions).
Before going further it is also important for those of skill in the art to understand the scale of these DEVSA data elements that sets them apart from text or photo data elements and why this scale is so extremely difficult to manage. As a first example, a 10 minute video at 24 “frames” per second would contain 14,400 frames. At 600×800 pixel resolution, 480,000 pixels, one approaches 7 billion pixel representations.
When one adds in the fact that each pixel needs 10- to 20 bits to describe it and the need to simultaneously describe the audio track, there is a clear and an impressive need for an invention that addresses both the complexity of the data and the fact that the DEVSA represents not a fixed, single object rather a continuous stream of varying objects spread over time whose characteristics can change multiple times within a single video. To date no viable solutions have been provided which are accessible to the typical consumer, other than very basic functions such as storing pre-encoded video files, manipulating those as fixed files, and executing START and STOP play commands such as those on a video tape recorder.
While one might have imagined that photos and video offer similar technical challenges, the preceding discussion makes it clear again that the difficulties in dealing with mere two dimensional photos which are fixed in time are therefore so fundamentally different and less challenging as to have no bearing on the present discussion. The preceding sentence applies at least as strongly to the issue of metadata associated with DEVSA. A tag, comment, etc. on an object fixed in time such as a text document or a picture or a photo are well-understood objects (metadata in a broad sense) with clear properties. The available technology has made such things more accessible but has not really changed their nature from that of the printed word on paper: fixed comment tied to fixed object.
In this and Applicant's related applications an emphasis is placed on metadata including tags, comments, visual browsing and social browsing information which are synchronized to the internal time-line of the DEVSA including after the DEVSA has been “edited”, all without changing the DEVSA.
By way of background information, some additional facts about DEVSA should be well understood by those of skill in the art; and these include:                a. Current decoding technology allows one to select any instant in time within a video and resolve a “snapshot” of that instant, in effect rendering a photo of that instant and to save that rendering in a separate file. As has been shown, for example in surveillance applications, this is a highly valuable technology but it completely fails to address the present needs.        b. It is not possible to take a “snapshot” of audio as it is perceived by a person. Those of skill in the electronic and audio-electronic arts recognize that audio data is a one dimensional data type: (amplitude versus time). It is only that as amplitude changes with time that it is perceivable by a person. Electronic equipment can measure that amplitude if desired for special reasons.        
The present application and those related family applications apply to this understanding of DEVSA when the actual video and audio is compressed (as an illustration only) by factors of a thousand or more but remain nonetheless very large files. Due the complex encoding and encodation techniques employed, those files cannot be disrupted or manipulated without a severe risk to the inherent stability of the underlying video and audio content. This explains in part the importance of keeping metadata and DEVSA as separate, linked entities previously unknown in the art.
The conventional manner in which users edit digitized data, whether numbers, text, graphics, photos, or DEVSA, is to display that data in viewable form, make desired changes to that viewable data directly and then re-save the now-changed data in digitized form.
The phrase above, “make desired changes to that viewable data”, could also be stated as “make desired changes to the manner in which that data is viewed” because what a user “views” changes because the data changes, which is the normative modality. In contrast to this position, the proposed invention changes the viewing of the data without changing the data itself. The distinction is material and fundamental.
In conventional data changes, where storage cost is not an issue to the user, the user can choose to save both the original and the changed version. Some sophisticated commercial software for text and number manipulation can remember a limited number of user-changes and, if requested, display and, if further requested, may undo prior changes.
As an illustrative example only, those of skill in the art should recognize the below comparison between DEVSA and other somewhat related data types.
The most common data type on computers (originally) was or involved numbers. This problem was well solved in the 1950s on computers and as a material example of this success one can buy a nice calculator today for $9.95 at a local non-specialty store. As another example, both Lotus® and now Excel® software systems now solve most data display problems on the desktop as far as numbers are concerned.
Today the most common data type on computers is text. Text is a one-dimensional array of data: a sequence of characters. That is, the characters have an X component (no Y or other component). All that matters is their sequence. The way in which the characters are displayed is the choice of the user. It could be on an 8×10 inch page, on a scroll, on a ticker-tape, in a circle or a spiral. The format, font type, font size, margins, etc. are all functions added after the fact easily because the text data type has only one dimension and places only one single logical demand on the programmer, that is, to keep the characters in the correct sequence.
More recently a somewhat more complex data type has become popular, photos or images. Photos have two dimensions: X and Y. A photo has a set of pixels arranged in a fixed X-Y plane and the relationship among those pixels does not change. Thus, those of skill in the art will recognize that the photo can be treated as a single object and manipulated accordingly.
While techniques have been developed to allow one to “edit” photos by cropping, brightening, changing tone, etc., those techniques require one to make a new data object, a new “photo” (a newly saved image), in order to store and/or retrieve this changed image. This changed image retains the same restrictions as the original: if one user wants to “edit” the image, the user needs to change the image and re-save it. It turns out that there is little “size”, “space” or “time” penalty to that approach to photos because, compared to DEVSA, images are relatively small and fixed data objects.
In summary, DEVSA should be understood by those of skill in the art as a type of data with very different characteristics from data representing numbers, text, photos or other commonly found data types. Those differences and their impacts are fundamental to the present disclosure. As a consequence, an extension of ideas and techniques that have been applied to those other, substantially less complex data types have no corollary to those conceptions and solutions noted below. The present invention must be appreciated by those of skill in the art as providing a new manner of (and a new solution for) dealing with DEVSA type data that both overcomes the detriments of such data noted above and results in a substantial improvement demonstrated via the present system and architectural model.
The present invention also recognizes the earlier-discussed need for a system to manage and use DEVSA data in a variety of ways while providing extremely rapid response to user input without changing the underlying DEVSA data.
What is also needed by those of skill in the art is the need for a new manner of dealing with DEVSA that overcomes the challenges inherent in such data and enables immediate and timely response to both initial video and DEVSA data, and that DEVSA and time-based media data that is amended or updated on a continual or rapidly changing basis.
What is not appreciated by the related art is the fundamental data problem involving DEVSA and current systems for manipulating the same in a consumer responsive manner.
What is also not appreciated by the related art is the need for providing a data model that accommodates (effectively) all present modern needs involving high speed and high volume video data manipulation and usages.
Accordingly, there is a need for an improved system and architectural model for shared viewing and editing of time-based media without changing the underlying video media content and which additionally takes into account the time-variant nature of the incorporated metadata.