Field of the Invention
This invention relates generally to event/message processing, and more particularly to systems and associated methods for clustering messages/events received from managed infrastructure using graph entropy.
Description of the Related Art
The World Wide Web is increasingly becoming a more important and more frequently used form of communication between people. The primary form of web-based communication is electronic mail. Other forms of communication are also used, however, such as news groups, discussion groups, bulletin boards, voice-over IP, and so on. Because of the vast amount of information that is available on the web, it can be difficult for a person to locate information that may be of interest. For example, a person who receives hundreds of electronic mail messages/events from infrastructure a day may find it impractical to take the time to store the messages/events from infrastructure in folders of the appropriate topic. As a result, it may be difficult for the person to later find and retrieve all messages/events from infrastructure related to the same topic. A similar situation arises when a person tries to locate news groups or discussion groups of interest. Because there may be no effective indexing of these groups, it can be difficult for the person to find groups related to the topic of interest.
Some attempts have been made to help the retrieval of information of interest by creating web directories that provide a hierarchical organization of web-based information. The process of creating the directories and deciding into which directory a particular piece of information (e.g., a news group) should go is typically not automated. Without an automated approach it is impractical to handle the massive amounts of web-based information that are being generated on a daily basis. Moreover, because a person may not be fully aware of the entire web directory hierarchy or may not fully understand the semantics of information, the person may place the information in a directory that is not the most appropriate, making later retrieval difficult. It would be desirable to have an automated technique that would help organize such information.
The advent of global communications networks such as the Internet has provided alternative forms of communicating worldwide. Additionally, it has increased the speed at which communications can be sent and received. Not only can written or verbal messages/events from infrastructure be passed through the Internet, but documents, sound recordings, movies, and pictures can be transmitted by way of the Internet as well. As can be imagined, inboxes are being inundated with countless items. The large volume can more than difficult to manage and/or organize for most users.
In particular, a few of the more common activities that a user performs with respect to email, for example, are: sorting of new messages/events from infrastructure, task management of using messages/events from infrastructure that can serve as reminders, and retrieval of past messages/events from infrastructure. Retrieval of recent messages/events from infrastructure can be more common than older messages/events from infrastructure. Traditional systems employed today support at least some aspect of these three activities using folders such as an inbox, task-oriented folders, and user-created folders, respectively. However, this as well as other existing approaches present several problems. The folders make stark divisions between the three activities which are not conducive or coincident with user behavior, in general. For example, tasks are not visible to the user, or rather are “out of sight, out of mind”, and thus can be easily, if not frequently, neglected, overlooked, or forgotten. In addition, in many current systems any given message can only be in one folder at a time. Hence, the particular message cannot serve multiple activities at once. Other current systems have attempted to ease these problems; however, they fall short as well for similar reasons.
A user can communicate using one or more different messaging techniques known in the art: email, instant messaging, social network messaging, cellular phone messages/events from infrastructure, etc. Typically, the user can accumulate a large collection of messages/events from infrastructure using one or more of these different messaging techniques. This user collection of messages/events from infrastructure can be presented as a large collection of messages/events from infrastructure with limited options of grouping or clustering the messages/events from infrastructure.
One way of grouping messages/events from infrastructure is to group multiple emails into an email thread. An email thread is a collection of emails that are related based on the subjects of the emails. For example, one user sends an email to one or more users based on a given subject. Another user replies to that email and a computer would mark those two emails as belonging to a thread. Another way for grouping messages/events from infrastructure is put the messages/events from infrastructure into folders. This can be done manually by the user or can be done automatically by the user setting up rules for message processing.
Document clustering and classification techniques can provide an overview or identify a set of documents based upon certain criteria, which amplifies or detects certain patterns within its content. In some applications these techniques lead to filtering unwanted email and in other applications they lead to effective search and storage strategies. An identification strategy may for example divide documents into clusters so that the documents in a cluster are similar to one another and are less similar to documents in other clusters, based on a similarity measurement. One refers to the process of clustering and classification as labeling. In demanding applications labeling can greatly improve the efficiency of an enterprise, especially for storage and retrieval applications, provided that it is stable, fast, efficient, and accurate.
Users of information technology must effectively deal with countless unwanted emails, unwanted text messages/events from infrastructure and crippling new viruses and worms every day. This largely unnecessarily high volume of network traffic decreases worker productivity and slows down important network applications. One of the most serious problems in today's digital economy has to do with the increasing volume of spam. As such, recipients of email as well as the service providers need effective solutions to reduce its proliferation on the World Wide Web. However, as spam detection becomes more sophisticated, spammers invent new methods to circumvent detection. For example, one prior art methodology provides a centralized database for maintaining signatures of documents having identified attributes against which emails are compared, however, spammers now modify the content of their email either slightly or randomly such that the message itself may be intelligible, but it evades detection under various anti-spam filtering techniques currently employed.
At one time, at least 30 open relays dominated the world, bursting messages/events from infrastructure at different rates and different levels of structural variation. Because certain types of email mutate or evolve, as exemplified by spam, spam-filtering detection algorithms must constantly adjust to be effective. In the case of spam email, for example, the very nature of the spam corpus undergoes regime changes. Therefore, clustering optimality depends heavily on the nature of the data corpus and the changes it undergoes.
Decomposing a traffic matrix has proven to be challenging. In one method, a matrix factorization system is used to extract application dependencies in an enterprise network, a cloud-based data center, and other like data centers, using a temporal global application traffic graph dynamically constructed over time and spatial local traffic observed at each server of the data center. The data center includes a plurality of servers running a plurality of different applications, such as e-commerce and content delivery. Each of the applications has a number of components such as a, web server, application server and database server, in the application's dependency path, where one or more of the components are shared with one or more of the other applications.
Because such data centers typically host a large number of multi-tier applications, the applications requests are overlapped, both in the spatial and temporal domains, making it very difficult for conventional pair wise statistical correlation techniques to correctly extract these interleaved but independent applications. A matrix-based representation of application traffic is used which captures both system snapshots and their historical evolution. The system and method decomposes a matrix representation of application graphs into small sub-graphs, each representing a single application.
The number of applications is usually unknown a priori due to interleaving and overlapping application requests, which further imposes a challenge to discovery of the individual application sub-graphs. In one prior method and system, the number of applications is determined using low rank matrix estimation either with singular value decomposition or power factorization based solvers, under complete and incomplete traffic data scenarios, with theoretical bound guarantee.
Traffic tapping from switches is limited by the capability of switches as well as the monitoring hosts. A switch typically can mirror only a few ports at the same time. In addition, monitoring data collected over multiple switches, each with multiple ports may result in high-volume aggregate network traffic and potentially packet loss. Both cases lead to significant loss in the monitoring data.
One system and method to overcome this problem utilizes historical data to provide redundancy and employs power factorization based techniques to provide resilience to data loss and estimation errors. In one system and method, a distributed network monitors and centralizes data processing to determine application dependency paths in a data center.
The majority of current service management solutions are rule based. The concept behind rule-based systems is that you start with the system you are monitoring, analyze and model it, turning it into a series of business logic rules that respond to events as they occur. For example, in response to some logged text, you apply logic that turns the text into a database record to which you apply more logic that turns it into an alert, before applying again more logic to connect the alert to a trouble ticket.
The fundamental problem with this approach is that the rules are dependent on a point in time snapshot of what is out there that you are managing, which is subject to continual change. So, every time the infrastructure alters the business logic must be modified. Clearly the rule-based approach is not a scalable way of running a business.