1. Technical Field
The embodiments of the invention relate in general level to a field of automated information techniques, but more specifically to an improved code compression method. The invention relates also to a system for code compression. The invention relates also to a network element for communicating code over a boundary layer between said network element and a second network element of a communications network. The invention relates also to computer program products in machine-readable form for executing the above mentioned method or system.
2. Discussion of Related Art
It is a communication capacity of the network that sets an upper limit for the quantity of information passed through the network. When developing communications, the complexity of the information loads the available resources that seem to run out from the users, when the services and hardware takes their turn in the evolution. However, one direction of the development points to the capacity increase and towards larger amounts of data being transferred. Accordingly, new applications may eat up very quickly the additional benefits of the new techniques. Another way than capacity increase, to mitigate the communications needs is to compress information to be sent. One important field in that is code compression, in which the information to be transmitted and/or stored is processed so to achieve a packed format, in which the information has a smaller effective volume to be treated, but larger information density than the original information as such. Such densification is made normally by an algorithm, directed according to certain predetermined rules, definable to the user, such densification phase called packing. An inverse operation of the packing, unpacking, restores the original information as such from the packed format. In the old days of techniques, it was advantageous to drop out so called empty bits and/or portions and thus increase the density of the information.
However, it was noted that it is not necessary to transfer and/or store all the information from the original data as such. Even significant part of a file can be omitted, if there is a detectable period of data. In such a case it is sufficient to send only one such period and the number of times to be repeated in certain order.
It is also possible to drop out from a file some parts of it, which do not actually comprise significant information for the purpose of the use in the file. In picture formats as well as in audio formats, the files can be processed by such techniques to achieve a result in which the user won't sense the loss of the information from the original file or such loss can be interpolated back into the file to a certain degree, if needed. A man skilled in the art recognizes such several pictures and audio formats as well as utilization of several ZIP-algorithms or similar known variants of them at the priority date of this application. Such techniques belong to a passive way of compressing code and do not actually have relevance to the specific field of technology in which the invention belongs to.
In general, code compression is important in several areas of information techniques, especially in those fields that use e.g. embedded systems, where the storage space may be very limited in respect to the information to be processed.
For a performance of successful code compression, especially when dealing with automatically performing applications, the compression can be dealt into phases, in which the code is processed, for transfer and/or storage. Such processing generically involves recognition of the parts of the code for compression. For instance, the degree of compression can be ruled by certain predetermined values defmed by the compression algorithm user. However, the algorithm needs to decide in some phase what to do and in what extent. To guide such operation, certain rules are needed. In automated approach, the critical compression parameters can be set by the operator or, they can be deduced from the file itself, its structure and/or the file type, in order to convey the information to the algorithm for fulfillment of the compression according to the rules.
Those rules are driven by the compressors, which can be categorized into non-model based and model based, as featurized according to that are they using a model or not in the compressing. In a coding performed without a model, a compression table is needed for determination of the code words for all input elements, which are called also tokens, but the compression table is generated from each input file every time when the file is to be compressed and encoded so that the input uses its own table. Whereas in the model based techniques, such a table is generated from all input, and then each input is encoded by using such a common table every time.
An example is considered relating to file transfer. For example, if a first file with a first size would be sent through a network, said first file would be compressed as much as possible. In model based techniques a model is created. The model has a certain model size. When the first file is compressed according to the model, the result of the compression, a second file, is produced, having a certain second size. When decompressing, the model is used in order to decompress the second file.
However, the network capacity needed for transferring the whole data, for decoding, should provide bandwidth for a transferred file with a size as the model size plus the second size. Therefore, in order to save transfer capacity, the transferred file size would be beneficially minimized. However, it is problematic, that during a model creation by compressing an actual part of the first file, it is almost impossible to count the size of the actual part of the second file in the required time scale for the compression.
A sequence of operations and/or decisions can form a tree type structure. As well as real trees have greens and leaves in the nature, branching further towards finer and finer structures, consequential decisions of a code compression algorithm may form similar branches, as operations and/or decisions to focusing finer and finer structures of the file to be further processed as long as there is some code left to be compressed in the light of the predetermined rules and the table for the compression.
Such process may in disadvantageous conditions lead to complicated rules and even larger descriptions of the performed operations made, than the code itself to be coded. Time may be needed also for decoding very complicated structures, so that the more complicated a structure has, the more time would be used.
It is one common code compression technique that uses a model for splitting the coding into phase of learning, comprising building and pruning phases of the tree and a phase of coding using the model. In the building phase the tree is grown from a single node and an order of decisions are given. To reduce the complexity and size of the model the tree is pruned to the required level (this way the precision of the tree is reduced while maintaining its compression capability).
A skilled man in the art recognizes from a publication [1] Christopher W. Fraser—Automatic Inference of Models for Statistical Code Compression. In Proceedings of PLDI'99, pages 242-246, May 1999 a method for building a decision tree as described therein. In [1] the decision trees are used as models of coding. Binary trees are built, but not actually pruned. Instead, in order to reduce the size these trees are simply transformed into DAGs by merging similar leaves (DAG as Directed Acyclic Graph that do not contain cycles, as a skilled man in the art would immediately recognize).
A skilled man in the art recognizes from a publication [2] Minos Garofalakis, Dongjoon Hyun, Rajeev Rastogi and Kyuseok Shim—Efficient Algorithms for Constructing Decision Trees with Constraints. In Knowledge Discovery and Data Mining, pages 335-339, 2000 a method for tree pruning. In [2] the method for tree pruning is described using cost functions, which involves only the size of the tree. Garofalakis et al. developed a method that first builds the full tree, which is then pruned in such a way that the result would be encoded in to the minimal functional size. However, that method does not involve the information content of the tree when pruning. The tree is not used as a model of coding as such, but for other purposes.
Decision trees are commonly used in data mining. Commercially available packets as CART and C4.5 are available. The known decision trees are binary and rooted trees. http://www.cse.ucsc.edu/reserarch/compbio/genex/otherTest.html links to an Internet page [3], in which it has been described a tree evaluation and growing/pruning of a tree. The document provides a description on classifiers of internal nodes against a certain threshold. In addition to the standard algorithm as considered the C4.5., a hyperplane techniques is referred in relation to an OC1 system, as well as to an improved version of it, called therein as MOC1, which also relates to Vapnik-Chervonenkis theory.
Parzen windows classification, a generalization of a technique of k-nearest neighbors, relates to a known technique. In such techniques nonparametric density estimation is used. It is also known to use approximate densities for a posterior probability. The Parzen windows classification algorithm does not need a training phase, but lack of sparseness can make the performance of the algorithm slow.
Fisher's linear discriminant and Fisher's criterion relate to projecting a high dimensional data onto a line and performing classification in one dimension. It is also described in [3] that a cost function can be optimized on a training set for a threshold determination.
FIG. 1A indicates a model based code compression method, which method comprises phases of starting 1 a model based coding, creating 2 a model, a utilisation phase 3 of the model for compressing and/or decompressing and an ending phase 4 of the model based coding. In phase 3 the model is used for compressing/decompressing the code. The phase 4 ends the process.
The phase 2 in FIG. 1A is described in more detail in FIG. 1B. Model creation is started in sub-phase 21. Such phase may comprise a sub-step of selecting and/or adjusting the model to be used in the coding session. In phase 22 the input data that has significance for the coding session according to the method in 1A, is processed into utilizable form for the model and the grow/prune sub-phase 23 in FIG. 1B.
The sub-phase 23 is illustrated in FIG 1C. In the sub-phase 23 there are sub-steps or sub-phases of starting 231 the grow/prune phase of the model for utilisation for the code compression/decompression, which both are called also as treatment of code. The phase 23 comprises a tree growing phase 232 and a tree pruning phase 233, which as separately performed, each are separately described in more detail in FIGS. 1D and 1E, respectively.
The sub-phase 23 has also an ending phase 234, which involves steps that are necessary for stopping the process for the sub-phase 23. The phase 234 can comprise however steps that are related to the product from the sub-phase 23.
The sub-phase 232 has sub-phases that are described in FIG. 1D. The sub-phase 2321 starts the growing of a tree with the necessary preparation for the means and data to be used in the growing phase 2322. In phase 2322 a sub tree is grown at a root. The sub tree growing is stopped in the phase 2323.
The sub-phase 233 has sub-phases that are described further in FIG. 1E. The subphase 2331 starts the pruning of a tree with the necessary preparation for the means and data to be used in the pruning phase 2332. In phase 2332 a sub tree is pruned at a root. The sub tree pruning is stopped in the phase 2333.
In FIG. 1F the sub-phase 2322 is described in more detail. There are sub-phases of starting 23221 a sub tree growing at a node, a check phase 23222 weather or not a stopping criterion is met that is defining in which conditions the sub tree growing should stop in the sub-phase 2322. If the stopping criterion is met, no children are created in the phase 23223, and the sub-phase continues by skipping the phase 23224 to the stopping phase 23225 of the sub tree growing. If the stopping criterion in the phase 23222 is not met, children are created in the phase 23223 where and/or when needed. Since children were created in phase 23223, in phase 23224 a sub tree is grown at each child. The stopping phase 23225 stops the sub-phase 2322.
In FIG. 1G the sub-phase 2332 is described further in more detail. There are sub-phases of starting 23321 a sub tree pruning at a node, a check phase 23322 weather or not a stopping criterion is met that is defining in which conditions the sub tree pruning should stop in the sub-phase 2332. Such stopping criterion comprises a check weather or not the node is a leaf. If the node is a leaf the pruning is stopped for that node. If the node is not a leaf, the process continues in phase 23323 by pruning sub tree at each child. The costs are evaluated in phases 23324 and 23325 for a decision phase 23326. In phase 23324 a cost C1 is evaluated for all children plus for an internal node. In phase 23325 a cost C2 for a leaf is evaluated instead of a node. In the decision phase 23326 it is studied if the cost C1 is less than C2 in which case a child is not removed, but if the C2 is equal or less, a child is removed. A decision is made between two possibilities: keep the children (which has a cost C1) or drop them and replace the sub tree with a leaf (which has a cost C2). The cost would be minimized, so we do the operation belongs to the lower cost. Thus, if C1 is less than C2, so the children are kept, but if C2 is less than C1 they are dropped. In case of equality (C1=C2) a simpler structure is chosen, a leaf, so the children are also dropped. The stopping phase 23327 stops the sub-phase 2332.
For a proper use, it is necessary that the compression can be made backwards, but preferably with no errors or in tolerable margins. Such code compression algorithms that either have an inverse algorithm or comprise itself such are here called “bijective”. Nevertheless, such bijective algorithms always have or comprise themselves an inverse algorithm, for doing the code compression backwards. Such pair of algorithms can be regarded as operators and their inverse operators, respectively. Especially model-based algorithms that have the bijective property are very useful.
To be noted in the terminology, a term method is used as a series of actions that is used as a normal language in common patent terminology. Nevertheless, a term algorithm is used to refer to such a method that comprises method steps especially advantageous for an implementation by a computer or similar.
The term code should be understood in here also as a file, which can be just a data file or a series of commands, preferably executable by a processor of a computer, independently, on which form the code is presented for a computer for execution. A file can be almost any suitable ensemble of mechanically and/or electromagnetically handled values or characters provided it is commonly understood in the field of the technique of this application, and also provided that it is in a machine-readable form.
When making programmatic structures for an executable file, before such file becomes an executable in a processor, the file must be coded or translated form a language to another, which is more relevant to the hardware and the execution in it. In such case, there can be often programmatic structures, such as commands or combinations thereof, written in several times, which may disadvantageously only increase the size of the file. Therefore the files can grow into extremely large sizes. Sometimes as large as even the programmer cannot know what for certain lines were written. Automation of such may leave even more such structures that are repeated in several times more than a human programmer. Such repeated structures would be sufficient to write only once in the coded code and link to the point where such structure was used for guarantee the correct performance but a reasonable size of the file. When using a model that comprises a suitable pruning phase of the code can also reduce errors or human mistakes in addition of saved memory and time of execution.
To explain the term tree and relating terminology, some terms are considered, although a skilled man in the art knows their meaning in the field. A “node” and “(directed) edge” are mathematically used terms of graphs. A graph can be drawn on a paper for instance. When drawn, a node becomes a “point” in the paper, and a directed edge between two nodes becomes an “arrow” between two points on the paper. So “node” and “point” means essentially the same, as well as “directed edge” and “arrow”. A directed graph has nodes, also called points, as well as directed edges, also called arrows, between the nodes. A term tree means in here a special kind of (directed) graph: All but one node have exactly one incoming edge. (All but one point has exactly one arrow that points to it.) A node that has no incoming edges is called a root. One node can have many outgoing edges. (In other words, there can be many arrows starting from a point).
The nodes that are pointed by the outgoing edges of a parent node are called children of said parent node. A node that has no children is. called a leaf. Each node in a tree is a root of a sub tree of said tree. If an incoming edge of said node is deleted and if said node is the root, a tree can be formed that was a part of the original said tree. In extreme case each tree is a sub tree of itself.
The problems relating to the conventional model based code compression according to known techniques are to be solved, at least mitigated considerably by the merits of the embodiments of the invention.