By elements we understand for example Internet users, various workgroups of one company or government institution, users of the same application, for example of the same computer program or information system, cells in a biological system, for example in a multicellular organism, or microorganisms or viruses that communicate or interchange fragments or total DNA, as in gene transfer. Elements may also be intermediaries of information sharing, such as communication channels or nodes, elements may also be groups of elements or subsystems. Data information about an element may for example be the genetic information of a cell or the electric signals representing the state of an operating system at a certain point in time or for example the electric signals or the wave signals representing the identification data stored in artificial databases, for example storing age and address. By information sharing we understand for example Internet communication or aggregate communication data (e.g. a trace of data exchange among multiple users as stored by e.g. an intermediate communication network node) or exchange of information contained within operator networks or publication of aggregate data about for example network searches. By sharing of information in biological systems we understand for example processing of various signals, or of genetic information fragments, contained in genomic DNA, by cells, or by parts of DNA of a single cell.
Common statistical methods gather information about elements and information shared by elements without the private content removed. These methods assume that the information is gathered by an authority that is generally trusted not to abuse the gathered information and also to protect it from malicious attacks. This assumption is unrealistic: the individual elements often have no reason to trust the information-gathering authority. Therefore they do not make the information available and the statistical analysis is not possible. Sometimes, protection of access to sensitive information during analysis or study by a third party is solved by bi-partial legal contracts and considerable sanctions in the case of information leak or abuse. The disadvantage of such approach is the considerable work expenditure, poor scalability, complicated or impossible access for further parties, complicated and costly control and enforcement, complicated and costly security protection and considerable ineffectiveness (in particular against leaks caused by own personnel).
However, there exist important techniques concerning systems that only need partial information. An example of such technique is intrusion detection in Internet communication, where presence of a virus is indicated by the Internet communication containing a short segment present in a database of malicious segments. Danger of a different threat may be indicated by frequent repetition of the same short segment within Internet communication.
For techniques concerning systems that require only partial data information, short segments of the original data information are important, which we denote as local data information segments. We denote as collection of local data information segments any unordered group of multiple local data information segments. The above reasoning motivates the need to process the original data information—encoded as sequence of symbols, so that its content is concealed yet the relevant collection of local data information segments is preserved.
The processed data information containing the collection of local data information segments could be shared and would enable analysis and corrections of the system and it would encourage mutual communication among elements. A proposal for processing of data information was presented in the work: Lukas Kencl, Jose Zamora, Martin Loebl, “Packet Content Anonymization by Hiding Words”, Demo at IEEE INFOCOM,, Barcelona, Spain, April 2006, using random permutations of a collection of short overlapping data segments. However, it was shown later that this technique does not lead to concealing the original data information.
From biology we know that a large fraction of the eukaryotic genome is composed of DNA sub-segments, which are repeating many times exactly, or with slight alterations. In computational biology this phenomenon is identified as the main obstacle of the current methods in reconstructing longer segments of DNA out of a known set of shorter overlapping segments. The reason is that if the set contains a large amount of shorter segments with repeating initial or terminal sub-segments, there exist an uncontrollable number of possible variations of reconstruction of the longer segments which would be consistent with the analysis of overlaps.