Chemoinformatics plays an important role in areas that rely on topology and information of the chemical space. Many areas concerning discovery and formulation of new materials of drug involve an immense amount of study, modeling and simulation of various chemical structures, formulae, properties and similar aspects for achieving the end result.
Chemoinformatics are often used in pharmaceutical companies in the process of drug discovery or formation. These methods can also be used in chemical and other allied industries for various uses. Interpretation of chemical structures and formulae into computable structures is cumbersome and time consuming and often requires manual intervention. Enormous effort is poured into drafting images in intellectual papers and articles and such images that cannot be further reproduced for computational purposes.
There are some documents which teach to extract data relating to chemical structures. References may be made to Patent Application US2011202331 discloses an invention comprising methods and software for processing text documents and extracting chemical data therein. Preferred method embodiments of said invention comprise: (a) identifying and tagging one or more chemical compounds within a text document; (b) identifying and tagging physical properties related to one or more of those compounds; (c) translating one or more of those compounds into a chemical structure; (d) identifying and tagging one or more chemical reaction descriptions within the text document; and (e) extracting at least some of the tagged information and storing it in a database.
References may be made to an article titled “CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition” by Aniko T. Valko et. al. in J. Chem. Inf. Mod., 2009, 49(4), pp 780-787, discloses an advance version of CLiDE software, CLiDE Pro for extraction of chemical structure and generic structure information from electronic images of chemical molecules available online and pages of scanned documents. The process of extraction has three steps: segmentation of image into text and graphical regions, analysis of graphical region and reconstruction of connection table, and interpretation of generic structures by matching R-groups found in structure diagrams with the ones located in the text.
References may be made to U.S. Pat. No. 5,157,736 discloses an apparatus and methods for optical recognition of chemical graphics which allows documents containing chemical structures to be optically scanned so that both the text and the chemical structures are recognized. In the said invention, the structures are directly converted into molecular structure files suitable for direct input into chemical databases, molecular modeling programs, image rendering programs, and programs that perform real time manipulation of structures. References may be made to a paper titled “Optical recognition of chemical graphics” by Casey R. et. al. appeared in Document Analysis and Recognition, 1993, proceedings of the Second International Conference, discloses a prototype system for encoding chemical structure diagrams from scanned printed documents.
References may be made to a paper titled “Optical recognition of chemical graphics” by Casey R. et. al. appeared in Document Analysis and Recognition, 1993, proceedings of the Second International Conference, discloses a prototype system for encoding chemical structure diagrams from scanned printed documents.
References may be made to an article titled “Automatic Recognition of Chemical Images” by Maria-Elena Algorri, discloses a system that can automatically reconstruct the chemical information associated to the images of chemical molecules thus rendering them computer readable. The system consists of 5 modules: 1) Pre-processing module which binarizes the input image and labels it into its constituent connected components. 2) OCR module which examines the connected components and recognizes those that represent letters, numbers or special symbols. 3) Vectorizer module which converts the connected components not labeled by the OCR into graphs of vectors, 4) Reconstruction module which analyzes the graphs of vectors produced by the vectorizer and annotates the vectors with their chemical significance using a library of chemical graph-based rules. It also analyzes the results of the OCR and groups the letters, numbers and symbols into names of atoms and superatoms and then it associates the chemically annotated vector graphs with the results of the OCR. 5) Chemical Knowledge module which turns the chemically annotated vector graphs into chemical molecules under knowledge-based chemical rules, verifies the chemical validity of the molecules and produces the final chemical files.
References may be made to an Journal “J. Chem. Inf. Model 2009, 49, 740-743”, wherein inventor built an optical structure recognition application based on modern advances in image processing implemented in open source tools—OSRA. OSRA can read documents in over 90 graphical formats including GIF, JPEG, PNG, TIFF, PDF, and PS, automatically recognizes and extracts the graphical information representing chemical structures in such documents, and generates the SMILES or SD representation of the encountered molecular structure images.
However, processing of live images using webcams to harvest chemical data from hand drawn images is found to be difficult. There exists a need for a tool to acquire data from digital imaging apparatus and convert them into file formats suitable for reusability in simulation and modeling efficiently.
However, processing of live images using webcams to harvest chemical data from hand drawn images is found to be difficult. There exists a need for a tool to acquire data from digital imaging apparatus and convert them into file formats suitable for reusability in simulation and modeling efficiently.