(1) Field of Invention
The present invention relates to a system for detecting dates in text and, more particularly, to a system for detecting dates in texts written in Persian (Farsi).
(2) Description of Related Art
Date detection is an important component of conceptual search in texts. While searching for content or content in text is important, it is often equally important to be able to automatically identify or detect relevant dates in streaming or online text or other text platforms (e.g., social media platforms, such as Twitter®). As can be appreciated, it is also desirable to be able to extract dates across dialects and languages.
To address this need, researchers have developed Spanish and Portuguese date extractors (see, Detecting future social unrest from unprocessed twitter data” Compton, Lee, De Silva, Lu, Macy, IEEE ISI 13, which is incorporated in its entirety by reference as though fully set forth herein). While operable for Spanish and Portuguese, the work of Compton et al. does not address date detection written in Farsi language.
More than 70 million Iranians speak in Farsi of which about 40% are between ages of 15 to 45. Despite all censorship by the Iranian government, according to Wikipedia®, “Iran experienced a great surge in Internet usage, and, with 20 million people on the Internet, currently has the second highest percentage of its population online in the Middle East, after Israel.” As can be appreciated, such software is potentially of great use.
Although an algorithm for extracting dates in Spanish and Portuguese is available, very different structure and grammers of Persian make it impossible to use those software for Farsi in a trivial way. Unlike Spanish and Portuguese, the name of days of the week has two components that could be connected to each other, separated by a space or by a zero width non-joiner character.
FIG. 1, for example, shows two forms of writing Sunday in two Tweets® 101 and the corresponding date stamps (in the left column 103) or creation date of the Tweet®. Sunday is written with a space between its components in the the first Tweet® 100 and without a space in the second Tweet® 102. Since one of the components appears in the names of six days of the week, and the second component is a number, search for a name of the day requires more work than just looking for one name.
On the other hand, unlike the official way of writing dates in Farsi, people mention dates online in different ways. They might mention the name of the month from either Persian or Gregorian calendar, using the Persian alphabet; for more details on this method of writing see en.wikipedia.org/wiki/Fingilish. People might also write the numbers in English, while the name of the month are from Persian Calendar. All these challenges and many others make it impossible to apply the algorithm for Spanish and Portuguese to Farsi in a trivial way.
Thus, a continuing need exists for a system for temporally tagging (e.g., extracting dates) Farsi language text in a variety of platforms.