Monolingual Search Engines Launched in 5 Indian Languages
Search in Hindi, Bengali, Marathi, Tamil and Telugu Enabled for Tourism Sector
SANDHAN, the Indian language search engine for tourism domain. This will fill the wide gap that exists in fulfilling the information needs of Indians not conversant with English- estimated at 90% of the population.
Unveiling the search engines for Bengali, Hindi, Marathi, Tamil and Telugu, The real success would be when even village level e-services would be available in local languages.
SANDHAN has been developed by 120 researchers of 12 institutions over a period of 6 years led by Dr. Pushpak Bhattacharya under the supervision of TDIL DeitY. The project aims at satisfying the user information need through text documents present in the web. This search engine captures the information in the form of a query in one of the 5 Indian ‘query’ languages— Bengali, Hindi, Marathi, Tamil and Telugu. The query is processed to retrieve a set of relevant documents of the same language from crawled data in tourism domain from the World Wide Web (WWW). These retrieved documents are presented to the user in the form of an ordered list based on the relevance of the document.
Although designed mainly for tourism, sectors such as business and academia would also benefit from SANDHAN. It can also be deployed as part of e-governance and e-learning.
The following are the salient features of SANDHAN
• System is developed to satisfy the user information need in tourism domain.
• User has the facility to submit a query either with the help of in-script keyboard or phonetic keyboard. In case of in-script keyboard, user can type using the keyboard or onscreen keyboard can be used to submit a query to the system.
• It has the capability to process the query based on its language and retrieves results ONLY from that language.
• Snippets generated for each of the retrieved document helps the user to understand the context of query terms in that document.
• Summary is generated for each retrieved document. This feature helps the user to get an idea about the overall content of the document without opening the same.
• An additional URL based semantic search facility is provided for Tamil language.
• A set of ten results are displayed at a time to the user to increase the readability.
• Many of the Indian language web pages are in custom fonts that make the system difficult for retrieving documents. SANDHAAN uses a font transcoder that converts the custom fonts into Unicode fonts for processing.
SANDHAN is a mission mode project of a consortium of academic & research institutions, and industry partners. The institutes involved are IIT Bombay (Consortium leader), CDAC Noida (Co-Consortium leader), IIT Kharagpur, Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar, Anna University-Centre for Electronics, Anna University-Knowledge-based Computing Centre, CDAC Pune, Gauhati University, Indian Institute of Information Technology Bhubaneswar, International Institute of Information Technology Hyderabad, ISI Kolkata and Jadhavpur University. It is conceptualized, evolved and funded as a National-level project in the emerging area of Information Retrieval and Access in Indian Languages by Dept of Electronics & Information Technology (DeitY).
SANDHAN project has been put together by Technology Development for Indian Languages (TDIL), a flagship programme of DeitY involved in research, development, standardization and proliferation of Language Technology in India in 22 Constitutionally-recognized Indian languages. TDIL Programme is also associated with international standardization bodies like the Unicode Consortium , W3C , IETF and ELRA.
The link for SANDHAN is www.tdil-dc.in/sandhan