Thursday 25 October 2018

سنڌي انسائيڪلوپيڊيا جي ڪمپيوٽنگ واري حصي ۾ رهجي ويل ڄاڻ



مظهر علي ڏوتيو
سنڌي ٻولي دنيا جي پراڻين ۽ اصلي ٻولين مان هڪ ٻولي آهي جيڪا سموري دنيا ۾ لکي، ڳالهائي ۽ پڙهي وڃي پئي. سماجي رابطن جي ويبسائيٽس تي به سنڌي ٻولي لکي ۽ پڙهي وڃي پئي. جنهن سان دنيا جي تحقيق ڪندڙن، ڄاڻ حاصل ڪندڙن ۽ ٻولي تي تجزيو ڪندڙن کي سنڌي ٻولي جي لفظن جو ڀنڊار يا ڪارپس ملي رهيو آهي. ڪمپيوٽر سائنس ۾ ٻولين تي مختلف حوالن سان ڪم ٿي رهيو آهي جيڪو سنڌي ٻولي تي به ٿي رهيو آهي. مثال طور، هن وقت سنڌي ٻولي جا ڪيترن ئي ڪتابن کي ڪمپيوٽر جي سوفٽويئر ذريعي ڪمپيوٽرائزڊ ڪيو ويو آهي، جنهن ذريعي ڪوبه ماڻهو پنهنجي موبائيل فون يا ڪمپيوٽر ذريعي ڪٿي به اهي ڪتاب پڙهي سگهي ٿو. ان کانپوءِ سنڌ جي تفريحي ۽ تاريخي ماڳن جون تصويرون ڪڍي يا وڊيو ٺاهي انهن کي آن لائين ڪيو وڃي پيو يا انهن ماڳن کي گوگل ميپس تي آڻي دنيا آڏو آندو وڃي ٿو ته جيئن دنيا سنڌ جا ماڳ ڏسي سگهي. اهڙي طرح اهو عمل سير و تفريح ڪندڙن کي سهولت فراهم ڪري ٿو. جڏهن ته پروگرامر وري مختلف قسم جا پروگرامس ۽ فونٽس ٺآهي سنڌي لکڻي بهتر بڻائڻ سان گڏ مخلتف ڪمپيوٽر پليٽ فارمس تي سنڌي کي آڻي رهيا آهن.
انهن سڀني ڪمن سان گڏ سنڌي ٻولي تي هڪ ٻيو به ڪم ٿي رهيو آهي جيڪو هن وقت دنيا لاءِ تمام گهڻي اهميت رکي ٿو. ان ڪم کي نيچرل لينگوئجز پروسس ۽ ڪمپيوٽيشنل لسانيات چون ٿا. هي ٻه اهڙيون فيلڊس آهن جيڪي دنيا ۾ ڳالهائيندڙ ٻولين کي بين الاقوامي ٻولين جي سٿ ۾ شامل ڪرڻ سان گڏ ترجمي ۽ ڄاڻ جي حاصلات ۾ به مدد ڪن ٿيون. انهن فيلڊس جي ڪري ئي ڪنهن به ٻولي جو خاص ترجمي ڪرڻ جو هڪ نظام يا اوزار ٺاهي سگهجي ٿو ۽ ٻولين جي متن يا ڪارپس ۾ لڪل ڄاڻ ۽ احساسن کي سمجهي يا حاصل ڪري يا وري تجزيو ڪري سگهجي ٿو. سنڌي ٻولي تي اهو ڪم مون ڪيو آهي جنهن سان سنڌي ٻولي جي لفظن جي ڀنڊار يا ڪارپس کي مشين لرننگ ذريعي يا ڊيپ لرننگ (گهري سکيا) ذريعي سمجهي سگهجي ٿو، ان جو تجزيو ڪري سگهجي، سنڌي ٻولي کي دنيا جي ڪنهن به ٻولي ۾ ترجمو ڪري سگهجي، سنڌي جملن کي عالمي گرامر ۽ سنڌي گرامر مطابق ٽڪرا ڪرڻ سان گڏ انهن لفظن، نشانين کي ظاهر ڪري سگهجي ٿو. سنڌي متن مان ليما ڪڍي سگهجن ٿا ۽ انهن لفظن جا اشتقاق ڪري سگهجن ٿا. انهيءَ کان سواءِ ماڻهن جي راين مان احساس ڳولهي سگهجن ٿا ۽ انهن جو تجزيو ڪري سگهجي ٿو. مختلف وکرن بابت ماڻهن جي ڏنل راين جو تجزيو ڪري سگهجي ٿو. اسان جي ان ڪم ڪرڻ جي ڪري دنيا ۾ سنڌي ٻولي تي تحقيق شروع ٿي چڪي آهي پر افسوس ان ڪم کي اسان جي حڪومت ۽ ٻولي سان لاڳاپيل ادارن طرفان نظر انداز ڪيو وڃي پيو جڏهن ته اهو هڪ تاريخي ڪم آهي جيڪو اڳم ان سطح تي ڪڏهن به ڪون ٿيو آهي.
سنڌي ٻولي جي بااختيار اداري طرفان سنڌي انسائيڪلوپيڊيا پڌرو ڪيو ويو آهي جيڪو هڪ بهترين عمل آهي جنهنجي ساراھ ڪرڻ کان سواءِ نٿو رهي سگهجي پر افسوس جو ان سنڌي انسائيڪلو پيڊيا جي هڪ حصي سنڌي ڪمپيوٽنگ کي لکڻ وقت اهي اهم موضوع رهايا ويا آهن. جڏهن ته انهن موضوعن کي سنڌي ڪمپيوٽنگ ۾ شامل ڪرڻ تمام ضروري هيو ته جيئن دنيا کي خبر پوي ها ته سنڌي قوم هڪ سڄاڳ قوم آهي ۽ دنيا سان گڏ هلي پئي. اهو ڪم انٽرنيٽ Sindhi NLP  (http://www.sindhinlp.com) جي نالي سان موجود آهي جنهن کي استعمال ڪندي ملڪ ۽ پرڏيھ ۾ تمام گهڻو تحقيقي ڪم ٿي رهيو آهي. اسان جي ان  ڪم کي پوري دنيا مڃيو آهي پر ان ڪم کي سنڌي ٻولي جي بااختيار اداري طرفان  نظرانداز ڪيو ويو آهي. ان ڪم کي نظرانداز ڪرڻ جو مقصد ڪهڙو ٿي سگهي ٿو سا ته خبر نه آهي پر سنڌي ٻولي تي ٿيندڙ اهم تحقيق کي سنڌي ڪمپيوٽنگ ۾ شامل نه ڪرڻ سنڌي ٻولي جو نقصان آهي. هي ڪم اڄ نه ته سڀاڻي ضرور ظاهر ٿيندو جڏهن سنڌي ٻولي دنيا جي تحقيق جو موضوع ٿي ويندي. اهڙي طرح تحقيق ڪندڙ ان اهم ڪم جو  حوالو ڏيندا.
 ماجد ڀرگڙي صاحب سنڌ ٻولي کي ڪمپيوٽرائزيد ڪري تمام وڏو احسان ڪيو آهي جنهنڪري سنڌي اخبارون، ڪتاب، ويب سائيٽس، بلاگس وغيره وجود ۾ آيا ايتري تائين جو سنڌي ڪمپيوٽيشنل لسانيات تي ڪم ٿي سگهيو. هن وقت دنيا ۾ ٻولين جي حوالي سان ڪمپيوٽيشنل لسانيات ۽ نيچرل لينگوئجز پروسس تي تمام گهڻو ڪم ٿي رهيو آهي جنهنڪري دنيا جون ڪيتريون ئي ٻوليون بين الاقوامي سطح تي اچي چڪيون آهن ۽ اسان سنڌي خوشنيصيب آهيون جو سنڌي ٻولي به ان سٿ ۾ اچي چڪي آهي پر افسوس سان سان لکڻو پوي ٿو ته سنڌي ٻولي جي بااختيار اداري ان وڏي ڪم کي،  جيڪو ماجد ڀرڳڙي جي ڪم کان پوءِ وڏي حيثيت رکي ٿو، نظرانداز ڪري ڇڏيو آهي.
ان ڪم کي آن لائين ڏسڻ لاءِ هيٺين ويب سائيٽ کولي ڏسي سگهو ٿا

Wednesday 24 October 2018

Unicode-8 based linguistics data set of annotated Sindhi text

https://www.sciencedirect.com/science/article/pii/S2352340918305687


Sindhi Unicode-8 based linguistics data set is multi-class and multi-featured data set. It is developed to solve the natural languages processing (NLP) and linguistics problems of Sindhi language. The data set presents information on grammatical and morphological structure of Sindhi language text as well as sentiment polarity of Sindhi lexicons. Therefore, data set may be used for information retrieving, machine translation, lexicon analysis, language modeling analysis, grammatical and morphological analysis, Semantic and sentiment analysis.

To read this article please open the link give above




Sentiment Summerization and Analysis of Sindhi Text

http://thesai.org/Downloads/Volume8No10/Paper_38-Sentiment_Summerization_and_Analysis_of_Sindhi_Text.pdf

Text corpus is important for assessment of language features and variation analysis. Machine learning techniques identify the language terms, features, text structures and sentiment from linguistic corpus. Sindhi language is one of the oldest languages of the world having proper script and complete grammar. Sindhi is remained less resourced language computationally even in this digital era. Viewing this problem of Sindhi language, Sindhi NLP toolkit is developed to solve the Sindhi NLP and computational linguistics problems. Therefore, this research work may be an addition to NLP. This research study has developed an own Sindhi sentimentally structured and analyzed corpus on the basis of accumulated results of Sindhi sentiment analysis tool. Corpus is normalized and analyzed for language features and variation analysis using DTM and TF-IDF techniques. DTM and TF-IDF analysis is performed using ngram model. The supervised machine learning model is formulated using SVMs and K-NN techniques to perform analysis on Sindhi sentiment analysis corpus dataset. Precision, recall and f-score show better performance of machine learning technique than other techniques. Cross validation techniques is used with 10 folds to validate and evaluate data set randomly for supervised machine learning analysis. Research study opens doors for linguists, data analysts and decision makers to work more for sentiment summarization and visual tracking.

To read this research article, please open the above given link

Syntactic parsing and supervised analysis of Sindhi text


https://www.sciencedirect.com/science/article/pii/S1319157817301696


This research study addresses the morphological and syntactic problems of Sindhi language text by proposing an Algorithm for tokenization and syntactic parsing. A Sindhi parser is developed on basis of proposed algorithm to perform syntactic parsing on Sindhi text using Sindhi WordNet (SWN) and corpus. Results of Sindhi syntactic parsing are accumulated to develop multi-class and multi-feature based Sindhi dataset in CSV format. Three attributes of Sindhi dataset are labelled as class. All three classes are comprised with different number of categories. SVM, Random forest and K-NN supervised machine learning methods are used and trained to analyze and evaluate the Sindhi dataset. 80% of dataset is used as training set and 20% of dataset is used as test set. In this research study, 10-fold cross validation technique is applied to evaluate and validate the supervised machine learning process. The SVM classifier gives better results on class phrase and UPOS whereas Random forest gives better result on class TagStatus. Precision, recall, f-measure and confusion matrix approve the performance of all supervised classifiers. The better performance of supervised machine learning methods, support the Sindhi dataset and Sindhi online parser for future research. This study opens new doors for research on right hand written languages especially Sindhi language to solve its computational linguistics problems.

Click above link to read the complete research paper 

Is Sindhi an ‘endangered’ language?

The decrease in use of this indigenous language is a threat to Pakistan’s linguistic diversity


Mazhar Ali Dootio

Published in Daily Times, July 21st 2018.
Language is one of the significant identifications of any nation or society. It is the complete source of communication and the primary way of sharing of thoughts, values, emotions, feelings, opinions, social integration, socio-political status and educational development. Therefore, language is considered a vital asset for people and the world. There are about 7000 languages spoken around the world today, whereas, hundreds are dead or extinct.
Linguists are of the opinion that more than 70% of languages used today will be dead or extinct by the coming decades. Several languages are in danger of becoming extinct and this is an increasing threat to linguistic diversity.
Languages are the most influential tools of preserving and developing our tangible and intangible heritage. A society or nation is known by its speakers, therefore, language tells the structure, moral status, behaviour, educational development, social integration, culture, financial status and the everyday routine and lifestyle of a nation or society. Language is a natural phenomenon that was born along with the birth of man. Mother language is as necessary for man as air, water, oxygen, food, land, culture, rains, etc.
A Hungarian poet of early twentieth century AD Mr Gyula Juhasz, explained the significance of language in these touching words: “Let us take care of the people who sing, the people who create myths, the women who draw pictures, and the men that carve wood and bone, and do not become extinct, for their extinction would mean the end of the mankind. Let us preserve, collect and disseminate their work; let us learn from them; let their art which is permanent as nature and continuous as history, become a part of our own living soul.”
Pakistan is a multi-lingual country that has several socio-economic and linguistic diversity problems. Since there is no proper “language policy”, therefore, some languages are in process of transformation to influenced languages and others are in danger of being extinct.
The Sindhi language is one of the oldest indigenous languages of the world, possessing history, culture, civilization, heritage, grammar, morphology, land and nation.
Sindhi is one of the country’s major languages and pivotal to Sindhi identity. A majority of historians, anthropologists, linguists and scholars believe that Sindhi language and culture were not offshoots of any foreign influence, but are indigenous, and owe its roots to the Indus Valley civilization.
A Hungarian poet of early twentieth century AD Mr Gyula Juhasz, explained the significance of language in these touching words: “Let us take care of the people who sing, the people who create myths, the women who draw pictures, and the men that carve wood and bone, and do not become extinct, for their extinction would mean the end of the mankind. Let us preserve, collect and disseminate their work; let us learn from them; let their art which is permanent as nature and continuous as history, become a part of our own living soul.”
Pakistan is a multi-lingual country that has several socio-economic and linguistic diversity problems. Since there is no proper “language policy”, therefore, some languages are in process of transformation to influenced languages and others are in danger of being extinct.
The Sindhi language is one of the oldest indigenous languages of the world, possessing history, culture, civilization, heritage, grammar, morphology, land and nation.
Sindhi is one of the country’s major languages and pivotal to Sindhi identity. A majority of historians, anthropologists, linguists and scholars believe that Sindhi language and culture were not offshoots of any foreign influence, but are indigenous, and owe its roots to the Indus Valley civilization.
Sindhi language has been facing several attacks and conspiracies for the past many years. But the language survived due to its rich literature, morphology and complex grammar. The British had given importance to this language and made it an official language. All officers were bound to learn Sindhi, while working in Sindh.
After partition, Sindhi, like other regional languages, was ignored. Urdu was declared the official language of the country for all purposes, including education. However, the Sindhis did not accept this decision and struggled to revoke their mother language’s status. Some of their efforts were rewarded, but the National Commission for Education’s edict has stopped Sindhi from regaining the status it held under the British.
On 7th July 1972, the Sindh assembly passed a “Sindhi Language Bill,” but it was opposed by non-Sindhi groups. After amendments, the bill was passed in 1990. However, it has not been properly implemented.
Despite these problems, Sindhi languge is alive and developing. It stands together with all developed languages, like English, Arabic, Urdu, etc. Sindhi is resourced with all the natural languages processing (NLP) and computational tools and techniques. It is a fully unicode based language and has got the status, capacity and capability of being a modern language. Sindhi text corpus is developed to analyse using machine learning and deep learning techniques which enable Sindhi language for machine translations, information retrieval, syntactic, semantic and sentiments analysis and other computing and business operations. There are hundreds of daily and weekly Sindhi newspapers, magazines as well as a good number of online blogs and websites. Sindhi keyboard enables Sindhi users all over the world to write texts, messages, while using dissimilar social media sites and applications.
Sindhi NLP is developed professionally to solve the computational linguistics and NLP problems of the language. Now Sindhi text corpus may be segmented, tagged with universal part of speech and Sindhi part of speech, parsed syntactically, analysed morphologically by Sindhi NLP tools (http://www.sindhinlp.com/).
Sentiments and opinions are identified from Sindhi text and analysed to evaluate the positive, negative and neutral polarity. Research on different aspects of Sindhi language and linguistics is in progress and will be published in national and international journals.
However, language diversity, the trend of learning different languages and an “international education system” are generating problems for Sindhi. People, living in big cities are using other languages to communicate with children. They tell stories, poems and songs to their children in foreign languages.
All this might lead to an upcoming Sindhi generation that don’t speak and write Sindhi. This will not only be a great loss to the language, but our society and nation as well and may even lead to Sindhi’s extinction.
The government, linguists, researchers, scholars and socio-political organisations should take serious notice of this situation and work to prevent this outcome. All these parties should take steps like incorporating Sindhi in the country’s primary education syllabus and implementation of the “Sindhi Language Bill” as safeguards to protect and preserve this rich language.
Published in Daily Times, July 21st 2018.
https://dailytimes.com.pk/270587/is-sindhi-an-endangered-language/

Sindhi NLP Blog

Mazhar Ali Dootio   
مظهر علي ڏُوتيو
Computational Linguist and NLP practitioner 
http://www.sindhinlp.com/     

Sindhi NLP deals with Sindhi language and solves computational linguistics problems of Sindhi language by presenting Sindhi Online Parser, Sindhi WordNet, Sindhi Lemma, Sindhi Stemmer and Sentiment Analysis. Sindhi language is one of the oldest language of the World having fifty two alphabetical letters and space to adopt several other languages lexicons. This language is written, read and spoken all over the World. Sindhi language is complex grammatically and rich morphologically. The grammar of Sindhi language is not the same as the grammar of English and other languages even the meaning and sense of understating of Sindhi lexicons are different.


International Mother Language Day 2024: Linguistic Complexities in Pakistan and Global Impact

  Dr. Mazhar Ali Dootio International Mother Language Day, observed annually on February 21st, serves as a global platform to celebrate li...