School Class Monitoring System Based on Audio Signal Processing

C. R. Rashmi 1,,C.P.Shantala 2 andt.r.yashavanth 3 1 Department of CSE, PG Student, CIT, Gubbi, Tumkur, Karnataka, India. 2 Department of CSE, Vice Principal & HOD, CIT, Gubbi, Tumkur, Karnataka, India. 3 Department of CNE, PG Student,VTU, Belgaum, Karnataka, India. e-mail: rashmicr46@gmail.com; shantala.cp@cittumkur.org; yashavanthtr@gmail.com Abstract. This paper aims to develop a proof-of-concept system for monitoring the functioning of the school class. We envisage this system to be used (a) as a supporting tool for monitoring (b) provide reliable quantitative data to evaluate the system effectively (c) use is at primary, secondary and higher education levels. This is an ICT (Information & Communications Technology) based intervention to monitor the functioning of the school class. Currently, one aspect of effective functioning, the functioning of the classrooms, translated into a quantitative data, is covered ineffectively. The proposed system aims to monitor each and every classroom of the school and provide daily, weekly, and monthly reports to Head-teacher of the school, SDMCs (School Development and Monitoring Committees) and BEOs (Block Education Officers), respectively. Keywords: Speech processing, MFCC, Vector quantization. 1. Introduction Sarva Shiksha Abhiyan (SSA) is Government of India s first flagship programme for achievement of Universalization of Elementary Education (UEE) which was launched in 2001. Community based monitoring is one of the strengths of this programme. The community, through its representative institutions like Village Education Committees (VEC) and SDMC (School Development and Monitoring Committees), has been entrusted with the primary level of ensuring that the schools are functioning effectively. In order to monitor the functioning of class we use audio samples recorded from the school environment. Speech processing is done for the collected audio samples for development of software system. Speech processing can be divided into 5 categories. Speech coding is used for encoding of voice, for example digitalization of signal voice in WAV or MP3 format. Speech Recognition is used to identify what the speaker is saying, for example text processing software which will recognize the speech and translate it into text and this method is usually used for dictation purpose. Speech enhancement which will maximize the speaker voice is another category, for example voice in a song can be enhanced by using some filters and they are used by audio players. Next category is speech synthesis, it interprets text to voice; this system is helpful for the people who cannot use their vocal chords or in big companies as automatic phone answerers. Speaker recognition is another category which has capability of recognizing who is speaking on the basis of information included in the speech waves automatically. Because of the use of speaker s voice to verify their identity, several applications are possible like banking by telephone, telephone shopping, voice mail, database access services, remote access to computers, voice dialling. Speaker recognition mainly has two types, text-dependent system and text-independent system. Text-independent system recognizes the speaker without having the knowledge of any word in the database, singular characteristics of speaker s voice is extracted by the system which makes recognition possible without saying any precise word. Text-dependent system uses some words or phrases that were previously recorded and stored in database for speaker recognition, for example speaker say a PIN number or his name as a password for opening the door to enter into his office [1]. Corresponding author Elsevier Publications 2013.

C. R. Rashmi, C. P. Shantala and T. R. Yashavanth Figure 1. Central system. 2. System Functioning Microphone(s), which has one-to-one association with the classroom, pickup the audio data of the classroom(s) and transmit to the central system in figure 1. The central system will work on this audio data, and use the in-build algorithm to take appropriate decision on the classroom state. The classroom functioning, will be categorized suitably [categorization depends on need of the concerned parties like Head-Teacher, SDMC, BEO, etc] and stored for future transmission. Periodically (once a day/week/month), the stored data will be summarized and sent as SMS s (Short Message Service). 3. Overview of Implementation Steps 1. Students will visit few government schools to collect the audio data -classroom audio recording under different cases [interactive class, lecture-only class, no-teacher-class, etc]. 2. This audio data will be used to develop a software system, which will analyze the noise levels to make proper estimations. Better the step-1 is performed, estimations can get better. 3. Developed software system can be installed on an embedded PC. 4. Multi-audio-in interface can be built for the embedded PC combining the off-the-shelf components. 5. Necessary modem for SMS transmission, will be connected. 6. The system can be deployed in a school and tested. 4. Software System For the development of software system to check the functioning of class we use text-independent speech processing technique since we are not concerned on the text of speech samples hence text-dependent technique is not necessary. The collected speech samples are categorized into 3 types. They are 1. Interactive class 2. Teacher only class 3. Noise In our system we consider that if there is noise then class is not functioning. The above shown types are used for speech processing which has mainly two techniques. a. Feature Extraction b. Feature matching Feature extraction is the process of extracting small amount of data from the audio sample that will be used further for representation of each speech samples classified above. Feature matching is the process of comparing the extracted features with unknown speech samples to identify them. We have wide range of possibilities for parametrically representing the speech samples and they are Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFCC is considered to be well known and popular method, and will be used in our system for feature extraction. Different feature matching techniques are Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). In this system, the VQ approach will be used. 4.1 Feature Extraction (MFCC) In order to get better recognition performance we should extract best parametric representation of acoustic signals. This phase should be more efficient and also important for next phase since it may affect behavior [2]. 536 Elsevier Publications 2013.

Figure 2. MFCC process [3]. Figure 3. Mel scale filter bank [2]. 4.1.1 MFCC process MFCC (Mel-Frequency Cepstrum Coefficients) is a method based on human hearing behavior which cannot recognize the frequencies above 1 KHz. They are based on difference of frequencies that the human ear can distinguish. The speech signal is expressed in MEL scale, which is a scale based on pitches in an equally spaced intervals judged by observers. MEL scale uses a filter which is linearly spaced at frequencies below 1000 Hz and logarithmic spacing above 1000 Hz [1]. MFCC process has few steps and they are explained as follows. Figure 2 shows the complete process of MFCC. 4.1.1.1 Framing In this step continuous speech sample is segmented or blocked into frames. Frames are of N samples and distance between adjacent samples is M(M < N). Usually the values of M and N are 100 and 256 respectively [1]. 4.1.1.2 Windowing In order to minimize the spectral distortion or signal discontinuities at the beginning and end of signal, windowing is done for each frame [4]. For this step hamming window is used. 4.1.1.3 Fast fourier transform (FFT) FFT is used to convert every frame of N samples from time domain to frequency domain. It converts the convolution of glottal pulse and the vocal tract impulse response in time domain [2]. 4.1.1.4 Mel filters Mel filters or Mel filter bank does the operation of filtering an input power spectrum through a bank of number of mel-filters [4]. Figure 3 shows a set of triangular filters which computes a weighted sum of filter spectral components which leads to approximation to Mel scale. Every filter s magnitude frequency response is triangular in shape and it is equal to unity at the centre frequency and decreases linearly to zero at centre frequency of two adjacent filters. Then, each filter output is the sum of its filtered spectral components [2]. Finally the following equation is used to compute the Mel for given frequency f in HZ: F(Mel)= [2595 log 10[1 + f ]700] (1) 4.1.1.5 Logarithm Compression is carried out in this step. The set of values generated by Mel filter bank is reduced by replacing each value by its logarithm [4]. Elsevier Publications 2013. 537

C. R. Rashmi, C. P. Shantala and T. R. Yashavanth 4.1.1.6 Discrete cosine transform (DCT) In this step, the log Mel spectrum is converted into time domain using DCT. The output of this step will be Mel Frequency Cepstrum Coefficients (MFCC s). MFCC s are time domain coefficients and these set are called as acoustic vectors. Each input utterances will be transformed into acoustic vectors [2]. The above shown steps extract the best parametric representation of acoustic signals. These set of coefficients are used at later stage. 4.2 Feature matching (VQ) We decide to use Vector Quantization (VQ) for feature matching. 4.2.1 Vector quantization (VQ) It is the process of mapping large number of vectors to finite number of regions. Every region is called a cluster and it will be represented by its center. Center is called as codeword and its collection is said to be codebook [5]. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence. LBG algorithm is as follows. 1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors. 2. Double the size of the codebook by splitting each current codebook y n according to the rule y n = y n (1 + ε) (2) y n = y n (1 ε) (3) where n varies from 1 to the current size of the codebook, and ε is a splitting parameter. 3. Nearest neighbour search: for each training vector, find the codeword in the current codebook that is closest & assign that vector to the corresponding cell. 4. Update the codeword in each cell using the centroid of the training vectors assigned to that cell. 5. Repeat steps 3 & 4 until the average distance falls below a present threshold. 6. Repeat steps 2, 3 & 4 until a codebook size of M is designed. This VQ algorithm gives fixed size codebook of size QXT.HereQ is number of Mel filters and it is fixed, T is any number which satisfies following condition: If we follow the above steps codebook will be created [6]. 5. Methodology T = 2 i i = 1, 2, 3... We use Matlab for implementation. There are two main stages. They are training and testing phase. The requirement of training phase is as shown in table 1. In testing phase, the audio samples are taken in random and tested. The requirement of testing phase is similar to the training phase but it computes minimum distance (Euclidean Distance) and minimum distortion. Based on the minimum distortion and by comparing the random audio samples with codebook, it classifies to any one of the cases among three. If random audio sample is matched with noise then class is not functioning otherwise class is functioning. Table 1. Training requirement. Sl. No. Process Description 1 Data Audio samples 1. Interactive class, 2.Teacher-only class, 3. Noise 2 Sampling Frequency, Hz 8000 Hz 3 Audio format.wav 4 Number of cases 15 in each case taken in school environment 5 Duration Each sample of 20s 6 Feature extraction Uses MFCC and VQ to create codebook 538 Elsevier Publications 2013.

Table 2. Performance measure for classification among all cases. Number of audio samples Number of audio samples in training phase in testing phase Efficiency 15 6 66% 18 15 70% 45 20 80% 6. Results For experimentation purpose we have used different number of audio samples and got performance as shown in table 2. As the training data is increased, efficiency is also increased. If random audio sample used in testing phase is classified to interactive and teacher-only classes then we say that class is functioning otherwise class is not functioning. Result also depends on the noise levels because noise may be added even in teacher-only class also which reduces efficiency. 7. Conclusion This paper provides a proof-of-concept system for monitoring the functioning of classroom. For real time data samples we have got 80% efficiency. The efficiency can be improved by increasing the training data and also different methods like Linear Prediction Coding (LPC), Hidden Markov Model (HMM) and Artificial Neural Networks (ANN) can be tried. This system can be enhanced in future by implementing in a school by embedding the software system in PC with necessary components and modem for SMS transmission. References [1] Jorge MARTINEZ*, Hector PEREZ, Enrique ESCAMILLA and Masahisa Mabo SUZUKI, Speaker recognition using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ) Techniques, pp. 248 251, 2012. [2] Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, vol. 2, issue 3, ISSN 2151-9617, pp. 138 143, March 2010. [3] B. G. Nagaraja and H. S. Jayanna, Mono and Cross Lingual Speaker Identification with the Constraint of Limited Data, pp. 439 443, March 21 23, 2012. [4] Ali Zulfiqar, Aslam Muhammad and A. M. Martinez Enriquez, A Speaker Identification System using MFCC Features with VQ Technique pp. 115 118, 2009. [5] Fatma zohra. Chelali and Amar. DJERADI, MFCC and vector quantization for Arabic fricatives Speech/Speaker recognition, 2012. [6] Amruta Anantrao Malode and Shashikant Sahare, ADVANCED SPEAKER RECOGNITION vol. 4, issue 1, pp. 443 455, July 2012. Elsevier Publications 2013. 539