Auto Halal Detection Products Based on Euclidian Distance and Cosine Similarity

— Although Indonesia is the world the world's most populous Muslim-majority country, the number of halal-certified products in Indonesia is only 20% of the products on the Indonesian market. Halal certification is voluntary as such there are many food products which are halal but are not certified as halal. In principle, these food products may have similar halal ingredients with halal-certified products. In this study, we build a system that can compare products that have not been certified halal with halal certified products based on its ingredients. The food products are collected from Open Food Facts, Institute For Foods, Drugs, And Cosmetics Indonesian Council Of Ulama (LPPOM MUI) and our halal system. As of this paper writing, the halal-certified products are obtained from LPPOM MUI. The system uses the Euclidean Distance and Cosine Similarity that generate top-5 similar products. Those two similarity calculations are based on Term Frequency-Inverse Entity Frequency weighting function. The weighting function calculates the frequency of a term on a product name and ingredients. If a similarity value of a product with no halal certification and a halal-certified product is higher than 75%, then the former could be indicated as a halal product. In the end, the system can give a recommendation of unknown products from a related pool of halal-certified products based on similarity of product composition. Cosine similarity accuracy is higher than Euclidean Distance and MoreLikeThis accuracy. Cosine similarity gets the highest precision because the cosine similarity is based on the vector angle of the term in a product.


I. INTRODUCTION
Based on the population census in 2010, the population in Indonesia is 207.176.162 million people, where 87.2% of them are Moslems [1]- [3]. To assist the Muslim needs, Institute For Foods, Drugs, And Cosmetics Indonesian Council Of Ulama (LPPOM MUI), an authorised institution to supervise halal food products in Indonesia. LPPOM MUI was established by the Indonesian Ulama Council and the authority to conduct audits and issue halal certificates [4] to industries such as food, medicine, and cosmetics, Slaughter House and restaurant/catering/ kitchen [5].
According to the LPPOM MUI report [6], the number of halal-certified products is increasing. However, the number of halal-certified products in Indonesia is only 20% of the products on Indonesia market because it is not mandatory for a company to apply for halal certificates for their products [7]. A product which is not halal certified could not be ascertained to be a haram product, since the composition of the product may have similarities to a halal-certified product. Therefore, we build a system that compares products that have not been certified against similar products that have been halal certified based on the composition of the ingredients.
To date, there are over 200 halal certification organisations in the world, but a Muslim might be challenging to distinguish halal products especially in non-Muslim countries [8]. Our system allows users to survey, check and ensure the food products based on their ingredients in comparison to Halal certified products. It is important to note that the limitation of this work is that we only focus on the components of a product which our system does not include a food processing system. Open Food Facts (https://world.openfoodfacts.org/), a crowdsourcing food database system provides some of the products that are labelled as a halal product. However, the label is uncertain because it does not come from halal certification organisations and everyone is free to claim a halal status of a product without supervision.
Our system uses Euclidean distance and cosine similarity method between two products based on the similarity level by making the composition of the product as a comparison in assessing whether or not the product is halal. Concerning finding the similarity of products, Cosine similarity is expected to increase the value of accuracy [9]. Some previous studies have used cosine similarity and Euclidean distance to perform groupings of vector data from the reduction results with Principal Component Analysis (PCA) since the changes in the average correction of the original data may alter the location of the document [10]. Another study [11] successfully improved classification accuracy using Euclidean distance on the authenticity of tea samples.
Our main contributions are explained as follows: • Proposing two similarity methods to detect the halal status of products with no halal certification • Providing a system that can display the top five of similar products that are already being halal certified concerning the product in question This paper is structured as follows: Section II describes two of our similarity methods and our existing system. The methodology used in this system is explained in Section III. The results and conclusion can be found in Section IV and Section V respectively.

II. MATERIAL AND METHOD
Briefly, Fig. 1 describes our system. All products are compared to each other. If a similarity value of a product with no halal certification and a halal-certified product is higher than 75%, then the former could be indicated as a halal product.
Our methodology consists of four steps, namely data collection, data indexing, TF-IEF weighting terms, and Euclidean distance and cosine similarity calculations.

A. Data collection
Our data were obtained from three sources: 1) crowdsourcing using a product input feature in our website http://halal.addi.is.its.ac.id; 2) web scrapping LPPOM MUI website http://www.halalmui.org/, and 3) Open food facts (https://world.openfoodfacts.org/). http://www.halalmui.org/ only provides the name of products, the name of manufactures, and halal certificate expiry date. https://world.openfoodfacts.org/ is an initiative for establishing a collaborative, free and open database of food products from around the world. Open Food Facts provides information about food products including food, manufacture, ingredients and nutrition.
In 2016, Halal Nutrition Food framework, http://halal.addi.is.its.ac.id was developed by Fatawi and Rakhmawati [14]. The system facilitates users to enter the halal products, search for halal products, and find more information by exploiting Linked Data technologies [15].

B. Data Indexing
The collected data were then indexed into the Apache Lucene (https://lucene.apache.org/) to store the terms of labels and ingredients from the document into the Lucene index. Before product data was indexed, data must be preprocessed to accelerate the weighting and calculation of similarity between products. The initial pre-processing was tokenisation, which is cutting the sentence into its constituent words, called tokens, based on spaces and punctuation. This token was later indexed into Apache Lucene. Examples of indexing of SampleP Mie Goreng and SampleQ Rasa Soto Ayam can be seen in Fig. 2 and 3.

C. TF-IEF Weighting
Delbru presents weighting for Linked Data search which is called the Term Frequency-Inverse Entity Frequency or TF-IEF [16]. This weighting measures the importance of a term across entities in the collection. In our case, product and ingredient are an entity. For instance, p i is an ingredient in product p, the weight of p i in product p is Where , is the number of occurrences of the ingredient , | | specifies the number of ingredients in the product p, and N is the number of products in the dataset.
The product along with ingredients was weighted using TF-IEF [16]. The weighting of the TF-IEF compared the composition of each product with the composition of other products that have been stored. Apache Lucene reads the previous index before calculating the weight of terms.

D. Calculation of Euclidean Distance and Cosine Similarity
Euclidean Distance is a type of distance measurement and is the most commonly used cluster analysis to measure the distance from a data object to a cluster centre. Euclidean distance is a geometric distance between two data objects. The closer the distance between the two objects, the more identical they are [12]. Similarity on Euclidean distance has a range of values from 0 to 1. Zero (0) value states that the two products do not have any resemblance, while one (1) means that the two products are identical.
Given p as the first product and q as the second product, the Euclidian distance formula for finding the similarity between product p and q can be defined as follows: Where dist(p,q) can be formulated as follows: Cosine method similarity is a method used to calculate the similarity (degree of similarity) between two objects. In general, the calculation of this method is based on the vector space similarity measure. The similarity between two objects is expressed in the two vectors by using keywords (keywords) of a document as size [13]. Vector representation is used to facilitate calculations of long documents. Similar to Euclidean distance, Cosine similarity has a range of values from 0 to 1.
Given p as the first product and q as the second product, the Cosine similarity formula for finding the similarity between two products can be defined as follows: Where , and , are the weighting function for each ingredient in product p and q.
The Euclidean distance calculation takes into account the distance between two points in the Euclidean space. The first TF-IEF document was calculated to represent it as a point in the Euclidean space. Table III shows the steps in calculating Euclidean distance between SampleP Mie Goreng product (p) and SampleQ Rasa Soto Ayam (q). Based on the similarity formula on Euclidean distance, the similarity score between SampleP Mie Goreng and SampleQ Rasa Soto Ayam is: Calculation of cosine similarity also uses TF-IEF weighting terms. Table IV shows the calculation steps in cosine similarity SampleP Mie Goreng product (p) and SampleQ Rasa Soto Ayam (q).

III. RESULTS AND DISCUSSION
Based on Euclidean Distance and Cosine Similarity testing on the Halal Nutrition Food application feature, an exemplary search result is shown in Fig. 4.
Each search result displayed a list of other products related to the product being searched. List of related products with SampleQ Mie Goreng are depicted in Table V  and Table VI.
Based on the products list in Tables 5 and 6, it can be seen that the related products are Instant Noodle products.
Therefore, the result of using Euclidean Distance and Cosine Similarity on SampleP Mie Goreng is relevant. Meanwhile, Fig. 5 displays a notification of a product (Chocolate Sample) that does not have a halal certificate, and does not contain additives in the haram category, has a similarity of more than 75% based on Euclidean distance or cosine similarity. To test the relevance of the related product shown, precision testing of the Euclidean distance and cosine similarity algorithms was performed. Besides, we also examined the significance of the associated products by using MoreLikeThis from Apache Lucene. Precision results of several products can be seen in Table VII. Further research requires more samples and evaluations to investigate the reliability and reproducibility of results. Before weighing the term, it is suggested to pre-process the document using a similarity string algorithm, such as Levenshtein [17] or Jaccard [18] to avoid in case of typing errors.  Table VII shows that the precision of related products searches using Euclidean distance, Cosine similarity and MoreLikeThis are 72%, 84%, and 80% respectively. Precision score using Euclidean distance yields the smallest value because Euclidean distance measures the distance between two points in the Euclidean space that is influenced by term weight. Cosine similarity gets the highest precision because the cosine similarity is based on the vector angle of the term in a document.

IV. CONCLUSION
The related product search system in the Halal Nutrition Food was developed. The relevant product search feature displays top five products that are on the product detail page. Cosine similarity surpasses the performance of Euclidean Distance and MoreLikeThis from Apache Lucene; and as such is recommended for further use in the product search system. It is noteworthy that the findings of this work do not rule out the needs of halal certification process which include documents and site audits; covering all aspects of food processing from ingredients, processing line and logistics. Nevertheless, the system offers an opportunity for users to survey, check and ensure the food products based on their elements in comparison to Halal certified products. In the future, the system could be further refined to include the standard requirements for halal certification. Fig. 5 Notification of product (chocolate sample) that does not have a halal certificate, and does not contain additives in the haram category, has a similarity of more than 75% based on Euclidean distance or cosine similarity.