Journal:Privacy preservation techniques in big data analytics: A survey
Full article title | Privacy preservation techniques in big data analytics: A survey |
---|---|
Journal | Journal of Big Data |
Author(s) | Rao, P. Ram Mohan; Krishna, S. Murali; Kumar, A.P. Siva |
Author affiliation(s) | MLR Institute of Technology, Sri Venkateswara College of Engineering, JNTU Anantapur |
Primary contact | Email: rammohan04 at gmail dot com |
Year published | 2018 |
Volume and issue | 5 |
Page(s) | 33 |
DOI | 10.1186/s40537-018-0141-8 |
ISSN | 2196-1115 |
Distribution license | Creative Commons Attribution 4.0 International |
Website | https://link.springer.com/article/10.1186/s40537-018-0141-8 |
Download | https://link.springer.com/content/pdf/10.1186%2Fs40537-018-0141-8.pdf (PDF) |
This article should not be considered complete until this message box has been removed. This is a work in progress. |
Abstract
Incredible amounts of data are being generated by various organizations like hospitals, banks, e-commerce, retail and supply chain, etc. by virtue of digital technology. Not only humans but also machines contribute to data streams in the form of closed circuit television (CCTV) streaming, web site logs, etc. Tons of data is generated every minute by social media and smart phones. The voluminous data generated from the various sources can be processed and analyzed to support decision making. However data analytics is prone to privacy violations. One of the applications of data analytics is recommendation systems, which are widely used by e-commerce sites like Amazon and Flipkart for suggesting products to customers based on their buying habits, leading to inference attacks. Although data analytics is useful in decision making, it will lead to serious privacy concerns. Hence privacy preserving data analytics became very important. This paper examines various privacy threats, privacy preservation techniques, and models with their limitations. The authors then propose a data lake-based modernistic privacy preservation technique to handle privacy preservation in unstructured data.
Keywords: data, data analytics, privacy threats, privacy preservation
Introduction
There is exponential growth in the volume and variety of data due to diverse applications of computers in all domain areas. The growth has been achieved due to affordable availability of computer technology, storage, and network connectivity. The large scale data—which also include person specific private and sensitive data like gender, zip code, disease, caste, shopping cart, religion, etc.—is being stored in a variety of public and private domains. The data holder can then release this data to a third-party data analyst to gain deeper insights and identify hidden patterns which are useful in making important decisions that may help in improving businesses and provide value-added services to customers[1], as well in activities such as prediction, forecasting, and recommendation.[2] One of the prominent applications of data analytics is the recommendation system, which is widely used by e-commerce sites like Amazon and Flipkart for suggesting products to customers based on their buying habits. Facebook does something similar by suggesting friends, places to visit, and even movies to watch based on our interest. However releasing user activity data may lead to inference attacks like identifying gender based on user activity.[3] We have studied a number of privacy preserving techniques which are being employed to protect against privacy threats. Each of these techniques has their own merits and demerits. This paper explores the merits and demerits of each of these techniques and also describes the research challenges in the area of privacy preservation. Always there exists a trade off between data utility and privacy. This paper also proposes a data lake-based modernistic privacy preservation technique to handle privacy preservation in unstructured data with maximum data utility.
Privacy threats in data analytics
Privacy is the ability of an individual to determine what data can be shared, and employ access control. If the data is in the public or private domain, then it is a threat to individual privacy as the data is held by a data holder. The data holder can be a social networking application, website, mobile app, e-commerce site, bank, hospital, etc. It is the responsibility of the data holder to ensure privacy of the users data. Apart from the data held in various domains, knowingly or unknowingly users may contribute to data leakage. For example, most of mobile apps seek access to our contacts, files, camera, etc., and without reading the privacy statement we agree to all its terms and conditions, there by contributing to data leakage.
Hence there is a need to educate smart phone users regarding privacy and privacy threats. Some of the key privacy threats include (1) surveillance, (2) disclosure, (3) discrimination, and (4) personal embracement and abuse.
Surveillance
Many retail, e-commerce, etc. businesses study their customers' buying habits and try to come up with various offers and value-added services.[4] Based on the opinion data and sentiment analysis, social media sites may provide recommendations of new friends, places to visit, people to follow, etc. This is possible only when they continuously monitor their customers' transactions. This is a serious privacy threat as no individual accepts surveillance.
Disclosure
Consider a hospital holding a patient's data, often containing identifying or revealing information such as zip code, gender, age, and disease.[5][6][7] The data holder, the hospital, has released data to a third party for analysis by anonymizing sensitive personal information so that the person cannot be identified. The third party data analyst can map this information with freely available external data sources like census data and then identify the person suffering a disorder. This is how the private information of a person can be disclosed, which is considered to be a serious privacy breach.
Discrimination
Discrimination is the bias or inequality which can happen when some private information of a person is disclosed. For instance, statistical analysis of electoral results proved that people of one community were completely against the party, which formed the government. Now the government can neglect that community or can have bias over them.
Personal embracement and abuse
Whenever a person's private information is disclosed, it can even lead to personal embracement or abuse. For example, a person was privately taking medication for some specific problem and was buying the medicine on a regular basis from a medical shop. As part of their regular business model, the medical shop may send a reminder and offers related to the medicine over the phone. If another family member has noticed this, it may lead to personal embracement and even abuse.[8]
Data analytics activity creates data privacy issues. Many countries are enacting and enforcing privacy preservation laws. Yet something as simple as lack of awareness can still generate privacy attacks despite these mechanisms. For example, many smart phones users are not aware of the information that is stolen from their phones by many apps. Previous research shows only 17 percent of smart phone users are aware of such privacy threats.[9]
Privacy preservation methods
Many privacy preserving techniques have been developed, but most of them are based on anonymization of data. A list of privacy preservation techniques includes:
- K anonymity
- L diversity
- T closeness
- Randomization
- Data distribution
- Cryptographic techniques
- Multidimensional sensitivity-based anonymization (MDSBA)
K anonymity
Anonymization is the process of modifying data before it is given for data analytics[10], so that de-identification is not possible and will lead to K indistinguishable records if an attempt is made to de-identify by mapping the anonymized data with external data sources. K anonymity[11] is prone to two attacks namely homogeneity attack and back ground knowledge attack. Some of the algorithms applied include Incognito[12] and Mondrian[13] to ensure anonymization. K-anonymity is applied on the patient data shown in Table 1. The table shows data before anonymization.
|
The K-anonymity algorithm is applied with a K value of 3 to ensure three indistinguishable records when an attempt is made to identify a particular person's data. K-anonymity is applied on the two attributes, viz., zip and age shown in Table 1. The result of applying anonymization on the zip and age attributes is shown in Table 2.
|
The above technique used generalization[14] to achieve anonymization. Suppose we know that John is 27 years old and lives in the 57677 zip code. We can then still conclude John to have cardiac problem even after anonymization as shown in Table 2. This is called a homogeneity attack. For example, if John is 36 years old and we know that John does not have cancer, then definitely John must have cardiac problem. This is called a background knowledge attack. Achieving K-anonymity[15][16] can be done either by using generalization or suppression. K-anonymity can be optimized if the minimal generalization can be done without significant data loss.[17] Identity disclosure is the major privacy threat, which cannot be guaranteed by K-anonymity.[18] Personalized privacy is the most important aspect of individual privacy.[19]
L-diversity
To address the homogeneity attack, another technique called L-diversity has been proposed. As per L-diversity, there must be L well represented values for the sensitive attribute (disease) in each equivalence class.
Implementing L-diversity is not possible every time because of the variety of data. L-diversity is also prone to the skewness attack, meaning that when the overall distribution of data is skewed into few equivalence classes, attribute disclosure cannot be ensured. For example, if the entire records are distributed into only three equivalence classes, then semantic closeness of these values may lead to attribute disclosure. Also L-diversity may lead to a similarity attack. From Table 3, it can be noticed that if we know that John is 27 year old and lives in the 57677 zip code, then John definitely falls under the low income group because salaries of all three persons in the 576** zip are low compared to others in the table. This is called a similarity attack.
|
Abbreviations
- CCTV: closed circuit television
- MDSBA: multidimensional sensitivity-based anonymization
References
- ↑ Ducange, P.; Pecori, R.; Mezzina, P. (2018). "A glimpse on big data analytics in the framework of marketing strategies". Soft Computing 22 (1): 325–42. doi:10.1007/s00500-017-2536-4.
- ↑ Chauhan, A.; Kummamuru, K.; Toshniwal, D. (2017). "Prediction of places of visit using tweets". Knowledge and Information Systems 50 (1): 145–66. doi:10.1007/s10115-016-0936-x.
- ↑ Yang, D.; Qu, B.; Cudre-Mauroux, P. (2018). "Privacy-Preserving Social Media Data Publishing for Personalized Ranking-Based Recommendation". IEEE Transactions on Knowledge and Data Engineering. doi:10.1109/TKDE.2018.2840974.
- ↑ Liu, Y.; Guo, W.; Fan, C.-I. et al. (2018). "A Practical Privacy-Preserving Data Aggregation (3PDA) Scheme for Smart Grid". IEEE Transactions on Industrial Informatics. doi:10.1109/TII.2018.2809672.
- ↑ Duncan, G.T.; Fienberg, S.E.; Krishnan, R. et al. (2001). "Disclosure limitation methods and information loss for tabular data". In Doyle, P.; Lane, J.; Theeuwes, J. et al.. Confidentiality, disclosure and data access: Theory and practical applications for statistical agencies. Elsevier. pp. 135–66. ISBN 9780444507617.
- ↑ Duncan, G.T.; Lambert, D. (1986). "Disclosure-Limited Data Dissemination". Journal of the American Statistical Association 81 (393): 10-18. doi:10.1080/01621459.1986.10478229.
- ↑ Lambert, D. (1993). "Measures of disclosure risk and harm". Journal of Official Statistics 9 (2): 313–31.
- ↑ Spiller, K.; Ball, K; Bandara, A. et al. (2017). "Data Privacy: Users’ Thoughts on Quantified Self Personal Data". In Ajana, B.. Self-Tracking. Palgrave Macmillan, Cham. pp. 111–24. doi:10.1007/978-3-319-65379-2_8. ISBN 9783319653792.
- ↑ Hettig, M.; Kiss, E.; Jassel, J.-F. et al. (2013). "Visualizing Risk by Example: Demonstrating Threats Arising From Android Apps". Symposium on Usable Privacy and Security (SOUPS) 2013: 1-2. https://cups.cs.cmu.edu/soups/2013/risk/paper.pdf.
- ↑ Iyengar, V.S. (2002). "Transforming data to satisfy privacy constraints". Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 279–288. doi:10.1145/775047.775089.
- ↑ Bayardo, R.J.; Agrawal, R. (2005). "Data privacy through optimal k-anonymization". Proceedings of the 21st International Conference on Data Engineering: 217–28. doi:10.1109/ICDE.2005.42.
- ↑ LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. (2005). "Incognito: Efficient full-domain K-anonymity". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 49–60. doi:10.1145/1066157.1066164.
- ↑ LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. (2006). "Mondrian: Multidimensional K-Anonymity". Proceedings of the 22nd International Conference on Data Engineering: 25. doi:10.1109/ICDE.2006.101.
- ↑ Samarati, P.; Sweeney, L. (1998). "Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression". Technical Report SRI-CSL-98-04. SRI International. http://www.csl.sri.com/papers/sritr-98-04/.
- ↑ Sweeney, L. (2002). "Achieving k-anonymity privacy protection using generalization and suppression". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5): 571–88. doi:10.1142/S021848850200165X.
- ↑ Sweeney, L. (2002). "K-Anonymity: A model for protecting privacy". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (5): 557–70. doi:10.1142/S0218488502001648.
- ↑ Meyerson, A.; Williams, R. (2004). "On the complexity of optimal K-anonymity". Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems: 223-28. doi:10.1145/1055558.1055591.
- ↑ Xiao, X.; Tao, Y. (2006). "Personalized privacy preservation". Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data: 229-40. doi:10.1145/1142473.1142500.
Notes
This presentation is faithful to the original, with only a few minor changes to presentation. Grammar was cleaned up for smoother reading. In some cases important information was missing from the references, and that information was added. The citations for 10 and 11 were flipped because the original applied the citation to the title, which we don't do here.