• A Case Study on Apache HBase

      Nalla, Rohit Reddy; Sengupta, Sam; Adviser; Novillo, Jorge; Reviewer; Rezk, Mohamed; Reviewer (2015-05-16)
      Apache HBase is an open-source, non-relational and a distributed data base system built on top of HDFS (Hadoop Distributed File system). HBase was designed post Google’s Big table and it is written in Java. It was developed as a part of Apache’s Hadoop Project. It provides a kind of fault – tolerant mechanism to store minor amounts of non-zero items caught within large amounts of empty items. HBase is used when we require real-time read/write access to huge data bases. HBase project was started by the end of 2006 by Chad Walters and Jim Kellerman at Powerset.[2] The main purpose of HBase is to process large amounts of data. Mike Cafarella worked on code of the working system initially and later Jim Kellerman carried it to the next stage. HBase was first released as a part of Hadoop 0.15.0 in October 2007[2]. The project goal was holding of very large tables like billions of rows X millions of columns. In May 2010, HBase advanced to a major project and it became an Apache Top Level Project. Several applications like Adobe, Twitter, Yahoo, Trend Micro etc. use this data base. Social networking sites like Facebook have implemented its messenger application using HBase. This document helps us to understand how HBase works and how is it different from other data bases. This document highlights about the current challenges in data security and a couple of models have been proposed towards the security and levels of data access to overcome the challenges. This document also discusses the workload challenges and techniques to overcome. Also an overview has been given on how HBase has been implemented in real time application Facebook messenger app.
    • Data Mining: Privacy Preservation in Data Mining Using Perturbation Techniques

      Patel, Nikunjkumar; Sengupta, Sam; Adviser; Andriamanalimanana, Bruno; Reviewer; Novillo, Jorge; Reviewer (2015-05-06)
      In recent years, data mining has become important player in determining future business strategies. Data mining helps identifying patterns and trends from large amount of data, which can be used for reducing cost, increasing revenue and many more. With increased use of various data mining technologies and larger storage devices, amount of data collected and stored is significantly increased. This data contains personal information like credit card details, contact and residential information, etc. All these reasons have made it inevitable to concentrate on privacy of the data. In order to alleviate privacy concerns, a number of techniques have recently been proposed to perform the data mining in privacy preserving way. This project briefs about various data mining models and explains in detail about perturbation techniques. Main objective of this project is to achieve two things. First, preserve the accuracy of the data mining models and second, preserve the privacy of the original data. The discussion about transformation invariant data mining models has shown that multiplicative perturbations can theoretically guarantee zero loss of accuracy for a number of models.
    • High Performance Distributed Big File Cloud Storage

      Shakelli, Anusha; Sengupta, Sam; Adviser; White, Joshua; Reviewer (2016-05-01)
      Cloud storage services are growing at a fast rate and are emerging in data storage field. These services are used by people for backing up data, sharing file through social networks like Facebook [3], Zing Me [2]. Users will be able to upload data from computer, mobile or tablet and also download and share them to others. Thus, system load in cloud storage becomes huge. Nowadays, Cloud storage service has become a crucial requirement for many enterprises due to its features like cost saving, performance, security, flexibility. To design an efficient storage engine for cloud based systems, it is always required to deal with requirements like big file processing, lightweight metadata, deduplication, high scalability. Here we suggest a Big file cloud architecture to handle all problems in big file cloud system. Basically, here we propose to build a scalable distributed data cloud storage that supports big file with size up to several terabytes. In cloud storage, system load is usually heavy. Data deduplication to reduce wastage of storage space caused by storing same static data from different users. In order to solve the above problems, a common method used in Cloud storages, is by dividing big file into small blocks, storing them on disks and then dealing them using a metadata system [1], [6], [19], [20]. Current cloud storage services have a complex metadata system. Thereby, the space complexity of the metadata System is O(n) and it is not scalable for big file. In this research, a new big file cloud storage architecture and a better solution to reduce the space complexity of metadata is suggested.
    • Representational State Transfer as a Web Service

      Desai, Dhruv; Sengupta, Sam; Adviser; Novillo, Jorge; Reviewer; Andriamanalimanana, Bruno; Reviewer (2015-12-01)
      This report is a study on Representational State Transfer architectural style and its usefulness for implementing web service. This report will highlight the differences in perceiving REST as an architectural style and as a web service. This document will also discuss web services in general and highlight important differences between the different web services in programming languages. The goal of this report is to clarify the term REST as an architectural style which has proved to be a popular choice for implementing a web service rather than REST being termed as a web service and compare Web Services based on its performance in a Java Application.
    • Social Media Emoji Analysis, Correlations and Trust Modeling

      Preisendorfer, Matthew; Sengupta, Sam; Adviser; White, Joshua; Adviser; Tekeoglu, Ali; Adviser (2018-01-18)
      Twitter is an ever-growing social-media platform where users post tweets, or small messages, for all of their followers to see and react to. This is old news of course, as the platform first launched over ten years ago. Currently, Twitter handles approximately six thousand new tweets every second, so there is plenty of data to be analyzed. With a character limit of 140 per tweet, emojis are commonly used to express feelings in a tweet without using extra characters that more explaining might use. This is helpful in identifying the mood or state of mind that a person may have been in when writing their tweet. From a computing standpoint, this makes mood analysis much easier. Rather than analyzing a group of words and predicting moods from keywords, we can analyze single (or many) emoji(s), and then match those emojis to commonly expressed emotions and feelings. The objective of this research is to gather large amounts of Twitter data and analyze emojis used to find correlations in societal interactions, and how current events may drive social media interactions and behaviors. By creating topic models for each user and comparing it with the emoji distribution analysis, a trust ”fingerprint” can be created to measure authenticity or genuineness of a given user and/or group of users. The emoji distribution analysis also provides the possibility of demographic predictions. Analysis is not limited to Twitter of course but is used here because the API is free and generally easy to use. This paper aims to prove the validity of emoji analysis as a method of user identification and how their trust models can be used in conjunction with pre-existing models to improve success rates of these models.