Browsing by Subject "Text mining"

Now showing 1 - 2 of 2

Distant-supervised algorithms with applications to text mining, product search, and scholarly networks
(2020-11) Manchanda, Saurav
In recent times, data has become the lifeblood of pretty much all businesses. As such, the real-world impact of data-driven machine learning has grown in leaps and bounds. It has set up itself as a standard tool for organizations to draw insights from the data at scale, and hence, to enhance their profits. However, one of the key-bottlenecks in deploying machine learning models in practice is the unavailability of labeled training data. The manually-labeled training sets are expensive and it can be a tedious exercise to create them. Besides, they cannot be practically reused for new objectives, if the underlying distribution of data changes with time. As such, distant-supervision provides a solution to using expensive hand-labeled datasets, which means leveraging alternative sources of weak-supervision. In this thesis, we identify and provide solutions to some of the challenges that can benefit from distant-supervised approaches. First, we present a distant-supervised approach to accurately and efficiently estimate a vector representation for each sense of the multi-sense words. Second, we present approaches for distant-supervised text-segmentation and annotation, which is the task of associating individual parts in a multilabel document with their most appropriate class labels. Third, we present approaches for query understanding in product search. Specifically, we developed distant-supervised solutions to three challenges in query understanding: (i) when multiple terms are present in a query, determining the relevant terms that are representative of the query’s product intent, (ii) vocabulary gap between the terms in the query and the product’s description, and (iii) annotating individual terms in a query with the corresponding intended product characteristics (product type, brand, gender, size, color, etc.). Fourth, we present approaches to estimate content-aware bibliometrics to accurately quantitatively measure the scholarly impact of a publication. Our proposed metric assigns content-aware weights to the edges of a citation network, that quantify the extent to which the cited-node informs the citing-node. Consequently, this weighted network can be used to derive impact metrics for the various involved entities, like the publications, authors, etc.
Leveraging open source web resources to improve retrieval of low text content items
(2014-08) Singhal, Ayush
With the exponential increase in the amount of digital information in the world, search engines and recommendation systems have become the most convenient ways to find relevant information. As an example, the number of web pages on the world wide web was estimated to be over a trillion mark by the year 2008. However, search today is no longer limited to documents on the world wide web. The new "information needs" such as multimedia items (images, videos) opens up challenging avenues for scientific research. Thus the search techniques used to find items which are content rich ( e.g documents) no longer holds for items with low-text content. In the literature, several solutions are proposed for developing search framework for multimedia item search which includes using the visual or audio content of such items for retrieval purposes. However, there is little research on this problem in the domain of scientific research artifacts. This thesis investigates the problem of retrieval of low-text content items for search and recommendation purposes and propose novel techniques to improve retrieval of such items. In particular, we focus on scientific research datasets owing to their importance and exponential growth in the last few decades.One of the main challenges in searching research datasets is the lack of text content surrounding the dataset. While the datasets themselves have raw content, the problem of low text content makes the conventional text based search techniques inadequate for their retrieval. In comparison to multimedia items, where visual and audio features have been utilized to enhance search or recommendation based retrieval, scientific research datasets lack a uniform schema for representing their raw content. Although solutions such as curation and annotation by experts/data scientists exists but these are unfeasible for practical operation on a large scale. As a solution, this thesis provides a computational and an efficient framework for retrieving such low-text content items. We primarily present two retrieval models, namely, (1) a user profile based search, and (2) keyword based search. For the user profile based search model, we show that the text content of the item can be derived from the user's profile and the relevance ranking can also be derived based on users profile. We find that the proposed approach using open source knowledge for item extraction outperforms local content based extraction approach. For the keyword based search model, we have developed a content rich database for research datasets. We use novel content generation techniques to overcome the low-text content challenge for datasets. The content information is extracted from open source and crowd sourced knowledge resources like academic search engines and Wikipedia. In addition to the stand-alone quantitative assessment of the content generated, we evaluate the efficiency of the entire keyword based search framework via user study. Based on user responses, the thesis reports positive evidence that the proposed search framework is better than the popular general purpose search engine for searching datasets with a context based queries.The ideas developed in this thesis are implemented in a real search system DataGopher.org: an open source search engine for scientific research datasets. Moreover, the approaches developed for research datasets have application to other low-content items such as short text document, news feeds and twitter tweets. In summary, the computational approaches proposed in this thesis advance the state-of-the-art in retrieval of low-content items. Whereas the extensive evaluations that are performed on items like scientific research datasets and low text content documents demonstrate the validity of the findings.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Text mining"