Crowd-Computing Hybrids in Scientific Discovery

  • Description
  • Updates
  • Comments
  • Backers

Abstract

Big Data has become a major research venue in today’s computing landscape and refers to datasets whose volume, velocity and variety are so extreme that the current automatic tools are inadequate for their accurate collection, management and analysis in a reasonable amount of time. Researchers, program managers and venture capital investors are nowadays overwhelmed by thousands of findings, devoting substantial effort to keep up with advances in their rapidly expanding fields. Understanding scientific topics and domains is a laborious-intensive and time-consuming endeavor that is not well supported by current systems due to their lack of semantic characterization of relevant entities such as publications, publication venues, researchers, research areas, trends and relationships. Regarding the need of making new discoveries from vast volumes of data accurately classified according to each user interest, this project aims to show how machine learning can be harnessed by leveraging the strengths of humans and computational agents to solve crowdsourcing tasks in the context of Big Data. The technical contribution is a model for integrating computerized classification and faceted search with human interaction, demonstrating how hybrid intelligence systems can drive and encapsulate the future of scientific data analysis and classification. SciCrowd will be developed to support a higher level of engagement by researchers and general public through a Software as a Service (SaaS) approach in which a user can access the software and its functions remotely as a Web-based service upon a subscription fee.

 

Impact

SciCrowd project has repercussions on a global scale, comprising institutions (e.g., universities, research labs, etc.) from all over the world. For instance, Portuguese institutions can access our system as they do with services like Elsevier’s Scopus (https://www.scopus.com/home.uri) and B-on (http://www.b-on.pt/). An increased level of engagement between digital volunteers and paid workers (e.g., Amazon Mechanical Turk) can also assume global impacts on society through different forms of collective intelligence and monetary rewards for motivating collaborators to perform at their best and strive to achieve both company and individual goals. With more users becoming ‘contributors’ into the system there is a great potential for efficiency gains both in terms of resource and time savings, suitability to consumer’s information needs, and accuracy of machine learning algorithms. Concerning the continual flow of new information and experiences engaging individuals in the digital era, fighting info-exclusion also constitutes an objective we intend to achieve globally. In addition, scholars may be more aware of the developments in their fields of research, whereas students can obtain information with more precision when compared to digital libraries (e.g., ACM Digital Library) and search engines (e.g., Google Scholar) which only provide access to elementary metadata (e.g., title, abstract, authors) and .PDF files of publications individually.

 

Business Perspective

Research is a complex activity since it is very difficult to find the right results while uncovering trends, discovering sources and collaborators, and analyzing outputs to yield further insights. The company focuses on providing academic solutions for institutions and individual consumers that want to work with highly specialized, Big Data in scientific contexts. SciCrowd is an intelligent system that provides novel functionalities integrated into an innovative environment for exploring and making sense of scholarly data based on crowsourcing, large-scale data mining, semantic technologies, and visual analytics. The solution integrates public (free) accessible, filtered and contextualized metadata from major commercial publishers, including fine grained visualizations which support users in examining topics, authors, and research communities. While a crowd of contributors is constantly motivated to populate a NoSQL database with new information (e.g.,classification labels) about each publication, active learning algorithms are used to ask which data sample should be labeled next and which annotator should be queried while learning from human inputs in a highly dynamic way. SciCrowd users are thus able to detect and make sense of trends and topics in each field of research while tracking the evolution of communities and the relationships between researchers, organizations, and countries.

SciCrowd is being adapted to the needs of academic organizations, research labs and venture capital investors. Lecturers, professors, students, researchers, research assistants and other faculty staff all over the world can use the system to harness and find qualitative and quantitative data while making correlations and adding new insights on the reults achieved. For instance, a student working on his/her Master’s Thesis, a researcher applying for a funded research, or a Professor preparing his/her teaching activities can use SciCrowd for achieving rigorous research, robust findings, and comprehensive recommendations. General public and external consumers from industry companies interested in obtaining highly specialized knowledge can also pay a charge for getting unlimited access to the system. The system will be accessible (as a Web service) through an annual subscription fee.

 

Innovation Factors

SciCrowd is enabled by a crowd-computing model scientifically validated through a 4-year research effort performed in the context of a Ph.D. Thesis in Computer Science at UTAD. A total of 1335 tools were compared to identify functional gaps while retrieving requirements for the development of a novel technology able to cope with the limitations of current systems and tools. The use of crowdsourcing for engaging users in the evaluation of scientific data with the support of active learning algorithms working on the background (based on human inputs) constitutes a paradigmatic shift in Big Data and semantic analysis. The customization of the system according to each user interest is also a significant implementation since most of current tools are static by nature. SciCrowd provides access to the scientific data “anywhere at anytime”. Each contributor is rewarded through a robust engagement mechanism which envisages the right credit to the work performed. Some implementations at this level include but are not limited to gamification approaches (e.g., barnstars) in order to maintain reputation while providing better interaction with requesters.

 

SciCrowd is a novel intelligent Big Data analysis system which considers a vast set of scientific needs reported in the literature. Some examples include the time-consuming and laborious processes of scientific data seeking and analysis. The advancements enabled by SciCrowd can allow surpassing another set of problems such as the lack of quantitative data perspectives and absence of more qualitative evidence. Crowdsourcing via the Web is a creative mode of user interactivity and allows achieving a multidimensional view of Big Data, a particular advantage when comparing with proprietary databases such as Web of Science (http://webofknowledge.com) and SpringerLink libraries (http://link.springer.com). Gamification mechanisms are also absent from such services since they have their own staff/contributors instead of a crowd of people with different background and cognitive abilities. The creation of distinct licenses for Ph.D. students and researchers, universities, and other entities (e.g., general public) will be based on a division between standard, plus, and premium licenses with different levels of access to the system functionalities.

 

The crowd-enabled model implemented in the SciCrowd is innovative by nature and aims at inspiring a new set of powerful tools for analyzing complex, Big Data from scientific publications. The framework is based on putting human crowds into the loop of scientific evaluation, while “machines” (i.e., active learning algorithms) are working in the background to learn from human inputs and take better decisions on intelligent tasks. For example, consider the following evaluation scenario. A paper classified as “medical informatics” could be characterized by subarea (e.g., cognitive aging), aims and purpose, setting and context, key concepts and definitions, participant characteristics, research boundaries and limitations, method, results and findings, social-technical aspects concerning a certain technology (e.g., a Wiki to support knowledge exchange in public health), related work, bibliometric data (e.g., affiliation and country of authors’ affiliation), and crowd annotations as a meta-cognitive activity. In addition, all these data could be correlated and filtered to present the final results considering specific research purposes (e.g., identify what kind of features was introduced in health care technologies by Portuguese researchers between 2009 and 2015).

 

SciCrowd is an intelligent system that provides features for tracking, analyzing and visualizing research. The Web service is being developed using PHP since it is a popular language, easily accessible and used by many companies (e.g., Wikipedia, Facebook, WordPress, and Joomla). Symfony (http://symfony.com) was the chosen framework due to its high availability for large projects and great level of modularization. In addition, a non-relational, distributed, open-source and horizontally scalable database (NoSQL) is the selected approach for coping with the high variability and volume of data sources. MongoDB (http://www.mongodb.com) and Apache Hadoop (http://hadoop.apache.org) are some of the possible solutions to advance the prototype already created. Currently, the system prototype only allows users’ authentication, edit and classify data using different annotations, comments and categories. The system will have functionalities for allowing human crowd inputs (e.g., insert labels), supervised (active) learning algorithms, and visual analytics.

 

Market Segmentation

 

Considering the limitations already identified regarding the task of finding and analyzing manually all kind of digital artifacts and other forms of intellectual assets produced by researchers, SciCrowd aims to 1) reduce bias, time and cognitive effort spent in scientific data seeking, cataloguing, analysis, and classification, 2) extend the limitative spectrum (e.g., sample size, and analytical dimensions) of literature reviews and bibliometric studies using crowd-based human computation and machine learning techniques, and 3) fill the lack of human-centered results by evaluating convergence indicators and interaction requirements in meta-cognitive research practices at a large-scale. It is expected that this system will allow the extraction of relevant facts about the relationships between disciplines, scholars and publications, filling the limitations of current tools for understanding research attributes and trends effectively at different levels of granularity while relating them semantically through an integrated solution. Since the symbiosis between collective and computational intelligence is not adequately supported by an overarching framework for science mapping and Big Data analytics, SciCrowd can improve scientific practices with the potential to contribute for: 1) reducing the errors of the automatic extraction of semantics from scientific literature, 2) understanding human information needs based on semantic metadata while modeling human discovery and exploration behavior, and 3) enhancing computational reasoning capabilities for data discovery processes, keeping machines accurate when “timely” scientific information suddenly needs updating.

 

The market segmentation of our company is based on a global approach in which academics from all parts of the world can contribute to enhance the system behavior with access to high-dimensional Big Data. In this sense, the size of the proposed product covers the entire global market. In this sense, the recent local/European economic problems have no direct impacts on the project. In the first stage, our company is interested in targeting institutions from geographic locations inside Portugal (a first implementation will be proposed at UTAD), being expanded to the rest of the world through a strong marketing effort using social media. Age (+18) is the only demographic segmentation condition since this system does not need to consider aspects such as contributor’s income, gender, ethnic background and family life cycle. Psychographic segmentation relies mainly on people with interests on scholarly data and Big Data analytics.

 

Resources

 

The access to funders and companies interested in the product is fundamental to its success. In this way, an intermediary is needed for putting us in direct contact with the possible interested parties. We believe that our system has potential to capture the interest of big companies such as Google, Elsevier, and ResearchGate in order to complement their systems with our Artificial Intelligence (AI) and crowdsourcing integrated solution. In the first stage of the system development, we will not need many financial resources for putting the system online. Nevertheless, we will explore the crowdfunding market by making our project available for funding on platforms such as Kickstarter (http://www.kickstarter.com).

 

In terms of human resources, we have a strong opportunity for involving students from UTAD in the system development practices. Some recent research has also denoted a focus on crowdsourcing the software development process. In this context, we are also considering the hypothesis of putting our project accessible to freelancers with programming abilities (in order to reduce the costs of employing a full-time system developer). Concerning the physical facilities, we have access to an open room at UTAD for performing research activities, as well as our own devices and servers for deploying the system.

 

Development Expectations

 

The sale of software licenses to academic service providers and individual users into an integrated Software as a Service (SaaS) and Business-to-Consumer (B2B) approach is the most valid proposal for converting the idea into value. The requirements’ elicitation is in an advanced state of development. Thus, the deployment of a functional prototype represents the next step for introducing the service into the market. Subsequently, the strategic option falls on the acquisition of customers in a continuous process where the satisfaction of their needs is our main objective.

 

After the product/market fit, our objective is to create a strong notion of commitment between contributors and crowdfunding members for enhancing the business as more consumers pay for the service. We will start by introducing metadata about Computer Science publications (i.e., Human-Computer Interaction – HCI), expanding the system gradually to other fields and subdomains (e.g., medicine) and thus capturing new consumers in a dynamic process.

 

Several contacts will be made with companies such as ResearchGate GmbH (http://researchgate.net) for obtaining strategic insights and specialized knowledge for our business. In addition, the company is established on a crowd-enabled funding and development approach so we expect a lot of partnerships from both academia and industry.

 

SciCrowd has the potential to create specialized jobs (at least 1 new employee per year). In addition, the community of contributors is rewarded according to their contributions in a model very similar to the implemented by Amazon Mechanical Turk where users are paid by requesters by performing microtasks.

 

Critical Factors

 

The critical factors for the success of our company are mainly established in the recognition of SciCrowd as a very robust alternative to current services, justifying a worldwide investment and appropriate adoption by institutions.

 

The technological know-how and complexity associated with SciCrowd require a highly accentuated learning curve since the data provided to the consumer is processed and visualized in a multidimensional perspective. The lack investiment by institutions (e.g., universities) considering financial and market problems and the lack of funding (e.g., crowdfunding) can also be limitative to the Time-to-market of our product. In addition, the lack of potential consumers and contributors can lead to the failure of the project.

 


Working Plan

 

Phase 1

 

Once research on whether manual data gathering and evaluation can be scalable to a large set of publications and scholars remains unclear, it is assumed that (H1) The prerequisites for crowdsourcing are present in academic settings, and scientists perceive it as a useful tool for supporting research, (H2) Current data mining, machine learning and bibliometric tools have the necessary requirements for a large-scale scholarly data analysis, and (H3) By systematically applying crowd-based human computation methods, techniques and tools to analyze scientific data, the output of the community (in terms of “value” as judged by scientists) can be significantly increased compared to automatic approaches. Reaching such prospects, a vast set of steps must be followed, including: T1) Identifying limitations, challenges and opportunities considering the understanding of research dynamics from scientific publications through the use of collective intelligence as a source for machine learning self-supervision, (T2) Surveying the main issues/dimensions behind massively collaborative science, crowd-based human computation and machine computation in science, (T3) Concluding the feature analysis and systematic literature review comparing human intelligence (e.g., Amazon Mechanical Turk, Foldit), computational intelligence (e.g., Rexplore, SciMAT), and hybrid intelligence (e.g., Apolo, Cascade) tools for requirements’ elicitation; and (T4) Performing case studies and experiments using current crowdsourcing and machine learning systems to obtain important insights for informing the design of a mixed-initiative system able to support human intelligence tasks and automated-based reasoning to systematically evaluating scientific data.

 

Phase 2

 

Developing a framework for large-scale, scientific data search, analysis and classification comprising crowd-based human computation and automated reasoning on publications. Inform the development of SciCrowd, a mixed-initiative, knowledge discovery system for examining publications “from the ground up”. This task relies on requirements elicitation, deployment, and evaluation of a prototype under development.

 

Phase 3

 

Development of the business plan and possible introduction of the product into the market of semantic technologies for Big Data analytics and science mapping.

 

Team

António Correia received M.Sc. degree in Information and Communication Technologies from the University of Trás-os-Montes e Alto Douro (UTAD), Vila Real, Portugal, in 2012. His Master’s thesis entitled “Characterization of the state-of-the-art of CSCW” obtained Summa Cum Laude (20/20). Currently, he is a Ph.D. Candidate in Computer Science at UTAD. An experience of over 6 years as researcher and 2 years as lecturer and project supervisor keens his background in the study and application of collaborative systems in many fields. In the last years, he was a Portuguese Government’s +E+I Programme fellow, a junior researcher at UTAD, and an Espírito Santo Bank’s trainee.

The team is also constituted by André Lopes, a Bachelor’s Degree student in Audiovisual and Multimedia at School of Communication and Media Studies (Lisbon Polytechnic Institute – IPL) with User eXperience (UX) design responsabilities.

This project has the support of Benjamim Fonseca and Hugo Paredes (Assistant Professors at UTAD and Senior Researchers of the INESC TEC Associated Laboratory – formerly, INESC Porto) for achieving more specialized knowledge. Currently, INESC TEC provides the support required for implementing an overarching model robust enough to systematically analyze large volumes of scientific data by combining human and machine intelligence.

The strengths of each team member can be further leveraged with the inclusion of undergraduate and graduate students (with programming skills) in Computer Science at UTAD, through the development of Bechelor’s Degree projects and Master’s Theses on the SciCrowd system development.

  • John Bestevaar Backer

    Sep 30, 2017

    I recognized the value of this project only because i had been thinking along similar lines. I think you failed here because the way you sold it to the public is dead boring.

  • António Correia

    António CorreiaResearcher

    Oct 01, 2017

    Thank you, John! I totally agree with you. It was my first experience with crowdfunding projects and I learned a lot from it. All the best!

  • Brittany Hollerbach

    Brittany HollerbachBacker

    Oct 01, 2017

    It is much harder than I thought it would be to crowdfund! Best of luck to you and your team!

Backers (4)

  1. JamesT  
    11 months ago
    $120
  2. KICKSTARTER Offical 
    1 year ago
    $25
  3. Shella Zercho 
    1 year ago
    $15
  4. Yeron@LOCODOR  
    1 year ago
    $7

This is a unique website which will require a more modern browser to work!

Please upgrade today!