Predictive Analytics and Data Mining_ Concepts and Practice with RapidMiner [Kotu & Deshpande 2014-12-03](1).pdf

(39626 KB) Pobierz
Predictive Analytics
and Data Mining
Concepts and Practice with
RapidMiner
Vijay Kotu
Bala Deshpande, PhD
Amsterdam • Boston • Heidelberg • London
New York • Oxford • Paris • San Diego
San Francisco • Singapore • Sydney • Tokyo
Morgan Kaufmann is an imprint of Elsevier
Executive Editor:
Steven Elliot
Editorial Project Manager:
Kaitlin Herbert
Project Manager:
Punithavathy Govindaradjane
Designer:
Greg Harris
Morgan Kaufmann
is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright
©
2015 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by
any means, electronic or mechanical, including photocopying, recording, or any
information storage and retrieval system, without permission in writing from
the publisher. Details on how to seek permission, further information about the
Publisher’s permissions policies and our arrangements with organizations such as the
Copyright Clearance Center and the Copyright Licensing Agency, can be found at our
website:
www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under
copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research
and experience broaden our understanding, changes in research methods,
professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and
knowledge in evaluating and using any information, methods, compounds, or
experiments described herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including parties for whom they
have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors,
or editors, assume any liability for any injury and/or damage to persons or property
as a matter of products liability, negligence or otherwise, or from any use or
operation of any methods, products, instructions, or ideas contained in the material
herein.
ISBN: 978-0-12-801460-8
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
Library of Congress Cataloging-in-Publication Data
A catalogue record for this book is available from the Library of Congress.
For information on all MK publications visit
our website at
www.mkp.com.
Dedication
To the contributors to the Open Source Software movement
We dedicate this book to all those talented and generous developers around
the world who continue to add enormous value to open source software tools,
without whom this book would have never seen light of day.
Foreword
Everybody can be a data scientist. And everybody should be. This book shows
you why everyone should be a data scientist and how you can get there. In
today’s world, it should be embarrassing to make any complex decision with-
out understanding the available data first. Being a “data-driven organization”
is the state of the art and often the best way to improve a business outcome
significantly. Consequently we have seen a dramatic change with respect to the
tools supporting us to get to this success quickly. It has only been a few years
that building a data warehouse and creating reports or dashboards on top of
the data warehouse has become the norm in larger organizations. Technologi-
cal advances have made this process easier than ever and in fact, the existence
of data discovery tools have allowed business users to build dashboards them-
selves without the need for an army of Information Technology consultants
supporting them in this endeavor. But now, after we have managed to effec-
tively answer questions based on our data from the past, a new paradigm shift
is underway: Wouldn’t it be better to answer what is going to happen instead?
This is the realm of advanced analytics and data science: moving your interest
from the past to the future and optimizing the outcomes of your business
proactively.
Here are some examples of this paradigm shift:
Traditional Business Intelligence (BI) system and program answers:
How
many customers did we lose last year?
Although certainly interesting, the
answer comes too late: the customers are already gone and there is not
much we can do about it. Predictive analytics will show you
who will
most likely churn within the next 10 days and what you can do best for each
customer to keep them.
Traditional BI answers:
What campaign was the most successful in the past?
Although certainly interesting, the answer will only provide limited
value to determine what is the best campaign for your upcoming
product. Predictive analytics will show you
what will be the next best
action to trigger a purchase action for each of your prospects individually.
xi
xii
Foreword
Traditional BI answers:
How often did my production stand still in the past
and why?
Although certainly interesting, the answer will not change the
fact that profit was decreased due to suboptimal utilization. Predictive
analytics will show you exactly
when and why a part of a machine will
break and when you should replace the parts instead of backlogging production
without control.
Those are all high-value questions and knowing the answers has the potential
to positively impact your business processes like nothing else. And the good
news is that this is not science fiction; predicting the future based on data
from the past and the inherent patterns living in the data is absolutely possible
today. So why isn’t every company in the world exploiting this potential all day
long? The answer is the data science skills gap.
Performing advanced analytics (predictive analytics, data mining, text ana-
lytics, and the necessary data preparation) requires, well, advanced skills. In
fact, a data scientist is seen as a superstar programmer with a PhD in statis-
tics who just happens to understand every business problem in the world. Of
course people with such a rare skill mix are very rare; in fact McKinsey has
predicted a shortage of 1.8 million data scientists by the year 2018 only in
the United States. This is a classical dilemma: we have identified the value of
future-oriented questions and solving them with data science methods, but at
the same time we can’t find the answers to those questions since we don’t have
the people able to do so.
The only way out of this dilemma is a democratization of
advanced analytics.
We need to empower more people to do create predictive
models: business analysts, Excel power users, data-savvy business managers.
We can’t transform this group of people magically into data scientists, but we
can give them the tools and show them how to use them
to act like a data
scientist.
This book can guide you in this direction.
We are in a time of modern analytics with “big data” fueling the explosion
for the need of answers. It is important to understand that big data is not just
about volume but also about complexity. More data means new and more
complex infrastructures. Unstructured data requires new ways of storage and
retrieval. And sometimes the data is generated so fast it should not be stored
at all, but analyzed directly at the source and the findings stored instead. Real-
time analytics, stream mining, and the Internet of Things become a reality now.
At the same time, it is also clear that we are in the midst of a sea change: data
alone has no value, but the hidden patterns and insights in the data are an
extremely valuable asset. Accessing this asset should no longer be an option
for experts only but should be given into the hands of analytical practitioners
and business managers of all kinds. This democratization of advanced analyt-
ics removes the bottleneck of data science and unleashes new business value
in an instant.
Zgłoś jeśli naruszono regulamin