# Chapter 4

**小球容器模型**

样本概率的分布近似于整体分布概率（独立同分布—i.i.d.）

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMLw9szQ_vUIJt1fWf%2F-LQMNg_sh6FPLE7KCoJE%2FScreen%20Shot%202018-11-03%20at%2010.24.19.png?alt=media\&token=d9f58a7e-8419-4f12-ace1-bca56b969ff9)

Hoeffding’s Inequalit&#x79;**（**&#x970D;夫丁不等&#x5F0F;**）**

&#x20;$$\epsilon$$ 为误差

**联系到Learning**

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMLw9szQ_vUIJt1fWf%2F-LQMRtcp3ajEcQjLhs6O%2FScreen%20Shot%202018-11-03%20at%2010.42.46.png?alt=media\&token=8a2c171a-a832-40bf-8755-9f83dabf6337)

E为误差

根据sample（in/P）中h与y不相等的概率来推算总体（out）中h与f不相等的概率

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMLw9szQ_vUIJt1fWf%2F-LQMT3EHX34uj3g1Cl9v%2FScreen%20Shot%202018-11-03%20at%2010.47.51.png?alt=media\&token=b31f8073-dd4e-4abc-b06c-5b9a859c5395)

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMLw9szQ_vUIJt1fWf%2F-LQMUKg1QDlnkLXIzRIA%2FScreen%20Shot%202018-11-03%20at%2010.52.56.png?alt=media\&token=fcb3a909-b935-4b2c-842f-6e9a5e13ff82)

这个过程的作用在于验证我们选择的h好不好

学习到的模型的可行的依据是：

① $$E\_{in}(h)$$ 是否足够小 ② $$E\_{in}(h)\approx E\_{out}(h)$$ 是否成立

h为多个hypothesis中的一个，g才是我们需要的模型

**Connection to Real Learning**

**单个假设函数->多个假设函数**

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMX_2AQ7nfmiGMKMnU%2F-LQMyF_VREV2Ski7Cnts%2FScreen%20Shot%202018-11-03%20at%2013.08.13.png?alt=media\&token=af0acbf1-d54a-4cda-869d-587c41d79b24)

bad data：使Ein和Eout差距大

抛硬币，记录第i次抛硬币出现的正反面情况。假设我们的抽样中，连续5次硬币出现的都是正面的情况（概率虽然小，但还是有可能发生），那么我们会作出抛出正面的概率是100%的结论。但是我们知道，实际上抛出正面的概率是50%。这种样本，我们称之为bad sample。在机器学习中，我们同样会遇到这种情况，我们把这些不好的数据称之为bad data。机器学习需要在hypothesis set中选择一个最好的h(x)，但只要任意的h(x)遇上bad data，都有可能对选择带来影响。例如h1本来不是个好的选择，然而因为bad data，恰好导致Ei&#x6E;**(**&#x68;1)很小，导致A选择了它。

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMX_2AQ7nfmiGMKMnU%2F-LQMscH_86zIBk9liN2d%2FScreen%20Shot%202018-11-03%20at%2012.43.24.png?alt=media\&token=08086cdb-6666-4f62-b590-0b49116397ba)

&#x20;不同的数据集Dn，对于不同的hypothesis，有可能成为bad data。只要Dn在某个hypothesis上是bad data，那么Dn就是bad data。

根据霍夫丁不等式，bad data的上界可以表示为

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMX_2AQ7nfmiGMKMnU%2F-LQMtA_0stzt-7L1AGy4%2FScreen%20Shot%202018-11-03%20at%2012.46.19.png?alt=media\&token=4758f041-8274-48ad-aa4f-cea8986a3a06)

M是hypothesis的数量，N是样本D的数量

Ein的大小与M的数量有关，M大，Ein就小，但会导致bad data的概率上升

N大可以保证Ein跟Eout误差小

需要有足够的M保证Ein足够小，但是M大会导致bad data数量增加，导致Ein和Eout差距大。

上述式子表示，当M有限，N足够大时，可以通过演算法A得到一个g，满足上述提到的模型可行的条件。

![](https://2779524803-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LQIIMtXAR9kxe0OARuh%2F-LQMX_2AQ7nfmiGMKMnU%2F-LQMxQv1qFzeZQyOl09U%2FScreen%20Shot%202018-11-03%20at%2013.04.55.png?alt=media\&token=9a77ef42-a912-4b30-bc6b-e0e7b2a74854)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ikcis.gitbook.io/machine-learning-foundations/chapter-4.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.