我这里的主要目标是实现类似 Readability 或 Safari 的 Reader 服务的效果,其中网页的主要内容被转换为文本。我实际上不想显示任何图像,只是获取网页的所有重要文本。我目前正在使用一些相当长的自建代码来解析网页的 s 以找出标题可能是什么样子,我也在解析
s 我希望包含大部分内容页面内容。
-(void)interpretAndDisplay {
NSURL *URL = [NSURL URLWithString:self.url];
NSData *data = [NSData dataWithContentsOfURL:URL];
NSString *html = [NSString stringWithUTF8String:[data bytes]];
//Getting the H1s
NSMutableArray *h1Full = [[NSMutableArray alloc] init];
h1Full = [self stringsBetweenString:@"<h1" andString:@">" andText:html];
if ([h1Full count] > 0) {
NSMutableArray *h1Content = [[NSMutableArray alloc] init];
h1Content = [self stringsBetweenString:[NSString stringWithFormat:@"<h1%@>",[h1Full firstObject]] andString:@"</h1>" andText:html];
NSMutableArray *h1Sanitize = [[NSMutableArray alloc] init];
h1Sanitize = [self stringsBetweenString:@"<" andString:@">" andText:html];
if ([h1Content count] > 0) {
NSString *finalTitle = [h1Content firstObject];
for (int i = 0; i < [h1Sanitize count]; i++) {
NSString *toRemove = [NSString stringWithFormat:@"<%@>",[h1Sanitize objectAtIndex:i]];
finalTitle = [finalTitle stringByReplacingOccurrencesOfString:toRemove withString:@""];
finalTitle = [finalTitle stringByReplacingOccurrencesOfString:@"\n" withString:@""];
}
finalTitle = [self sanitizeString:finalTitle];
[self.titleLabel setText:finalTitle];
}
}
//Now for the body!
NSMutableArray *pTag = [[NSMutableArray alloc] init];
pTag = [self stringsBetweenString:@"<p" andString:@">" andText:html];
if ([pTag count] > 0) {
NSMutableArray *pContent = [[NSMutableArray alloc] init];
pContent = [self stringsBetweenString:[NSString stringWithFormat:@"<p%@>",[pTag firstObject]] andString:@"</p>" andText:html];
NSMutableArray *pSanitize = [[NSMutableArray alloc] init];
pSanitize = [self stringsBetweenString:@"<" andString:@">" andText:html];
if ([pContent count] > 0) {
for (int i = 0; i < [pContent count]; i++) {
NSString *pToEdit = [pContent objectAtIndex:i];
for (int i = 0; i < [pSanitize count]; i++) {
NSString *toRemove = [NSString stringWithFormat:@"<%@>",[pSanitize objectAtIndex:i]];
pToEdit = [pToEdit stringByReplacingOccurrencesOfString:toRemove withString:@""];
}
[pContent replaceObjectAtIndex:i withObject:pToEdit];
}
for (int i = 0; i < [pContent count]; i++) {
NSString *pToEdit = [pContent objectAtIndex:i];
pToEdit = [pToEdit stringByReplacingOccurrencesOfString:@"\n" withString:@""];
[pContent replaceObjectAtIndex:i withObject:pToEdit];
}
NSString *finalBody = @"";
for (int i = 0; i < [pContent count]; i++) {
if ([finalBody isEqualToString:@""]) {
finalBody = [NSString stringWithFormat:@"%@",[pContent objectAtIndex:i]];
}
else {
finalBody = [NSString stringWithFormat:@"%@\n\n%@",finalBody,[pContent objectAtIndex:i]];
}
}
finalBody = [self sanitizeString:finalBody];
[self.textLabel setText:finalBody];
}
}
}
上面的代码很好地提取了所有元素并使用我创建的方法对它们进行了清理,但问题是仅分析 P 标签有时完全无法简化内容,并且分析所有可能的内容标签可能会混淆内容的顺序和布局。
是否有更好的方法或框架可以将所有文本转换为漂亮的字符串?
编辑
四处搜索,我发现了一个可以极其轻松地提取文本的 Boilerpipe 项目 ( https://github.com/k-bx/boilerpipe/wiki/QuickStart )。它看起来像这样简单:String text= ArticleExtractor.INSTANCE.getText(url);
我可以在 Objective C 上做这个吗?
编辑2
似乎有一个样板 API,但它的请求有限。我主要是在寻找用户端解决方案。
最佳答案
在我看来,Reggie 并不是最宽容的方法。
我会尝试找到一个现有的开源(即 https://github.com/Kerrick/readability-js )并使用 WebKit加载后将 JS 注入(inject)网页。
之后你可以注入(inject)另一个 JS,提取处理后的内容(使用 appropriate class from the source )
然后,使用 JavaScriptCore你可以将 div
的内容传递给 Objective-C(JS 提供了很多方法)
关于html - 在 iOS 上将网页精简为文本( Objective-C ),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30677385/